Skip to main content

How to solve your Natural Language Processing classification task?

A new post starts with you beautiful people and a very Happy New Year to all! It was quite a fantastic last year with respect to learning many new things in Data Science and I was very happy to see that many aspiring data scientists like you had contacted me through my facebook page. It was an honor for me that I was able to solve your queries and motivated many of you!
This year also I will continue my posts about my learning and will share with you all so don't stop, stay positive, keep practicing and try again and again :)

In this post I am going to share nuts and bolts of handling a natural language processing (NLP) task which I had learnt while I was working for a Singapore client. In that project I scraped the different e-commerce merchants' product related data through the apis, web scraping and then I performed text cleaning before applying a machine learning algorithm to classify the product categories. Such kind of task you may encounter in your life. So let's starts our journey by first understanding some important fundamentals of NLP.

You must aware of a fact that a word or a sentence has a very strong role in this world; since both can affect a human being in different ways. So teaching a machine to understand a human language is currently a very fast growing field; you can take an example of Amazon Alexa. NLP is everywhere whether it is - translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition or topic segmentation. You can see NLP as a field focused on making sense of language. Let's understand key terms and their meaning before jumping into Python coding part-

Corpus:- Corpus is a large collection of texts. It is a body of written or spoken material upon which a linguistic analysis is based. The plural form of corpus is corpora. There are various in built corpora present in NLTK so in most of your task you just need to download them using the nltk.download() command in your anaconda prompt. You can see a list of in built corpora in this link also.

Lexicon:- Lexicon is a vocabulary, a list of words, a dictionary for example: an English dictionary. In NLTK, any lexicon is considered a corpus since a list of words is also a body of text. For example: To a financial investor, the first meaning for the word "Bull" is someone who is confident about the market, as compared to the common English lexicon, where the first meaning for the word "Bull" is an animal. As such, there is a special lexicon for financial investors, doctors, children, mechanics, and so on.

Stop Words:- StopWords is a list of words those are meaningless or have a very rare contribution in text analysis process. For example many people use 'hmmmm' or 'ummmm' in text messages frequently which don't have any meaning in English grammar so such words should be excluded from the processed text. This can be done easily by accessing it via the NLTK corpus: from nltk.corpus import stopwords.

Stemming:- Stemming is a sort of normalizing method. In documents you may find that many variations of words carry the same meaning, other than when tense is involved. For example sentences- 'I was taking a ride in the car.' and 'I was riding in the car.' have the same meaning and a machine should also understand this diffrentiation thus we do stemming. Most widely used stemming algorithm in NLP is Porter stemmer which you can use as from nltk.stem import PorterStemmer.

Lemmatizing:- Lemmatizing is very similar operation to stemming. The major difference between these is-  stemming can often create non-existent words, whereas lemmas are actual words. Lemmatization normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . You can use lemmatizing as WordNetLemmatizer from nltk.stem package.

Tokenization:- Since a machine does not have any understand of alphabets, grammar, punctuation etc; it cannot understand a raw rext. You need to divide a raw text into small chunks or tokens. This text breaking process is known as Tokenization. These tokens may be words or number or punctuation mark. One challenge in this process is that it depends on the type of language thus it generates different result for different languages. For example, languages such as English and French are referred to as space-delimited as most of the words are separated from each other by white spaces. Languages such as Chinese and Thai are referred to as unsegmented as words do not have clear boundaries. In Python the most common and useful toolkit for handling NLP is nltk.

We can tokenize a string to split off punctuation other than periods using the word_tokenize package and we can tokenize a document into sentences using the sent_tokenize package from the nltk.tokenize library. There is also a special class for tokenizing the tweeter's tweets named as TweetTokenizer in nltk.tokenize libraryI recommend you to once visit following link to understand all of the nltk tokens- tokenization

POS Tagging:- POS or Part Of Speech tagging is one of the more powerful aspects of the NLTK module. Using POS tagging, you can easily label words in a sentence as nouns, adjectives, verbs etc. For tagging you need to use PunktSentenceTokenizer from the nltk.tokenize library.

Regex:- A regex or regular expression is a pattern which can match one or more character strings. It allows us to pick off numerous patterns. Some common regex are as below-

We can use regex features by importing Python's library re. The re library has various methods including split, findall, search and match which you can use in finding patterns.

Named Entity Recognition:- NER or Named Entity Recognition is a process which makes a machine to be able to pull out "entities" like people, places, things, locations, monetary figures, and more. NLTK easily recognize all named entities or all named entities as their respective type like people, places, locations, etc.

WordNet:- WordNet is a part of NLTK corpus and a lexical database for the English language. It is used to find the meanings of words, synonyms, antonyms etc. We can also use WordNet to compare the similarity of two words and their tenses.

Feature Extraction:-  Feature extraction or Vectorization is a process to encode the words into numerical forms either as integer or in floating point values. This process is required because most of the machine learning algorithms accept only numerical input. The scikit-learn library provides easy-to-use tools for this step like CountVectorizer, TfidfVectorizer and HashingVectorizer. CountVectorizer converts text to word count vectors, TfidfVectorizer converts text to word frequency vectors and HashingVectorizer converts text to unique integers.

That is enough of theory you must know for any NLP task. Now it's time to work on an actual NLP problem where we need to work on text preprocessing before applying any machine learning algorithm. For this exercise we are going to work on a text classification problem where we have a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. This dataset you can download from here . The types of toxicity are:

  1. toxic 
  2. severe_toxic 
  3. obscene 
  4. threat 
  5. insult 
  6. identity_hate 

Here your goal is to create a model which predicts a probability of each type of toxicity for each comment.

Step 1:- Importing required libraries

Step 2:- Reading datasets in Pandas Dataframe

Don't forget to change the path of downloaded dataset here!

Step 3:- Exploring the dataset


Step 4:- Separate the target variable

Step 5:- Perform Feature Extraction using word n grams

Here notice the parameters I am passing in TfidfVectorizer. Each parameter has it's own role and a summary of each one is as below-


  1. sublinear_tf:- It is a boolean flag and if true the it applies sublinear tf scaling, i.e. replace tf with 1 + log(tf).
  2. strip_accents:- It removes accents and perform other character normalization during the preprocessing step. ‘ascii’ is a fast method that only works on characters that have an direct ASCII mapping. ‘unicode’ is a slightly slower method that works on any characters.
  3. analyzer:- Whether the feature should be made of word or character n-grams.
  4. token_pattern:- Regular expression denoting what constitutes a “token”, only used if analyzer ='word'.
  5. ngram_range:- The lower and upper boundary of the range of n-values for different n-grams to be extracted.
  6. max_features:- If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.


Step 6:- Perform Feature Extraction using char n grams

Step 7:- Stack sparse matrices horizontally (column wise)

Step 8:- Train your model and Predict Toxicity on Test data

Here to avoid the overfitting, I am using cross_val_score() function where 'cv' argument is used to determine the cross-validation splitting strategy and 'roc_auc' is the classification scoring metric. You may refer following link to see all metrics- scoring-parameter.

Result:-

Output CSV:-

As a top up I am going to tell you a great visualization technique (WordCloud) to show which words are the most frequent among the given text. First you need to install it using any one of the following commands in anaconda prompt:- conda install -c conda-forge wordcloud or conda install -c conda-forge/label/gcc7 wordcloud. Then you need to import this package with basic visualization libraries as follows-

This wordCloud library requires only one required input and that should be a text. So in our case you can use 'train_text' or 'test_text' as input. Let me show you how you can you use this library-


Here, the argument interpolation="bilinear" in the plt.imshow() is used to make the displayed image appear more smoothly. Try with different ones and see what changes you find! Once you run above function, it will display following beautiful cloud-

Notice in above image, every comment has different size and it is done by the WordCloud purposely. This difference in size tells the frequent occurring words in the train_text data. If you are interested to learn more about WordCloud then please follow this fantastic article- explore wordcloud.

That's it friends for today. Try above exercise in your notebook, read all the links I have shared above, explore more, practice your learning in a new dataset like tweeter sentiment analysis data and don't afraid of failure. Keep trying! With that said my friends- Go chase your dreams, have an awesome day, make every second count and see you later in my next post.

Comments

Post a Comment

Popular posts from this blog

How to deploy your ML model as Fast API?

Another post starts with you beautiful people! Thank you all for showing so much interests in my last posts about object detection and recognition using YOLOv4. I was very happy to see many aspiring data scientists have learnt from my past three posts about using YOLOv4. Today I am going to share you all a new skill to learn. Most of you have seen my post about  deploying and consuming ML models as Flask API   where we have learnt to deploy and consume a keras model with Flask API  . In this post you are going to learn a new framework-  FastAPI to deploy your model as Rest API. After completing this post you will have a new industry standard skill. What is FastAPI? FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.6+ based on standard Python type hints. It is easy to learn, fast to code and ready for production . Yes, you heard it right! Flask is not meant to be used in production but with FastAPI you can use you...

Learn the fastest way to build data apps

Another post starts with you beautiful people! I hope you have enjoyed and learned something new from my previous three posts about machine learning model deployment. In one post we have learned  How to deploy a model as FastAPI?  I n the second post, we have learned  How to deploy a deep learning model as RestAPI ? and in the third post, we have also learned  How to scale your deep learning model API?   If you are following my blog posts, you have seen how easily you have transit yourselves from aspiring to a mature data scientist. In this new post, I am going to share a new framework-  Streamlit which will help you to easily create a beautiful app with Python only. I will show here how had I used the Streamlit framework to create an app for my YOLOv3 custom model. What is Streamlit? Streamlit’s open-source app framework is the easiest way for data scientists and machine learning engineers to create beautiful, performant apps in only a few hours!...

How can I make a simple ChatBot?

Another post starts with you beautiful people! It has been a long time of posting a new post. But my friends in this period I was not sitting  where I got a chance to work with chatbot and classification related machine learning problem. So in this post I am going to share all about chatbot- from where I have learned? What I have learned? And how can you build your first bot? Quite interesting right! Chatbot is a program that can conduct an intelligent conversation based on user's input. Since chatbot is a new thing to me also, I first searched- is there any Python library available to start with this? And like always Python has helped me this time also. There is a Python library available with name as  ChatterBot   which is nothing but a machine learning conversational dialog engine. And yes that is all I want to start my learning because I always prefer inbuilt Python library to start my learning journey and once I learn this then only I move ahead for another...