Share this...
IntroductionIn this tutorial, we will be classifying movie reviews based on sentimental analysis using an NLP Model. This is an application-based tutorial where we will be using a pre-trained LSTM model from the Allen NLP library. The outline of the tutorial is as follows: Show
The full Jupyter file can be seen on our GitHub Page Setting up the environmentThis tutorial is carried out in Jupyter Notebooks (Anaconda version 4.8.3) with Python version 3.8 on Windows 10 Operating system. Following packages need to be installed before you continue with the code:
You can install the above-mentioned packages using pip or conda. Simply type pip install package-name or conda install package-name in the command line. To access GridDBs database through Python, the following packages will be required:
All About the DatasetWe are using the IMDB Sentiment Analysis Dataset which is available publicly on Kaggle. The format of the dataset is pretty simple it has 2 attributes:
A label 0 represents a negative movie review whereas 1 represents a positive movie review. Since we will be using a pre-trained model, there is no need to download the train and validation dataset. We will be utilizing only the test dataset which has 5000 instances. Once you download the dataset, put it in the same working directory. Now lets go ahead and load the dataset in our python environment Loading the DataGridDB has made it easier to work with data as we can directly call the database using its python-client and load it in the form of pandas dataframe. import griddb_python as griddb
import pandas as pd
sql_statement = ('SELECT * FROM movie_review_test')
movie_review_test = pd.read_sql_query(sql_statement, cont) The cont variable has the container information in which you have your data stored. A detailed tutorial on reading and writing to GridDB using Pandas is available on the blog. Alternatively, if you have the CSV file, you can use the read_csv() function of pandas. The outcome will be the same in both scenarios import pandas as pd
movie_review_test = pd.read_csv("movie_review_test.csv") Lets print out the first five rows to get a little sneak peak into our data movie_review_test.head()
len(movie_review_test) 5000
Data PreprocessingData Preprocessing is an important step to avoid getting any unexpected behaviour from the machine learning model. Null values or missing values tend to mess with the overall results if not dealt with properly. Lets see if our data contains any null values. movie_review_test.isna().sum() text 0
label 0
dtype: int64
Great! Fortunately, we have zero null/missing values in our test dataset. However, if you do encounter null values, consider dropping them or replacing them before moving further. Removing Punctuation and Stop WordsPunctuation and stop words only increase the total word limit of a text. They do not contribute to model learning and serve majorly as noise. It is, therefore, important to remove those before the training step. In our case, although there is no training step, we still want to make sure that the input were providing is valid and appropriate. You can extend this step for the training dataset as well. Various libraries provide a list of stopwords. Well be using the nltk library for this task. Note that the list of stop words depend on package to package. You might get a slightly different result if youre using some other library, say spacy. from nltk.corpus import stopwords
import nltk stop = stopwords.words('english') len(stop) 179
type(stop) list
We now have a list of 179 stopwords. You can add some custom words to the list as well. In fact, lets go ahead and add a couple of words to the stopwords list. extra_words = ['Yeah', 'Okay']
for word in extra_words:
if word not in stop:
stop.append(word) len(stop) 181
Alternatively, you can use the extend() to append all the items of the list. The if condition inside the for loop just makes sure were not adding the same word twice. movie_review_test['text'] = movie_review_test['text'].apply(lambda words: ' '.join(word for word in words.split() if word not in stop)) movie_review_test.head()
As we can see, personal pronouns such as I, we, etc. have been removed. Lets go ahead and remove the punctuation as well. movie_review_test['text'] = movie_review_test['text'].str.lower()
movie_review_test['text'] = movie_review_test['text'].str.replace('[^\w\s]','') movie_review_test.head()
Now that our data is ready to be used, lets load up our model and start making some predictions! Loading the Allen NLP ModelAllen NLP has made available a lot of machine learning models targeting different problem statements. We will be using the GLoVe-LSTM binary classifier for our movie review dataset. As per the official documentation, the model achieved an overall accuracy of 87% on the Stanford Sentiment Treebank. A live demo of the model is available on the allennlps official website. Lets go ahead and load our predictor. from allennlp.predictors.predictor import Predictor
import allennlp_models.tagging predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/basic_stanford_sentiment_treebank-2020.06.09.tar.gz") error loading _jsonnet (this is expected on Windows), treating C:\Users\SHRIPR~2\AppData\Local\Temp\tmpfjmtd8u3\config.json as plain json
Note that these models can be heavy and if you have a GPU enabled system, simply pass the argument cuda_device=0 in the above predictor function. To check if the predictor works fine, lets pass a sample text review and see what kind of output do we get. sample_review = "This movie was so great. I laughed and cried, a lot!" predictor.predict(sample_review) '0'
As we can see, the predictor returns a dictionary with 5 keys logits, probs, token_ids, label, and, tokens. Since we know the sample review is a positive one, we can say that the model correctly returned a label '1'. In addition to the label, the probs list also tells us the confidence score or probability of each label, which in our case are 0 or 1. The first item of the probs list i.e. the probability of label 1 is 0.98 (or 98%) which implies that the model was 98% confident that the review was positive. Now we know that the predictor is working fine, it is time to make some predictions Making PredictionsWell define a predict function that takes a movie review and returns the label as an integer. Note that the original labels are of type int. Itll be easier to compare the actual and predicted value if theyre of the same data type. def predict_review(movie_review):
return (int(predictor.predict(movie_review)['label'])) movie_review_test['predicted_label'] = movie_review_test['text'].apply(predict_review) movie_review_test.head()
Now we simply need to calculate the accuracy of our model. The prediction cell took 6 minutes to execute for 5000 instances because it was running on CPU and these models can be heavy. If youll be utilizing the code for large data, consider using a GPU. Evaluating the resultsAllen NLP has their own set of metrics for evaluation. For the sake of simplicity, well be using the scikit-learn library. You can find more information on Allen NLP metrics here. from sklearn.metrics import accuracy_score actual = movie_review_test['label']
predicted = movie_review_test['predicted_label'] accuracy = accuracy_score(actual, predicted) accuracy 0.7208
Our model has an overall accuracy of 72% on the test dataset. Thats decent for starters, right? You can save the predictions in a CSV file using the pd.to_csv(file_path). Go ahead and try the code for yourself. Happy coding! Share this...
|