Review generator nlp

Share this...
  • Facebook
    0
  • email
  • Twitter
  • Linkedin
  • Reddit
    0

Introduction

In this tutorial, we will be classifying movie reviews based on sentimental analysis using an NLP Model. This is an application-based tutorial where we will be using a pre-trained LSTM model from the Allen NLP library. The outline of the tutorial is as follows:

  1. Setting up the environment
  2. All about the Dataset
  3. Data Preprocessing
  4. Loading the Allen NLP model
  5. Making predictions
  6. Evaluating the results

The full Jupyter file can be seen on our GitHub Page

Setting up the environment

This tutorial is carried out in Jupyter Notebooks (Anaconda version 4.8.3) with Python version 3.8 on Windows 10 Operating system. Following packages need to be installed before you continue with the code:

  1. Pandas
  2. allennlp
  3. allennlp-models
  4. nltk
  5. scikit-learn

You can install the above-mentioned packages using pip or conda. Simply type pip install package-name or conda install package-name in the command line.

To access GridDBs database through Python, the following packages will be required:

  1. GridDB C-client
  2. SWIG (Simplified Wrapper and Interface Generator)
  3. GridDB Python-client

All About the Dataset

We are using the IMDB Sentiment Analysis Dataset which is available publicly on Kaggle. The format of the dataset is pretty simple it has 2 attributes:

  1. Movie Review (string)
  2. Sentiment Label (int) Binary

A label 0 represents a negative movie review whereas 1 represents a positive movie review. Since we will be using a pre-trained model, there is no need to download the train and validation dataset. We will be utilizing only the test dataset which has 5000 instances. Once you download the dataset, put it in the same working directory.

Now lets go ahead and load the dataset in our python environment

Loading the Data

GridDB has made it easier to work with data as we can directly call the database using its python-client and load it in the form of pandas dataframe.

import griddb_python as griddb import pandas as pd sql_statement = ('SELECT * FROM movie_review_test') movie_review_test = pd.read_sql_query(sql_statement, cont)

The cont variable has the container information in which you have your data stored. A detailed tutorial on reading and writing to GridDB using Pandas is available on the blog.

Alternatively, if you have the CSV file, you can use the read_csv() function of pandas. The outcome will be the same in both scenarios

import pandas as pd movie_review_test = pd.read_csv("movie_review_test.csv")

Lets print out the first five rows to get a little sneak peak into our data

movie_review_test.head()
textlabel
0I always wrote this series off as being a comp0
11st watched 12/7/2002 3 out of 10(Dir-Steve 0
2This movie was so poorly written and directed 0
3The most interesting thing about Miryang (Secr1
4when i first read about berlin am meer i did0
len(movie_review_test)
5000

Data Preprocessing

Data Preprocessing is an important step to avoid getting any unexpected behaviour from the machine learning model. Null values or missing values tend to mess with the overall results if not dealt with properly. Lets see if our data contains any null values.

movie_review_test.isna().sum()
text 0 label 0 dtype: int64

Great! Fortunately, we have zero null/missing values in our test dataset. However, if you do encounter null values, consider dropping them or replacing them before moving further.

Removing Punctuation and Stop Words

Punctuation and stop words only increase the total word limit of a text. They do not contribute to model learning and serve majorly as noise. It is, therefore, important to remove those before the training step. In our case, although there is no training step, we still want to make sure that the input were providing is valid and appropriate. You can extend this step for the training dataset as well.

Various libraries provide a list of stopwords. Well be using the nltk library for this task. Note that the list of stop words depend on package to package. You might get a slightly different result if youre using some other library, say spacy.

from nltk.corpus import stopwords import nltk
stop = stopwords.words('english')
len(stop)
179
type(stop)
list

We now have a list of 179 stopwords. You can add some custom words to the list as well. In fact, lets go ahead and add a couple of words to the stopwords list.

extra_words = ['Yeah', 'Okay'] for word in extra_words: if word not in stop: stop.append(word)
len(stop)
181

Alternatively, you can use the extend() to append all the items of the list. The if condition inside the for loop just makes sure were not adding the same word twice.

movie_review_test['text'] = movie_review_test['text'].apply(lambda words: ' '.join(word for word in words.split() if word not in stop))
movie_review_test.head()
textlabel
0I always wrote series complete stink-fest Jim 0
11st watched 12/7/2002 3 10(Dir-Steve Purcell0
2This movie poorly written directed I fell asle0
3The interesting thing Miryang (Secret Sunshine1
4first read berlin meer expect much. thought 0

As we can see, personal pronouns such as I, we, etc. have been removed. Lets go ahead and remove the punctuation as well.

movie_review_test['text'] = movie_review_test['text'].str.lower() movie_review_test['text'] = movie_review_test['text'].str.replace('[^\w\s]','')
movie_review_test.head()
textlabel
0i always wrote series complete stinkfest jim b0
11st watched 1272002 3 10dirsteve purcell typi0
2this movie poorly written directed i fell asle0
3the interesting thing miryang secret sunshine 1
4first read berlin meer expect much thought rig0

Now that our data is ready to be used, lets load up our model and start making some predictions!

Loading the Allen NLP Model

Allen NLP has made available a lot of machine learning models targeting different problem statements. We will be using the GLoVe-LSTM binary classifier for our movie review dataset. As per the official documentation, the model achieved an overall accuracy of 87% on the Stanford Sentiment Treebank. A live demo of the model is available on the allennlps official website.

Lets go ahead and load our predictor.

from allennlp.predictors.predictor import Predictor import allennlp_models.tagging
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/basic_stanford_sentiment_treebank-2020.06.09.tar.gz")
error loading _jsonnet (this is expected on Windows), treating C:\Users\SHRIPR~2\AppData\Local\Temp\tmpfjmtd8u3\config.json as plain json

Note that these models can be heavy and if you have a GPU enabled system, simply pass the argument cuda_device=0 in the above predictor function.

To check if the predictor works fine, lets pass a sample text review and see what kind of output do we get.

sample_review = "This movie was so great. I laughed and cried, a lot!"
predictor.predict(sample_review)
'0'

As we can see, the predictor returns a dictionary with 5 keys logits, probs, token_ids, label, and, tokens. Since we know the sample review is a positive one, we can say that the model correctly returned a label '1'.

In addition to the label, the probs list also tells us the confidence score or probability of each label, which in our case are 0 or 1. The first item of the probs list i.e. the probability of label 1 is 0.98 (or 98%) which implies that the model was 98% confident that the review was positive.

Now we know that the predictor is working fine, it is time to make some predictions

Making Predictions

Well define a predict function that takes a movie review and returns the label as an integer. Note that the original labels are of type int. Itll be easier to compare the actual and predicted value if theyre of the same data type.

def predict_review(movie_review): return (int(predictor.predict(movie_review)['label']))
movie_review_test['predicted_label'] = movie_review_test['text'].apply(predict_review)
movie_review_test.head()
textlabelpredicted_label
0I always wrote this series off as being a comp01
11st watched 12/7/2002 3 out of 10(Dir-Steve 00
2This movie was so poorly written and directed 00
3The most interesting thing about Miryang (Secr11
4when i first read about berlin am meer i did01

Now we simply need to calculate the accuracy of our model. The prediction cell took 6 minutes to execute for 5000 instances because it was running on CPU and these models can be heavy. If youll be utilizing the code for large data, consider using a GPU.

Evaluating the results

Allen NLP has their own set of metrics for evaluation. For the sake of simplicity, well be using the scikit-learn library. You can find more information on Allen NLP metrics here.

from sklearn.metrics import accuracy_score
actual = movie_review_test['label'] predicted = movie_review_test['predicted_label']
accuracy = accuracy_score(actual, predicted)
accuracy
0.7208

Our model has an overall accuracy of 72% on the test dataset. Thats decent for starters, right? You can save the predictions in a CSV file using the pd.to_csv(file_path). Go ahead and try the code for yourself.

Happy coding!

Share this...
  • Facebook
    0
  • email
  • Twitter
  • Linkedin
  • Reddit
    0