Sentimental analysis via machine learning – Bag of words meets bag of popcorn

Sentimental analysis kaggle tutorial

This is nothing so sophisticated tutorial on predicting sentiments using R. The following is part of a Kaggle competition that I took part in. I must tell you in advance that this tutorial is basically ‘101’ of predicting sentiments using bag of words. A more advance way to predict sentiments would be using word2vec by Google. This tutorial will get you started on predicting sentiments.

Sentimental analysis kaggle tutorial
Image credits: Pixabay

Moving on, first few baby steps involve reading data from your directory.
You can use the following codes to check or moderate the directory:

Attribution: The beautiful code is a result of using Pretty R at inside-R.org

#Get your current working directory
getwd()
 
#Set a different working directory
setwd('/documents/popcorn')

Everything looks perfect? Good. Lets now read our data and see what it looks like. I generally always take a sneak peek at the data files to optimize my codes to read it faster. But, let’s not talk about that in this tutorial, and start begging for some more RAM!

To read your files in R, you can use the read.csv() command as follows:

labeledtrain <- read.csv('labeledTrainData.tsv', quote='', sep='\t')
 
testData <- read.csv('testData.tsv', quote='', sep ='\t')

Now that you have read your data in R, let’s check out the structure and type of data in our data frames.

> str(labeledtrain)
'data.frame':	25000 obs. of  3 variables:
 $ id       : Factor w/ 25000 levels "\"0_3\"","\"0_9\"",..: 15704 8076 20024 10851 23882 20996 18707 1415 9871 22146 ...
 $ sentiment: int  1 1 0 0 1 1 0 0 0 1 ...
 $ review   : Factor w/ 24904 levels "\"\b\b\b\bA Turkish Bath sequence in a film noir located in New York in the 50's, that must be a hint at something ! Something "| __truncated__,..: 24348 834 17316 12571 16735 7766 20881 11186 1345 1049 ...
> str(testData)
'data.frame':	25000 obs. of  2 variables:
 $ id    : chr  "\"12311_10\"" "\"8348_2\"" "\"5828_4\"" "\"7186_2\"" ...
 $ review: chr  "\"Naturally in a film who's main themes are of mortality, nostalgia, and loss of innocence it is perhaps not surprising that it"| __truncated__ "\"This movie is a disaster within a disaster film. It is full of great action scenes, which are only meaningful if you throw aw"| __truncated__ "\"All in all, this is a movie for kids. We saw it tonight and my child loved it. At one point my kid's excitement was so great "| __truncated__ "\"Afraid of the Dark left me with the impression that several different screenplays were written, all too short for a feature l"| __truncated__ ...

 

Now, if we wish to apply a bag of words model to this data. We need to first combine both of the data sets in order to make a giant bag of words with common features.

alldata <- rbind(labeledtrain[,c(1,3)], testData)

 

‘tm’ is a text mining library in R. Since, our data is now in appropriate format we can actually start using our text mining methods on it.
If you paid attention, we excluded ‘sentiments’ from ‘alldata’, I’ll add that up later.

If you don’t have ‘tm’ installed, you get it installed using this command:

After installation, load ‘tm’ package

If you observe the text, there’s a few things that you already know that we need to do. First, we need to convert our text into a machine readable form(Corpus). That’s easy, all you need to do in this case is to just use the command ‘Corpus(VectorSource(Text))’. There are other situations where to lowercase text gets a little different from the standard process, e.g. scraping data, which might have Unicode characters. For making the corpus more machine readable, you should remove punctuation, stopwords, lower all text and stem it. Fortunately, R provides easy steps to accomplish this:

corpus <- Corpus(VectorSource(alldata$review))
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument)

We can now convert this corpus into a document term matrix(DTM). What is this you ask? DTM is a matrix that shows us the frequency of terms in a document. In our case, we are going to find out how frequently words occur in a particular review. Our matrix consists of a lot of terms with different frequencies ranging from zero to a few millions(joking!). It is definitely a good idea to reduce the number of terms(features). I tried using different values with removeSparseTerms, but after a threshold of predictions(~0.71), it didn’t matter how many terms I kept, results won’t change. That is a limitation of this method I guess. That’s why the tutorial suggests you to use word2vec.

Making a document term matrix and removing sparse terms involves these easy steps.

dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.999)

Created by Pretty R at inside-R.org

Convert your document term matrix to a data frame.

dtmsparse <- as.data.frame(as.matrix(dtm))

Now, be patient! This might take some time to finish.

Using R, one starts loving data frames, the ease they bring in. You can examine ‘dtmsparse’ to get an idea of the frequency terms, you can find out the most frequent/least frequent terms after taking sum of columns. That’s just some exploration, we don’t need that at the moment. We will now separate our training and test data sets that we’d merged earlier. Remember the training and testing set consisted of 25,000 rows. To train our model, we need to add sentiments back to the training set.

train <- dtmsparse[1:25000,]
test <- dtmsparse[25001:50000,]
train$sentiment <- labeledtrain$sentiment

Let’s now implement our model. We are going to use ‘rpart’ for our predictions. To load this library you should use the following command:

Build your model using ‘rpart':

model <- rpart(sentiment ~., data = train, method= 'class')

To build this model, we have used all independent variables that are in the training data set.

Woah! You just successfully build your first sentiment analysis model. That was easy, right? Now, let’s quickly finish this low level analysis by predicting sentiments on the test set. R has predict() that takes in the model you built, testing set and other things( In this case, we need to tell the response type is class).

mypred <- predict(model, newdata =test, type = 'class')

Created by Pretty R at inside-R.org

There you have it, your own sentimental analysis model. Save them up in a file along with the csv file to get a prediction accuracy of ~71%.

In another series, I will write about predicting sentiments using more mature algorithms such as Google’s word2vec. Meanwhile, if you are finding this tutorial helpful, you can start experimenting with other models such as Naive Bayes and see if you can improve your results ;-).

Let me know your thoughts via comment folks!

10 comments

  • It’s really a nice and helpful piece of information. I’m satisfied that
    you shared this helpful information with us.

    Please keep us up to date like this. Thanks for sharing.

  • Very good tutorial. A nice introduction for beginners. Thank you!

  • So, mypred has a csv file with the prediction. But what file is that comparing to? In other words, what columns do I match this up against to see how well my prediction did?

    • Mypred was supposed to be uploaded to Kaggle to cross check the accuracy of predictions. However, now that this competition has been closed you can split the training dataset into train and test data sets to check your accuracy of predictions.

      • Parikshit,

        I took a snapshot of part of my prediction (of a different data set that I was predicting on), and it looks like the following: http://imgur.com/KQLNgjp. So, how do I take the column A, consisting of the character(0).1, character(0).2, etc… and compare this to my unlabeled training set that I created? How do I check how well it did? Thanks for your help!

        • Hey Nick, Column A is just a bad product of this method, I usually remove it using row.names(dataframe) <- NULL. As I said earlier, you should split the labelled dataset into two pieces and check for accuracy. Let me know if there’s something I missed.

  • Hi parikshit. Now I have finance data in which i ahve to predict sentiments of conversation in between mutual fund assoiates.. i have data of around 20 rows . i need large financial data.. how to get thi. also pls share your word2vec r code for sentimental analysis

    • Parikshit Joshi

      Hi Navdeep,

      If you have less data, I would advise bootstrapping your data to generate a bigger data set. I couldn’t find my WordVec submission that I used for this particular Kaggle competition. Hopefully, I’ll rewrite it within this week and share it with you.

      Best Regards,
      Parikshit

  • An Exploration of Graph-based Algorithms for Cross-domain Sentiment Classi?

Leave a Reply

Your email address will not be published. Required fields are marked *