Sentimental analysis via machine learning – Bag of words meets bag of popcorn
This is nothing so sophisticated tutorial on predicting sentiments using R. The following is part of a Kaggle competition that I took part in. I must tell you in advance that this tutorial is basically ‘101’ of predicting sentiments using bag of words. A more advance way to predict sentiments would be using word2vec by Google. This tutorial will get you started on predicting sentiments.
Moving on, first few baby steps involve reading data from your directory.
You can use the following codes to check or moderate the directory:
Everything looks perfect? Good. Lets now read our data and see what it looks like. I generally always take a sneak peek at the data files to optimize my codes to read it faster. But, let’s not talk about that in this tutorial, and start begging for some more RAM!
To read your files in R, you can use the read.csv() command as follows:
Now that you have read your data in R, let’s check out the structure and type of data in our data frames.
> str(labeledtrain) 'data.frame': 25000 obs. of 3 variables: $ id : Factor w/ 25000 levels "\"0_3\"","\"0_9\"",..: 15704 8076 20024 10851 23882 20996 18707 1415 9871 22146 ... $ sentiment: int 1 1 0 0 1 1 0 0 0 1 ... $ review : Factor w/ 24904 levels "\"\b\b\b\bA Turkish Bath sequence in a film noir located in New York in the 50's, that must be a hint at something ! Something "| __truncated__,..: 24348 834 17316 12571 16735 7766 20881 11186 1345 1049 ...
> str(testData) 'data.frame': 25000 obs. of 2 variables: $ id : chr "\"12311_10\"" "\"8348_2\"" "\"5828_4\"" "\"7186_2\"" ... $ review: chr "\"Naturally in a film who's main themes are of mortality, nostalgia, and loss of innocence it is perhaps not surprising that it"| __truncated__ "\"This movie is a disaster within a disaster film. It is full of great action scenes, which are only meaningful if you throw aw"| __truncated__ "\"All in all, this is a movie for kids. We saw it tonight and my child loved it. At one point my kid's excitement was so great "| __truncated__ "\"Afraid of the Dark left me with the impression that several different screenplays were written, all too short for a feature l"| __truncated__ ...
Now, if we wish to apply a bag of words model to this data. We need to first combine both of the data sets in order to make a giant bag of words with common features.
‘tm’ is a text mining library in R. Since, our data is now in appropriate format we can actually start using our text mining methods on it.
If you paid attention, we excluded ‘sentiments’ from ‘alldata’, I’ll add that up later.
If you don’t have ‘tm’ installed, you get it installed using this command:
After installation, load ‘tm’ package
If you observe the text, there’s a few things that you already know that we need to do. First, we need to convert our text into a machine readable form(Corpus). That’s easy, all you need to do in this case is to just use the command ‘Corpus(VectorSource(Text))’. There are other situations where to lowercase text gets a little different from the standard process, e.g. scraping data, which might have Unicode characters. For making the corpus more machine readable, you should remove punctuation, stopwords, lower all text and stem it. Fortunately, R provides easy steps to accomplish this:
corpus <- Corpus(VectorSource(alldata$review)) corpus <- tm_map(corpus, tolower) corpus <- tm_map(corpus, removePunctuation) corpus <- tm_map(corpus, PlainTextDocument) corpus <- tm_map(corpus, removeWords, stopwords('english')) corpus <- tm_map(corpus, stemDocument)
We can now convert this corpus into a document term matrix(DTM). What is this you ask? DTM is a matrix that shows us the frequency of terms in a document. In our case, we are going to find out how frequently words occur in a particular review. Our matrix consists of a lot of terms with different frequencies ranging from zero to a few millions(joking!). It is definitely a good idea to reduce the number of terms(features). I tried using different values with removeSparseTerms, but after a threshold of predictions(~0.71), it didn’t matter how many terms I kept, results won’t change. That is a limitation of this method I guess. That’s why the tutorial suggests you to use word2vec.
Making a document term matrix and removing sparse terms involves these easy steps.
dtm <- DocumentTermMatrix(corpus) dtm <- removeSparseTerms(dtm, 0.999)
Convert your document term matrix to a data frame.
Now, be patient! This might take some time to finish.
Using R, one starts loving data frames, the ease they bring in. You can examine ‘dtmsparse’ to get an idea of the frequency terms, you can find out the most frequent/least frequent terms after taking sum of columns. That’s just some exploration, we don’t need that at the moment. We will now separate our training and test data sets that we’d merged earlier. Remember the training and testing set consisted of 25,000 rows. To train our model, we need to add sentiments back to the training set.
train <- dtmsparse[1:25000,] test <- dtmsparse[25001:50000,] train$sentiment <- labeledtrain$sentiment
Let’s now implement our model. We are going to use ‘rpart’ for our predictions. To load this library you should use the following command:
Build your model using ‘rpart':
To build this model, we have used all independent variables that are in the training data set.
Woah! You just successfully build your first sentiment analysis model. That was easy, right? Now, let’s quickly finish this low level analysis by predicting sentiments on the test set. R has predict() that takes in the model you built, testing set and other things( In this case, we need to tell the response type is class).
mypred <- predict(model, newdata =test, type = 'class')
There you have it, your own sentimental analysis model. Save them up in a file along with the csv file to get a prediction accuracy of ~71%.
In another series, I will write about predicting sentiments using more mature algorithms such as Google’s word2vec. Meanwhile, if you are finding this tutorial helpful, you can start experimenting with other models such as Naive Bayes and see if you can improve your results ;-).
Let me know your thoughts via comment folks!