Twitter Sentiment Analysis using R: Amazon vs Walmart on Twitter Sentiment scores [tutorial]

Both of these organizations are extremely disruptive in nature – Amazon and Walmart. And, with this tutorial on twitter sentiment analysis using R, I wanted to see how Twitterverse feels about Amazon when you compare it with Walmart.

Mining Twitter for Sentiment analysis using R

Twitter is my obvious choice when it comes to quickly source data for sentiment related work. In this tutorial, I am using Twitter search API as opposed to Twitter streaming API. For understanding difference between the two, checkout this blog of mine that highlights the differences between Twitter streaming API vs Twitter Search API.

What will you learn in this blog post?

  • How to mine Twitter data using Twitter search API using keywords
  • How to clean Twitter data to build sentiments
  • A simple sentiment based scoring model for tweets

Let’s get started.

Mining Twitter data using Twitter search API and keywords

On your R console, enter the following commands to make your local environment is ready to mine data from Twitter:


 install.packages(c('devtools', 'rjson', 'bit64', 'httr'))
 library(devtools)
 install_github('geoffjentry/twitteR')
 library(twitteR)
 

See a lot of libraries there? They all play an important role in making twitteR work – httr for example handle authorization and caching.

Note: twitteR is being deprecated now, rtweet is now the preferred package.

Moving on, you need to create an account on https://apps.twitter.com/ and create an application. Successfully doing so will provide you with an API key, API secret, access token and an access token secret.

Now, using these four you can set up Twitter authentication:


 api_key <- 'Weird looking API key'
 api_secret <- 'Extremely long API secret'
 access_token <- 'Even longer access token'
 access_token_secret <- 'Last of those,finally!'
 

Make sure that you don’t share these four parameters with anyone.

Now, when you enter


 setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)

You will be prompted to make two different types of authentication. I usually prefer direct authentication, but the choice is completely subjective.

Now, let’s fetch some data from Twitter. Excited?

via GIPHY

Let’s find 2,000 tweets related to Amazon that were posted by users

Amazon_tweets = searchTwitter("amazon", n =2000)

If you don’t see any error, your code worked fine. Let’s check what it has scraped from Twitter by using the head function:

head(Amazon_tweets)

You should see an out similar to (but with different content, obviously!)


 > head(Amazon_tweets)
 [[1]]
 [1] "ishi_shiro_kuro: シングル55作品の両A面曲を含む全63曲の映像が収録予定\nSMAP の Clip! Smap! コンプリートシングルス(初回生産分) [Blu-ray] https://t.co/LnMh5fQH6H"
 [[2]]
 [1] "foxnumber6: @arurenR 諫山創 の 進撃の巨人(17) (週刊少年マガジンコミックス) を Amazon でチェック! https://t.co/8PBh7gMVOW @さんから"
 [[3]]
 [1] "natume_aoi: RT @kimi_lica: @AKIDOsyoi こちらの3話目ですね。 潜入!スパイカメラ~ペンギン 極限の親子愛 Amazonビデオ ~ ジョン・ダウナー https://t.co/KWfTMq883l @AmazonJPさんから"
 .....
 

You just successfully downloaded tweets from Twitter using the search query “Amazon”. If you noticed, we’d a lot alien type language and words in our data. Don’t worry about it. As we move forward in this blog post, I will show you how to clean this information and make it 100% ready to be consumed in our sentiment analysis scoring algorithm.

So, Let’s start the cleaning process!

We will use “sapply” on our Twitter data to only get the textual part of Twitter data. In case if you don’t know, Twitter sends us a huge amount of data that isn’t just limited to text. We get timestamp, retweet count, favourite counts, etc in addition to the text we want. So, we have to remove other parts and only keep the text we wish to analyze.

Here’s how you would do it

#Cleaning tweets
amazon_t walmart_t <-sapply(Walmart_tweets, function(x) x$getText())

Let’s now build some functions to deal with the missing values in our data


 #Dealing with Missing values
 FindError = function(x){
 y = NA
finderror = tryCatch(tolower(x), error=function(e) e)
 if(!inherits(finderror, 'error'))
 y = tolower(x)
 return(y)
 }
cleanTweets <- function(tweet){
 tweet = gsub("(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", " ", tweet)
 tweet = gsub("(RT|via)((?:\\b\\W*@\\w)+)", " ", tweet)
 tweet = gsub("#\\w+", " ", tweet)
 tweet = gsub("@\\w+", " ", tweet)
 tweet = gsub("[[:punct:]]", " ", tweet)
 tweet = gsub("[[:digit:]]", " ", tweet)
 tweet = gsub("[ \t]{2,}", " ", tweet)
 tweet = gsub("^\\s+|\\s+$", "", tweet)
 tweet = FindError(tweet)
 tweet
 }
cleanTweetsAndRemoveNAs <- function(Tweets){
 TweetsCleaned = sapply(Tweets, cleanTweets)
 #Remove the "NA"
 TweetsCleaned = TweetsCleaned[!is.na(TweetsCleaned)]
 names(TweetsCleaned) = NULL
TweetsCleaned = unique(TweetsCleaned)
 TweetsCleaned
 }

Now, let’s apply these cleaning functions to our data.

#Cleaing Amazon and Walmart's tweets
amazon_c <- cleanTweetsAndRemoveNAs(amazon_t)
walmart_c <- cleanTweetsAndRemoveNAs(walmart_t)
head(amazon_c)
head(walmart_c)

Here’s what the data looks before and after cleaning amazon vs walmart clean tweets sentiment analysis r

If you have not faced any error so far, great! Let’s move ahead with building our vocabulary of sentiments.

Sentiment data words

In order to do a quick and dirty work here, I used the list of positive and negative words from Williamgun’s Github repository.

Building this vocabulary of positive and negative sentiments is extremely simple.

#Estimating sentiments
opinion.lexicon.pos = scan('positive-words.txt', what = 'character')
opinion.lexicon.neg = scan('negative-words.txt', what = 'character')

Let’s take a look at what we just imported in our console

R sentiment analysis positive negative words library 1

Whenever you build a sentiment based vocabulary using an external source, make sure that you add some of your industry specific keywords as well. For example, waiting for anything in a brick and mortar or eCommerce business would be considered negative. So, I would go and add this keyword to my vocabulary.

#Adding industry specific keywords
 pos.words = c(opinion.lexicon.pos)
 neg.words = c(opinion.lexicon.neg, 'wait')

Here’s what these positive and negative words look like when placed in our library for sentiment analysis

sentiment analysis library for positive and negative words

Let’s create positive and negative sentiment scores using our vocabulary of positive and negative words

 #Creating a sentiment score
getSentimentScore = function(sentences, words.postive, words.negative, .progress = 'none'){
 require(plyr)
 require(stringr)
scores = laply(sentences, function(sentence, words.postive, words.negative){
 sentence = gsub('[[:cntrl:]]', '', gsub('[[:punct:]]', '', gsub('\\d+', '', sentence)))
 sentence = tolower(sentence)
 words = unlist(str_split(sentence, '\\s+'))
 pos.matches = !is.na(match(words, words.postive))
 neg.matches = !is.na(match(words, words.negative))
score = sum(pos.matches) - sum(neg.matches)
return(score)
 }, words.positive, words.negative, .progress = .progress)
return(data.frame(text=sentences, score=scores))
 }
words.positive <- pos.words
 words.negative <- neg.words

Let’s try to build sentiment scores based on the data we have for both Amazon and Walmart.

amazon_results = getSentimentScore(amazon_c, pos.words, neg.words )
 amazon_results
walmart_results = getSentimentScore(walmart_c, pos.words, neg.words)
 walmart_results

Here’s a sample of the sentiment scores that our R code made

Sentiment analysis scores using R of Amazon and Walmart ecommerce

 

Noticed the <U+3010> types scored with a zero sentiment score? Also, did you notice how @amazon’s own tweet on 35th index is rated positively. Nice isn’t it!

The line 26 actually showed where this approach was actually flawed. The twitter user was probably talking about a book that had the word “dark” in it. The sentiment analysis algorithm we wrote thought it was negative as it contained a negative word “dark” in it.

If I were to perform sentiment analysis using R on a production grade, I would have to think about hundreds of these edge cases. Only then we can possibly come up with something that can be reliably leveraged in a product system. But since, this is an educational blog about performing sentiment analysis, we can move on assuming that we did “something”.

Visualising sentiment score analysis data

The sample output I attached before isn’t the best way to glance at a result. By just one or two more lines, I can simply build charts and graphs to see highly actionable information.

For the sake of simplicity, let’s quickly build a histogram of distribution of sentiments of Amazon and Walmart

Histogram of amazon’s setiment analysis using r

Let’s do the same for Walmart as well and see the distribution Histogram of Walmart sentiment analysis using R

To someone who’s knew to sentiment analysis, these graphical representations may not be the best indicator of who wins on the sentimental scale. Let’s take a look at the mean score instead.

The mean score values for Amazon and Walmart is 0.215103 and 0.2123995.

When I initially started with using Amazon and Walmart, I though it could serve as a really awesome ground to showcase contrasting customer scores. But, we don’t really got to see anything that was a revelation. Maybe, if I dig a little deep into this data, we’ll eventually see what we initially started with. But for the sake of keeping the focus here on sentiment analysis and high level sentiment visualization we won’t go there.

Maybe in a future blog post. Hit me up on Twitter or comments if you want to see such a blog post!

Leave a Reply

Your email address will not be published. Required fields are marked *