Data Analysis and Sentiment Classification of different topics from Twitter using R

Introduction

The purpose of the project is to identify and analyze tweets around different topics talked about on the twitter. Based on the topic defined by user, it will gather most recent tweets and identify any specific keywords in order to analyze sentiments regarding the topic, whether the topic is viewed positively or negatively by people, or what kind of emotions do people show around this topic. Futhermore, it will also allow user to analyze sentiments from the text.

The project is designed in ‘R’, a data manipulation language, and the web app is published in a web framework called ‘Shiny’.

Applications

Stock Market: Analyze the growth and perfomance of certain stocks and cryptocurrencies
Social Affairs: View people’s perception towards various political opinions and affairs
Data Scraping: Gather top words and data revolving around certain topics
Trending Topics: See what’s trending on the world right now
Gist of Article/Book: Find the gist of article and its overall sentiments

Demo

Source code: https://github.com/Sambhav101/Twitter-Sentiment-Analysis-in-R

Project: https://sambhav101.shinyapps.io/Twitter-Sentiments/

Project Overview

The first step of the project is to import various libraries that provide useful functions to help in data acquisition and data wrangling. In order to fetch data from twitter, I created a twitter developer account which provides credentials and tokens to connect with the twitter API. rtweet is a library that provides functions to fetch tweets data using keywords or date or username or different parameters. The other libraries used are dplyr, tidyr, tidytext etc for cleaning the data, ggplot2 for creating data visualizations, ‘wordcloud’ to create beautiful wordclouds and so on.

Imported Libraries

# load rtweet from CRAN
library(rtweet)
library(dplyr)
library(tidyr)
library(tidytext)
library(ggplot2)
library(textdata)
library(wordcloud)
library(reshape2)
library(textclean)

Authenticate by creating an access token

# authenticate via access token
token <- create_token(
  app = app_name,
  consumer_key = api_key,
  consumer_secret = api_secret_key,
  access_token = access_token,
  access_secret = access_secret_token
)

Searching the tweets containing keyword on Twitter

search_tweets() is a function present in rtweet library that lets users search tweets using parameters like number of tweets, specific keywords, including retweets and comments, date and many more. For the purpose of this project, I want to gather most recent 2000 original tweets (no retweets or comments) containing the keyword “Bitcoin”.

var = 'Bitcoin'
stock1 <- search_tweets(
  var,
  n = 2000,
  include_rts = FALSE
)

Data Cleaning and Pre-proccessing

This is the second most important part in the project. Before we analyze the sentiments, the data needs to be cleaned and processed. The search_tweets() function returns all the details of the tweet including the original poster, posted date, number of characters, source, the original post, number of likes, comments, and retweets, user interaction and many more. Since we are only interested in the text of the tweet, we will only use the main tweet for the analysis.

Furthermore, the original tweets still contain a lot of unnecessary information such as mentions to other users, hashtags, links, stopwords (words like a, an, the, he, she, etc), vulgar words, foreign language and many more. R provides us with different libraries that helps us in filtering such words. After all the words have been filtered, we move to the next step, which is sentiment classification

Raw Data

stock1

Gather only the original post from the tweet data

tweets.stock1 <- stock1 %>% select(text)
tweets.stock1

Clean up the text by stemming words, removing links, punctuations, making text lowercase and stop words.

#remove links manually
tweets.stock1$stripped_text <- gsub('http\\S+|\\$\\S+|\\@\\S+|\\#\\S+', "", tweets.stock1$text)

# create a token for each word in stripped_text (using unnest_tokens to remove punctuations. By default, to_lower is true)
tweets.stock1_words <- tweets.stock1 %>% select(stripped_text) %>% unnest_tokens(word, stripped_text)

# remove stop words, duplicate words and vulgar words
tweets.stock1_filtered <- tweets.stock1_words %>% 
  anti_join(stop_words) %>%
  filter(!grepl("\\d|@|amp|nigg|fuc", word)) %>%
  filter(!grepl(gsub(" ", "|", var), word, ignore.case = TRUE)) %>%
  subset(subset = nchar(word) != 1)

tweets.stock1_filtered

Top 10 words associated with keyword

Here, we find the top 10 used words in the tweets containing our keyword. Although some words may not make sense to the dictionary, we still keep them because it might be a slang and could provide crucial information.

# Top 10 words in $ tweet
tweets.stock1_filtered %>%
  count(word, sort=TRUE) %>%
  top_n(20) %>%
  mutate(word=reorder(word, n)) %>%
  ggplot(aes(x=word, y=n)) + geom_col() + xlab(NULL) + coord_flip() + theme_classic() + 
  labs(x="Count", y="Unique words", title="Unique words found in tweets containing bitcoin")

Sentiment Analysis

Here, we identify the words that carry the sentimental value, either positive or negative or none, and distribute them into different categories. We also keep track of the number of times the word appeared. Notice that some of the top words found in last table are not present here. It is because we filter words that either have positve sentiments or negative sentiments. Words which hold no sentiments or words that are not present in the bing lexicon are not included. In the project, I also used NRC lexicon in order to classify words into other emotions such as surprise, disgust, fear, joy, sadness etc.

The bing lexicon is a collection of words having either positive or negative value

#bing sentiment analysis
bing_stock1 = tweets.stock1_filtered %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort=TRUE) %>%
  ungroup()

bing_stock1

Data Visualization

The final and the most important part of this project is data visualization. Visualizations turn data into meaning. Here, I use ‘ggplot’ libary to not just visualize the data but also manipulate them to create comparison tables and graphs.

Sentiment Contribution of each words

bing_stock1 %>%
  group_by(sentiment) %>%
  top_n(20) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) + geom_col(show.legend = FALSE) + facet_wrap(~sentiment, scales="free_y") + labs(title="Tweets containing Bitcoin", x = NULL, y = "contribution to sentiment") + coord_flip() + theme_bw()

Wordcloud visualization

tweets.stock1_filtered %>%
  count(word) %>%
  with(wordcloud(word, n, max.words=200))

Visualization of positive and negative words in wordcloud

bing_stock1 %>%
  acast(word ~ sentiment, value.var = 'n', fill=0) %>%
  comparison.cloud(colors=c("#F8766D", "#00BFC4"),
                   max.words=100)

Conclusion

Currently, the AI is only able to support lexicon based sentiment classification. In other words, it will only identify words and classify them as positive or negative. It fails when the post contains negative words but has positive meaning as a whole. The goal of the next version is to use NLP to make the AI more advanced so that it can not only analyze words but sentences too.