Introduction

The purpose of the project is to identify and analyze tweets around different topics talked about on the twitter. Based on the topic defined by user, it will gather most recent tweets and identify any specific keywords in order to analyze sentiments regarding the topic, whether the topic is viewed positively or negatively by people, or what kind of emotions do people show around this topic. Futhermore, it will also allow user to analyze sentiments from the text.

The project is designed in ‘R’, a data manipulation language, and the web app is published in a web framework called ‘Shiny’.



Applications



Demo

  • Source code: https://github.com/Sambhav101/Twitter-Sentiment-Analysis-in-R
  • Project: https://sambhav101.shinyapps.io/Twitter-Sentiments/




  • Project Overview

    The first step of the project is to import various libraries that provide useful functions to help in data acquisition and data wrangling. In order to fetch data from twitter, I created a twitter developer account which provides credentials and tokens to connect with the twitter API. rtweet is a library that provides functions to fetch tweets data using keywords or date or username or different parameters. The other libraries used are dplyr, tidyr, tidytext etc for cleaning the data, ggplot2 for creating data visualizations, ‘wordcloud’ to create beautiful wordclouds and so on.


    Imported Libraries

    # load rtweet from CRAN
    library(rtweet)
    library(dplyr)
    library(tidyr)
    library(tidytext)
    library(ggplot2)
    library(textdata)
    library(wordcloud)
    library(reshape2)
    library(textclean)

    Authenticate by creating an access token

    # authenticate via access token
    token <- create_token(
      app = app_name,
      consumer_key = api_key,
      consumer_secret = api_secret_key,
      access_token = access_token,
      access_secret = access_secret_token
    )



    Searching the tweets containing keyword on Twitter

    search_tweets() is a function present in rtweet library that lets users search tweets using parameters like number of tweets, specific keywords, including retweets and comments, date and many more. For the purpose of this project, I want to gather most recent 2000 original tweets (no retweets or comments) containing the keyword “Bitcoin”.

    var = 'Bitcoin'
    stock1 <- search_tweets(
      var,
      n = 2000,
      include_rts = FALSE
    )



    Data Cleaning and Pre-proccessing

    This is the second most important part in the project. Before we analyze the sentiments, the data needs to be cleaned and processed. The search_tweets() function returns all the details of the tweet including the original poster, posted date, number of characters, source, the original post, number of likes, comments, and retweets, user interaction and many more. Since we are only interested in the text of the tweet, we will only use the main tweet for the analysis.

    Furthermore, the original tweets still contain a lot of unnecessary information such as mentions to other users, hashtags, links, stopwords (words like a, an, the, he, she, etc), vulgar words, foreign language and many more. R provides us with different libraries that helps us in filtering such words. After all the words have been filtered, we move to the next step, which is sentiment classification

    Raw Data

    stock1


    Gather only the original post from the tweet data

    tweets.stock1 <- stock1 %>% select(text)
    tweets.stock1


    Clean up the text by stemming words, removing links, punctuations, making text lowercase and stop words.

    #remove links manually
    tweets.stock1$stripped_text <- gsub('http\\S+|\\$\\S+|\\@\\S+|\\#\\S+', "", tweets.stock1$text)
    
    # create a token for each word in stripped_text (using unnest_tokens to remove punctuations. By default, to_lower is true)
    tweets.stock1_words <- tweets.stock1 %>% select(stripped_text) %>% unnest_tokens(word, stripped_text)
    
    # remove stop words, duplicate words and vulgar words
    tweets.stock1_filtered <- tweets.stock1_words %>% 
      anti_join(stop_words) %>%
      filter(!grepl("\\d|@|amp|nigg|fuc", word)) %>%
      filter(!grepl(gsub(" ", "|", var), word, ignore.case = TRUE)) %>%
      subset(subset = nchar(word) != 1)
    
    tweets.stock1_filtered



    Top 10 words associated with keyword

    Here, we find the top 10 used words in the tweets containing our keyword. Although some words may not make sense to the dictionary, we still keep them because it might be a slang and could provide crucial information.

    # Top 10 words in $ tweet
    tweets.stock1_filtered %>%
      count(word, sort=TRUE) %>%
      top_n(20) %>%
      mutate(word=reorder(word, n)) %>%
      ggplot(aes(x=word, y=n)) + geom_col() + xlab(NULL) + coord_flip() + theme_classic() + 
      labs(x="Count", y="Unique words", title="Unique words found in tweets containing bitcoin")



    Sentiment Analysis

    Here, we identify the words that carry the sentimental value, either positive or negative or none, and distribute them into different categories. We also keep track of the number of times the word appeared. Notice that some of the top words found in last table are not present here. It is because we filter words that either have positve sentiments or negative sentiments. Words which hold no sentiments or words that are not present in the bing lexicon are not included. In the project, I also used NRC lexicon in order to classify words into other emotions such as surprise, disgust, fear, joy, sadness etc.


    The bing lexicon is a collection of words having either positive or negative value

    #bing sentiment analysis
    bing_stock1 = tweets.stock1_filtered %>%
      inner_join(get_sentiments("bing")) %>%
      count(word, sentiment, sort=TRUE) %>%
      ungroup()
    
    bing_stock1



    Data Visualization

    The final and the most important part of this project is data visualization. Visualizations turn data into meaning. Here, I use ‘ggplot’ libary to not just visualize the data but also manipulate them to create comparison tables and graphs.

    Sentiment Contribution of each words

    bing_stock1 %>%
      group_by(sentiment) %>%
      top_n(20) %>%
      ungroup() %>%
      mutate(word = reorder(word, n)) %>%
      ggplot(aes(word, n, fill = sentiment)) + geom_col(show.legend = FALSE) + facet_wrap(~sentiment, scales="free_y") + labs(title="Tweets containing Bitcoin", x = NULL, y = "contribution to sentiment") + coord_flip() + theme_bw()


    Wordcloud visualization

    tweets.stock1_filtered %>%
      count(word) %>%
      with(wordcloud(word, n, max.words=200))

    Visualization of positive and negative words in wordcloud

    bing_stock1 %>%
      acast(word ~ sentiment, value.var = 'n', fill=0) %>%
      comparison.cloud(colors=c("#F8766D", "#00BFC4"),
                       max.words=100)






    Conclusion

    Currently, the AI is only able to support lexicon based sentiment classification. In other words, it will only identify words and classify them as positive or negative. It fails when the post contains negative words but has positive meaning as a whole. The goal of the next version is to use NLP to make the AI more advanced so that it can not only analyze words but sentences too.




    Thank you