Week ten was an exciting week sentiment analysis was introduced. In week ten discussion I posted regarding Game of Thrones and sentiment analysis for season 6 premiere. I feel that we did not cover the topic in depth, so my proposal is the following.
The election is over and unless the electoral college votes against Donald Trump he will be president. I want to do sentiment analysis using twitter. My primary goal is to capture the mood of the people in November and December, classify twits as positive, negative, or neutral, and identify these words. I will implement learned material, and other techniques learned in the course.
1 Scrape twitter for data regarding the election (message, date, Maybe geographical location)
2 After cleaning the data, I will use Mongo dB to store information
3 The analysis is going to be performed by querying Mongo dB and using ggplot2
I will use R, Mongo dB, Twitter, R packages (tidyr, dplyr, tm, ggplot2)
The goal of the final project was to be able to gather information from a social website clean this data, transform and classify it. The topic intrigues me due to the many application that can be achieved. I see this project as a small step toward implementing a sentiment market research tool with the inclusion of many other social media sites.
I first start by connecting to the Twitter API. I first tried to connect using #library(“ROAuth”) but due to the API not validating my access code I search for a different implementation. What worked for me was using direct access authentication with the Twitter API.
#options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
consumerKey = ""
consumerSecret = ""
accessToken =""
accessTokenSecret=""
#reqURL = "https://api.twitter.com/oauth/request_token" #important at the moment that it is https Twitter needs a secure connection
#accessURL = "https://api.twitter.com/oauth/access_token"
#authURL = "https://api.twitter.com/oauth/authorize"
#twitCred = OAuthFactory$new(consumerKey=consumerKey,consumerSecret=consumerSecret,requestURL=reqURL,accessURL=accessURL,authURL=authURL)
#twitCred$handshake()
#registerTwitterOAuth(twitCred)
#setup_twitter_oauth(consumerKey,consumerSecret,accessToken,accessTokenSecret)
Two search methods were applied with TwitteR and the API connection. The first was the Search Twitter and the second one GetUser.Using the technique allowed me to get two different data sources and apply sentiment analysis. In the first method, the term Trump was searched and downloaded. The total of 10,000 Twitts was harvested. The second method I used Donald Trump Twitter handle to collect all his twitter feeds. I also tried getting his 17.7 M followers but when I search to download the direct connection only produced 56 followers. Finally, each file was exported to a CSV file that was later uploaded to Git Hub.
#tweets=searchTwitter("trump", n=10000,lang = "en")
#df = do.call("rbind", lapply(tweets, as.data.frame))
#write.csv(df, "Trump10000Tweets.csv", row.names=FALSE)
#TrumpTwiterAcct <- getUser("realDonaldTrump")
#donaldtweetslist = userTimeline(TrumpTwiterAcct, n=3200, includeRts=TRUE, excludeReplies=TRUE)
#tumpprofiletweetsdf = do.call("rbind", lapply(donaldtweetslist, as.data.frame))
#write.csv(tumpprofiletweetsdf, "realDonaldTrump3200Tweets.csv", row.names=FALSE)
After uploading the data to git hub, Rcurl will be used to bring it back to the project.
url1 = "https://raw.githubusercontent.com/chrisestevez/DataAnalyticsProjects/master/FinalProject/Trump10000Tweets.csv"
Rdata1 = getURL(url1)
TrumpSearch = read.csv(text = Rdata1,header = TRUE,stringsAsFactors = F,sep=",")
head(TrumpSearch,5)
TrumpSearchText = as.vector(TrumpSearch$text)
url2 = "https://raw.githubusercontent.com/chrisestevez/DataAnalyticsProjects/master/FinalProject/realDonaldTrump3200Tweets.csv"
Rdata2 = getURL(url2)
TrumpPersonal = read.csv(text = Rdata2,header = TRUE,stringsAsFactors = F,sep=",")
TrumpPersonalText = as.vector(TrumpPersonal$text)
head(TrumpPersonal,5)
The sentiment analysis algorithm used here is based on the Word-Emotion Association of Saif Mohammad and Peter Turney. The use of a dictionary that associates the words with eight different emotions and a negative/Positive sentiment. Please see the examples below.
get_nrc_sentiment("Donal Trump is awesome and amazing I'm happy he is running for president")
get_nrc_sentiment("I hate Donal Trump he is a liar and deceiving person")
In this part of the project, I investigate to see if there is any pattern in emotion or sentiment by the Twitter Community. I begin by using the acquired data that was obtained by searching for the term Trump. The tweets were converted into a vector to process the information effectively. I used gsub to remove various unwanted terms. Next, I applied the sentiment algorithm and merged the results to the original data. After merging the data, I used dplyr and tidyr to transform and plot the data using ggplot2.
head(TrumpSearchText,5)
## [1] "RT @AboveTopSecret: ACLU Threatens Donald Trump Via New York Times Ad #ATS https://t.co/WPc6NC5Ci3"
## [2] "#Breaking News: A Gang Of Trump Fans Just Viciously Attacked Peaceful Protesters https://t.co/bqqyPz2Vai"
## [3] "@Siclittlemonkey As far as I know Trump hasn't done anything illegal. Hillary, on the other hand, should be behind bars for her many crimes."
## [4] "RT @bannerite: #Shameless Donald Trump's sons behind nonprofit selling access to president-elect | Center for Public Integrity https://t.co<U+0085>"
## [5] "RT @feistybunnygirl: When u voted for Trump bc he promised to deport all the brown people, but then u realize he's going to cut ur SS & Med<U+0085>"
cleanTweet = gsub("rt|RT", "", TrumpSearchText) # remove Retweet
cleanTweet = gsub("http\\w+", "", cleanTweet) # remove links http
cleanTweet = gsub("<.*?>", "", cleanTweet) # remove html tags
cleanTweet = gsub("@\\w+", "", cleanTweet) # remove at(@)
cleanTweet = gsub("[[:punct:]]", "", cleanTweet) # remove punctuation
cleanTweet = gsub("\r?\n|\r", " ", cleanTweet) # remove /n
cleanTweet = gsub("[[:digit:]]", "", cleanTweet) # remove numbers/Digits
cleanTweet = gsub("???|???|???|???|???|???|???|???|???|???", "", cleanTweet) # asian letters
cleanTweet = gsub("[ |\t]{2,}", "", cleanTweet) # remove tabs
cleanTweet = gsub("^ ", "", cleanTweet) # remove blank spaces at the beginning
cleanTweet = gsub(" $", "", cleanTweet) # remove blank spaces at the end
TrumpSearchSentiment = get_nrc_sentiment(cleanTweet)
head(TrumpSearchSentiment,5)
TrumpSearchFinalData = cbind(TrumpSearch,TrumpSearchSentiment)
plotData1 =gather(TrumpSearchFinalData,"sentiment","values",17:24) %>%
group_by( sentiment) %>%
summarise(Total = sum(values))
ggplot(data = plotData1, aes(x = plotData1$sentiment, y = plotData1$Total)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Emotions") + ylab("Total") + ggtitle("Emotion for Search Term Trump")+
geom_text(aes(label = plotData1$Total), position = position_dodge(width=0.75), vjust = -0.25)
plotData2 =gather(TrumpSearchFinalData,"Polarity","values",25:26) %>%
group_by( Polarity) %>%
summarise(Total = sum(values))
ggplot(data = plotData2, aes(x = plotData2$Polarity, y = plotData2$Total)) +
geom_bar(aes(fill = plotData2$Polarity), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total") + ggtitle("Sentiment for Search Term Trump")+
geom_text(aes(label = plotData2$Total), position = position_dodge(width=0.75), vjust = -0.25)
In this section, I focused on Donal Trumps personal twitter handle. The data set includes retweets and ranges from 2/2016-12/2016. I also try to make sense of the emotions and sentiment by plotting the data monthly.
head( TrumpPersonalText,5)
## [1] "\"@mike_pence: Congratulations to @RealDonaldTrump; officially elected President of the United States today by the Electoral College!\""
## [2] "\"@Franklin_Graham: Congratulations to President-elect @realDonaldTrump--the electoral votes are in and it's official.\" Thank you Franklin!"
## [3] "RT @DanScavino: #TrumpTrain<ed><U+00A0><U+00BD><ed><U+00BA><U+0082><ed><U+00A0><U+00BD><ed><U+00B2><U+00A8><ed><U+00A0><U+00BC><ed><U+00B7><U+00BA><ed><U+00A0><U+00BC><ed><U+00B7><U+00B8><ed><U+00A0><U+00BC><ed><U+00B7><U+00BA><ed><U+00A0><U+00BC><ed><U+00B7><U+00B8><ed><U+00A0><U+00BC><ed><U+00B7><U+00BA><ed><U+00A0><U+00BC><ed><U+00B7><U+00B8><ed><U+00A0><U+00BC><ed><U+00B7><U+00BA><ed><U+00A0><U+00BC><ed><U+00B7><U+00B8> https://t.co/qAQdBGEwSv"
## [4] "We did it! Thank you to all of my great supporters, we just officially won the election (despite all of the distorted and inaccurate media)."
## [5] "Today there were terror attacks in Turkey, Switzerland and Germany - and it is only getting worse. The civilized world must change thinking!"
cleanTweetp = gsub("rt|RT", "", TrumpPersonalText) # remove Retweet
cleanTweetp = gsub("http\\w+", "", cleanTweetp) # remove links http
cleanTweetp = gsub("<.*?>", "", cleanTweetp) # remove html tags
cleanTweetp = gsub("@\\w+", "", cleanTweetp) # remove at(@)
cleanTweetp = gsub("[[:punct:]]", "", cleanTweetp) # remove punctuation
cleanTweetp = gsub("\r?\n|\r", " ", cleanTweetp) # remove /n
cleanTweetp = gsub("[[:digit:]]", "", cleanTweetp) # remove numbers/Digits
cleanTweetp = gsub("???|???|???|???|???|???|???|???|???|???", "", cleanTweetp) # asian letters
cleanTweetp = gsub("[ |\t]{2,}", "", cleanTweetp) # remove tabs
cleanTweetp = gsub("^ ", "", cleanTweetp) # remove blank spaces at the beginning
cleanTweetp = gsub(" $", "", cleanTweetp) # remove blank spaces at the end
TrumpPersonalSentiment = get_nrc_sentiment(cleanTweetp)
head(TrumpPersonalSentiment,5)
TrumpPersonalFinalData = cbind(TrumpPersonal,TrumpPersonalSentiment)
plotData3 =gather(TrumpPersonalFinalData,"sentiment","values",17:24) %>%
group_by( sentiment) %>%
summarise(Total = sum(values))
ggplot(data = plotData3, aes(x = plotData3$sentiment, y = plotData3$Total)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total") + ggtitle("Emotions for @realDonaldTrump")+
geom_text(aes(label = plotData3$Total), position = position_dodge(width=0.75), vjust = -0.25)
plotData4 =gather(TrumpPersonalFinalData,"Polarity","values",25:26) %>%
group_by( Polarity) %>%
summarise(Total = sum(values))
ggplot(data = plotData4, aes(x = plotData4$Polarity, y = plotData4$Total)) +
geom_bar(aes(fill = plotData4$Polarity), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") + ylab("Total") + ggtitle("Sentiment for @realDonaldTrump")+
geom_text(aes(label = plotData4$Total), position = position_dodge(width=0.75), vjust = -0.25)
plotData5 = select(TrumpPersonalFinalData,created,17:24)
plotData5 = separate(plotData5,created,c("date","Time")," ") %>%
group_by(date)%>%
summarise(Anger=sum(anger), Anticipation=sum(anticipation), Disgust=sum(disgust), Fear=sum(fear), Joy=sum(joy), Sadness=sum(sadness), Surprise=sum(surprise), Trust=sum(trust))
plotData5$date = as.Date(plotData5$date,"%Y-%m-%d")
plotData5$date <- as.Date(cut(plotData5$date, breaks = "month"))
plotData5 = gather(plotData5,"sentiment","values",2:9)%>%
group_by(date,sentiment)%>%
summarise(Total=sum(values))
ggplot(data = plotData5, aes(x = plotData5$date, y = plotData5$Total, group = plotData5$sentiment)) +
geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment,stat = "identity")) +
geom_point(size = 0.5) +
#ylim(0, 0.6) +
theme(legend.title=element_blank(), axis.title.x = element_blank()) +
ylab("Total") +
ggtitle("Emotions of @realDonaldTrump 2/2016-12/2016")+
scale_y_continuous(limits=c(0,300))
## Warning: Ignoring unknown aesthetics: stat
plotData6 =gather(TrumpPersonalFinalData,"Polarity","values",25:26) %>%
group_by( created,Polarity) %>%
summarise(Total = sum(values))
plotData6 = separate(plotData6,created,c("date","Time")," ")
plotData6$date = as.Date(plotData6$date,"%Y-%m-%d")
plotData6$date <- as.Date(cut(plotData6$date, breaks = "month"))
plotData6 = select(plotData6,date,Polarity,Total)%>%
group_by(date,Polarity)%>%
summarise(Total = sum(Total))
ggplot(data = plotData6, aes(x = plotData6$date, y = plotData6$Total, group = plotData6$Polarity)) +
geom_line(size = 2.5, alpha = 0.7, aes(color = plotData6$Polarity,stat = "identity")) +
geom_point(size = 0.5) +
#ylim(0, 0.6) +
theme(legend.title=element_blank(), axis.title.x = element_blank()) +
ylab("Total") +
ggtitle("Sentiment of @realDonaldTrump 2/2016-12/2016")+
scale_y_continuous(limits=c(0,500))
## Warning: Ignoring unknown aesthetics: stat
vector = TrumpPersonal$text
Corpus <- Corpus(VectorSource(vector))
Corpus = tm_map(Corpus,removeNumbers)
Corpus = tm_map(Corpus,str_replace_all,pattern = "http\\w+", replacement =" ")
Corpus = tm_map(Corpus,str_replace_all,pattern = "<.*?>", replacement =" ")
Corpus = tm_map(Corpus,str_replace_all,pattern = "@\\w+", replacement =" ")
Corpus = tm_map(Corpus,str_replace_all,pattern ="\\=", replacement =" ")
Corpus = tm_map(Corpus,str_replace_all,pattern = "[[:punct:]]", replacement =" ")
Corpus = tm_map(Corpus,str_replace_all,pattern = "amp", replacement =" ")
Corpus = tm_map(Corpus,removeWords, words= stopwords("en"))
Corpus = tm_map(Corpus,tolower)
Corpus = tm_map(Corpus,stripWhitespace)
Corpus = tm_map(Corpus, PlainTextDocument)
tdm = TermDocumentMatrix(Corpus)
tdm
## <<TermDocumentMatrix (terms: 6678, documents: 3195)>>
## Non-/sparse entries: 31745/21304465
## Sparsity : 100%
## Maximal term length: 28
## Weighting : term frequency (tf)
wordcloud(words = Corpus,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
Sentiment analysis can be applied to many topics. It was interesting to see how Trump was relating a positive message within his twitter handle. These messages overshadow the negativity. Also towards November Trust emotion was very high indicating support and self-confidence. In the instance where the term trump was searched surprise seems to be the overwhelming emotion.
http://technokarak.com/how-to-clean-the-twitter-data-using-r-twitter-mining-tutorial.html
http://juliasilge.com/blog/Joy-to-the-World/
https://www.r-bloggers.com/plot-weekly-or-monthly-totals-in-r/
http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
https://github.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107