Data 607 - Final Project
Research Objective
Trump’s presidency is unique from other presidents in the manner he communicates public and economic policy via social media. We want to examine whether his remarks during this global pandemic (Mid March through April 2020) have tangible effects on the stock market. We’re going to analyze the text from daily White House briefings using sentiment analysis. Compare and contrast the sentiment analysis with the stock market performance during his presidency and Trump’s approval ratings.
Data Sources:
- White House Briefings - each weekday President Trump addresses the nation and those scripts can be found on https://www.whitehouse.gov/briefings-statements
- Stock Market Data - this data was extracted using the quantmod to obtain small, mid and large Cap funds
- Trump Approval Ratings - this shows daily approval ratings from FiveThirtyEight
- Twitter API (https://dev.twitter.com/apps)
Necessary R Packages:
The following libraries are used throughout our analysis:
- quantmod
- rvest
- stringr
- purr
- tidytext
- dplyr
- tidyr
- ggplot2
- lubridate
- RCurl
Gather Data
Daily Presidential White House Briefings
These URLs all end in a number so we’re going to first create a list of URLs & use read_html to grab the < p > tags for each page
base_url <- 'https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-members-coronavirus-task-force-press-briefing-'
getPageURLs <- function(url) {
add_number <- seq(2,33)
urls <- str_c(base_url, add_number)
return(urls)
}
urls <- getPageURLs(urls)
head(urls)
## [1] "https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-members-coronavirus-task-force-press-briefing-2"
## [2] "https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-members-coronavirus-task-force-press-briefing-3"
## [3] "https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-members-coronavirus-task-force-press-briefing-4"
## [4] "https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-members-coronavirus-task-force-press-briefing-5"
## [5] "https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-members-coronavirus-task-force-press-briefing-6"
## [6] "https://www.whitehouse.gov/briefings-statements/remarks-president-trump-vice-president-pence-members-coronavirus-task-force-press-briefing-7"
Stock Market Data
We decided it would be best to compare smallCapFunds, midCapFunds, and largeCapFunds.
start<- as.Date("2020-03-15")
end <- as.Date("2020-04-30")
#Retrieving information on Top 5 Small Growth Funds as ranked by "U.S. News"
smallCapFunds <- c("PSGAX", "FKASX", "PGSGX", "QUASX", "TRSSX")
getSymbols(smallCapFunds, src = "yahoo", from = start, to = end)
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
## [1] "PSGAX" "FKASX" "PGSGX" "QUASX" "TRSSX"
#Retreiving information on Top 5 Mid Growth Funds as ranked by "U.S. News"
midCapFunds <- c("DFDMX", "CCSMX","PRDMX", "OTCAX", "BMGAX")
getSymbols(midCapFunds, src = "yahoo", from = start, to = end)
## [1] "DFDMX" "CCSMX" "PRDMX" "OTCAX" "BMGAX"
#Retrieving information on Top 5 Large Growth Funds as ranked by "U.s News"
largeCapFunds <- c("TRLGX", "PREFX", "TPLGX", "FDSVX", "PBLAX")
getSymbols(largeCapFunds, src = "yahoo", from = start, to = end)
## [1] "TRLGX" "PREFX" "TPLGX" "FDSVX" "PBLAX"
## [1] "DJIA"
## An 'xts' object on 2020-03-16/2020-04-29 containing:
## Data: num [1:32, 1:6] 30.3 31.4 29.3 30.5 29.2 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:6] "PSGAX.Open" "PSGAX.High" "PSGAX.Low" "PSGAX.Close" ...
## Indexed by objects of class: [Date] TZ: UTC
## xts Attributes:
## List of 2
## $ src : chr "yahoo"
## $ updated: POSIXct[1:1], format: "2020-05-12 17:28:06"
Trump Approval Ratings
These ratings were available on FiveThiryEight broken out by Votes,Adults, and All Polls for each day during April 2020.
#github URL
theURL <- getURL("https://raw.githubusercontent.com/geeman1209/MSDATA2020/master/DATA607/Final_Project/approval_topline.csv")
# Read csv from github
trump_apprdf <- read.csv(text = theURL,stringsAsFactors = FALSE)
# glimpse data
dplyr::glimpse(trump_apprdf)
## Observations: 3,582
## Variables: 10
## $ president <chr> "Donald Trump", "Donald Trump", "Donald Trump", "…
## $ subgroup <chr> "Voters", "Adults", "All polls", "Adults", "Voter…
## $ modeldate <chr> "4/30/2020", "4/30/2020", "4/30/2020", "4/29/2020…
## $ approve_estimate <dbl> 42.84215, 43.65953, 42.92110, 43.02131, 42.72302,…
## $ approve_hi <dbl> 46.74216, 47.53337, 46.79893, 46.37986, 46.54502,…
## $ approve_lo <dbl> 38.94214, 39.78569, 39.04326, 39.66275, 38.90103,…
## $ disapprove_estimate <dbl> 52.64487, 51.65192, 52.38926, 52.15911, 52.61465,…
## $ disapprove_hi <dbl> 56.27713, 56.48534, 56.54765, 56.86387, 56.20207,…
## $ disapprove_lo <dbl> 49.01261, 46.81850, 48.23088, 47.45436, 49.02723,…
## $ timestamp <chr> "15:46:08 30 Apr 2020", "15:44:36 30 Apr 2020", "…
Cleaning Data
Clean WH Briefing Data
In order to easily analyze the WH briefings, we needed to include a date in the format ‘yyyy-mm-dd’ in addition to our number page number, i. The page number correlates with the last character of the page URL.
#create empty dataframe
testFrame <- data.frame(date = character(),
stringsAsFactors = FALSE)
for (i in 1:length(wh_dates)){
testFrame <- rbind(testFrame,cbind(as.data.frame(unlist(wh_dates[[i]]),stringsAsFactors = FALSE),i))
}
dateFrame <- na.omit(testFrame)
data0 <- data.frame(text=character(),
Day=integer(),
stringsAsFactors=FALSE)
for (i in 1:length(wh_briefings)){
data0 <- rbind(data0,cbind(as.data.frame(unlist(wh_briefings[[i]]),stringsAsFactors = FALSE),i))
}
colnames(data0) <- c('text','day')
correctMatch <- inner_join(dateFrame,data0,by= c("i" = "day"))
colnames(correctMatch) <- c('date', 'day', 'text')
wh_data <- correctMatch
tidy_data <- wh_data %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text)
head(tidy_data)
## date day linenumber word
## 1 March 15, 2020 1 1 remarks
## 2 March 15, 2020 1 2 healthcare
## 3 March 15, 2020 1 3 issued
## 3.1 March 15, 2020 1 3 on
## 3.2 March 15, 2020 1 3 march
## 3.3 March 15, 2020 1 3 15
Clean Approval Ratings
To analyze the approval ratings, we needed to convert the modeldate into the date format ‘yyyy-mm-dd’ alongwith filterring the data for the month of April, 2020.
# modeldate - convert into date
trump_apprdf$modeldate <- mdy(trump_apprdf$modeldate)
dplyr::glimpse(trump_apprdf)
## Observations: 3,582
## Variables: 10
## $ president <chr> "Donald Trump", "Donald Trump", "Donald Trump", "…
## $ subgroup <chr> "Voters", "Adults", "All polls", "Adults", "Voter…
## $ modeldate <date> 2020-04-30, 2020-04-30, 2020-04-30, 2020-04-29, …
## $ approve_estimate <dbl> 42.84215, 43.65953, 42.92110, 43.02131, 42.72302,…
## $ approve_hi <dbl> 46.74216, 47.53337, 46.79893, 46.37986, 46.54502,…
## $ approve_lo <dbl> 38.94214, 39.78569, 39.04326, 39.66275, 38.90103,…
## $ disapprove_estimate <dbl> 52.64487, 51.65192, 52.38926, 52.15911, 52.61465,…
## $ disapprove_hi <dbl> 56.27713, 56.48534, 56.54765, 56.86387, 56.20207,…
## $ disapprove_lo <dbl> 49.01261, 46.81850, 48.23088, 47.45436, 49.02723,…
## $ timestamp <chr> "15:46:08 30 Apr 2020", "15:44:36 30 Apr 2020", "…
Data Analysis
White House Briefing & Sentiment Analysis
The tidytext package contains three sentiment lexicons. Three general-purpose lexicons are affin, bing and nrc which are based on unigrams, i.e., single words.
- AFINN from Finn Arup Nielsen assigns scores to words for positive/negative sentiments.
- bing from Bing Liu and collaborators categorizes words in a binary fashion into positive and negative categories.
- nrc from Saif Mohammad and Peter Turney categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
# Using lexicon affin
affinLex <- get_sentiments("afinn")
wh.affin <- tidy_data %>%
anti_join(stop_words) %>%
group_by(day) %>%
inner_join(affinLex) %>%
summarise(sentiment = sum(value)) %>%
mutate(method="affin")
## Joining, by = "word"
## Joining, by = "word"
## # A tibble: 32 x 3
## day sentiment method
## <int> <dbl> <chr>
## 1 1 172 affin
## 2 2 43 affin
## 3 3 243 affin
## 4 4 114 affin
## 5 5 228 affin
## 6 6 142 affin
## 7 7 207 affin
## 8 8 152 affin
## 9 9 190 affin
## 10 10 174 affin
## # … with 22 more rows
# bar plot for positive and negative cumulative score
wh.affin %>%
summarise(Positive = sum(sentiment[sentiment>0]), Negative = sum(sentiment[sentiment<0])) %>%
gather(variable, value, Positive:Negative) %>%
ggplot(aes(variable, value, fill = variable)) +
geom_bar(stat="identity") +
geom_text(aes(label=value), position=position_dodge(width=0.9), vjust=-0.25)
# Using lexicon bing
bingLex <- get_sentiments("bing")
wh.bing <- tidy_data %>%
anti_join(stop_words) %>%
group_by(day) %>%
inner_join(bingLex) %>%
count(day, sentiment) %>%
spread(sentiment, n, fill=0) %>%
mutate(sentiment = positive - negative) %>%
mutate(method="bing")
## Joining, by = "word"
## Joining, by = "word"
## # A tibble: 32 x 5
## # Groups: day [32]
## day negative positive sentiment method
## <int> <dbl> <dbl> <dbl> <chr>
## 1 1 125 187 62 bing
## 2 2 178 141 -37 bing
## 3 3 230 241 11 bing
## 4 4 180 180 0 bing
## 5 5 204 295 91 bing
## 6 6 241 274 33 bing
## 7 7 246 295 49 bing
## 8 8 333 295 -38 bing
## 9 9 97 143 46 bing
## 10 10 168 220 52 bing
## # … with 22 more rows
# bar plot for positive and negative sentiments
wh.bing %>%
ungroup() %>%
select(-day) %>%
select(negative, positive) %>%
summarise_all(funs(sum)) %>%
gather(variable, value, negative:positive) %>%
ggplot(aes(variable, value, fill = variable)) +
geom_bar(stat="identity") +
geom_text(aes(label=value), position=position_dodge(width=0.9), vjust=-0.25)
# Using lexicon nrc
nrcLex <- get_sentiments("nrc")
wh.nrc <- tidy_data %>%
anti_join(stop_words) %>%
group_by(day) %>%
inner_join(nrcLex) %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(day, sentiment) %>%
spread(sentiment, n, fill=0) %>%
mutate(sentiment = positive - negative) %>%
mutate(method="nrc")
## Joining, by = "word"
## Joining, by = "word"
## # A tibble: 32 x 5
## # Groups: day [32]
## day negative positive sentiment method
## <int> <dbl> <dbl> <dbl> <chr>
## 1 1 145 496 351 nrc
## 2 2 198 503 305 nrc
## 3 3 251 710 459 nrc
## 4 4 226 540 314 nrc
## 5 5 256 691 435 nrc
## 6 6 282 804 522 nrc
## 7 7 318 838 520 nrc
## 8 8 412 884 472 nrc
## 9 9 108 341 233 nrc
## 10 10 217 587 370 nrc
## # … with 22 more rows
# bar graph for all categories in nrc lexicon
tidy_data %>%
anti_join(stop_words) %>%
group_by(day) %>%
inner_join(nrcLex) %>%
#filter(sentiment %in% c("positive", "negative")) %>%
count(day, sentiment) %>%
spread(sentiment, n, fill=0) %>%
ungroup() %>%
select(-day) %>%
summarise_all(funs(sum)) %>%
gather(variable, value, anger:trust) %>%
ggplot(aes(variable, value, fill = variable)) +
geom_bar(stat="identity") +
geom_text(aes(label=value), position=position_dodge(width=0.9), vjust=-0.25)
## Joining, by = "word"
## Joining, by = "word"
## president subgroup modeldate approve_estimate approve_hi approve_lo
## 1 Donald Trump Adults 2020-04-02 45.58512 50.25256 40.91767
## 2 Donald Trump Voters 2020-04-02 46.01426 50.15618 41.87234
## 3 Donald Trump All polls 2020-04-02 45.67397 49.99093 41.35702
## disapprove_estimate disapprove_hi disapprove_lo timestamp
## 1 49.40673 54.87109 43.94237 14:48:24 2 Apr 2020
## 2 50.63609 55.11203 46.16016 14:49:51 2 Apr 2020
## 3 50.03857 54.96909 45.10805 14:47:26 2 Apr 2020
## president subgroup modeldate approve_estimate approve_hi approve_lo
## 1 Donald Trump Voters 2020-04-03 45.77735 49.91039 41.64430
## 2 Donald Trump Adults 2020-04-03 46.01294 50.58268 41.44320
## 3 Donald Trump All polls 2020-04-03 45.84590 50.08724 41.60456
## disapprove_estimate disapprove_hi disapprove_lo timestamp
## 1 50.77266 55.23004 46.31528 09:44:49 3 Apr 2020
## 2 49.08885 54.33594 43.84177 09:43:23 3 Apr 2020
## 3 49.88251 54.68981 45.07522 09:42:26 3 Apr 2020
## president subgroup modeldate approve_estimate approve_hi approve_lo
## 1 Donald Trump Adults 2020-04-04 46.01294 50.58270 41.44317
## 2 Donald Trump Voters 2020-04-04 45.70196 49.85291 41.55100
## 3 Donald Trump All polls 2020-04-04 45.78480 50.03462 41.53498
## disapprove_estimate disapprove_hi disapprove_lo timestamp
## 1 49.08885 54.33595 43.84176 07:31:21 4 Apr 2020
## 2 50.79717 55.27833 46.31600 07:32:47 4 Apr 2020
## 3 49.91005 54.74942 45.07069 07:30:25 4 Apr 2020
Comparing Lexicons
Following graphs show Sentiment score assigned by each lexicon to individual words in all documents in the corpus.
# Comparing all 3 lexicons
wh.aff_bin_nrc <- bind_rows(wh.affin, wh.bing, wh.nrc)
bind_rows(wh.aff_bin_nrc) %>%
ggplot(aes(day, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
# comparing the cumulative sentiments for all 3 lexicons
wh.aff_bin_nrc %>%
group_by(method) %>%
summarise(sentiment = sum(sentiment)) %>%
ggplot(aes(method, sentiment, fill = method)) +
geom_bar(stat="identity") +
geom_text(aes(label=sentiment), position=position_dodge(width=0.9), vjust=-0.25)
# line graph for all 3 lexicons
wh.aff_bin_nrc %>%
ggplot(aes(x=day, y=sentiment, group=method, color=method)) +
geom_line(size=1) +
geom_point() + labs(x="day", y="sentiment", title = "Sentiment vs Days for all 3 lexicons") +
scale_color_manual(values=c("red", "green", "blue")) +
theme(plot.title = element_text(hjust = 0.5))
The following two graphs are for nrc sentiment where former shows positive, trust and later shows rest of the sentiments.
- There are peaks in words of trust and positive on April 3rd, April 7th and April 14th
- There are peaks in words of fear on April 1st and April 10th
a <- tidy_data %>%
anti_join(stop_words) %>%
inner_join(nrcLex) %>%
count(day, sentiment) %>%
spread(sentiment, n, fill=0) %>%
gather(variable, value, c('positive','trust')) %>%
ggplot(aes(day, value, color = variable,fill = variable)) +
geom_line(stat="identity") +
ylab('Sentiment')
## Joining, by = "word"
## Joining, by = "word"
b <- tidy_data %>%
anti_join(stop_words) %>%
inner_join(nrcLex) %>%
count(day, sentiment) %>%
spread(sentiment, n, fill=0) %>%
gather(variable, value, c('anger','anticipation','disgust','fear','joy','negative','sadness','surprise')) %>%
ggplot(aes(day, value, color = variable,fill = variable)) +
geom_line(stat="identity") +
ylab('Sentiment')
## Joining, by = "word"
## Joining, by = "word"
Trump approval ratings analysis
Below two graphs show Trump approval and disapproval ratings for the period of April, 2020. The 3 subgroups included in this analysis are Adults, All polls and Voters. Seeing the graphs, approval estimates shows decline while disapproval estimates increases for all 3 categories.
# Disapproval line plot for all 3 sub groups
disapprove <- trump_apprsubdf %>%
ggplot(aes(x=modeldate,y=disapprove_estimate, group=subgroup, color=subgroup)) +
geom_line(size=1) +
geom_point() + labs(x="Date", y="Disapproval estimates", title = "") +
scale_color_manual(values=c("red", "green", "blue")) +
ggtitle('Trump disapproval estimates for April 2020')+
theme(plot.title = element_text(hjust = 0.5))
# Approval line plot for all 3 sub groups
approve <- trump_apprsubdf %>%
ggplot(aes(x=modeldate,y=approve_estimate, group=subgroup, color=subgroup)) +
geom_line(size=1) +
geom_point() + labs(x="Date", y="Approval estimates", title = "") +
scale_color_manual(values=c("red", "green", "blue")) +
ggtitle('Trump approval estimates for April 2020')+
theme(plot.title = element_text(hjust = 0.5))
ggarrange(approve, disapprove, ncol=1)
Stock Market Performance
We used the U.S. News Report top 5 small, mid, and large growth funds to reflect performance small business america vs mid sized businesses vs big business. It was difficult to pick something that would compare the health of “mom and pop” shops, which are a sizable number of businesses.
A data frame was created that included all the closing prices along with the bing sentiment. The average closing price was calculated and compared against the sentiment over the designated time frame.
A the corr function and lm function were used to calculate any p and corr values between the average closing price and sentiment variables.
Small Growth Funds & White House Sentiment Analysis
smFrame <- dateFrame
colnames(smFrame) <- c("Date","Day")
smFrame$PSGAX_Close <- PSGAX$PSGAX.Close
smFrame$FKASX_Close <- FKASX$FKASX.Close
smFrame$PGSGX_Close <- PGSGX$PGSGX.Close
smFrame$QUASX_Close <- QUASX$QUASX.Close
smFrame$TRSSX_Close <- TRSSX$TRSSX.Close
smFrame$Sentiment <- wh.bing$sentiment
tstFrame <- smFrame
tstFrame$AvgClose <- (smFrame$PSGAX_Close + smFrame$FKASX_Close + smFrame$PGSGX_Close + smFrame$TRSSX_Close + smFrame$QUASX_Close)/5
SmLm2 <- lm(AvgClose ~ Sentiment, data= tstFrame)
summary(SmLm2)
##
## Call:
## lm(formula = AvgClose ~ Sentiment, data = tstFrame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6750 -1.7080 0.1007 1.6570 4.5108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.955444 0.513187 56.423 <2e-16 ***
## Sentiment 0.011512 0.008987 1.281 0.21
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.458 on 30 degrees of freedom
## Multiple R-squared: 0.05186, Adjusted R-squared: 0.02026
## F-statistic: 1.641 on 1 and 30 DF, p-value: 0.21
## Sentiment AvgClose
## Sentiment 1.0000000 0.2277354
## AvgClose 0.2277354 1.0000000
plot1 <- tstFrame %>%
ggplot(aes(x=Day, y=AvgClose, group=1)) +
geom_line() +
geom_point() + labs(x=" Days", y="Closing Price", title = "Avg of Closing Price for Top 5 Small Growth Funds", subtitle = " March 15, 2020 - April 30, 2020")
plot2 <- tstFrame %>%
ggplot(aes(x=Day, y=Sentiment, group=1)) +
geom_line() +
geom_point() + labs(x="Days", y="WH Briefing Sentiment", title = "Bing Sentiment Scores over 32 Days of WH Briefings", subtitle = "March 15, 2020 - April 30, 2020")
grid.arrange(plot1, plot2, nrow = 2)
## Don't know how to automatically pick scale for object of type xts/zoo. Defaulting to continuous.
Mid Growth Funds & White House Sentiment Analysis
mdFrame <- dateFrame
colnames(mdFrame) <- c("Date","Day")
mdFrame$DFDMX_Close <- DFDMX$DFDMX.Close
mdFrame$CCSMX_Close <- CCSMX$CCSMX.Close
mdFrame$PRDMX_Close <- PRDMX$PRDMX.Close
mdFrame$OTCAX_Close <- OTCAX$OTCAX.Close
mdFrame$BMGAX_Close <- BMGAX$BMGAX.Close
mdFrame$Sentiment <- wh.bing$sentiment
tst2Frame <- mdFrame
tst2Frame$AvgClose <- (mdFrame$DFDMX_Close + mdFrame$CCSMX_Close + mdFrame$PRDMX_Close + mdFrame$OTCAX_Close + mdFrame$BMGAX_Close)/5
SmLm3 <- lm(AvgClose ~ Sentiment, data= tst2Frame)
summary(SmLm3)
##
## Call:
## lm(formula = AvgClose ~ Sentiment, data = tst2Frame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3896 -0.9952 0.1580 1.1440 2.7797
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.370883 0.358911 59.544 <2e-16 ***
## Sentiment 0.006627 0.006285 1.054 0.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.719 on 30 degrees of freedom
## Multiple R-squared: 0.03574, Adjusted R-squared: 0.003596
## F-statistic: 1.112 on 1 and 30 DF, p-value: 0.3001
## Sentiment AvgClose
## Sentiment 1.000000 0.189044
## AvgClose 0.189044 1.000000
plot1 <- tst2Frame %>%
ggplot(aes(x=Day, y=AvgClose, group=1)) +
geom_line() +
geom_point() + labs(x=" Days", y="Closing Price", title = "Avg of Closing Price for Top 5 Mid Growth Funds", subtitle = " March 15, 2020 - April 30, 2020")
plot2 <- tst2Frame %>%
ggplot(aes(x=Day, y=Sentiment, group=1)) +
geom_line() +
geom_point() + labs(x="Days", y="WH Briefing Sentiment", title = "Biin Sentiment Scores over 32 Days of WH Briefings", subtitle = "March 15, 2020 - April 30, 2020")
grid.arrange(plot1, plot2, nrow = 2)
## Don't know how to automatically pick scale for object of type xts/zoo. Defaulting to continuous.
Large Growth Funds & White House Sentiment Analysis
lgFrame <- dateFrame
colnames(lgFrame) <- c("Date","Day")
lgFrame$TRLGX_Close <- TRLGX$TRLGX.Close
lgFrame$PREFX_Close <- PREFX$PREFX.Close
lgFrame$TPLGX_Close <- TPLGX$TPLGX.Close
lgFrame$FDSVX_Close <- FDSVX$FDSVX.Close
lgFrame$PBLAX_Close <- PBLAX$PBLAX.Close
lgFrame$Sentiment <- wh.bing$sentiment
tst3Frame <- lgFrame
tst3Frame$AvgClose <- (TRLGX$TRLGX.Close + PREFX$PREFX.Close + TPLGX$TPLGX.Close + FDSVX$FDSVX.Close + PBLAX$PBLAX.Close)/5
SmLm4 <- lm(AvgClose ~ Sentiment, data= tst3Frame)
summary(SmLm4)
##
## Call:
## lm(formula = AvgClose ~ Sentiment, data = tst3Frame)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2845 -1.9643 0.3148 2.2355 3.9929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.300249 0.570832 60.088 <2e-16 ***
## Sentiment 0.011255 0.009996 1.126 0.269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.734 on 30 degrees of freedom
## Multiple R-squared: 0.04055, Adjusted R-squared: 0.008564
## F-statistic: 1.268 on 1 and 30 DF, p-value: 0.2691
## Sentiment AvgClose
## Sentiment 1.0000000 0.2013588
## AvgClose 0.2013588 1.0000000
plot1 <- tst3Frame %>%
ggplot(aes(x=Day, y=AvgClose, group=1)) +
geom_line() +
geom_point() + labs(x=" Days", y="Closing Price", title = "Avg of Closing Price for Top 5 Large Growth Funds", subtitle = " March 15, 2020 - April 30, 2020")
plot2 <- tst3Frame %>%
ggplot(aes(x=Day, y=Sentiment, group=1)) +
geom_line() +
geom_point() + labs(x="Days", y="WH Briefing Sentiment", title = "Biin Sentiment Scores over 32 Days of WH Briefings", subtitle = "March 15, 2020 - April 30, 2020")
grid.arrange(plot1, plot2, nrow = 2)
## Don't know how to automatically pick scale for object of type xts/zoo. Defaulting to continuous.
Twitter API
In this analysis, we use the twitter account of Donald Trump. All Twitter posts collected from Donald Trump’s twitter account realDonaldTrump
post for corona pandemic. We will request for 10,000 tweets related to #COVID-19 and #realDonaldTrump from March 15 to May 8th in 2020 for analysis.
We use the data science software R with the tidyr,tidytext, dplyr packages to do text analytics and the twitteR package to connect and download data from twitter.
To collect data from twitter, we created one twitter application. This twitter data belongs to realDonaldTrump
page. Followed the below steps to connect the app and download data from twitter. After getting the data, store the file in a csv file in Git repository and perform the sentiment analysis.
- Create a twitter account
- Go to https://dev.twitter.com/apps and log in with twitter credentials
- Click create a new app
- Give name, Description, Website, and Callback URL (I have used http://localhost:1410 for Callback URL)
- Click on Developer Agreement check box
- After create app, go to “keys and Access token”
- Copy the below keys
- Consumer key (API Key)
- Consumer secret (API Secret)
- Then, click the
create my access token
then copy the below keys- access token
- access token secret
Load necessary packages and collect the data from twitter.
library(twitteR)
library(tidyr)
consumer_key <- "Lsk1wfKakbL6rfFC1dcLfpSrb"
consumer_secret <- "rKywBX5uJ3d13drP8NxgWGxC3TtPraKZiucS58J2OzFW0Rpcci"
access_token <- "1257449305616650243-J8E4TpO0Fmu6v5KXyY8T9owgSsPI7v"
access_secret <- "tgCzUd7fXgnSZFrLvmeMBYNJzH3i1hZkDX16PK3APkM3E"
#now lets connect
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"
# Collect 10000 tweets
trump_tweets <- twitteR::searchTwitter("#COVID-19 + @realDonaldTrump", n = 10000,lang = "en", since = "2020-03-15", until = "2020-05-08",retryOnRateLimit = 1e2)
# set to a data frame
trump_tweets_df = twListToDF(trump_tweets)
# write data in to a csv
write.csv(trump_tweets_df, file = "trump_tweets.csv")
# read data from Github
trump_tweets_df <- read.csv("https://raw.githubusercontent.com/geeman1209/MSDATA2020/master/DATA607/Final_Project/trump_tweets.csv", stringsAsFactors = FALSE)
# Raw data
head(trump_tweets_df)
## X
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## text
## 1 RT @no_silenced: San Francisco Delivers Weed, Opioids, Booze and Ciggies to Homeless Locked Up for COVID-19\n\nMr President, can you please n…
## 2 RT @grey_sister: @kyledcheney @juliabhaber & what a co-inkydink that @TheJusticeDept #AGBarr let Flynn walk free today, also good distracti…
## 3 RT @CaslerNoel: While Trump was berating a nurse in the Oval Office yesterday who was telling the truth about the lack on PPE she experienc…
## 4 RT @PARISDENNARD: Brittany from @SamaritansPurse gave a powerful testimony about her work in New York during #Covid_19 with other front lin…
## 5 RT @the_resistor: @JoeBiden "A Grave Hardship" ~ The INABILITY & UNWILLINGNESS of @realDonaldTrump to LEAD \n\n33.5 MILLION Unemployed\n76,681…
## 6 RT @Free_Media_Hub: @realDonaldTrump COVID-19 US update infections 1,291,804 recorded deaths 76,889 deaths per million population 232 new d…
## favorited favoriteCount replyToSN created truncated replyToSID
## 1 FALSE 0 <NA> 2020-05-07 23:59:48 FALSE NA
## 2 FALSE 0 <NA> 2020-05-07 23:59:39 FALSE NA
## 3 FALSE 0 <NA> 2020-05-07 23:59:31 FALSE NA
## 4 FALSE 0 <NA> 2020-05-07 23:59:22 FALSE NA
## 5 FALSE 0 <NA> 2020-05-07 23:59:21 FALSE NA
## 6 FALSE 0 <NA> 2020-05-07 23:59:14 FALSE NA
## id replyToUID
## 1 1.258547e+18 NA
## 2 1.258547e+18 NA
## 3 1.258547e+18 NA
## 4 1.258547e+18 NA
## 5 1.258547e+18 NA
## 6 1.258547e+18 NA
## statusSource
## 1 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## 2 <a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>
## 3 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
## 4 <a href="http://tapbots.com/tweetbot" rel="nofollow">Tweetbot for iΟS</a>
## 5 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 6 <a href="https://mobile.twitter.com" rel="nofollow">Twitter Web App</a>
## screenName retweetCount isRetweet retweeted longitude latitude
## 1 jabelliott 524 TRUE FALSE NA NA
## 2 juliabhaber 3 TRUE FALSE NA NA
## 3 HollyL85225600 1305 TRUE FALSE NA NA
## 4 michaeljashmore 15 TRUE FALSE NA NA
## 5 dmgraden 25 TRUE FALSE NA NA
## 6 Roberttimestwo 6 TRUE FALSE NA NA
# Remove http from statusSource
trump_tweets_df$statusSource <- gsub("<.*?>", "",trump_tweets_df$statusSource)
# Most favorited tweets
trump_fav <- trump_tweets_df %>%
dplyr::arrange(desc(favoriteCount))
# Top 6 favorited tweets among the extracted 10000 tweets
head(trump_fav)
## X
## 1 5294
## 2 4216
## 3 5883
## 4 2214
## 5 7136
## 6 3682
## text
## 1 While Trump was berating a nurse in the Oval Office yesterday who was telling the truth about the lack on PPE she e… https://t.co/meo4q6ijTX
## 2 Whoops! @realDonaldTrump's personal valet **who serves Trump's food and beverage** has tested positive for… https://t.co/RjgrN4IePP
## 3 Trump could have been spreading Covid-19 to everyone in that factory on Tuesday because he didn’t want to look bad… https://t.co/O5ddIdL1d0
## 4 Today I launch the #ThankYou45 campaign to highlight how @realDonaldTrump has delivered for FL to respond to COVID-… https://t.co/wHJjsT1qj7
## 5 This concern of COVID-19 is a huge scam and political stunt. Orchestrated by globalists next step to one government… https://t.co/Bgg1pwmTXs
## 6 @realDonaldTrump The TRUTH IS going to come out \n\nIt ISN'T CHINA'S or @WHO's Fault\n\nIt IS @realDonaldTrump's FAULT… https://t.co/FeIgyTa8LH
## favorited favoriteCount replyToSN created truncated
## 1 FALSE 3991 <NA> 2020-05-07 16:01:22 TRUE
## 2 FALSE 2058 <NA> 2020-05-07 17:02:21 TRUE
## 3 FALSE 1461 <NA> 2020-05-07 15:11:51 TRUE
## 4 FALSE 802 <NA> 2020-05-07 19:27:40 TRUE
## 5 FALSE 602 <NA> 2020-05-07 12:37:54 TRUE
## 6 FALSE 512 realDonaldTrump 2020-05-07 17:30:36 TRUE
## replyToSID id replyToUID statusSource screenName
## 1 NA 1.258427e+18 NA Twitter for iPhone CaslerNoel
## 2 NA 1.258442e+18 NA Twitter Web App DrDenaGrayson
## 3 NA 1.258414e+18 NA Twitter for iPhone CaslerNoel
## 4 NA 1.258479e+18 NA Twitter for iPhone realannapaulina
## 5 NA 1.258376e+18 NA Twitter for iPhone RL9631
## 6 1.258448e+18 1.258449e+18 25073877 Twitter Web App the_resistor
## retweetCount isRetweet retweeted longitude latitude
## 1 1305 FALSE FALSE NA NA
## 2 661 FALSE FALSE NA NA
## 3 471 FALSE FALSE NA NA
## 4 276 FALSE FALSE NA NA
## 5 518 FALSE FALSE NA NA
## 6 78 FALSE FALSE NA NA
# Most retweeted
trump_retweet <- trump_tweets_df %>%
dplyr::arrange(desc(retweetCount)) %>%
dplyr::distinct(text, .keep_all = TRUE)
# Top 6 retweeted texts among the extracted 10000 tweets
head(trump_retweet)
## X
## 1 6284
## 2 4694
## 3 1449
## 4 316
## 5 6248
## 6 557
## text
## 1 RT @narendramodi: Fully agree with you President @realDonaldTrump. Times like these bring friends closer. The India-US partnership is stron…
## 2 RT @GOPChairwoman: State Rep. Whitsett nearly died from coronavirus, but because she dared to say something positive about @realDonaldTrump…
## 3 RT @MichaelCoudrey: UPDATE: @Twitter just suspended the account of the publicly traded biotech company AYTU BioScience that created a novel…
## 4 RT @SteveSchmidtSES: This ad has unnerved and rattled the unsteady hand of @realDonaldTrump. He was played by the Chinese over Covid-19.…
## 5 RT @IvankaTrump: Breaking: The House finally passes $480 billion package to deliver aid to millions of small businesses, workers and hospit…
## 6 RT @DonaldJTrumpJr: “COVID-19’s economic fallout is shining a light on exactly how dangerous it is to so heavily rely on China, as well as…
## favorited favoriteCount replyToSN created truncated replyToSID
## 1 FALSE 0 <NA> 2020-05-07 14:23:54 FALSE NA
## 2 FALSE 0 <NA> 2020-05-07 16:31:52 FALSE NA
## 3 FALSE 0 <NA> 2020-05-07 20:39:23 FALSE NA
## 4 FALSE 0 <NA> 2020-05-07 23:10:41 FALSE NA
## 5 FALSE 0 <NA> 2020-05-07 14:28:56 FALSE NA
## 6 FALSE 0 <NA> 2020-05-07 22:40:31 FALSE NA
## id replyToUID statusSource screenName retweetCount
## 1 1.258402e+18 NA Twitter for Android jaykrishna324 34595
## 2 1.258434e+18 NA Twitter Web App arrowsmithwoman 19387
## 3 1.258497e+18 NA Twitter for iPhone CarrieNemitz21 14000
## 4 1.258535e+18 NA Twitter for iPad handdonordad 12728
## 5 1.258403e+18 NA Twitter for Android njq5qd4hCP477fH 6947
## 6 1.258527e+18 NA Twitter for iPad RonaldM79125420 6728
## isRetweet retweeted longitude latitude
## 1 TRUE FALSE NA NA
## 2 TRUE FALSE NA NA
## 3 TRUE FALSE NA NA
## 4 TRUE FALSE NA NA
## 5 TRUE FALSE NA NA
## 6 TRUE FALSE NA NA
## X screenName retweetCount isRetweet
## 1 6284 jaykrishna324 34595 TRUE
## 2 4694 arrowsmithwoman 19387 TRUE
## 3 1449 CarrieNemitz21 14000 TRUE
## 4 316 handdonordad 12728 TRUE
## 5 6248 njq5qd4hCP477fH 6947 TRUE
## 6 557 RonaldM79125420 6728 TRUE
Data cleaning and tokenization
We will convert the data set into a corpus and then clean the corpus such as making all character lower case, remove punctuation marks, white spaces, and stop words.
library(tm)
library(textmineR)
library(RWeka)
library(wordcloud)
library(RColorBrewer)
trump_tweets_df_2 <- trump_tweets_df[c(1,2)]
# remove imocation from text
trump_tweets_df_2$text <- gsub("[^\x01-\x7F]", "", trump_tweets_df_2$text)
# Change dataset into a corpus
trump_tweets_corp <- tm::VCorpus(tm::VectorSource(trump_tweets_df_2))
# Data cleaning
trump_tweets_corp <- tm::tm_map(trump_tweets_corp, tolower)
trump_tweets_corp <- tm::tm_map(trump_tweets_corp, PlainTextDocument)
trump_tweets_corp <- tm::tm_map(trump_tweets_corp, removePunctuation)
# Remove stop words
new_stops <-c("covid","iphone","coronavirus","android","web","rt","chuonlinenews","Fashion", "fashionblogger", "Covid_19", "Juventus", "WuhanVirus","covid19","dranthonyfauci","scotgov youre", "rvawonk two","false","president","realdonaldtrump","champion")
trump_tweets_corp <- tm::tm_map(trump_tweets_corp, removeWords, words = c(stopwords("english"), new_stops))
trump_tweets_corp <- tm::tm_map(trump_tweets_corp, stripWhitespace)
trump_tweets_corp <- tm::tm_map(trump_tweets_corp, PlainTextDocument)
trump_tweets_corp <- tm::tm_map(trump_tweets_corp, removePunctuation)
trump_tweets_corp <- tm::tm_map(trump_tweets_corp, removeNumbers)
# Tokenize tweets texts into words
tokenizer <- function(x) {
RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))
}
tdm <- TermDocumentMatrix(
trump_tweets_corp,
control = list(tokenize = tokenizer)
)
tdm <- as.matrix(tdm)
trump_tweets_cleaned_freq <- rowSums(tdm)
# Create a bi-gram (2-word) word cloud
pal <- RColorBrewer::brewer.pal(8,"Set1")
wordcloud::wordcloud(names(trump_tweets_cleaned_freq), trump_tweets_cleaned_freq, min.freq=50,max.words = 50, random.order=TRUE,random.color = TRUE, rot.per=.15, colors = pal,scale = c(3,1))
This word cloud shows the Word frequency of bi-grams(2 words). Based on the bi-gram we can know what most people are taking on Trump’s post.
Sentiment Analysis
Sentiment analysis helps us understand peoples’ feelings towards a specific subject. We will break the tweets’ sentences into words for further analysis.
library(tibble)
library(tidytext)
# Transform sentences into words
trump_data <- trump_tweets_df %>%
tidytext::unnest_tokens(output = "words", input = text, token = "words")
# Remove stop words from tibble
trump_clean_data <- trump_data %>%
dplyr::anti_join(stop_words, by=c("words"="word")) %>% dplyr::filter(words != "trump" )
Polarity scores help us make quantitative judgments about the feelings of some text. In short, we categorize words from the tweets into positive and negative types and give them a score for analysis. Then, we filter the dataset to get only words with a polarity score of 80 or more. I assigned the words with sentiment using bing
lexicon and categorize words using polarity scores.
library(tidyr)
library(ggplot2)
sentiment_data <- trump_clean_data %>%
# Inner join to bing lexicon by term = word
dplyr::inner_join(get_sentiments("bing"), by = c("words" = "word")) %>%
# Count by term and sentiment, weighted by count
dplyr::count(words, sentiment) %>%
# Spread sentiment, using n as values
tidyr::spread(sentiment, n, fill = 0) %>%
# Mutate to add a polarity column
dplyr::mutate(polarity = positive - negative)
# show summary of sentiment data
summary(sentiment_data)
## words negative positive polarity
## Length:860 Min. : 0.00 Min. : 0.000 Min. :-984.000
## Class :character 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: -3.000
## Mode :character Median : 1.00 Median : 0.000 Median : -1.000
## Mean : 10.16 Mean : 2.238 Mean : -7.926
## 3rd Qu.: 3.00 3rd Qu.: 1.000 3rd Qu.: 1.000
## Max. :984.00 Max. :854.000 Max. : 854.000
polarity_data <- sentiment_data %>%
# Filter for absolute polarity at least 80
dplyr::filter(abs(polarity) >= 80) %>%
# add new column named as sentiments, shows positive/negative
dplyr::mutate(
Sentiments = ifelse(polarity > 0, "positive", "negative")
)
ggplot2::ggplot(polarity_data, aes(reorder(words, polarity), polarity, fill = Sentiments)) +
geom_col() +
ggtitle("Sentiment Word Frequency") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5, size = 10))+
xlab("Word")
From the frequency of sentiments, we can see, negative sentiments frequency is much higher than positive sentiments.
To get a clear picture of how positive and negative words are used, I assigned the words with a sentiment using the ‘bing’ lexicon and do a simple count to generate the top 15 most common positive and negative words used in the extracted tweets.
word_counts <- trump_clean_data %>%
# sentiment analysis using the "bing" lexicon
dplyr::inner_join(get_sentiments("bing"), by = c("words" = "word")) %>%
# Count by word and sentiment
dplyr::count(words, sentiment)
top_words <- word_counts %>%
# Group by sentiment
dplyr:: group_by(sentiment) %>%
# Take the top 15 for each sentiment
dplyr::top_n(15) %>%
dplyr::ungroup() %>%
# Make word a factor in order of n
dplyr::mutate(words = reorder(words, n))
ggplot2::ggplot(top_words, aes(words, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = n, hjust=1), size = 3.5, color = "black") +
facet_wrap(~sentiment, scales = "free") +
coord_flip() +
ggtitle("Most common positive and negative words")
# Sentiment word cloud
tokenizer <- function(x) {
RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 1))
}
tdm <- TermDocumentMatrix(
trump_tweets_corp,
control = list(tokenize = tokenizer)
)
tdm <- as.matrix(tdm)
trump_tweets_cleaned_freq <- rowSums(tdm)
# Create a uni-gram (1-word) word cloud
pal <- RColorBrewer::brewer.pal(9,"Set2")
wordcloud::wordcloud(names(trump_tweets_cleaned_freq), trump_tweets_cleaned_freq, min.freq=50,max.words = 50, random.order=TRUE,random.color = TRUE, rot.per=.15, colors = pal,scale = c(3,1))
Conclusion
Based on the \(R^2\) value (3%-4%) between White House sentiments and stocks, there is little correlation and statistical signifance to the White House Briefings affecting the average closing price of small, mid, or large businesses in America. Ideally, we wanted to perform a sentiment analysis of Trump’s tweets to visualize the impact of his words (unfiltered through tweets) on the economy, especially during the current crises. Unfortunately, twitter’s API only allows for the download of a week’s worth of data from the present date.
Something to consider, there many variables that affect stock or economic performance, which we did not take into consideration. In the future, perhaps a sentiment analysis of news articles from the financial times or wall street journal, as well as the social media accounts of prominent financial talking heads comparing it with White House Briefings might see a better correlation to market performance. Also, the lexicon used for to compare against market performance is bing. We did not compare against others.
Sentence Sentiment Analysis can be considered here since it focuses on the whole sentence and not just words. It works well with the larger amount of data so can be explored here to get the sentiments that might go well with stocks.
Overall, the tweets convey an optimistic sentiment with the high frequency of words such as
Positive
,safe
andlead
of defeating Coronavirus. And most negative high frequancy words such aslack
,scam
andconcern
.When looking at bar graph (Sentiment Word Frequency graph), the word “lack” has highest frequency among other words, which suggests that there are news or stories posted on twitter about people died in Covid-19 pandemic and president statement is not correct.
The most frequent words in bi-gram word cloud plot show related to tested positive, approved trails, lung disinfectant, concern huge, and huge scam suggesting that the Government tried to release vaccine for the virus but the number of infection cases increase and people are more panic about their health.
This sentiment analysis, we see the most negative sentiments over positive. This pandemic number of infections of people increasing daily but no vaccine released yet. Government trying hard to help people, due to infection health care professionals also getting infected.
Challenges & Additional Comments
- Incorpating such different data into one graph for comparison
- Twitter data was only available for the past 8 days, so we were unable to chart this data over time
- Our feature not discussed in our class was using linear regression to analyze different stock prices and sentiment analysis
Resources
- https://bradleyboehmke.github.io/2015/12/scraping-html-text.html
- https://projects.fivethirtyeight.com/trump-approval-ratings/
- Twitter API using R: https://www.youtube.com/watch?v=M_PnapGrpNI
- Twitter react to the Coronavirus pandemic: https://towardsdatascience.com/how-did-twitter-react-to-the-coronavirus-pandemic-2857592b449a