United States of America being one of the Superpowers, the leader of the country will have a significant influence not just for the country but the entire world. The President of the United States is considered one of the world’s most powerful people, leading the world’s only contemporary superpower. The role includes being the commander-in-chief of the world’s most expensive military with the largest nuclear arsenal and leading the nation with the largest economy by real and nominal GDP. The office of the president holds significant hard and soft power both in the United States and abroad.1
The President is elected by Electoral College to a four-year term. Current President Barack Obama will be ending his second four-year term and The United States Presidential Election is scheduled for Tuesday, November 8, 2016. The series of presidential primary elections and caucuses took place between February 1 and June 14, 2016.
Former Secretary of State and New York Senator Hillary Clinton is the Democratic Party’s presidential nominee. Businessman and reality television personality Donald Trump is the Republican Party’s presidential nominee. Hillary Clinton, if elected will be the first woman to take the Office of the President in United States which makes this year’s election very interesting. Opinion Polls are conducted Nationwide and they seem to tell a consistent story as to where the race stands.
The Report is an attempt at predicting the probable President of United States of America based on the Opinion Polls conducted from January 2016 to present day.
Huffington Post, One of the leading news aggregators in America, has been publishing the results of various Polls conducted across the nation. R’s XML library will be used to get the live data from Huffington Post - http://elections.huffingtonpost.com/pollster/2016-general-election-trump-vs-clinton
The Dataset contains Poll information, Percent Votes for Trump, Clinton, Others, Undecided and Spread. Results of over 30 agents conducting the polls from past 20 months are listed. For the current study, results from January 2016 and onwards will be used.
Data will be extracted from the website when code is run and results from January 2016 and onwards will be extracted
#Get data from the website
library(XML)
#Data From Polls published on Huffington Post
rawHuff <- readHTMLTable('http://elections.huffingtonpost.com/pollster/2016-general-election-trump-vs-clinton')
Huff <- data.frame(rawHuff[[1]])
Huff <- Huff %>% dplyr::filter(!grepl("2015",Poll))
Data Extracted Contains Information from Polls conducted by various agencies like Rasmussen, CNN totalling to 26 sources. There are several challenges with this data
# Replace Poll Field to end with "Voters"
Huff$Poll <- gsub("Adults","Adult Voters",Huff$Poll)
# Extract num_voters from Poll Column
num_voters <- str_extract(Huff$Poll,word(Huff$Poll,-3))
num_voters <- as.numeric(gsub(",","",num_voters))
# Extract Voter Type from Poll Column
voter_type <- as.vector(str_extract(Huff$Poll,word(Huff$Poll,-2)))
# Extract Poll name and clean it.
poll_name <- str_extract(Huff$Poll,word(Huff$Poll,1))
poll_name <- gsub("\n","",poll_name)
# Extract poll week
patt <- '(\\w+)\\s*(\\w+)\\s*\u2013\\s*(\\w+)\\s*(\\w+)'
poll_week <- str_extract(Huff$Poll,patt)
start_date <- word(poll_week,1,2)
end_date <- word(poll_week,-2,-1)
poll_start_date <- as.Date(start_date,"%b %d")
poll_end_date <- as.Date(end_date,"%b %d")
Extracted Values are added as columns to the dataframe. Based on the poll week, polls need to be sorted on a monthly basis. Percent Votes for each candidate will be converted to numeric values. Finally Poll field will be removed from the dataset. Data will be re-arrange for better readability
# Add 5 new columns to Huff Dataframe
Huff <- mutate(Huff, num_of_voters = num_voters,type_of_voter = voter_type,
poll_name = poll_name,poll_week = poll_week,
poll_start_date = poll_start_date,poll_end_date = poll_end_date)
# Compute month variable from poll_start_date
Huff$month <- ifelse(month(Huff$poll_start_date)==1,1,
ifelse(month(Huff$poll_start_date)==2,2,
ifelse(month(Huff$poll_start_date)==3,3,
ifelse(month(Huff$poll_start_date)==4,4,
ifelse(month(Huff$poll_start_date)==5,5,
ifelse(month(Huff$poll_start_date)==6,6, ifelse(month(Huff$poll_start_date)==7,7,
ifelse(month(Huff$poll_start_date)==8,8,
ifelse(month(Huff$poll_start_date)==9,9,
ifelse(month(Huff$poll_start_date)==10,10,
ifelse(month(Huff$poll_start_date)==11,11,11)
))))))))))
Huff$month <- as.factor(Huff$month)
# Remove Poll Column
Huff$Poll <- NULL
# Convert percent_votes from char to numeric
Huff$Trump <- as.numeric(as.character(Huff$Trump))
Huff$Clinton <- as.numeric(as.character(Huff$Clinton))
Huff$Other <- as.numeric(as.character(Huff$Other))
Huff$Undecided <- as.numeric(as.character(Huff$Undecided))
Huff <- Huff[c(8,9,1,2,3,4,12,10,11,7,6,5)]
kable(head(Huff[,1:ncol(Huff)]), format = "markdown")
| poll_name | poll_week | Trump | Clinton | Other | Undecided | month | poll_start_date | poll_end_date | type_of_voter | num_of_voters | Spread |
|---|---|---|---|---|---|---|---|---|---|---|---|
| UPI/CVOTER | Aug 16 Aug 22 | 47 | 48 | 4 | NA | 8 | 2016-08-16 | 2016-08-22 | Likely | 1214 | Clinton +1 |
| NBC/SurveyMonkey | Aug 15 Aug 21 | 42 | 50 | NA | 8 | 8 | 2016-08-15 | 2016-08-21 | Registered | 17459 | Clinton +8 |
| Morning | Aug 18 Aug 20 | 38 | 44 | NA | 18 | 8 | 2016-08-18 | 2016-08-20 | Registered | 2001 | Clinton +6 |
| ARG | Aug 17 Aug 20 | 42 | 47 | 6 | 5 | 8 | 2016-08-17 | 2016-08-20 | Registered | 994 | Clinton +5 |
| Ipsos/Reuters | Aug 13 Aug 17 | 36 | 41 | 10 | 13 | 8 | 2016-08-13 | 2016-08-17 | Likely | 1049 | Clinton +5 |
| Normington, | Aug 9 Aug 15 | 40 | 50 | 5 | 5 | 8 | 2016-08-09 | 2016-08-15 | Likely | 1000 | Clinton +10 |
Data has to be transformed to obtain certain plots to explain the trend of the candidates across polls and in various months. Data is re-arranged for better readability
# Transpose data
Huff_gathered <- Huff %>% gather(Candidate,percent_votes,Trump:Undecided)
Huff_gathered$Candidate <- as.factor(Huff_gathered$Candidate)
Huff_gathered$Poll <- NULL
Huff_gathered <- Huff_gathered[c(1,2,9,10,3,4,5,6,7,8)]
kable(head(Huff_gathered[,1:ncol(Huff_gathered)]), format = "markdown")
| poll_name | poll_week | Candidate | percent_votes | month | poll_start_date | poll_end_date | type_of_voter | num_of_voters | Spread |
|---|---|---|---|---|---|---|---|---|---|
| UPI/CVOTER | Aug 16 Aug 22 | Trump | 47 | 8 | 2016-08-16 | 2016-08-22 | Likely | 1214 | Clinton +1 |
| NBC/SurveyMonkey | Aug 15 Aug 21 | Trump | 42 | 8 | 2016-08-15 | 2016-08-21 | Registered | 17459 | Clinton +8 |
| Morning | Aug 18 Aug 20 | Trump | 38 | 8 | 2016-08-18 | 2016-08-20 | Registered | 2001 | Clinton +6 |
| ARG | Aug 17 Aug 20 | Trump | 42 | 8 | 2016-08-17 | 2016-08-20 | Registered | 994 | Clinton +5 |
| Ipsos/Reuters | Aug 13 Aug 17 | Trump | 36 | 8 | 2016-08-13 | 2016-08-17 | Likely | 1049 | Clinton +5 |
| Normington, | Aug 9 Aug 15 | Trump | 40 | 8 | 2016-08-09 | 2016-08-15 | Likely | 1000 | Clinton +10 |
Line Plots for all the candidates over the months as reported by Various Polls.
ggplot(Huff_gathered,aes(month,percent_votes)) +
geom_line(data = Huff_gathered,aes(group = Candidate,color = Candidate)) +
facet_wrap(~poll_name) + scale_colour_manual(values=c("blue","green","red","black"))
# Trend from January to Present
Huff_CT_trend <- aggregate(percent_votes ~ Candidate + month,Huff_gathered,mean)
ggplot(Huff_CT_trend,aes(month,percent_votes)) + geom_line(data = Huff_CT_trend,aes(group = Candidate,color = Candidate)) + scale_colour_manual(values=c("blue","green","red","black"))
“Undecided” and “Other” Candidates do not seem to be in competition as indicated by all of the polls (Plot 1.).
Percent Votes from combined polls for each candidate based on month also indicates that the “Undecided” and “Other” Candidates are not in competition from beginning of the year. (Plot 2.)
Hence further reporting and predictions will be done for the top runners Hillary Clinton and Donald Trump.
Huff_CT<- Huff_gathered %>% filter(Candidate == "Clinton" |Candidate == "Trump")
ggplot(Huff_CT, aes(x=month,y=percent_votes,fill=Candidate))+
geom_bar(stat="identity",position="dodge") +
facet_wrap(~poll_name) +
scale_fill_manual(values=c("blue","red"))
ggplot(Huff_CT, aes(month,percent_votes, fill=Candidate)) + geom_bar(stat="identity",position="dodge") +
scale_fill_manual(values=c("blue","red"))
Most of the poll results as well as overall monthly trend indicates that Hillary Clinton has been constantly leading since January.
It is intersting to note that the difference in votes varies on monthly basis this could be mostly due to the issues addressed by the candidates in their rallies. Although the impact of issues on Voter’s mood will not be considered in the current report, such a study in future could show intersting trends.
Based on the observations and conclusions predicitions will be done for Hillary Clinton using TimeSeries Modelling.
Data needs to be stored in Time-Series format
Huff$ClintonLeads <- Huff$Clinton - Huff$Trump
Huff_clinton <- select(Huff,month,Clinton)
Huff_clintonLeads <- select(Huff,month,ClintonLeads)
Huff_agg <- aggregate(Clinton~month,data= Huff_clinton,mean)
rdate <- as.Date(Huff_agg$month,"%Y-%m-%d")
Huff_ts_agg <- ts(Huff_agg$Clinton)
Plot of the time series data is generated. This gives the trend over time
plot.ts(Huff_agg$Clinton,col="blue",axes=F,xlab = "month", ylab = "Clinton Leads")
box()
axis(1,Huff_agg$month,format(Huff_agg$month,format="%b"))
Looking at the plot, there seems to have been fluctuations for Clinton earlier during the year, this could probably be because there was race among democratic candidates. However after May , Clinton has been having Upward trend. The fluctuation in the upward trend could be due to the issues she has been addressing during the rallies. This is a potential factor that could be used for predictions in the future(this report does not use the issues addressed by candidates)
Predictions is done using forecast() function. Data is fitted in time series linear model .
Clintonforecasts <- forecast(Huff_ts_agg,h=8,level=c(80,95))
plot(Clintonforecasts)
plot(fitted(Clintonforecasts))
summary(Clintonforecasts)
##
## Forecast method: ETS(A,N,N)
##
## Model Information:
## ETS(A,N,N)
##
## Call:
## ets(y = object, lambda = lambda, allow.multiplicative.trend = allow.multiplicative.trend)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 45.6694
##
## sigma: 1.0889
##
## AIC AICc BIC
## 21.99844 24.39844 22.15732
##
## Error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.0003289607 1.088915 0.880404 -0.05695047 1.916417
## MASE ACF1
## Training set 0.6286183 -0.1107369
##
## Forecasts:
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 9 45.66942 44.27392 47.06492 43.53519 47.80366
## 10 45.66942 44.27392 47.06492 43.53519 47.80366
## 11 45.66942 44.27392 47.06492 43.53519 47.80366
## 12 45.66942 44.27392 47.06492 43.53519 47.80366
## 13 45.66942 44.27392 47.06492 43.53519 47.80366
## 14 45.66942 44.27392 47.06492 43.53519 47.80366
## 15 45.66942 44.27392 47.06492 43.53519 47.80366
## 16 45.66942 44.27392 47.06492 43.53519 47.80366
From the forecast above, it is very much apparent that Clinton will maintain her lead in next couple of months up until the election which will be result of final election as well.
I am grateful to my mentor for being a constant support throught out my Data Science learning Journey and Suggesting Capstone Project and Time Series modelling method.