Predictions for United States Presidential Elections 2016

Introduction

United States of America being one of the Superpowers, the leader of the country will have a significant influence not just for the country but the entire world. The President of the United States is considered one of the world’s most powerful people, leading the world’s only contemporary superpower. The role includes being the commander-in-chief of the world’s most expensive military with the largest nuclear arsenal and leading the nation with the largest economy by real and nominal GDP. The office of the president holds significant hard and soft power both in the United States and abroad.1

The President is elected by Electoral College to a four-year term. Current President Barack Obama will be ending his second four-year term and The United States Presidential Election is scheduled for Tuesday, November 8, 2016. The series of presidential primary elections and caucuses took place between February 1 and June 14, 2016.

Former Secretary of State and New York Senator Hillary Clinton is the Democratic Party’s presidential nominee. Businessman and reality television personality Donald Trump is the Republican Party’s presidential nominee. Hillary Clinton, if elected will be the first woman to take the Office of the President in United States which makes this year’s election very interesting. Opinion Polls are conducted Nationwide and they seem to tell a consistent story as to where the race stands.

The Report is an attempt at predicting the probable President of United States of America based on the Opinion Polls conducted from January 2016 to present day.

Description of Dataset

Huffington Post, One of the leading news aggregators in America, has been publishing the results of various Polls conducted across the nation. R’s XML library will be used to get the live data from Huffington Post - http://elections.huffingtonpost.com/pollster/2016-general-election-trump-vs-clinton

The Dataset contains Poll information, Percent Votes for Trump, Clinton, Others, Undecided and Spread. Results of over 30 agents conducting the polls from past 20 months are listed. For the current study, results from January 2016 and onwards will be used.

Analysis and Cleaning of Variables in the Data Set

Data will be extracted from the website when code is run and results from January 2016 and onwards will be extracted

#Get data from the website
library(XML)

#Data From Polls published on Huffington Post
rawHuff <- readHTMLTable('http://elections.huffingtonpost.com/pollster/2016-general-election-trump-vs-clinton')
Huff <- data.frame(rawHuff[[1]])

Huff <- Huff %>% dplyr::filter(!grepl("2015",Poll))

Dataset : 191, 6

Data Extracted Contains Information from Polls conducted by various agencies like Rasmussen, CNN totalling to 26 sources. There are several challenges with this data

Data is extracted using readHTMLTable function in XML library which has advantage of capturing the latest and greatest poll results
Data contains information from previous year, hence only Poll results from January 2016 and onwards will be filtered.
Poll Field in the results is a combination of Poll name, Poll week, Number of people who participated in the poll and the type of voters.
Fields need to be extracted from the above field

Extract poll_name, start_date,end_date,type_of_voter and number_of_voters from Poll field

# Replace Poll Field to end with "Voters"
Huff$Poll <- gsub("Adults","Adult Voters",Huff$Poll)

# Extract num_voters from Poll Column
num_voters <- str_extract(Huff$Poll,word(Huff$Poll,-3))
num_voters <- as.numeric(gsub(",","",num_voters))

# Extract Voter Type from Poll Column
voter_type <- as.vector(str_extract(Huff$Poll,word(Huff$Poll,-2)))

# Extract Poll name and clean it.
poll_name <- str_extract(Huff$Poll,word(Huff$Poll,1))
poll_name <- gsub("\n","",poll_name)

# Extract poll week
patt <- '(\\w+)\\s*(\\w+)\\s*\u2013\\s*(\\w+)\\s*(\\w+)'

poll_week <- str_extract(Huff$Poll,patt)

start_date <- word(poll_week,1,2)
end_date <- word(poll_week,-2,-1)

poll_start_date <- as.Date(start_date,"%b %d")
poll_end_date <- as.Date(end_date,"%b %d")

Add the extracted Variable

Extracted Values are added as columns to the dataframe. Based on the poll week, polls need to be sorted on a monthly basis. Percent Votes for each candidate will be converted to numeric values. Finally Poll field will be removed from the dataset. Data will be re-arrange for better readability

# Add 5 new columns to Huff Dataframe
Huff <- mutate(Huff, num_of_voters = num_voters,type_of_voter = voter_type,
                    poll_name = poll_name,poll_week = poll_week,
                    poll_start_date = poll_start_date,poll_end_date = poll_end_date)

# Compute month variable from poll_start_date
Huff$month <- ifelse(month(Huff$poll_start_date)==1,1,
                 ifelse(month(Huff$poll_start_date)==2,2,
                    ifelse(month(Huff$poll_start_date)==3,3,
                       ifelse(month(Huff$poll_start_date)==4,4,
                          ifelse(month(Huff$poll_start_date)==5,5,
                             ifelse(month(Huff$poll_start_date)==6,6,                                                           ifelse(month(Huff$poll_start_date)==7,7,
                                   ifelse(month(Huff$poll_start_date)==8,8,
                                      ifelse(month(Huff$poll_start_date)==9,9,
                                         ifelse(month(Huff$poll_start_date)==10,10,
                                            ifelse(month(Huff$poll_start_date)==11,11,11)
                                               ))))))))))


Huff$month <- as.factor(Huff$month)

# Remove Poll Column
Huff$Poll <- NULL

# Convert percent_votes from char to numeric
Huff$Trump <- as.numeric(as.character(Huff$Trump))
Huff$Clinton <- as.numeric(as.character(Huff$Clinton))
Huff$Other <- as.numeric(as.character(Huff$Other))
Huff$Undecided <- as.numeric(as.character(Huff$Undecided))

Huff <- Huff[c(8,9,1,2,3,4,12,10,11,7,6,5)]

kable(head(Huff[,1:ncol(Huff)]), format = "markdown")

poll_name	poll_week	Trump	Clinton	Other	Undecided	month	poll_start_date	poll_end_date	type_of_voter	num_of_voters	Spread
UPI/CVOTER	Aug 16 Aug 22	47	48	4	NA	8	2016-08-16	2016-08-22	Likely	1214	Clinton +1
NBC/SurveyMonkey	Aug 15 Aug 21	42	50	NA	8	8	2016-08-15	2016-08-21	Registered	17459	Clinton +8
Morning	Aug 18 Aug 20	38	44	NA	18	8	2016-08-18	2016-08-20	Registered	2001	Clinton +6
ARG	Aug 17 Aug 20	42	47	6	5	8	2016-08-17	2016-08-20	Registered	994	Clinton +5
Ipsos/Reuters	Aug 13 Aug 17	36	41	10	13	8	2016-08-13	2016-08-17	Likely	1049	Clinton +5
Normington,	Aug 9 Aug 15	40	50	5	5	8	2016-08-09	2016-08-15	Likely	1000	Clinton +10

Data has to be transformed to obtain certain plots to explain the trend of the candidates across polls and in various months. Data is re-arranged for better readability

# Transpose data

Huff_gathered <- Huff %>% gather(Candidate,percent_votes,Trump:Undecided)

Huff_gathered$Candidate <- as.factor(Huff_gathered$Candidate)

Huff_gathered$Poll <- NULL

Huff_gathered <- Huff_gathered[c(1,2,9,10,3,4,5,6,7,8)]

kable(head(Huff_gathered[,1:ncol(Huff_gathered)]), format = "markdown")

poll_name	poll_week	Candidate	percent_votes	month	poll_start_date	poll_end_date	type_of_voter	num_of_voters	Spread
UPI/CVOTER	Aug 16 Aug 22	Trump	47	8	2016-08-16	2016-08-22	Likely	1214	Clinton +1
NBC/SurveyMonkey	Aug 15 Aug 21	Trump	42	8	2016-08-15	2016-08-21	Registered	17459	Clinton +8
Morning	Aug 18 Aug 20	Trump	38	8	2016-08-18	2016-08-20	Registered	2001	Clinton +6
ARG	Aug 17 Aug 20	Trump	42	8	2016-08-17	2016-08-20	Registered	994	Clinton +5
Ipsos/Reuters	Aug 13 Aug 17	Trump	36	8	2016-08-13	2016-08-17	Likely	1049	Clinton +5
Normington,	Aug 9 Aug 15	Trump	40	8	2016-08-09	2016-08-15	Likely	1000	Clinton +10

Exploratory Data Analysis

Line Plots for all the candidates over the months as reported by Various Polls.

ggplot(Huff_gathered,aes(month,percent_votes)) + 
  geom_line(data = Huff_gathered,aes(group = Candidate,color = Candidate)) +
  facet_wrap(~poll_name) + scale_colour_manual(values=c("blue","green","red","black"))

# Trend from January to Present 

Huff_CT_trend <- aggregate(percent_votes ~ Candidate + month,Huff_gathered,mean)

ggplot(Huff_CT_trend,aes(month,percent_votes)) + geom_line(data = Huff_CT_trend,aes(group = Candidate,color = Candidate)) + scale_colour_manual(values=c("blue","green","red","black"))

Conclusions

“Undecided” and “Other” Candidates do not seem to be in competition as indicated by all of the polls (Plot 1.).
Percent Votes from combined polls for each candidate based on month also indicates that the “Undecided” and “Other” Candidates are not in competition from beginning of the year. (Plot 2.)
Hence further reporting and predictions will be done for the top runners Hillary Clinton and Donald Trump.

Bar Plots for Clinton Vs Trump Based on Polls and well as monthly Trends

Huff_CT<- Huff_gathered %>% filter(Candidate == "Clinton" |Candidate == "Trump")

ggplot(Huff_CT, aes(x=month,y=percent_votes,fill=Candidate))+
  geom_bar(stat="identity",position="dodge") + 
  facet_wrap(~poll_name) + 
  scale_fill_manual(values=c("blue","red"))

ggplot(Huff_CT, aes(month,percent_votes, fill=Candidate)) +              geom_bar(stat="identity",position="dodge") + 
scale_fill_manual(values=c("blue","red"))

Conclusions

Most of the poll results as well as overall monthly trend indicates that Hillary Clinton has been constantly leading since January.

It is intersting to note that the difference in votes varies on monthly basis this could be mostly due to the issues addressed by the candidates in their rallies. Although the impact of issues on Voter’s mood will not be considered in the current report, such a study in future could show intersting trends.

Predictions

Based on the observations and conclusions predicitions will be done for Hillary Clinton using TimeSeries Modelling.

Data needs to be stored in Time-Series format

Huff$ClintonLeads <- Huff$Clinton - Huff$Trump
Huff_clinton <- select(Huff,month,Clinton)
Huff_clintonLeads <- select(Huff,month,ClintonLeads)

Huff_agg <- aggregate(Clinton~month,data= Huff_clinton,mean)
rdate <- as.Date(Huff_agg$month,"%Y-%m-%d")
Huff_ts_agg <- ts(Huff_agg$Clinton)

Plot of the time series data is generated. This gives the trend over time

Time Series Plots for Clinton monthly Trends

plot.ts(Huff_agg$Clinton,col="blue",axes=F,xlab = "month", ylab = "Clinton Leads")
box()
axis(1,Huff_agg$month,format(Huff_agg$month,format="%b"))

Looking at the plot, there seems to have been fluctuations for Clinton earlier during the year, this could probably be because there was race among democratic candidates. However after May , Clinton has been having Upward trend. The fluctuation in the upward trend could be due to the issues she has been addressing during the rallies. This is a potential factor that could be used for predictions in the future(this report does not use the issues addressed by candidates)

Predictions is done using forecast() function. Data is fitted in time series linear model .

Clintonforecasts <- forecast(Huff_ts_agg,h=8,level=c(80,95))

plot(Clintonforecasts)

plot(fitted(Clintonforecasts))

summary(Clintonforecasts)

## 
## Forecast method: ETS(A,N,N)
## 
## Model Information:
## ETS(A,N,N) 
## 
## Call:
##  ets(y = object, lambda = lambda, allow.multiplicative.trend = allow.multiplicative.trend) 
## 
##   Smoothing parameters:
##     alpha = 1e-04 
## 
##   Initial states:
##     l = 45.6694 
## 
##   sigma:  1.0889
## 
##      AIC     AICc      BIC 
## 21.99844 24.39844 22.15732 
## 
## Error measures:
##                         ME     RMSE      MAE         MPE     MAPE
## Training set -0.0003289607 1.088915 0.880404 -0.05695047 1.916417
##                   MASE       ACF1
## Training set 0.6286183 -0.1107369
## 
## Forecasts:
##    Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
##  9       45.66942 44.27392 47.06492 43.53519 47.80366
## 10       45.66942 44.27392 47.06492 43.53519 47.80366
## 11       45.66942 44.27392 47.06492 43.53519 47.80366
## 12       45.66942 44.27392 47.06492 43.53519 47.80366
## 13       45.66942 44.27392 47.06492 43.53519 47.80366
## 14       45.66942 44.27392 47.06492 43.53519 47.80366
## 15       45.66942 44.27392 47.06492 43.53519 47.80366
## 16       45.66942 44.27392 47.06492 43.53519 47.80366

From the forecast above, it is very much apparent that Clinton will maintain her lead in next couple of months up until the election which will be result of final election as well.

Acknowledgements

I am grateful to my mentor for being a constant support throught out my Data Science learning Journey and Suggesting Capstone Project and Time Series modelling method.

Predictions for United States Presidential Elections 2016

Harshala Rajesh

August 16, 2016

Introduction

Description of Dataset

Analysis and Cleaning of Variables in the Data Set

Dataset : 191, 6

Extract poll_name, start_date,end_date,type_of_voter and number_of_voters from Poll field

Add the extracted Variable

Exploratory Data Analysis

Conclusions

Bar Plots for Clinton Vs Trump Based on Polls and well as monthly Trends

Conclusions

Predictions

Time Series Plots for Clinton monthly Trends

Acknowledgements