Forecasting 2016 US Election using Machine Learning

Introduction

During the 2012 US Elections, hundreds of skilled analysts tried to predict the results of the two running candidates. However, many forecasts predicted overwhelmingly wrong with over 80% winning chances for the losing party. In this analysis, we will attempt to predict the results of the 2012 election again and dive into the many difficulties of predicting an election and what possibly might have gone wrong during many of the forecast made during 2012.

What makes voter behavior prediction and election forecasting a difficult problem?

There are several factors that makes it difficult to accurately predict voter behavior. All of which can be attributed to the fact that large-scale human behavior is difficult to predict and hard to poll. Divided into more specific cateogories of data analysis, we can see that some of the possible errors that could’ve resulted in a wrong prediction are from biases such as the nonresponse bias and sampling bias, as well as voter’s change of behavior or dishonesty that makes it hard to identify reliable data that will translate to accurately predicting an election.

A nonresponse bias results when inaccurate measurement of a projected population occurs because of reluctance or avoidance from individauls in the sample which causes a false projection of the actual population. An example of this during the election would be individuals who chose not to convey their election preferences which would drastically skew the results for the opposing party.

A sampling bias occurs when data is taken from an area that is aggregated by a sample size that is biased towards a particular preference. In other words, data that is not collected diversely enough to represent the whole population. An example of this during the election would be if sample data was collected largely from only Democratic heavy states, which will signifcantly overshoot the chances of a Democratic’s candidate’s victory with by a predictive model.

Behaviors from voters that may cause model inaccuracy include the possibility of overenthuastic voters that may skew proportionality of actual dominance of each candidate from possible voters. Furthermore, voters may lie about their intentions and this may cause the data to contain false information which will drastically change the outcome of the model. Even if voters do not lie, there exists the possiblity of voters changing their mind and thus making any data prior to their actual vote irrelevant and inaccurate. Even though the data may accurately predict the result of voters when polled, this may not accurately reflect the actual outcome of when they actually do put in their votes for the election.

Who was Nate Silver and what made his approach in 2012 that allowed him to achieve good predictions?

Nate Silver is a journalist and statistican whom had many forecast models during many of the US elections. The unique aspects that allowed Nate Silver’s approach in 2012 to be so successful is his overcompensation for the aboved biases as compared to other predictive models. Silver heavily relied on time series, Bayesian proability, heirarchical clustering and graph theory to predict his model. Silver accounted for the measure of bias in polls to account for possible voter dishonesty and furthermore, track and simulate the change of public voting behavior and trend shift towards a particular candidate to better predict the actual outcome of voting results in 2012. Heirarchical clustering was crucial to his success in measuring voter response as it allowed him to accurately adjust predictions for each individual states based on correlative data from states with similar demographics and responses.

What went wrong in 2016? What should be done to make future predictions better?

The problem with 2016’s election laid in the overconfidence and high bias towards Hilary Clinton’s support and dominance over Donald Trump in the actual polls. Many of the projected states that favored Hilary instead turned in favor of Trump from possible biases and errors mentioned in question 1 above. A large influx of Trump voters were not accounted for in the predictor models because they were excluded during survey polls, most possibly from nonresponse bias which caused a false projection of actual Trump voters.A majority of Clinton’s voters were accounted to be false votes in the data base as actual votes proved far less for Hilary in states where she was thought to dominate, which could be attributed to changes in voter mindset.

To make future predictions of the election more accurate, data should be taken over a longer span period to track the fluctuations of voter behavior in preperation of election day. The sum of these results can help more accurately predict the variance between day to day change of behavior from possible voters and thus more accurately predict the actual voter numbers and results for each camp of the candidates. States where results are unclear or have a small sample size should be evaluated as swing states where data can not accurately reflect the actual results of the votes. Furthermore, states should be correlated to each other in addition to correlating with itself from past election results to more accurately gauge the behaviors of consistent and returning voters.

Studying The dimension of our raw data file named election.raw

county	fips	candidate	state	votes
Los Angeles County	6037	Hillary Clinton	CA	2464364
Los Angeles County	6037	Donald Trump	CA	769743
Los Angeles County	6037	Gary Johnson	CA	88968
Los Angeles County	6037	Jill Stein	CA	76465
Los Angeles County	6037	Gloria La Riva	CA	21993

**Dimension Table**
Observations	Variables
18345	5

We remove any observations with fips=2000 from the data set election.raw, we obtain a dimension of 18345 observations with 5 variables.

As to why we remove these observations where fips=2000, these are rows where results do not have a clear county. We want to remove these row as they may skew the end data as there are important information that is lacking from these observations.

As many of the rows are summarized for candidates and we do not need these summary results when predicting our model. It is evident that these rows do not provide any more useful data than what is already presented in the data set itself. These summary rows will only serve to grossly skew our results as they will be miscounted as counties when they really aren’t.

Remove summary rows from election.raw

From the above data, we observe that there contains summary rows for each candidate at the federal, state, and county level. A combination of the three summary results would not help predict our model. Thus it is better to divide each into its own category. Elsewise, they will serve to grossly skew our results as the scale of each vastly differ from each other.

We now split the election.raw dataset into three distinct categories.

Federal-level summaries are placed into a dataset called election_federal.
State-level summaries are placed into a dataset called election_state.
County-level data is placed in a dataset called election.

We will priortize creating models from the smallest and most specific dataset so in this case, it will be the county dataset.

A bar chart of all votes recieved by each candidate. Seperated into multiple plots

Based on the above data, it can be seen that there are 32 candidates with Democratic representative Hilary Clinton and Republic representative Donald Trump as the top two candidates in the 2016 election.

Here we create variables county_winner and state_winner by taking the candidate with the highest proportion of votes.

The graph below shows each state of the U.S. by using the R package ggplot2 library functions.

A county-level map by creating counties = map_data(“county”). Colored by county

Here, it is further divided into individual counties of the U.S. during the 2016 election.

Now we color the map by the winning candidate for each state. First, we combine states variable and state_winner we created earlier using left_join(). Note that left_join() needs to match up values of states to join the tables. A call to left_join() takes all the values from the first table and looks for matches in the second table. If it finds a match, it adds the data from the second table; if not, it adds missing values:

Here, we’ll be combing the two datasets based on state name. However, the state names are in different formats in the two tables: e.g. AZ vs. arizona. Before using left_join(), create a common column by creating a new column for states named fips = state.abb[match(some_column, some_function(state.name))]. Replace some_column and some_function to complete creation of this new column. Then left_join(). Your figure will look similar to state_level New York Times map.

Each states is divided into colors according to its respective winning candidate.

The variable county does not have fips column. So we will create one by pooling information from maps::county.fips. Split the polyname column to region and subregion. Use left_join() combine county.fips into county. Also, left_join() previously created variable county_winner. Your figure will look similar to county-level New York Times map.

The above map is seperated further by counties.

We create a visualization of your choice using census data. Many exit polls noted that demographics played a big role in the election. Use this Washington Post article and this R graph gallery for ideas and inspiration.

We created 4 visualizations from the census dataset to better understand the correlations between income, population, and ethnicity. More specifically, we created 4 density plots using the hexabin library from R to clearly visualize how poverty and income is distributed throughout the United States.

From graph (1), we can see that the majority of the U.S population earns around $50,000 and is under the 25% poverty line. This graph proves to demonstrate the purposes of a density plot. As the graph shows, as income increases, one tends to stray further away from the poverty % line. Whereas if one is high in the poverty % line, their income is low. Furthermore, the density graph shows that outliers such as $250,000 and 100% poverty line are rare and thus have a lower density on the graph.

We can use graph (1) to further study what graph (2), (3) and (4) tells us about effects of population, and ethnicity in correlation to income. From graph (2), we can tell that most of U.S. counties lay under 20,000 people in population. Furthermore, this graph supports graph (1)’s conclusion that most people earn an average of $50,000 in the U.S. The most concentrated density of individuals in the US comes from those who live in towns that are estimated to have 8000 individuals and have an average wage of $50,000.

We can further study this distribution further by looking at graph (3). We can see that a majority of the black community where 50% or more of the population is black usually earns an income of less than $100,000. Very few individuals earn above that. Furthermore, there is an inverse relation between % of black population and income. As seen from the density chart, as the % of black population goes down in a community, income tends to grow higher. Where as if we study graph (4), we can see that the exact opposite is true for communities where its predominately white. From graph (4), we can see that there are far more income diversity and far more individuals earning >$100,000 in a predominate white community versus a predominate black community. Furthermore, there exists a positive relation where more white communities usually indicate a higher average in terms of income. Though it is necessary to be mindful towards the distribution and collection of each individual sample size by the census, we can use these graphs to approximate and visualize a rough disparity on how ethnicity affects income and where do must of each ethnicity lay in terms of their predominate communities.

The census data contains high resolution information (more fine-grained than county-level). In this problem, we aggregate the information into county-level data by computing TotalPop-weighted average of each attributes for each county. Create the following variables:

State	County	Men	White	Citizen	Income	IncomeErr	IncomePerCap	IncomePerCapErr	Poverty	ChildPoverty	Professional	Service	Office	Production	Drive	Carpool	Transit	OtherTransp	WorkAtHome	MeanCommute	Employed	PrivateWork	SelfEmployed	FamilyWork	Unemployment	Minority
Alabama	Autauga	48.43	75.79	73.75	51696	7771	24974	3434	12.91	18.71	32.79	17.17	24.28	17.16	87.51	8.781	0.0953	1.3060	1.8357	26.50	43.44	73.74	5.433	0.0000	7.734	22.54
Alabama	Baldwin	48.85	83.10	75.69	51074	8745	27317	3804	13.42	19.48	32.73	17.95	27.10	11.32	84.60	8.959	0.1266	1.4438	3.8505	26.32	44.05	81.28	5.909	0.3633	7.590	15.21
Alabama	Barbour	53.83	46.23	76.91	32959	6031	16824	2430	26.51	43.56	26.12	16.46	23.28	23.32	83.33	11.057	0.4954	1.6217	1.5019	24.52	31.92	71.59	7.150	0.0898	17.526	51.94
Alabama	Bibb	53.41	74.50	77.40	38887	5662	18431	3074	16.60	27.20	21.59	17.96	17.47	23.74	83.43	13.154	0.5031	1.5621	0.7315	28.71	36.69	76.74	6.638	0.3942	8.163	24.17
Alabama	Blount	49.41	87.85	73.38	46238	8696	20532	2052	16.72	26.86	28.53	13.94	23.84	20.10	84.85	11.279	0.3626	0.4199	2.2654	34.84	38.45	81.83	4.229	0.3565	7.700	10.59
Alabama	Bullock	53.01	22.20	75.45	33293	9000	17580	3111	24.50	37.29	19.55	14.92	20.17	25.74	74.77	14.839	0.7732	1.8238	3.0999	28.63	36.20	79.09	5.274	0.0000	17.890	76.54

Run PCA for both county & sub-county level data. Save the first two principle components PC1 and PC2 into a two-column data frame, call it ct.pc and subct.pc, respectively. Decide whether to center and scale the features before running PCA. Analyze the three features with the largest absolute values of the first principal component. Which features have opposite signs and what does that mean about the correaltion between these features?

For our PCA on both ct.pc and subct.pc, we chose scale to be true as values would otherwise get extremely big for our results in the final principal components. Furthermore, we chose to center to normalize the data. Both of these factors are very important as setting scale = TRUE sets the standard deviation of all variables to one. Whereas centering sets the data points to have mean of 0. In other words, it allows for the data to follow a gaussian distribution where all variables are compared on an equal scale with equal variance, thus having much better results in terms of weights for each predictor, allowing for better prediction in respect to each principal component.

**PCA on Counties**
	PC1 Weights by Absolute Value for Counties
IncomePerCap	0.3516
ChildPoverty	0.3436
Poverty	0.3423
Employed	0.3278
Income	0.3200
Unemployment	0.2907

**PCA on Sub-counties**
	PC1 Weights by Absolute Value for Sub-counties
IncomePerCap	0.3185
Professional	0.3067
Poverty	0.3048
Income	0.3029
ChildPoverty	0.2980
Service	0.2690

Here, we use the head function to obtain the first couple of rows of Principal Component 1 for ct.pc and subct.pc. We can see that for ct.pc, the three features with the largest absolute values of the first component are IncomePerCap, ChildPoverty, and Poverty. While for subct.pc, the three features with the largest absolute values of the first component are IncomePerCap, Professional, and Poverty.

**Largest absolute values of PC1 for County**
	PC1	PC2
IncomePerCap	0.3516	0.0675
ChildPoverty	-0.3436	-0.0516
Poverty	-0.3423	-0.0789

**Largest aboslute values of PC1 for Sub-county**
	PC1	PC2
IncomePerCap	-0.3185	-0.1784
Professional	-0.3067	-0.1541
Poverty	0.3048	-0.0768

According to the two graphs above, we can see that for ct.pc, principal component 1 attributes much of the weight to IncomePerCap, ChildPoverty, and Poverty. We can also see that IncomePerCap is positive, while both ChildPoverty and Poverty are negative.

Principal component 1 is the the principal component that explains the largest portion of the variance in accordance to the relationship with the other variables in the dataset. Thus, we can see that if Poverty increases, so will ChildPoverty as they both share a positive relation with each other. While on the other hand, if IncomePerCap increases, then both ChildPoverty and Poverty will both decrease as PC1 suggests that they have a negative correlation between each other.

For subct.pc, we can see that IncomePerCap is correlated with Professional, indicating that the two have a positive relation while Poverty have a negative relationship with the two variables. In other words, if poverty goes up, IncomePerCap and Professional will most likely go down.

The above correlations make sense since if IncomePerCapital goes up, then it is logical to think that both ChildPoverty and Poverty will go down. And in similar light for sub-counties, if there are more professionals and more income per capita, then most likely, it is going to be correlated negatively with poverty.

Determine the number of minimum number of PCs needed to capture 90% of the variance for both the county and sub-county analyses. Plot proportion of variance explained (PVE) and cumulative PVE for both county and sub-county analyses.

**Minimum % of PCs needed to capture 90% of the variance for both the country and sub-country analyses**
Counties	Sub-counties
13	15

With census.ct, perform hierarchical clustering with complete linkage. Cut the tree to partition the observations into 10 clusters. Re-run the hierarchical clustering algorithm using the first 5 principal components of ct.pc as inputs instead of the originald features. Compare and contrast the results. For both approaches investigate the cluster that contains San Mateo County. Which approach seemed to put San Mateo County in a more appropriate clusters? Comment on what you observe and discuss possible explanations for these observations.

**Summary of Heirarchical Clustering Types**
	Mean	Standard Deviation
Heirarchical Clustering	1.151	0.8219
Heirarchical Clustering with Principal Component	1.153	0.8192

We belive that the PC5 hierarchical clustering is the superior approach as it has a smaller standard deviation. A smaller standard deviation suggests that the PC5 hierarchical clustering will be more likely to place San Mateo in the correct cluster more consistently. We also noticed that the mean of the heiarchical clustering using the first 5 principal component is higher than the regular method. A possible explanation for this is that the variables that decides the weight of the first 5 PCs score a much higher mean value than the normal method. Furthermore, the lower deviation can be explained by the fact that the first 5 components will most likely pick variables that are strongly correlated with each other, therefore causing a lower variance as compared to the regular method.

Decision tree: train a decision tree by cv.tree(). Prune tree to minimize misclassification error. Be sure to use the folds from above for cross-validation. Visualize the trees before and after pruning. Save training and test errors to records variable. Intepret and discuss the results of the decision tree analysis. Use this plot to tell a story about voting behavior in the US (remember the NYT infographic?)

From our pruned tree, we can see that Transit is the first variable to be split at the value of 1.0524. If Transit < 1.0524, the tree then goes to the next most important variable which is whether or not the population of the community is above 48.3773% white or not. If it is, then the tree categorizes the vote as Donald Trump’s. If it is not, then it splits again to check whether the variable Unemployment is above or lower than 10.4482. If it is below 10.4482, the votes are categorized as Trump’s, if not, then they are considered Clinton’s.

The above captures the left side of our Pruned Tree. On the right side, it can be seen that Transit again is the most important variable after the first split. If the Transit variable is > 2.798287, then it is categorized as Clinton’s votes. If it is not, then it checks for the final remaining variable in this Pruned Tree which is the minority variable. If the minority variable is >51.8182, then it is classified as Hilary Clinton’s votes but if it is less than that, then it is Donald Trump’s votes.

The votes tell a story about how ethnic groups in the United States tend to vote. If there are a lot of people who uses transit, then they are likely to vote for Hilary Clinton. In the same light, if a community is made up of mostly ethnic minorities, they are also likely to pick for Hilary Clinton. On the other hand, if a community is majorly White or has a low unemployment rate, they are likely to pick Donald Trump. An explanation for why Transit is so important in determining candidate favoritism because whether or not transit is used is probably heavily connected to income level. And from previous exercises, we can see that ethnic minorities tend to have lower income than majorly White communities. From the pruned tree, we can see that Donald Trump captured the hearts of predominately White Communities while Clinton captured the hearts of predominately ethnic minorities of America instead during the 2016 elections.

	train.error	test.error
tree	0.0798	0.0765
logistic	NA	NA
lasso	NA	NA

From the records table, we can see that the decision tree had a train.error of 0.0790% and test.error of 0.0715%. Since logitic regression favors datasets that try to predict binomial outcomes, we except logistic regression to perform better than decision trees in this scenario.

Run a logistic regression to predict the winning candidate in each county. Save training and test errors to records variable. What are the significant variables? Are they consistent with what you saw in decision tree analysis? Interpret the meaning of a couple of the significant coefficients in terms of a unit change in the variables.

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-26.2060	9.4767	-2.7653	0.0057
Men	0.0691	0.0477	1.4488	0.1474
White	-0.1832	0.0642	-2.8512	0.0044
Citizen	0.1399	0.0279	5.0159	0.0000
Income	-0.0001	0.0000	-2.7630	0.0057
IncomeErr	0.0000	0.0001	0.0873	0.9304
IncomePerCap	0.0002	0.0001	3.5709	0.0004
IncomePerCapErr	-0.0003	0.0001	-2.1068	0.0351
Poverty	0.0418	0.0405	1.0300	0.3030
ChildPoverty	-0.0156	0.0247	-0.6321	0.5273
Professional	0.2886	0.0376	7.6839	0.0000
Service	0.3254	0.0473	6.8825	0.0000
Office	0.0787	0.0448	1.7552	0.0792
Production	0.1528	0.0411	3.7172	0.0002
Drive	-0.1935	0.0469	-4.1222	0.0000
Carpool	-0.1546	0.0589	-2.6267	0.0086
Transit	0.1097	0.0926	1.1842	0.2363
OtherTransp	-0.0557	0.0954	-0.5837	0.5594
WorkAtHome	-0.1737	0.0745	-2.3319	0.0197
MeanCommute	0.0431	0.0238	1.8094	0.0704
Employed	0.1987	0.0333	5.9745	0.0000
PrivateWork	0.1252	0.0210	5.9710	0.0000
SelfEmployed	0.0647	0.0461	1.4049	0.1601
FamilyWork	-0.8523	0.3932	-2.1674	0.0302
Unemployment	0.2155	0.0397	5.4344	0.0000
Minority	-0.0391	0.0617	-0.6334	0.5265

From the table above, we can pinpoint the significant variables in two ways. One way is to look for the largest absolute values for the z value and to look for the smallest Pr(>|z|) values. The z value determines whether a test of a relatively large population size, assuming normal distribution, proves to be significant different enough to affect the end result. The Pr(>|z|) value is the alpha level at which the test proves to be significant. We wish to minimize this value as this is the possible error rate of the test when performed with in tandem to the z-value. From the table, we can see that the following variables meet these criterias:

Citizen, IncomePerCap, Professional, Service, Production, Drive, Employed, PrivateWork, Unemployment

Compared to the decision tree analysis, Transit is not nearly as big of a factor in deciding which votes go to which candidate. Furthermore, it places a lot more priority on variables such as Citizen, Employed, and Unemployement as compared to the decision tree.

If we take a variable such as Unemployement, we can see that the variable Unemployement has a coefficient of 0.2097. This means for every one unit change in Unemployment, the log odds of Hilary’s chances of winning increases by 0.0207.

Another Aspect we can look at is PrivateWork. The Variable PrivateWork has a coefficent of 0.1060 and for each increases of one unit in PrivateWork, we can expect to see a log odds of Hilary’s chances of winning increases by 0.1060.

We set response variable that is either <> .50 and if it is greater, then we predict Hillary Clinton’s victory while if it is lesser, we predict Donald Trump’s victory.

	train.error	test.error
tree	0.0798	0.0765
logistic	0.0688	0.0651
lasso	NA	NA

We can see from the records table that logistic regression is better than the decision tree in both train.error and test.error. Thus, it is the better model as it is far more accurate.

You may notice that you get a warning glm.fit: fitted probabilities numerically 0 or 1 occurred. As we discussed in class, this is an indication that we have perfect separation (some linear combination of variables perfectly predicts the winner). This is usually a sign that we are overfitting. One way to control overfitting in logistic regression is through regularization. Use the cv.glmnet function from the glmnet library to run K-fold cross validation and select the best regularization parameter for the logistic regression with LASSO penalty. Reminder: set alpha=1 to run LASSO regression, set lambda = c(1, 5, 10, 50) * 1e-4 in cv.glmnet() function to set pre-defined candidate values for the tuning parameter λ. This is because the default candidate values of λ in cv.glmnet() is relatively too large for our dataset thus we use pre-defined candidate values. What is the optimal value of λ in cross validation? What are the non-zero coefficients in the LASSO regression for the optimal value of λ? How do they compare to the unpenalized logistic regression? Save training and test errors to the records variable.

## [1] 5e-04

	LASSO Regression Non-Zero Coefficients for Lambda Value: 5e^-04
(Intercept)	-28.6032
Men	0.0461
White	-0.1351
Citizen	0.1430
Income	0.0000
IncomeErr	0.0000
IncomePerCap	0.0002
IncomePerCapErr	-0.0002
Poverty	0.0269
ChildPoverty	-0.0008
Professional	0.2604
Service	0.2956
Office	0.0533
Production	0.1208
Drive	-0.1630
Carpool	-0.1257
Transit	0.1286
OtherTransp	-0.0093
WorkAtHome	-0.1265
MeanCommute	0.0284
Employed	0.1892
PrivateWork	0.1162
SelfEmployed	0.0384
FamilyWork	-0.7028
Unemployment	0.2041

The optimal value of Lambda in cross validation for our model is ‘5e^-04’. The Above table shows all of the non-zero coefficients in the Lasso Regression for the optimal value of Lambda.

The Esimates of our non-zero coefficients for LASSO regression are similar to that of regular regression. We found a total of 24 non-zero values for our lambda and when compared to the coefficients of the original, here is how they compare:

Lasso scored the following coefficents slightly higher in these categories - Men, IncomePerCap,Poverty,Professional,Service,Office,Production,Employed, PrivateWork,SelfEmployed,Unemployed

Lasso scored the following coefficients slightly less in these categories - White, Citizen, IncomePerCapErr, ChildPoverty, Carpool, Transit, OtherTransp, WorkAtHome, FamilyWork

Lasso scored the following coefficients with little to no changes in these categories - Income, Income err

The only variable that is missing from the original regression method is the variable Minority.

Interestingly enough, every variable except one considered the most important Variables in the original regression method is scored to be evaluated higher in LASSO regression. That variable being the Citizen variable

	train.error	test.error
tree	0.0798	0.0765
logistic	0.0688	0.0651
lasso	0.0676	0.0668

From out records table, it can be seen that LASSO regression scored the lowest on the train.error but failed on beating Logistic Regression’s accuracy for test.error. Both Logistic regression and Lasso regression comfortably prove to be superior in both test.error and train.error rates in comparison to the decision tree model.

Compute ROC curves for the decision tree, logistic regression and LASSO logistic regression using predictions on the test data. Display them on the same plot. Based on your classification results, discuss the pros and cons of the various methods. Are the different classifiers more appropriate for answering different kinds of questions about the election?

	Decision Tree	Logistic Regression	Lasso Regression
AUC Values	0.8662	0.9316	0.9321

Decision trees are good for separating data into rectangular regions but may often lead to overfitting. However, election results do not easily categorize into rectangles especially when candidate predictions are set to a binomial outcome. Logistic regression can be used for bionmial data where there are only two results, but if the data is linear, LDA might be better. Since the election data predicted a binomial outcome of either Trump or Hillary, logistic regression proved to be a good choice here. Lasso is good for finding significant variables, however the irrelevant variables are set to 0 and ignored. This may be a problem because it is still getting rid of a lot of data that may help predict the results. But in this case, using Lasso regression retained most of the original variables in the logistic regression and thus it did not affect the results too drastically when comparing the two.

In an election, it may be useful to use decision tree to predict the possible outcomes of a candidate winning the premliminaries as multiple options are presented and thus the outcome is not in a binomial fashion so such that logistic regression will suffer greatly compared to a decision tree. However, when it comes to the final votings, a logistic regression will be a much more efficient and accurate model as the results will be binomial and due to its nature of its logit function, it will do a much better job at predicting the winner of the election as opposed to decision trees.

Finally, Lasso regression is good if we want to measure only the most important aspects of a voter’s preferences when chosing a candidate as this regression model serves to get rid of all unnecessary or not significant enough predictor variables. It is good if we want a broad overview of what will sway the voters the most when choosing a candidate. However, it may not be the best of choice if we want to aim for pinpoint accuracy of election results as it may remove too many variables and thus cause the predictions to be severely underesimated.

Interpret and discuss any overall insights gained in this analysis and possible explanations. Use any tools at your disposal to make your case: visualize errors on the map, discuss what does/doesn’t seems reasonable based on your understanding of these methods, propose possible directions (collecting additional data, domain knowledge, etc).

We picked:Exploring additional classification methods: KNN, LDA, QDA, SVM, random forest, boosting etc. (You may research and use methods beyond those covered in this course). How do these compare to logistic regression and the tree method?

Goal: Compare methods above with other classification methods such as random forest, boosting, SVM and KNN. What are some common important variables across the models? Finally, which of these models work best for predicting the election results and why?

a) RandomForest

The first of our models that we are going to use is the classfication method of randomForest. To use this, we first must access library(randomForest). Then, we train the model using the randomForest() and by viewing the dataset which we used our randomForest() to predict, we se that the random forest uses 500 tress and tried 5 variables at each split. The out-of-bounds error was 6.23% and the two most important variables were White and Transit. Both of these variables were also considered to be very important in the regular decision trees. Random forest is similar to decision tree, but it uses several smaller decision trees by applying boostrap method to it. One con of random forest is that it can not reduce bias hence the need for lage, unpruned trees so that the bias is kept as low as possible.

	train.error	test.error
randomForest	8e-04	0.044
Boosting	NA	NA
SVM	NA	NA
KNN	NA	NA

As we can see, random forests proves to be quite an effective model for our specific purpose as both training error and testing error is relatively low.

b) Boosting

	var	rel.inf
Transit	Transit	40.064
White	White	31.349
Minority	Minority	9.493
Professional	Professional	5.461
IncomePerCap	IncomePerCap	2.621
Men	Men	1.738

The second of our model is classification by boosting. Boosting uses a bunch of weak learners to create a strong leaner by relearning from its weaker trees. The table above highlights that the boosting model also show a similar trend of Transit and White being the more important variables. The plots display the marginal effect of selected variables on the response are intergrating out the other variables.This makes logical sense since the method is similar to random forest except for the fact that it makes each new tree with the information from previous trees.

	train.error	test.error
randomForest	8e-04	0.0440
Boosting	4e-04	0.0016
SVM	NA	NA
KNN	NA	NA

Compared to using purely the random forest classfication method, boosting scores ever so slightly more accurate in having a lower score the overall training error. Though both methods are extremely close in comparison with each other as expected as both methodologies are similar in nature.

c) SVM

The classification method of SVM works very differently from the two methodologies mentioned above. SVM does not use decision trees and instead tries to evaluate the probability of a certain outcome by determining a clear boundary between the possible choices. Compared to the methods above, SVM is among one of the most difficult models to interpret because of its complex data transfomrations and resulting boundary planes. Unlike previous methods, it is not easy to tell which variables are most important when using the SVM classification method.

Confusion Matrix for Testing Data Set
	Donald Trump	Hillary Clinton
Donald Trump	524	24
Hillary Clinton	8	58

Confusion Matrix for Training Data Set
	Donald Trump	Hillary Clinton
Donald Trump	2056	81
Hillary Clinton	18	301

	train.error	test.error
randomForest	0.0008	0.0440
Boosting	0.0004	0.0016
SVM	0.0403	0.0521
KNN	NA	NA

From the classification matrix, we can see that SVM do not perform as well as randomForest or boosting methodologies in terms of predicting the likelihood of which canidates will win.

c) KNN

K-closest neighbor, abbrievated as KNN, is a classification method that is basic and intuitive to understand. The pros of KNN lies in its fact that it can be used for both Classification and Regression and that it is very easy to implement for multi-class problems. There are also a variety of distance to choose from aside from Euclidean Distance to better fine-tune the model to be more flexible with the presented data. However, KNN suffers from its slow algorithm and higher dimensions will cause KNN to struggle with accurately predicting the relationship between different predictor variables. KNN also do not have a inherent capability of dealing with missing values and suffers greatly from outlier sensitivity, and thus often makes it the weakest of classification methods when considering more complex models.

Confusion Matrix for Testing Data Set
	Donald Trump	Hillary Clinton
Donald Trump	507	61
Hillary Clinton	25	21

Confusion Matrix for Training Data Set
	Donald Trump	Hillary Clinton
Donald Trump	2022	274
Hillary Clinton	52	108

	train.error	test.error
randomForest	0.0008	0.0440
Boosting	0.0004	0.0016
SVM	0.0403	0.0521
KNN	0.1327	0.1401

From the final classification matrix, we can see that KNN has very poor results in test accuracy in both the testing error rate and training error rate as compared to other classifcation methods. KNN will do much better if data is more widespread and has concentrated clusters but in the case of predicting just the winner of a 2016 election with just two possible candidates, KNN clearly proves to be not a good model to use in this scenario.

Conclusion When evaluating overall results of the four models above, we can safely conclude that using classification methods such as randomForest() or boosting provides the best results when predicting accurate results of 2016’s election candidate tallies. Important common variables include the variables White, and Transit. And finally, when evaluating the models, we see that it is best to avoid the KNN classification method for this case as its training and test error rates are significantly higher when compared to other classification methods.

Forecasting 2016 US Election using Machine Learning

Name: Jason Teng

Date: March 10, 2021