1 Introduction and Background

1.1 Abstract

The 2016 US Presidential Election came as a shock to everyone when it was determined that Donald Trump would assume that position for the next 4 years. Even to skilled statistician Nate Silver, who ended up wrongly predicting the result back then. Predicting election results is always challenging because there are so many different factors that may play into why a voter chooses or doesn’t choose a specific candidate. This project aims to recreate some of the machine learning methods Nate Silver used in 2016, by using the actual election data and determining how accurate some of these methods were at predicting the final results. The methods we used were Principal Component Analysis, Hierarchical Clustering, Decision Trees, Logistic Regression and Lasso Regularization. In addition to using those methods, we performed other classification methods such as K-Nearest Neighbors and Random Forest and explored the possibility of Simpson’s Paradox in our dataset used for the algorithms.

Some key variables that have a large impact on winning the election in our decision tree algorithm were transit, county total, white, and unemployment. For logistic regression, the key variables were citizen, income per cap, professional, service, production, drive, employed, unemployment, and county total. As a result, we found the logistic regression to be the best algorithm if we look at the records table as it has the lowest test error of 0.0634 (6.34%) compared to other algorithms. However, the ROC curve and area under the curve (AUC) suggest the lasso logistic regression (0.9488) to be slightly better than the logistic regression model (0.9482). As it is a more important evaluation metric for checking a classification model’s performance, and it solves the problem of perfect separation, we would prefer to use lasso logistic regression to classify results of the 2016 US Presidential Election.

1.2 Introduction

The primary goal of this project is to predict, analyze, and gain better understanding on voter behavior for the 2016 presidential election using two datasets census and election using machine learning algorithms. Individual voting behavior is complicated and dynamic which is why it is interesting to analyze how and why voters choose a candidate or change their behavior. This is usually a central concern to political scientists and electoral candidates to devise directions for their campaigns and for normal voters, it may help individuals better understand their behaviors. As aspiring data scientists, we are interested in any type of real world experience we can get. Machine Learning is an interesting field and doing this project has allowed us to see which domains we would like to work with in the future. The 2016 election was shocking to everyone nationwide, even very skilled statisticians predicted the outcome wrong. How could numbers, analysis and machine learning possibly predict a wrong outcome like that in 2016?

The datasets census and election are given by the instructor of the course as this is a class project. These datasets are relevant to answering our questions as the census consists of demographics, economic, and population of each county and election consists of information about the winning candidates and votes for different counties and states in the United States in 2016.

1.2.1 Questions of Interest:

What are the key variables that may influence voting behaviors and to what extent?
Who is the predicted winning candidate for the 2016 election for different states and counties in the United States?
Which of the best machine learning methods are better at classifying the results of the 2016 US Presidential Election?

1.3 Data Background

Most variables in the dataset are self explanatory and the more complicated variables are listed below.

1.3.1 Election data:

Fips - refers to Federal Information Processing Standard which is a value that denotes a specific county and state in the United States.

1.3.2 Census data:

Income - percentage (%) of median income
IncomeErr - percentage (%) of median income
Professional - percentage (%) of individuals employed in management, business, science, and arts
Construction - percentage (%) of individuals employed in natural resources, construction, and maintenance
Production - percentage (%) of individuals employed in production, transportation, and material movement The values of County, TotalPop, Men, and Women are raw values and not percentages while the rest of the variables in the dataset are percentages.

1.3.3 Issues and Biases

The results of the 2016 presidential election came as a big surprise to many and it a great example that even with the use of current technology, we may not be able to accurately predict voting behavior. A writer and statistician, Nate Silver, however was able to accurate the results of the election even down to each county. This project uses his approaches as a reference in hopes of getting a similar or improved result.

What makes voter behavior prediction (and thus election forecasting) a hard problem?

Nonreponse Bias. This occurs when people are unwilling or unable to respond to a survey due to a factor that makes them differ greaty from people who respond. If supporters for a particular candidate are less likely to respond to the polisters, the votes for that candidate may be particularly low in the survey but it may not turn out to be true.
Sampling Bias. This refers to biases on demographics or regions which occurs during the process of sampling. Pollsters often take a random sample which may lead to sampling error when a particular demographic of the general population is over or underrepresented. For example, if there is an overrepresentation of Hillary Clinton supporters in the survey, it may lead to an allusion that she would win certain states, counties or even the election.
Voter Enthusiam. This occurs when individuals wo were not planning on voting ended up voting in the election. The voting survey are often unable to the votes representing this group of voters.
Shy Tory effect. The shy tory effect reflects the lying behavior of voters when answering pollsters. Some voters may choose to lie about their votes or lie about whether they will be voting at all.
Last-minute changing behavior. Many voters may also change their votes with time so their answers to the polls many not be representative of their actually vote. It only captures their voting behavior at a certain point in time.

What was unique to Nate Silver’s approach in 2012 that allowed him to achieve good predictions?

Nate Silver’s approach in 2012 was unique as it includes the unsupervised learning method of hierarchical modelling which allows for information to be moved around the movel. He considered multiple stages in the process of sampling and acknowledges the dynamics in behavior over time. He considers creating a time series graph to allow him to capture the percentage of variation in voting intention and the extent of its effect as the levels of uncertainty often rises closer to election date. Nate Silver also considers the distribution and statistics of the model he obtained to generate different results per state. He arrived at the formula:actual percentage + house effects + sampling variation. Instead of only looking at the maximum probability, Silver’s approach utilizes Bayes’ theorem and takes in a full range of probabilities when creating the model that predicts change in support for candidates. Finally, he references previous election polling results and actual results to estimate possible bias and extent of deviation of support.

What went wrong in 2016? What do you think should be done to make future predictions better?

In 2016, many election results were significantly miscalculated as the results were in favor of Clinton while the actual winner of the election was Donald Trump. The collection of data was conducted using phone polls in which voters would receive calls from a recorded voice instead of a live person which may be a source leading to changes in behavior and biases. Some Trump voters were distrustful of institutions and poll calls which led to the inaccuracy of polling results. The polling results also often does not capture the late supporters and those who were supporting Gary Johnson.

In future predictions, a posisble way to improve polling results may be conducting weighted procedures in sampling to ensure more accurate polls. Polling organizations may also consider to use a variety of polling methods to encourage participation which includes email, text, web surveys, and not only using phone calls. Future prediction should also value uncertainty to a greater extent. The methods of analysis should take into account more outstanding effects that may have greater effect and influence on swing states in the election.

2 Methods and Data Wrangling

2.1 Methods

Exploratory plots such as numerical summaries and principal component analysis are used to identify interesting patterns in the data and relationship between the variables. Principal component analysis (PCA) was used for dimension reduction to extract interesting features and remove unwanted predictors present in the dataset.

The machine learning models used in this project are hierarchical clustering, decision trees, logistic regression, and a lasso logistic regression model. The purpose of using hierarchical clustering which is an unsupervised learning algorithm is to identify subgroups of data and attempt to understand what constitutes a grouping in the 2016 election for different states. The predictive models such as decision tree, logistic regression, and Lasso logistic regression are supervised learning models to generate a model containing important variables to predict winning candidates for each county and state.

2.2 Data Wrangling

The datasets election and census are given by the course instructor. A concern when using this data may be it limits the considerations on variables which affects voters behavior to variables that are listed only in the dataset. The election dataset represents aggregated polling results of counties and states. The census dataset contains information about demographics, economic and population of people in different counties. Both datasets are aggregated at a county level so it would not pose any harms to individuals as they contain no private information of individual who answered the polls. Despite that the results of the analysis may pose harm to the public or candidates as extreme groups of voters that are unsatisfied with the projected winning candidate may begin protesting or other behaviors. The data does not reflect the different age of voters and their levels of education.

Some observations from Census Data Frame
State	County	TotalPop	Men	Women	Hispanic	White	Black	Native	Asian
Alabama	Autauga	1948	940	1008	0.9	87.4	7.7	0.3	0.6
Alabama	Autauga	2156	1059	1097	0.8	40.4	53.3	0.0	2.3
Alabama	Autauga	2968	1364	1604	0.0	74.5	18.6	0.5	1.4
Alabama	Autauga	4423	2172	2251	10.5	82.8	3.7	1.6	0.0
Alabama	Autauga	10763	4922	5841	0.7	68.5	24.8	0.0	3.8
Alabama	Autauga	3851	1787	2064	13.1	72.9	11.9	0.0	0.0

Some observations from Election Data Frame
county	fips	candidate	state	votes
NA	US	Donald Trump	US	62984825
NA	US	Hillary Clinton	US	65853516
NA	US	Gary Johnson	US	4489221
NA	US	Jill Stein	US	1429596
NA	US	Evan McMullin	US	510002
NA	US	Darrell Castle	US	186545

The dimension of raw election data is 18345 observations with 5 variables after removing rows with fips=2000. The column fips is a code that uniquely identifies the counties in the United States. The observations with fip code 2000 does not correspond to any county in United States so it should be excluded from our dataset.

From the observations above, it is evident that the election data contains summary rows of votes for each candidate as well as for each candidate by state. To answer our questions of interest, the summary rows are not needed and removed. From the election dataset, we found out that there are 32 presidential candidates in total in the 2016 election. A bar chart with all the votes received by each candidate is used to help visualize more popular candidates.The variables county_winner and state_winner are created to help identify candidates with the highest proportion of votes.

3 Visualization

3.1 State Map

3.2 County Map

3.3 Winning Candidate by State

3.4 Winning Candidate by County

3.5 Our Visualization

Using the top 100 counties with the largest population, we used a bubble plot to compare the poverty level and Income Per Capita. For the bubble size, we classified using citizen percentages and each bubble is colored by state.

Based off our observation, there is a strong negative correlation between poverty and Income per Capita. This makes sense because intuitively, if the average income per capita in a county is low, there will be a higher poverty rate and if the average income per capita is high. It is quite interesting that the Citizen rate is generally lower for a population with a higher income per capita and a low poverty rate. We would have to look at more resources and data sets revolving populations with non citizens and different demographics like poverty rate and income per capita to see if there is a clear relationship between all of these variables.

4 Data Cleaning

The current data set contains missing data. We clean the dataset by removing rows with missing variables. The data in our table such as men, employed, and citizen are inconsistent with the rest of the variables so we convert these attributes to percentages. We also created a new Minority attribute which combines the percentages of Hispanic, Black, Native, Asian, and Pacific to reduce variables in our dataset. Many columns are related so we deleted the columns that are sets who adds up to 100% such as Women, Walk, PublicWork, Construction, and the columns with percentages that make up Minority.

A sub-county data set is created by grouping the attributes State and County. We also computed the CountyTotal and the weight corresponded to each block within the county. A county data set is created from subsetting the sub-county data set where the weighted sum of the variables are computed.This is because the raw dataset does not capture the fact that Electoral College gives different weights to votes cast in different states. Despite that, focusing on state population might not be the most useful way to determine relative weight as it could not help us understand the difference between weights assigned to voters assigned by Electoral College and the equal weights given to all voters in a popular vote which weighs each vote based on total turnout.

Some observations from Election Data Frame
State	County	Men	White	Citizen	Income	IncomeErr	IncomePerCap	IncomePerCapErr	Poverty
Alabama	Autauga	48.43266	75.78823	73.74912	51696.29	7771.009	24974.50	3433.674	12.91231
Alabama	Baldwin	48.84866	83.10262	75.69406	51074.36	8745.050	27316.84	3803.718	13.42423
Alabama	Barbour	53.82816	46.23159	76.91222	32959.30	6031.065	16824.22	2430.189	26.50563
Alabama	Bibb	53.41090	74.49989	77.39781	38886.63	5662.358	18430.99	3073.599	16.60375
Alabama	Blount	49.40565	87.85385	73.37550	46237.97	8695.786	20532.27	2052.055	16.72152
Alabama	Bullock	53.00618	22.19918	75.45420	33292.69	9000.345	17579.57	3110.645	24.50260

5 Dimensionality Reduction

3 features with largest absolute values of PC1 for sub-county
	PC1	PC2
IncomePerCap	-0.3181199	0.1679083
Professional	-0.3064366	0.1416499
Poverty	0.3046886	0.0514278

3 features with largest absolute values of PC1 for county
	PC1	PC2
IncomePerCap	-0.3530767	-0.1389017
ChildPoverty	0.3421530	-0.0405822
Poverty	0.3405832	-0.0560238

We conducted principal component analysis for both sub-county data and county data. We chose to scale the features because without scaling our range for each feature is more extreme. We also chose to center it to normalize the data so the biases in the original variables such the difference in units are removed. all the variables have the same standard deviation and same weight. This process is known as standardization and variables become unitless and have a similar variance. Standardization is important because it puts an emphasis on variables with higher variances than those low variances to help with identifying the right principal components. The three features with the largest absolute value of PC1 for sub-county are IncomePerCapita, Income and Professional while the three largest absolute values of PC1 for county are IncomePerCaptia, ChildPoverty and Poverty.

For sub-county PC1, IncomePerCap and Professional have negative PC1 values while Poverty has a positive PC1 value. This indicates that Poverty and PC1 are positively correlated where the increase in one corresponds to an increase in the other. The positive sign also indicates the direction that Poverty in PC1 is going in the single dimension vector. The negative values for IncomePerCap and Professional indicates that these values are negatively correlated with PC1 where the increase in one corresponds to a decrease in the other. The sign also indicates the negative direction that IncomePerCap and Professional are going in the single dimension vector.

For county PC1, IncomePerCap has a negative PC1 while ChildPoverty and Povery have positive values for PC1. This indicates that Child Poverty and Poverty are positively correlated with PC1 where the increase in one corresponds to an increase in the other. The sign also indicates a positive direction that ChildProverty and Poverty are going in the single dimension vector. On the other hand, IncomePerCap is negatively correlated with PC1 and the increase in one corresponds to a decrease in the other. The sign for IncomePerCap indicates a negative direction that it is going in the single dimension vector.

The minimum number of PCs needed to capture 90% of the variance for county is 14 and the minimum number if PCs needed to capture 90% of the variance for sub county is 17.

6 Clustering

6.1 Dendogram with hierarchical clustering

6.1.1 Original features dendogram

6.1.2 Principal Component Dendogram

6.2 Cluster Map

One major distinction we see in the two maps is that the entire central region of California is within Cluster 2 for the first hierarchical clustering, but it is not within cluster 1 for the 2nd hierarchical clustering with PCA. Similarly, in the first map, most of Nevada was contained in cluster 2 while in the second graph, several counties were not included in cluster 1.

Based on this observation, while the central region of California and most of Nevada were contained in Cluster 2 and also not contained in cluster 1 of the second map, this shows a progression towards the actual election results in each county (shown in question 10). As you can see, most of Nevada and Central California were classified as voting Donald Trump in the previous map with the actual data. Meanwhile, the bay area stayed consistent with voting for Hillary and staying in the actual election map, Cluster 2 for the first map and cluster 1 for the second. From this, we can believe that the PCA approach put San Mateo County in the right cluster because it was able to get rid of more wrongly classified election results from some counties.

7 Classification

While the transit rate is less than 1.05 percent, if the percentage of white people in a county is greater than 48.377%, it is 92.72% likely that Donald Trump will win in that county. Given that the percentage of white people in a county is less than 48.377%, if the unemployment rate is higher than 10.448%, it is 60.6% likely that Hillary Clinton will win in that county. Given that the unemployment rate is less than 10.448%, if the percentage of white people in the county is greater than 23.425%, then it is 73.6% likely that Donald Trump will win in that county. While the Transit Rate is greater than 84.5%, if the total county population is over 243,088, there is a 50.9% chance that Hilary Clinton will win that county. Given that the County total is less than 243,088, if the percentage of white people in the county is greater than 92.156%, then it is 67.7% likely that Donald Trump will win in that county. Given that the percentage of white people in a county is less than 92.156%, if a county has an employment rate greater than 52.307%, then it is 61.7% likely that Hilary Clinton will win that county. Given that the employment rate is less than 52.307%, if a county’s population is more than 46.136% white, then it is 68.3% likely that Donald Trump will win in that county.

	train.error	test.error
unpruned	0.0647394	0.0634146
pruned tree	0.0655537	0.0650407

Using the new pruned tree with best size from cross validation we are able to find our test and train error. Although the unpruned tree has an overall lower error rate, the difference between the values are very small. Given this factor we would choose the pruned tree as it has a smaller size and is less complex.

7.1 Logistic Regression

Logistic regression was conducted to predict the winning candidate in each country. The test error for tree and logistic regression are the same value and the train error for logistic regression is greater than decision tree. However, this does not matter as they have the same test error so both models are equally capable in predicting the winning candidate. After conducting logistic regression, we encountered the error of complete separation problem. This error occurs when the outcome variable candidate perfectly separates the combination of predictor variables completely.

	train.error	test.error
tree	0.0655537	0.0650407
logistic	0.0704397	0.0634146

Citizen, IncomePerCap, Professional, Service, Production, Drive, Employed, PrivateWork, and Unemployment are important predictors as they have a significance level between 0 and 0.001. This means that the p-value for these variables are significantly smaller than alpha values which rejects the null hypothesis that each variable has a coefficient of 0. To conclude, we are 99.9% confident that all the predictors listed are important predictors for the logistic model. At a 99% confidence level, Intercept, Carpool, and Income are also considered important predictors including the listed predictors. At a 95% confidence level, Men, White, IncomePerCapErr, WorkAtHome, MeanComute, and FamilyWork are added to the list of significant predictors.

This is not consistent with the decision tree analysis. The largest split on the decision tree is on the Transit variable followed by White and CountyTotal which are not considered significant variables in logistic regression, except White is significant at a 95% confidence level.

7.1.1 Interpreting coefficients

The variable Citizens has a coefficient of 0.1302. This means that for every one unit change in the percentage of United States citizens in the county, the log odds of Hillary Clinton winning the county increases by 0.1302, where all other variables are held constant.
The variable Drive has a coefficient of -0.2097. For a one unit increase in the percentage of individuals commuting alone in a car, van or truck in the county, the log odds of Hillary Clinton winning the election decreases by 0.02097 when all other variables are fixed. This means that the increase in Drive by one corresponds to a multiplicative change in the odds of exponential(-0.02097) = -0.057 .
The variable Professional has a coefficient of 0.2802, meaning that for every unit of increase in the percentage employed in management in the county, the likelihood of Hillary Clinton winning the county increases by a multiplicative change in the odds of e(0.2802)=0.7617.

7.2 Lasso Logistic Regression

The problem with complete separation when using logistic regression is usually a sign of overfitting. One way to control overfitting in logistic regression is through regularization. In this case, lasso penalty was used to reduce the variance by “shrinking” our coefficients.

	train.error	test.error
tree	0.0655537	0.0650407
logistic	0.0704397	0.0634146
lasso	0.0650407	0.0696254

The optimal value for λ is 0.0005 using cv.lasso$lambda.min as the lasso penalty value. The small lambda values reflect a preference for a more complex model. lambda.1se produces a simpler model in comparison to lambda.min however the model may be less accurate due to assumptions that many of the predictors may not be relevant for predicting the outcome.

There are a total of 24 non-zero coefficients in the LASSO regression for the optimal model λ. They are Men, White, Citizen, Income, IncomeErr, IncomePerCap, IncomePerCapErr, Poverty, ChildPoverty, Professional, Service, Office, Production, Drive, Carpool, Transit, OtherTransp, WorkAtHome, MeanCommute, Employed, PrivateWork, FamilyWork, Unemployment, and CountyTotal. The zero-coefficients are SelfEmployed and Minority.

Lasso regression model has a slightly higher test error of 0.0696 compared to the decision and logistic regression which has 0.0634.

7.3 ROC curves

	Decision Tree	Logistic Regression	Lasso Logistic Regression
Area Under the Curve	0.8420769	0.9482782	0.9488162

Based on the classification results, we see that decision trees are very simple to use but they do not have the best accuracy. Since they also have high variance and tend to overfit, any small changes can lead to a completely different tree. This form of classification will only work well if the data can easily be split into rectangular regions. Logistic regression is good for classifying between two different values. In this class, we are classifying the election result for each county (either Hillary Clinton or Donald Trump). However, if the data is linear or has complete separation, it will be harder to classify. Lasso Regression is most useful when some predictors are redundant and can be removed. Much like all regularization methods as well as logistic regression, Lasso Regression tends to have a lower variance and does not overfit as much. Since it ignores non significant variables, that may be problematic because we’ll never know how interesting or uninteresting they are.

Based on the result from the AUC calculation, decision trees perform poorly with only 0.8530 while logistic and lasso regression perform pretty much the same with values 0.9482782 and 0.9488162. We would choose the model which has an AUC value closer to 1. Since the election data couldn’t easily fit in a rectangular region, using a decision tree classifier wasn’t the best for classifying election results.

8 Conclusion

In this project, we applied many different machine learning algorithms on election and census data with the intention of finding the best method at classifying correct results. Famous FiveThirty-Eight Statistician Nate Silver achieved a milestone in 2012 after correctly predicting the presidential election. However in 2016, his luck was ill fated when he mispredicted the result of the presidential election. After analyzing the trends in voter behavior and reviewing Nate Silver’s model, we talked about how for future predictions, analysts should put more effort into improving the polling techniques. Since these predictions are based solely on polling results, it is only best to take into account all of the errors, uncertainties and biases surrounding different methods.

We found that some key factors that may have an impact on the election may be transit, county total, white, and unemployment from the decision tree model. Other important predictors identified from the logistic regression model were citizen, income per cap, professional, service, production, drive, employed, and private work. Amongst these variables, service and professional have a greater impact on the probability of a candidate winning the election.

Looking at the records table, we can see that logistic regression has the lowest test error of 0.0634 (6.34%) compared to the decision tree with 0.0650 (6.50%) and the lasso logistic regression which is 0.0696 (6.96%). However, we also found that using the logistic regression, we encountered the problem of perfect separation which may have arised due to overfitting. This issue is corrected through the regularization method which is related to the issue of bias variance tradeoff. The regularization method tries to reduce variance in the model by minimizing the coefficient to be close to zero. Utilizing the ROC curve to calculate area under the curve (AUC), we found the lasso logistic regression (0.9488) to perform slightly better than the logistic regression model (0.9482) since we prefer AUC to be close to 1. As it is a more important evaluation metric for checking a classification model’s performance and it solves the problem of perfect separation, we would prefer to use lasso logistic regression to classify results of the 2016 US Presidential Election. As this model works well with our data, we may infer that our assumption of the election data to be “sparse” is true.

9 Taking it further

9.1 Exploring additional Classification Methods

9.1.1 K-Nearest Neighbor Algorithm

The test error for K-nearest neighbor algorithm with 24 number of neightbors.

We used knn.cv to do a KNN classification on a training set using LOOCV. After determining the best k values is 24, we predicted the election results using the training data on both the test and the training data. After calculating the error rate, we see that our error rate for KNN 0.1138211 is much larger than our results for any of the other methods we used prior to this exploration. This increase in error rate is likely due to our data being more linear than nonlinear. In addition, since KNN classification is nonparametic, it is subject overfitting due to its high variance.

9.1.2 Bagging / Random Forest

The process of bagging is conducted by considering 10 predictors for every split of the tree. This number is chosen as it is close to the best value for pruned tree from cross validation. There are a total of 500 trees created using random forest and an error rate of 6.07% was obtained. The amount of misclassification rate for Hillary Clinton is significantly higher (0.2572), more than 10 times greater than the misclassification rate for Donald Trump which is only (0.0246). This indicates a high number of swing states that are classified as voting for Hillary CLinton but may be actually voting for Donald Trump. This might help explain why many have mispredicted the outcome of the election in 2016.

The randomForest model is the best model compared to the logistic regression and all other models created so far. It has the lowest test error of 0.0504065 amongst the models present. It also tells us additional information about the misclassification rate between the candidates. This is useful because for campaign workers who are working with the candidate because they can focus on campaigns for other states (swing states).

9.1.3 Simpson’s paradox

Correlation coefficient of percentage of Minority against Unemployment: 0.4266908

The correlation coefficient of percentage of Asians against Unemployment: -0.06173159

The correlation coefficient of percentage of Hispanic against Unemployment: 0.4438007

The correlation coefficient of percentage of Blacks against Unemployment: 0.3770913

The correlation coefficient of percentage of Native against Unemployment: 0.002003403

The correlation coefficient of percentage of Pacific against Unemployment: -0.109236

For our last exploration part of this project, we also wanted to see if there was a sign of Simpson’s paradox when we combine the different racial groups into one large attribute to represent all Minorities. We chose to look at the relationship between the Minority percentage and the unemployment rate in counties because it showed the most promising linear relationship for a more interesting analysis. After fitting a linear regression model with Minority as a predictor and unemployment as a response, we get a correlation coefficient of 0.4267. Although this doesn’t imply a strong linear relationship between the two variables, It was the most interesting combination we found.

In order to obtain a better visualization, we decided to randomly sample 200 counties and use that data for plotting and calculating the correlation between each minority group and unemployment rate. After doing so, we found that ⅗ of the groups that make up the minority population showed little to no correlation between their percentage population in a county and the unemployment rate; those groups being Asians, Pacific Islanders and Native Amaricans. Conversely, the correlation between the two remaining minority groups, Hispanics and Blacks, were the highest resulting in 0.4438 and 0.3771 R-squared values respectively. Another interesting observation from the plots that we noticed was the general curvature of the Hispanic percentage vs. Unemployment rate. Up until a Hispanic percentage of around 75%, the plot is pretty horizontal. The percentages greater than 75% increase sort of exponentially implying that a significantly large population of Hispanics is associated with a significant increase in unemployment rate.

One thing we can think of to explain this big difference between the correlation values can be because of how overpowering the Hispanics and Blacks are to the Minority population. Generally, there will always be a significantly smaller Asian, Native American, Pacific Islander population compared to the two other dominating minority groups, so the low correlation may be due to a lack of variability in population percentages in those groups. Comparing the 3 other minority populations, the results from their correlations are ignored in the combination of all minorities for the first scatterplot. This is a sign of the Simpson’s Paradox because these values appear different in each individual group but when combined, they show a different result.

9.2 Appendix

Summary for the logistic regression computed in question 17

## 
## Call:
## glm(formula = candidate ~ ., family = "binomial", data = trn.cl)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9948  -0.2600  -0.1104  -0.0399   3.5372  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.487e+01  9.452e+00  -2.631  0.00850 ** 
## Men              9.589e-02  4.824e-02   1.988  0.04682 *  
## White           -1.689e-01  6.763e-02  -2.497  0.01253 *  
## Citizen          1.302e-01  2.796e-02   4.657 3.20e-06 ***
## Income          -8.708e-05  2.729e-05  -3.192  0.00142 ** 
## IncomeErr       -3.326e-06  6.365e-05  -0.052  0.95833    
## IncomePerCap     2.671e-04  6.699e-05   3.987 6.70e-05 ***
## IncomePerCapErr -3.605e-04  1.715e-04  -2.102  0.03560 *  
## Poverty          4.741e-02  4.052e-02   1.170  0.24193    
## ChildPoverty    -1.582e-02  2.458e-02  -0.644  0.51971    
## Professional     2.802e-01  3.870e-02   7.240 4.48e-13 ***
## Service          3.242e-01  4.758e-02   6.814 9.46e-12 ***
## Office           7.590e-02  4.465e-02   1.700  0.08917 .  
## Production       1.668e-01  4.093e-02   4.076 4.58e-05 ***
## Drive           -2.097e-01  4.617e-02  -4.542 5.58e-06 ***
## Carpool         -1.736e-01  5.929e-02  -2.928  0.00342 ** 
## Transit          7.578e-02  9.424e-02   0.804  0.42134    
## OtherTransp     -6.258e-02  9.503e-02  -0.659  0.51021    
## WorkAtHome      -1.657e-01  7.198e-02  -2.302  0.02133 *  
## MeanCommute      5.616e-02  2.422e-02   2.319  0.02039 *  
## Employed         2.056e-01  3.354e-02   6.132 8.69e-10 ***
## PrivateWork      1.010e-01  2.189e-02   4.615 3.93e-06 ***
## SelfEmployed     1.968e-02  4.678e-02   0.421  0.67394    
## FamilyWork      -8.873e-01  3.765e-01  -2.357  0.01844 *  
## Unemployment     2.073e-01  3.971e-02   5.219 1.80e-07 ***
## Minority        -3.053e-02  6.509e-02  -0.469  0.63902    
## CountyTotal      3.511e-07  4.299e-07   0.817  0.41405    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2119.56  on 2455  degrees of freedom
## Residual deviance:  854.47  on 2429  degrees of freedom
## AIC: 908.47
## 
## Number of Fisher Scoring iterations: 7

Coefficients for Lasso Regression computed in question 18

## 27 x 1 sparse Matrix of class "dgCMatrix"
##                            s0
## (Intercept)     -2.638933e+01
## Men              6.970393e-02
## White           -1.295006e-01
## Citizen          1.354001e-01
## Income          -6.118765e-05
## IncomeErr       -1.300717e-05
## IncomePerCap     2.045610e-04
## IncomePerCapErr -2.547540e-04
## Poverty          3.454654e-02
## ChildPoverty    -2.082829e-03
## Professional     2.527687e-01
## Service          2.922577e-01
## Office           5.289099e-02
## Production       1.356508e-01
## Drive           -1.799921e-01
## Carpool         -1.427167e-01
## Transit          9.333821e-02
## OtherTransp     -2.124700e-02
## WorkAtHome      -1.238035e-01
## MeanCommute      3.967374e-02
## Employed         1.941281e-01
## PrivateWork      9.284186e-02
## SelfEmployed     .           
## FamilyWork      -7.515485e-01
## Unemployment     1.942268e-01
## Minority         .           
## CountyTotal      4.057991e-07

Random Forest Results compute for Exploratory Analysis

## 
## Call:
##  randomForest(formula = candidate ~ ., data = new.trn.cl, mtry = 10,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 6.07%
## Confusion matrix:
##                 Donald Trump Hillary Clinton class.error
## Donald Trump            2024              51  0.02457831
## Hillary Clinton           98             283  0.25721785

Final Project: Predicting Voting Behavior

Jasmine Kwok: 3202496 and Alyssa Keehan: 3016771

December 17, 2020