Part 1 - Introduction

The goal of this project is to investigate whether a correlation exists between a voter’s characteristics/reviews to the presidential candidates in the 2016 election. This project will be focusing on multiple linear regression. For multiple linear regression, we will have at least three variables in our models, one response and two predictors. The response variable will be quantitative, but the predictor variables can be quantitative or categorical.

First, I’ll investigate what the data set looks like and explore how we can visualize uncomplicated multiple linear regression equations. Next, we will be manually building different models throughout this R guide. Then, we’ll dig deeper into the process of creating a multiple linear regression model and determine its interpretation. We will build a model, using both quantitative and categorical variables, discussing how categorical variables are used in a multiple linear regression situation. Finally, we will demonstrate why and how to include polynomial terms or transformations in a model and interpret the output.

Part 2 - Data

The data set we will use for this projet comes from the American National Election Studies (ANES). Hence ,data can found on my github. This is a cross-sectional data set from a survey conducted by American National Election Studies (ANES) in 2016 containing 1290 variables relating to the characteristics of the voters surveyed for this data set. Though the data set contains 1290 different variables, we will only use score, the achievement rating score obtained bothCandidates, which is simply the rating of trump minus the rating of clinton. So, a score of \(< 0\) represents preference to candidate Clinton while \(> 0\) represents preference to candidate Trump.

Load Data

The raw data had 1,290 variables in total, well more than one could reasonably work with. So, I wanted to narrow our data down to some of the most interesting variables that can help reach the stated goal.

# Load data via a pipe-delimited file, using character type for all columns
#url <-"https://raw.githubusercontent.com/omerozeren/DATA606/master/Project/anes_timeseries_2016_rawdata.txt"
#rawData <- read_delim(url,"|",
#                      col_types= paste(rep("c",1290),sep="",collapse=""))

#DELETE
rawData <- read_delim("anes_timeseries_2016_rawdata.txt","|",
                       col_types= paste(rep("c",1290),sep="",collapse=""))

Feature Engineering

Data Statistics

First I looked at the distribution of scores for the candidates, since these will be the variables we’re trying to find correlations to.

## Trump:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00   30.00   36.95   70.00  100.00      41
## Clinton:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    2.00   40.00   42.15   70.00  100.00      38

Turning to the explanatory variables, I looked next at how the respondents to the survey got their election voter chacterictics:

The chacterictics seem to have been significant spread across all 15 variables. For modeling purposes I will chooce variables that significantlly greater respondents shown in graph.

Part 4 - Inference

Multiple Linear Regression

Criteria for Comparing all Kinds of Models :

There are many types of criteria that we can use to determine whether or not one model of a set of data is better than another model of that same set of data. First, when comparing any two models we can use the Akaike Information Criteria (AIC) or Mallow’s Cp. When looking at AIC, we want to weigh the number of parameters in a model and the AIC value together. For example, if we have a model with an AIC value of X and 4 parameters, in order to prefer a model with more than 4 parameters, we would want the AIC value for that model to be at least 10 units less than X. Generally, we can think of lower AIC values corresponding to better models. One important note on these comparison tools is that they have no units so they can only compare models made from the same data set, not models from varied data sets.

For Comparing Nested Models:

If we are comparing nested models, i.e. one model’s predictors are a subset of the other model’s predictors, we have additional comparison options. For comparing nested models, we can use the p-values from t-tests or partial F-tests, only choosing the more complex model if the p-value of the t-test or F-test is below .05. We can also use adjusted R-squared as a comparison tool; however, I would discourage anyone from using adjusted R-squared as their comparison criteria because the other methods of comparison are more rigorous.

Creating the model

Stepwise Selection:

Now, on to the topic of actually building the model. First, we can discuss the idea of building a model in a stepwise method. There are two ways of creating models stepwise, either forward selection or backwards elimination. In forward selection, you start with the simplest desired model and add predictors to the model until your chosen criteria indicate that adding more predictors to the model would actually worsen the model’s abilities. In backwards elimination, you start with the most complex model acceptable and remove predictors from the model until your chosen criteria indicate that removing more predictors from the model would worsen the model’s abilities.

By their nature, these models will always be nested so any of the comparison methods previously mentioned can be used.

As moving forward, I’ll be using the backwards elimination method and forward method to create a model.

Backwards Elimination Model

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
## 
##     survey
## The following object is masked from 'package:dplyr':
## 
##     select
survey<-na.omit(survey)
# Set seed for reproducibility
full.model <- lm(bothCandidates ~ high_school +
         heat_to_democ + votePre +global_warming + health_ins , data = survey)
step.model <- stepAIC(full.model, direction = "backward", 
                      trace = FALSE)
step.model$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## bothCandidates ~ high_school + heat_to_democ + votePre + global_warming + 
##     health_ins
## 
## Final Model:
## bothCandidates ~ high_school + heat_to_democ + global_warming
## 
## 
##           Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                               1132    1445369 8307.875
## 2 - health_ins  1 1630.937      1133    1447000 8307.180
## 3    - votePre  1 1833.100      1134    1448833 8306.647
step.model$call
## lm(formula = bothCandidates ~ high_school + heat_to_democ + global_warming, 
##     data = survey)
summary(step.model)
## 
## Call:
## lm(formula = bothCandidates ~ high_school + heat_to_democ + global_warming, 
##     data = survey)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -122.792  -20.632   -0.388   20.933  124.816 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         17.827      7.264   2.454  0.01427 *  
## high_schoolYes       7.424      2.807   2.645  0.00828 ** 
## high_schoolNo        1.299      5.303   0.245  0.80649    
## heat_to_democ  0    51.898      4.085  12.705  < 2e-16 ***
## heat_to_democ  5    47.681     25.512   1.869  0.06188 .  
## heat_to_democ  8   -20.193     36.327  -0.556  0.57841    
## heat_to_democ 10    33.106     25.614   1.292  0.19646    
## heat_to_democ 15    37.039      4.705   7.872 8.11e-15 ***
## heat_to_democ 20    26.683     36.037   0.740  0.45919    
## heat_to_democ 25    56.255     35.902   1.567  0.11742    
## heat_to_democ 30    15.068      5.127   2.939  0.00336 ** 
## heat_to_democ 40     1.856      5.085   0.365  0.71521    
## heat_to_democ 50   -11.510      5.026  -2.290  0.02221 *  
## heat_to_democ 60   -35.915      4.644  -7.733 2.30e-14 ***
## heat_to_democ 70   -53.560      4.683 -11.437  < 2e-16 ***
## heat_to_democ 75   -49.388     18.169  -2.718  0.00666 ** 
## heat_to_democ 80   -64.553     20.904  -3.088  0.00206 ** 
## heat_to_democ 85   -68.913      4.513 -15.270  < 2e-16 ***
## heat_to_democ 90   -97.457     25.504  -3.821  0.00014 ***
## heat_to_democ 95   -88.745     35.902  -2.472  0.01359 *  
## heat_to_democ100   -83.357      5.356 -15.564  < 2e-16 ***
## heat_to_democ998    41.255     35.902   1.149  0.25076    
## global_warmingYes  -19.082      6.771  -2.818  0.00491 ** 
## global_warmingNo     3.067      7.204   0.426  0.67041    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.74 on 1134 degrees of freedom
## Multiple R-squared:  0.6493, Adjusted R-squared:  0.6422 
## F-statistic: 91.28 on 23 and 1134 DF,  p-value: < 2.2e-16
plot(step.model)

We can see from the output that our final model actually contains all of the chosen variables, except for whether or not voter votePre (Whether voter vote for pre election), and health_ins review.

Next, I will easily use a forward selection method by specifying starting model as the simplest model we would deem acceptable and changing direction to “forward”. However, using a forward selection method might produce a different model than the backwards selection method because each process takes a different set of paths to reach the optimal model.

Forward Elimination Model

library(MASS)
survey<-na.omit(survey)
# Set seed for reproducibility
full.model <- lm(bothCandidates ~ high_school +
          heat_to_democ + votePre +global_warming + health_ins, data = survey)
step.model <- stepAIC(full.model, direction = "forward", 
                      trace = FALSE)
step.model$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## bothCandidates ~ high_school + heat_to_democ + votePre + global_warming + 
##     health_ins
## 
## Final Model:
## bothCandidates ~ high_school + heat_to_democ + votePre + global_warming + 
##     health_ins
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                       1132    1445369 8307.875
step.model$call
## lm(formula = bothCandidates ~ high_school + heat_to_democ + votePre + 
##     global_warming + health_ins, data = survey)
summary(step.model)
## 
## Call:
## lm(formula = bothCandidates ~ high_school + heat_to_democ + votePre + 
##     global_warming + health_ins, data = survey)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -120.20  -20.16   -0.38   21.13  129.75 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         17.200      7.321   2.349 0.018978 *  
## high_schoolYes       7.358      2.824   2.605 0.009294 ** 
## high_schoolNo        1.209      5.381   0.225 0.822251    
## heat_to_democ  0    51.967      4.097  12.684  < 2e-16 ***
## heat_to_democ  5    50.167     25.558   1.963 0.049901 *  
## heat_to_democ  8   -19.793     36.351  -0.544 0.586210    
## heat_to_democ 10    31.905     25.620   1.245 0.213278    
## heat_to_democ 15    36.787      4.707   7.816 1.24e-14 ***
## heat_to_democ 20    27.058     36.040   0.751 0.452946    
## heat_to_democ 25    53.252     35.944   1.482 0.138742    
## heat_to_democ 30    15.192      5.127   2.963 0.003109 ** 
## heat_to_democ 40     1.783      5.085   0.351 0.725871    
## heat_to_democ 50   -11.895      5.033  -2.364 0.018269 *  
## heat_to_democ 60   -36.030      4.645  -7.757 1.93e-14 ***
## heat_to_democ 70   -53.836      4.689 -11.482  < 2e-16 ***
## heat_to_democ 75   -48.910     18.180  -2.690 0.007242 ** 
## heat_to_democ 80   -63.807     20.911  -3.051 0.002331 ** 
## heat_to_democ 85   -68.806      4.513 -15.246  < 2e-16 ***
## heat_to_democ 90   -96.912     25.510  -3.799 0.000153 ***
## heat_to_democ 95   -88.233     35.898  -2.458 0.014125 *  
## heat_to_democ100   -82.852      5.367 -15.436  < 2e-16 ***
## heat_to_democ998    38.252     35.944   1.064 0.287458    
## votePreNo            3.516      2.495   1.409 0.159156    
## global_warmingYes  -18.968      6.773  -2.800 0.005190 ** 
## global_warmingNo     3.384      7.206   0.470 0.638784    
## health_insNo        -4.151      3.673  -1.130 0.258635    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.73 on 1132 degrees of freedom
## Multiple R-squared:  0.6501, Adjusted R-squared:  0.6424 
## F-statistic: 84.14 on 25 and 1132 DF,  p-value: < 2.2e-16
plot(step.model)

The first plot is the residual plot, a comparison of the residuals of our model against the fitted values produced by our model, and is the most important plot because it can tell us about trends in our residuals, evidence of variability and possible outliers. The plot for backwards elimination and forward selection indicates that our models are systematically underpredicting the lower values of candidate ratings and systematically overpredicting the higher values of candidates ratings. Although the residuals do not seem to be evenly spread around 0 for all fitted values, the range of the residuals at each fitted value appears to be roughly the same, so we can conclude there is no evidence of heteroskedasticity. Finally, this plot indicates that there are likely no outliers because there are no points on the plot well-separated from the rest.

The next plot is the QQ-plot. Though most of the points seem to fall on the line which indicates that our residuals come from a normal distribution, there are some points that stray from the line in the lower and upper quantiles of the plot. It is possible that these points do not come from a normal distribution, but most of our points seem to come from a normal distribution so there is not a lot to worry about here.

The third plot created is the scale-location plot. This plot is similar to the residual plot, but uses the square root of the standardized residuals instead of the residuals themselves. This makes trends in residuals more evident and, from this plot, we can see that there is likely a U-shaped trend in our residuals.

The Regression Equation

From our summary, I can also get final the multiple linear regression equation.From testing backward elimination and forward selection metdod results, I will go with slightly higher model output Adj-R square results (Which is forward selection R^2 =0.6424).So our final model equation will be

step.model$call
## lm(formula = bothCandidates ~ high_school + heat_to_democ + votePre + 
##     global_warming + health_ins, data = survey)
step.model$coefficients
##       (Intercept)    high_schoolYes     high_schoolNo  heat_to_democ  0 
##         17.200196          7.358330          1.209036         51.967112 
##  heat_to_democ  5  heat_to_democ  8  heat_to_democ 10  heat_to_democ 15 
##         50.167469        -19.792756         31.905096         36.786716 
##  heat_to_democ 20  heat_to_democ 25  heat_to_democ 30  heat_to_democ 40 
##         27.057950         53.251756         15.191769          1.783248 
##  heat_to_democ 50  heat_to_democ 60  heat_to_democ 70  heat_to_democ 75 
##        -11.895013        -36.030445        -53.836222        -48.909909 
##  heat_to_democ 80  heat_to_democ 85  heat_to_democ 90  heat_to_democ 95 
##        -63.806815        -68.806212        -96.911693        -88.232528 
##  heat_to_democ100  heat_to_democ998         votePreNo global_warmingYes 
##        -82.852403         38.251756          3.515716        -18.967668 
##  global_warmingNo      health_insNo 
##          3.383524         -4.151186

Part 5 - Conclusions

We look at the model coefficents to understand what each input values mean to candidates score.As I menation earlier my target variable is the achievement rating score obtained bothCandidates, which is simply the rating of trump minus the rating of clinton. So, a score of \(< 0\) represents preference to candidate Clinton while \(> 0\) represents preference to candidate Trump.So, coefficents that have negative signs will have negative effect (decrease) in candidates score , and coefficents that have positive sign will have positive effect (increase) in candidates rating score. Since \(> 0\) represents preference to candidate Trump, positive coefficents will be favored by Trump, and since \(< 0\) represents preference to candidate Clinton, negative coefficinets will be favored by Clinton.

Lets’ try to undestand some of our input coefficents :

-The input variable ‘high_schollYes’ has positive coeff (17.20) so, the increase in number of high school graduates will favor by Trump.

-The input variable “global_warningYes” has negative coeff(-18.96) so, the increase in number of voters who think there is a global warning will favor in Clinton.

-The input variable ‘votePreNO’ has positive coeff (3.50) so, the increase in number of voters who didnt vote in previous election will favor by Trump.

-The input variable ‘health_insNo’ has negative coeff (-4.15) so, the increase in number of voters who think our health care system is good will favor by Clinton