The goal of this project is to investigate whether a correlation exists between a voter’s characteristics/reviews to the presidential candidates in the 2016 election. This project will be focusing on multiple linear regression. For multiple linear regression, we will have at least three variables in our models, one response and two predictors. The response variable will be quantitative, but the predictor variables can be quantitative or categorical.
First, I’ll investigate what the data set looks like and explore how we can visualize uncomplicated multiple linear regression equations. Next, we will be manually building different models throughout this R guide. Then, we’ll dig deeper into the process of creating a multiple linear regression model and determine its interpretation. We will build a model, using both quantitative and categorical variables, discussing how categorical variables are used in a multiple linear regression situation. Finally, we will demonstrate why and how to include polynomial terms or transformations in a model and interpret the output.
The data set we will use for this projet comes from the American National Election Studies (ANES). Hence ,data can found on my github. This is a cross-sectional data set from a survey conducted by American National Election Studies (ANES) in 2016 containing 1290 variables relating to the characteristics of the voters surveyed for this data set. Though the data set contains 1290 different variables, we will only use score, the achievement rating score obtained bothCandidates
, which is simply the rating of trump
minus the rating of clinton
. So, a score of \(< 0\) represents preference to candidate Clinton while \(> 0\) represents preference to candidate Trump.
The raw data had 1,290 variables in total, well more than one could reasonably work with. So, I wanted to narrow our data down to some of the most interesting variables that can help reach the stated goal.
# Load data via a pipe-delimited file, using character type for all columns
#url <-"https://raw.githubusercontent.com/omerozeren/DATA606/master/Project/anes_timeseries_2016_rawdata.txt"
#rawData <- read_delim(url,"|",
# col_types= paste(rep("c",1290),sep="",collapse=""))
#DELETE
rawData <- read_delim("anes_timeseries_2016_rawdata.txt","|",
col_types= paste(rep("c",1290),sep="",collapse=""))
First I looked at the distribution of scores for the candidates, since these will be the variables we’re trying to find correlations to.
## Trump:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 30.00 36.95 70.00 100.00 41
## Clinton:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 2.00 40.00 42.15 70.00 100.00 38
Turning to the explanatory variables, I looked next at how the respondents to the survey got their election voter chacterictics:
The chacterictics seem to have been significant spread across all 15 variables. For modeling purposes I will chooce variables that significantlly greater respondents shown in graph.
Criteria for Comparing all Kinds of Models :
There are many types of criteria that we can use to determine whether or not one model of a set of data is better than another model of that same set of data. First, when comparing any two models we can use the Akaike Information Criteria (AIC) or Mallow’s Cp. When looking at AIC, we want to weigh the number of parameters in a model and the AIC value together. For example, if we have a model with an AIC value of X and 4 parameters, in order to prefer a model with more than 4 parameters, we would want the AIC value for that model to be at least 10 units less than X. Generally, we can think of lower AIC values corresponding to better models. One important note on these comparison tools is that they have no units so they can only compare models made from the same data set, not models from varied data sets.
For Comparing Nested Models:
If we are comparing nested models, i.e. one model’s predictors are a subset of the other model’s predictors, we have additional comparison options. For comparing nested models, we can use the p-values from t-tests or partial F-tests, only choosing the more complex model if the p-value of the t-test or F-test is below .05. We can also use adjusted R-squared as a comparison tool; however, I would discourage anyone from using adjusted R-squared as their comparison criteria because the other methods of comparison are more rigorous.
Stepwise Selection:
Now, on to the topic of actually building the model. First, we can discuss the idea of building a model in a stepwise method. There are two ways of creating models stepwise, either forward selection or backwards elimination. In forward selection
, you start with the simplest desired model and add predictors to the model until your chosen criteria indicate that adding more predictors to the model would actually worsen the model’s abilities. In backwards elimination
, you start with the most complex model acceptable and remove predictors from the model until your chosen criteria indicate that removing more predictors from the model would worsen the model’s abilities.
By their nature, these models will always be nested so any of the comparison methods previously mentioned can be used.
As moving forward, I’ll be using the backwards elimination method and forward method to create a model.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
##
## survey
## The following object is masked from 'package:dplyr':
##
## select
survey<-na.omit(survey)
# Set seed for reproducibility
full.model <- lm(bothCandidates ~ high_school +
heat_to_democ + votePre +global_warming + health_ins , data = survey)
step.model <- stepAIC(full.model, direction = "backward",
trace = FALSE)
step.model$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## bothCandidates ~ high_school + heat_to_democ + votePre + global_warming +
## health_ins
##
## Final Model:
## bothCandidates ~ high_school + heat_to_democ + global_warming
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 1132 1445369 8307.875
## 2 - health_ins 1 1630.937 1133 1447000 8307.180
## 3 - votePre 1 1833.100 1134 1448833 8306.647
step.model$call
## lm(formula = bothCandidates ~ high_school + heat_to_democ + global_warming,
## data = survey)
summary(step.model)
##
## Call:
## lm(formula = bothCandidates ~ high_school + heat_to_democ + global_warming,
## data = survey)
##
## Residuals:
## Min 1Q Median 3Q Max
## -122.792 -20.632 -0.388 20.933 124.816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.827 7.264 2.454 0.01427 *
## high_schoolYes 7.424 2.807 2.645 0.00828 **
## high_schoolNo 1.299 5.303 0.245 0.80649
## heat_to_democ 0 51.898 4.085 12.705 < 2e-16 ***
## heat_to_democ 5 47.681 25.512 1.869 0.06188 .
## heat_to_democ 8 -20.193 36.327 -0.556 0.57841
## heat_to_democ 10 33.106 25.614 1.292 0.19646
## heat_to_democ 15 37.039 4.705 7.872 8.11e-15 ***
## heat_to_democ 20 26.683 36.037 0.740 0.45919
## heat_to_democ 25 56.255 35.902 1.567 0.11742
## heat_to_democ 30 15.068 5.127 2.939 0.00336 **
## heat_to_democ 40 1.856 5.085 0.365 0.71521
## heat_to_democ 50 -11.510 5.026 -2.290 0.02221 *
## heat_to_democ 60 -35.915 4.644 -7.733 2.30e-14 ***
## heat_to_democ 70 -53.560 4.683 -11.437 < 2e-16 ***
## heat_to_democ 75 -49.388 18.169 -2.718 0.00666 **
## heat_to_democ 80 -64.553 20.904 -3.088 0.00206 **
## heat_to_democ 85 -68.913 4.513 -15.270 < 2e-16 ***
## heat_to_democ 90 -97.457 25.504 -3.821 0.00014 ***
## heat_to_democ 95 -88.745 35.902 -2.472 0.01359 *
## heat_to_democ100 -83.357 5.356 -15.564 < 2e-16 ***
## heat_to_democ998 41.255 35.902 1.149 0.25076
## global_warmingYes -19.082 6.771 -2.818 0.00491 **
## global_warmingNo 3.067 7.204 0.426 0.67041
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.74 on 1134 degrees of freedom
## Multiple R-squared: 0.6493, Adjusted R-squared: 0.6422
## F-statistic: 91.28 on 23 and 1134 DF, p-value: < 2.2e-16
plot(step.model)
We can see from the output that our final model actually contains all of the chosen variables, except for whether or not voter votePre (Whether voter vote for pre election), and health_ins
review.
Next, I will easily use a forward selection method by specifying starting model as the simplest model we would deem acceptable and changing direction to “forward”. However, using a forward selection method might produce a different model than the backwards selection method because each process takes a different set of paths to reach the optimal model.
library(MASS)
survey<-na.omit(survey)
# Set seed for reproducibility
full.model <- lm(bothCandidates ~ high_school +
heat_to_democ + votePre +global_warming + health_ins, data = survey)
step.model <- stepAIC(full.model, direction = "forward",
trace = FALSE)
step.model$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## bothCandidates ~ high_school + heat_to_democ + votePre + global_warming +
## health_ins
##
## Final Model:
## bothCandidates ~ high_school + heat_to_democ + votePre + global_warming +
## health_ins
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 1132 1445369 8307.875
step.model$call
## lm(formula = bothCandidates ~ high_school + heat_to_democ + votePre +
## global_warming + health_ins, data = survey)
summary(step.model)
##
## Call:
## lm(formula = bothCandidates ~ high_school + heat_to_democ + votePre +
## global_warming + health_ins, data = survey)
##
## Residuals:
## Min 1Q Median 3Q Max
## -120.20 -20.16 -0.38 21.13 129.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.200 7.321 2.349 0.018978 *
## high_schoolYes 7.358 2.824 2.605 0.009294 **
## high_schoolNo 1.209 5.381 0.225 0.822251
## heat_to_democ 0 51.967 4.097 12.684 < 2e-16 ***
## heat_to_democ 5 50.167 25.558 1.963 0.049901 *
## heat_to_democ 8 -19.793 36.351 -0.544 0.586210
## heat_to_democ 10 31.905 25.620 1.245 0.213278
## heat_to_democ 15 36.787 4.707 7.816 1.24e-14 ***
## heat_to_democ 20 27.058 36.040 0.751 0.452946
## heat_to_democ 25 53.252 35.944 1.482 0.138742
## heat_to_democ 30 15.192 5.127 2.963 0.003109 **
## heat_to_democ 40 1.783 5.085 0.351 0.725871
## heat_to_democ 50 -11.895 5.033 -2.364 0.018269 *
## heat_to_democ 60 -36.030 4.645 -7.757 1.93e-14 ***
## heat_to_democ 70 -53.836 4.689 -11.482 < 2e-16 ***
## heat_to_democ 75 -48.910 18.180 -2.690 0.007242 **
## heat_to_democ 80 -63.807 20.911 -3.051 0.002331 **
## heat_to_democ 85 -68.806 4.513 -15.246 < 2e-16 ***
## heat_to_democ 90 -96.912 25.510 -3.799 0.000153 ***
## heat_to_democ 95 -88.233 35.898 -2.458 0.014125 *
## heat_to_democ100 -82.852 5.367 -15.436 < 2e-16 ***
## heat_to_democ998 38.252 35.944 1.064 0.287458
## votePreNo 3.516 2.495 1.409 0.159156
## global_warmingYes -18.968 6.773 -2.800 0.005190 **
## global_warmingNo 3.384 7.206 0.470 0.638784
## health_insNo -4.151 3.673 -1.130 0.258635
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.73 on 1132 degrees of freedom
## Multiple R-squared: 0.6501, Adjusted R-squared: 0.6424
## F-statistic: 84.14 on 25 and 1132 DF, p-value: < 2.2e-16
plot(step.model)
The first plot is the residual plot, a comparison of the residuals of our model against the fitted values produced by our model, and is the most important plot because it can tell us about trends in our residuals, evidence of variability and possible outliers. The plot for backwards elimination
and forward selection
indicates that our models are systematically underpredicting the lower values of candidate ratings and systematically overpredicting the higher values of candidates ratings. Although the residuals do not seem to be evenly spread around 0 for all fitted values, the range of the residuals at each fitted value appears to be roughly the same, so we can conclude there is no evidence of heteroskedasticity. Finally, this plot indicates that there are likely no outliers because there are no points on the plot well-separated from the rest.
The next plot is the QQ-plot. Though most of the points seem to fall on the line which indicates that our residuals come from a normal distribution, there are some points that stray from the line in the lower and upper quantiles of the plot. It is possible that these points do not come from a normal distribution, but most of our points seem to come from a normal distribution so there is not a lot to worry about here.
The third plot created is the scale-location plot. This plot is similar to the residual plot, but uses the square root of the standardized residuals instead of the residuals themselves. This makes trends in residuals more evident and, from this plot, we can see that there is likely a U-shaped trend in our residuals.
From our summary, I can also get final the multiple linear regression equation.From testing backward elimination
and forward selection
metdod results, I will go with slightly higher model output Adj-R square results (Which is forward selection
R^2 =0.6424).So our final model equation will be
step.model$call
## lm(formula = bothCandidates ~ high_school + heat_to_democ + votePre +
## global_warming + health_ins, data = survey)
step.model$coefficients
## (Intercept) high_schoolYes high_schoolNo heat_to_democ 0
## 17.200196 7.358330 1.209036 51.967112
## heat_to_democ 5 heat_to_democ 8 heat_to_democ 10 heat_to_democ 15
## 50.167469 -19.792756 31.905096 36.786716
## heat_to_democ 20 heat_to_democ 25 heat_to_democ 30 heat_to_democ 40
## 27.057950 53.251756 15.191769 1.783248
## heat_to_democ 50 heat_to_democ 60 heat_to_democ 70 heat_to_democ 75
## -11.895013 -36.030445 -53.836222 -48.909909
## heat_to_democ 80 heat_to_democ 85 heat_to_democ 90 heat_to_democ 95
## -63.806815 -68.806212 -96.911693 -88.232528
## heat_to_democ100 heat_to_democ998 votePreNo global_warmingYes
## -82.852403 38.251756 3.515716 -18.967668
## global_warmingNo health_insNo
## 3.383524 -4.151186
We look at the model coefficents to understand what each input values mean to candidates score.As I menation earlier my target variable is the achievement rating score obtained bothCandidates
, which is simply the rating of trump
minus the rating of clinton
. So, a score of \(< 0\) represents preference to candidate Clinton while \(> 0\) represents preference to candidate Trump.So, coefficents that have negative signs will have negative effect (decrease) in candidates score , and coefficents that have positive sign will have positive effect (increase) in candidates rating score. Since \(> 0\) represents preference to candidate Trump, positive coefficents will be favored by Trump, and since \(< 0\) represents preference to candidate Clinton, negative coefficinets will be favored by Clinton.
Lets’ try to undestand some of our input coefficents :
-The input variable ‘high_schollYes’ has positive coeff (17.20)
so, the increase in number of high school graduates will favor by Trump.
-The input variable “global_warningYes” has negative coeff(-18.96)
so, the increase in number of voters who think there is a global warning will favor in Clinton.
-The input variable ‘votePreNO’ has positive coeff (3.50)
so, the increase in number of voters who didnt vote in previous election will favor by Trump.
-The input variable ‘health_insNo’ has negative coeff (-4.15)
so, the increase in number of voters who think our health care system is good will favor by Clinton