Title: “Social Media in the 2016 US Presidential Election”
Synopsis: In this project, I explore the 2016 presidential election outcomes on a state-by-state basis and explore modeling the margin of victory for Clinton (in popular votes) using other factors such as unemployment, educational attainment, and percentage of minorities in each state.
Skills: “Data cleaning, Merging Data Frames, Linear Regression, Stepwise Regression, F-Statistics, data visualization, choroplethr, choroplethrMaps
The presidential election outcomes have been scraped from Wikipedia and put into the following file: > https://en.wikipedia.org/wiki/United_States_presidential_election,_2016#Results_by_state > https://www.dropbox.com/s/jf5rd3lxtuklwgy/pres_election_16.csv?raw=1
Note that the table shows the number of popular votes, percentage of popular votes, and electoral votes garnered by the top 5 presidential candidates. The .csv file simply contains these values for Clinton and Trump. The variables contained are:
statemethod: ‘WTA’ stands for winner-takes-all (electoral votes allotted to the state are given to the candidate winning the popular vote in that state); ‘CD’ stands for congressional district (electoral votes for a district in the state are given to the candidate winning the popular vote in the district)_ct: raw number of popular votes won by candidate_pct: popular votes won by candidate as a percentage of total votes cast amongst all candidates_elec: number of electoral votes won by candidateclinton_pct and trump_pct to reflect the proportion of votes won by Clinton and Trump among voters who chose either Clinton or Trump and not a third party candidate.Loading Data
rm(list = ls())
setwd("/Users/Roberta/Desktop/Data Analysis & Exploration/HW7")
list.files()
## [1] "2016 US Presidential Election FINAL.Rmd"
## [2] "2016_US_Presidential_Election_FINAL.Rmd"
## [3] "HW7.html"
## [4] "HW7.pdf"
## [5] "HW7.Rmd"
## [6] "pres_election_16.csv"
## [7] "state_data.csv"
x <- read.csv("pres_election_16.csv", as.is = TRUE)
Modifying clinton_pct and trump_pct
head(x)
## state method clinton_ct clinton_pct clinton_elec trump_ct
## 1 Alabama WTA 729,547 34.36% â<U+0080><U+0093> 1,318,255
## 2 Alaska WTA 116,454 36.55% â<U+0080><U+0093> 163,387
## 3 Arizona WTA 1,161,167 45.13% â<U+0080><U+0093> 1,252,401
## 4 Arkansas WTA 380,494 33.65% â<U+0080><U+0093> 684,872
## 5 California WTA 8,753,788 61.73% 55 4,483,810
## 6 Colorado WTA 1,338,870 48.16% 9 1,202,484
## trump_pct trump_elec
## 1 62.08% 9
## 2 51.28% 3
## 3 48.67% 11
## 4 60.57% 6
## 5 31.62% â<U+0080><U+0093>
## 6 43.25% â<U+0080><U+0093>
#Modifying Maine and Nebraska because of errors later in problem set
x$state <- gsub(" (at-lg)", "", x$state, fixed=TRUE)
#Removing Maine 1st and 2nd, and Nebraska 1st, 2nd and 3rd and US total (57)
tempx <- x[-c(21, 22, 31, 32, 33), ]
#Cleaning up clinton_ct
clinton_ct_temp <- gsub(",", "", x$clinton_ct)
is.na(clinton_ct_temp)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE
clinton_ct_temp <- as.numeric(clinton_ct_temp)
x$clinton_ct <- clinton_ct_temp
#Cleaning up trump_ct
trump_ct_temp <- gsub(",", "", x$trump_ct)
is.na(trump_ct_temp)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE
trump_ct_temp <- as.numeric(trump_ct_temp)
x$trump_ct <- trump_ct_temp
#Creating a total count
x$total_ct <- x$clinton_ct + x$trump_ct
#Cleaning clinton_pc
clinton_pct_temp <- x$clinton_ct/x$total_ct
x$clinton_pct <- clinton_pct_temp
#Cleaning trump_pc
trump_pct_temp <- x$trump_ct/x$total_ct
x$trump_pct <- trump_pct_temp
head(x)
## state method clinton_ct clinton_pct clinton_elec trump_ct trump_pct
## 1 Alabama WTA 729547 0.3562586 â<U+0080><U+0093> 1318255 0.6437414
## 2 Alaska WTA 116454 0.4161435 â<U+0080><U+0093> 163387 0.5838565
## 3 Arizona WTA 1161167 0.4810998 â<U+0080><U+0093> 1252401 0.5189002
## 4 Arkansas WTA 380494 0.3571486 â<U+0080><U+0093> 684872 0.6428514
## 5 California WTA 8753788 0.6612822 55 4483810 0.3387178
## 6 Colorado WTA 1338870 0.5268333 9 1202484 0.4731667
## trump_elec total_ct
## 1 9 2047802
## 2 3 279841
## 3 11 2413568
## 4 6 1065366
## 5 â<U+0080><U+0093> 13237598
## 6 â<U+0080><U+0093> 2541354
_elec from the data frame rows where method corresponds to congressional districts as I want one row per state (plus DC) in order to study margin of popular vote victory/loss per state.head(x)
## state method clinton_ct clinton_pct clinton_elec trump_ct trump_pct
## 1 Alabama WTA 729547 0.3562586 â<U+0080><U+0093> 1318255 0.6437414
## 2 Alaska WTA 116454 0.4161435 â<U+0080><U+0093> 163387 0.5838565
## 3 Arizona WTA 1161167 0.4810998 â<U+0080><U+0093> 1252401 0.5189002
## 4 Arkansas WTA 380494 0.3571486 â<U+0080><U+0093> 684872 0.6428514
## 5 California WTA 8753788 0.6612822 55 4483810 0.3387178
## 6 Colorado WTA 1338870 0.5268333 9 1202484 0.4731667
## trump_elec total_ct
## 1 9 2047802
## 2 3 279841
## 3 11 2413568
## 4 6 1065366
## 5 â<U+0080><U+0093> 13237598
## 6 â<U+0080><U+0093> 2541354
temp <- subset(x, select = -c(clinton_elec, trump_elec)) #dropping clinton_elec (column 5) and trump_elec (column 8)
head(temp)
## state method clinton_ct clinton_pct trump_ct trump_pct total_ct
## 1 Alabama WTA 729547 0.3562586 1318255 0.6437414 2047802
## 2 Alaska WTA 116454 0.4161435 163387 0.5838565 279841
## 3 Arizona WTA 1161167 0.4810998 1252401 0.5189002 2413568
## 4 Arkansas WTA 380494 0.3571486 684872 0.6428514 1065366
## 5 California WTA 8753788 0.6612822 4483810 0.3387178 13237598
## 6 Colorado WTA 1338870 0.5268333 1202484 0.4731667 2541354
x <- temp
head(x)
## state method clinton_ct clinton_pct trump_ct trump_pct total_ct
## 1 Alabama WTA 729547 0.3562586 1318255 0.6437414 2047802
## 2 Alaska WTA 116454 0.4161435 163387 0.5838565 279841
## 3 Arizona WTA 1161167 0.4810998 1252401 0.5189002 2413568
## 4 Arkansas WTA 380494 0.3571486 684872 0.6428514 1065366
## 5 California WTA 8753788 0.6612822 4483810 0.3387178 13237598
## 6 Colorado WTA 1338870 0.5268333 1202484 0.4731667 2541354
diff) to the data frame. This is the percentage of votes won by Clinton minus the percentage of votes won by Trump. I then displayed a five number summary of this diff variable and show a histogram so that we have a visual representation of the margin of victory.x$diff <- x$clinton_pct - x$trump_pct
summary(x$diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.57865 -0.22223 -0.03780 -0.04938 0.11997 0.91390
hist(x$diff)
The histogram is roughly clustered around 0 indicating a normal distribution, though there is a single outlier.
choroplethr and choroplethrMaps packages to display this diff variable on a map of the United States.#install.packages("choroplethr")
library(choroplethr)
## Warning: package 'choroplethr' was built under R version 3.4.2
## Loading required package: acs
## Warning: package 'acs' was built under R version 3.4.2
## Loading required package: stringr
## Loading required package: XML
##
## Attaching package: 'acs'
## The following object is masked from 'package:base':
##
## apply
#install.packages("choroplethrMaps")
library(choroplethrMaps)
## Warning: package 'choroplethrMaps' was built under R version 3.4.2
data(df_pop_state)
head(df_pop_state)
## region value
## 1 alabama 4777326
## 2 alaska 711139
## 3 arizona 6410979
## 4 arkansas 2916372
## 5 california 37325068
## 6 colorado 5042853
state_choropleth(df_pop_state,
title = "US 2012 State Population Estimate",
legend = "Population") #This checks out
diff_by_state <- df_pop_state
x$state <- tolower(x$state)
setdiff(x$state, diff_by_state$region)
## [1] "maine, 1st" "maine, 2nd" "nebraska, 1st" "nebraska, 2nd"
## [5] "nebraska, 3rd" "u.s. total"
#Merge diff scores based on the region column in our new data frame matched with the state column from z data frame
diff_by_state <- merge(diff_by_state, x[,c("state", "diff")], by.x = "region", by.y = "state")
diff_by_state$value <- diff_by_state$diff
diff_by_state$diff <- NULL
The diff variable according to a map of the United States
state_choropleth(diff_by_state,
title = "Difference in votes attained in 1016 US National Elections",
legend = "Difference (clinton_ct - trump_ct)")
I then modelled the margin of victory variable diff against possible predictors that might be associated with margin of victory. The possible predictors are were drawn from recent Census/Bureau of Labor Statistics data:
https://www.dropbox.com/s/b3ehlep6yqvnl81/state_data.csv?raw=1
unempl_16: average unemployment rate (2016)hs_pct_15: percentage of population that have completed high school (2015)bac_pct_15: percentage of population that have a Bachelor degree (2015)adv_pct_15: percentage of population that have an advanced degree (2015)pct_minority_14: percentage of population that is non-white (2014)(Note: There are other variables that would do a better job of describing diff, but these were readily available. Inevitably, models almost certainly oversimplifying the story about voting patterns in the United States.)
Importing Indicator Data set
list.files()
## [1] "2016 US Presidential Election FINAL.Rmd"
## [2] "2016_US_Presidential_Election_FINAL.Rmd"
## [3] "2016_US_Presidential_Election_FINAL_files"
## [4] "HW7.html"
## [5] "HW7.pdf"
## [6] "HW7.Rmd"
## [7] "pres_election_16.csv"
## [8] "state_data.csv"
y <- read.csv("state_data.csv", as.is = TRUE)
head(y)
## State unempl_16 hs_pct_15 bac_pct_15 adv_pct_15 pct_minority_14
## 1 Alabama 6.0 84.3 23.5 8.7 33.8
## 2 Alaska 6.6 92.1 28.0 10.1 38.1
## 3 Arizona 5.3 86.0 27.5 10.2 43.8
## 4 Arkansas 4.0 84.8 21.1 7.5 26.6
## 5 California 5.4 81.8 31.4 11.6 61.5
## 6 Colorado 3.3 90.7 38.1 14.0 31.0
head(x)
## state method clinton_ct clinton_pct trump_ct trump_pct total_ct
## 1 alabama WTA 729547 0.3562586 1318255 0.6437414 2047802
## 2 alaska WTA 116454 0.4161435 163387 0.5838565 279841
## 3 arizona WTA 1161167 0.4810998 1252401 0.5189002 2413568
## 4 arkansas WTA 380494 0.3571486 684872 0.6428514 1065366
## 5 california WTA 8753788 0.6612822 4483810 0.3387178 13237598
## 6 colorado WTA 1338870 0.5268333 1202484 0.4731667 2541354
## diff
## 1 -0.28748287
## 2 -0.16771309
## 3 -0.03780047
## 4 -0.28570275
## 5 0.32256441
## 6 0.05366667
#Merging the new variables into the data frame
y$State <- tolower(y$State)
z <- merge(x, y, by.x = "state", by.y = "State")
head(z)
## state method clinton_ct clinton_pct trump_ct trump_pct total_ct
## 1 alabama WTA 729547 0.3562586 1318255 0.6437414 2047802
## 2 alaska WTA 116454 0.4161435 163387 0.5838565 279841
## 3 arizona WTA 1161167 0.4810998 1252401 0.5189002 2413568
## 4 arkansas WTA 380494 0.3571486 684872 0.6428514 1065366
## 5 california WTA 8753788 0.6612822 4483810 0.3387178 13237598
## 6 colorado WTA 1338870 0.5268333 1202484 0.4731667 2541354
## diff unempl_16 hs_pct_15 bac_pct_15 adv_pct_15 pct_minority_14
## 1 -0.28748287 6.0 84.3 23.5 8.7 33.8
## 2 -0.16771309 6.6 92.1 28.0 10.1 38.1
## 3 -0.03780047 5.3 86.0 27.5 10.2 43.8
## 4 -0.28570275 4.0 84.8 21.1 7.5 26.6
## 5 0.32256441 5.4 81.8 31.4 11.6 61.5
## 6 0.05366667 3.3 90.7 38.1 14.0 31.0
is.na(z)
## state method clinton_ct clinton_pct trump_ct trump_pct total_ct
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [32,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [33,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [35,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [39,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [40,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [41,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [42,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [43,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [44,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [46,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [47,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [48,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [50,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [51,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## diff unempl_16 hs_pct_15 bac_pct_15 adv_pct_15 pct_minority_14
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE FALSE FALSE
## [22,] FALSE FALSE FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE FALSE FALSE FALSE FALSE
## [24,] FALSE FALSE FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE FALSE FALSE
## [32,] FALSE FALSE FALSE FALSE FALSE FALSE
## [33,] FALSE FALSE FALSE FALSE FALSE FALSE
## [34,] FALSE FALSE FALSE FALSE FALSE FALSE
## [35,] FALSE FALSE FALSE FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE FALSE FALSE FALSE
## [37,] FALSE FALSE FALSE FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE FALSE FALSE FALSE
## [39,] FALSE FALSE FALSE FALSE FALSE FALSE
## [40,] FALSE FALSE FALSE FALSE FALSE FALSE
## [41,] FALSE FALSE FALSE FALSE FALSE FALSE
## [42,] FALSE FALSE FALSE FALSE FALSE FALSE
## [43,] FALSE FALSE FALSE FALSE FALSE FALSE
## [44,] FALSE FALSE FALSE FALSE FALSE FALSE
## [45,] FALSE FALSE FALSE FALSE FALSE FALSE
## [46,] FALSE FALSE FALSE FALSE FALSE FALSE
## [47,] FALSE FALSE FALSE FALSE FALSE FALSE
## [48,] FALSE FALSE FALSE FALSE FALSE FALSE
## [49,] FALSE FALSE FALSE FALSE FALSE FALSE
## [50,] FALSE FALSE FALSE FALSE FALSE FALSE
## [51,] FALSE FALSE FALSE FALSE FALSE FALSE
dim(z)
## [1] 51 13
diff. You should use the optional flag trace=0 within the step() function call to suppress the incremental output. Display the summary of the final model.#Stepwise regression - iteratively adding variables toa linear regression model
m0 <- lm(diff ~ 1, data=z)
mfull <- lm(diff ~ unempl_16 + hs_pct_15 + bac_pct_15 + adv_pct_15 +pct_minority_14, data=z)
mstepf <- step(m0, scope = list(upper=mfull), direction = "forward")
## Start: AIC=-137.96
## diff ~ 1
##
## Df Sum of Sq RSS AIC
## + adv_pct_15 1 2.23925 1.0399 -194.53
## + bac_pct_15 1 2.08397 1.1952 -187.43
## + pct_minority_14 1 0.89552 2.3836 -152.22
## <none> 3.2791 -137.96
## + unempl_16 1 0.02347 3.2557 -136.32
## + hs_pct_15 1 0.01118 3.2680 -136.13
##
## Step: AIC=-194.53
## diff ~ adv_pct_15
##
## Df Sum of Sq RSS AIC
## + pct_minority_14 1 0.237767 0.80212 -205.77
## <none> 1.03989 -194.53
## + hs_pct_15 1 0.032591 1.00730 -194.15
## + bac_pct_15 1 0.017687 1.02220 -193.40
## + unempl_16 1 0.005950 1.03394 -192.82
##
## Step: AIC=-205.77
## diff ~ adv_pct_15 + pct_minority_14
##
## Df Sum of Sq RSS AIC
## + bac_pct_15 1 0.056949 0.74517 -207.52
## <none> 0.80212 -205.77
## + unempl_16 1 0.018988 0.78313 -204.99
## + hs_pct_15 1 0.017435 0.78469 -204.89
##
## Step: AIC=-207.52
## diff ~ adv_pct_15 + pct_minority_14 + bac_pct_15
##
## Df Sum of Sq RSS AIC
## <none> 0.74517 -207.52
## + hs_pct_15 1 0.00085080 0.74432 -205.58
## + unempl_16 1 0.00027484 0.74490 -205.54
Summary of the final model
summary(mstepf)
##
## Call:
## lm(formula = diff ~ adv_pct_15 + pct_minority_14 + bac_pct_15,
## data = z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.276510 -0.099923 -0.007753 0.098294 0.283421
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.937127 0.127505 -7.350 2.42e-09 ***
## adv_pct_15 0.024961 0.013427 1.859 0.069283 .
## pct_minority_14 0.004990 0.001194 4.180 0.000126 ***
## bac_pct_15 0.015779 0.008326 1.895 0.064222 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1259 on 47 degrees of freedom
## Multiple R-squared: 0.7728, Adjusted R-squared: 0.7582
## F-statistic: 53.27 on 3 and 47 DF, p-value: 3.708e-15
The F-Statistic and its associated p-value in the final model is
\(F-statistic = 135.8/2 and 48 DF\) and \(p-value < 2.2e-16\)
The F-statistic checks the overall significance of the model by suggesting that \[H0: B1 = B2 ... Bk = 0\] \[Ha: at least 1 Bj is not equal to 0\] Due to the extremely small p-value of the F-statistic, \(3.708e-15\) we can reject the null hypothesis and conclude that the addition of additional predictors significantly improves the model.
The predictor with the largest p-value in our previous model is ‘adv_pct_15’, with a p-value of 0.069283.
m2 <- lm(diff ~ pct_minority_14 + bac_pct_15, data=z) #Model 1 in our results below
anova(m2, mstepf) #Model 2 below
## Analysis of Variance Table
##
## Model 1: diff ~ pct_minority_14 + bac_pct_15
## Model 2: diff ~ adv_pct_15 + pct_minority_14 + bac_pct_15
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 48 0.79997
## 2 47 0.74517 1 0.054797 3.4562 0.06928 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Running the anova function, or a nested F-test allows us to determine if any of the extra predictors in the larger model, ie mstepf, are significant. The p-value of the mstepf model is 0.06928, which is the same as the p-value for adv_pct_15. The next model tells us whether any of the EXTRA predictors in the larger model are significant, the only extra predictor is adv_pct_15, indicating that the addition of this predictor to the model is not statistically significant at the 0.05 level.
summary(m2)
##
## Call:
## lm(formula = diff ~ pct_minority_14 + bac_pct_15, data = z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.274811 -0.094176 0.006248 0.102881 0.285250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.105670 0.091922 -12.028 4.29e-16 ***
## pct_minority_14 0.005671 0.001165 4.870 1.26e-05 ***
## bac_pct_15 0.030202 0.003098 9.748 5.80e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1291 on 48 degrees of freedom
## Multiple R-squared: 0.756, Adjusted R-squared: 0.7459
## F-statistic: 74.38 on 2 and 48 DF, p-value: 1.975e-15
The coeeficients from m2, all of which as statistically significant at a=0.05, indicate that the presence of minorities and individuals’ with bachelors degree’s positively influenced the diff variable. That is to that that minorities and individual’s with Bachelor’s degrees were more highly correlated with voting for Clinton.
I now examine our assumptions and the presence of unusual point(s) in our model from 2(d).
plot(resid(m2) ~ fitted(m2))
hist(resid(m2))
In our plot there is a significant outlier for the fitted values at > 0.8, when all the other values are <0.4. Otherwise, the vertical spread of the plot from left to right is consistent indicating equal variance. The histogram has an displays normal distribution.
#Linearity
plot(diff ~ pct_minority_14, data=z)
Based on the plot above, the relationship between diff and one of our predictors, at the very least appears to be linear and so there is some basis for continuing with a linear model for quantifying ‘diff’.
which.max(resid(m2))
## 46
## 46
z$state[46]
## [1] "vermont"
The state with the outlier is vermont.
plot(resid(m2) ~ fitted(m2))
I believe that the Vermont outlier is is significant. An outlier is significant if the model is underpredicting its value significantly or if it disproportionately determining the positioning of the regression line, and the sheer disparately of the Vermont outlier in comparison to the even spread of the rest of the observations is likely determining the regression line.
z$V <- ifelse(z$state == "vermont", 1, 0)
m3 <- lm(diff ~ pct_minority_14 + bac_pct_15 + V, data=z)
coef(m2)
## (Intercept) pct_minority_14 bac_pct_15
## -1.105670101 0.005671286 0.030202380
coef(m3)
## (Intercept) pct_minority_14 bac_pct_15 V
## -1.087736501 0.006404034 0.028620822 0.319489197
The coefficient V represents that residual for Vermont’s observation based on the model that was fit without using Vermont. It’s inclusion changed the values for the intercept, pct_minority_14, and bac_pct_15 quite a lot.
m3b <- lm(diff ~ pct_minority_14 + bac_pct_15, data=z[z$V==0,])
coef(m3b)
## (Intercept) pct_minority_14 bac_pct_15
## -1.087736501 0.006404034 0.028620822
This indicates that the fitted equation for states other than Vermont is \[diff = -1.088 + 0.006pctminority14 + 0.029bacpct15\] While the fitted equation for Vermont is \[diff = -1.088 + 0.006pctminority14 + 0.029bacpct15 + 0.319\]
summary(m3)
##
## Call:
## lm(formula = diff ~ pct_minority_14 + bac_pct_15 + V, data = z)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26375 -0.09264 0.01095 0.08586 0.24996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.087737 0.087748 -12.396 < 2e-16 ***
## pct_minority_14 0.006404 0.001147 5.582 1.15e-06 ***
## bac_pct_15 0.028621 0.003017 9.487 1.70e-12 ***
## V 0.319489 0.129969 2.458 0.0177 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1228 on 47 degrees of freedom
## Multiple R-squared: 0.7838, Adjusted R-squared: 0.77
## F-statistic: 56.81 on 3 and 47 DF, p-value: 1.153e-15
Based on the fact that Vermont is a value of 0.319 when compared to other states, and when compared to the magnitude of the other coefficients and that fact that the p-value for the state is 0.0177, I would say that the state’s residual is unusually large and is influential on the rest of the model (because it shifts the other coefficients significantly when it is added to the model, as compared to the model created with the exclusion of Vermont).