Social Media in the 2016 US Presidential Election

Synopsis: In this project, I explore the 2016 presidential election outcomes on a state-by-state basis and explore modeling the margin of victory for Clinton (in popular votes) using other factors such as unemployment, educational attainment, and percentage of minorities in each state.

Skills: “Data cleaning, Merging Data Frames, Linear Regression, Stepwise Regression, F-Statistics, data visualization, choroplethr, choroplethrMaps

Problems

The presidential election outcomes have been scraped from Wikipedia and put into the following file: > https://en.wikipedia.org/wiki/United_States_presidential_election,_2016#Results_by_state > https://www.dropbox.com/s/jf5rd3lxtuklwgy/pres_election_16.csv?raw=1

Note that the table shows the number of popular votes, percentage of popular votes, and electoral votes garnered by the top 5 presidential candidates. The .csv file simply contains these values for Clinton and Trump. The variables contained are:

state
method: ‘WTA’ stands for winner-takes-all (electoral votes allotted to the state are given to the candidate winning the popular vote in that state); ‘CD’ stands for congressional district (electoral votes for a district in the state are given to the candidate winning the popular vote in the district)
columns ending in _ct: raw number of popular votes won by candidate
columns ending in _pct: popular votes won by candidate as a percentage of total votes cast amongst all candidates
columns ending in _elec: number of electoral votes won by candidate

Exploring the Election Data

I loaded and the cleaned the data and modified clinton_pct and trump_pct to reflect the proportion of votes won by Clinton and Trump among voters who chose either Clinton or Trump and not a third party candidate.

Loading Data

rm(list = ls()) 
setwd("/Users/Roberta/Desktop/Data Analysis & Exploration/HW7")
list.files()

## [1] "2016 US Presidential Election FINAL.Rmd"
## [2] "2016_US_Presidential_Election_FINAL.Rmd"
## [3] "HW7.html"                               
## [4] "HW7.pdf"                                
## [5] "HW7.Rmd"                                
## [6] "pres_election_16.csv"                   
## [7] "state_data.csv"

x <- read.csv("pres_election_16.csv", as.is = TRUE)

Modifying clinton_pct and trump_pct

head(x)

##        state method clinton_ct clinton_pct clinton_elec  trump_ct
## 1    Alabama    WTA    729,547      34.36%          â<U+0080><U+0093> 1,318,255
## 2     Alaska    WTA    116,454      36.55%          â<U+0080><U+0093>   163,387
## 3    Arizona    WTA  1,161,167      45.13%          â<U+0080><U+0093> 1,252,401
## 4   Arkansas    WTA    380,494      33.65%          â<U+0080><U+0093>   684,872
## 5 California    WTA  8,753,788      61.73%           55 4,483,810
## 6   Colorado    WTA  1,338,870      48.16%            9 1,202,484
##   trump_pct trump_elec
## 1    62.08%          9
## 2    51.28%          3
## 3    48.67%         11
## 4    60.57%          6
## 5    31.62%        â<U+0080><U+0093>
## 6    43.25%        â<U+0080><U+0093>

#Modifying Maine and Nebraska because of errors later in problem set
x$state <- gsub(" (at-lg)", "", x$state, fixed=TRUE)
#Removing Maine 1st and 2nd, and Nebraska 1st, 2nd and 3rd and US total (57)
tempx <- x[-c(21, 22, 31, 32, 33), ]

#Cleaning up clinton_ct
clinton_ct_temp <- gsub(",", "", x$clinton_ct)
is.na(clinton_ct_temp)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE

clinton_ct_temp <- as.numeric(clinton_ct_temp)
x$clinton_ct <- clinton_ct_temp

#Cleaning up trump_ct
trump_ct_temp <- gsub(",", "", x$trump_ct)
is.na(trump_ct_temp)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE

trump_ct_temp <- as.numeric(trump_ct_temp)
x$trump_ct <- trump_ct_temp

#Creating a total count  
x$total_ct <- x$clinton_ct + x$trump_ct

#Cleaning clinton_pc 
clinton_pct_temp <- x$clinton_ct/x$total_ct
x$clinton_pct <- clinton_pct_temp

#Cleaning trump_pc
trump_pct_temp <- x$trump_ct/x$total_ct
x$trump_pct <- trump_pct_temp

head(x)

##        state method clinton_ct clinton_pct clinton_elec trump_ct trump_pct
## 1    Alabama    WTA     729547   0.3562586          â<U+0080><U+0093>  1318255 0.6437414
## 2     Alaska    WTA     116454   0.4161435          â<U+0080><U+0093>   163387 0.5838565
## 3    Arizona    WTA    1161167   0.4810998          â<U+0080><U+0093>  1252401 0.5189002
## 4   Arkansas    WTA     380494   0.3571486          â<U+0080><U+0093>   684872 0.6428514
## 5 California    WTA    8753788   0.6612822           55  4483810 0.3387178
## 6   Colorado    WTA    1338870   0.5268333            9  1202484 0.4731667
##   trump_elec total_ct
## 1          9  2047802
## 2          3   279841
## 3         11  2413568
## 4          6  1065366
## 5        â<U+0080><U+0093> 13237598
## 6        â<U+0080><U+0093>  2541354

I also removed the columns ending in _elec from the data frame rows where method corresponds to congressional districts as I want one row per state (plus DC) in order to study margin of popular vote victory/loss per state.

head(x)

##        state method clinton_ct clinton_pct clinton_elec trump_ct trump_pct
## 1    Alabama    WTA     729547   0.3562586          â<U+0080><U+0093>  1318255 0.6437414
## 2     Alaska    WTA     116454   0.4161435          â<U+0080><U+0093>   163387 0.5838565
## 3    Arizona    WTA    1161167   0.4810998          â<U+0080><U+0093>  1252401 0.5189002
## 4   Arkansas    WTA     380494   0.3571486          â<U+0080><U+0093>   684872 0.6428514
## 5 California    WTA    8753788   0.6612822           55  4483810 0.3387178
## 6   Colorado    WTA    1338870   0.5268333            9  1202484 0.4731667
##   trump_elec total_ct
## 1          9  2047802
## 2          3   279841
## 3         11  2413568
## 4          6  1065366
## 5        â<U+0080><U+0093> 13237598
## 6        â<U+0080><U+0093>  2541354

temp <- subset(x, select = -c(clinton_elec, trump_elec)) #dropping clinton_elec (column 5) and trump_elec (column 8)
head(temp)

##        state method clinton_ct clinton_pct trump_ct trump_pct total_ct
## 1    Alabama    WTA     729547   0.3562586  1318255 0.6437414  2047802
## 2     Alaska    WTA     116454   0.4161435   163387 0.5838565   279841
## 3    Arizona    WTA    1161167   0.4810998  1252401 0.5189002  2413568
## 4   Arkansas    WTA     380494   0.3571486   684872 0.6428514  1065366
## 5 California    WTA    8753788   0.6612822  4483810 0.3387178 13237598
## 6   Colorado    WTA    1338870   0.5268333  1202484 0.4731667  2541354

x <- temp
head(x)

##        state method clinton_ct clinton_pct trump_ct trump_pct total_ct
## 1    Alabama    WTA     729547   0.3562586  1318255 0.6437414  2047802
## 2     Alaska    WTA     116454   0.4161435   163387 0.5838565   279841
## 3    Arizona    WTA    1161167   0.4810998  1252401 0.5189002  2413568
## 4   Arkansas    WTA     380494   0.3571486   684872 0.6428514  1065366
## 5 California    WTA    8753788   0.6612822  4483810 0.3387178 13237598
## 6   Colorado    WTA    1338870   0.5268333  1202484 0.4731667  2541354

I added a margin of victory variable (call it diff) to the data frame. This is the percentage of votes won by Clinton minus the percentage of votes won by Trump. I then displayed a five number summary of this diff variable and show a histogram so that we have a visual representation of the margin of victory.

x$diff <- x$clinton_pct - x$trump_pct
summary(x$diff)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.57865 -0.22223 -0.03780 -0.04938  0.11997  0.91390

hist(x$diff)

The histogram is roughly clustered around 0 indicating a normal distribution, though there is a single outlier.

I used the choroplethr and choroplethrMaps packages to display this diff variable on a map of the United States.

#install.packages("choroplethr")
library(choroplethr)

## Warning: package 'choroplethr' was built under R version 3.4.2

## Loading required package: acs

## Warning: package 'acs' was built under R version 3.4.2

## Loading required package: stringr

## Loading required package: XML

## 
## Attaching package: 'acs'

## The following object is masked from 'package:base':
## 
##     apply

#install.packages("choroplethrMaps")
library(choroplethrMaps)

## Warning: package 'choroplethrMaps' was built under R version 3.4.2

data(df_pop_state)
head(df_pop_state)

##       region    value
## 1    alabama  4777326
## 2     alaska   711139
## 3    arizona  6410979
## 4   arkansas  2916372
## 5 california 37325068
## 6   colorado  5042853

state_choropleth(df_pop_state,
                  title = "US 2012 State Population Estimate",
                  legend = "Population") #This checks out

diff_by_state <- df_pop_state
x$state <- tolower(x$state)
setdiff(x$state, diff_by_state$region)

## [1] "maine, 1st"    "maine, 2nd"    "nebraska, 1st" "nebraska, 2nd"
## [5] "nebraska, 3rd" "u.s. total"

#Merge diff scores based on the region column in our new data frame matched with the state column from z data frame
diff_by_state <- merge(diff_by_state, x[,c("state", "diff")], by.x = "region", by.y = "state")
diff_by_state$value <- diff_by_state$diff
diff_by_state$diff <- NULL

The diff variable according to a map of the United States

state_choropleth(diff_by_state,
                 title = "Difference in votes attained in 1016 US National Elections",
                 legend = "Difference (clinton_ct - trump_ct)")

Modeling the Election Data

I then modelled the margin of victory variable diff against possible predictors that might be associated with margin of victory. The possible predictors are were drawn from recent Census/Bureau of Labor Statistics data:

https://www.dropbox.com/s/b3ehlep6yqvnl81/state_data.csv?raw=1

unempl_16: average unemployment rate (2016)
hs_pct_15: percentage of population that have completed high school (2015)
bac_pct_15: percentage of population that have a Bachelor degree (2015)
adv_pct_15: percentage of population that have an advanced degree (2015)
pct_minority_14: percentage of population that is non-white (2014)

(Note: There are other variables that would do a better job of describing diff, but these were readily available. Inevitably, models almost certainly oversimplifying the story about voting patterns in the United States.)

Importing Indicator Data set

list.files()

## [1] "2016 US Presidential Election FINAL.Rmd"  
## [2] "2016_US_Presidential_Election_FINAL.Rmd"  
## [3] "2016_US_Presidential_Election_FINAL_files"
## [4] "HW7.html"                                 
## [5] "HW7.pdf"                                  
## [6] "HW7.Rmd"                                  
## [7] "pres_election_16.csv"                     
## [8] "state_data.csv"

y <- read.csv("state_data.csv", as.is = TRUE)
head(y)

##        State unempl_16 hs_pct_15 bac_pct_15 adv_pct_15 pct_minority_14
## 1    Alabama       6.0      84.3       23.5        8.7            33.8
## 2     Alaska       6.6      92.1       28.0       10.1            38.1
## 3    Arizona       5.3      86.0       27.5       10.2            43.8
## 4   Arkansas       4.0      84.8       21.1        7.5            26.6
## 5 California       5.4      81.8       31.4       11.6            61.5
## 6   Colorado       3.3      90.7       38.1       14.0            31.0

head(x)

##        state method clinton_ct clinton_pct trump_ct trump_pct total_ct
## 1    alabama    WTA     729547   0.3562586  1318255 0.6437414  2047802
## 2     alaska    WTA     116454   0.4161435   163387 0.5838565   279841
## 3    arizona    WTA    1161167   0.4810998  1252401 0.5189002  2413568
## 4   arkansas    WTA     380494   0.3571486   684872 0.6428514  1065366
## 5 california    WTA    8753788   0.6612822  4483810 0.3387178 13237598
## 6   colorado    WTA    1338870   0.5268333  1202484 0.4731667  2541354
##          diff
## 1 -0.28748287
## 2 -0.16771309
## 3 -0.03780047
## 4 -0.28570275
## 5  0.32256441
## 6  0.05366667

Merge this new data frame into the first data frame.

#Merging the new variables into the data frame
y$State <- tolower(y$State)
z <- merge(x, y, by.x = "state", by.y = "State")
head(z)

##        state method clinton_ct clinton_pct trump_ct trump_pct total_ct
## 1    alabama    WTA     729547   0.3562586  1318255 0.6437414  2047802
## 2     alaska    WTA     116454   0.4161435   163387 0.5838565   279841
## 3    arizona    WTA    1161167   0.4810998  1252401 0.5189002  2413568
## 4   arkansas    WTA     380494   0.3571486   684872 0.6428514  1065366
## 5 california    WTA    8753788   0.6612822  4483810 0.3387178 13237598
## 6   colorado    WTA    1338870   0.5268333  1202484 0.4731667  2541354
##          diff unempl_16 hs_pct_15 bac_pct_15 adv_pct_15 pct_minority_14
## 1 -0.28748287       6.0      84.3       23.5        8.7            33.8
## 2 -0.16771309       6.6      92.1       28.0       10.1            38.1
## 3 -0.03780047       5.3      86.0       27.5       10.2            43.8
## 4 -0.28570275       4.0      84.8       21.1        7.5            26.6
## 5  0.32256441       5.4      81.8       31.4       11.6            61.5
## 6  0.05366667       3.3      90.7       38.1       14.0            31.0

is.na(z)

##       state method clinton_ct clinton_pct trump_ct trump_pct total_ct
##  [1,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
##  [2,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
##  [3,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
##  [4,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
##  [5,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
##  [6,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
##  [7,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
##  [8,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
##  [9,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [10,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [11,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [12,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [13,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [14,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [15,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [16,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [17,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [18,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [19,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [20,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [21,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [22,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [23,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [24,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [25,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [26,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [27,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [28,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [29,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [30,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [31,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [32,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [33,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [34,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [35,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [36,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [37,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [38,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [39,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [40,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [41,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [42,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [43,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [44,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [45,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [46,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [47,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [48,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [49,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [50,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
## [51,] FALSE  FALSE      FALSE       FALSE    FALSE     FALSE    FALSE
##        diff unempl_16 hs_pct_15 bac_pct_15 adv_pct_15 pct_minority_14
##  [1,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
##  [2,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
##  [3,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
##  [4,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
##  [5,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
##  [6,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
##  [7,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
##  [8,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
##  [9,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [10,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [11,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [12,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [13,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [14,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [15,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [16,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [17,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [18,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [19,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [20,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [21,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [22,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [23,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [24,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [25,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [26,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [27,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [28,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [29,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [30,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [31,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [32,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [33,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [34,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [35,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [36,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [37,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [38,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [39,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [40,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [41,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [42,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [43,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [44,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [45,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [46,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [47,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [48,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [49,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [50,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE
## [51,] FALSE     FALSE     FALSE      FALSE      FALSE           FALSE

dim(z)

## [1] 51 13

Use forward stepwise regression to predict diff. You should use the optional flag trace=0 within the step() function call to suppress the incremental output. Display the summary of the final model.

#Stepwise regression - iteratively adding variables toa linear regression model
m0 <- lm(diff ~ 1, data=z)
mfull <- lm(diff ~ unempl_16 + hs_pct_15 + bac_pct_15 + adv_pct_15 +pct_minority_14, data=z)

mstepf <- step(m0, scope = list(upper=mfull), direction = "forward")

## Start:  AIC=-137.96
## diff ~ 1
## 
##                   Df Sum of Sq    RSS     AIC
## + adv_pct_15       1   2.23925 1.0399 -194.53
## + bac_pct_15       1   2.08397 1.1952 -187.43
## + pct_minority_14  1   0.89552 2.3836 -152.22
## <none>                         3.2791 -137.96
## + unempl_16        1   0.02347 3.2557 -136.32
## + hs_pct_15        1   0.01118 3.2680 -136.13
## 
## Step:  AIC=-194.53
## diff ~ adv_pct_15
## 
##                   Df Sum of Sq     RSS     AIC
## + pct_minority_14  1  0.237767 0.80212 -205.77
## <none>                         1.03989 -194.53
## + hs_pct_15        1  0.032591 1.00730 -194.15
## + bac_pct_15       1  0.017687 1.02220 -193.40
## + unempl_16        1  0.005950 1.03394 -192.82
## 
## Step:  AIC=-205.77
## diff ~ adv_pct_15 + pct_minority_14
## 
##              Df Sum of Sq     RSS     AIC
## + bac_pct_15  1  0.056949 0.74517 -207.52
## <none>                    0.80212 -205.77
## + unempl_16   1  0.018988 0.78313 -204.99
## + hs_pct_15   1  0.017435 0.78469 -204.89
## 
## Step:  AIC=-207.52
## diff ~ adv_pct_15 + pct_minority_14 + bac_pct_15
## 
##             Df  Sum of Sq     RSS     AIC
## <none>                    0.74517 -207.52
## + hs_pct_15  1 0.00085080 0.74432 -205.58
## + unempl_16  1 0.00027484 0.74490 -205.54

Summary of the final model

summary(mstepf)

## 
## Call:
## lm(formula = diff ~ adv_pct_15 + pct_minority_14 + bac_pct_15, 
##     data = z)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.276510 -0.099923 -0.007753  0.098294  0.283421 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -0.937127   0.127505  -7.350 2.42e-09 ***
## adv_pct_15       0.024961   0.013427   1.859 0.069283 .  
## pct_minority_14  0.004990   0.001194   4.180 0.000126 ***
## bac_pct_15       0.015779   0.008326   1.895 0.064222 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1259 on 47 degrees of freedom
## Multiple R-squared:  0.7728, Adjusted R-squared:  0.7582 
## F-statistic: 53.27 on 3 and 47 DF,  p-value: 3.708e-15

What specifically do you learn from the F-statistic (and its associated p-value) in the regression summary in (b)?

The F-Statistic and its associated p-value in the final model is

\(F-statistic = 135.8/2 and 48 DF\) and \(p-value < 2.2e-16\)

The F-statistic checks the overall significance of the model by suggesting that \[H0: B1 = B2 ... Bk = 0\] \[Ha: at least 1 Bj is not equal to 0\] Due to the extremely small p-value of the F-statistic, \(3.708e-15\) we can reject the null hypothesis and conclude that the addition of additional predictors significantly improves the model.

Now consider a smaller model in which the predictor with the largest p-value in your stepwise final model is removed. Run a nested F-test between this smaller model the final model from stepwise regression. What do you learn from the resulting p-value? Where have you previously seen this p-value?

The predictor with the largest p-value in our previous model is ‘adv_pct_15’, with a p-value of 0.069283.

m2 <- lm(diff ~ pct_minority_14 + bac_pct_15, data=z) #Model 1 in our results below
anova(m2, mstepf) #Model 2 below

## Analysis of Variance Table
## 
## Model 1: diff ~ pct_minority_14 + bac_pct_15
## Model 2: diff ~ adv_pct_15 + pct_minority_14 + bac_pct_15
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     48 0.79997                              
## 2     47 0.74517  1  0.054797 3.4562 0.06928 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Running the anova function, or a nested F-test allows us to determine if any of the extra predictors in the larger model, ie mstepf, are significant. The p-value of the mstepf model is 0.06928, which is the same as the p-value for adv_pct_15. The next model tells us whether any of the EXTRA predictors in the larger model are significant, the only extra predictor is adv_pct_15, indicating that the addition of this predictor to the model is not statistically significant at the 0.05 level.

Interpreting the model

summary(m2)

## 
## Call:
## lm(formula = diff ~ pct_minority_14 + bac_pct_15, data = z)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.274811 -0.094176  0.006248  0.102881  0.285250 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.105670   0.091922 -12.028 4.29e-16 ***
## pct_minority_14  0.005671   0.001165   4.870 1.26e-05 ***
## bac_pct_15       0.030202   0.003098   9.748 5.80e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1291 on 48 degrees of freedom
## Multiple R-squared:  0.756,  Adjusted R-squared:  0.7459 
## F-statistic: 74.38 on 2 and 48 DF,  p-value: 1.975e-15

The coeeficients from m2, all of which as statistically significant at a=0.05, indicate that the presence of minorities and individuals’ with bachelors degree’s positively influenced the diff variable. That is to that that minorities and individual’s with Bachelor’s degrees were more highly correlated with voting for Clinton.

Checking Assumptions

I now examine our assumptions and the presence of unusual point(s) in our model from 2(d).

Display a histogram of the residuals and a residual vs. fitted values plot to check the assumptions of our linear regression model.

plot(resid(m2) ~ fitted(m2))

hist(resid(m2))

In our plot there is a significant outlier for the fitted values at > 0.8, when all the other values are <0.4. Otherwise, the vertical spread of the plot from left to right is consistent indicating equal variance. The histogram has an displays normal distribution.

#Linearity
plot(diff ~ pct_minority_14, data=z)

Based on the plot above, the relationship between diff and one of our predictors, at the very least appears to be linear and so there is some basis for continuing with a linear model for quantifying ‘diff’.

There appears to be an unusual observation in the residuals vs. fitted values plot in (a). I identify the state that this data belongs to and display its raw data. I then determine whether or not the observation is an influential outlier based on a plot of residuals vs. fitted values.

which.max(resid(m2))

## 46 
## 46

z$state[46]

## [1] "vermont"

The state with the outlier is vermont.

plot(resid(m2) ~ fitted(m2))

I believe that the Vermont outlier is is significant. An outlier is significant if the model is underpredicting its value significantly or if it disproportionately determining the positioning of the regression line, and the sheer disparately of the Vermont outlier in comparison to the even spread of the rest of the observations is likely determining the regression line.

Examining the outlier state, Vermont. Does Vermont have an extreme residual, does it influence the model?

z$V <- ifelse(z$state == "vermont", 1, 0)
m3 <- lm(diff ~ pct_minority_14 + bac_pct_15 + V, data=z)

coef(m2)

##     (Intercept) pct_minority_14      bac_pct_15 
##    -1.105670101     0.005671286     0.030202380

coef(m3)

##     (Intercept) pct_minority_14      bac_pct_15               V 
##    -1.087736501     0.006404034     0.028620822     0.319489197

The coefficient V represents that residual for Vermont’s observation based on the model that was fit without using Vermont. It’s inclusion changed the values for the intercept, pct_minority_14, and bac_pct_15 quite a lot.

m3b <- lm(diff ~ pct_minority_14 + bac_pct_15, data=z[z$V==0,])
coef(m3b)

##     (Intercept) pct_minority_14      bac_pct_15 
##    -1.087736501     0.006404034     0.028620822

This indicates that the fitted equation for states other than Vermont is \[diff = -1.088 + 0.006pctminority14 + 0.029bacpct15\] While the fitted equation for Vermont is \[diff = -1.088 + 0.006pctminority14 + 0.029bacpct15 + 0.319\]

summary(m3)

## 
## Call:
## lm(formula = diff ~ pct_minority_14 + bac_pct_15 + V, data = z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26375 -0.09264  0.01095  0.08586  0.24996 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.087737   0.087748 -12.396  < 2e-16 ***
## pct_minority_14  0.006404   0.001147   5.582 1.15e-06 ***
## bac_pct_15       0.028621   0.003017   9.487 1.70e-12 ***
## V                0.319489   0.129969   2.458   0.0177 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1228 on 47 degrees of freedom
## Multiple R-squared:  0.7838, Adjusted R-squared:   0.77 
## F-statistic: 56.81 on 3 and 47 DF,  p-value: 1.153e-15

Based on the fact that Vermont is a value of 0.319 when compared to other states, and when compared to the magnitude of the other coefficients and that fact that the p-value for the state is 0.0177, I would say that the state’s residual is unusually large and is influential on the rest of the model (because it shifts the other coefficients significantly when it is added to the model, as compared to the model created with the exclusion of Vermont).