We see there are only 145 obsv. even though there are 50 states and 3 election years. so let’s look at the no. of obsv per year using table
function. In 2012 only 45 obsv are there. Actually, pollsters were so sure about these five states that they did not conduct any polls there.
#description of variables
names(polling)
[1] "State" "Year" "Rasmussen" "SurveyUSA" "DiffCount" "PropR"
[7] "Republican"
Represents polling data in months leading up to 2004, 2008, 2012 pres. election. Each row represents a state in a particlular year. Republican
— 1 if republican win and 0 if democrat win. Rasmussen
— % of voters likely to vote for republican minus % who were likely to vote democrat. DiffCount
— no. of polls that predicted a republican winner minus all polls that predicted a democrat PropR
— proportion of all polls that predicted a republican winner.
polling<- read.csv("polldata.csv")
#structure
str(polling)
'data.frame': 145 obs. of 7 variables:
$ State : Factor w/ 50 levels "Alabama","Alaska",..: 1 1 2 2 3 3 3 4 4 4 ...
$ Year : int 2004 2008 2004 2008 2004 2008 2012 2004 2008 2012 ...
$ Rasmussen : int 11 21 NA 16 5 5 8 7 10 NA ...
$ SurveyUSA : int 18 25 NA NA 15 NA NA 5 NA NA ...
$ DiffCount : int 5 5 1 6 8 9 4 8 5 2 ...
$ PropR : num 1 1 1 1 1 ...
$ Republican: int 1 1 1 1 1 1 1 1 1 1 ...
#table of polling year
table(polling$Year)
2004 2008 2012
50 50 45
#summary of polling
summary(polling)
State Year Rasmussen SurveyUSA
Arizona : 3 Min. :2004 Min. :-41.0000 Min. :-33.0000
Arkansas : 3 1st Qu.:2004 1st Qu.: -8.0000 1st Qu.:-11.7500
California : 3 Median :2008 Median : 1.0000 Median : -2.0000
Colorado : 3 Mean :2008 Mean : 0.0404 Mean : -0.8243
Connecticut: 3 3rd Qu.:2012 3rd Qu.: 8.5000 3rd Qu.: 8.0000
Florida : 3 Max. :2012 Max. : 39.0000 Max. : 30.0000
(Other) :127 NA's :46 NA's :71
DiffCount PropR Republican
Min. :-19.000 Min. :0.0000 Min. :0.0000
1st Qu.: -6.000 1st Qu.:0.0000 1st Qu.:0.0000
Median : 1.000 Median :0.6250 Median :1.0000
Mean : -1.269 Mean :0.5259 Mean :0.5103
3rd Qu.: 4.000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. : 11.000 Max. :1.0000 Max. :1.0000
One method is that we delete observations, but here we already have less number of obsv. We won’t prefer this. We can delete var with missing data ut we want to retain Rasmussen/SurveyUSA
. Or filling them with average values. There is a bit more complicaed method called multiple imputation in which we fill missing values based on non-missing ones.It is mathematically sophisticated approach. We can do this easily through R’s package called mice
(Multiple Imputation through Chained Equations).
install.packages("mice")
library("mice")
So for our multiple imputation to be useful we have to be able to find out the values of our missing variables without using the outcome of the Republican.
simple<- polling[ ,c("Rasmussen", "SurveyUSA","PropR", "DiffCount")]
summary(simple)
Rasmussen SurveyUSA PropR DiffCount
Min. :-41.0000 Min. :-33.0000 Min. :0.0000 Min. :-19.000
1st Qu.: -8.0000 1st Qu.:-11.7500 1st Qu.:0.0000 1st Qu.: -6.000
Median : 1.0000 Median : -2.0000 Median :0.6250 Median : 1.000
Mean : 0.0404 Mean : -0.8243 Mean :0.5259 Mean : -1.269
3rd Qu.: 8.5000 3rd Qu.: 8.0000 3rd Qu.:1.0000 3rd Qu.: 4.000
Max. : 39.0000 Max. : 30.0000 Max. :1.0000 Max. : 11.000
NA's :46 NA's :71
set.seed(144)
imputed<- complete(mice(simple))
iter imp variable
1 1 Rasmussen SurveyUSA
1 2 Rasmussen SurveyUSA
1 3 Rasmussen SurveyUSA
1 4 Rasmussen SurveyUSA
1 5 Rasmussen SurveyUSA
2 1 Rasmussen SurveyUSA
2 2 Rasmussen SurveyUSA
2 3 Rasmussen SurveyUSA
2 4 Rasmussen SurveyUSA
2 5 Rasmussen SurveyUSA
3 1 Rasmussen SurveyUSA
3 2 Rasmussen SurveyUSA
3 3 Rasmussen SurveyUSA
3 4 Rasmussen SurveyUSA
3 5 Rasmussen SurveyUSA
4 1 Rasmussen SurveyUSA
4 2 Rasmussen SurveyUSA
4 3 Rasmussen SurveyUSA
4 4 Rasmussen SurveyUSA
4 5 Rasmussen SurveyUSA
5 1 Rasmussen SurveyUSA
5 2 Rasmussen SurveyUSA
5 3 Rasmussen SurveyUSA
5 4 Rasmussen SurveyUSA
5 5 Rasmussen SurveyUSA
summary(imputed)
Rasmussen SurveyUSA PropR DiffCount
Min. :-41.000 Min. :-33.000 Min. :0.0000 Min. :-19.000
1st Qu.: -8.000 1st Qu.:-11.000 1st Qu.:0.0000 1st Qu.: -6.000
Median : 3.000 Median : 1.000 Median :0.6250 Median : 1.000
Mean : 1.731 Mean : 1.517 Mean :0.5259 Mean : -1.269
3rd Qu.: 11.000 3rd Qu.: 18.000 3rd Qu.:1.0000 3rd Qu.: 4.000
Max. : 39.000 Max. : 30.000 Max. :1.0000 Max. : 11.000
polling$Rasmussen<- imputed$Rasmussen
polling$SurveyUSA<- imputed$SurveyUSA
Rasmussen and SurveyUSA have no more NA’s
polling$Rasmussen<- imputed$Rasmussen
polling$SurveyUSA<- imputed$SurveyUSA
table(train$Republican)
0 1
47 53
Since, the Republican won most of the state, our baseline model is aways going to predict Republican. So, we will have an accuracy of 53%. This is a pretty weak model. a smarter baseline model against which we can compare our model. We use func. sign here. If it’s passed a negative number, it returns -1, if 0, then 0; If we passed the rasmussen variable into sign, whenever the republican was winning the state, it’s gonna return 1.
table(sign(train$Rasmussen))
-1 0 1
42 3 55
-1 indicates democrat won and +1 indicates a republican.
#comparison of smart baseline model with basic baseline model
table(train$Republican, sign(train$Rasmussen))
-1 0 1
0 42 2 3
1 0 1 52
0 and 1 in rows indicate democrat and republican win. 42 obsv where smart baseline correctly predicted that the democrat would win. This a better baseline model against which we can compare logistic approach. We need to check for multicollinearity before we proceed—
(cor(train[c("Rasmussen","SurveyUSA","PropR","DiffCount","Republican")]))
Rasmussen SurveyUSA PropR DiffCount Republican
Rasmussen 1.0000000 0.9194508 0.8404803 0.5124098 0.8021191
SurveyUSA 0.9194508 1.0000000 0.8756581 0.5541816 0.8205806
PropR 0.8404803 0.8756581 1.0000000 0.8273785 0.9484204
DiffCount 0.5124098 0.5541816 0.8273785 1.0000000 0.8092777
Republican 0.8021191 0.8205806 0.9484204 0.8092777 1.0000000
We see many ind. variables are highly correlated with one another. Let’s start with one variable, it should be the one that’s highly correlated i.e. PropR
(0.94).
mod1<- glm(Republican ~ PropR, data = train, family = "binomial")
summary(mod1)
Call:
glm(formula = Republican ~ PropR, family = "binomial", data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.22880 -0.06541 0.10260 0.10260 1.37392
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.146 1.977 -3.108 0.001882 **
PropR 11.390 3.153 3.613 0.000303 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.269 on 99 degrees of freedom
Residual deviance: 15.772 on 98 degrees of freedom
AIC: 19.772
Number of Fisher Scoring iterations: 8
#predictions on train data
pred1<- predict(mod1, type = "response")
table(train$Republican, pred1>=0.5)
FALSE TRUE
0 45 2
1 2 51
0 and 1 in rows indicate that a democrat or a republican won respt. True means we predicted republican and false means democrat. We can see we correctly predicted for 45 democrats and 51 republican. It makes 4 mistakes.
We will see if we can improve the predictions, we will select a var which is less related to PropR. It is less correlated to surveyusa and diffcount, so we try them out.
mod2<- glm(Republican ~ SurveyUSA+DiffCount , data = train, family = "binomial")
pred2<- predict(mod2, type="response")
table(train$Republican, pred2>=0.5)
FALSE TRUE
0 45 2
1 1 52
summary(mod2)
Call:
glm(formula = Republican ~ SurveyUSA + DiffCount, family = "binomial",
data = train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.01196 -0.00698 0.01005 0.05074 1.54975
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1405 1.2456 -0.916 0.3599
SurveyUSA 0.2976 0.1949 1.527 0.1267
DiffCount 0.7673 0.4188 1.832 0.0669 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 138.269 on 99 degrees of freedom
Residual deviance: 12.439 on 97 degrees of freedom
AIC: 18.439
Number of Fisher Scoring iterations: 10
We make one less mistake but nothing too impressive here and neither of the variables are significant.
first we will use smart baseline model to predict the outcome of the
table(test$Republican, sign(test$Rasmussen)) #using smart baseline for predictions
-1 0 1
0 18 2 4
1 0 0 21
testpred<- predict(mod2, newdata= test , type="response")
table(test$Republican, testpred>=0.5)
FALSE TRUE
0 23 1
1 0 21
Smart baseline correctly predicted election results for 18 democrat and 21 republican. It makes 4 mistakes and 2 were inconclusive. Our model predicts correctly for 44 observations out of 45 and 1 was incorrect. Let’s look at the mistake we have made here—
subset(test, testpred>=0.5 & Republican==0)
Here, Rasmussen is 2, DiffCount is 6, it points towards republican winning the election. But in reality Barack obama won the state of florida in 2012 and he is from democrat. However overall it outperforms baseline model. Hence, it is a good model.
Here we are fine by just using a cutoff of 0.5 as we are not much concerned with errors and we are trying to predict for different states a binary outcome. so we won’t use ROC curve.