The semester project requires you to model a problem using any of the methods covered in this class. You are welcome to work on a problem that you are already working on at your work.
anes_data <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
"SPS/master/DATA%20621/anes_timeseries_2016.csv"), header = T)
VARIABLE | CODE | NAME | DESCRIPTION |
---|---|---|---|
330 | V161245 | relig_attendance | PRE: Attend religious services how often |
668 | V162002 | television_usage | POST: How many programs about 2016 campaign did Respondent watch on TV |
670 | V162004 | internet_usage | POST: How many times Respondent got info about 2016 campaign on the Internet |
692 | V162018e | internet_activity | POST: DHS: sent a message on Facebook/Twitter about polit iss |
719 | V162031 | voted_presidential | POST: Did Respondent vote for President |
726 | V162034a | presidential_vote | POST: For whom did Respondent vote for President |
727 | V162035 | presidential_intensity | POST: Preference strong for Pres cand for whom Respondent voted |
773 | V162076b | justice_recall | POST: Office recall: US Supreme Ct Chief Justice Roberts |
792 | V162096 | therm_feminists | POST: Feeling thermometer: FEMINISTS |
794 | V162098 | therm_unions | POST: Feeling thermometer: LABOR UNIONS |
796 | V162100 | therm_business | POST: Feeling thermometer: BIG BUSINESS |
799 | V162103 | therm_glbt | POST: Feeling thermometer: GAY MEN AND LESBIANS |
801 | V162105 | therm_rich | POST: Feeling thermometer: RICH PEOPLE |
802 | V162106 | therm_muslims | POST: Feeling thermometer: MUSLIMS |
803 | V162107 | therm_christians | POST: Feeling thermometer: CHRISTIANS |
804 | V162108 | therm_jews | POST: Feeling thermometer: JEWS |
806 | V162110 | therm_police | POST: Feeling thermometer: POLICE |
808 | V162112 | therm_scientists | POST: Feeling thermometer: SCIENTISTS |
822 | V162123 | nationalism | POST: Better if rest of world more like America |
825 | V162125x | flag_pride | POST: SUMMARY- How good/bad does Respondent feel to see American flag |
837 | V162136x | economic_mobility | POST: SUMMARY-Economic mobility easier/harder compared to 20 yrs ago |
840 | V162139 | reducing_debt | POST: Importance of reducing deficit |
841 | V162140 | millionaires_tax | POST: Does Respondent favor or oppose tax on millionaires |
845 | V162144 | obamacare_costs | POST: Health Care Law effect on cost of Respondent’s health care |
865 | V162158 | immig_hurts_jobs | POST: How likely immigration will take away jobs |
867 | V162160 | terror_anxiety | POST: How worried about terrist attack next 12 months |
873 | V162165 | fin_situation | POST: Worry about financial situation |
899 | V162185 | govt_size | POST: Less govt better OR more that govt should be doing |
931 | V162212 | slavery_hurts | POST: Agree/disagree: past slavery make more diff for blacks |
954 | V162230x | gender_roles | POST: SUMMARY- Better if man works and woman takes care of home |
968 | V162239 | indep_respect | POST: Child trait more important: independence or respect |
985 | V162255x | obama_muslim | POST: SUMMARY- Barack Obama is/isn’t Muslim |
1197 | V999999 | target | POST: Vote intensity (2-candidate) created by combining V162034a and V162035 |
Select columns and only keep respondents who voted. Refused and don’t know type answers are represented by negative numbers. Therefore, all negative numbers are being converted to nulls. ANES Feeling Thermometers are evely spaced interval variables ranging from 0 to 100, therefore all numbers outside this range are also being converted to nulls.
variables <- c(330, 668, 670, 692, 719, 726, 727, 773, 792, 794, 796, 799, 801, 802, 803,
804, 806, 808, 822, 825, 837, 840, 841, 845, 865, 867, 873, 899, 931, 954, 968, 985)
anes_data <- anes_data[ , variables]
anes_data <- subset(anes_data, V162031 == 4) # only look at subset that voted
anes_data <- anes_data[ , -which(names(anes_data) %in% "V162031")]
anes_data[anes_data < 0 | anes_data > 100] <- NA # make invalid values NA
names(anes_data)
## [1] "V161245" "V162002" "V162004" "V162018e" "V162034a" "V162035"
## [7] "V162076b" "V162096" "V162098" "V162100" "V162103" "V162105"
## [13] "V162106" "V162107" "V162108" "V162110" "V162112" "V162123"
## [19] "V162125x" "V162136x" "V162139" "V162140" "V162144" "V162158"
## [25] "V162160" "V162165" "V162185" "V162212" "V162230x" "V162239"
## [31] "V162255x"
Create target variable, remove underlying variables used to create the target variable, specify factor levels for target variable, remove rows with null values for the target variable, and name columns.
V999999 <- rep(NA, nrow(anes_data))
V999999[anes_data[, "V162034a"] == 1 & anes_data[, "V162035"] == 1] <- 1 # Voter w/ strong pref for Hillary
V999999[anes_data[, "V162034a"] == 1 & anes_data[, "V162035"] == 2] <- 2 # Voter w/o strong pref for Hillary
V999999[anes_data[, "V162034a"] == 2 & anes_data[, "V162035"] == 2] <- 3 # Voter w/o strong pref for Trump
V999999[anes_data[, "V162034a"] == 2 & anes_data[, "V162035"] == 1] <- 4 # Voter w/ strong pref for Trump
anes_data <- anes_data[ , -which(names(anes_data) %in% "V162034a")]
anes_data <- anes_data[ , -which(names(anes_data) %in% "V162035")]
anes_data[, "target"] <- as.factor(V999999)
levels(anes_data[, "target"]) <- c("HRC_Strong", "HRC_Weak", "DJT_Weak", "DJT_Strong")
anes_data <- subset(anes_data, !is.na(target)) # remove rows with null target values
colnames(anes_data) <- c("relig_attendance", "television_usage", "internet_usage",
"internet_activity", "justice_recall", "therm_feminists", "therm_unions",
"therm_business", "therm_glbt", "therm_rich", "therm_muslims", "therm_christians",
"therm_jews", "therm_police", "therm_scientists", "nationalism", "flag_pride",
"economic_mobility", "reducing_debt", "millionaires_tax", "obamacare_costs",
"immig_hurts_jobs", "terror_anxiety", "fin_situation", "govt_size",
"slavery_hurts", "gender_roles", "indep_respect", "obama_muslim", "target")
names(anes_data)
## [1] "relig_attendance" "television_usage" "internet_usage"
## [4] "internet_activity" "justice_recall" "therm_feminists"
## [7] "therm_unions" "therm_business" "therm_glbt"
## [10] "therm_rich" "therm_muslims" "therm_christians"
## [13] "therm_jews" "therm_police" "therm_scientists"
## [16] "nationalism" "flag_pride" "economic_mobility"
## [19] "reducing_debt" "millionaires_tax" "obamacare_costs"
## [22] "immig_hurts_jobs" "terror_anxiety" "fin_situation"
## [25] "govt_size" "slavery_hurts" "gender_roles"
## [28] "indep_respect" "obama_muslim" "target"
Convert unordered categorical variables into ordinal variables. For instance, “yes, no, maybe” is not in an intuitive order. However, “yes, maybe, no” is in an intuitive from positive to neutral to negative.
swap <- function(vector, value1, value2){
vector <- replace(vector, vector == value1, value2)
vector <- replace(vector, vector == value2, value1)
return(vector)
}
anes_data[, "millionaires_tax"] <- swap(anes_data[, "millionaires_tax"], 2, 3)
anes_data[, "obamacare_costs"] <- swap(anes_data[, "obamacare_costs"], 2, 3)
anes_data[, "indep_respect"] <- swap(anes_data[, "indep_respect"], 2, 3)
Create random training and evaluation datasets, specify \(X\) variables, specify \(Y\) variable, and specify varible types that will be examined later in the analysis.
set.seed(2017)
N <- nrow(anes_data)
n <- sample(1:N, 2 * N / 3, replace = F)
The training data set is anes_data[n, ]
with anes_data[n, -ncol(anes_data)]
representing the \(X\) variables and anes_data[n, ncol(anes_data)]
representing the \(Y\) variable. The evaluation data set is anes_data[-n, -ncol(anes_data)]
.
binary <- c(4, 5, 20, 21, 25, 28)
names(anes_data[binary])
## [1] "internet_activity" "justice_recall" "millionaires_tax"
## [4] "obamacare_costs" "govt_size" "indep_respect"
ordinal <- c(1:3, 16:19, 22:24, 26, 27, 29)
names(anes_data[ordinal])
## [1] "relig_attendance" "television_usage" "internet_usage"
## [4] "nationalism" "flag_pride" "economic_mobility"
## [7] "reducing_debt" "immig_hurts_jobs" "terror_anxiety"
## [10] "fin_situation" "slavery_hurts" "gender_roles"
## [13] "obama_muslim"
interval <- grep("^therm", colnames(anes_data))
names(anes_data[interval])
## [1] "therm_feminists" "therm_unions" "therm_business"
## [4] "therm_glbt" "therm_rich" "therm_muslims"
## [7] "therm_christians" "therm_jews" "therm_police"
## [10] "therm_scientists"
summary(anes_data[n, ], digits=3)
## relig_attendance television_usage internet_usage internet_activity
## Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00
## 1st Qu.:1.00 1st Qu.:2.00 1st Qu.:2.00 1st Qu.:1.00
## Median :2.00 Median :3.00 Median :3.00 Median :2.00
## Mean :2.42 Mean :3.04 Mean :3.02 Mean :1.64
## 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:2.00
## Max. :5.00 Max. :4.00 Max. :4.00 Max. :2.00
## NA's :616 NA's :2 NA's :2
## justice_recall therm_feminists therm_unions therm_business
## Min. :0.000 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:0.000 1st Qu.: 40.0 1st Qu.: 40.0 1st Qu.: 40.0
## Median :0.000 Median : 51.0 Median : 60.0 Median : 50.0
## Mean :0.284 Mean : 57.3 Mean : 56.8 Mean : 49.8
## 3rd Qu.:1.000 3rd Qu.: 79.0 3rd Qu.: 71.0 3rd Qu.: 62.0
## Max. :1.000 Max. :100.0 Max. :100.0 Max. :100.0
## NA's :24 NA's :13 NA's :13
## therm_glbt therm_rich therm_muslims therm_christians
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 50.0 1st Qu.: 50.0 1st Qu.: 40.0 1st Qu.: 60.0
## Median : 60.0 Median : 50.0 Median : 50.0 Median : 85.0
## Mean : 61.4 Mean : 55.2 Mean : 55.2 Mean : 76.5
## 3rd Qu.: 85.0 3rd Qu.: 70.0 3rd Qu.: 70.0 3rd Qu.:100.0
## Max. :100.0 Max. :100.0 Max. :100.0 Max. :100.0
## NA's :12 NA's :17 NA's :28 NA's :16
## therm_jews therm_police therm_scientists nationalism
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :1.0
## 1st Qu.: 51.0 1st Qu.: 70.0 1st Qu.: 60.0 1st Qu.:2.0
## Median : 71.0 Median : 85.0 Median : 84.0 Median :3.0
## Mean : 73.2 Mean : 76.9 Mean : 77.0 Mean :3.1
## 3rd Qu.: 86.0 3rd Qu.: 93.0 3rd Qu.: 93.8 3rd Qu.:4.0
## Max. :100.0 Max. :100.0 Max. :100.0 Max. :5.0
## NA's :22 NA's :7 NA's :13 NA's :1
## flag_pride economic_mobility reducing_debt millionaires_tax
## Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00
## 1st Qu.:1.00 1st Qu.:4.00 1st Qu.:1.00 1st Qu.:1.00
## Median :1.00 Median :6.00 Median :2.00 Median :1.00
## Mean :1.89 Mean :5.48 Mean :1.87 Mean :1.33
## 3rd Qu.:2.00 3rd Qu.:7.00 3rd Qu.:2.00 3rd Qu.:2.00
## Max. :7.00 Max. :7.00 Max. :5.00 Max. :2.00
## NA's :1 NA's :7 NA's :5 NA's :5
## obamacare_costs immig_hurts_jobs terror_anxiety fin_situation
## Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00
## 1st Qu.:1.00 1st Qu.:2.00 1st Qu.:2.00 1st Qu.:3.00
## Median :1.00 Median :3.00 Median :3.00 Median :3.00
## Mean :1.24 Mean :2.75 Mean :2.71 Mean :3.36
## 3rd Qu.:1.00 3rd Qu.:4.00 3rd Qu.:3.00 3rd Qu.:4.00
## Max. :2.00 Max. :4.00 Max. :5.00 Max. :5.00
## NA's :35 NA's :5 NA's :2 NA's :3
## govt_size slavery_hurts gender_roles indep_respect
## Min. :1.00 Min. :1.00 Min. :1.00 Min. :1.00
## 1st Qu.:1.00 1st Qu.:2.00 1st Qu.:2.00 1st Qu.:1.00
## Median :1.00 Median :3.00 Median :4.00 Median :2.00
## Mean :1.49 Mean :2.96 Mean :3.22 Mean :1.74
## 3rd Qu.:2.00 3rd Qu.:4.00 3rd Qu.:4.00 3rd Qu.:2.00
## Max. :2.00 Max. :5.00 Max. :7.00 Max. :2.00
## NA's :11 NA's :4 NA's :23 NA's :7
## obama_muslim target
## Min. : 1.00 HRC_Strong:609
## 1st Qu.: 4.00 HRC_Weak :243
## Median : 8.00 DJT_Weak :205
## Mean : 7.01 DJT_Strong:586
## 3rd Qu.:10.00
## Max. :10.00
## NA's :84
Looking at the data summaries for the training dataset, we can see that only the variable religious_attendance
has a notable amount of missing values. However, other questions that were possibly uncomfortable for respondents like obama_muslim
and obamacare_costs
also saw higher than average non-responses.
pie(table(anes_data[n, "target"]), main="Major Party Vote Breakdown",
labels = paste0(levels(anes_data[n, "target"]),
" (", table(anes_data[n, "target"]), ")"),
col=c("blue", "lightblue", "pink", "red"))
par(mfrow = c(3,5))
for (i in 1:(ncol(anes_data)-1)) {
hist(anes_data[n,i], xlab = names(anes_data[i]),
main = names(anes_data[i]), col="grey", ylab="", cex = .5)
}
par(mfrow = c(2,5), cex = .5)
for (i in interval) {
plot(anes_data[n, i], main = names(anes_data[i]), ylab = "", xlab = "")
}
As expected, many respondents chose answers with multiples of ten on these thermometer questions. There’s largly warm feelings for Christians, Jews, police, and scientists, while there is a broader range of opinion towards the other groups.
par(mfrow = c(2,5))
for (i in interval) {
d <- density(anes_data[n,i], na.rm=T)
plot(d, main = names(anes_data[i]), xlab = "", ylab = "")
polygon(d, col="red")
}
The thermometer questions focusing on more popular groups do show some skewing.
par(mfrow = c(3,5), cex = .5)
for (i in ordinal) {
smoothScatter(anes_data[n,i], main = names(anes_data[i]), ylab = "",
xlab = "", colramp = colorRampPalette(c("white", "red")))
}
library(ggplot2)
library(reshape2)
ggplot(data = melt(abs(cor(na.omit(sapply(anes_data[n,], as.numeric))))),
aes(x=Var1, y=Var2, fill=value)) +
scale_fill_gradient(low = 'black', high = 'red', name = "Absolute Value") +
geom_tile() + labs(title = "Correlation Heatmap") +
theme(axis.title.y = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90, hjust = 1),
plot.title = element_text(hjust = 0.5))
The correlation between different predictor variables and the response variable range from low to moderate with several interesting correlations surfacing. Some of the strongest correlations in the dataset underscore differences in ideological views with respect to polarizing social issues. For example, the candidate preference and views toward slavery, Muslims, LGBT, and feminists. The variables reflecting views toward financial situation, economic mobility, Jews, and justice recall show weak correlations with the response variable in this dataset.
PCA <- function(X) {
Xpca <- prcomp(na.omit(X), center = T, scale. = T)
M <- as.matrix(na.omit(X)); R <- as.matrix(Xpca$rotation); score <- M %*% R
print(list("Importance of Components" = summary(Xpca)$importance[ ,1:4],
"Correlation between X and PC" = cor(na.omit(X), score)[ ,1:4]))
par(mfrow=c(2,3))
barplot(Xpca$sdev^2, ylab = "Component Variance")
barplot(cor(cbind(X)), ylab = "Correlations")
barplot(Xpca$rotation, ylab = "Loadings")
biplot(Xpca); barplot(M); barplot(score)
}
PCA(anes_data[n, -ncol(anes_data)])
## $`Importance of Components`
## PC1 PC2 PC3 PC4
## Standard deviation 2.289431 1.577647 1.281814 1.272464
## Proportion of Variance 0.180740 0.085830 0.056660 0.055830
## Cumulative Proportion 0.180740 0.266570 0.323220 0.379060
##
## $`Correlation between X and PC`
## PC1 PC2 PC3 PC4
## relig_attendance 0.265338848 -0.02280027 0.34318542 -0.143482872
## television_usage 0.033359822 0.10364935 0.06844106 0.018800255
## internet_usage 0.105157424 -0.01657748 0.21907764 0.064051446
## internet_activity -0.088667801 -0.07507725 -0.14642145 0.008227587
## justice_recall 0.078548773 0.06431838 0.10633371 0.015604102
## therm_feminists 0.809791418 0.48473921 0.28867202 -0.671289226
## therm_unions 0.594001703 0.38144789 -0.16816468 -0.810234883
## therm_business -0.226148341 0.38069274 -0.36364689 0.327125887
## therm_glbt 0.773044137 0.53249547 0.66795435 -0.308026889
## therm_rich -0.096355140 0.58089544 -0.11216237 0.360144314
## therm_muslims 0.741941368 0.57618468 0.22548280 -0.259986999
## therm_christians -0.325055425 0.37969196 -0.35136870 0.043635989
## therm_jews 0.261883605 0.71326233 0.39997866 -0.017413341
## therm_police -0.343015481 0.38333869 0.15572940 0.182520215
## therm_scientists 0.489974257 0.51822977 0.39827632 -0.434782083
## nationalism 0.390678945 0.06181492 0.16733105 -0.148464933
## flag_pride 0.374163774 -0.06795340 0.02797379 -0.200772844
## economic_mobility -0.036974799 -0.02210561 0.16957486 -0.017452311
## reducing_debt 0.361220906 0.01810991 0.10008120 -0.237668711
## millionaires_tax -0.332858748 -0.05252630 -0.11312952 0.388522837
## obamacare_costs 0.297638254 0.05494702 0.05207091 -0.224465940
## immig_hurts_jobs 0.447030145 0.12678283 0.16472569 -0.174125255
## terror_anxiety 0.176923537 0.02024504 -0.08249539 -0.068155614
## fin_situation -0.000116906 0.06990652 -0.14283385 0.125610520
## govt_size 0.370406028 0.10350955 -0.02558070 -0.409500128
## slavery_hurts -0.478707123 -0.14433811 -0.05284779 0.304707633
## gender_roles 0.352030378 0.01584006 0.22952265 -0.226253090
## indep_respect -0.289208353 -0.02279324 -0.18811455 0.110489743
## obama_muslim 0.543650372 0.13055775 0.18700451 -0.244974494
We see again that the main underlying component explaining the variance of the data appears to focus on inflamatory social issues, highlighted by the importance of therm_feminists
, therm_lgbt
, therm_muslims
, and slavery_hurts
, which additionally had correlation with presidential_vote
. The second, and less important, pattern revolves around sentiment possibly involving anti-semetic conspiracy theories, highlighted by the importance of therm_rich
and therm_jews
.
Impute missing values with means.
library(mice)
library(VIM)
aggr(anes_data[n, -ncol(anes_data)], bars=F, sortVars=T)
##
## Variables sorted by number of missings:
## Variable Count
## relig_attendance 0.3749239197
## obama_muslim 0.0511259890
## obamacare_costs 0.0213024954
## therm_muslims 0.0170419963
## therm_feminists 0.0146074254
## gender_roles 0.0139987827
## therm_jews 0.0133901400
## therm_rich 0.0103469264
## therm_christians 0.0097382836
## therm_unions 0.0079123554
## therm_business 0.0079123554
## therm_scientists 0.0079123554
## therm_glbt 0.0073037127
## govt_size 0.0066950700
## therm_police 0.0042604991
## economic_mobility 0.0042604991
## indep_respect 0.0042604991
## reducing_debt 0.0030432136
## millionaires_tax 0.0030432136
## immig_hurts_jobs 0.0030432136
## slavery_hurts 0.0024345709
## fin_situation 0.0018259282
## internet_usage 0.0012172855
## internet_activity 0.0012172855
## terror_anxiety 0.0012172855
## nationalism 0.0006086427
## flag_pride 0.0006086427
## television_usage 0.0000000000
## justice_recall 0.0000000000
data <- anes_data[n, -ncol(anes_data)]
MICE <- mice(data, predictorMatrix = quickpred(data), method = "mean", printFlag = F)
anes_data[n, -ncol(anes_data)] <- complete(MICE, action = 1)
aggr(anes_data[n, -ncol(anes_data)], bars=F)
data <- anes_data[-n, -ncol(anes_data)]
MICE <- mice(data, predictorMatrix = quickpred(data), method = "mean", printFlag = F)
anes_data[-n, -ncol(anes_data)] <- complete(MICE, action = 1)
library("DT")
display <- function(data) {
datatable(data, options = list(
searching = TRUE,
pageLength = 5,
lengthMenu = c(5, nrow(data))
), rownames = FALSE)
}
library(reshape2)
Corr_XY <- function(X, Y) {
corr <- data.frame(array(NA, dim = c(ncol(X), 5)))
colnames(corr) <- c("Y", "X", "r","p","<0.05")
for (i in 1:ncol(X)) {
r <- cor.test(as.numeric(Y), X[, i])
corr[i, 1] <- "target"
corr[i, 2] <- names(X[i])
corr[i, 3] <- round(r$estimate, 6)
corr[i, 4] <- round(r$p.value, 6)
corr[i, 5] <- corr[i, 4] < 0.05
}
return(corr)
}
Corr_XX <- function(X, threshold) {
corr <- data.frame(array(NA, dim = c(choose(ncol(X), 2), 5)))
colnames(corr) <- c("X1", "X2", "r","p","<0.05"); k = 1
for (i in 1:(ncol(X) - 1)) {
for (j in (i+1):ncol(X)) {
r <- cor.test(X[,i], X[,j])
corr[k, 1] <- names(X[i])
corr[k, 2] <- names(X[j])
corr[k, 3] <- round(r$estimate, 6)
corr[k, 4] <- round(r$p.value, 6)
corr[k, 5] <- corr[k, 4] < 0.05
k = k + 1
}
}
least <- corr[corr[,"<0.05"] == F, ]
most <- corr[abs(corr[,"r"]) >= threshold, ]
result <- list("Correlations" = corr, "Least_Correlated" = least, "Most_Correlated" = most)
return(result)
}
correlations <- Corr_XY(anes_data[n, -ncol(anes_data)], anes_data[n, ncol(anes_data)])
display(correlations)
The predictor variables justice_recall
and fin_situation
do not have statistically significant correlations with the response variable and are therefore not being considered for the model.
remove <- c("justice_recall", "fin_situation")
anes_data <- anes_data[ , -which(names(anes_data) %in% remove)]
correlations <- Corr_XX(anes_data[n, -ncol(anes_data)], 0.40)
The optimal power transformation is found via the Box-Cox Test where:
\[-2\Rightarrow \frac { 1 }{ Y^{ 2 } }, \quad \quad -1\Rightarrow \frac { 1 }{ Y }, \quad \quad -0.5\Rightarrow \frac { 1 }{ \sqrt { Y } }, \quad \quad 0\Rightarrow \log { \left( Y \right) }, \quad \quad 0.5\Rightarrow \sqrt { Y }, \quad \quad 1\Rightarrow Y, \quad \quad 2\Rightarrow { Y }^{ 2 }\]
library(car)
## Warning: package 'car' was built under R version 3.3.2
interval <- grep("^therm", colnames(anes_data))
box.cox.powers <- powerTransform(anes_data[n, interval] + 1e-32, family="bcPower")
summary(box.cox.powers)$result
## Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
## therm_feminists 0.2563347 0.006073274 0.2444311 0.2682383
## therm_unions 0.2938270 0.007078411 0.2799533 0.3077007
## therm_business 0.2694247 0.006428981 0.2568239 0.2820255
## therm_christians 0.5218412 0.016743361 0.4890243 0.5546582
## therm_jews 0.5694112 0.021088975 0.5280768 0.6107456
## therm_scientists 0.7311165 0.035138996 0.6622441 0.7999890
The Box-Cox Power Test results provide support for raising all the interval variables to powers greater than 0.25 and less than 0.75. Values were shifted ever so slightly to the right with a negligible value (+\(1^{-32}\)) to allow for the calculations to take place on variables that have zeros since the function can only handle \(\mathbb{R}^+\) values.
interval <- grep("^therm", colnames(anes_data)); k = 1
for (i in interval) {
var_name <- paste0("bc_", names(anes_data[i]))
anes_data[, var_name] <- anes_data[, i]^(box.cox.powers$lambda[k])
k = k + 1
}
anes_data <- anes_data[, -interval]
\(AIC\) and \(BIC\) will be used to determine potential models. Adjusted \(R^2\) and Mallows \(C_p\) are not appropriate for logistic regressions.
library(MASS)
null <- polr(target ~ 1, anes_data[n, ], Hess=T)
full <- polr(target ~ ., anes_data[n, ], Hess=T)
Stepwise subset selection based on \(AIC\). Using \(k = 2\) degrees of freedom for the penalty gives the genuine \(AIC\).
AIC_steps <- step(null, scope=list(lower=null, upper=full), direction="forward", k = 2, trace=F)
AIC_steps$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 1640 4199.274 4205.274
## 2 + obama_muslim -1 643.801993 1639 3555.472 3563.472
## 3 + govt_size -1 200.687939 1638 3354.784 3364.784
## 4 + bc_therm_feminists -1 119.631397 1637 3235.152 3247.152
## 5 + immig_hurts_jobs -1 90.466570 1636 3144.686 3158.686
## 6 + slavery_hurts -1 66.445149 1635 3078.241 3094.241
## 7 + millionaires_tax -1 42.032839 1634 3036.208 3054.208
## 8 + reducing_debt -1 24.996449 1633 3011.211 3031.211
## 9 + obamacare_costs -1 19.778353 1632 2991.433 3013.433
## 10 + economic_mobility -1 16.315463 1631 2975.118 2999.118
## 11 + flag_pride -1 14.990560 1630 2960.127 2986.127
## 12 + bc_therm_scientists -1 11.633122 1629 2948.494 2976.494
## 13 + relig_attendance -1 6.452458 1628 2942.041 2972.041
## 14 + gender_roles -1 3.461286 1627 2938.580 2970.580
## 15 + terror_anxiety -1 2.285315 1626 2936.295 2970.295
## 16 + nationalism -1 2.639941 1625 2933.655 2969.655
AIC_steps$call$formula
## target ~ obama_muslim + govt_size + bc_therm_feminists + immig_hurts_jobs +
## slavery_hurts + millionaires_tax + reducing_debt + obamacare_costs +
## economic_mobility + flag_pride + bc_therm_scientists + relig_attendance +
## gender_roles + terror_anxiety + nationalism
The above model has the lowest \(AIC\).
AIC_steps <- step(full, scope=list(lower=null, upper=full), direction="backward", k = 2, trace=F)
AIC_steps$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 1617 2925.654 2977.654
## 2 - bc_therm_jews 1 0.1385388 1618 2925.792 2975.792
## 3 - bc_therm_business 1 0.3030515 1619 2926.095 2974.095
## 4 - bc_therm_unions 1 0.6898720 1620 2926.785 2972.785
## 5 - internet_usage 1 0.8280503 1621 2927.613 2971.613
## 6 - indep_respect 1 0.9317921 1622 2928.545 2970.545
## 7 - internet_activity 1 1.5267602 1623 2930.072 2970.072
## 8 - bc_therm_christians 1 1.7219694 1624 2931.794 2969.794
## 9 - television_usage 1 1.8610353 1625 2933.655 2969.655
AIC_steps$call$formula
## target ~ relig_attendance + nationalism + flag_pride + economic_mobility +
## reducing_debt + millionaires_tax + obamacare_costs + immig_hurts_jobs +
## terror_anxiety + govt_size + slavery_hurts + gender_roles +
## obama_muslim + bc_therm_feminists + bc_therm_scientists
The above model has the lowest \(AIC\). These are the same variables obtained through forward selection.
The forward selection and backward elimination stepwise selection methods using \(AIC\) both yield models with the same exact variables.
AIC <- polr(AIC_steps$call$formula, anes_data[n, ], Hess=T)
coef_sum <- head(coef(summary(AIC)), -3)
p <- pt(abs(coef_sum[, "t value"]), lower.tail = F, df = AIC_steps$df.residual) * 2
cbind(coef_sum, "p value" = round(p, 6))
## Value Std. Error t value p value
## relig_attendance -0.12252747 0.05684933 -2.155302 0.031285
## nationalism -0.08374519 0.05152076 -1.625465 0.104257
## flag_pride -0.16529907 0.05105603 -3.237601 0.001230
## economic_mobility 0.14686348 0.03206333 4.580419 0.000005
## reducing_debt -0.30240516 0.06821625 -4.433037 0.000010
## millionaires_tax 0.69324163 0.12167283 5.697588 0.000000
## obamacare_costs -0.53259762 0.13727545 -3.879773 0.000109
## immig_hurts_jobs -0.37180933 0.06270353 -5.929640 0.000000
## terror_anxiety 0.08422079 0.05138492 1.639018 0.101403
## govt_size -0.84867211 0.11476028 -7.395173 0.000000
## slavery_hurts 0.26019290 0.04350841 5.980290 0.000000
## gender_roles -0.07736955 0.04237509 -1.825826 0.068060
## obama_muslim -0.27473342 0.02294422 -11.973972 0.000000
## bc_therm_feminists -0.69344932 0.12118874 -5.722060 0.000000
## bc_therm_scientists -0.04257146 0.01328927 -3.203446 0.001384
At a significance level of \(\alpha=0.5\) three of the variables are insignificant: gender_roles
, terror_anxiety
, and nationalism
. Removing those three insignificant variables yields a model with all significant variables.
AIC <- polr(target ~ relig_attendance + flag_pride + economic_mobility + reducing_debt +
millionaires_tax + obamacare_costs + immig_hurts_jobs + govt_size + slavery_hurts +
obama_muslim + bc_therm_feminists + bc_therm_scientists, anes_data[n, ], Hess=T)
coef_sum <- head(coef(summary(AIC)), -3)
p <- pt(abs(coef_sum[, "t value"]), lower.tail = F, df = AIC_steps$df.residual + 3) * 2
cbind(coef_sum, "p value" = round(p, 6))
## Value Std. Error t value p value
## relig_attendance -0.14256807 0.05620478 -2.536583 0.011287
## flag_pride -0.18134007 0.04878500 -3.717128 0.000208
## economic_mobility 0.14397162 0.03192266 4.510014 0.000007
## reducing_debt -0.29402930 0.06711412 -4.381035 0.000013
## millionaires_tax 0.70651459 0.12141525 5.818994 0.000000
## obamacare_costs -0.51566927 0.13697205 -3.764777 0.000173
## immig_hurts_jobs -0.37159820 0.06162243 -6.030243 0.000000
## govt_size -0.85599235 0.11453735 -7.473478 0.000000
## slavery_hurts 0.26580310 0.04317729 6.156086 0.000000
## obama_muslim -0.27952883 0.02281841 -12.250146 0.000000
## bc_therm_feminists -0.73048679 0.12109927 -6.032132 0.000000
## bc_therm_scientists -0.04308651 0.01328436 -3.243401 0.001205
Logistic regression coefficients are in terms of the log odds. The odds ratio is therefore obtained through exponentiation.
t(t(exp(coef_sum[,1])))
## [,1]
## relig_attendance 0.8671285
## flag_pride 0.8341516
## economic_mobility 1.1548513
## reducing_debt 0.7452547
## millionaires_tax 2.0269143
## obamacare_costs 0.5971008
## immig_hurts_jobs 0.6896313
## govt_size 0.4248614
## slavery_hurts 1.3044782
## obama_muslim 0.7561399
## bc_therm_feminists 0.4816745
## bc_therm_scientists 0.9578285
Given an ordered target variable that ranges from 1 to 4 (HRC_Strong, HRC_Weak, DJT_Weak, DJT_Strong), each of the variable’s odds indicate how, holding the other variables in the model constant, a change in the given variable would increase a subject’s odds of being in a higher target variable category. Here we see that many of the larger odds above – especially belief that economic_mobility
is harder and disagreeing that slavery_hurts
– have a larger impact on increasing the odds of being a voter with a strong preference for Trump. Interestingly, opposition to a millionaires_tax
has the largest impact increasing the odds of being a voter with a strong preference for Trump.
Stepwise subset selection based on \(BIC\). Using \(k = log(n)\) degrees of freedom for the penalty is sometimes referred to as \(BIC\).
BIC_steps <- step(null, scope=list(lower=null, upper=full), direction="forward", k = log(length(n)), trace=F)
BIC_steps$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 1640 4199.274 4221.487
## 2 + obama_muslim -1 643.80199 1639 3555.472 3585.089
## 3 + govt_size -1 200.68794 1638 3354.784 3391.805
## 4 + bc_therm_feminists -1 119.63140 1637 3235.152 3279.578
## 5 + immig_hurts_jobs -1 90.46657 1636 3144.686 3196.516
## 6 + slavery_hurts -1 66.44515 1635 3078.241 3137.475
## 7 + millionaires_tax -1 42.03284 1634 3036.208 3102.846
## 8 + reducing_debt -1 24.99645 1633 3011.211 3085.254
## 9 + obamacare_costs -1 19.77835 1632 2991.433 3072.880
## 10 + economic_mobility -1 16.31546 1631 2975.118 3063.969
## 11 + flag_pride -1 14.99056 1630 2960.127 3056.383
## 12 + bc_therm_scientists -1 11.63312 1629 2948.494 3052.154
BIC_steps$call$formula
## target ~ obama_muslim + govt_size + bc_therm_feminists + immig_hurts_jobs +
## slavery_hurts + millionaires_tax + reducing_debt + obamacare_costs +
## economic_mobility + flag_pride + bc_therm_scientists
The above model has the lowest \(BIC\).
BIC_steps <- step(full, scope=list(lower=null, upper=full), direction="backward", k = log(length(n)), trace=F)
BIC_steps$anova
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 1617 2925.654 3118.165
## 2 - bc_therm_jews 1 0.1385388 1618 2925.792 3110.899
## 3 - bc_therm_business 1 0.3030515 1619 2926.095 3103.798
## 4 - bc_therm_unions 1 0.6898720 1620 2926.785 3097.084
## 5 - internet_usage 1 0.8280503 1621 2927.613 3090.508
## 6 - indep_respect 1 0.9317921 1622 2928.545 3084.035
## 7 - internet_activity 1 1.5267602 1623 2930.072 3078.158
## 8 - bc_therm_christians 1 1.7219694 1624 2931.794 3072.475
## 9 - television_usage 1 1.8610353 1625 2933.655 3066.932
## 10 - nationalism 1 2.6399410 1626 2936.295 3062.168
## 11 - terror_anxiety 1 2.2853148 1627 2938.580 3057.049
## 12 - gender_roles 1 3.4612857 1628 2942.041 3053.106
## 13 - relig_attendance 1 6.4524580 1629 2948.494 3052.154
BIC_steps$call$formula
## target ~ flag_pride + economic_mobility + reducing_debt + millionaires_tax +
## obamacare_costs + immig_hurts_jobs + govt_size + slavery_hurts +
## obama_muslim + bc_therm_feminists + bc_therm_scientists
The above model has the lowest \(BIC\). These are the same variables obtained through forward selection.
The forward selection and backward elimination stepwise selection methods using \(BIC\) both yield models with the same exact variables.
BIC <- polr(BIC_steps$call$formula, anes_data[n, ], Hess=T)
coef_sum <- head(coef(summary(BIC)), -3)
p <- pt(abs(coef_sum[, "t value"]), lower.tail = F, df = BIC_steps$df.residual) * 2
cbind(coef_sum, "p value" = round(p, 6))
## Value Std. Error t value p value
## flag_pride -0.18461879 0.04878601 -3.784257 0.000160
## economic_mobility 0.13716700 0.03179330 4.314336 0.000017
## reducing_debt -0.30092538 0.06714031 -4.482037 0.000008
## millionaires_tax 0.71386167 0.12121665 5.889139 0.000000
## obamacare_costs -0.52786842 0.13701421 -3.852654 0.000121
## immig_hurts_jobs -0.37050047 0.06155459 -6.019055 0.000000
## govt_size -0.84488150 0.11439299 -7.385780 0.000000
## slavery_hurts 0.25922013 0.04306699 6.018997 0.000000
## obama_muslim -0.28192779 0.02278314 -12.374403 0.000000
## bc_therm_feminists -0.74243295 0.12124417 -6.123453 0.000000
## bc_therm_scientists -0.04513501 0.01325575 -3.404938 0.000678
At a significance level of \(\alpha=0.5\) none of the variables are insignificant.
Logistic regression coefficients are in terms of the log odds. The odds ratio is therefore obtained through exponentiation.
t(t(exp(coef_sum[,1])))
## [,1]
## flag_pride 0.8314212
## economic_mobility 1.1470197
## reducing_debt 0.7401330
## millionaires_tax 2.0418611
## obamacare_costs 0.5898610
## immig_hurts_jobs 0.6903887
## govt_size 0.4296083
## slavery_hurts 1.2959190
## obama_muslim 0.7543282
## bc_therm_feminists 0.4759545
## bc_therm_scientists 0.9558684
Given an ordered target variable that ranges from 1 (Strong Hillary) to 4 (Strong Trump), each of the variable odds indicate how, holding the other variables in the model constant, a change in the variable would increase a subject’s odds of being in a higher target variable category. The \(BIC\) odds ratios are very similar to the \(AIC\) model, with the variable millionaires_tax
showing potential for more growth. The main difference from the \(AIC\) model however, is that the \(BIC\) model does not agree that the impact of relig_attendance
is significant.
library(caret)
AIC_Scores <- predict(AIC, anes_data[n, ], type="prob")
AIC_Class <- predict(AIC, anes_data[n, ], type="class")
(cm1 <- confusionMatrix(AIC_Class, anes_data[n, "target"]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction HRC_Strong HRC_Weak DJT_Weak DJT_Strong
## HRC_Strong 521 182 27 51
## HRC_Weak 25 13 22 25
## DJT_Weak 0 0 0 0
## DJT_Strong 63 48 156 510
##
## Overall Statistics
##
## Accuracy : 0.6354
## 95% CI : (0.6116, 0.6587)
## No Information Rate : 0.3707
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4369
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: HRC_Strong Class: HRC_Weak Class: DJT_Weak
## Sensitivity 0.8555 0.053498 0.0000
## Specificity 0.7485 0.948571 1.0000
## Pos Pred Value 0.6671 0.152941 NaN
## Neg Pred Value 0.8979 0.852375 0.8752
## Prevalence 0.3707 0.147900 0.1248
## Detection Rate 0.3171 0.007912 0.0000
## Detection Prevalence 0.4753 0.051735 0.0000
## Balanced Accuracy 0.8020 0.501035 0.5000
## Class: DJT_Strong
## Sensitivity 0.8703
## Specificity 0.7474
## Pos Pred Value 0.6564
## Neg Pred Value 0.9122
## Prevalence 0.3567
## Detection Rate 0.3104
## Detection Prevalence 0.4729
## Balanced Accuracy 0.8089
The \(AIC\) model has an Accuracy of 0.635423 and Error Rate of 0.364577.
BIC_Scores <- predict(BIC, anes_data[n, ], type="prob")
BIC_Class <- predict(BIC, anes_data[n, ], type="class")
(cm2 <- confusionMatrix(BIC_Class, anes_data[n, "target"]))
## Confusion Matrix and Statistics
##
## Reference
## Prediction HRC_Strong HRC_Weak DJT_Weak DJT_Strong
## HRC_Strong 524 179 30 51
## HRC_Weak 26 9 19 26
## DJT_Weak 0 0 0 0
## DJT_Strong 59 55 156 509
##
## Overall Statistics
##
## Accuracy : 0.6342
## 95% CI : (0.6104, 0.6575)
## No Information Rate : 0.3707
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4345
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: HRC_Strong Class: HRC_Weak Class: DJT_Weak
## Sensitivity 0.8604 0.037037 0.0000
## Specificity 0.7485 0.949286 1.0000
## Pos Pred Value 0.6684 0.112500 NaN
## Neg Pred Value 0.9010 0.850288 0.8752
## Prevalence 0.3707 0.147900 0.1248
## Detection Rate 0.3189 0.005478 0.0000
## Detection Prevalence 0.4772 0.048691 0.0000
## Balanced Accuracy 0.8045 0.493161 0.5000
## Class: DJT_Strong
## Sensitivity 0.8686
## Specificity 0.7446
## Pos Pred Value 0.6534
## Neg Pred Value 0.9109
## Prevalence 0.3567
## Detection Rate 0.3098
## Detection Prevalence 0.4741
## Balanced Accuracy 0.8066
The \(BIC\) model has an Accuracy of 0.6342057 and Error Rate of 0.3657943.
The ROC curve that plots two predictor variables becomes a ROC surface when more variables lead to higher dimensions.
library(pROC)
AIC_mcr <- multiclass.roc(anes_data[n, "target"], as.numeric(AIC_Class))
BIC_mcr <- multiclass.roc(anes_data[n, "target"], as.numeric(BIC_Class))
AIC_mcr$auc
## Multi-class area under the curve: 0.7579
BIC_mcr$auc
## Multi-class area under the curve: 0.7541
The AIC model has a greater area under the surface but not by much.
par(mfrow=c(1,2), yaxt="n", cex=0.55)
x <- factor(AIC_mcr$predictor, levels=c(1:4), labels=levels(anes_data$target))
y <- AIC_mcr$response
z <- data.frame(x, y)
plot(z, main = "AIC Predictions", xlab = "", ylab="", col=c("blue", "lightblue", "pink", "red"))
x <- factor(BIC_mcr$predictor, levels=c(1:4), labels=levels(anes_data$target))
y <- BIC_mcr$response
z <- data.frame(x, y)
plot(z, main = "BIC Predictions", xlab = "", ylab="", col=c("blue", "lightblue", "pink", "red"))
The bars represent model predictions and the colors represent the actual data (HRC_Strong, HRC_Weak, DJT_Weak, DJT_Strong). Overall, the AIC model has a higher accuracy and larger area under the surface. It is worth noting that the difference in both metrics is minimal, but for purposes of this analysis, the slightly better model will be the one chosen.
(predicted <- table(predict(AIC, anes_data[-n, ], type="class")))
##
## HRC_Strong HRC_Weak DJT_Weak DJT_Strong
## 398 39 0 385
actual <- table(anes_data[-n, "target"])
rbind(actual, predicted) / (N - length(n))
## HRC_Strong HRC_Weak DJT_Weak DJT_Strong
## actual 0.4014599 0.12895377 0.1386861 0.3309002
## predicted 0.4841849 0.04744526 0.0000000 0.4683698
chisq.test(actual, predicted)
##
## Pearson's Chi-squared test
##
## data: actual and predicted
## X-squared = 12, df = 9, p-value = 0.2133
The prevalence of each condition in the acutal and predicted data is above. Although there is some difference in these figures, the difference is not significant at an \(\alpha=0.05\) as can be seen in the \(\chi^2\) goodness of fit test. The null hypothesis for the \(\chi^2\) goodness of fit test for multinomial distribution is that the observed (predicted) frequency is equal to an expected count (actual) in each category. In other words, the data are consistent with a specified distribution. With a \(p\)-value greater than \(\alpha=0.05\), we fail to reject the null hypothesis. The data supports the conclusion that the model is a good fit for the data.
p_e <- 0.5 # a priori proportion
N_e <- 235248000 # voting age population
n_e <- length(n) # sample size
a_e <- 1 - 0.95 # confidence level
z_e <- qnorm(1 - a_e / 2) # z-score
(e <- z_e * sqrt((p_e * (1 - p_e) / n_e) * ((N_e - n_e) / (N_e - 1)))) # Margin of Error
## [1] 0.02417674
Using a normal approximation, the randomly selected data are representative of the voting age population within a 0.0241767 margin of error.
http://rcompanion.org/rcompanion/e_07.html
http://rmarkdown.rstudio.com/lesson-11.html
https://www.youtube.com/watch?v=vDXEo2vzKbQ
http://cpsievert.github.io/slides/markdown/#/22
https://onlinecourses.science.psu.edu/stat501/node/346
http://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/
http://rmarkdown.rstudio.com/ioslides_presentation_format.html
https://stats.idre.ucla.edu/stata/output/ordered-logistic-regression/
http://stattrek.com/regression/linear-transformation.aspx?Tutorial=AP
https://groups.google.com/forum/#!msg/israel-r-user-group/-sBvk2F11EE/jtbuMU6n7roJ
http://www.r-tutor.com/elementary-statistics/goodness-fit/multinomial-goodness-fit
http://rprogramming.net/create-a-slideshow-powerpoint-with-r-knitr-pandoc-and-slidy/
http://www.electionstudies.org/studypages/anes_timeseries_2016/anes_timeseries_2016.htm