Overview

The semester project requires you to model a problem using any of the methods covered in this class. You are welcome to work on a problem that you are already working on at your work.

The project will account 300 points. This is 30% of your semester grade. There are two components:
1. Project Report (250 points; 25% of grade): This will involve a write up of your problem background, how you plan to model the problem, steps you took to clean up your data, model your data and how the model solves your problem (e.g.; interpret the model, talk about model limitations, etc.). Think of this as an extension of one of your homework assignments, except that now you get to utilize your formal writing skills in the context of statistical/predictive modeling. Along with your modeling ability, writing concisely and clearly will also be considered while grading this project.
2. Project Presentation (50 points; 5% of grade): This is where you create a concise PowerPoint presentation for a broader audience.
This is a group project Each group will submit one project.
Data source: The team gets to decide what data to use. You are free to use something that you are already working on at work, if you are allowed to do so. In case you decide to use any data that may be sensitive in nature, you will be solely responsible for making sure you have the proper permission to use the data and that you have taken necessary steps to protect any privacy concern that may arise. In case you are struggling to find data, you could use data that are publicly available. Some suggestions include: data.gov, usapengin.gov, and kaggle.com.

Data Exploration

anes_data <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
                        "SPS/master/DATA%20621/anes_timeseries_2016.csv"), header = T)

VARIABLE	CODE	NAME	DESCRIPTION
330	V161245	relig_attendance	PRE: Attend religious services how often
668	V162002	television_usage	POST: How many programs about 2016 campaign did Respondent watch on TV
670	V162004	internet_usage	POST: How many times Respondent got info about 2016 campaign on the Internet
692	V162018e	internet_activity	POST: DHS: sent a message on Facebook/Twitter about polit iss
719	V162031	voted_presidential	POST: Did Respondent vote for President
726	V162034a	presidential_vote	POST: For whom did Respondent vote for President
727	V162035	presidential_intensity	POST: Preference strong for Pres cand for whom Respondent voted
773	V162076b	justice_recall	POST: Office recall: US Supreme Ct Chief Justice Roberts
792	V162096	therm_feminists	POST: Feeling thermometer: FEMINISTS
794	V162098	therm_unions	POST: Feeling thermometer: LABOR UNIONS
796	V162100	therm_business	POST: Feeling thermometer: BIG BUSINESS
799	V162103	therm_glbt	POST: Feeling thermometer: GAY MEN AND LESBIANS
801	V162105	therm_rich	POST: Feeling thermometer: RICH PEOPLE
802	V162106	therm_muslims	POST: Feeling thermometer: MUSLIMS
803	V162107	therm_christians	POST: Feeling thermometer: CHRISTIANS
804	V162108	therm_jews	POST: Feeling thermometer: JEWS
806	V162110	therm_police	POST: Feeling thermometer: POLICE
808	V162112	therm_scientists	POST: Feeling thermometer: SCIENTISTS
822	V162123	nationalism	POST: Better if rest of world more like America
825	V162125x	flag_pride	POST: SUMMARY- How good/bad does Respondent feel to see American flag
837	V162136x	economic_mobility	POST: SUMMARY-Economic mobility easier/harder compared to 20 yrs ago
840	V162139	reducing_debt	POST: Importance of reducing deficit
841	V162140	millionaires_tax	POST: Does Respondent favor or oppose tax on millionaires
845	V162144	obamacare_costs	POST: Health Care Law effect on cost of Respondent’s health care
865	V162158	immig_hurts_jobs	POST: How likely immigration will take away jobs
867	V162160	terror_anxiety	POST: How worried about terrist attack next 12 months
873	V162165	fin_situation	POST: Worry about financial situation
899	V162185	govt_size	POST: Less govt better OR more that govt should be doing
931	V162212	slavery_hurts	POST: Agree/disagree: past slavery make more diff for blacks
954	V162230x	gender_roles	POST: SUMMARY- Better if man works and woman takes care of home
968	V162239	indep_respect	POST: Child trait more important: independence or respect
985	V162255x	obama_muslim	POST: SUMMARY- Barack Obama is/isn’t Muslim
1197	V999999	target	POST: Vote intensity (2-candidate) created by combining V162034a and V162035

Clean Data

Select columns and only keep respondents who voted. Refused and don’t know type answers are represented by negative numbers. Therefore, all negative numbers are being converted to nulls. ANES Feeling Thermometers are evely spaced interval variables ranging from 0 to 100, therefore all numbers outside this range are also being converted to nulls.

variables <- c(330, 668, 670, 692, 719, 726, 727, 773, 792, 794, 796, 799, 801, 802, 803,
  804, 806, 808, 822, 825, 837, 840, 841, 845, 865, 867, 873, 899, 931, 954, 968, 985)
anes_data <- anes_data[ , variables]
anes_data <- subset(anes_data, V162031 == 4) # only look at subset that voted
anes_data <- anes_data[ , -which(names(anes_data) %in% "V162031")]
anes_data[anes_data < 0 | anes_data > 100] <- NA # make invalid values NA
names(anes_data)

##  [1] "V161245"  "V162002"  "V162004"  "V162018e" "V162034a" "V162035" 
##  [7] "V162076b" "V162096"  "V162098"  "V162100"  "V162103"  "V162105" 
## [13] "V162106"  "V162107"  "V162108"  "V162110"  "V162112"  "V162123" 
## [19] "V162125x" "V162136x" "V162139"  "V162140"  "V162144"  "V162158" 
## [25] "V162160"  "V162165"  "V162185"  "V162212"  "V162230x" "V162239" 
## [31] "V162255x"

Normalize Data

Create target variable, remove underlying variables used to create the target variable, specify factor levels for target variable, remove rows with null values for the target variable, and name columns.

V999999 <- rep(NA, nrow(anes_data))
V999999[anes_data[, "V162034a"] == 1 & anes_data[, "V162035"] == 1] <- 1 # Voter w/ strong pref for Hillary
V999999[anes_data[, "V162034a"] == 1 & anes_data[, "V162035"] == 2] <- 2 # Voter w/o strong pref for Hillary
V999999[anes_data[, "V162034a"] == 2 & anes_data[, "V162035"] == 2] <- 3 # Voter w/o strong pref for Trump
V999999[anes_data[, "V162034a"] == 2 & anes_data[, "V162035"] == 1] <- 4 # Voter w/ strong pref for Trump
anes_data <- anes_data[ , -which(names(anes_data) %in% "V162034a")]
anes_data <- anes_data[ , -which(names(anes_data) %in% "V162035")]
anes_data[, "target"] <- as.factor(V999999)
levels(anes_data[, "target"]) <- c("HRC_Strong", "HRC_Weak", "DJT_Weak", "DJT_Strong")
anes_data <- subset(anes_data, !is.na(target)) # remove rows with null target values

colnames(anes_data) <- c("relig_attendance", "television_usage", "internet_usage", 
  "internet_activity", "justice_recall", "therm_feminists", "therm_unions", 
  "therm_business", "therm_glbt", "therm_rich", "therm_muslims", "therm_christians", 
  "therm_jews", "therm_police", "therm_scientists", "nationalism", "flag_pride", 
  "economic_mobility", "reducing_debt", "millionaires_tax", "obamacare_costs", 
  "immig_hurts_jobs", "terror_anxiety", "fin_situation", "govt_size", 
  "slavery_hurts", "gender_roles", "indep_respect", "obama_muslim", "target")
names(anes_data)

##  [1] "relig_attendance"  "television_usage"  "internet_usage"   
##  [4] "internet_activity" "justice_recall"    "therm_feminists"  
##  [7] "therm_unions"      "therm_business"    "therm_glbt"       
## [10] "therm_rich"        "therm_muslims"     "therm_christians" 
## [13] "therm_jews"        "therm_police"      "therm_scientists" 
## [16] "nationalism"       "flag_pride"        "economic_mobility"
## [19] "reducing_debt"     "millionaires_tax"  "obamacare_costs"  
## [22] "immig_hurts_jobs"  "terror_anxiety"    "fin_situation"    
## [25] "govt_size"         "slavery_hurts"     "gender_roles"     
## [28] "indep_respect"     "obama_muslim"      "target"

Convert unordered categorical variables into ordinal variables. For instance, “yes, no, maybe” is not in an intuitive order. However, “yes, maybe, no” is in an intuitive from positive to neutral to negative.

swap <- function(vector, value1, value2){
  vector <- replace(vector, vector == value1, value2)
  vector <- replace(vector, vector == value2, value1)
  return(vector)
}
anes_data[, "millionaires_tax"] <- swap(anes_data[, "millionaires_tax"], 2, 3)
anes_data[, "obamacare_costs"] <- swap(anes_data[, "obamacare_costs"], 2, 3)
anes_data[, "indep_respect"] <- swap(anes_data[, "indep_respect"], 2, 3)

Create random training and evaluation datasets, specify \(X\) variables, specify \(Y\) variable, and specify varible types that will be examined later in the analysis.

set.seed(2017)
N <- nrow(anes_data)
n <- sample(1:N, 2 * N / 3, replace = F)

The training data set is anes_data[n, ] with anes_data[n, -ncol(anes_data)] representing the \(X\) variables and anes_data[n, ncol(anes_data)] representing the \(Y\) variable. The evaluation data set is anes_data[-n, -ncol(anes_data)].

Binary Variables

binary <- c(4, 5, 20, 21, 25, 28)
names(anes_data[binary])

## [1] "internet_activity" "justice_recall"    "millionaires_tax" 
## [4] "obamacare_costs"   "govt_size"         "indep_respect"

Ordinal Variables

ordinal <- c(1:3, 16:19, 22:24, 26, 27, 29)
names(anes_data[ordinal])

##  [1] "relig_attendance"  "television_usage"  "internet_usage"   
##  [4] "nationalism"       "flag_pride"        "economic_mobility"
##  [7] "reducing_debt"     "immig_hurts_jobs"  "terror_anxiety"   
## [10] "fin_situation"     "slavery_hurts"     "gender_roles"     
## [13] "obama_muslim"

Interval Variables

interval <- grep("^therm", colnames(anes_data))
names(anes_data[interval])

##  [1] "therm_feminists"  "therm_unions"     "therm_business"  
##  [4] "therm_glbt"       "therm_rich"       "therm_muslims"   
##  [7] "therm_christians" "therm_jews"       "therm_police"    
## [10] "therm_scientists"

Numerical Summaries

summary(anes_data[n, ], digits=3)

##  relig_attendance television_usage internet_usage internet_activity
##  Min.   :1.00     Min.   :1.00     Min.   :1.00   Min.   :1.00     
##  1st Qu.:1.00     1st Qu.:2.00     1st Qu.:2.00   1st Qu.:1.00     
##  Median :2.00     Median :3.00     Median :3.00   Median :2.00     
##  Mean   :2.42     Mean   :3.04     Mean   :3.02   Mean   :1.64     
##  3rd Qu.:4.00     3rd Qu.:4.00     3rd Qu.:4.00   3rd Qu.:2.00     
##  Max.   :5.00     Max.   :4.00     Max.   :4.00   Max.   :2.00     
##  NA's   :616                       NA's   :2      NA's   :2        
##  justice_recall  therm_feminists  therm_unions   therm_business 
##  Min.   :0.000   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:0.000   1st Qu.: 40.0   1st Qu.: 40.0   1st Qu.: 40.0  
##  Median :0.000   Median : 51.0   Median : 60.0   Median : 50.0  
##  Mean   :0.284   Mean   : 57.3   Mean   : 56.8   Mean   : 49.8  
##  3rd Qu.:1.000   3rd Qu.: 79.0   3rd Qu.: 71.0   3rd Qu.: 62.0  
##  Max.   :1.000   Max.   :100.0   Max.   :100.0   Max.   :100.0  
##                  NA's   :24      NA's   :13      NA's   :13     
##    therm_glbt      therm_rich    therm_muslims   therm_christians
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0   
##  1st Qu.: 50.0   1st Qu.: 50.0   1st Qu.: 40.0   1st Qu.: 60.0   
##  Median : 60.0   Median : 50.0   Median : 50.0   Median : 85.0   
##  Mean   : 61.4   Mean   : 55.2   Mean   : 55.2   Mean   : 76.5   
##  3rd Qu.: 85.0   3rd Qu.: 70.0   3rd Qu.: 70.0   3rd Qu.:100.0   
##  Max.   :100.0   Max.   :100.0   Max.   :100.0   Max.   :100.0   
##  NA's   :12      NA's   :17      NA's   :28      NA's   :16      
##    therm_jews     therm_police   therm_scientists  nationalism 
##  Min.   :  0.0   Min.   :  0.0   Min.   :  0.0    Min.   :1.0  
##  1st Qu.: 51.0   1st Qu.: 70.0   1st Qu.: 60.0    1st Qu.:2.0  
##  Median : 71.0   Median : 85.0   Median : 84.0    Median :3.0  
##  Mean   : 73.2   Mean   : 76.9   Mean   : 77.0    Mean   :3.1  
##  3rd Qu.: 86.0   3rd Qu.: 93.0   3rd Qu.: 93.8    3rd Qu.:4.0  
##  Max.   :100.0   Max.   :100.0   Max.   :100.0    Max.   :5.0  
##  NA's   :22      NA's   :7       NA's   :13       NA's   :1    
##    flag_pride   economic_mobility reducing_debt  millionaires_tax
##  Min.   :1.00   Min.   :1.00      Min.   :1.00   Min.   :1.00    
##  1st Qu.:1.00   1st Qu.:4.00      1st Qu.:1.00   1st Qu.:1.00    
##  Median :1.00   Median :6.00      Median :2.00   Median :1.00    
##  Mean   :1.89   Mean   :5.48      Mean   :1.87   Mean   :1.33    
##  3rd Qu.:2.00   3rd Qu.:7.00      3rd Qu.:2.00   3rd Qu.:2.00    
##  Max.   :7.00   Max.   :7.00      Max.   :5.00   Max.   :2.00    
##  NA's   :1      NA's   :7         NA's   :5      NA's   :5       
##  obamacare_costs immig_hurts_jobs terror_anxiety fin_situation 
##  Min.   :1.00    Min.   :1.00     Min.   :1.00   Min.   :1.00  
##  1st Qu.:1.00    1st Qu.:2.00     1st Qu.:2.00   1st Qu.:3.00  
##  Median :1.00    Median :3.00     Median :3.00   Median :3.00  
##  Mean   :1.24    Mean   :2.75     Mean   :2.71   Mean   :3.36  
##  3rd Qu.:1.00    3rd Qu.:4.00     3rd Qu.:3.00   3rd Qu.:4.00  
##  Max.   :2.00    Max.   :4.00     Max.   :5.00   Max.   :5.00  
##  NA's   :35      NA's   :5        NA's   :2      NA's   :3     
##    govt_size    slavery_hurts   gender_roles  indep_respect 
##  Min.   :1.00   Min.   :1.00   Min.   :1.00   Min.   :1.00  
##  1st Qu.:1.00   1st Qu.:2.00   1st Qu.:2.00   1st Qu.:1.00  
##  Median :1.00   Median :3.00   Median :4.00   Median :2.00  
##  Mean   :1.49   Mean   :2.96   Mean   :3.22   Mean   :1.74  
##  3rd Qu.:2.00   3rd Qu.:4.00   3rd Qu.:4.00   3rd Qu.:2.00  
##  Max.   :2.00   Max.   :5.00   Max.   :7.00   Max.   :2.00  
##  NA's   :11     NA's   :4      NA's   :23     NA's   :7     
##   obama_muslim          target   
##  Min.   : 1.00   HRC_Strong:609  
##  1st Qu.: 4.00   HRC_Weak  :243  
##  Median : 8.00   DJT_Weak  :205  
##  Mean   : 7.01   DJT_Strong:586  
##  3rd Qu.:10.00                   
##  Max.   :10.00                   
##  NA's   :84

Looking at the data summaries for the training dataset, we can see that only the variable religious_attendance has a notable amount of missing values. However, other questions that were possibly uncomfortable for respondents like obama_muslim and obamacare_costs also saw higher than average non-responses.

Visualizations

Pie Chart

pie(table(anes_data[n, "target"]), main="Major Party Vote Breakdown", 
    labels = paste0(levels(anes_data[n, "target"]), 
    " (", table(anes_data[n, "target"]), ")"),
    col=c("blue", "lightblue", "pink", "red"))

Histograms

par(mfrow = c(3,5))
for (i in 1:(ncol(anes_data)-1)) {
  hist(anes_data[n,i], xlab = names(anes_data[i]), 
    main = names(anes_data[i]), col="grey", ylab="", cex = .5)
}

Scatterplots

par(mfrow = c(2,5), cex = .5)
for (i in interval) {
  plot(anes_data[n, i], main = names(anes_data[i]), ylab = "", xlab = "") 
}

As expected, many respondents chose answers with multiples of ten on these thermometer questions. There’s largly warm feelings for Christians, Jews, police, and scientists, while there is a broader range of opinion towards the other groups.

Density Curves

par(mfrow = c(2,5))
for (i in interval) {
  d <- density(anes_data[n,i], na.rm=T)
  plot(d, main = names(anes_data[i]), xlab = "", ylab = "")
  polygon(d, col="red")
}

The thermometer questions focusing on more popular groups do show some skewing.

Scatterplots with Densities

par(mfrow = c(3,5), cex = .5)
for (i in ordinal) {
  smoothScatter(anes_data[n,i], main = names(anes_data[i]), ylab = "", 
    xlab = "", colramp = colorRampPalette(c("white", "red")))
}

Correlation Heatmap

library(ggplot2)
library(reshape2)
ggplot(data = melt(abs(cor(na.omit(sapply(anes_data[n,], as.numeric))))), 
       aes(x=Var1, y=Var2, fill=value)) +
  scale_fill_gradient(low = 'black', high = 'red', name = "Absolute Value") +
  geom_tile() + labs(title = "Correlation Heatmap") +
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90, hjust = 1),
        plot.title = element_text(hjust = 0.5))

The correlation between different predictor variables and the response variable range from low to moderate with several interesting correlations surfacing. Some of the strongest correlations in the dataset underscore differences in ideological views with respect to polarizing social issues. For example, the candidate preference and views toward slavery, Muslims, LGBT, and feminists. The variables reflecting views toward financial situation, economic mobility, Jews, and justice recall show weak correlations with the response variable in this dataset.

Principal Component Analysis

PCA <- function(X) {
  Xpca <- prcomp(na.omit(X), center = T, scale. = T) 
  M <- as.matrix(na.omit(X)); R <- as.matrix(Xpca$rotation); score <- M %*% R
  print(list("Importance of Components" = summary(Xpca)$importance[ ,1:4], 
             "Correlation between X and PC" = cor(na.omit(X), score)[ ,1:4]))
  par(mfrow=c(2,3))
  barplot(Xpca$sdev^2, ylab = "Component Variance")
  barplot(cor(cbind(X)), ylab = "Correlations")
  barplot(Xpca$rotation, ylab = "Loadings")  
  biplot(Xpca); barplot(M); barplot(score)
}
PCA(anes_data[n, -ncol(anes_data)])

## $`Importance of Components`
##                             PC1      PC2      PC3      PC4
## Standard deviation     2.289431 1.577647 1.281814 1.272464
## Proportion of Variance 0.180740 0.085830 0.056660 0.055830
## Cumulative Proportion  0.180740 0.266570 0.323220 0.379060
## 
## $`Correlation between X and PC`
##                            PC1         PC2         PC3          PC4
## relig_attendance   0.265338848 -0.02280027  0.34318542 -0.143482872
## television_usage   0.033359822  0.10364935  0.06844106  0.018800255
## internet_usage     0.105157424 -0.01657748  0.21907764  0.064051446
## internet_activity -0.088667801 -0.07507725 -0.14642145  0.008227587
## justice_recall     0.078548773  0.06431838  0.10633371  0.015604102
## therm_feminists    0.809791418  0.48473921  0.28867202 -0.671289226
## therm_unions       0.594001703  0.38144789 -0.16816468 -0.810234883
## therm_business    -0.226148341  0.38069274 -0.36364689  0.327125887
## therm_glbt         0.773044137  0.53249547  0.66795435 -0.308026889
## therm_rich        -0.096355140  0.58089544 -0.11216237  0.360144314
## therm_muslims      0.741941368  0.57618468  0.22548280 -0.259986999
## therm_christians  -0.325055425  0.37969196 -0.35136870  0.043635989
## therm_jews         0.261883605  0.71326233  0.39997866 -0.017413341
## therm_police      -0.343015481  0.38333869  0.15572940  0.182520215
## therm_scientists   0.489974257  0.51822977  0.39827632 -0.434782083
## nationalism        0.390678945  0.06181492  0.16733105 -0.148464933
## flag_pride         0.374163774 -0.06795340  0.02797379 -0.200772844
## economic_mobility -0.036974799 -0.02210561  0.16957486 -0.017452311
## reducing_debt      0.361220906  0.01810991  0.10008120 -0.237668711
## millionaires_tax  -0.332858748 -0.05252630 -0.11312952  0.388522837
## obamacare_costs    0.297638254  0.05494702  0.05207091 -0.224465940
## immig_hurts_jobs   0.447030145  0.12678283  0.16472569 -0.174125255
## terror_anxiety     0.176923537  0.02024504 -0.08249539 -0.068155614
## fin_situation     -0.000116906  0.06990652 -0.14283385  0.125610520
## govt_size          0.370406028  0.10350955 -0.02558070 -0.409500128
## slavery_hurts     -0.478707123 -0.14433811 -0.05284779  0.304707633
## gender_roles       0.352030378  0.01584006  0.22952265 -0.226253090
## indep_respect     -0.289208353 -0.02279324 -0.18811455  0.110489743
## obama_muslim       0.543650372  0.13055775  0.18700451 -0.244974494

We see again that the main underlying component explaining the variance of the data appears to focus on inflamatory social issues, highlighted by the importance of therm_feminists, therm_lgbt, therm_muslims, and slavery_hurts, which additionally had correlation with presidential_vote. The second, and less important, pattern revolves around sentiment possibly involving anti-semetic conspiracy theories, highlighted by the importance of therm_rich and therm_jews.

Data Preparation

Impute Using Mice Package

Impute missing values with means.

library(mice)
library(VIM)
aggr(anes_data[n, -ncol(anes_data)], bars=F, sortVars=T)

## 
##  Variables sorted by number of missings: 
##           Variable        Count
##   relig_attendance 0.3749239197
##       obama_muslim 0.0511259890
##    obamacare_costs 0.0213024954
##      therm_muslims 0.0170419963
##    therm_feminists 0.0146074254
##       gender_roles 0.0139987827
##         therm_jews 0.0133901400
##         therm_rich 0.0103469264
##   therm_christians 0.0097382836
##       therm_unions 0.0079123554
##     therm_business 0.0079123554
##   therm_scientists 0.0079123554
##         therm_glbt 0.0073037127
##          govt_size 0.0066950700
##       therm_police 0.0042604991
##  economic_mobility 0.0042604991
##      indep_respect 0.0042604991
##      reducing_debt 0.0030432136
##   millionaires_tax 0.0030432136
##   immig_hurts_jobs 0.0030432136
##      slavery_hurts 0.0024345709
##      fin_situation 0.0018259282
##     internet_usage 0.0012172855
##  internet_activity 0.0012172855
##     terror_anxiety 0.0012172855
##        nationalism 0.0006086427
##         flag_pride 0.0006086427
##   television_usage 0.0000000000
##     justice_recall 0.0000000000

data <- anes_data[n, -ncol(anes_data)]
MICE <- mice(data, predictorMatrix = quickpred(data), method = "mean", printFlag = F)
anes_data[n, -ncol(anes_data)] <- complete(MICE, action = 1)
aggr(anes_data[n, -ncol(anes_data)], bars=F)

data <- anes_data[-n, -ncol(anes_data)]
MICE <- mice(data, predictorMatrix = quickpred(data), method = "mean", printFlag = F)
anes_data[-n, -ncol(anes_data)] <- complete(MICE, action = 1)

Correlation Tests

library("DT")
display <- function(data) {
  datatable(data, options = list(
    searching = TRUE,
    pageLength = 5,
    lengthMenu = c(5, nrow(data))
    ), rownames = FALSE)
}
library(reshape2)
Corr_XY <- function(X, Y) {
  corr <- data.frame(array(NA, dim = c(ncol(X), 5)))
  colnames(corr) <- c("Y", "X", "r","p","<0.05")
  for (i in 1:ncol(X)) {
    r <- cor.test(as.numeric(Y), X[, i])
    corr[i, 1] <- "target"
    corr[i, 2] <- names(X[i])
    corr[i, 3] <- round(r$estimate, 6)
    corr[i, 4] <- round(r$p.value, 6)
    corr[i, 5] <- corr[i, 4] < 0.05
  }
  return(corr)
}
Corr_XX <- function(X, threshold) {
  corr <- data.frame(array(NA, dim = c(choose(ncol(X), 2), 5)))
  colnames(corr) <- c("X1", "X2", "r","p","<0.05"); k = 1
  for (i in 1:(ncol(X) - 1)) {
    for (j in (i+1):ncol(X)) {
      r <- cor.test(X[,i], X[,j])
      corr[k, 1] <- names(X[i])
      corr[k, 2] <- names(X[j])
      corr[k, 3] <- round(r$estimate, 6)
      corr[k, 4] <- round(r$p.value, 6)
      corr[k, 5] <- corr[k, 4] < 0.05
      k = k + 1
    }
  }
  least <- corr[corr[,"<0.05"] == F, ]
  most <- corr[abs(corr[,"r"]) >= threshold, ]
  result <- list("Correlations" = corr, "Least_Correlated" = least, "Most_Correlated" = most)
  return(result)
}

Between \(Y\) and \(X\) Variables

correlations <- Corr_XY(anes_data[n, -ncol(anes_data)], anes_data[n, ncol(anes_data)])
display(correlations)

The predictor variables justice_recall and fin_situation do not have statistically significant correlations with the response variable and are therefore not being considered for the model.

remove <- c("justice_recall", "fin_situation")
anes_data <- anes_data[ , -which(names(anes_data) %in% remove)]

Between \(X\) Variables

correlations <- Corr_XX(anes_data[n, -ncol(anes_data)], 0.40)

Least Correlated

display(correlations$Least_Correlated)

Here we see that the behavior related variables of relig_attendance, television_usage, internet_usage, and economic_mobility have weak correlations with many issue focused variables.

Most Correlated

display(correlations$Most_Correlated)

When predictor variables are correlated, the model can vary wildly dependent on which combination of the correlated variables are included. Different combinations can lead to vastly different and less precise regression coefficients, error sum of squares varies, and hypothesis test results. The most correlated variables in this data set are moderately correlated, not highly correlated. Under careful examination, it is possible that some of those moderately correlated variables are measuring the same thing. For example (therm_feminists & therm_glbt), (therm_business & therm_rich), (therm_muslims & obama_muslim), and (therm_police & flag_pride). Therefore, the \(X\) variable in these pairs that is least correlated with \(Y\) will not be considered for the model.

remove <- c("therm_glbt", "therm_rich", "therm_muslims", "therm_police")
anes_data <- anes_data[ , -which(names(anes_data) %in% remove)]

Box-Cox Transformation

Box-Cox Power Test

The optimal power transformation is found via the Box-Cox Test where:

\[-2\Rightarrow \frac { 1 }{ Y^{ 2 } }, \quad \quad -1\Rightarrow \frac { 1 }{ Y }, \quad \quad -0.5\Rightarrow \frac { 1 }{ \sqrt { Y } }, \quad \quad 0\Rightarrow \log { \left( Y \right) }, \quad \quad 0.5\Rightarrow \sqrt { Y }, \quad \quad 1\Rightarrow Y, \quad \quad 2\Rightarrow { Y }^{ 2 }\]

library(car)

## Warning: package 'car' was built under R version 3.3.2

interval <- grep("^therm", colnames(anes_data))
box.cox.powers <- powerTransform(anes_data[n, interval] + 1e-32, family="bcPower")
summary(box.cox.powers)$result

##                  Est.Power    Std.Err. Wald Lower Bound Wald Upper Bound
## therm_feminists  0.2563347 0.006073274        0.2444311        0.2682383
## therm_unions     0.2938270 0.007078411        0.2799533        0.3077007
## therm_business   0.2694247 0.006428981        0.2568239        0.2820255
## therm_christians 0.5218412 0.016743361        0.4890243        0.5546582
## therm_jews       0.5694112 0.021088975        0.5280768        0.6107456
## therm_scientists 0.7311165 0.035138996        0.6622441        0.7999890

The Box-Cox Power Test results provide support for raising all the interval variables to powers greater than 0.25 and less than 0.75. Values were shifted ever so slightly to the right with a negligible value (+\(1^{-32}\)) to allow for the calculations to take place on variables that have zeros since the function can only handle \(\mathbb{R}^+\) values.

Box-Cox Power Transformation

interval <- grep("^therm", colnames(anes_data)); k = 1
for (i in interval) {
  var_name <- paste0("bc_", names(anes_data[i]))
  anes_data[, var_name] <- anes_data[, i]^(box.cox.powers$lambda[k])
  k = k + 1
}
anes_data <- anes_data[, -interval]

Build Models

\(AIC\) and \(BIC\) will be used to determine potential models. Adjusted \(R^2\) and Mallows \(C_p\) are not appropriate for logistic regressions.

library(MASS)
null <- polr(target ~ 1, anes_data[n, ], Hess=T)
full <- polr(target ~ ., anes_data[n, ], Hess=T)

AIC Stepwise Selection

Stepwise subset selection based on \(AIC\). Using \(k = 2\) degrees of freedom for the penalty gives the genuine \(AIC\).

Forward Selection

AIC_steps <- step(null, scope=list(lower=null, upper=full), direction="forward", k = 2, trace=F)
AIC_steps$anova

##                     Step Df   Deviance Resid. Df Resid. Dev      AIC
## 1                        NA         NA      1640   4199.274 4205.274
## 2         + obama_muslim -1 643.801993      1639   3555.472 3563.472
## 3            + govt_size -1 200.687939      1638   3354.784 3364.784
## 4   + bc_therm_feminists -1 119.631397      1637   3235.152 3247.152
## 5     + immig_hurts_jobs -1  90.466570      1636   3144.686 3158.686
## 6        + slavery_hurts -1  66.445149      1635   3078.241 3094.241
## 7     + millionaires_tax -1  42.032839      1634   3036.208 3054.208
## 8        + reducing_debt -1  24.996449      1633   3011.211 3031.211
## 9      + obamacare_costs -1  19.778353      1632   2991.433 3013.433
## 10   + economic_mobility -1  16.315463      1631   2975.118 2999.118
## 11          + flag_pride -1  14.990560      1630   2960.127 2986.127
## 12 + bc_therm_scientists -1  11.633122      1629   2948.494 2976.494
## 13    + relig_attendance -1   6.452458      1628   2942.041 2972.041
## 14        + gender_roles -1   3.461286      1627   2938.580 2970.580
## 15      + terror_anxiety -1   2.285315      1626   2936.295 2970.295
## 16         + nationalism -1   2.639941      1625   2933.655 2969.655

AIC_steps$call$formula

## target ~ obama_muslim + govt_size + bc_therm_feminists + immig_hurts_jobs + 
##     slavery_hurts + millionaires_tax + reducing_debt + obamacare_costs + 
##     economic_mobility + flag_pride + bc_therm_scientists + relig_attendance + 
##     gender_roles + terror_anxiety + nationalism

The above model has the lowest \(AIC\).

Backward Elimination

AIC_steps <- step(full, scope=list(lower=null, upper=full), direction="backward", k = 2, trace=F)
AIC_steps$anova

##                    Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                       NA        NA      1617   2925.654 2977.654
## 2       - bc_therm_jews  1 0.1385388      1618   2925.792 2975.792
## 3   - bc_therm_business  1 0.3030515      1619   2926.095 2974.095
## 4     - bc_therm_unions  1 0.6898720      1620   2926.785 2972.785
## 5      - internet_usage  1 0.8280503      1621   2927.613 2971.613
## 6       - indep_respect  1 0.9317921      1622   2928.545 2970.545
## 7   - internet_activity  1 1.5267602      1623   2930.072 2970.072
## 8 - bc_therm_christians  1 1.7219694      1624   2931.794 2969.794
## 9    - television_usage  1 1.8610353      1625   2933.655 2969.655

AIC_steps$call$formula

## target ~ relig_attendance + nationalism + flag_pride + economic_mobility + 
##     reducing_debt + millionaires_tax + obamacare_costs + immig_hurts_jobs + 
##     terror_anxiety + govt_size + slavery_hurts + gender_roles + 
##     obama_muslim + bc_therm_feminists + bc_therm_scientists

The above model has the lowest \(AIC\). These are the same variables obtained through forward selection.

AIC Selected Model

The forward selection and backward elimination stepwise selection methods using \(AIC\) both yield models with the same exact variables.

AIC <- polr(AIC_steps$call$formula, anes_data[n, ], Hess=T)
coef_sum <- head(coef(summary(AIC)), -3)
p <- pt(abs(coef_sum[, "t value"]), lower.tail = F, df = AIC_steps$df.residual) * 2
cbind(coef_sum, "p value" = round(p, 6))

##                           Value Std. Error    t value  p value
## relig_attendance    -0.12252747 0.05684933  -2.155302 0.031285
## nationalism         -0.08374519 0.05152076  -1.625465 0.104257
## flag_pride          -0.16529907 0.05105603  -3.237601 0.001230
## economic_mobility    0.14686348 0.03206333   4.580419 0.000005
## reducing_debt       -0.30240516 0.06821625  -4.433037 0.000010
## millionaires_tax     0.69324163 0.12167283   5.697588 0.000000
## obamacare_costs     -0.53259762 0.13727545  -3.879773 0.000109
## immig_hurts_jobs    -0.37180933 0.06270353  -5.929640 0.000000
## terror_anxiety       0.08422079 0.05138492   1.639018 0.101403
## govt_size           -0.84867211 0.11476028  -7.395173 0.000000
## slavery_hurts        0.26019290 0.04350841   5.980290 0.000000
## gender_roles        -0.07736955 0.04237509  -1.825826 0.068060
## obama_muslim        -0.27473342 0.02294422 -11.973972 0.000000
## bc_therm_feminists  -0.69344932 0.12118874  -5.722060 0.000000
## bc_therm_scientists -0.04257146 0.01328927  -3.203446 0.001384

At a significance level of \(\alpha=0.5\) three of the variables are insignificant: gender_roles, terror_anxiety, and nationalism. Removing those three insignificant variables yields a model with all significant variables.

AIC <- polr(target ~ relig_attendance + flag_pride + economic_mobility + reducing_debt + 
    millionaires_tax + obamacare_costs + immig_hurts_jobs + govt_size + slavery_hurts + 
    obama_muslim + bc_therm_feminists + bc_therm_scientists, anes_data[n, ], Hess=T)
coef_sum <- head(coef(summary(AIC)), -3)
p <- pt(abs(coef_sum[, "t value"]), lower.tail = F, df = AIC_steps$df.residual + 3) * 2
cbind(coef_sum, "p value" = round(p, 6))

##                           Value Std. Error    t value  p value
## relig_attendance    -0.14256807 0.05620478  -2.536583 0.011287
## flag_pride          -0.18134007 0.04878500  -3.717128 0.000208
## economic_mobility    0.14397162 0.03192266   4.510014 0.000007
## reducing_debt       -0.29402930 0.06711412  -4.381035 0.000013
## millionaires_tax     0.70651459 0.12141525   5.818994 0.000000
## obamacare_costs     -0.51566927 0.13697205  -3.764777 0.000173
## immig_hurts_jobs    -0.37159820 0.06162243  -6.030243 0.000000
## govt_size           -0.85599235 0.11453735  -7.473478 0.000000
## slavery_hurts        0.26580310 0.04317729   6.156086 0.000000
## obama_muslim        -0.27952883 0.02281841 -12.250146 0.000000
## bc_therm_feminists  -0.73048679 0.12109927  -6.032132 0.000000
## bc_therm_scientists -0.04308651 0.01328436  -3.243401 0.001205

AIC Odds Ratio

Logistic regression coefficients are in terms of the log odds. The odds ratio is therefore obtained through exponentiation.

t(t(exp(coef_sum[,1])))

##                          [,1]
## relig_attendance    0.8671285
## flag_pride          0.8341516
## economic_mobility   1.1548513
## reducing_debt       0.7452547
## millionaires_tax    2.0269143
## obamacare_costs     0.5971008
## immig_hurts_jobs    0.6896313
## govt_size           0.4248614
## slavery_hurts       1.3044782
## obama_muslim        0.7561399
## bc_therm_feminists  0.4816745
## bc_therm_scientists 0.9578285

Given an ordered target variable that ranges from 1 to 4 (HRC_Strong, HRC_Weak, DJT_Weak, DJT_Strong), each of the variable’s odds indicate how, holding the other variables in the model constant, a change in the given variable would increase a subject’s odds of being in a higher target variable category. Here we see that many of the larger odds above – especially belief that economic_mobility is harder and disagreeing that slavery_hurts– have a larger impact on increasing the odds of being a voter with a strong preference for Trump. Interestingly, opposition to a millionaires_tax has the largest impact increasing the odds of being a voter with a strong preference for Trump.

BIC Stepwise Selection

Stepwise subset selection based on \(BIC\). Using \(k = log(n)\) degrees of freedom for the penalty is sometimes referred to as \(BIC\).

Forward Selection

BIC_steps <- step(null, scope=list(lower=null, upper=full), direction="forward", k = log(length(n)), trace=F)
BIC_steps$anova

##                     Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                        NA        NA      1640   4199.274 4221.487
## 2         + obama_muslim -1 643.80199      1639   3555.472 3585.089
## 3            + govt_size -1 200.68794      1638   3354.784 3391.805
## 4   + bc_therm_feminists -1 119.63140      1637   3235.152 3279.578
## 5     + immig_hurts_jobs -1  90.46657      1636   3144.686 3196.516
## 6        + slavery_hurts -1  66.44515      1635   3078.241 3137.475
## 7     + millionaires_tax -1  42.03284      1634   3036.208 3102.846
## 8        + reducing_debt -1  24.99645      1633   3011.211 3085.254
## 9      + obamacare_costs -1  19.77835      1632   2991.433 3072.880
## 10   + economic_mobility -1  16.31546      1631   2975.118 3063.969
## 11          + flag_pride -1  14.99056      1630   2960.127 3056.383
## 12 + bc_therm_scientists -1  11.63312      1629   2948.494 3052.154

BIC_steps$call$formula

## target ~ obama_muslim + govt_size + bc_therm_feminists + immig_hurts_jobs + 
##     slavery_hurts + millionaires_tax + reducing_debt + obamacare_costs + 
##     economic_mobility + flag_pride + bc_therm_scientists

The above model has the lowest \(BIC\).

Backward Elimination

BIC_steps <- step(full, scope=list(lower=null, upper=full), direction="backward", k = log(length(n)), trace=F)
BIC_steps$anova

##                     Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                        NA        NA      1617   2925.654 3118.165
## 2        - bc_therm_jews  1 0.1385388      1618   2925.792 3110.899
## 3    - bc_therm_business  1 0.3030515      1619   2926.095 3103.798
## 4      - bc_therm_unions  1 0.6898720      1620   2926.785 3097.084
## 5       - internet_usage  1 0.8280503      1621   2927.613 3090.508
## 6        - indep_respect  1 0.9317921      1622   2928.545 3084.035
## 7    - internet_activity  1 1.5267602      1623   2930.072 3078.158
## 8  - bc_therm_christians  1 1.7219694      1624   2931.794 3072.475
## 9     - television_usage  1 1.8610353      1625   2933.655 3066.932
## 10         - nationalism  1 2.6399410      1626   2936.295 3062.168
## 11      - terror_anxiety  1 2.2853148      1627   2938.580 3057.049
## 12        - gender_roles  1 3.4612857      1628   2942.041 3053.106
## 13    - relig_attendance  1 6.4524580      1629   2948.494 3052.154

BIC_steps$call$formula

## target ~ flag_pride + economic_mobility + reducing_debt + millionaires_tax + 
##     obamacare_costs + immig_hurts_jobs + govt_size + slavery_hurts + 
##     obama_muslim + bc_therm_feminists + bc_therm_scientists

The above model has the lowest \(BIC\). These are the same variables obtained through forward selection.

BIC Selected Model

The forward selection and backward elimination stepwise selection methods using \(BIC\) both yield models with the same exact variables.

BIC <- polr(BIC_steps$call$formula, anes_data[n, ], Hess=T)
coef_sum <- head(coef(summary(BIC)), -3)
p <- pt(abs(coef_sum[, "t value"]), lower.tail = F, df = BIC_steps$df.residual) * 2
cbind(coef_sum, "p value" = round(p, 6))

##                           Value Std. Error    t value  p value
## flag_pride          -0.18461879 0.04878601  -3.784257 0.000160
## economic_mobility    0.13716700 0.03179330   4.314336 0.000017
## reducing_debt       -0.30092538 0.06714031  -4.482037 0.000008
## millionaires_tax     0.71386167 0.12121665   5.889139 0.000000
## obamacare_costs     -0.52786842 0.13701421  -3.852654 0.000121
## immig_hurts_jobs    -0.37050047 0.06155459  -6.019055 0.000000
## govt_size           -0.84488150 0.11439299  -7.385780 0.000000
## slavery_hurts        0.25922013 0.04306699   6.018997 0.000000
## obama_muslim        -0.28192779 0.02278314 -12.374403 0.000000
## bc_therm_feminists  -0.74243295 0.12124417  -6.123453 0.000000
## bc_therm_scientists -0.04513501 0.01325575  -3.404938 0.000678

At a significance level of \(\alpha=0.5\) none of the variables are insignificant.

BIC Odds Ratio

Logistic regression coefficients are in terms of the log odds. The odds ratio is therefore obtained through exponentiation.

t(t(exp(coef_sum[,1])))

##                          [,1]
## flag_pride          0.8314212
## economic_mobility   1.1470197
## reducing_debt       0.7401330
## millionaires_tax    2.0418611
## obamacare_costs     0.5898610
## immig_hurts_jobs    0.6903887
## govt_size           0.4296083
## slavery_hurts       1.2959190
## obama_muslim        0.7543282
## bc_therm_feminists  0.4759545
## bc_therm_scientists 0.9558684

Given an ordered target variable that ranges from 1 (Strong Hillary) to 4 (Strong Trump), each of the variable odds indicate how, holding the other variables in the model constant, a change in the variable would increase a subject’s odds of being in a higher target variable category. The \(BIC\) odds ratios are very similar to the \(AIC\) model, with the variable millionaires_tax showing potential for more growth. The main difference from the \(AIC\) model however, is that the \(BIC\) model does not agree that the impact of relig_attendance is significant.

Select Models

Evaluate Models

Confusion Matrix

library(caret)
AIC_Scores <- predict(AIC, anes_data[n, ], type="prob")
AIC_Class <- predict(AIC, anes_data[n, ], type="class")
(cm1 <- confusionMatrix(AIC_Class, anes_data[n, "target"]))

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   HRC_Strong HRC_Weak DJT_Weak DJT_Strong
##   HRC_Strong        521      182       27         51
##   HRC_Weak           25       13       22         25
##   DJT_Weak            0        0        0          0
##   DJT_Strong         63       48      156        510
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6354          
##                  95% CI : (0.6116, 0.6587)
##     No Information Rate : 0.3707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4369          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: HRC_Strong Class: HRC_Weak Class: DJT_Weak
## Sensitivity                     0.8555        0.053498          0.0000
## Specificity                     0.7485        0.948571          1.0000
## Pos Pred Value                  0.6671        0.152941             NaN
## Neg Pred Value                  0.8979        0.852375          0.8752
## Prevalence                      0.3707        0.147900          0.1248
## Detection Rate                  0.3171        0.007912          0.0000
## Detection Prevalence            0.4753        0.051735          0.0000
## Balanced Accuracy               0.8020        0.501035          0.5000
##                      Class: DJT_Strong
## Sensitivity                     0.8703
## Specificity                     0.7474
## Pos Pred Value                  0.6564
## Neg Pred Value                  0.9122
## Prevalence                      0.3567
## Detection Rate                  0.3104
## Detection Prevalence            0.4729
## Balanced Accuracy               0.8089

The \(AIC\) model has an Accuracy of 0.635423 and Error Rate of 0.364577.

BIC_Scores <- predict(BIC, anes_data[n, ], type="prob")
BIC_Class <- predict(BIC, anes_data[n, ], type="class")
(cm2 <- confusionMatrix(BIC_Class, anes_data[n, "target"]))

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   HRC_Strong HRC_Weak DJT_Weak DJT_Strong
##   HRC_Strong        524      179       30         51
##   HRC_Weak           26        9       19         26
##   DJT_Weak            0        0        0          0
##   DJT_Strong         59       55      156        509
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6342          
##                  95% CI : (0.6104, 0.6575)
##     No Information Rate : 0.3707          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4345          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: HRC_Strong Class: HRC_Weak Class: DJT_Weak
## Sensitivity                     0.8604        0.037037          0.0000
## Specificity                     0.7485        0.949286          1.0000
## Pos Pred Value                  0.6684        0.112500             NaN
## Neg Pred Value                  0.9010        0.850288          0.8752
## Prevalence                      0.3707        0.147900          0.1248
## Detection Rate                  0.3189        0.005478          0.0000
## Detection Prevalence            0.4772        0.048691          0.0000
## Balanced Accuracy               0.8045        0.493161          0.5000
##                      Class: DJT_Strong
## Sensitivity                     0.8686
## Specificity                     0.7446
## Pos Pred Value                  0.6534
## Neg Pred Value                  0.9109
## Prevalence                      0.3567
## Detection Rate                  0.3098
## Detection Prevalence            0.4741
## Balanced Accuracy               0.8066

The \(BIC\) model has an Accuracy of 0.6342057 and Error Rate of 0.3657943.

Area under the Surface

The ROC curve that plots two predictor variables becomes a ROC surface when more variables lead to higher dimensions.

library(pROC)
AIC_mcr <- multiclass.roc(anes_data[n, "target"], as.numeric(AIC_Class))
BIC_mcr <- multiclass.roc(anes_data[n, "target"], as.numeric(BIC_Class))
AIC_mcr$auc

## Multi-class area under the curve: 0.7579

BIC_mcr$auc

## Multi-class area under the curve: 0.7541

The AIC model has a greater area under the surface but not by much.

Performance Visualizations

par(mfrow=c(1,2), yaxt="n", cex=0.55)
x <- factor(AIC_mcr$predictor, levels=c(1:4), labels=levels(anes_data$target))
y <- AIC_mcr$response
z <- data.frame(x, y)
plot(z, main = "AIC Predictions", xlab = "", ylab="", col=c("blue", "lightblue", "pink", "red"))
x <- factor(BIC_mcr$predictor, levels=c(1:4), labels=levels(anes_data$target))
y <- BIC_mcr$response
z <- data.frame(x, y)
plot(z, main = "BIC Predictions", xlab = "", ylab="", col=c("blue", "lightblue", "pink", "red"))

The bars represent model predictions and the colors represent the actual data (HRC_Strong, HRC_Weak, DJT_Weak, DJT_Strong). Overall, the AIC model has a higher accuracy and larger area under the surface. It is worth noting that the difference in both metrics is minimal, but for purposes of this analysis, the slightly better model will be the one chosen.

Predictions

(predicted <- table(predict(AIC, anes_data[-n, ], type="class")))

## 
## HRC_Strong   HRC_Weak   DJT_Weak DJT_Strong 
##        398         39          0        385

Prevalence

actual <- table(anes_data[-n, "target"])
rbind(actual, predicted) / (N - length(n))

##           HRC_Strong   HRC_Weak  DJT_Weak DJT_Strong
## actual     0.4014599 0.12895377 0.1386861  0.3309002
## predicted  0.4841849 0.04744526 0.0000000  0.4683698

chisq.test(actual, predicted)

## 
##  Pearson's Chi-squared test
## 
## data:  actual and predicted
## X-squared = 12, df = 9, p-value = 0.2133

The prevalence of each condition in the acutal and predicted data is above. Although there is some difference in these figures, the difference is not significant at an \(\alpha=0.05\) as can be seen in the \(\chi^2\) goodness of fit test. The null hypothesis for the \(\chi^2\) goodness of fit test for multinomial distribution is that the observed (predicted) frequency is equal to an expected count (actual) in each category. In other words, the data are consistent with a specified distribution. With a \(p\)-value greater than \(\alpha=0.05\), we fail to reject the null hypothesis. The data supports the conclusion that the model is a good fit for the data.

Margin of Error

p_e <- 0.5 # a priori proportion
N_e <- 235248000 # voting age population
n_e <- length(n) # sample size
a_e <- 1 - 0.95 # confidence level
z_e <- qnorm(1 - a_e / 2) # z-score
(e <- z_e * sqrt((p_e * (1 - p_e) / n_e) * ((N_e - n_e) / (N_e - 1)))) # Margin of Error