Introduction:

The purpose of this investigation is to determine if we can draw an inference between the Education Level and the Financial Status of the Americans and the party affiliation thereof. Considering the scope of this project does not includes simultaneous association of the categorical variable gss$partyid with education level gss$educ and financial status gss$coninc i.e. total family income in constant dollar, we would only try to infer on individual numerical variables.

Recently a Quoran asked a question((https://www.quora.com/profile/Adam-Easton-1) 2015) titled

“How do Republicans explain the fact that higher education has such a high correlation with voting Democrat?”.

Respondents had mixed responses and none of them were corroborated with a true statistical analysis to approve or disapprove the Hypothesis. The Null Hypothesis is off course, "Nothing unusual correlation exist", but we would try to prove if the alternate Hypothesis namely "There is a correlation between the education level and party affiliation". What is the nature of correlation though is beyond the scope of this paper though we agree that it would be an interesting study to perform.

We would also extend the study to see if a similar correlation exist with the Financial Status where the Null Hypothesis would be “There is no correlation between Financial Status and Party affiliation” where the alternative hypothesis would try to disapprove and determine the correlation if any.

Data:

Datasets for the project

The Dataset was one of the two datasets provided by Duke University as part of the Project. The dataset we used was an extract of the General Social Survey (GSS) Cumulative File 1972-2012 which is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States.

We were only interested in three fields for this project.

The field gss$partyid, Party affiliation is a categorical variables with 8 Levels.

> unique(gss$partyid)
[1] Ind,Near Dem       Not Str Democrat   Independent        Strong Democrat   
[5] Not Str Republican Ind,Near Rep       Strong Republican  Other Party       
[9] <NA>              
8 Levels: Strong Democrat Not Str Democrat Ind,Near Dem Independent ... Other Party
> summary(gss$partyid)
   Strong Democrat   Not Str Democrat       Ind,Near Dem        Independent 
              9117              12040               6743               8499 
      Ind,Near Rep Not Str Republican  Strong Republican        Other Party 
              4921               9005               5548                861 
              NA's 
               327 

We can see that there is a fair representation from each political party affiliation. Considering the population of America is 318.9 million as of 2014, we can safely assume that the representative demography is fairly independent as it is less than 10% of the population. Moreover the sample size is substantial enough that even if there is a notable skewness, it would have negligible impact on the study.

The gss$educ, Education Level is a numeric variable of type Integer. It is unfortunate that the sample Education Level is not a normal distribution. There are both outliers along with left skewness. The 164 NA entries would be appropriately removed when we would be analyzing the data.

load(url("http://bit.ly/dasi_gss_data"))
library(ggplot2)
summary(gss$educ)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   12.00   12.00   12.75   15.00   20.00     164
boxplot(na.omit(gss$educ), main = "Boxplot of Education vs Political Affiliation")

load(url("http://bit.ly/dasi_gss_data"))
library(ggplot2)
summary(gss$educ)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   12.00   12.00   12.75   15.00   20.00     164
hist(na.omit(gss$educ), main = "Histogram of Education vs Political Affiliation")

Similarly the gss$coninc, Financial Status, which in our case is Family Income in Constant Dollars is also a numeric Integer Variable. The data is skewed along with obvious outliers in the data.

load(url("http://bit.ly/dasi_gss_data"))
library(ggplot2)
summary(gss$educ)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   12.00   12.00   12.75   15.00   20.00     164
hist(na.omit(gss$coninc), main = "Boxplot of Family Income vs Political Affiliation")

load(url("http://bit.ly/dasi_gss_data"))
library(ggplot2)
summary(gss$coninc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     383   18440   35600   44500   59540  180400    5829
hist(na.omit(gss$coninc), main = "Histogram of Family Income vs Political Affiliation")

Luckily, as we would be using ANOVA, which is not quite sensitive to non-extreme divergence from normality. Various simulation research, using a diversified non-normal distributions, have depicted that the false positive is not affected adequately by this violation (Glass and Sanders 1972; Lix and Keselman 1996; Harwell and Olds 1992). The reason being for an adequate random samples from a population, the means of those samples are roughly normally distributed irrespective of when the population is not normal.

Exploratory data analysis:

The categorical variable we are considering in this study has 8 Levels. Thus, we have to compare means across various groups. Unfortunately, pair wise comparison is not feasible we would have to opt for analysis of variance (ANOVA)(David M Diez and Cetinkaya-Rundel 2015) and subsequently perform the F test to determine if at least one mean is different which in our case would be the Alternate Hypothesis.

As we see, the box plot between the financial status and party affiliation is not very convincing to reject the NULL Hypothesis in favor of the alternative, so we need to perform further analysis to either favor or reject the NULL Hypothesis.

load(url("http://bit.ly/dasi_gss_data"))
library(ggplot2)
summary(gss$coninc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     383   18440   35600   44500   59540  180400    5829
plot(gss$coninc ~ gss$partyid, main = "Correlation of Family Income vs Political Affiliation")

The same goes for the education level

load(url("http://bit.ly/dasi_gss_data"))
library(ggplot2)
summary(gss$educ)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   12.00   12.00   12.75   15.00   20.00     164
plot(gss$educ ~ gss$partyid, main = "Correlation of Education Level vs Political Affiliation")

We are only interested on the education level, total family income in constant dollar and the part affiliation. So, we would mask out all the other data from the dataframe and would select the three relevant field

> data = gss[complete.cases(gss[,c(8,27,29)]),][c(8,27,29)]
> head(data)
  educ coninc          partyid
1   16  25926     Ind,Near Dem
2   10  33333 Not Str Democrat
3   12  33333      Independent
4   17  41667 Not Str Democrat
5   12  69444  Strong Democrat
6   14  60185     Ind,Near Dem
> tail(data)
      educ coninc          partyid
57056   11    383      Independent
57057   16  14363 Not Str Democrat
57058   13    383  Strong Democrat
57059   13  76600     Ind,Near Rep
57060   12  14363      Independent
57061   12    383     Ind,Near Dem
> summary(data)
      educ           coninc                     partyid     
 Min.   : 0.00   Min.   :   383   Not Str Democrat  :10968  
 1st Qu.:12.00   1st Qu.: 18445   Strong Democrat   : 8221  
 Median :12.00   Median : 35602   Not Str Republican: 8158  
 Mean   :12.82   Mean   : 44547   Independent       : 7213  
 3rd Qu.:15.00   3rd Qu.: 59542   Ind,Near Dem      : 6192  
 Max.   :20.00   Max.   :180386   Strong Republican : 4934  
                                  (Other)           : 5296

Inference:

So to begin with we state the Hypothesis

Independence: We have already shown that the data is nearly independent as the sample is considerably less than 10% of the pupulation.

Approximately normal: We have also provded with reasonable justification that the skewness would not affect our analysis. Moreover there are more than 150 observations in each group.

Constant variance: Finally the Variance in the groups is about equal from one group to the next. This was obvious when we plotted the side-by-side boxplots for each of the groups.

Instead of hand knitting the data, we have used R anova(Yau 2013) function to calculate and generate the ANOVA stigmatization to correlate between the education level and party affiliation as a table

> educvspartyid = lm(gss$educ ~ gss$partyid)
> summary(educvspartyid)

Call:
lm(formula = gss$educ ~ gss$partyid)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.3898  -1.2539  -0.2539   1.8281   7.7537 

Coefficients:
                               Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   12.246291   0.033004 371.060  < 2e-16 ***
gss$partyidNot Str Democrat    0.250711   0.043758   5.730 1.01e-08 ***
gss$partyidInd,Near Dem        0.847617   0.050615  16.746  < 2e-16 ***
gss$partyidIndependent         0.007638   0.047543   0.161    0.872    
gss$partyidInd,Near Rep        0.963433   0.055725  17.289  < 2e-16 ***
gss$partyidNot Str Republican  0.925582   0.046809  19.774  < 2e-16 ***
gss$partyidStrong Republican   1.143482   0.053667  21.307  < 2e-16 ***
gss$partyidOther Party         0.964174   0.112310   8.585  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.148 on 56595 degrees of freedom
  (458 observations deleted due to missingness)
Multiple R-squared:  0.0193,    Adjusted R-squared:  0.01918 
F-statistic: 159.1 on 7 and 56595 DF,  p-value: < 2.2e-16

> anova(educvspartyid)
Analysis of Variance Table

Response: gss$educ
               Df Sum Sq Mean Sq F value    Pr(>F)    
gss$partyid     7  11039 1576.98  159.11 < 2.2e-16 ***
Residuals   56595 560911    9.91                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Calculating the various across multiple groups increases the Type 1 Error rate. Thus we need a modified significance level. Applying the Bonferroni correction which suggests that we would need a more stringent significance level that is calculated as

\(\alpha^* = \frac{\alpha}{K}\) where K is the no of comparison being considered. For k groups this is calculated as \(K = \frac{k(k-1)}{2}\). If we employ a 95% confidence level then significance level \(\alpha\) would be \(0.05\) and the modified significance level would thereby be after employing the Bonferroni correction(David M Diez and Cetinkaya-Rundel 2015) considering we have 8 groups for 8 levels of party.

\[K = \frac{k(k-1)}{2} = \frac{8(8-1)}{2} = 28\] thus the modified significance level is \[\alpha^* = \frac{\alpha}{K} = \frac{0.05}{28} = 0.001786\]

The P-value as calculated via ANOVA for Education Level vs the Party affiliation is \(< 2.2e-16\) which is considerably less than the significant value \(0.001786\).

On similar grounds, we try to associate the financial status of the Americans with their voting polarity

> conincvspartyid = lm(gss$coninc ~ gss$partyid)
> summary(conincvspartyid)

Call:
lm(formula = gss$coninc ~ gss$partyid)

Residuals:
   Min     1Q Median     3Q    Max 
-54771 -25449  -8549  14533 142416 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    37970.2      390.8  97.159  < 2e-16 ***
gss$partyidNot Str Democrat     3965.9      516.8   7.674 1.70e-14 ***
gss$partyidInd,Near Dem         4856.9      596.2   8.146 3.84e-16 ***
gss$partyidIndependent          1004.1      571.7   1.756    0.079 .  
gss$partyidInd,Near Rep        11213.7      656.2  17.089  < 2e-16 ***
gss$partyidNot Str Republican  13736.9      553.8  24.805  < 2e-16 ***
gss$partyidStrong Republican   17184.1      638.1  26.928  < 2e-16 ***
gss$partyidOther Party          7618.3     1332.3   5.718 1.08e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 35460 on 51049 degrees of freedom
  (6004 observations deleted due to missingness)
Multiple R-squared:  0.02649,   Adjusted R-squared:  0.02635 
F-statistic: 198.4 on 7 and 51049 DF,  p-value: < 2.2e-16

> anova(conincvspartyid)
Analysis of Variance Table

Response: gss$coninc
               Df     Sum Sq    Mean Sq F value    Pr(>F)    
gss$partyid     7 1.7462e+12 2.4946e+11  198.42 < 2.2e-16 ***
Residuals   51049 6.4182e+13 1.2573e+09                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

and again the P-value as calculated via ANOVA for Financial Status vs the Party affiliation is \(< 2.2e-16\) which is considerably less than the significant value \(0.001786\).

Conclusion:

The analysis shows that their is a string correlation between the Financial Status and Education Level with the Party affiliation, though the type of relationship was not studied upon.

This was not quite a revelation as various Political Analyst like Larry Sabato and News Articles like Business Insider(Dougherty 2015) have analyzed and predicted a similar trend. In fact, Sabato(Sabato 2015) goes in to explain

“… The higher the education level, the more likely they are to vote Democratic,”.

Appendix:

load(url("http://bit.ly/dasi_gss_data"))
library(knitr)
data = gss[complete.cases(gss[,c(8,27,29)]),][c(8,27,29)]
kable(data[sample(nrow(data), 30),], format = "markdown")
educ coninc partyid
19362 12 57243 Strong Republican
44963 12 63190 Ind,Near Dem
33939 9 36472 Ind,Near Dem
3253 10 21952 Independent
45180 12 77233 Ind,Near Dem
21032 15 67456 Not Str Republican
37771 12 12037 Strong Democrat
42497 14 8756 Independent
14148 13 27764 Not Str Democrat
55091 13 107240 Not Str Republican
27329 13 20761 Not Str Republican
49054 17 123496 Strong Republican
26518 16 57493 Ind,Near Dem
49448 14 59542 Independent
9880 16 59569 Not Str Democrat
32564 16 121461 Not Str Democrat
9926 14 34487 Ind,Near Dem
3477 10 21952 Strong Democrat
31598 9 38141 Independent
56849 11 6894 Independent
46529 13 3969 Ind,Near Dem
24341 4 19233 Strong Democrat
24853 18 62946 Not Str Democrat
52081 16 37373 Independent
3523 19 25329 Not Str Democrat
1973 12 62500 Strong Republican
8932 3 9393 Ind,Near Rep
42750 14 31618 Not Str Republican
18962 12 13738 Strong Democrat
36693 12 6955 Strong Democrat

References:

David M Diez, Christopher D Barr, and Mine Cetinkaya-Rundel. 2015. OpenIntro Statistics. Duke University: CreateSpace Independent Publishing Platform (26 July 2012).

Dougherty, Michael Brendan. 2015. “Proof: Republicans Really Are Dumber Than Democrats.” http://www.businessinsider.com/proof-republicans-really-are-dumber-than-democrats-2012-5?IR=T.

Glass, P.D. Peckham, G.V., and J.R. Sanders. 1972. “Consequences of Failure to Meet Assumptions Underlying Fixed Effects Analyses of Variance and Covariance.” Review of Educational Research 42: 237–88.

Harwell, E.N. Rubinstein, M.R., and C.C. Olds. 1992. “Summarizing Monte Carlo Results in Methodological Research: The One- and Two-Factor Fixed Effects ANOVA Cases.” Journal of Educational Statistics 19: 315–39.

(https://www.quora.com/profile/Adam-Easton-1), Adam Easton. 2015. “How Do Republicans Explain the Fact That Higher Education Has Such a High Correlation with Voting Democrat?” https://www.quora.com/How-do-Republicans-explain-the-fact-that-higher-education-has-such-a-high-correlation-with-voting-Democrat.

Lix, J.C. Keselman, L.M., and H.J. Keselman. 1996. “Consequences of Assumption Violations Revisited: A Quantitative Review of Alternatives to the One-Way Analysis of Variance F Test.” Review of Educational Research 66: 579–619.

Sabato, Larry. 2015. “White Voters Solidly in for GOP in Georgia.” http://www.ajc.com/news/news/georgias-white-voters-elude-democrats/nSdq4/.

Yau, Chi. 2013. R Tutorial with Bayesian Statistics Using OpenBUGS. user.feedback@r-tutor.com: r-tutor.com.