The purpose of this investigation is to determine if we can draw an inference between the Education Level and the Financial Status of the Americans and the party affiliation thereof. Considering the scope of this project does not includes simultaneous association of the categorical variable gss$partyid with education level gss$educ and financial status gss$coninc i.e. total family income in constant dollar, we would only try to infer on individual numerical variables.
Recently a Quoran asked a question((https://www.quora.com/profile/Adam-Easton-1) 2015) titled
“How do Republicans explain the fact that higher education has such a high correlation with voting Democrat?”.
Respondents had mixed responses and none of them were corroborated with a true statistical analysis to approve or disapprove the Hypothesis. The Null Hypothesis is off course, "Nothing unusual correlation exist", but we would try to prove if the alternate Hypothesis namely "There is a correlation between the education level and party affiliation". What is the nature of correlation though is beyond the scope of this paper though we agree that it would be an interesting study to perform.
We would also extend the study to see if a similar correlation exist with the Financial Status where the Null Hypothesis would be “There is no correlation between Financial Status and Party affiliation” where the alternative hypothesis would try to disapprove and determine the correlation if any.
Datasets for the project
The Dataset was one of the two datasets provided by Duke University as part of the Project. The dataset we used was an extract of the General Social Survey (GSS) Cumulative File 1972-2012 which is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States.
We were only interested in three fields for this project.
The field gss$partyid, Party affiliation is a categorical variables with 8 Levels.
> unique(gss$partyid)
[1] Ind,Near Dem Not Str Democrat Independent Strong Democrat
[5] Not Str Republican Ind,Near Rep Strong Republican Other Party
[9] <NA>
8 Levels: Strong Democrat Not Str Democrat Ind,Near Dem Independent ... Other Party
> summary(gss$partyid)
Strong Democrat Not Str Democrat Ind,Near Dem Independent
9117 12040 6743 8499
Ind,Near Rep Not Str Republican Strong Republican Other Party
4921 9005 5548 861
NA's
327
We can see that there is a fair representation from each political party affiliation. Considering the population of America is 318.9 million as of 2014, we can safely assume that the representative demography is fairly independent as it is less than 10% of the population. Moreover the sample size is substantial enough that even if there is a notable skewness, it would have negligible impact on the study.
The gss$educ, Education Level is a numeric variable of type Integer. It is unfortunate that the sample Education Level is not a normal distribution. There are both outliers along with left skewness. The 164 NA entries would be appropriately removed when we would be analyzing the data.
load(url("http://bit.ly/dasi_gss_data"))
library(ggplot2)
summary(gss$educ)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 12.00 12.00 12.75 15.00 20.00 164
boxplot(na.omit(gss$educ), main = "Boxplot of Education vs Political Affiliation")
load(url("http://bit.ly/dasi_gss_data"))
library(ggplot2)
summary(gss$educ)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 12.00 12.00 12.75 15.00 20.00 164
hist(na.omit(gss$educ), main = "Histogram of Education vs Political Affiliation")
Similarly the gss$coninc, Financial Status, which in our case is Family Income in Constant Dollars is also a numeric Integer Variable. The data is skewed along with obvious outliers in the data.
load(url("http://bit.ly/dasi_gss_data"))
library(ggplot2)
summary(gss$educ)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 12.00 12.00 12.75 15.00 20.00 164
hist(na.omit(gss$coninc), main = "Boxplot of Family Income vs Political Affiliation")
load(url("http://bit.ly/dasi_gss_data"))
library(ggplot2)
summary(gss$coninc)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 383 18440 35600 44500 59540 180400 5829
hist(na.omit(gss$coninc), main = "Histogram of Family Income vs Political Affiliation")
Luckily, as we would be using ANOVA, which is not quite sensitive to non-extreme divergence from normality. Various simulation research, using a diversified non-normal distributions, have depicted that the false positive is not affected adequately by this violation (Glass and Sanders 1972; Lix and Keselman 1996; Harwell and Olds 1992). The reason being for an adequate random samples from a population, the means of those samples are roughly normally distributed irrespective of when the population is not normal.
The categorical variable we are considering in this study has 8 Levels. Thus, we have to compare means across various groups. Unfortunately, pair wise comparison is not feasible we would have to opt for analysis of variance (ANOVA)(David M Diez and Cetinkaya-Rundel 2015) and subsequently perform the F test to determine if at least one mean is different which in our case would be the Alternate Hypothesis.
As we see, the box plot between the financial status and party affiliation is not very convincing to reject the NULL Hypothesis in favor of the alternative, so we need to perform further analysis to either favor or reject the NULL Hypothesis.
load(url("http://bit.ly/dasi_gss_data"))
library(ggplot2)
summary(gss$coninc)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 383 18440 35600 44500 59540 180400 5829
plot(gss$coninc ~ gss$partyid, main = "Correlation of Family Income vs Political Affiliation")
The same goes for the education level
load(url("http://bit.ly/dasi_gss_data"))
library(ggplot2)
summary(gss$educ)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 12.00 12.00 12.75 15.00 20.00 164
plot(gss$educ ~ gss$partyid, main = "Correlation of Education Level vs Political Affiliation")
We are only interested on the education level, total family income in constant dollar and the part affiliation. So, we would mask out all the other data from the dataframe and would select the three relevant field
> data = gss[complete.cases(gss[,c(8,27,29)]),][c(8,27,29)]
> head(data)
educ coninc partyid
1 16 25926 Ind,Near Dem
2 10 33333 Not Str Democrat
3 12 33333 Independent
4 17 41667 Not Str Democrat
5 12 69444 Strong Democrat
6 14 60185 Ind,Near Dem
> tail(data)
educ coninc partyid
57056 11 383 Independent
57057 16 14363 Not Str Democrat
57058 13 383 Strong Democrat
57059 13 76600 Ind,Near Rep
57060 12 14363 Independent
57061 12 383 Ind,Near Dem
> summary(data)
educ coninc partyid
Min. : 0.00 Min. : 383 Not Str Democrat :10968
1st Qu.:12.00 1st Qu.: 18445 Strong Democrat : 8221
Median :12.00 Median : 35602 Not Str Republican: 8158
Mean :12.82 Mean : 44547 Independent : 7213
3rd Qu.:15.00 3rd Qu.: 59542 Ind,Near Dem : 6192
Max. :20.00 Max. :180386 Strong Republican : 4934
(Other) : 5296
So to begin with we state the Hypothesis
Independence: We have already shown that the data is nearly independent as the sample is considerably less than 10% of the pupulation.
Approximately normal: We have also provded with reasonable justification that the skewness would not affect our analysis. Moreover there are more than 150 observations in each group.
Constant variance: Finally the Variance in the groups is about equal from one group to the next. This was obvious when we plotted the side-by-side boxplots for each of the groups.
Instead of hand knitting the data, we have used R anova(Yau 2013) function to calculate and generate the ANOVA stigmatization to correlate between the education level and party affiliation as a table
> educvspartyid = lm(gss$educ ~ gss$partyid)
> summary(educvspartyid)
Call:
lm(formula = gss$educ ~ gss$partyid)
Residuals:
Min 1Q Median 3Q Max
-13.3898 -1.2539 -0.2539 1.8281 7.7537
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.246291 0.033004 371.060 < 2e-16 ***
gss$partyidNot Str Democrat 0.250711 0.043758 5.730 1.01e-08 ***
gss$partyidInd,Near Dem 0.847617 0.050615 16.746 < 2e-16 ***
gss$partyidIndependent 0.007638 0.047543 0.161 0.872
gss$partyidInd,Near Rep 0.963433 0.055725 17.289 < 2e-16 ***
gss$partyidNot Str Republican 0.925582 0.046809 19.774 < 2e-16 ***
gss$partyidStrong Republican 1.143482 0.053667 21.307 < 2e-16 ***
gss$partyidOther Party 0.964174 0.112310 8.585 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.148 on 56595 degrees of freedom
(458 observations deleted due to missingness)
Multiple R-squared: 0.0193, Adjusted R-squared: 0.01918
F-statistic: 159.1 on 7 and 56595 DF, p-value: < 2.2e-16
> anova(educvspartyid)
Analysis of Variance Table
Response: gss$educ
Df Sum Sq Mean Sq F value Pr(>F)
gss$partyid 7 11039 1576.98 159.11 < 2.2e-16 ***
Residuals 56595 560911 9.91
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Calculating the various across multiple groups increases the Type 1 Error rate. Thus we need a modified significance level. Applying the Bonferroni correction which suggests that we would need a more stringent significance level that is calculated as
\(\alpha^* = \frac{\alpha}{K}\) where K is the no of comparison being considered. For k groups this is calculated as \(K = \frac{k(k-1)}{2}\). If we employ a 95% confidence level then significance level \(\alpha\) would be \(0.05\) and the modified significance level would thereby be after employing the Bonferroni correction(David M Diez and Cetinkaya-Rundel 2015) considering we have 8 groups for 8 levels of party.
\[K = \frac{k(k-1)}{2} = \frac{8(8-1)}{2} = 28\] thus the modified significance level is \[\alpha^* = \frac{\alpha}{K} = \frac{0.05}{28} = 0.001786\]
The P-value as calculated via ANOVA for Education Level vs the Party affiliation is \(< 2.2e-16\) which is considerably less than the significant value \(0.001786\).
On similar grounds, we try to associate the financial status of the Americans with their voting polarity
> conincvspartyid = lm(gss$coninc ~ gss$partyid)
> summary(conincvspartyid)
Call:
lm(formula = gss$coninc ~ gss$partyid)
Residuals:
Min 1Q Median 3Q Max
-54771 -25449 -8549 14533 142416
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37970.2 390.8 97.159 < 2e-16 ***
gss$partyidNot Str Democrat 3965.9 516.8 7.674 1.70e-14 ***
gss$partyidInd,Near Dem 4856.9 596.2 8.146 3.84e-16 ***
gss$partyidIndependent 1004.1 571.7 1.756 0.079 .
gss$partyidInd,Near Rep 11213.7 656.2 17.089 < 2e-16 ***
gss$partyidNot Str Republican 13736.9 553.8 24.805 < 2e-16 ***
gss$partyidStrong Republican 17184.1 638.1 26.928 < 2e-16 ***
gss$partyidOther Party 7618.3 1332.3 5.718 1.08e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 35460 on 51049 degrees of freedom
(6004 observations deleted due to missingness)
Multiple R-squared: 0.02649, Adjusted R-squared: 0.02635
F-statistic: 198.4 on 7 and 51049 DF, p-value: < 2.2e-16
> anova(conincvspartyid)
Analysis of Variance Table
Response: gss$coninc
Df Sum Sq Mean Sq F value Pr(>F)
gss$partyid 7 1.7462e+12 2.4946e+11 198.42 < 2.2e-16 ***
Residuals 51049 6.4182e+13 1.2573e+09
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
and again the P-value as calculated via ANOVA for Financial Status vs the Party affiliation is \(< 2.2e-16\) which is considerably less than the significant value \(0.001786\).
The analysis shows that their is a string correlation between the Financial Status and Education Level with the Party affiliation, though the type of relationship was not studied upon.
This was not quite a revelation as various Political Analyst like Larry Sabato and News Articles like Business Insider(Dougherty 2015) have analyzed and predicted a similar trend. In fact, Sabato(Sabato 2015) goes in to explain
“… The higher the education level, the more likely they are to vote Democratic,”.
load(url("http://bit.ly/dasi_gss_data"))
library(knitr)
data = gss[complete.cases(gss[,c(8,27,29)]),][c(8,27,29)]
kable(data[sample(nrow(data), 30),], format = "markdown")
| educ | coninc | partyid | |
|---|---|---|---|
| 19362 | 12 | 57243 | Strong Republican |
| 44963 | 12 | 63190 | Ind,Near Dem |
| 33939 | 9 | 36472 | Ind,Near Dem |
| 3253 | 10 | 21952 | Independent |
| 45180 | 12 | 77233 | Ind,Near Dem |
| 21032 | 15 | 67456 | Not Str Republican |
| 37771 | 12 | 12037 | Strong Democrat |
| 42497 | 14 | 8756 | Independent |
| 14148 | 13 | 27764 | Not Str Democrat |
| 55091 | 13 | 107240 | Not Str Republican |
| 27329 | 13 | 20761 | Not Str Republican |
| 49054 | 17 | 123496 | Strong Republican |
| 26518 | 16 | 57493 | Ind,Near Dem |
| 49448 | 14 | 59542 | Independent |
| 9880 | 16 | 59569 | Not Str Democrat |
| 32564 | 16 | 121461 | Not Str Democrat |
| 9926 | 14 | 34487 | Ind,Near Dem |
| 3477 | 10 | 21952 | Strong Democrat |
| 31598 | 9 | 38141 | Independent |
| 56849 | 11 | 6894 | Independent |
| 46529 | 13 | 3969 | Ind,Near Dem |
| 24341 | 4 | 19233 | Strong Democrat |
| 24853 | 18 | 62946 | Not Str Democrat |
| 52081 | 16 | 37373 | Independent |
| 3523 | 19 | 25329 | Not Str Democrat |
| 1973 | 12 | 62500 | Strong Republican |
| 8932 | 3 | 9393 | Ind,Near Rep |
| 42750 | 14 | 31618 | Not Str Republican |
| 18962 | 12 | 13738 | Strong Democrat |
| 36693 | 12 | 6955 | Strong Democrat |
David M Diez, Christopher D Barr, and Mine Cetinkaya-Rundel. 2015. OpenIntro Statistics. Duke University: CreateSpace Independent Publishing Platform (26 July 2012).
Dougherty, Michael Brendan. 2015. “Proof: Republicans Really Are Dumber Than Democrats.” http://www.businessinsider.com/proof-republicans-really-are-dumber-than-democrats-2012-5?IR=T.
Glass, P.D. Peckham, G.V., and J.R. Sanders. 1972. “Consequences of Failure to Meet Assumptions Underlying Fixed Effects Analyses of Variance and Covariance.” Review of Educational Research 42: 237–88.
Harwell, E.N. Rubinstein, M.R., and C.C. Olds. 1992. “Summarizing Monte Carlo Results in Methodological Research: The One- and Two-Factor Fixed Effects ANOVA Cases.” Journal of Educational Statistics 19: 315–39.
(https://www.quora.com/profile/Adam-Easton-1), Adam Easton. 2015. “How Do Republicans Explain the Fact That Higher Education Has Such a High Correlation with Voting Democrat?” https://www.quora.com/How-do-Republicans-explain-the-fact-that-higher-education-has-such-a-high-correlation-with-voting-Democrat.
Lix, J.C. Keselman, L.M., and H.J. Keselman. 1996. “Consequences of Assumption Violations Revisited: A Quantitative Review of Alternatives to the One-Way Analysis of Variance F Test.” Review of Educational Research 66: 579–619.
Sabato, Larry. 2015. “White Voters Solidly in for GOP in Georgia.” http://www.ajc.com/news/news/georgias-white-voters-elude-democrats/nSdq4/.
Yau, Chi. 2013. R Tutorial with Bayesian Statistics Using OpenBUGS. user.feedback@r-tutor.com: r-tutor.com.