The project studies the relation between one’s political stand and family income in constant dollars.
People’s political stand is influenced by their family background which is closely tied to respective family income. Therefore, this study aims to explore how family income would exert an impact on one’s political stand, if there is a relation at all.
The study uses General Social Survey (GSS) data for the year 2012.
General Social Survey Cumulative File, 1972-2012 Coursera Extract. Modified for Data Analysis and Statistical Inference course (Duke University).
R dataset could be downloaded at http://bit.ly/dasi_gss_data.
load(url("http://bit.ly/dasi_gss_data"))
The study spans 40 years and nearly every decade the collection process was modified (see http://publicdata.norc.org:41000/gss/documents//BOOK/GSS_Codebook_AppendixA.pdf for details).
The data were collected from United States’ metropolitan and rural areas with household interview. Multiple level of stratification for region, race, age, income and sex was employed to guarantee a random sample. Each year were collected about 1500-2000 cases, with a slight increment in recent years.
The cases are adult persons resident in United States and interviewed in their household.
Party ID:
Answer to the question: “Did you ever get a high school diploma or a GED certificate?”.
Type of variable: categorical, ordinal.
summary(gss$partyid)
## Strong Democrat Not Str Democrat Ind,Near Dem
## 9117 12040 6743
## Independent Ind,Near Rep Not Str Republican
## 8499 4921 9005
## Strong Republican Other Party NA's
## 5548 861 327
str(gss$partyid)
## Factor w/ 8 levels "Strong Democrat",..: 3 2 4 2 1 3 3 3 1 1 ...
Family Income in Constant Dollars:
Inflation-adjusted family income.
Type of variable: numerical, continuous.
summary(gss$coninc)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 383 18440 35600 44500 59540 180400 5829
str(gss$coninc)
## int [1:57061] 25926 33333 33333 41667 69444 60185 50926 18519 3704 25926 ...
The study consists in interviews to a random sample of United States residents about their economic condition, their working status, their health, their beliefs, etc. So the study is observational.
The population of interest is composed by all US residents. The study employed random sampling, so the results could be generalized to the entire the population.
The study is observational, so we can only establish association but not causal links between the variables of interest.
The dataset, with only the partyid and coninc columns and filtered for NAs values, has 50393 cases.
partyid:
partyid is a categorical variable. We summarize it with table and plot.
table(gss$partyid)
##
## Strong Democrat Not Str Democrat Ind,Near Dem
## 9117 12040 6743
## Independent Ind,Near Rep Not Str Republican
## 8499 4921 9005
## Strong Republican Other Party
## 5548 861
plot(gss$partyid)
We can see that not strong democrat and not strong republican have the most instances.
Family Income in constant USD:
Family Income in constant USD is numerical continuous variable. We summarize it with summary and histogram.
summary(gss$coninc)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 383 18440 35600 44500 59540 180400 5829
hist(gss$coninc)
We can see that the distribution is right skewed.
To address our research question, is there a relation between political stand and family income?
To explore the relationship between a categorical and a numerical variable, we use ggplot to explore.
library(ggplot2)
ggplot(gss, aes(x=gss$partyid,y=gss$coninc)) + geom_bar(stat="identity")+scale_x_discrete(labels = c("Str Dem","Not Str Dem", "Ind, Near Dem","Ind", "Ind,Near Rep"," Not Str Rep","Str Rep","Others","NA"))
## Warning: Removed 5829 rows containing missing values (position_stack).
We don’t see a positive/negative correlation between political stand and family income. But not strong democrat and republican seem to be associated with higher family income.
The study want to explore if there is a statistical significant difference between the mean family income in constant dollars of United States resident as respect to their political stand.
In statistical inference terms, we test a null hypothesis (H0) where mean family income is equal for all political groups, and an alternative hypothesis (HA) where at least one group is different from others.
H0 : all means (µ) of each political group is equal, aka. µ1=µ2=….=µ8 HA : the average income in constant dollar varies across some (or all) groups
Three conditions need to be checked for ANOVA test.
# 8 graphs in 2 rows
par(mfrow = c(2,4))
# Iterate on 8 groups and graph a QQ plot to test normality
political = c("Strong Democrat","Not Str Democrat","Ind,Near Dem","Independent","Ind,Near Rep","Not Str Republican","Strong Republican","Other Party")
for (i in 1:8) {
qqnorm(gss[gss$partyid == political[i],]$coninc, main=political[i])
qqline(gss[gss$partyid == political[i],]$coninc)
}
par(mfrow = c(1, 1))
par(mar=c(8,5,4,4))
boxplot(gss$coninc ~ gss$partyid, las= 2, main="Family Income in constant USD by political stand")
Although the conditions on normality and constant variance are not fully respected, we will use ANOVA in our hypothesis test and report the uncertainty in final results.
Since we are workng with categorical variables with more than 2 levels, we will use ANOVA test to check whether means across 8 groups are equal. If we can reject the null hypothesis, then results of pairwise comparison can be conducted with Bonferroni method to control Type I error.
ANOVA uses F statistics, which represents a standardized ratio of variability in sample means, relative to variability within the group. The larger the observed variability, the larger F will be, and the stronger the evidence against the null hypothesis.
# ANOVA for the mean income in constanst dollars grouped by partyid
anova(lm(coninc ~ partyid, data=gss))
## Analysis of Variance Table
##
## Response: coninc
## Df Sum Sq Mean Sq F value Pr(>F)
## partyid 7 1.7462e+12 2.4946e+11 198.42 < 2.2e-16 ***
## Residuals 51049 6.4182e+13 1.2573e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA reports a F statistics of 198 and a p-value of approximately zero. This mean that the probability of observing a F value of 198 or higher, if the null hypothesis is true, is very low. So we can reject the null hypothesis and say that family income in constant dollar varies statistically significant among groups.
Since the null hypothesis has been rejected, we can do a pairwise comparison to find out which groups have different means. For every possible pair of groups (28 pairs), we use a t test statistic to confirm the null hypothesis that the means of the two groups are equal or the alternative hypothesis that they are different.
To avoid the increase of Type I error (rejecting a true null hypothesis), we apply a Bonferroni correction to the p-values which are multiplied by the number of comparison. With this correction, the difference of the means has to be bigger to reject the null hypothesis.
# Pairwise t test for the mean income in constanst dollars grouped by partyid
# With Bonferroni correction
pairwise.t.test(gss$coninc, gss$partyid, p.adj="bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: gss$coninc and gss$partyid
##
## Strong Democrat Not Str Democrat Ind,Near Dem
## Not Str Democrat 4.7e-13 - -
## Ind,Near Dem 1.1e-14 1.00000 -
## Independent 1.00000 9.8e-07 9.8e-09
## Ind,Near Rep < 2e-16 < 2e-16 < 2e-16
## Not Str Republican < 2e-16 < 2e-16 < 2e-16
## Strong Republican < 2e-16 < 2e-16 < 2e-16
## Other Party 3.0e-07 0.15624 1.00000
## Independent Ind,Near Rep Not Str Republican
## Not Str Democrat - - -
## Ind,Near Dem - - -
## Independent - - -
## Ind,Near Rep < 2e-16 - -
## Not Str Republican < 2e-16 0.00345 -
## Strong Republican < 2e-16 7.9e-15 1.9e-06
## Other Party 2.3e-05 0.25489 0.00012
## Strong Republican
## Not Str Democrat -
## Ind,Near Dem -
## Independent -
## Ind,Near Rep -
## Not Str Republican -
## Strong Republican -
## Other Party 8.2e-11
##
## P value adjustment method: bonferroni
We can see that only 5 p-value is not lower than significance level of 0.05 and so the null hypothesis of the other groups are rejected.
Due to the nature of the variables in our study, we used ANOVA test and no other method result to be compared.
The study explored the relationship between people’s political stand and their family income. Although we didn’t find a positive/negative relation between the two variables, after hypothesis testing via ANOVA and pair comparisons, we find that family income of the groups are significantly different from one another. So it seems there is correlation of the two variables, however, presence of outliers during exploratory data analysis phase, and the fact that some conditions not met for ANOVA test mean that there may be other variables strongly correlated with income, and we need to be cautious in interpreting the results.
Future research could address these shortcomings by analyzing the interaction of other variables and by using more sophisticated statistical techniques with fully respected conditions.
Citation for the original data:
Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1 Persistent URL: http://doi.org/10.3886/ICPSR34802.v1
Example of the data used in the study
head(gss)[,c(27,29)]
## coninc partyid
## 1 25926 Ind,Near Dem
## 2 33333 Not Str Democrat
## 3 33333 Independent
## 4 41667 Not Str Democrat
## 5 69444 Strong Democrat
## 6 60185 Ind,Near Dem