1. Introduction

The project studies the relation between one’s political stand and family income in constant dollars.

People’s political stand is influenced by their family background which is closely tied to respective family income. Therefore, this study aims to explore how family income would exert an impact on one’s political stand, if there is a relation at all.

2. Data

The study uses General Social Survey (GSS) data for the year 2012.

General Social Survey Cumulative File, 1972-2012 Coursera Extract. Modified for Data Analysis and Statistical Inference course (Duke University).

R dataset could be downloaded at http://bit.ly/dasi_gss_data.

load(url("http://bit.ly/dasi_gss_data"))

2.1 Data collection

The study spans 40 years and nearly every decade the collection process was modified (see http://publicdata.norc.org:41000/gss/documents//BOOK/GSS_Codebook_AppendixA.pdf for details).

The data were collected from United States’ metropolitan and rural areas with household interview. Multiple level of stratification for region, race, age, income and sex was employed to guarantee a random sample. Each year were collected about 1500-2000 cases, with a slight increment in recent years.

2.2 Cases

The cases are adult persons resident in United States and interviewed in their household.

2.3 Variables

Party ID:

Answer to the question: “Did you ever get a high school diploma or a GED certificate?”.

Type of variable: categorical, ordinal.

summary(gss$partyid)

##    Strong Democrat   Not Str Democrat       Ind,Near Dem 
##               9117              12040               6743 
##        Independent       Ind,Near Rep Not Str Republican 
##               8499               4921               9005 
##  Strong Republican        Other Party               NA's 
##               5548                861                327

str(gss$partyid)

##  Factor w/ 8 levels "Strong Democrat",..: 3 2 4 2 1 3 3 3 1 1 ...

Family Income in Constant Dollars:

Inflation-adjusted family income.

Type of variable: numerical, continuous.

summary(gss$coninc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     383   18440   35600   44500   59540  180400    5829

str(gss$coninc)

##  int [1:57061] 25926 33333 33333 41667 69444 60185 50926 18519 3704 25926 ...

2.4. Study

The study consists in interviews to a random sample of United States residents about their economic condition, their working status, their health, their beliefs, etc. So the study is observational.

2.5 Scope of inference - generalizability

The population of interest is composed by all US residents. The study employed random sampling, so the results could be generalized to the entire the population.

2.6 Scope of inference - causality

The study is observational, so we can only establish association but not causal links between the variables of interest.

3. Exploratory data analysis

The dataset, with only the partyid and coninc columns and filtered for NAs values, has 50393 cases.

partyid:

partyid is a categorical variable. We summarize it with table and plot.

table(gss$partyid)

## 
##    Strong Democrat   Not Str Democrat       Ind,Near Dem 
##               9117              12040               6743 
##        Independent       Ind,Near Rep Not Str Republican 
##               8499               4921               9005 
##  Strong Republican        Other Party 
##               5548                861

plot(gss$partyid)

We can see that not strong democrat and not strong republican have the most instances.

Family Income in constant USD:

Family Income in constant USD is numerical continuous variable. We summarize it with summary and histogram.

summary(gss$coninc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     383   18440   35600   44500   59540  180400    5829

hist(gss$coninc)

We can see that the distribution is right skewed.

To address our research question, is there a relation between political stand and family income?

To explore the relationship between a categorical and a numerical variable, we use ggplot to explore.

library(ggplot2)
ggplot(gss, aes(x=gss$partyid,y=gss$coninc)) + geom_bar(stat="identity")+scale_x_discrete(labels = c("Str Dem","Not Str Dem", "Ind, Near Dem","Ind", "Ind,Near Rep"," Not Str Rep","Str Rep","Others","NA"))

## Warning: Removed 5829 rows containing missing values (position_stack).

We don’t see a positive/negative correlation between political stand and family income. But not strong democrat and republican seem to be associated with higher family income.

4. Inference

4.1. State hypothesis

The study want to explore if there is a statistical significant difference between the mean family income in constant dollars of United States resident as respect to their political stand.

In statistical inference terms, we test a null hypothesis (H0) where mean family income is equal for all political groups, and an alternative hypothesis (HA) where at least one group is different from others.

H0 : all means (µ) of each political group is equal, aka. µ1=µ2=….=µ8 HA : the average income in constant dollar varies across some (or all) groups

4.2. Check conditions

Three conditions need to be checked for ANOVA test.

Indipendence: GSS data comes from random sample with less than 10% of the population. As a result, they could be considered independent.
approximately normal: normal probability plots for each political group are shown below. As we can see that there is some deviation from standard normal distribution in each group, especially in upper quantiles.

# 8 graphs in 2 rows
par(mfrow = c(2,4))
# Iterate on 8 groups and graph a QQ plot to test normality
political = c("Strong Democrat","Not Str Democrat","Ind,Near Dem","Independent","Ind,Near Rep","Not Str Republican","Strong Republican","Other Party")
for (i in 1:8) {
qqnorm(gss[gss$partyid == political[i],]$coninc, main=political[i])
qqline(gss[gss$partyid == political[i],]$coninc)
}

constant variance: we can check variability using boxplot below; total range and interquantile range of 8 groups are roughly similar, with some variability between general democrats and republicans.

par(mfrow = c(1, 1))
par(mar=c(8,5,4,4))
boxplot(gss$coninc ~ gss$partyid, las= 2, main="Family Income in constant USD by political stand")

Although the conditions on normality and constant variance are not fully respected, we will use ANOVA in our hypothesis test and report the uncertainty in final results.

4.3. State the method to be used and why and how

Since we are workng with categorical variables with more than 2 levels, we will use ANOVA test to check whether means across 8 groups are equal. If we can reject the null hypothesis, then results of pairwise comparison can be conducted with Bonferroni method to control Type I error.

ANOVA uses F statistics, which represents a standardized ratio of variability in sample means, relative to variability within the group. The larger the observed variability, the larger F will be, and the stronger the evidence against the null hypothesis.

4.4. Perform inference

# ANOVA for the mean income in constanst dollars grouped by partyid
anova(lm(coninc ~ partyid, data=gss))

## Analysis of Variance Table
## 
## Response: coninc
##              Df     Sum Sq    Mean Sq F value    Pr(>F)    
## partyid       7 1.7462e+12 2.4946e+11  198.42 < 2.2e-16 ***
## Residuals 51049 6.4182e+13 1.2573e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.5. Interpret results

ANOVA reports a F statistics of 198 and a p-value of approximately zero. This mean that the probability of observing a F value of 198 or higher, if the null hypothesis is true, is very low. So we can reject the null hypothesis and say that family income in constant dollar varies statistically significant among groups.

Since the null hypothesis has been rejected, we can do a pairwise comparison to find out which groups have different means. For every possible pair of groups (28 pairs), we use a t test statistic to confirm the null hypothesis that the means of the two groups are equal or the alternative hypothesis that they are different.

To avoid the increase of Type I error (rejecting a true null hypothesis), we apply a Bonferroni correction to the p-values which are multiplied by the number of comparison. With this correction, the difference of the means has to be bigger to reject the null hypothesis.

# Pairwise t test for the mean income in constanst dollars grouped by partyid
# With Bonferroni correction
pairwise.t.test(gss$coninc, gss$partyid, p.adj="bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  gss$coninc and gss$partyid 
## 
##                    Strong Democrat Not Str Democrat Ind,Near Dem
## Not Str Democrat   4.7e-13         -                -           
## Ind,Near Dem       1.1e-14         1.00000          -           
## Independent        1.00000         9.8e-07          9.8e-09     
## Ind,Near Rep       < 2e-16         < 2e-16          < 2e-16     
## Not Str Republican < 2e-16         < 2e-16          < 2e-16     
## Strong Republican  < 2e-16         < 2e-16          < 2e-16     
## Other Party        3.0e-07         0.15624          1.00000     
##                    Independent Ind,Near Rep Not Str Republican
## Not Str Democrat   -           -            -                 
## Ind,Near Dem       -           -            -                 
## Independent        -           -            -                 
## Ind,Near Rep       < 2e-16     -            -                 
## Not Str Republican < 2e-16     0.00345      -                 
## Strong Republican  < 2e-16     7.9e-15      1.9e-06           
## Other Party        2.3e-05     0.25489      0.00012           
##                    Strong Republican
## Not Str Democrat   -                
## Ind,Near Dem       -                
## Independent        -                
## Ind,Near Rep       -                
## Not Str Republican -                
## Strong Republican  -                
## Other Party        8.2e-11          
## 
## P value adjustment method: bonferroni

We can see that only 5 p-value is not lower than significance level of 0.05 and so the null hypothesis of the other groups are rejected.

4.6. If applicable, state whether results from various methods agree

Due to the nature of the variables in our study, we used ANOVA test and no other method result to be compared.

5. Conclusion

The study explored the relationship between people’s political stand and their family income. Although we didn’t find a positive/negative relation between the two variables, after hypothesis testing via ANOVA and pair comparisons, we find that family income of the groups are significantly different from one another. So it seems there is correlation of the two variables, however, presence of outliers during exploratory data analysis phase, and the fact that some conditions not met for ANOVA test mean that there may be other variables strongly correlated with income, and we need to be cautious in interpreting the results.

Future research could address these shortcomings by analyzing the interaction of other variables and by using more sophisticated statistical techniques with fully respected conditions.

6. References

Citation for the original data:

Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1 Persistent URL: http://doi.org/10.3886/ICPSR34802.v1

7. Appendix

Example of the data used in the study

head(gss)[,c(27,29)]

##   coninc          partyid
## 1  25926     Ind,Near Dem
## 2  33333 Not Str Democrat
## 3  33333      Independent
## 4  41667 Not Str Democrat
## 5  69444  Strong Democrat
## 6  60185     Ind,Near Dem

Data Analysis and Statistical Inference Project Analysis

Cynthia

Tuesday, April 07, 2015