Introduction:

This project analyzes the relationship between a person’s family income and his/her’s political affiliation.

A person’s political affiliation can be influenced by a variety of factors, such as his ideology, the party his family supports, his geographical location, his income etc. In this project we aim to analyze how a person’s political affiliation is influenced by his household income

This analasis will help us determine the characteristics of voters for each political party.

Data:

This study utilizes the General Social Survey (GSS) data.

General Social Survey (GSS) is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. It is used to collect data about a variety of topics. It can be analyzed in depth using the following codebook #https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html

The data was modified for Data Analysis and Statistical Inference course (Duke University).

How this data was collected

The study has been conductedfor the past 40 years and the data collection process has been modified every few years. The exact in-depth procedure for the collection can be studied using the following link: http://publicdata.norc.org:41000/gss/documents//BOOK/GSS_Codebook_AppendixA.pdf

Exploratory data analysis:

We have only considered a subset of the gss data set which contains the political affiliation of a person (partyid) and his family income (coninc) in constant USD.

partyid is a categorical variable with 8 levels.

summary(gss$partyid)
##    Strong Democrat   Not Str Democrat       Ind,Near Dem 
##               9117              12040               6743 
##        Independent       Ind,Near Rep Not Str Republican 
##               8499               4921               9005 
##  Strong Republican        Other Party               NA's 
##               5548                861                327
plot((gss$partyid))

While, coninc is a continuous numerical variable

summary(gss$coninc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     383   18440   35600   44500   59540  180400    5829
hist(gss$coninc)

We can check if there is any relation between the 2 variables by making a box plot for income for each partyid

plot(gss$coninc~gss$partyid)

We can see that there is no clear positive or negative correlayion between the two variables; however, the mean average income fior a few parties is higher than the others and needs to be further evaluated.

Inference:

We want to find out if there is a statistically significant difference between the party a person supports and his mean household income.

For the Null Hypothesis we will say that there is no difference in average income across people supporting different political parties. i.e. H0: µ1=µ2=µ3=µ4=µ5=µ6=µ7=µ8

While the alternative hypothesis states that there is atleast one group which has an average houshold income different from the other groups.

Pre-Requisite conditions for ANOVA hypothesis testing:

  1. Independence: The GSS survey survey’s around 57000 people which is definately lower thn 10% of the population of the US. Hence the observations are independent. Also, the groups are independent of each other as a person will fall into one of the groups and cannot be a member of more than one group.

2.Normality: We can test the normality by drawing probability plot or each of the groups between

# To make 8 graphs with 4 in each row
par(mfrow = c(2,4))
# To Define the 8 groups and test the normality of each group using probability distribution plots 
party = c("Strong Democrat","Not Str Democrat","Ind,Near Dem","Independent","Ind,Near Rep","Not Str Republican","Strong Republican","Other Party")
for (i in 1:8) {
qqnorm(gss[gss$partyid == party[i],]$coninc, main=party[i])
qqline(gss[gss$partyid == party[i],]$coninc)
}

We can see that the plots are primarily normal; however, there is some deviation from normality near the upper ranges of the plots

  1. Constant Variance: We can check the variability betwen the groups by drawing side-by-side box plots
par(mfrow = c(1, 1))
boxplot(gss$coninc ~ gss$partyid, main="Household Income by Political Affiliation", las= 2)

We can see that the groups have similar variability; however, a few groups have higher average household income

ANOVA Test:

anova(lm(coninc ~ partyid, data=gss))
## Analysis of Variance Table
## 
## Response: coninc
##              Df     Sum Sq    Mean Sq F value    Pr(>F)    
## partyid       7 1.7462e+12 2.4946e+11  198.42 < 2.2e-16 ***
## Residuals 51049 6.4182e+13 1.2573e+09                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA gives us an f statistic of 198.42 and a very very low p value. Which means we can reject the null hypothesis and state that Household income does vary with the political affiliation of a person.

We can find out which groups have different means by conducting a pairwise comparision between the groups. We will calculate a t statistice between each group pair to confirm or reject the null hypothesis. i.e. there is no difference between the average income of the group and political affiliation

pairwise.t.test(gss$coninc, gss$partyid)
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  gss$coninc and gss$partyid 
## 
##                    Strong Democrat Not Str Democrat Ind,Near Dem
## Not Str Democrat   2.4e-13         -                -           
## Ind,Near Dem       5.8e-15         0.15805          -           
## Independent        0.15805         3.5e-07          4.2e-09     
## Ind,Near Rep       < 2e-16         < 2e-16          < 2e-16     
## Not Str Republican < 2e-16         < 2e-16          < 2e-16     
## Strong Republican  < 2e-16         < 2e-16          < 2e-16     
## Other Party        1.2e-07         0.02790          0.12285     
##                    Independent Ind,Near Rep Not Str Republican
## Not Str Democrat   -           -            -                 
## Ind,Near Dem       -           -            -                 
## Independent        -           -            -                 
## Ind,Near Rep       < 2e-16     -            -                 
## Not Str Republican < 2e-16     0.00074      -                 
## Strong Republican  < 2e-16     4.5e-15      6.2e-07           
## Other Party        6.4e-06     0.03641      3.1e-05           
##                    Strong Republican
## Not Str Democrat   -                
## Ind,Near Dem       -                
## Independent        -                
## Ind,Near Rep       -                
## Not Str Republican -                
## Strong Republican  -                
## Other Party        3.8e-11          
## 
## P value adjustment method: holm

We can see that the p value is lower than 0.05 for all groups except 3 groups. Hence, we can reject the null hypothesis for the other groups.

Conclusion:

The Study analyzed the relationship between a person’s political affiliation and their household income. We found that there is a relationship between the two variables after conducing ANOVA tests and pairwise conmparisonsbetween the groups. However, there were a few outliers found during our initial testing for the conditions for ANOVA means that we need to be catious in interpreting the results of the tests. We could do outlier treatment to further improve our analysis and get more reliable results. Insert conclusion here…

References:

Citation for the original data:

Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1 Persistent URL: http://doi.org/10.3886/ICPSR34802.v1