GSS Statistical Analysis

Introduction

This document aims to find a correlation between two variables: political affliation and income. The analysis is aimed at analysing if ones political affliation is dependent on the income

Data Assumptions:

The family income is representative of the individual respondent
People who do not strongly favor republicans or Democrate, or support an independent but lean towards republicans or democrats are assumed to support one of these two parties during elections.

The data was collected through random sampling and hence it is possible to derive causal relationship between the dependent and the independent variables.

load("gss.Rdata")

impute.mean <-
function(x)
replace(x, is.na(x), mean(x, na.rm = TRUE))

gss.sub <- gss %>% select(partyid, coninc)
gss.sub$partyid <- as.character(gss.sub$partyid)

#analzying the NA' for partyId. The number of rows is less than 1% so ommiting them
gss.party.na <- gss.sub[is.na(gss.sub$partyid) == TRUE,]
gss.sub <- gss.sub[is.na(gss.sub$partyid) != TRUE,]

gss.inc.na <- gss.sub[is.na(gss.sub$coninc) == TRUE,]

#Grouping the respondents into two groups: Democrats or Republicans
gss.sub <-
gss.sub %>% filter(partyid != "Other Party" & partyid != "Independent" ) %>% group_by(partyid) %>% mutate(income = impute.mean(coninc), coninc)%>% mutate(party = ifelse(
partyid == "Ind,Near Dem",
"Democrat",
ifelse(
partyid == "Not Str Democrat",
"Democrat",
ifelse(
partyid == "Strong Democrat",
"Democrat",
ifelse(
partyid == "Not Str Republican",
"Republican",
ifelse(
partyid == "Ind,Near Rep",
"Republican",
ifelse(partyid == "Strong Republican", "Republican", partyid)
)
)
)
)
))

Data munging: 1. People who do not lean towards republicans or democrats will be ommited 2. People who are indepndents but leanrning towards one major party will be grouped to the major party.

Explanatory Analysis - Plotting to see the trend

Observation

People who support republicans seem to have a higher wage
There are outliers in the distribution. We will not normalize them for now.

In order to establish a causal relationship for the population based on the sample, a hypothesis test needs to be done.

Conditions for hypothesis testing:

The observations are independent of each other as it is a random survey
The samples are not normal and are slightly right skewed.
The sample size is large enough to overcome the lack of normal distribution

Hypothesis testing

We will be doing a hypothesis testing to see if party affliation affects the income. Since we have two groups and we are comparing the two means, we will be using the Two sample T test.

Null Hypothesis: The difference in the population mean of incomes between republicans and democrats is zero Alternate Hypothesis: The difference in the population mean of incomes between republicans and democrats is not zero.

Steps:

Calculate the T score
Calculate the confidence interval
Calculate the P value

group1 <-
  gss.sub %>% ungroup(partyid) %>% filter(party == "Democrat") %>% select(one_of("income"))
  group2 <-
  gss.sub %>% ungroup(partyid) %>% filter(party == "Republican") %>% select(one_of("income"))
  
  t.test(
  group1$income,
  group2$income,
  mu = 0,
  alternative = "two.sided",
  paired = F,
  conf.level = .95
  )

## 
##  Welch Two Sample t-test
## 
## data:  group1$income and group2$income
## t = -34.205, df = 37777, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -11837.62 -10554.49
## sample estimates:
## mean of x mean of y 
##  40855.48  52051.53

  ##manual P  value calculation
  
  group1_mean <- mean(group1$income)
  group2_mean <- mean(group2$income)
  
  group1_sd <- sd(group1$income)
  group2_sd <- sd(group2$income)
  
  df1 <- min(length(group1$income) - 1, length(group2$income) - 1)
  
  se <-
  sqrt(((group1_sd) ^ 2 / length(group1$income)) + ((group2_sd) ^ 2 / length(group2$income)))
  
  mu <- group1_mean - group2_mean
  
  t_score <- mu / se
  pt(t_score, df = df1)

## [1] 2.300031e-249

Conclusion and Inference:

The p value is less than .05. The p value is small enough to reject the null hypothesis in favor of the alternate hypothesis. There infact seems to be a causal relationship between party affliation and income.