Introduction
This document aims to find a correlation between two variables: political affliation and income. The analysis is aimed at analysing if ones political affliation is dependent on the income
Data Assumptions:
The data was collected through random sampling and hence it is possible to derive causal relationship between the dependent and the independent variables.
load("gss.Rdata")
impute.mean <-
function(x)
replace(x, is.na(x), mean(x, na.rm = TRUE))
gss.sub <- gss %>% select(partyid, coninc)
gss.sub$partyid <- as.character(gss.sub$partyid)
#analzying the NA' for partyId. The number of rows is less than 1% so ommiting them
gss.party.na <- gss.sub[is.na(gss.sub$partyid) == TRUE,]
gss.sub <- gss.sub[is.na(gss.sub$partyid) != TRUE,]
gss.inc.na <- gss.sub[is.na(gss.sub$coninc) == TRUE,]
#Grouping the respondents into two groups: Democrats or Republicans
gss.sub <-
gss.sub %>% filter(partyid != "Other Party" & partyid != "Independent" ) %>% group_by(partyid) %>% mutate(income = impute.mean(coninc), coninc)%>% mutate(party = ifelse(
partyid == "Ind,Near Dem",
"Democrat",
ifelse(
partyid == "Not Str Democrat",
"Democrat",
ifelse(
partyid == "Strong Democrat",
"Democrat",
ifelse(
partyid == "Not Str Republican",
"Republican",
ifelse(
partyid == "Ind,Near Rep",
"Republican",
ifelse(partyid == "Strong Republican", "Republican", partyid)
)
)
)
)
))
Data munging: 1. People who do not lean towards republicans or democrats will be ommited 2. People who are indepndents but leanrning towards one major party will be grouped to the major party.
Observation
In order to establish a causal relationship for the population based on the sample, a hypothesis test needs to be done.
Conditions for hypothesis testing:
We will be doing a hypothesis testing to see if party affliation affects the income. Since we have two groups and we are comparing the two means, we will be using the Two sample T test.
Null Hypothesis: The difference in the population mean of incomes between republicans and democrats is zero Alternate Hypothesis: The difference in the population mean of incomes between republicans and democrats is not zero.
Steps:
group1 <-
gss.sub %>% ungroup(partyid) %>% filter(party == "Democrat") %>% select(one_of("income"))
group2 <-
gss.sub %>% ungroup(partyid) %>% filter(party == "Republican") %>% select(one_of("income"))
t.test(
group1$income,
group2$income,
mu = 0,
alternative = "two.sided",
paired = F,
conf.level = .95
)
##
## Welch Two Sample t-test
##
## data: group1$income and group2$income
## t = -34.205, df = 37777, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11837.62 -10554.49
## sample estimates:
## mean of x mean of y
## 40855.48 52051.53
##manual P value calculation
group1_mean <- mean(group1$income)
group2_mean <- mean(group2$income)
group1_sd <- sd(group1$income)
group2_sd <- sd(group2$income)
df1 <- min(length(group1$income) - 1, length(group2$income) - 1)
se <-
sqrt(((group1_sd) ^ 2 / length(group1$income)) + ((group2_sd) ^ 2 / length(group2$income)))
mu <- group1_mean - group2_mean
t_score <- mu / se
pt(t_score, df = df1)
## [1] 2.300031e-249
Conclusion and Inference:
The p value is less than .05. The p value is small enough to reject the null hypothesis in favor of the alternate hypothesis. There infact seems to be a causal relationship between party affliation and income.