Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")

Joseph Antony | 08th January 2021

Part 1: Data

According to the GSS website, the data in the General Social Survey (GSS) is collected from an independently drawn sample of people above 18 years of age, living in non-institutional arrangements within the United States. The data was collected by random sampling across years through personal interviews. Hence, the results obtained can be generalized to the whole US population.

Since this was a survey, which is more like an observational study and not experimental, we cannot use it to determine causality.


Part 2: Research question

Is there an association between race and political affiliation?

According to the news media, there exists sizable and long- standing racial and ethnic differences in political affiliation. It would be interesting to see how does the political leanings of each races vary throughout the years in the US and to test for any association between race and political affiliation.

The variables to be used are:

race: Race of respondent (Categorical Variable)

partyid: Political party affiliation (Categorical Variable)


Part 3: Exploratory data analysis

Let us first create a table between partyid and race in order to get a sense of the data.

table(gss$partyid, gss$race)
##                     
##                      White Black Other
##   Strong Democrat     5692  3075   350
##   Not Str Democrat    9192  2176   672
##   Ind,Near Dem        5489   895   359
##   Independent         6767   962   770
##   Ind,Near Rep        4517   235   169
##   Not Str Republican  8450   297   258
##   Strong Republican   5276   141   131
##   Other Party          745    71    45

The race variable has 3 categories and the partyid variable has 8 categories. For simplicity, I will group “Strong Democrat”, “Not Str Democrat” and “Ind,Near Dem” as Democrats. Similarly, “Ind,Near Rep”, “Not Str Republican” and “Strong Republican” will be grouped as Republican.

After that, let’s visualize the data using bar plot since we are plotting categorical variables.

#Creating a varibale called r_prty that has both race and partyid variables.

party <- gss %>%
  filter(!is.na(partyid), !is.na(race), !is.na(year)) %>%
  select(partyid, race, year)

#Creating the function to group democrats and republicans.

party_short <- function(word) {
  short = word
  if(short == "Strong Democrat" || short == "Not Str Democrat" || short == "Ind,Near Dem") {
    return("Democrat") 
  }
  if(short == "Ind,Near Rep" || short == "Not Str Republican" || short == "Strong Republican") {
    return("Republican")
  }
  if(short == "Independent") {
    return("Independent")
  }
  if(short == "Other Party") {
    return("Other Party")
  }
}

party$partyid <- sapply(party$partyid, party_short)

table(party$partyid)
## 
##    Democrat Independent Other Party  Republican 
##       27900        8499         861       19474
table(party$partyid, party$race)
##              
##               White Black Other
##   Democrat    20373  6146  1381
##   Independent  6767   962   770
##   Other Party   745    71    45
##   Republican  18243   673   558
#plotting using ggplot2

ggplot(data = party, aes(x = partyid, fill = race)) + 
  geom_bar(position = "fill") + labs(title = "Race & Political Affiliation", 
                                     y = "Proportion", x = "Political Party")

From the above plot, both ‘Other Party’ and ‘Independent’ parties shows similar race proportions. But a clear difference in race proportions are noticeable among both ‘Democrat’ and ‘Republican’ parties. There are more Blacks and Others in the ‘Democrat’ compared to the ‘Republican’.

Let us also visualize how did the proportion of races differ among different political parties throughout the years.

#Grouping the variables, followed by summarizing their total counts, and 
#then finally, mutating a new columns with their corresponding proportions.

b <- party %>% 
  group_by(partyid, year, race) %>% 
  summarise(n = n()) %>% 
  mutate(Prop = ifelse(race == "White", n/sum(n), 
                       ifelse(race == "Black", n/sum(n), 
                              ifelse(race == "Other", n/sum(n),0))))

b
## # A tibble: 325 x 5
## # Groups:   partyid, year [116]
##    partyid   year race      n    Prop
##    <chr>    <int> <fct> <int>   <dbl>
##  1 Democrat  1972 White   704 0.764  
##  2 Democrat  1972 Black   215 0.233  
##  3 Democrat  1972 Other     3 0.00325
##  4 Democrat  1973 White   657 0.815  
##  5 Democrat  1973 Black   144 0.179  
##  6 Democrat  1973 Other     5 0.00620
##  7 Democrat  1974 White   691 0.837  
##  8 Democrat  1974 Black   131 0.159  
##  9 Democrat  1974 Other     4 0.00484
## 10 Democrat  1975 White   683 0.842  
## # ... with 315 more rows
#Plotting for b using ggplot2

ggplot(data = b, aes(x = year, y = Prop)) + 
  geom_smooth(aes(fill=race)) + facet_wrap(~partyid)

From the above plot, we can deduce the following:

  1. There has been an increasing trend of both Black and Others identifying themselves as a Democrat throughout the years.

  2. Both Independent and Democrat categories have similar trends.

  3. Within the Republican party, the trend has hardly seen any change from 1972 to 2010. White race makes up the majority of the proportion.

  4. The trend within the Other Party is somewhat similar to that of the Republican party.

Next, let’s do a statistical test to verify whether there is any association between race and party.


Part 4: Inference

Null Hypothesis: Race and Political affiliations are independent of each other.

Alternative Hypothesis: There is an association between one’s race and political affiliation.

As we have three race groups and four political party groups, we will use the chi-square test of independence. But first, we check whether the conditions for using chi-square tests are satisfied.

1) Independence: As mentioned earlier, the observations in each group are independent as random sampling was employed during the GSS survey.

2) Expected Counts

#Checking for expected value of each cell.

chisq.test(party$partyid, party$race)$expected
##              party$race
## party$partyid      White     Black      Other
##   Democrat    22684.3022 3861.3671 1354.33074
##   Independent  6910.1751 1176.2638  412.56118
##   Other Party   700.0424  119.1626   41.79494
##   Republican  15833.4803 2695.2065  945.31315

For each cell, the expected count is higher than five. Hence, this condition is satisfied too.

3) Degree of Freedom: The degrees of freedom is given by (4-1)*(3-1)) = 6

All the conditions have been checked. Since we are working with categorical variables with more than two levels, there is no associated confidence interval. Hence, we cannot find the confidence interval.

Now, let’s continue with the chi-square test.

#At 95% confidence level.

inference(y = partyid, x = race, data = party, type = "ht", 
          statistic = "proportion", method = "theoretical", alternative = "greater")
## Response variable: categorical (4 levels) 
## Explanatory variable: categorical (3 levels) 
## Observed:
##        y
## x       Democrat Independent Other Party Republican
##   White    20373        6767         745      18243
##   Black     6146         962          71        673
##   Other     1381         770          45        558
## 
## Expected:
##        y
## x        Democrat Independent Other Party Republican
##   White 22684.302   6910.1751   700.04244 15833.4803
##   Black  3861.367   1176.2638   119.16262  2695.2065
##   Other  1354.331    412.5612    41.79494   945.3131
## 
## H0: race and partyid are independent
## HA: race and partyid are dependent
## chi_sq = 4004.6596, df = 6, p_value = 0

Conclusion The P-value obtained is around zero and is much lower than the significance level of 0.05. Therefore, we have strong evidence that there IS an association between race and political party affiliation. Hence, we reject the Null Hypothesis in favor of the Alternative Hypothesis.

The results obtained here matches with the what we have seen from the above plots. People belonging to Black and other races increasingly lean towards Strong Democrats.