Chi-Squared Test for Independence

Research Question

I hypothesize that there is a relationship between [categorical variable A] and [categorical variable B]. I will be analyzing responses to the voter data dataset in order to test this hypothesis.

I hypothesize that there is a relationship between ImmigrationNaturalization and PartyIdentification is going to be different. ## Import Data

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

data <- read_csv("C:/Users/jammi/Desktop/Abbreviated Voter Dataset Labeled.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   NumChildren = col_double(),
##   Immigr_Economy_GiveTake = col_double(),
##   ft_fem_2017 = col_double(),
##   ft_immig_2017 = col_double(),
##   ft_police_2017 = col_double(),
##   ft_dem_2017 = col_double(),
##   ft_rep_2017 = col_double(),
##   ft_evang_2017 = col_double(),
##   ft_muslim_2017 = col_double(),
##   ft_jew_2017 = col_double(),
##   ft_christ_2017 = col_double(),
##   ft_gays_2017 = col_double(),
##   ft_unions_2017 = col_double(),
##   ft_altright_2017 = col_double(),
##   ft_black_2017 = col_double(),
##   ft_white_2017 = col_double(),
##   ft_hisp_2017 = col_double()
## )

## See spec(...) for full column specifications.

head(data)

## # A tibble: 6 x 53
##   gender race  education familyincome children region urbancity Vote2012
##   <chr>  <chr> <chr>     <chr>        <chr>    <chr>  <chr>     <chr>   
## 1 Female White 4-year    Prefer not ~ No       West   Suburb    Barack ~
## 2 Female White Some Col~ $60K-$69,999 No       West   Rural Ar~ Mitt Ro~
## 3 Male   White High Sch~ $50K-$59,999 No       Midwe~ City      Mitt Ro~
## 4 Male   White Some Col~ $70K-$79,999 No       South  City      Barack ~
## 5 Male   White 4-year    $40K-$49,999 No       South  Suburb    Mitt Ro~
## 6 Female White 2-year    $30K-$39,999 No       West   Suburb    Barack ~
## # ... with 45 more variables: Vote2016 <chr>, TrumpSanders <chr>,
## #   PartyRegistration <chr>, PartyIdentification <chr>,
## #   PartyIdentification2 <chr>, PartyIdentification3 <chr>,
## #   NewsPublicAffairs <chr>, DemPrimary <chr>, RepPrimary <chr>,
## #   ImmigrantContributions <chr>, ImmigrantNaturalization <chr>,
## #   ImmigrationShouldBe <chr>, Abortion <chr>, GayMarriage <chr>,
## #   DeathPenalty <chr>, DeathPenaltyFreq <chr>, TaxWealthy <chr>,
## #   Healthcare <chr>, GlobWarmExist <chr>, GlobWarmingSerious <chr>,
## #   AffirmativeAction <chr>, Religion <chr>, ReligiousImportance <chr>,
## #   ChurchAttendance <chr>, PrayerFrequency <chr>, NumChildren <dbl>,
## #   areatype <chr>, GunOwnership <chr>, EconomyBetterWorse <chr>,
## #   Immigr_Economy_GiveTake <dbl>, ft_fem_2017 <dbl>, ft_immig_2017 <dbl>,
## #   ft_police_2017 <dbl>, ft_dem_2017 <dbl>, ft_rep_2017 <dbl>,
## #   ft_evang_2017 <dbl>, ft_muslim_2017 <dbl>, ft_jew_2017 <dbl>,
## #   ft_christ_2017 <dbl>, ft_gays_2017 <dbl>, ft_unions_2017 <dbl>,
## #   ft_altright_2017 <dbl>, ft_black_2017 <dbl>, ft_white_2017 <dbl>,
## #   ft_hisp_2017 <dbl>

Prepare Data

Select variables necessary for analysis Filter to keep only those categories of interest in your analysis Store prepared data in a new object

library(readr)
library(dplyr)
library(ggplot2)

data <- read_csv("C:/Users/jammi/Desktop/Abbreviated Voter Dataset Labeled.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   NumChildren = col_double(),
##   Immigr_Economy_GiveTake = col_double(),
##   ft_fem_2017 = col_double(),
##   ft_immig_2017 = col_double(),
##   ft_police_2017 = col_double(),
##   ft_dem_2017 = col_double(),
##   ft_rep_2017 = col_double(),
##   ft_evang_2017 = col_double(),
##   ft_muslim_2017 = col_double(),
##   ft_jew_2017 = col_double(),
##   ft_christ_2017 = col_double(),
##   ft_gays_2017 = col_double(),
##   ft_unions_2017 = col_double(),
##   ft_altright_2017 = col_double(),
##   ft_black_2017 = col_double(),
##   ft_white_2017 = col_double(),
##   ft_hisp_2017 = col_double()
## )

## See spec(...) for full column specifications.

data<-data%>%
  select(ImmigrantNaturalization, PartyIdentification)%>%
  filter(ImmigrantNaturalization %in% c("Favor","Not Sure"), PartyIdentification %in% c("Democrat","Republican"))

Null Hypothesis

The table below shows the actual % of responses given for each category of [variable A]

table(data$ImmigrantNaturalization)%>%
  prop.table()%>%
  round(2)

## 
##    Favor Not Sure 
##     0.68     0.32

The table below shows the actual % of responses given for each category of [variable B]

table(data$PartyIdentification)%>%
  prop.table()%>%
  round(2)

## 
##   Democrat Republican 
##       0.71       0.29

Below are the values that we would expect to observe in a crosstab if the two variables were completely independent of eachother. This is what we might consider the “null hypothesis”.

VariableA//Response1 .68 * VariableB Response1 .71 = .4828
VariableA//Response2 .68 * VariableB Response1 .29 = .1972
VariableA//Response1 .32 * VariableB Response2 .71 = .2272
VariableA//Response2 .32 * VariableB Response2 .29 = .0928

[Note that the above example illustrates a scenario where Variable A has 2 categories, and Variable B has 3 categories. Your variables might be slightly different.]

Actual Observations

The table below shows the actual % of responses for each category combination. A crosstab showing table %. These values are not very different from the expected observations from the null hypothesis.

data%>%
  group_by(ImmigrantNaturalization,PartyIdentification)%>%
  summarize(n=n())%>%
  mutate(percent=n/sum(n))

## `summarise()` regrouping output by 'ImmigrantNaturalization' (override with `.groups` argument)

## # A tibble: 4 x 4
## # Groups:   ImmigrantNaturalization [2]
##   ImmigrantNaturalization PartyIdentification     n percent
##   <chr>                   <chr>               <int>   <dbl>
## 1 Favor                   Democrat             1782   0.794
## 2 Favor                   Republican            463   0.206
## 3 Not Sure                Democrat              564   0.539
## 4 Not Sure                Republican            482   0.461

Relationship of Interest: Table

The table below shows [row%]/[column%] to highlight the relationship of interest. If your independent variable is represented in the rows of your table, calculate row %. If your independent variable is represented in the columns of your table, calculate columns %.

table(data$ImmigrantNaturalization,data$PartyIdentification)

##           
##            Democrat Republican
##   Favor        1782        463
##   Not Sure      564        482

Relationship of Interest: Visualization

data%>%
  group_by(ImmigrantNaturalization,PartyIdentification)%>%
  summarize(n=n())%>%
  mutate(percent=n/sum(n))%>%
  ggplot()+geom_col(aes(x=ImmigrantNaturalization,y=PartyIdentification,fill=ImmigrantNaturalization))

## `summarise()` regrouping output by 'ImmigrantNaturalization' (override with `.groups` argument)

[WRITE YOUR INTERPRETATION OF THE TABLE & VISUALIZATION HERE]

Chi-Squared Test

Below are the results of the chi-squared test for independence. This tells us whether there is a statistically significant relationship between the variables.

The results below indicate that there [is]/[is not] a statistically significant relationship between [categorical variable A] and [categorical variable B].

chisq.test(data$ImmigrantNaturalization, data$PartyIdentification)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$ImmigrantNaturalization and data$PartyIdentification
## X-squared = 224.66, df = 1, p-value < 2.2e-16

[There is a statistically significant relationship between ImmigrantNaturalization and PartIdentification]