Chi-Squared Test for Independence

Research Question

I hypothesize that there is a relationship between the type of neighborhood voters live (urban, rural, suburb, etc.) and how they view immigrant contributions to the society (contribute, drain, neither, or unsure). I will be analyzing responses to the voter data dataset in order to test this hypothesis.

Import Data

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
voterdata <- read.csv("/Users/Nazija/Downloads/Abbreviated Voter Dataset Labeled.csv")
head(voterdata)

##   gender  race            education      familyincome children  region
## 1 Female White               4-year Prefer not to say       No    West
## 2 Female White         Some College      $60K-$69,999       No    West
## 3   Male White High School Graduate      $50K-$59,999       No Midwest
## 4   Male White         Some College      $70K-$79,999       No   South
## 5   Male White               4-year      $40K-$49,999       No   South
## 6 Female White               2-year      $30K-$39,999       No    West
##    urbancity     Vote2012        Vote2016   TrumpSanders
## 1     Suburb Barack Obama Hillary Clinton Bernie Sanders
## 2 Rural Area  Mitt Romney    Donald Trump   Donald Trump
## 3       City  Mitt Romney Hillary Clinton Bernie Sanders
## 4       City Barack Obama    Gary Johnson Bernie Sanders
## 5     Suburb  Mitt Romney    Donald Trump   Donald Trump
## 6     Suburb Barack Obama Hillary Clinton Bernie Sanders
##              PartyRegistration PartyIdentification     PartyIdentification2
## 1                         <NA>            Democrat Not very strong Democrat
## 2                   Republican          Republican        Strong Republican
## 3                         <NA>          Republican        Strong Republican
## 4 Decline/No Party/Independent         Independent              Independent
## 5                         <NA>          Republican        Strong Republican
## 6                     Democrat            Democrat          Strong Democrat
##   PartyIdentification3 NewsPublicAffairs      DemPrimary   RepPrimary
## 1             Moderate  Most of the time Hillary Clinton         <NA>
## 2         Conservative  Most of the time            <NA> Donald Trump
## 3             Moderate  Most of the time Hillary Clinton         <NA>
## 4             Moderate  Most of the time    Someone Else         <NA>
## 5         Conservative  Most of the time            <NA>  Marco Rubio
## 6         Very Liberal  Most of the time Hillary Clinton         <NA>
##   ImmigrantContributions ImmigrantNaturalization ImmigrationShouldBe
## 1      Mostly Contribute                   Favor     Slightly Easier
## 2         Mostly a Drain                Not Sure           No change
## 3      Mostly Contribute                   Favor         Much Easier
## 4      Mostly Contribute                   Favor         Much Easier
## 5         Mostly a Drain                Not Sure     Slightly Easier
## 6      Mostly Contribute                   Favor     Slightly Harder
##                                    Abortion GayMarriage DeathPenalty
## 1                        Legal in all cases       Favor       Oppose
## 2 Legal in some cases and Illegal in others      Oppose        Favor
## 3                        Legal in all cases       Favor        Favor
## 4 Legal in some cases and Illegal in others       Favor        Favor
## 5                      Illegal in all cases      Oppose        Favor
## 6                        Legal in all cases       Favor     Not Sure
##   DeathPenaltyFreq TaxWealthy Healthcare            GlobWarmExist
## 1        Too Often      Favor        Yes  Definitely is happening
## 2 Not Often Enough     Oppose         No Definitely not happening
## 3 Not Often Enough      Favor        Yes  Definitely is happening
## 4      About Right      Favor        Yes  Definitely is happening
## 5 Not Often Enough     Oppose         No Definitely not happening
## 6         Not Sure      Favor        Yes  Definitely is happening
##   GlobWarmingSerious AffirmativeAction              Religion
## 1       Very Serious             Favor        Roman Catholic
## 2               <NA>            Oppose                Mormon
## 3       Very Serious             Favor              Agnostic
## 4   Somewhat Serious             Favor Nothing in Particular
## 5               <NA>            Oppose                Mormon
## 6       Very Serious             Favor              Agnostic
##    ReligiousImportance      ChurchAttendance     PrayerFrequency NumChildren
## 1   Somewhat Important                Seldom          Once a day           0
## 2       Very Important More than once a week Several times a day           0
## 3 Not at all Important                Seldom               Never           0
## 4 Not at all Important                Seldom A few times a month           0
## 5       Very Important           Once a week  A few times a week           0
## 6 Not at all Important                 Never               Never           0
##     areatype        GunOwnership EconomyBetterWorse Immigr_Economy_GiveTake
## 1     Suburb No Gun in Household     Getting Better                       7
## 2 Rural Area    Gun in Household     About the Same                      10
## 3       City    Gun in Household     Getting Better                       8
## 4       City No Gun in Household     Getting Better                      NA
## 5     Suburb No Gun in Household      Getting Worse                       7
## 6     Suburb    Gun in Household     About the Same                      10
##   ft_fem_2017 ft_immig_2017 ft_police_2017 ft_dem_2017 ft_rep_2017
## 1          99            95             76          88          21
## 2          65            96             95          86          96
## 3          74            77             78          91          20
## 4          NA            NA             NA          NA          NA
## 5          25            91             94          22          83
## 6         100           100             28          99          NA
##   ft_evang_2017 ft_muslim_2017 ft_jew_2017 ft_christ_2017 ft_gays_2017
## 1            50             50          50             50           50
## 2            96             61         100             98           82
## 3             2             49          25             50           77
## 4            NA             NA          NA             NA           NA
## 5            70             80          91             94           71
## 6            NA            100         100             28          100
##   ft_unions_2017 ft_altright_2017 ft_black_2017 ft_white_2017 ft_hisp_2017
## 1             80                1            51            50           79
## 2             62               50            98            90           95
## 3            100                0            87            90           91
## 4             NA               NA            NA            NA           NA
## 5             20               50            90            85           90
## 6            100               NA           100            50          100

Prepare Data

Select variables necessary for analysis Filter to keep only those categories of interest in your analysis Store prepared data in a new object

voter <- voterdata%>%
  select(urbancity, ImmigrantContributions)%>%
  filter(ImmigrantContributions != "NA", urbancity != "Other")

Null Hypothesis

The table below shows the actual % of responses given for each category of urbancity

table(voter$urbancity)%>%
  prop.table()%>%
  round(2)

## 
##       City      Other Rural Area     Suburb       Town 
##       0.29       0.00       0.19       0.38       0.14

The table below shows the actual % of responses given for each category of ImmigrantContributions

table(voter$ImmigrantContributions)%>%
  prop.table()%>%
  round(2)

## 
##    Mostly a Drain Mostly Contribute           Neither          Not Sure 
##              0.52              0.27              0.12              0.08

Below are the values that we would expect to observe in a crosstab if the two variables were completely independent of eachother. This is what we might consider the “null hypothesis”.

VariableA//Response1 % * VariableB Response1 % = XX%
VariableA//Response2 % * VariableB Response1 % = XX%
VariableA//Response1 % * VariableB Response2 % = %
VariableA//Response2 % * VariableB Response2 % = %
VariableA//Response1 % * VariableB Response3 % = %
VariableA//Response2 % * VariableB Response3 % = %

[Note that the above example illustrates a scenario where Variable A has 2 categories, and Variable B has 3 categories. Your variables might be slightly different.]

urbancity/ ImmigrantContributions

City * Mostly Contribute = 0.28 * .27 = 0.08

City * Neither = 0.28 * .12 = 0.03

City * Mostly a Drain = 0.28 * .52 = 0.15

City * Not Sure = 0.28 * .08 = 0.02

Rural Area * Mostly Contribute = .19 * .27 = 0.05

Rural Area * Neither = .19 * .12 = 0.02

Rural Area * Mostly a Drain = .19 * .52 = 0.1

Rural Area * Not Sure = .19 * .08 = 0.02

Suburb * Mostly Contribute = .38 * .27 = 0.1

Suburb * Neither = .38 * .12 = 0.05

Suburb * Mostly a Drain = .38 * .52 = 0.2

Suburb * Not Sure = .38 * .08 = 0.03

Town * Mostly Contribute = .14 * .27 = 0.04

Town * Neither = .14 * .12 = 0.02

Town * Mostly a Drain = .14 * .52 = 0.07

Town * Not Sure = .14 * .08 = 0.01

Actual Observations

The table below shows the actual % of responses for each category combination. A crosstab showing table %. These values are not very different from the expected observations from the null hypothesis.

table(voter$urbancity, voter$ImmigrantContributions)%>%
  prop.table()%>%
  round(2)

##             
##              Mostly a Drain Mostly Contribute Neither Not Sure
##   City                 0.12              0.10    0.03     0.03
##   Other                0.00              0.00    0.00     0.00
##   Rural Area           0.12              0.03    0.02     0.02
##   Suburb               0.20              0.10    0.05     0.03
##   Town                 0.08              0.04    0.02     0.01

Relationship of Interest: Table

The table below shows [row%]/[column%] to highlight the relationship of interest. If your independent variable is represented in the rows of your table, calculate row %. If your independent variable is represented in the columns of your table, calculate columns %.

table(voter$urbancity, voter$ImmigrantContributions)%>%
  prop.table(1)%>%
  round(2)

##             
##              Mostly a Drain Mostly Contribute Neither Not Sure
##   City                 0.43              0.34    0.12     0.11
##   Other                                                       
##   Rural Area           0.62              0.18    0.12     0.08
##   Suburb               0.53              0.27    0.13     0.07
##   Town                 0.56              0.25    0.12     0.07

voter%>%
  group_by(urbancity, ImmigrantContributions)%>%
  summarize(n = n())%>%
  mutate(percentage = n/sum(n))

## `summarise()` regrouping output by 'urbancity' (override with `.groups` argument)

## # A tibble: 16 x 4
## # Groups:   urbancity [4]
##    urbancity  ImmigrantContributions     n percentage
##    <fct>      <fct>                  <int>      <dbl>
##  1 City       Mostly a Drain           973     0.433 
##  2 City       Mostly Contribute        761     0.339 
##  3 City       Neither                  269     0.120 
##  4 City       Not Sure                 242     0.108 
##  5 Rural Area Mostly a Drain           918     0.622 
##  6 Rural Area Mostly Contribute        264     0.179 
##  7 Rural Area Neither                  173     0.117 
##  8 Rural Area Not Sure                 120     0.0814
##  9 Suburb     Mostly a Drain          1582     0.530 
## 10 Suburb     Mostly Contribute        799     0.268 
## 11 Suburb     Neither                  384     0.129 
## 12 Suburb     Not Sure                 221     0.0740
## 13 Town       Mostly a Drain           639     0.563 
## 14 Town       Mostly Contribute        283     0.249 
## 15 Town       Neither                  133     0.117 
## 16 Town       Not Sure                  80     0.0705

Relationship of Interest: Visualization

voter%>%
  group_by(urbancity, ImmigrantContributions)%>%
  summarize(n = n())%>%
  mutate(percentage = n/sum(n))%>%
  ggplot()+
  geom_col(aes(x = urbancity, y = percentage, fill = ImmigrantContributions ))

## `summarise()` regrouping output by 'urbancity' (override with `.groups` argument)

The table shows that the type of neighborhood voters lived in did seem to impact their views on immigrant contributions to society. Voters in urban neighborhoods were more likely to think that immigrants mostly contribute to society than their rural counterparts, who were more likely to think that immigrants are mostly a drain to society. Voters in suburbs and towns had very similar views on immigrant contributions, and fell between city voters and rural voters.

Chi-Squared Test

Below are the results of the chi-squared test for independence. This tells us whether there is a statistically significant relationship between the variables.

The results below indicate that there is a statistically significant relationship between the type of neighborhood a voter is from (urbancity) and their views on immigrant contributions (ImmigrantContributions)

chisq.test(voter$urbancity, voter$ImmigrantContributions)[7]

## $expected
##                voter$ImmigrantContributions
## voter$urbancity Mostly a Drain Mostly Contribute  Neither  Not Sure
##      City            1177.3294          603.2668 274.5766 189.82719
##      Rural Area       773.5238          396.3557 180.4011 124.71942
##      Suburb          1565.9268          802.3852 365.2052 252.48285
##      Town             595.2200          304.9923 138.8171  95.97054

chisq.test(voter$urbancity, voter$ImmigrantContributions)[6]

## $observed
##                voter$ImmigrantContributions
## voter$urbancity Mostly a Drain Mostly Contribute Neither Not Sure
##      City                  973               761     269      242
##      Rural Area            918               264     173      120
##      Suburb               1582               799     384      221
##      Town                  639               283     133       80

chisq.test(voter$urbancity, voter$ImmigrantContributions)[3]

## $p.value
## [1] 4.195969e-33

chisq.test(voter$urbancity, voter$ImmigrantContributions)

## 
##  Pearson's Chi-squared test
## 
## data:  voter$urbancity and voter$ImmigrantContributions
## X-squared = 175.6, df = 9, p-value < 2.2e-16

Since the p-value is well below .05, there is a statistically significant relationship between the two variables.