Chi-Squared Test - Testing Relationships Between Categorical Variable Assignment

I will be comparing voters who has second chance on voting for Hillary or Donald Trump and on the topic on their support for raising taxes on families with incomes over $200,000 per year.

Preliminary Steps

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(ggplot2)

Voter <- read_csv("~/Downloads/Voter Data 2019.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   weight_18_24_2018 = col_logical(),
##   izip_2019 = col_character(),
##   housevote_other_2019 = col_character(),
##   senatevote_other_2019 = col_character(),
##   senatevote2_other_2019 = col_character(),
##   SenCand1Name_2019 = col_character(),
##   SenCand1Party_2019 = col_character(),
##   SenCand2Name_2019 = col_character(),
##   SenCand2Party_2019 = col_character(),
##   SenCand3Name_2019 = col_character(),
##   SenCand3Party_2019 = col_character(),
##   SenCand1Name2_2019 = col_character(),
##   SenCand1Party2_2019 = col_character(),
##   SenCand2Name2_2019 = col_character(),
##   SenCand2Party2_2019 = col_character(),
##   SenCand3Name2_2019 = col_character(),
##   SenCand3Party2_2019 = col_character(),
##   governorvote_other_2019 = col_character(),
##   GovCand1Name_2019 = col_character(),
##   GovCand1Party_2019 = col_character()
##   # ... with 108 more columns
## )

## See spec(...) for full column specifications.

## Warning: 800 parsing failures.
##  row               col           expected           actual                              file
## 2033 weight_18_24_2018 1/0/T/F/TRUE/FALSE .917710168467982 '~/Downloads/Voter Data 2019.csv'
## 2828 weight_18_24_2018 1/0/T/F/TRUE/FALSE 1.41022291345592 '~/Downloads/Voter Data 2019.csv'
## 4511 weight_18_24_2018 1/0/T/F/TRUE/FALSE 1.77501243840922 '~/Downloads/Voter Data 2019.csv'
## 7264 weight_18_24_2018 1/0/T/F/TRUE/FALSE 1.29486870319614 '~/Downloads/Voter Data 2019.csv'
## 7277 weight_18_24_2018 1/0/T/F/TRUE/FALSE 1.44972719707603 '~/Downloads/Voter Data 2019.csv'
## .... ................. .................. ................ .................................
## See problems(...) for more details.

head(Voter)

## # A tibble: 6 x 1,282
##   weight_2016 weight_2017 weight_panel_20… weight_latino_2… weight_18_24_20…
##         <dbl>       <dbl>            <dbl>            <dbl> <lgl>           
## 1       0.358       0.438            0.503               NA NA              
## 2       0.563       0.366            0.389               NA NA              
## 3       0.552       0.550            0.684               NA NA              
## 4       0.208      NA               NA                   NA NA              
## 5       0.334       0.346            0.322               NA NA              
## 6       0.207       0.148            0.594               NA NA              
## # … with 1,277 more variables: weight_overall_2018 <dbl>, weight_2019 <dbl>,
## #   weight1_2018 <dbl>, weight1_2019 <dbl>, weight2_2019 <dbl>,
## #   weight3_2019 <dbl>, cassfullcd <dbl>, vote2020_2019 <dbl>,
## #   trumpapp_2019 <dbl>, fav_trump_2019 <dbl>, fav_obama_2019 <dbl>,
## #   fav_hrc_2019 <dbl>, fav_sanders_2019 <dbl>, fav_putin_2019 <dbl>,
## #   fav_schumer_2019 <dbl>, fav_pelosi_2019 <dbl>, fav_comey_2019 <dbl>,
## #   fav_mueller_2019 <dbl>, fav_mcconnell_2019 <dbl>, fav_kavanaugh_2019 <dbl>,
## #   fav_biden_2019 <dbl>, fav_warren_2019 <dbl>, fav_harris_2019 <dbl>,
## #   fav_gillibrand_2019 <dbl>, fav_patrick_2019 <dbl>, fav_booker_2019 <dbl>,
## #   fav_garcetti_2019 <dbl>, fav_klobuchar_2019 <dbl>, fav_gorsuch_2019 <dbl>,
## #   fav_kasich_2019 <dbl>, fav_haley_2019 <dbl>, fav_bloomberg_2019 <dbl>,
## #   fav_holder_2019 <dbl>, fav_avenatti_2019 <dbl>, fav_castro_2019 <dbl>,
## #   fav_landrieu_2019 <dbl>, fav_orourke_2019 <dbl>,
## #   fav_hickenlooper_2019 <dbl>, fav_pence_2019 <dbl>, add_confirm_2019 <dbl>,
## #   izip_2019 <chr>, votereg_2019 <dbl>, votereg_f_2019 <dbl>,
## #   regzip_2019 <dbl>, region_2019 <dbl>, turnout18post_2019 <dbl>,
## #   tsmart_G2018_2019 <dbl>, tsmart_G2018_vote_type_2019 <dbl>,
## #   tsmart_P2018_2019 <dbl>, tsmart_P2018_party_2019 <dbl>,
## #   tsmart_P2018_vote_type_2019 <dbl>, housevote_2019 <dbl>,
## #   housevote_other_2019 <chr>, senatevote_2019 <dbl>,
## #   senatevote_other_2019 <chr>, senatevote2_2019 <dbl>,
## #   senatevote2_other_2019 <chr>, SenCand1Name_2019 <chr>,
## #   SenCand1Party_2019 <chr>, SenCand2Name_2019 <chr>,
## #   SenCand2Party_2019 <chr>, SenCand3Name_2019 <chr>,
## #   SenCand3Party_2019 <chr>, SenCand1Name2_2019 <chr>,
## #   SenCand1Party2_2019 <chr>, SenCand2Name2_2019 <chr>,
## #   SenCand2Party2_2019 <chr>, SenCand3Name2_2019 <chr>,
## #   SenCand3Party2_2019 <chr>, governorvote_2019 <dbl>,
## #   governorvote_other_2019 <chr>, GovCand1Name_2019 <chr>,
## #   GovCand1Party_2019 <chr>, GovCand2Name_2019 <chr>,
## #   GovCand2Party_2019 <chr>, GovCand3Name_2019 <chr>,
## #   GovCand3Party_2019 <chr>, inst_court_2019 <dbl>, inst_media_2019 <dbl>,
## #   inst_congress_2019 <dbl>, inst_justice_2019 <dbl>, inst_FBI_2019 <dbl>,
## #   inst_military_2019 <dbl>, inst_church_2019 <dbl>, inst_business_2019 <dbl>,
## #   Democrats_2019 <dbl>, Republicans_2019 <dbl>, Men_2019 <dbl>,
## #   Women_2019 <dbl>, wm_2019 <dbl>, ww_2019 <dbl>, bm_2019 <dbl>,
## #   bw_2019 <dbl>, hm_2019 <dbl>, hw_2019 <dbl>, rwm_2019 <dbl>,
## #   rww_2019 <dbl>, rbm_2019 <dbl>, rbw_2019 <dbl>, pwm_2019 <dbl>, …

voter <- Voter%>%
  mutate(PartyVoters= ifelse(second_chance_2016==1,"Hillary Clinton (Democratic)",                      ifelse(second_chance_2016==2,"Donald Trump (Republican)",NA)),
         
TaxWealthy=ifelse(taxdoug_2016==1,"Yes",                           ifelse(taxdoug_2016==2,"No",NA)))%>%
  select(PartyVoters,TaxWealthy)%>%
  filter(PartyVoters %in% c("Hillary Clinton (Democratic)","Donald Trump (Republican)"))

1.How do people respond to these two variables, separately?

People would respond to these two variables, by answering yes or no, if families with income over $200,000 per year should get tax increase or not.

2. If these variables were completely unrelated to one another, what % of respondents would fit each variable combination?

% of Respondents by Political Party

table(voter$PartyVoters)%>%
  prop.table()%>%
  round(2)

## 
##    Donald Trump (Republican) Hillary Clinton (Democratic) 
##                          0.4                          0.6

% of Respondents by support of tax on Wealthy

table(voter$TaxWealthy)%>%
  prop.table()%>%
  round(2)

## 
##   No  Yes 
## 0.21 0.79

table(voter$PartyVoters, voter$TaxWealthy)%>%
  prop.table(1)%>%
  round(2)

##                               
##                                  No  Yes
##   Donald Trump (Republican)    0.38 0.62
##   Hillary Clinton (Democratic) 0.11 0.89

3. Generate a croostab to show how respondents actually distribute across variable combinations. How does this compare to the table you generated in step 2?

    Donald Trump (republican) (40%)  Hillary Clinton (Democratic) (60%)

No(21%) .40 (.21)=.08 .60 (.21)=.13

Yes(79%) .40 (.79)=.32 .60 (.79)=.47

. 47% of respondents are Hillary Clinton’s voters that is in favor of taxing families that is above of income of $200,000.

. 32% of respondents are Donald Trump’s voters that is in favor of taxing families that is above of income of $200,000.

. 13% of respondents are Hillary Clinton’s voters that is not in favor of taxing families that is above of income of $200,000.

. 8% of respondents are Donald Trump’s voters that is not in favor of taxing families that is above of income of $200,000.

4. Run a chi-squared test to confirm whether or not there is a statistically significant relationship between variables.

chisq.test(voter$PartyVoters,voter$TaxWealthy)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  voter$PartyVoters and voter$TaxWealthy
## X-squared = 22.005, df = 1, p-value = 2.719e-06

5. Write 2-3 sentences interpreting your findings

When I calculated the probability for the voters who believe that we should tax the wealth has a higher percentage than those who are against that taxation of the wealthy. While I was calculating the Chi-squared test with for those respondents and their response on whether or not we should increase tax on wealthy, it shows that X-squared is 22.005 and the p-value is about 2.719e-6, which is shows the significant relationship between the voters and people who support on taxing the wealthy.

6. Extract the Null Hypothesis Table from your chi-squared test output

chisq.test(voter$PartyVoters, voter$TaxWealthy)[7]

## $expected
##                               voter$TaxWealthy
## voter$PartyVoters                    No       Yes
##   Donald Trump (Republican)    18.44541  69.55459
##   Hillary Clinton (Democratic) 29.55459 111.44541

7. Extract the Observed Values Table from your chi-squared test output.

chisq.test(voter$PartyVoters,voter$TaxWealthy)[6]

## $observed
##                               voter$TaxWealthy
## voter$PartyVoters               No Yes
##   Donald Trump (Republican)     33  55
##   Hillary Clinton (Democratic)  15 126

8. Compare & discuss the values observed between the two tables.

While comparing the Null Hypothesis Table and Observed Value table, I can see that they have totally different values. In the Hypothesis table, it seems that the Donald Trump Voters who said no on taxing on the wealthy, has a value about 18.44. However, in the actual result, it shows that Donald Trump Voters who said no on taxing on the wealthy, has a completely value, which is 33. For those who said yes in the hypothesis testing, is about 69.55, and in the actual data, it shows that 55 people who favored on taxing the wealthy. The major difference I can see from the two table is that in the hypothesis testing, it has values that has decimal points which shows that it doesn’t have an accurate measure of indivduals, however, in the actual table, it shows values that don’t have decimal points, which shows an accurate measure of individuals that was measured in this data.