I hypothesize that there is a relationship between [categorical variable A] and [categorical variable B]. I will be analyzing responses to the voter data dataset in order to test this hypothesis.
I hypothesize that there is a relationship between ImmigrationNaturalization and PartyIdentification is going to be different. ## Import Data
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
data <- read_csv("C:/Users/jammi/Desktop/Abbreviated Voter Dataset Labeled.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## NumChildren = col_double(),
## Immigr_Economy_GiveTake = col_double(),
## ft_fem_2017 = col_double(),
## ft_immig_2017 = col_double(),
## ft_police_2017 = col_double(),
## ft_dem_2017 = col_double(),
## ft_rep_2017 = col_double(),
## ft_evang_2017 = col_double(),
## ft_muslim_2017 = col_double(),
## ft_jew_2017 = col_double(),
## ft_christ_2017 = col_double(),
## ft_gays_2017 = col_double(),
## ft_unions_2017 = col_double(),
## ft_altright_2017 = col_double(),
## ft_black_2017 = col_double(),
## ft_white_2017 = col_double(),
## ft_hisp_2017 = col_double()
## )
## See spec(...) for full column specifications.
head(data)
## # A tibble: 6 x 53
## gender race education familyincome children region urbancity Vote2012
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Female White 4-year Prefer not ~ No West Suburb Barack ~
## 2 Female White Some Col~ $60K-$69,999 No West Rural Ar~ Mitt Ro~
## 3 Male White High Sch~ $50K-$59,999 No Midwe~ City Mitt Ro~
## 4 Male White Some Col~ $70K-$79,999 No South City Barack ~
## 5 Male White 4-year $40K-$49,999 No South Suburb Mitt Ro~
## 6 Female White 2-year $30K-$39,999 No West Suburb Barack ~
## # ... with 45 more variables: Vote2016 <chr>, TrumpSanders <chr>,
## # PartyRegistration <chr>, PartyIdentification <chr>,
## # PartyIdentification2 <chr>, PartyIdentification3 <chr>,
## # NewsPublicAffairs <chr>, DemPrimary <chr>, RepPrimary <chr>,
## # ImmigrantContributions <chr>, ImmigrantNaturalization <chr>,
## # ImmigrationShouldBe <chr>, Abortion <chr>, GayMarriage <chr>,
## # DeathPenalty <chr>, DeathPenaltyFreq <chr>, TaxWealthy <chr>,
## # Healthcare <chr>, GlobWarmExist <chr>, GlobWarmingSerious <chr>,
## # AffirmativeAction <chr>, Religion <chr>, ReligiousImportance <chr>,
## # ChurchAttendance <chr>, PrayerFrequency <chr>, NumChildren <dbl>,
## # areatype <chr>, GunOwnership <chr>, EconomyBetterWorse <chr>,
## # Immigr_Economy_GiveTake <dbl>, ft_fem_2017 <dbl>, ft_immig_2017 <dbl>,
## # ft_police_2017 <dbl>, ft_dem_2017 <dbl>, ft_rep_2017 <dbl>,
## # ft_evang_2017 <dbl>, ft_muslim_2017 <dbl>, ft_jew_2017 <dbl>,
## # ft_christ_2017 <dbl>, ft_gays_2017 <dbl>, ft_unions_2017 <dbl>,
## # ft_altright_2017 <dbl>, ft_black_2017 <dbl>, ft_white_2017 <dbl>,
## # ft_hisp_2017 <dbl>
Select variables necessary for analysis Filter to keep only those categories of interest in your analysis Store prepared data in a new object
library(readr)
library(dplyr)
library(ggplot2)
data <- read_csv("C:/Users/jammi/Desktop/Abbreviated Voter Dataset Labeled.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## NumChildren = col_double(),
## Immigr_Economy_GiveTake = col_double(),
## ft_fem_2017 = col_double(),
## ft_immig_2017 = col_double(),
## ft_police_2017 = col_double(),
## ft_dem_2017 = col_double(),
## ft_rep_2017 = col_double(),
## ft_evang_2017 = col_double(),
## ft_muslim_2017 = col_double(),
## ft_jew_2017 = col_double(),
## ft_christ_2017 = col_double(),
## ft_gays_2017 = col_double(),
## ft_unions_2017 = col_double(),
## ft_altright_2017 = col_double(),
## ft_black_2017 = col_double(),
## ft_white_2017 = col_double(),
## ft_hisp_2017 = col_double()
## )
## See spec(...) for full column specifications.
data<-data%>%
select(ImmigrantNaturalization, PartyIdentification)%>%
filter(ImmigrantNaturalization %in% c("Favor","Not Sure"), PartyIdentification %in% c("Democrat","Republican"))
The table below shows the actual % of responses given for each category of [variable A]
table(data$ImmigrantNaturalization)%>%
prop.table()%>%
round(2)
##
## Favor Not Sure
## 0.68 0.32
The table below shows the actual % of responses given for each category of [variable B]
table(data$PartyIdentification)%>%
prop.table()%>%
round(2)
##
## Democrat Republican
## 0.71 0.29
Below are the values that we would expect to observe in a crosstab if the two variables were completely independent of eachother. This is what we might consider the “null hypothesis”.
VariableA//Response1 .68 * VariableB Response1 .71 = .4828
VariableA//Response2 .68 * VariableB Response1 .29 = .1972
VariableA//Response1 .32 * VariableB Response2 .71 = .2272
VariableA//Response2 .32 * VariableB Response2 .29 = .0928
[Note that the above example illustrates a scenario where Variable A has 2 categories, and Variable B has 3 categories. Your variables might be slightly different.]
The table below shows the actual % of responses for each category combination. A crosstab showing table %. These values are not very different from the expected observations from the null hypothesis.
data%>%
group_by(ImmigrantNaturalization,PartyIdentification)%>%
summarize(n=n())%>%
mutate(percent=n/sum(n))
## `summarise()` regrouping output by 'ImmigrantNaturalization' (override with `.groups` argument)
## # A tibble: 4 x 4
## # Groups: ImmigrantNaturalization [2]
## ImmigrantNaturalization PartyIdentification n percent
## <chr> <chr> <int> <dbl>
## 1 Favor Democrat 1782 0.794
## 2 Favor Republican 463 0.206
## 3 Not Sure Democrat 564 0.539
## 4 Not Sure Republican 482 0.461
The table below shows [row%]/[column%] to highlight the relationship of interest. If your independent variable is represented in the rows of your table, calculate row %. If your independent variable is represented in the columns of your table, calculate columns %.
table(data$ImmigrantNaturalization,data$PartyIdentification)
##
## Democrat Republican
## Favor 1782 463
## Not Sure 564 482
data%>%
group_by(ImmigrantNaturalization,PartyIdentification)%>%
summarize(n=n())%>%
mutate(percent=n/sum(n))%>%
ggplot()+geom_col(aes(x=ImmigrantNaturalization,y=PartyIdentification,fill=ImmigrantNaturalization))
## `summarise()` regrouping output by 'ImmigrantNaturalization' (override with `.groups` argument)
[WRITE YOUR INTERPRETATION OF THE TABLE & VISUALIZATION HERE]
Below are the results of the chi-squared test for independence. This tells us whether there is a statistically significant relationship between the variables.
The results below indicate that there [is]/[is not] a statistically significant relationship between [categorical variable A] and [categorical variable B].
chisq.test(data$ImmigrantNaturalization, data$PartyIdentification)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data$ImmigrantNaturalization and data$PartyIdentification
## X-squared = 224.66, df = 1, p-value < 2.2e-16
[There is a statistically significant relationship between ImmigrantNaturalization and PartIdentification]