Variable Selection & Research Question
Categorical (Independent) Variable
The independent variable I will be analyzing is GayMarriage which tells us whether the respondent favors or oppposes gay marriage.
Continuous (Dependent) Variable
The dependent variable I will be analyzing is NumChildren which tells us how many children the respondent has.
Research Question/Hypothesis
I hypothesize that there is a relationship between number of children a respondent has and whether they support gay marriage or not.
Data Prep
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
voterdata <- read_csv("/Users/rebeccagibble/Downloads/(Data)Abbreviated Labeled Voter 2017 Dataset.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## NumChildren = col_double(),
## Immigr_Economy_GiveTake = col_double(),
## ft_fem_2017 = col_double(),
## ft_immig_2017 = col_double(),
## ft_police_2017 = col_double(),
## ft_dem_2017 = col_double(),
## ft_rep_2017 = col_double(),
## ft_evang_2017 = col_double(),
## ft_muslim_2017 = col_double(),
## ft_jew_2017 = col_double(),
## ft_christ_2017 = col_double(),
## ft_gays_2017 = col_double(),
## ft_unions_2017 = col_double(),
## ft_altright_2017 = col_double(),
## ft_black_2017 = col_double(),
## ft_white_2017 = col_double(),
## ft_hisp_2017 = col_double()
## )
## See spec(...) for full column specifications.
data<-voterdata%>%
select(NumChildren, GayMarriage)%>%
filter(GayMarriage %in% c("Favor","Oppose"))
Comparison of Means
Table comparing mean of continuous variable between groups
data%>%
select(GayMarriage, NumChildren)%>%
filter(GayMarriage %in% c("Favor","Oppose"))%>%
group_by(GayMarriage)%>%
summarize(Number_Children=mean(NumChildren,na.rm=TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## GayMarriage Number_Children
## <chr> <dbl>
## 1 Favor 0.326
## 2 Oppose 0.467
Visualization
data%>%
ggplot()+
geom_histogram(aes(x=NumChildren))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 99 rows containing non-finite values (stat_bin).

Interpretation
By means of analyzing the table and visualization, we see that those who favor gay marriage have on average 0.32 kids and those who oppose gay marriage have on average 0.47 kids. However, respondents cannot have 1/2 of a child (for example), so instead we can say that respondents who favor gay marriage tend to have fewer children than those who oppose it.
Comparison of Distributions
data%>%
ggplot()+
geom_histogram(aes(x=NumChildren))+
facet_wrap(~GayMarriage)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 99 rows containing non-finite values (stat_bin).

Interpretation
By analyzing these histograms side by side we see that they are very similar. The biggest difference is between those with 0 children who favor vs. oppose gay marriage with more people favoring gay marriage than opposing. Because the graphs are so similar, we can assume that in general there are just more people who support gay marriage and that there may not be a real relationship between number of children and support for gay marriage. We will keep analyzing to understand the relationship better.
Sampling Distribution & T-test
Group 1
data1=rep_data<-data%>%
filter(GayMarriage=="Favor")
Group 2
data2=rep_data<-data%>%
filter(GayMarriage=="Oppose")
Drawing 10,000 samples of 40 respondents and calculating means of continuous variables for each group with respective sampling distributions
Group 1
favor_data<-replicate(10000,
sample(data1$NumChildren, 40)%>%mean(na.rm=TRUE)
)%>%
data.frame()%>%
rename("mean"=1)
favor_data%>%
ggplot()+
geom_histogram(aes(x=mean),fill="green")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Group 2
oppose_data<-replicate(10000,
sample(data2$NumChildren, 40)%>%mean(na.rm=TRUE)
)%>%
data.frame()%>%
rename("mean"=1)
oppose_data%>%
ggplot()+
geom_histogram(aes(x=mean),fill="red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation of these histograms
Because these distributions show 10,000 samples of 40 respondents, they are far closer to a normal distribution than the population distribution.
T-Test Set Up
data%>%
summarize(NumChildren=mean(NumChildren,na.rm = TRUE))
## # A tibble: 1 x 1
## NumChildren
## <dbl>
## 1 0.395
If the number of children that a person has, has no impact on their support for gay marriage, then there should be very little difference between 0.39 and the group-wise averages for both sides, favor and opppose.
data%>%
group_by(GayMarriage)%>%
summarize(NumChildren=mean(NumChildren,na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## GayMarriage NumChildren
## <chr> <dbl>
## 1 Favor 0.326
## 2 Oppose 0.467
These averages are approximately the same distance from 0.39, but in opposite directions. Thsese numbers do stray a bit from 0.39.
Hypotheses
Null Hypothesis: There is no difference in the mean value between the two groups.
Alt Hypothesis: There is a difference in the mean value between the two groups.
T-Test Execution
t.test(NumChildren~GayMarriage, data=data)
##
## Welch Two Sample t-test
##
## data: NumChildren by GayMarriage
## t = -6.4456, df = 6240.5, p-value = 1.237e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.18473685 -0.09857266
## sample estimates:
## mean in group Favor mean in group Oppose
## 0.3258237 0.4674785
There is a statistically significant difference between those who favor and oppose gay marriage and their mean number of children. We know this because the p-value is smaller than 0.05.