library(ggplot2)
library(dplyr)
library(statsr)
library(tidyverse)load("gss.Rdata")The data we will being using for this project GSS (the General Social Survey) is gathered by NORC (National Opinion Research Center) from the University of Chicago since 1972. The GSS data is dedicated to record contemporary American society including social trends, constants in attitudes, behaviors, and attributes as well as social structures and functioning of society. It covers various aspects of American society such as national spending priorities, marijuana use, crime, intergroup relations, social and economic life, lifestyle, civil liberties, subjective well-being, and confidence in institutions etc.
The data gss has 57061 observations and 114 columns and it was gathered using random sampling, therefore it can be used to generalize the entire population of the country. However, we should be aware that bias also exists in the data since there are missing entries for certain years and certain columns as well as missing years since 1972. Because the survey is observational, therefore casual inferences can not be made from the data since it doesn’t meet some of the requirements (bias).
In the following analysis, I’m about to look into how the proportion of U.S. white males’ views on LGBTQ has changed between the year 2002 and 2012, here we define ‘Not Wrong At All’ a success, ‘Always Wrong’ a failure’ (for clarity we only consider these two “extreme” opinions).
The variables I will be using are race, sex, year, homosex.
summary(gss$race)## White Black Other
## 46350 7926 2785
summary(gss$sex)## Male Female
## 25146 31915
summary(gss$homosex)## Always Wrong Almst Always Wrg Sometimes Wrong Not Wrong At All
## 21601 1581 2243 7282
## Other NA's
## 82 24272
summary(gss$year)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1972 1983 1993 1992 2002 2012
Filter our data:
W_Male <- gss %>%
filter(race == 'White', sex == 'Male', year == c('2002', '2012'),
!is.na(homosex), stringr::str_detect(homosex, 'Always Wrong|Not Wrong At All'))## Warning in year == c("2002", "2012"): longer object length is not a multiple of
## shorter object length
Summarize W_Male:
W_Male %>%
group_by(year) %>%
summarise(count = n())## # A tibble: 2 x 2
## year count
## <int> <int>
## 1 2002 189
## 2 2012 172
W_Male %>%
filter(stringr::str_detect(homosex, 'Always Wrong|Not Wrong At All')) %>%
group_by(homosex) %>%
summarise(count = n()) %>%
mutate(ratio = (count/361)*100)## # A tibble: 2 x 3
## homosex count ratio
## <fct> <int> <dbl>
## 1 Always Wrong 207 57.3
## 2 Not Wrong At All 154 42.7
Null hypothesis: the proportions of U.S. white males’ opinions on LGBTQ in 2002 is the same as the one in 2012: \[p_{2002} = p_{2012}\]
Alternative hypothesis: the proportions of U.S. white males’ opinions on LGBTQ in 2002 is not the same as the one in 2012: \[p_{2002} \ne p_{2012}\]
inference(y = homosex, x = year, data = W_Male, statistic = 'proportion', type = 'ht',
method = 'theoretical', success = 'Not Wrong At All', null = 0, alternative = 'twosided')## Warning: Explanatory variable was numerical, it has been converted
## to categorical. In order to avoid this warning, first convert
## your explanatory variable to a categorical variable using the
## as.factor() function
## Warning: Ignoring null value since it's undefined for chi-square test of
## independence
## Response variable: categorical (2 levels, success: Not Wrong At All)
## Explanatory variable: categorical (5 levels)
## n_2002 = 189, p_hat_2002 = 0.3915
## n_2012 = 172, p_hat_2012 = 0.4651
## H0: p_2002 = p_2012
## HA: p_2002 != p_2012
## z = -1.4118
## p_value = 0.158
The p-value is less than \(\alpha\) (0.05), therefore we can reject the null hypothesis. Hence we have convincing evidence to state that the proportions of U.S. white males’ opinions on LGBTQ in 2002 is not the same as the one in 2012.
The following is a quick analysis of using Chi-Square test of independence:
State hypothesis:
Check independence: it’s already stated above.
Inference: Filter data:
gss_W_Male <- gss %>%
filter(race == 'White', sex == 'Male', !is.na(homosex),
stringr::str_detect(homosex, 'Always Wrong|Not Wrong At All'))Plot:
gss_W_Male %>%
ggplot(aes(x = year, fill = homosex)) +
geom_bar() +
labs(title = "U.S. white males' views on LGBTQ 1972 - 2012",
x = 'Year',
y = 'Count and Ratio',
caption = paste0("Data Source: GSS"),
fill = 'Views')
We can see that even though the proportion of “Always Wrong” has always been high, but it has decreased a lot since 1972 and its share almost reaches the same level as “Not Wrong At All” in 2012,
chisq.test(gss_W_Male$year, gss_W_Male$homosex)$expected## gss_W_Male$homosex
## gss_W_Male$year Always Wrong Not Wrong At All
## 1973 367.9182 122.08175
## 1974 370.9217 123.07834
## 1976 363.4131 120.58687
## 1977 374.6759 124.32407
## 1980 379.1811 125.81895
## 1982 358.9080 119.09200
## 1984 330.3756 109.62443
## 1985 388.9421 129.05785
## 1987 349.1469 115.85309
## 1988 220.7509 73.24905
## 1989 256.7919 85.20808
## 1990 219.2492 72.75076
## 1991 244.7783 81.22174
## 1993 252.2868 83.71320
## 1994 470.7852 156.21481
## 1996 439.2493 145.75066
## 1998 392.6964 130.30358
## 2000 403.2084 133.79164
## 2002 261.2970 86.70296
## 2004 212.4916 70.50844
## 2006 391.1947 129.80529
## 2008 307.8500 102.15004
## 2010 288.3278 95.67223
## 2012 272.5598 90.44016
Degree of freedom: \[df = (R-1)*(C-1)\]
(24-1)*(2-1)## [1] 23
chisq.test(gss_W_Male$year, gss_W_Male$homosex)##
## Pearson's Chi-squared test
##
## data: gss_W_Male$year and gss_W_Male$homosex
## X-squared = 636.53, df = 23, p-value < 2.2e-16
We can see that our p value is almost 0, therefore we can reject the null hypothesis. Hence we have convincing evidence to state that U.S. white males’ views and the time(year) are dependent.