Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")

Part 1: Data

This extract of the General Social Survey (GSS) Cumulative File 1972-2012 provides a sample of selected indicators in the GSS with the goal of providing a convenient data resource for students learning statistical reasoning using the R language. We will investigate one question using the GSS dataset. The results are generalizable to the US population above 18 years old, as this is an observational study that uses random sampling. Since there is no random assignment in this study, we cannot make causal conclusions.

As this study does not employ volunteers we can also exclude the possibility of voluntary response bias. However, there might be non-response bias since there are variables in the GSS Codebook where we can notice observations called “Refused”, as well as coveniece bias since some surveys targeted English-speaking persons while others did not.

According to the full General Social Survey Cumulative File :

The National Data Program for the Social Sciences is designed as a data diffusion project and a program of social indicator research. The data come from the General Social Surveys, interviews administered to NORC national samples using a standard questionnaire.

The General Social Surveys have been conducted during February, March, and April of 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1980, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1993, 1994, 1996, 1998, 2000, 2002, 2004, 2006, 2008, 2010, and 2012. There are a total of 57,061 completed interviews. Each survey from 1972 to 2004 was an independently drawn sample of English-speaking persons 18 years of age or over, living in non-institutional arrangements within the United States. Starting in 2006 Spanish-speakers were added to the target population. Block quota sampling was used in 1972, 1973, and 1974 surveys and for half of the 1975 and 1976 surveys. Full probability sampling was employed in half of the 1975 and 1976 surveys and the 1977, 1978, 1980, 1982-1991, 1993-1998, 2000, 2002,2004, 2006, 2008, 2010, and 2012 surveys. Also, the 2004, 2006, 2008, 2010, and 2012 surveys had sub-sampled non-respondents.

The data from the interviews were processed according to standard NORC procedures.


Part 2: Research question

In many areas in the world crime is on the rise, so I believe many people (including myself) wonder if men are as afraid walking alone at night as woman are. Therefore, I decided to research the following question:

Are women and men from the US equaly likely to be afraid to walk alone at night in their neighborhood?


Part 3: Exploratory data analysis

Since our question is about men and women (variable named “sex”), we will exclude the observasions named in the gss Codebook as “NA” from our analysis. Likewise, since our question is about their fear of walking alone at night in their neighborhood (variable named “fear”), we will exclude the observasions named in the BRFSS Codebook as “NA” from our analysis.

#filtering for unwated data
gss_na <- gss %>%
  filter(!is.na(sex), !is.na(fear)) %>%
  select(sex, fear)
#table of proportions
gss_na %>%
  group_by(sex) %>%
  summarise(count = n(), y_c = sum(fear == "Yes"), n_c = sum(fear == "No"), y_p = sum(fear == "Yes") / n() , n_p = sum(fear == "No") / n() )
## # A tibble: 2 x 6
##   sex    count   y_c   n_c   y_p   n_p
##   <fct>  <int> <int> <int> <dbl> <dbl>
## 1 Male   15178  3419 11759 0.225 0.775
## 2 Female 19117 10591  8526 0.554 0.446
#histogram
ggplot(data = gss_na, aes(x = sex, fill = fear)) + geom_bar(position=position_dodge()) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_fill_manual(values=c("#003300", "#CC0000"))

As we can notice from the data table, the data in the “gss” dataset suggest that women from US seem to be more afraid to walk alone at night in their neighborhood than men from US. The proportion of women that are afraid to walk alone at night in their neighborhood is 0.2252602 while the mens’ is 0.5540095.

The segmented bar plot helps us visualise the distribution and reaffirms the above mentioned indication.

As this only an exploratory data analysis (whose results can be generalised to the general public but are not causal) we cannot draw definite conclusions, we can only get these indications.


Part 4: Inference

In order to answer the research question we will need to do Inference. As “sex” and “fear” are both categorical variables we will research the difference of their proportions with use of a confidence interval and a hypothesis test.

Before we do inference though, we need to check if the conditions for both confidence interval and hypothesis testing are met.

Starting with independece, both the sampled females and sampled males can be assumed to be independent of each other. This is because this is a random sample, and also both the 19117 women and the 15178 men are less than 10% of all US women and men respectively, so there is independence within groups. Finally we have no reason to expect sampled females and males to be dependent, so there is independence between groups as well.

As for the sample size/skew we need to check if the success-failure conditions for both confidence interval and hypothesis test are met.

#success-failure conditions for confidence interval
15178*0.2252602
## [1] 3418.999
15178*0.7747398
## [1] 11759
19117*0.5540095
## [1] 10591
19117*0.4459905
## [1] 8526
#success-failure conditions hypothesis test
ppool <- (3419 + 10591)/(15178 + 19117)
15178*ppool
## [1] 6200.431
15178*(1-ppool)
## [1] 8977.569
19117*ppool
## [1] 7809.569
19117*(1-ppool)
## [1] 11307.43

Since all the calculations shown above equal to numbers greater than 10, our sample size/skew conditions for both confidence interval and hypothesis test are also met for both groups (men-women) in each case. We can assume that the sampling distribution of the difference between proportions is nearly normal.

Therefore, we can go on with doing inference for both confidence interval and hypothesis testing and we will start that by stating the hypotheses:

\(H_0: \ p_{Female} - \ p_{Male} = \ 0\)

\(H_A: \ p_{Female} - \ p_{Male} \ne \ 0\)

Next, we will proceed with the 95% confidence interval for the difference between the fear of women and that of men. This can be done because the success-failure condition for the confidence interval was met.

#confidence interval
inference(y = fear, x= sex, data = gss_na, statistic = "proportion", type = "ci", method = "theoretical", success = "Yes", order = c("Female","Male"))
## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_Female = 19117, p_hat_Female = 0.554
## n_Male = 15178, p_hat_Male = 0.2253
## 95% CI (Female - Male): (0.3191 , 0.3384)

Since 0 is not between 0.3191 and 0.3384 we will reject our null hypotheses. Consequently, we are 95% confident that the proportion of women from US who are afraid to walk alone at night in their neighborhood is 31.9% to 33.8% higher than the proportion of men from US who are afraid to walk alone at night in their neighborhood.

We will also conduct a hypothesis test for our research question. This can be done because the success-failure condition for the hypothesis test was met.

#hypothesis test
inference(y = fear, x= sex, data = gss_na, statistic = "proportion", type = "ht",  null = 0, 
          alternative = "twosided", method = "theoretical", success = "Yes", , order = c("Female","Male"))
## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_Female = 19117, p_hat_Female = 0.554
## n_Male = 15178, p_hat_Male = 0.2253
## H0: p_Female =  p_Male
## HA: p_Female != p_Male
## z = 61.5164
## p_value = < 0.0001

Due to the p_value = < 0.0001 being less than the significance level (0.05), so re will reject the null hypotheses. Hence, we can conclude that there is a difference in women and men from US with respect to their fear of walking alone at night in their neighborhood.

As we can see, the results from the hypothesis test and the confidence interval agree.

To sum up, we have proved that women and men from the US are not equaly likely to be afraid to walk alone at night in their neighborhood. I would suggest further research to be done under the scope of gun ownership.