Week 5 - Inferential Statistics Project

Summary

The purpose of this project is to present a study using the General Social Survey (GSS) dataset to demonstrate the skills developed during the Inferential Statistics course.

The project is divided as follows: an initial setup showing the libraries and the loading of the data, then we present a description of the data used for there yes, explain the research questions in the next part, we will do an exploratory analysis of the data and finally present the necessary inferences for the analysis of this project.

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(gridExtra)

Load data

load("data.Rdata")

Part 1: Data

The data we are using is from General Social Surveys, 1972-2018 produced by NORC - University of Chicago as part of The National Data Program for the Social Sciences. Our sample consists of 57061 observations and 114 variables. To achieve the objective of this project, we evaluated APPENDIX A “SAMPLING DESIGN & WEIGHTING”.

As this work aims to test some hypotheses at the sample level, we verified in the documentation aspects of Sampling Error and the Probabilities of each decade evidenced in the documentation.

Regarding Sampling Error, it is important to consider that the study includes SAMPCODE (“sampling error code”), VSTRATA (variance stratum), and VPSU (variance primary sampling unit). Information about the use of this code is available from the GSS project staff at NORC.

An important aspect for our work is to consider Black Oversamples since during the studies over time there was an increase in the number of respondents over the time of Black people. As our research will consider this element as important, it is worth considering this aspect.

Below we present which data will be used. First, we made a choice at the annual level, that is, we chose 4 years throughout the history of the survey. The choices were 1976, 1986, 1996 and 2006. As we will see in the research questions, we will focus on some issues of perception of racial policies and fear of violence considering the race of people. In this sense, I will use some variables just declared below.

The complete GSS database contains:

dim(gss)

## [1] 57061   114

Data Subset

We will work in a subset only with the data associated with our analyzes:

df = subset(gss, year == 1976 | 
                 year == 1986 | 
                 year == 1996 |
                 year == 2006, select = c(year, race, natrace, fear))

str(df)

## 'data.frame':    10383 obs. of  4 variables:
##  $ year   : int  1976 1976 1976 1976 1976 1976 1976 1976 1976 1976 ...
##  $ race   : Factor w/ 3 levels "White","Black",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ natrace: Factor w/ 3 levels "Too Little","About Right",..: 2 3 3 2 3 2 3 2 1 2 ...
##  $ fear   : Factor w/ 2 levels "Yes","No": 1 1 2 2 2 1 1 2 1 1 ...

The ‘year’ variable contains the periods selected for analysis: 1976, 1986, 1996 and 200 with the respective frequencies indicated below:

ggplot(df, aes(x=year, colour = year, fill = year)) +
  geom_histogram(bins = 20) +
  labs(x="Years", y="Frequencies", title="Frequencies of Observations - Subset")

Race Variable

The race variable contains the following categorization: White, Black and Other and frequencies shown below:

ggplot(df, aes(x=year)) + 
  geom_histogram(bins=20) + 
  labs(x="Years", y="Frequencies", title="Race frequencies by year") +
  facet_grid(~ race)

racefreq = xtabs(~year+race, data = df)
ftable(racefreq)

##      race White Black Other
## year                       
## 1976       1361   129     9
## 1986       1249   184    37
## 1996       2349   402   153
## 2006       3284   634   592

NatRace Variabçe - Improving Conditions of Blacks

The variable natrace refers to Improving conditions of Blacks, related to a broader question described as if the government is investing to solve problems in the country among them the conditions of blacks.

table(df$year, df$natrace)

##       
##        Too Little About Right Too Much
##   1976        409         604      379
##   1986        243         308      115
##   1996        457         558      277
##   2006        521         602      225

Fear Variable - Afraid to Walk at night in neighborhood

Another variable that we will be using is the fear variable. This variable wants to know if the individual is afraid to walk around the neighborhood at night. Below are the frequencies for each year and will be treated in terms of the relationship with other variables in part 3.

table(df$year, df$fear)

##       
##         Yes   No
##   1976  657  835
##   1986    0    0
##   1996  804 1099
##   2006  719 1274

It is important to note that in 1986 this question was not considered, so we will disregard any analysis for this period with this variable.

Part 2: Research question

The research questions are more broadly aimed at verifying the qualitative perception over time about policies to improve the quality of life of blacks and the fear of walking in the neighborhood at night.

Considering this research bias I will search through the inference techniques to verify if:

Is there a dependency between the categorical variables of color / race with the perception of the Improving Conditions of Blacks policies?
Is there a dependency between categorical color / race variables with Afraid to Walk at night in neighborhood?

Part 3: Exploratory data analysis

Considering the research questions, we separated the exploratory analyzes considering the variables race, natrace (Improving Conditions of Blacks), fear (Afraid to Walk at night in neighborhood) over the years. The analysis below shows the behavior of these variables in the years 1976, 1986, 1996 and 2006.

We will disregard the OTHER and use only the comparison between blacks and whites.

Improving Conditions of Blacks

p1 = df %>% filter(year == 1976) %>%
  filter(race %in% c("White", "Black")) %>%
  filter(natrace %in% c("Too Little","About Right","Too Much")) %>%
  ggplot() + 
  geom_bar(aes(x = natrace, fill = natrace), show.legend = FALSE) +
  facet_wrap(~race) +
  labs(y="Frequencies", x= "About Improving Black Conditions") +
  ggtitle("1976 Survey")

p2 = df %>% filter(year == 1986) %>%
  filter(race %in% c("White", "Black")) %>%
  filter(natrace %in% c("Too Little","About Right","Too Much")) %>%
  ggplot() + 
  geom_bar(aes(x = natrace, fill = natrace), show.legend = FALSE) +
  facet_wrap(~race) +
  labs(y="Frequencies", x= "About Improving Black Conditions") +
  ggtitle("1986 Survey")

p3 = df %>% filter(year == 1996) %>%
  filter(race %in% c("White", "Black")) %>%
  filter(natrace %in% c("Too Little","About Right","Too Much")) %>%
  ggplot(na.rm = TRUE) + 
  geom_bar(aes(x = natrace, fill = natrace), show.legend = FALSE) +
  facet_wrap(~race) +
  labs(y="Frequencies", x= "About Improving Black Conditions") +
  ggtitle("1996 Survey")

p4 = df %>% filter(year == 2006) %>%
  filter(race %in% c("White", "Black")) %>%
  filter(natrace %in% c("Too Little","About Right","Too Much")) %>%
  ggplot() + 
  geom_bar(aes(x = natrace, fill = natrace), show.legend = FALSE, na.rm = TRUE) +
  facet_wrap(~race) +
  labs(y="Frequencies", x= "About Improving Black Conditions") +
  ggtitle("2006 Survey")

grid.arrange(p1,p2,p3,p4, ncol=2)

natracefreq = xtabs(~year+race+natrace, data = df)
ftable(natracefreq)

##            natrace Too Little About Right Too Much
## year race                                         
## 1976 White                303         582      374
##      Black                104          18        3
##      Other                  2           4        2
## 1986 White                163         283      111
##      Black                 73          21        3
##      Other                  7           4        1
## 1996 White                269         495      266
##      Black                165          28        2
##      Other                 23          35        9
## 2006 White                292         500      189
##      Black                163          36        7
##      Other                 66          66       29

Afraid to Walk at night in neighborhood

p5 = df %>% filter(year == 1976) %>%
  filter(race %in% c("White", "Black")) %>%
  filter(fear %in% c("Yes", "No")) %>%
  ggplot() + 
  geom_bar(aes(x = fear, fill = fear), show.legend = FALSE) +
  facet_wrap(~race) +
  labs(y="Frequencies", x= "Afraid to Walk at night in neighborhood") +
  ggtitle("1976 Survey")

p6 = df %>% filter(year == 1996) %>%
  filter(race %in% c("White", "Black")) %>%
  filter(fear %in% c("Yes", "No")) %>%
  ggplot() + 
  geom_bar(aes(x = fear, fill = fear), show.legend = FALSE) +
  facet_wrap(~race) +
  labs(y="Frequencies", x= "Afraid to Walk at night in neighborhood") +
  ggtitle("1996 Survey")

p7 = df %>% filter(year == 2006) %>%
  filter(race %in% c("White", "Black")) %>%
  filter(fear %in% c("Yes", "No")) %>%
  ggplot() + 
  geom_bar(aes(x = fear, fill = fear), show.legend = FALSE) +
  facet_wrap(~race) +
  labs(y="Frequencies", x= "Afraid to Walk at night in neighborhood") +
  ggtitle("2006 Survey")


grid.arrange(p5,p6,p7, ncol=2)

Comentários

fearfreq = xtabs(~year+race+fear, data = df)
ftable(fearfreq)

##            fear Yes  No
## year race              
## 1976 White      591 764
##      Black       62  66
##      Other        4   5
## 1996 White      613 917
##      Black      150 119
##      Other       41  63
## 2006 White      488 978
##      Black      120 150
##      Other      111 146

Part 4: Inference

In this part four the objective is to verify if there is any relationship between two categorical variables, in this case, we are looking at the relationship between the categorical variable race and the categorical variable natrace (Improving Conditions of Black) and the race variable with the fear variable (Afraid to Walk at night in neighborhood). I split the correlation analyzes over the four selected periods and below I present using the Chi-Square Test the relationship between these variables.

O Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them.This test examines whether rows and columns of a contingency table are statistically significantly associated.

Therefore, we start from a hypothesis test considering that the Null hypothesis (H0): the row and the column variables of the contingency table are independent. Alternative hypothesis (H1): row and column variables are dependent

In this case we will check through Chi-Square which hypothesis is true:

Hypothesis Testing Race X Improving Conditions of Black

Null hypothesis (H0): The variables reported are independent. Alternative hypothesis (H1): The variables reported are dependents.

If the p-value of the test is below 0.05 the variables are dependent (H0), if it is above the variables are independent (h0)

1976

df1976 = subset(df, year == 1976)
chisq.test(df1976$race,df1976$natrace)

## 
##  Pearson's Chi-squared test
## 
## data:  df1976$race and df1976$natrace
## X-squared = 193.16, df = 4, p-value < 2.2e-16

1986

df1986 = subset(df, year == 1986)
chisq.test(df1986$race,df1986$natrace)

## Warning in chisq.test(df1986$race, df1986$natrace): Chi-squared approximation
## may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  df1986$race and df1986$natrace
## X-squared = 79.25, df = 4, p-value = 2.511e-16

1996

df1996 = subset(df, year == 1996)
chisq.test(df1996$race,df1996$natrace)

## 
##  Pearson's Chi-squared test
## 
## data:  df1996$race and df1996$natrace
## X-squared = 252.25, df = 4, p-value < 2.2e-16

2006

df2006 = subset(df, year == 2006)
chisq.test(df2006$race,df2006$natrace)

## 
##  Pearson's Chi-squared test
## 
## data:  df2006$race and df2006$natrace
## X-squared = 176.77, df = 4, p-value < 2.2e-16

Results

As we can see in the results presented above, the variables Race and Natrace are independent considering that the p-value values were above 0.05. Therefore, the H0 is validated, very likely that there is no direct relationship between the race of the American citizen and their perception of the improvement of national policies to improve the lives of blacks considering the American population universe.

Hypothesis testing Race X Afraid to Walk at night in neighborhood

Null hypothesis (H0): The variables reported are independent. Alternative hypothesis (H1): The variables reported are dependents.

If the p-value of the test is below 0.05 the variables are dependent (H0), if it is above the variables are independent (h0)

1976

chisq.test(df1976$race,df1976$fear)

## 
##  Pearson's Chi-squared test
## 
## data:  df1976$race and df1976$fear
## X-squared = 1.1037, df = 2, p-value = 0.5759

1986 - Não disponível

1996

df1996 = subset(df, year == 1996)
chisq.test(df1996$race,df1996$fear)

## 
##  Pearson's Chi-squared test
## 
## data:  df1996$race and df1996$fear
## X-squared = 23.462, df = 2, p-value = 8.039e-06

2006

df2006 = subset(df, year == 2006)
chisq.test(df2006$race,df2006$fear)

## 
##  Pearson's Chi-squared test
## 
## data:  df2006$race and df2006$fear
## X-squared = 18.782, df = 2, p-value = 8.347e-05

Results

As we can see in the results presented above, the variables Race and fear are independent considering that the p-value values were above 0.05. Therefore, the H0 is validated, very likely that there is no direct relationship between the race of the American citizen and his perception of fear of walking at night considering the American population universe.

Conclusion

We conclude that there is no dependence and correlation between the categorical variables Race and NatRace and Race and Fear for the analyzed samples, evidently based on the Chi-squared test used.