The purpose of this project is to present a study using the General Social Survey (GSS) dataset to demonstrate the skills developed during the Inferential Statistics course.
The project is divided as follows: an initial setup showing the libraries and the loading of the data, then we present a description of the data used for there yes, explain the research questions in the next part, we will do an exploratory analysis of the data and finally present the necessary inferences for the analysis of this project.
library(ggplot2)
library(dplyr)
library(statsr)
library(gridExtra)
load("data.Rdata")
The data we are using is from General Social Surveys, 1972-2018 produced by NORC - University of Chicago as part of The National Data Program for the Social Sciences. Our sample consists of 57061 observations and 114 variables. To achieve the objective of this project, we evaluated APPENDIX A “SAMPLING DESIGN & WEIGHTING”.
As this work aims to test some hypotheses at the sample level, we verified in the documentation aspects of Sampling Error and the Probabilities of each decade evidenced in the documentation.
Regarding Sampling Error, it is important to consider that the study includes SAMPCODE (“sampling error code”), VSTRATA (variance stratum), and VPSU (variance primary sampling unit). Information about the use of this code is available from the GSS project staff at NORC.
An important aspect for our work is to consider Black Oversamples since during the studies over time there was an increase in the number of respondents over the time of Black people. As our research will consider this element as important, it is worth considering this aspect.
Below we present which data will be used. First, we made a choice at the annual level, that is, we chose 4 years throughout the history of the survey. The choices were 1976, 1986, 1996 and 2006. As we will see in the research questions, we will focus on some issues of perception of racial policies and fear of violence considering the race of people. In this sense, I will use some variables just declared below.
The complete GSS database contains:
dim(gss)
## [1] 57061 114
We will work in a subset only with the data associated with our analyzes:
df = subset(gss, year == 1976 |
year == 1986 |
year == 1996 |
year == 2006, select = c(year, race, natrace, fear))
str(df)
## 'data.frame': 10383 obs. of 4 variables:
## $ year : int 1976 1976 1976 1976 1976 1976 1976 1976 1976 1976 ...
## $ race : Factor w/ 3 levels "White","Black",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ natrace: Factor w/ 3 levels "Too Little","About Right",..: 2 3 3 2 3 2 3 2 1 2 ...
## $ fear : Factor w/ 2 levels "Yes","No": 1 1 2 2 2 1 1 2 1 1 ...
The ‘year’ variable contains the periods selected for analysis: 1976, 1986, 1996 and 200 with the respective frequencies indicated below:
ggplot(df, aes(x=year, colour = year, fill = year)) +
geom_histogram(bins = 20) +
labs(x="Years", y="Frequencies", title="Frequencies of Observations - Subset")
The race variable contains the following categorization: White, Black and Other and frequencies shown below:
ggplot(df, aes(x=year)) +
geom_histogram(bins=20) +
labs(x="Years", y="Frequencies", title="Race frequencies by year") +
facet_grid(~ race)
racefreq = xtabs(~year+race, data = df)
ftable(racefreq)
## race White Black Other
## year
## 1976 1361 129 9
## 1986 1249 184 37
## 1996 2349 402 153
## 2006 3284 634 592
The variable natrace refers to Improving conditions of Blacks, related to a broader question described as if the government is investing to solve problems in the country among them the conditions of blacks.
table(df$year, df$natrace)
##
## Too Little About Right Too Much
## 1976 409 604 379
## 1986 243 308 115
## 1996 457 558 277
## 2006 521 602 225
Another variable that we will be using is the fear variable. This variable wants to know if the individual is afraid to walk around the neighborhood at night. Below are the frequencies for each year and will be treated in terms of the relationship with other variables in part 3.
table(df$year, df$fear)
##
## Yes No
## 1976 657 835
## 1986 0 0
## 1996 804 1099
## 2006 719 1274
It is important to note that in 1986 this question was not considered, so we will disregard any analysis for this period with this variable.
The research questions are more broadly aimed at verifying the qualitative perception over time about policies to improve the quality of life of blacks and the fear of walking in the neighborhood at night.
Considering this research bias I will search through the inference techniques to verify if:
Is there a dependency between the categorical variables of color / race with the perception of the Improving Conditions of Blacks policies?
Is there a dependency between categorical color / race variables with Afraid to Walk at night in neighborhood?
Considering the research questions, we separated the exploratory analyzes considering the variables race, natrace (Improving Conditions of Blacks), fear (Afraid to Walk at night in neighborhood) over the years. The analysis below shows the behavior of these variables in the years 1976, 1986, 1996 and 2006.
We will disregard the OTHER and use only the comparison between blacks and whites.
p1 = df %>% filter(year == 1976) %>%
filter(race %in% c("White", "Black")) %>%
filter(natrace %in% c("Too Little","About Right","Too Much")) %>%
ggplot() +
geom_bar(aes(x = natrace, fill = natrace), show.legend = FALSE) +
facet_wrap(~race) +
labs(y="Frequencies", x= "About Improving Black Conditions") +
ggtitle("1976 Survey")
p2 = df %>% filter(year == 1986) %>%
filter(race %in% c("White", "Black")) %>%
filter(natrace %in% c("Too Little","About Right","Too Much")) %>%
ggplot() +
geom_bar(aes(x = natrace, fill = natrace), show.legend = FALSE) +
facet_wrap(~race) +
labs(y="Frequencies", x= "About Improving Black Conditions") +
ggtitle("1986 Survey")
p3 = df %>% filter(year == 1996) %>%
filter(race %in% c("White", "Black")) %>%
filter(natrace %in% c("Too Little","About Right","Too Much")) %>%
ggplot(na.rm = TRUE) +
geom_bar(aes(x = natrace, fill = natrace), show.legend = FALSE) +
facet_wrap(~race) +
labs(y="Frequencies", x= "About Improving Black Conditions") +
ggtitle("1996 Survey")
p4 = df %>% filter(year == 2006) %>%
filter(race %in% c("White", "Black")) %>%
filter(natrace %in% c("Too Little","About Right","Too Much")) %>%
ggplot() +
geom_bar(aes(x = natrace, fill = natrace), show.legend = FALSE, na.rm = TRUE) +
facet_wrap(~race) +
labs(y="Frequencies", x= "About Improving Black Conditions") +
ggtitle("2006 Survey")
grid.arrange(p1,p2,p3,p4, ncol=2)
natracefreq = xtabs(~year+race+natrace, data = df)
ftable(natracefreq)
## natrace Too Little About Right Too Much
## year race
## 1976 White 303 582 374
## Black 104 18 3
## Other 2 4 2
## 1986 White 163 283 111
## Black 73 21 3
## Other 7 4 1
## 1996 White 269 495 266
## Black 165 28 2
## Other 23 35 9
## 2006 White 292 500 189
## Black 163 36 7
## Other 66 66 29
p5 = df %>% filter(year == 1976) %>%
filter(race %in% c("White", "Black")) %>%
filter(fear %in% c("Yes", "No")) %>%
ggplot() +
geom_bar(aes(x = fear, fill = fear), show.legend = FALSE) +
facet_wrap(~race) +
labs(y="Frequencies", x= "Afraid to Walk at night in neighborhood") +
ggtitle("1976 Survey")
p6 = df %>% filter(year == 1996) %>%
filter(race %in% c("White", "Black")) %>%
filter(fear %in% c("Yes", "No")) %>%
ggplot() +
geom_bar(aes(x = fear, fill = fear), show.legend = FALSE) +
facet_wrap(~race) +
labs(y="Frequencies", x= "Afraid to Walk at night in neighborhood") +
ggtitle("1996 Survey")
p7 = df %>% filter(year == 2006) %>%
filter(race %in% c("White", "Black")) %>%
filter(fear %in% c("Yes", "No")) %>%
ggplot() +
geom_bar(aes(x = fear, fill = fear), show.legend = FALSE) +
facet_wrap(~race) +
labs(y="Frequencies", x= "Afraid to Walk at night in neighborhood") +
ggtitle("2006 Survey")
grid.arrange(p5,p6,p7, ncol=2)
Comentários
fearfreq = xtabs(~year+race+fear, data = df)
ftable(fearfreq)
## fear Yes No
## year race
## 1976 White 591 764
## Black 62 66
## Other 4 5
## 1996 White 613 917
## Black 150 119
## Other 41 63
## 2006 White 488 978
## Black 120 150
## Other 111 146
In this part four the objective is to verify if there is any relationship between two categorical variables, in this case, we are looking at the relationship between the categorical variable race and the categorical variable natrace (Improving Conditions of Black) and the race variable with the fear variable (Afraid to Walk at night in neighborhood). I split the correlation analyzes over the four selected periods and below I present using the Chi-Square Test the relationship between these variables.
O Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them.This test examines whether rows and columns of a contingency table are statistically significantly associated.
Therefore, we start from a hypothesis test considering that the Null hypothesis (H0): the row and the column variables of the contingency table are independent. Alternative hypothesis (H1): row and column variables are dependent
In this case we will check through Chi-Square which hypothesis is true:
Null hypothesis (H0): The variables reported are independent. Alternative hypothesis (H1): The variables reported are dependents.
If the p-value of the test is below 0.05 the variables are dependent (H0), if it is above the variables are independent (h0)
df1976 = subset(df, year == 1976)
chisq.test(df1976$race,df1976$natrace)
##
## Pearson's Chi-squared test
##
## data: df1976$race and df1976$natrace
## X-squared = 193.16, df = 4, p-value < 2.2e-16
df1986 = subset(df, year == 1986)
chisq.test(df1986$race,df1986$natrace)
## Warning in chisq.test(df1986$race, df1986$natrace): Chi-squared approximation
## may be incorrect
##
## Pearson's Chi-squared test
##
## data: df1986$race and df1986$natrace
## X-squared = 79.25, df = 4, p-value = 2.511e-16
df1996 = subset(df, year == 1996)
chisq.test(df1996$race,df1996$natrace)
##
## Pearson's Chi-squared test
##
## data: df1996$race and df1996$natrace
## X-squared = 252.25, df = 4, p-value < 2.2e-16
df2006 = subset(df, year == 2006)
chisq.test(df2006$race,df2006$natrace)
##
## Pearson's Chi-squared test
##
## data: df2006$race and df2006$natrace
## X-squared = 176.77, df = 4, p-value < 2.2e-16
As we can see in the results presented above, the variables Race and Natrace are independent considering that the p-value values were above 0.05. Therefore, the H0 is validated, very likely that there is no direct relationship between the race of the American citizen and their perception of the improvement of national policies to improve the lives of blacks considering the American population universe.
Null hypothesis (H0): The variables reported are independent. Alternative hypothesis (H1): The variables reported are dependents.
If the p-value of the test is below 0.05 the variables are dependent (H0), if it is above the variables are independent (h0)
chisq.test(df1976$race,df1976$fear)
##
## Pearson's Chi-squared test
##
## data: df1976$race and df1976$fear
## X-squared = 1.1037, df = 2, p-value = 0.5759
df1996 = subset(df, year == 1996)
chisq.test(df1996$race,df1996$fear)
##
## Pearson's Chi-squared test
##
## data: df1996$race and df1996$fear
## X-squared = 23.462, df = 2, p-value = 8.039e-06
df2006 = subset(df, year == 2006)
chisq.test(df2006$race,df2006$fear)
##
## Pearson's Chi-squared test
##
## data: df2006$race and df2006$fear
## X-squared = 18.782, df = 2, p-value = 8.347e-05
As we can see in the results presented above, the variables Race and fear are independent considering that the p-value values were above 0.05. Therefore, the H0 is validated, very likely that there is no direct relationship between the race of the American citizen and his perception of fear of walking at night considering the American population universe.
We conclude that there is no dependence and correlation between the categorical variables Race and NatRace and Race and Fear for the analyzed samples, evidently based on the Chi-squared test used.