Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")

Part 1: Data

The General Social Survey (GSS) gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes [1].

The data was collected amongst all americans, giving possible research directions envolving only the american people. Besides, as there was only a random sampling amongst the population, we can state the possible results from other analysis can only be used to analyze correlations and not causation. * * *

Part 2: Research question

For this project, I have choosen a question regarding the american’s race, and the point of view about social differences, i.e., differences on the level of basic access to employment, wealth, and health to white and black people.

It is known that historically, black people have had difficulties to earn the same benefits of white people. More difficulties to get a job, less earnings, and subjects to prejudice are a few example of the complex enviroment that can have profund impacts on the living of color people.

My question will be: “Is a worse social condition due to difficulty to access higher education levels?

For this task we will evaluate the gss database and analyze three variables of interest:


Part 3: Preprocessing and Exploratory data analysis

This question will evaluate the opinion of the americans in the first and last poll and analyze if there were any changes regarding to the racdif3 variable proportion.

We first select only the three variables of interest. The ‘racdif3’ had a lot of NA’s, so we remove them from the processed dataset. After removing the NA’s, we obtain and filter only the rows related to the first and last year of the poll.

# filter columns of interest
gss <- gss[,c('year','race','racdif3')];

# remove NA's from the racdif3 column
gss <- gss %>% filter(!is.na(racdif3));

# after removing NA's, get only the first and last year from data
year.min <- min(gss$year);
year.max <- max(gss$year);
gssQuestion <- gss %>% filter(year %in% c(year.min, year.max));

# is there any more NA's in the dataset?
any(is.na(gssQuestion))
## [1] FALSE

To get a first idea of the respondents distribution, let’s make a table of the number of respondents in both of these years.

# get the first idea about the distribution of respondents from year.min and year.max
# great difference about the number of black and white respondents.
table(filter(gssQuestion, year == year.min)$race)
## 
## White Black Other 
##  1284     0    14
table(filter(gssQuestion, year == year.max)$race)
## 
## White Black Other 
##   953   194   116

As we can see, there is a big difference in the number of respondents regarding the race. The first year of the GSS poll did not have any black people, so we will not be able to analyze our question this way. In 2012 (last year of the poll data), we still see a great difference, but at least we are able to investigate our research question.

To decide which year use for our first year, let’s analyze the respondents distribution throught all years.

# let's see the overall distribution of respondents throught the years
# from the original gss dataset (without filtering)
respondents <- gss %>% group_by(year,race) %>% summarise(n_respondents = n())
ggplot(respondents,aes(year,n_respondents,colour = race)) + geom_line() + ggtitle("Number of Respondents by Sex")

Looking at the graph, we prove that the number of black people respondents were indeed really smaller throught the years. As we can see the census.gov demographic table, the number of black people in USA are around 13%, so it is natural to have this distribution. 2. Combining a necessity to have a few examples to analyze our research question and still maintain a considerable interval between the chosen years, I decided to pick the 1996 year to be the first year to analyze, as the number of black respondents greatly increased.So we filter our dataset accordingly to use in our research question.

gssQuestion.new <- gss %>% filter(year %in% c(1996, 2012), race %in% c('Black','White'));

Part 4: Inference

To assess our research question, we can make tree hipothesis questions:

\(H_0: p_{1996 - White} = p_{2012 - White}\); \(H_A: p_{1996 - White} != p_{2012 - White}\)

\(H_0: p_{1996 - Black} = p_{2012 - Black}\); \(H_A: p_{1996 - Black} != p_{2012 - Black}\)

\(H_0: p_{2012 - White} = p_{2012 - Black}\); \(H_A: p_{2012 - White} != p_{2012 - Black}\)

For each of this hipothesis test, we will use the inference funciton from the statsr package. Before each test, we wil also validate the conditions of independence and normality to see if our analysis will be valid.Besides, we will extract a confidence interval for each test to reinforce our findings.

str(gssQuestion.new)
## 'data.frame':    2944 obs. of  3 variables:
##  $ year   : int  1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
##  $ race   : Factor w/ 3 levels "White","Black",..: 2 1 2 2 2 1 1 1 1 2 ...
##  $ racdif3: Factor w/ 2 levels "Yes","No": 1 1 1 1 2 2 2 1 1 2 ...

For the independence rule: We are sure that the 2944 samples from the gss dataset are < 10% of our entire population (all american people). Besides, there are no indicators that the sampling procedure could have inserted any kind of dependence between interviewed people. Therefore, we can assume that our data is independent.

Q1:

# filter data
q1 <- gssQuestion.new %>% filter(race == 'White')

# check sampling distribution normality
table(filter(q1,year == 1996)$racdif3)
## 
## Yes  No 
## 678 846
table(filter(q1,year == 2012)$racdif3)
## 
## Yes  No 
## 411 542
# hipothesis test
q1$year <- as.factor(q1$year)
inference(y = racdif3, x = year, data = q1, statistic = "proportion", type = "ht",
          null = 0, alternative = 'twosided', method = "theoretical", success = "Yes")
## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_1996 = 1524, p_hat_1996 = 0.4449
## n_2012 = 953, p_hat_2012 = 0.4313
## H0: p_1996 =  p_2012
## HA: p_1996 != p_2012
## z = 0.6641
## p_value = 0.5066

# confidence interval
inference(y = racdif3, x = year, data = q1, statistic = "proportion", type = "ci", conf_level = 0.95, method = "theoretical", success = "Yes")
## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_1996 = 1524, p_hat_1996 = 0.4449
## n_2012 = 953, p_hat_2012 = 0.4313
## 95% CI (1996 - 2012): (-0.0265 , 0.0538)

As our first analysis, we evaluated the idea if the agreement of white people that a more effective educational access could impact on the difference on life conditions between white and black people. The proportion of white people in agreement in 1996 and 2012 were very similar, 0.4449 in 1996 against 0.4313 in 2012, and the high p-value of 0.5066 indeed confirm to us that we have no evidence to state that the proportions changed in these years.

To reinforce the result, we also evaluate the confidence interval, and we can interpret it by the following way: “We are 95% confident that the average difference of the proportions of white people in agreement with the social difference / race relationship is between (-0.0265 , 0.0538)”. As the confidence interval contains the value 0 (the null value hipothesis), this reinforce our idea that we can not reject our null hipothesis of equal proportions for white people.

Q2:

# filter data
q2 <- gssQuestion.new %>% filter(race == 'Black')

# check sampling distribution normality
table(filter(q2,year == 1996)$racdif3)
## 
## Yes  No 
## 146 127
table(filter(q2,year == 2012)$racdif3)
## 
## Yes  No 
##  89 105
# hipothesis test
q2$year <- as.factor(q2$year)
inference(y = racdif3, x=year, data = q2, statistic = "proportion", type = "ht",
          null = 0, alternative = 'twosided', method = "theoretical", success = "Yes")
## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_1996 = 273, p_hat_1996 = 0.5348
## n_2012 = 194, p_hat_2012 = 0.4588
## H0: p_1996 =  p_2012
## HA: p_1996 != p_2012
## z = 1.6195
## p_value = 0.1053

# confidence interval
inference(y = racdif3, x = year, data = q2, statistic = "proportion", type = "ci", conf_level = 0.95, method = "theoretical", success = "Yes")
## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_1996 = 273, p_hat_1996 = 0.5348
## n_2012 = 194, p_hat_2012 = 0.4588
## 95% CI (1996 - 2012): (-0.0157 , 0.1678)

On our second analysis, we evaluated the idea if the agreement of black people related to if a more effective educational access could impact on the difference on life conditions between white and black people. The proportion of black people in agreement in 1996 and 2012 were not so similar, 0.5348 in 1996 against 0.4588 in 2012. Although we had a perceptual vision of changes in opinion, our p-value of 0.1053 do not let we reject our null hipothesis of equality of proportions between these years and, therefore, we also do not have enough evidence to state that the proportions changed in these years.

To reinforce the result, we also evaluate the confidence interval, and we can interpret it by the following way: “We are 95% confident that the average difference of the proportions of black people in agreement with the social difference / race relationship is between (-0.0157 , 0.1678)”. As the confidence interval contains the value 0 (the null value hipothesis), this reinforce our idea that we can not reject our null hipothesis of equal proportions for black people.

Q3:

# filter data
q3 <- gssQuestion.new %>% filter(year == 2012)

# check sampling distribution normality
table(filter(q3,race == 'White')$racdif3)
## 
## Yes  No 
## 411 542
table(filter(q3,race == 'Black')$racdif3)
## 
## Yes  No 
##  89 105
# hipothesis test
q3$race <- factor(q3$race, levels = c('Black', 'White'))
inference(y = racdif3, x=race, data = q3, statistic = "proportion", type = "ht",
          null = 0, alternative = 'twosided', method = "theoretical", success = "Yes")
## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_Black = 194, p_hat_Black = 0.4588
## n_White = 953, p_hat_White = 0.4313
## H0: p_Black =  p_White
## HA: p_Black != p_White
## z = 0.7039
## p_value = 0.4815

# confidence interval
inference(y = racdif3, x = race, data = q3, statistic = "proportion", type = "ci", conf_level = 0.95, method = "theoretical", success = "Yes")
## Response variable: categorical (2 levels, success: Yes)
## Explanatory variable: categorical (2 levels) 
## n_Black = 194, p_hat_Black = 0.4588
## n_White = 953, p_hat_White = 0.4313
## 95% CI (Black - White): (-0.0494 , 0.1043)

In our last analysis, we evaluated the idea if the agreement of black people on the hipothesis is different to the proportion of white people that have the same opinion. The proportion of black people in agreement in 2012 was 0.4588 and 0.4313 for white people. With a high p-value of 0.4815 we we do not have enough evidence to state that the proportions of white people in agreement with the original hipothesis is different from the proportion of black people in 2012.

To reinforce the result, we also evaluate the confidence interval, and we can interpret it by the following way: “We are 95% confident that the average difference of the proportions between black and white people in agreement with the social difference / race relationship, and in 2012, is between (-0.0494 , 0.1043)”. As the confidence interval contains the value 0 (the null value hipothesis), this reinforce our idea that we can not reject our null hipothesis of equal proportions for between black and white people.

Part 5: Conclusion

From our first analysis envolving race and race-related social differences between white and black people, we do not have enough evidence to relate these two variables. Of course, social aspects of the population are a multifactorial problem and maybe race alone can not make a difference in showing evidence of impact but, together with other variables, they can better explain the relationships between race and social differences, when comparing to white people. If we had a specialist with a strong opinion that this variable is related to our research question, we would seek combinations of racdif3 with other columns to try to explain our problem. On the other hand, if we did not have more evidences that racdif3 could have an impact on our research question, we could eliminate it and search for new models and questions.

.