Bao Ngoc Dinh
contact: ngocdinh1410@gmail.com
github: github.com/ngocdinh1410 ## Setup
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.
Since 1972, the General Social Survey (GSS) has provided politicians, policymakers, and scholars with a clear and unbiased perspective on what Americans think and feel about such issues as national spending priorities, crime and punishment, intergroup relations, and confidence in institutions. The vast majority of GSS data is obtained in face-to-face interviews. Computer-assisted personal interviewing (CAPI) began in the 2002 GSS. Under some conditions when it has proved difficult to arrange an in-person interview with a sampled respondent, GSS interviews may be conducted by telephone. The target population is adults (18+) in the United States. According to the appendix, the survey created a quota sample system that would closely reflect the population. They would try to sample from all races/ genders/ with respect to employment status… However, one thing that bothered me was that they only sampled english speaking respondents. There are people in the US with limited ability to speak english and that might skew the sample. Random sampling was done through interview, however non-english speakers aren’t considered while they might be a significant portion of actual US population. Spanish speakers were considered in 2006 but not other non-english speakers. there might be bias in the survey due to underreporting of non-english speakers. We can make observation and inference but that does not mean causality. To test causality we would need to do more in-depth research with contorl variables.
Is there a relationship between race and highest year of education?
NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.
First, I omitted the missing data (NA). I calculated the average highest year of education per race for every year the survey was done for each race category. We could see that gradually for every race the highes educational level moved up. However, the sample data collected on the category ‘other’ is rather small compared to black/ white. Thus, it might not be a correct reflection of the current level of education other races are getting. Secondly, categorizing people as ‘other’ is rather misleading. There are smaller segments of races in the United States that we are failing to look at by categorizing them as ‘Other’. * * *
NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.
Since the ‘other’ population is rather small, I will develope two inference questions based on the data:
## Warning in year == c(2000, 2012): longer object length is not a multiple of
## shorter object length
new2_white <- new2_white%>%filter(race=="White")%>%select(year,race,educ)
new2_white$year=as.factor(new2_white$year)
summary(new2_white)## year race educ
## 2000:1098 White:1845 Min. : 0.00
## 2012: 747 Black: 0 1st Qu.:12.00
## Other: 0 Median :13.00
## Mean :13.51
## 3rd Qu.:16.00
## Max. :20.00
# Hypothesis test
inference(educ, year, data=new2_white,type="ht",statistic="mean", null=0, method="theoretical", alternative="twosided")## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_2000 = 1098, y_bar_2000 = 13.3734, s_2000 = 2.8366
## n_2012 = 747, y_bar_2012 = 13.7055, s_2012 = 3.0043
## H0: mu_2000 = mu_2012
## HA: mu_2000 != mu_2012
## t = -2.3835, df = 746
## p_value = 0.0174
The sample from the gss data for 2000 is 1098 while 2012 is 747. amount of people sample is less than 10% of the US population.
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_2000 = 1098, y_bar_2000 = 13.3734, s_2000 = 2.8366
## n_2012 = 747, y_bar_2012 = 13.7055, s_2012 = 3.0043
## 95% CI (2000 - 2012): (-0.6056 , -0.0586)
The mean did change from the 2000 sample to 2012 sample. We could see that the p-value is very small at 0.01, which is significant enough for us to reject the null hypothesis. There is a change in the average highest education level for people who identify as white. The 95% confidence interval established is (-0.6056 , -0.0586)
## Warning in year == c(2000, 2012): longer object length is not a multiple of
## shorter object length
black <- black%>%filter(race=="Black")%>%select(year,race,educ)
black$year=as.factor(black$year)
summary(black)## year race educ
## 2000:218 White: 0 Min. : 2.00
## 2012:148 Black:366 1st Qu.:12.00
## Other: 0 Median :12.00
## Mean :12.75
## 3rd Qu.:14.00
## Max. :20.00
The summary function provides us with overview of the data. There’s 366 people in this category, which is less than 10% of the US population and also less than 10% of the US black population. The highest level of education achieved (max educ) is 20, and lowest is at 2.00 years.
## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_2000 = 218, y_bar_2000 = 12.3486, s_2000 = 2.782
## n_2012 = 148, y_bar_2012 = 13.3378, s_2012 = 2.5032
## H0: mu_2000 = mu_2012
## HA: mu_2000 != mu_2012
## t = -3.5456, df = 147
## p_value = 5e-04
The sample mean for 2002 for black is 12.35 year and for 2012 it’s 13.33 years. The t-value is so small it’s close to 0, thus we can reject the null hypothesis.
## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_2000 = 218, y_bar_2000 = 12.3486, s_2000 = 2.782
## n_2012 = 148, y_bar_2012 = 13.3378, s_2012 = 2.5032
## 95% CI (2000 - 2012): (-1.5406 , -0.4378)
The confidence interval is (-1.5406 , -0.4378). We are 95% confident there is a (-1.5406 , -0.4378) change in the mean of the highest education level obtained.