Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.
Respondents were selected to represent a cross-section of the country, representing 50,000 households similar to themselves. It’s unclear what the GSS means by “similar household” but it implies some stratification. Within these 50,000 similar households, a representative household was randomly sampled and a random adult from that household was randomly selected to participate in the interview. Since it was randomly sampled, the results can be generalized to the population of interest, but since it is observational and not experimental (eg no random assignment to control or experimental groups) none of the variables can determine causality.
How has confidence in education changed over time among the respondents? How does this confidence vary within a year according to the age of the respondent? Since the GSS has data across many decades, I’m curious to see how confidence in education has changed as higher education has become more accessible. But since the age and education of the respondents varies much, their opinions must vary also. I’m curious to see if there’s an overall trend consistent across age or degree received.
Let’s first narrow the data frame down to the variables of interest and rid the data frame of NAs. Now we want to see how the proportion of respondents that has “A great deal” of confidence in education changes over the years.
gss_edu <- gss %>% select(year, age, sex, race, degree, coneduc) %>% filter(!is.na(year), !is.na(age), !is.na(sex), !is.na(race), !is.na(degree), !is.na(coneduc))
#group by year and confidence level and display counts
edu_conf <- gss_edu %>% group_by(year, coneduc) %>% summarize(count = n())## `summarise()` regrouping output by 'year' (override with `.groups` argument)
#create a vector of the total number of respondents by year
totals_by_year <- (gss_edu %>% group_by(year) %>% summarize(count = n()))$count## `summarise()` ungrouping output (override with `.groups` argument)
#create a new data frame for just those with great confidence and add a column with the proportion to the table
great_conf <- edu_conf %>% filter(coneduc == "A Great Deal")
great_conf <- data.frame(great_conf, totals_by_year) %>% mutate(prop_great_conf = count/totals_by_year)
great_conf## year coneduc count totals_by_year prop_great_conf
## 1 1973 A Great Deal 531 1417 0.3747354
## 2 1974 A Great Deal 700 1410 0.4964539
## 3 1975 A Great Deal 441 1417 0.3112209
## 4 1976 A Great Deal 528 1405 0.3758007
## 5 1977 A Great Deal 590 1453 0.4060564
## 6 1978 A Great Deal 420 1461 0.2874743
## 7 1980 A Great Deal 422 1389 0.3038157
## 8 1982 A Great Deal 612 1753 0.3491158
## 9 1983 A Great Deal 448 1529 0.2930020
## 10 1984 A Great Deal 264 930 0.2838710
## 11 1986 A Great Deal 398 1415 0.2812721
## 12 1987 A Great Deal 636 1742 0.3650976
## 13 1988 A Great Deal 286 957 0.2988506
## 14 1989 A Great Deal 307 999 0.3073073
## 15 1990 A Great Deal 238 869 0.2738780
## 16 1991 A Great Deal 296 983 0.3011190
## 17 1993 A Great Deal 229 1017 0.2251721
## 18 1994 A Great Deal 486 1955 0.2485934
## 19 1996 A Great Deal 430 1878 0.2289670
## 20 1998 A Great Deal 502 1858 0.2701830
## 21 2000 A Great Deal 497 1838 0.2704026
## 22 2002 A Great Deal 226 900 0.2511111
## 23 2004 A Great Deal 237 864 0.2743056
## 24 2006 A Great Deal 544 1962 0.2772681
## 25 2008 A Great Deal 385 1335 0.2883895
## 26 2010 A Great Deal 364 1354 0.2688331
## 27 2012 A Great Deal 341 1318 0.2587253
#make a graph of the proportions over time
ggplot(data = great_conf, aes(x = year, y = prop_great_conf)) + geom_line() the proportion of those with great confidence seems to have taken a downward trend. Now lets see how the proportion of those with hardly any confidence has changed over time.
#create a new data frame for just those with hardly any confidence and add a column with the proportion to the table
hardly_conf <- edu_conf %>% filter(coneduc == "Hardly Any")
hardly_conf <- data.frame(hardly_conf, totals_by_year) %>% mutate(prop_hardly_conf = count/totals_by_year)
hardly_conf## year coneduc count totals_by_year prop_hardly_conf
## 1 1973 Hardly Any 118 1417 0.08327452
## 2 1974 Hardly Any 115 1410 0.08156028
## 3 1975 Hardly Any 187 1417 0.13196895
## 4 1976 Hardly Any 226 1405 0.16085409
## 5 1977 Hardly Any 125 1453 0.08602891
## 6 1978 Hardly Any 222 1461 0.15195072
## 7 1980 Hardly Any 177 1389 0.12742981
## 8 1982 Hardly Any 228 1753 0.13006275
## 9 1983 Hardly Any 202 1529 0.13211249
## 10 1984 Hardly Any 100 930 0.10752688
## 11 1986 Hardly Any 153 1415 0.10812721
## 12 1987 Hardly Any 147 1742 0.08438576
## 13 1988 Hardly Any 83 957 0.08672936
## 14 1989 Hardly Any 106 999 0.10610611
## 15 1990 Hardly Any 107 869 0.12313003
## 16 1991 Hardly Any 134 983 0.13631740
## 17 1993 Hardly Any 184 1017 0.18092429
## 18 1994 Hardly Any 344 1955 0.17595908
## 19 1996 Hardly Any 346 1878 0.18423855
## 20 1998 Hardly Any 312 1858 0.16792250
## 21 2000 Hardly Any 290 1838 0.15778020
## 22 2002 Hardly Any 140 900 0.15555556
## 23 2004 Hardly Any 123 864 0.14236111
## 24 2006 Hardly Any 316 1962 0.16106014
## 25 2008 Hardly Any 196 1335 0.14681648
## 26 2010 Hardly Any 200 1354 0.14771049
## 27 2012 Hardly Any 218 1318 0.16540212
#make a graph of the proportions over time
ggplot(data = hardly_conf, aes(x = year, y = prop_hardly_conf)) + geom_line() the graph appears to go upward but since the y-axis range is small, it really hasnt changed that much. If we put the graph on the same scale:
Lets see what the distribution of age looks like for the first and last years of in the data set.
#the first year in the data set
conf_1973 <- gss_edu %>% filter(year == 1973)
#to graph the distribution of age by the two extreme confidence levels
greatconf_73 <- conf_1973 %>% filter(coneduc == "A Great Deal")
ggplot(data = greatconf_73, aes(x = age)) + geom_histogram(binwidth = 5) + ylim(0,60)hardlyconf_73 <- conf_1973 %>% filter(coneduc == "Hardly Any")
ggplot(data = hardlyconf_73, aes(x = age)) + geom_histogram(binwidth = 5) + ylim(0,60) The distribution of the age of greatly confident respondents in 1973 is almost uniform but slightly right skewed, while hardly confident respondents are more right skewed.
#the last year in the data set
conf_2012 <- gss_edu %>% filter(year == 2012)
#to graph the distribution of age by the two extreme confidence levels
greatconf_12 <- conf_2012 %>% filter(coneduc == "A Great Deal")
ggplot(data = greatconf_12, aes(x = age)) + geom_histogram(binwidth = 5) + ylim(0,50)hardlyconf_12 <- conf_2012 %>% filter(coneduc == "Hardly Any")
ggplot(data = hardlyconf_12, aes(x = age)) + geom_histogram(binwidth = 5) + ylim(0,50) the distribution of the age of those who have great confidence in education looks less uniform in 2012 than in 1973 and is more obviously right skewed with a peak around those aged 25. The distribution of age of those who have hardly any confidence in education looks almost bimodal with two peaks. The data doesnt really show any relationship with age and confidence. * * *
NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.
In the exploratory data analysis we calculated sample statistics (proportions calculated from the samples) now we want insight into the population parameters.
The sample meets the conditions for inference: The first condition is independence which relies on random sampling and less than 10% of the population being sampled. With 57000 entries total over 40 years that’s around 1500 per year, which is definitely less than 10% of the US population in any given year. We already mentioned before that this was a random sample. Therefore we can assume that whether or not one American has confidence in education is independent of another. The second condition is about sample size. With categorical variables and proportions we check this with success-failures. We need at least 10 successes and 10 failures in the sample. If responding with “great confidence” is a success, we definitely have 10 of each.
Though the confidence levels have 3 options, we can make it so that there are only two options with “A Great Deal” being a success and anything else being a fail. This way we can compare proportions of 2 levels across two categorical variables (the same variable but 2 different years).
Now lets construct a 95% confidence interval for the proportion of Americans with great confidence in education in 2012.
#since there are three options on the survey for confidence in education but we only need success and failures, we consider great confidence a success and anything else a failure.
greatconfissuccess12 <- gss_edu %>% filter(year == 2012) %>% mutate(confidence = ifelse(coneduc == "A Great Deal", "success", "fail"))
greatconfissuccess12 %>% group_by(year) %>% summarise(count = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1 x 2
## year count
## <int> <int>
## 1 2012 1318
## year age sex race degree coneduc confidence
## 1 2012 21 Male White High School Only Some fail
## 2 2012 42 Male Other High School Only Some fail
## 3 2012 49 Female White High School Only Some fail
## 4 2012 70 Female Black Bachelor A Great Deal success
## 5 2012 35 Female White Junior College Hardly Any fail
## 6 2012 24 Female Other Lt High School A Great Deal success
## Single categorical variable, success: success
## n = 1318, p-hat = 0.2587
## 95% CI: (0.2351 , 0.2824)
with a margin of error of .0236, we are 95% confident that in 2012 the proportion of the population of the US that had great confidence in education was between 0.2351 and 0.2824.
Now we want to see if there is convincing evidence as to whether or not the US has seen a change in the proportion of Americans with great confidence in education from 1973 to 2012.
The null hypothesis is that the proportions of those with great confidence is the same between 1973 and 2012. The alternative hypothesis is that they are different.
years <- c(1973, 2012)
greatconfissuccess73and12 <- gss_edu %>% filter(year == years) %>% mutate(confidence = ifelse(coneduc == "A Great Deal", "success", "fail"))
greatconfissuccess73and12 %>% group_by(year) %>% summarise(count = n())## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## year count
## <int> <int>
## 1 1973 709
## 2 2012 659
## Warning: Explanatory variable was numerical, it has been converted
## to categorical. In order to avoid this warning, first convert
## your explanatory variable to a categorical variable using the
## as.factor() function
## Warning: Missing null value, set to 0
## Response variable: categorical (2 levels, success: success)
## Explanatory variable: categorical (2 levels)
## n_1973 = 709, p_hat_1973 = 0.3794
## n_2012 = 659, p_hat_2012 = 0.2564
## H0: p_1973 = p_2012
## HA: p_1973 != p_2012
## z = 4.8707
## p_value = < 0.0001
We did a two sided test because the proportion could have gone up or down. Since the p-value is less than .0001 we can confidently say that it is likely that the US has seen a change in the proportion of Americans who have great confidence in education. There is a very small chance that these apparent differences in the sample would have occured by chance.