Setup

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.


Part 1: Data

Respondents were selected to represent a cross-section of the country, representing 50,000 households similar to themselves. It’s unclear what the GSS means by “similar household” but it implies some stratification. Within these 50,000 similar households, a representative household was randomly sampled and a random adult from that household was randomly selected to participate in the interview. Since it was randomly sampled, the results can be generalized to the population of interest, but since it is observational and not experimental (eg no random assignment to control or experimental groups) none of the variables can determine causality.


Part 2: Research question

How has confidence in education changed over time among the respondents? How does this confidence vary within a year according to the age of the respondent? Since the GSS has data across many decades, I’m curious to see how confidence in education has changed as higher education has become more accessible. But since the age and education of the respondents varies much, their opinions must vary also. I’m curious to see if there’s an overall trend consistent across age or degree received.


Part 3: Exploratory data analysis

Let’s first narrow the data frame down to the variables of interest and rid the data frame of NAs. Now we want to see how the proportion of respondents that has “A great deal” of confidence in education changes over the years.

## `summarise()` regrouping output by 'year' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
##    year      coneduc count totals_by_year prop_great_conf
## 1  1973 A Great Deal   531           1417       0.3747354
## 2  1974 A Great Deal   700           1410       0.4964539
## 3  1975 A Great Deal   441           1417       0.3112209
## 4  1976 A Great Deal   528           1405       0.3758007
## 5  1977 A Great Deal   590           1453       0.4060564
## 6  1978 A Great Deal   420           1461       0.2874743
## 7  1980 A Great Deal   422           1389       0.3038157
## 8  1982 A Great Deal   612           1753       0.3491158
## 9  1983 A Great Deal   448           1529       0.2930020
## 10 1984 A Great Deal   264            930       0.2838710
## 11 1986 A Great Deal   398           1415       0.2812721
## 12 1987 A Great Deal   636           1742       0.3650976
## 13 1988 A Great Deal   286            957       0.2988506
## 14 1989 A Great Deal   307            999       0.3073073
## 15 1990 A Great Deal   238            869       0.2738780
## 16 1991 A Great Deal   296            983       0.3011190
## 17 1993 A Great Deal   229           1017       0.2251721
## 18 1994 A Great Deal   486           1955       0.2485934
## 19 1996 A Great Deal   430           1878       0.2289670
## 20 1998 A Great Deal   502           1858       0.2701830
## 21 2000 A Great Deal   497           1838       0.2704026
## 22 2002 A Great Deal   226            900       0.2511111
## 23 2004 A Great Deal   237            864       0.2743056
## 24 2006 A Great Deal   544           1962       0.2772681
## 25 2008 A Great Deal   385           1335       0.2883895
## 26 2010 A Great Deal   364           1354       0.2688331
## 27 2012 A Great Deal   341           1318       0.2587253

the proportion of those with great confidence seems to have taken a downward trend. Now lets see how the proportion of those with hardly any confidence has changed over time.

##    year    coneduc count totals_by_year prop_hardly_conf
## 1  1973 Hardly Any   118           1417       0.08327452
## 2  1974 Hardly Any   115           1410       0.08156028
## 3  1975 Hardly Any   187           1417       0.13196895
## 4  1976 Hardly Any   226           1405       0.16085409
## 5  1977 Hardly Any   125           1453       0.08602891
## 6  1978 Hardly Any   222           1461       0.15195072
## 7  1980 Hardly Any   177           1389       0.12742981
## 8  1982 Hardly Any   228           1753       0.13006275
## 9  1983 Hardly Any   202           1529       0.13211249
## 10 1984 Hardly Any   100            930       0.10752688
## 11 1986 Hardly Any   153           1415       0.10812721
## 12 1987 Hardly Any   147           1742       0.08438576
## 13 1988 Hardly Any    83            957       0.08672936
## 14 1989 Hardly Any   106            999       0.10610611
## 15 1990 Hardly Any   107            869       0.12313003
## 16 1991 Hardly Any   134            983       0.13631740
## 17 1993 Hardly Any   184           1017       0.18092429
## 18 1994 Hardly Any   344           1955       0.17595908
## 19 1996 Hardly Any   346           1878       0.18423855
## 20 1998 Hardly Any   312           1858       0.16792250
## 21 2000 Hardly Any   290           1838       0.15778020
## 22 2002 Hardly Any   140            900       0.15555556
## 23 2004 Hardly Any   123            864       0.14236111
## 24 2006 Hardly Any   316           1962       0.16106014
## 25 2008 Hardly Any   196           1335       0.14681648
## 26 2010 Hardly Any   200           1354       0.14771049
## 27 2012 Hardly Any   218           1318       0.16540212

the graph appears to go upward but since the y-axis range is small, it really hasnt changed that much. If we put the graph on the same scale:

Lets see what the distribution of age looks like for the first and last years of in the data set.

The distribution of the age of greatly confident respondents in 1973 is almost uniform but slightly right skewed, while hardly confident respondents are more right skewed.

the distribution of the age of those who have great confidence in education looks less uniform in 2012 than in 1973 and is more obviously right skewed with a peak around those aged 25. The distribution of age of those who have hardly any confidence in education looks almost bimodal with two peaks. The data doesnt really show any relationship with age and confidence. * * *

Part 4: Inference

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.

In the exploratory data analysis we calculated sample statistics (proportions calculated from the samples) now we want insight into the population parameters.

The sample meets the conditions for inference: The first condition is independence which relies on random sampling and less than 10% of the population being sampled. With 57000 entries total over 40 years that’s around 1500 per year, which is definitely less than 10% of the US population in any given year. We already mentioned before that this was a random sample. Therefore we can assume that whether or not one American has confidence in education is independent of another. The second condition is about sample size. With categorical variables and proportions we check this with success-failures. We need at least 10 successes and 10 failures in the sample. If responding with “great confidence” is a success, we definitely have 10 of each.

Though the confidence levels have 3 options, we can make it so that there are only two options with “A Great Deal” being a success and anything else being a fail. This way we can compare proportions of 2 levels across two categorical variables (the same variable but 2 different years).

Now lets construct a 95% confidence interval for the proportion of Americans with great confidence in education in 2012.

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 1 x 2
##    year count
##   <int> <int>
## 1  2012  1318
##   year age    sex  race         degree      coneduc confidence
## 1 2012  21   Male White    High School    Only Some       fail
## 2 2012  42   Male Other    High School    Only Some       fail
## 3 2012  49 Female White    High School    Only Some       fail
## 4 2012  70 Female Black       Bachelor A Great Deal    success
## 5 2012  35 Female White Junior College   Hardly Any       fail
## 6 2012  24 Female Other Lt High School A Great Deal    success
## Single categorical variable, success: success
## n = 1318, p-hat = 0.2587
## 95% CI: (0.2351 , 0.2824)

with a margin of error of .0236, we are 95% confident that in 2012 the proportion of the population of the US that had great confidence in education was between 0.2351 and 0.2824.

Now we want to see if there is convincing evidence as to whether or not the US has seen a change in the proportion of Americans with great confidence in education from 1973 to 2012.

The null hypothesis is that the proportions of those with great confidence is the same between 1973 and 2012. The alternative hypothesis is that they are different.

## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##    year count
##   <int> <int>
## 1  1973   709
## 2  2012   659
## Warning: Explanatory variable was numerical, it has been converted
##               to categorical. In order to avoid this warning, first convert
##               your explanatory variable to a categorical variable using the
##               as.factor() function
## Warning: Missing null value, set to 0
## Response variable: categorical (2 levels, success: success)
## Explanatory variable: categorical (2 levels) 
## n_1973 = 709, p_hat_1973 = 0.3794
## n_2012 = 659, p_hat_2012 = 0.2564
## H0: p_1973 =  p_2012
## HA: p_1973 != p_2012
## z = 4.8707
## p_value = < 0.0001

We did a two sided test because the proportion could have gone up or down. Since the p-value is less than .0001 we can confidently say that it is likely that the US has seen a change in the proportion of Americans who have great confidence in education. There is a very small chance that these apparent differences in the sample would have occured by chance.