Frank D. Evans
Data Analysis and Statistical Inference
Duke University
Research Question: Is there a statistical relationship between the highest level of education attained by a person and the amount of confidence that person has in the scientific community?
The push for higher access to education is a common goal across many facets of modern American society. Additionally, much of America’s innovation has come from advancements in Science that make our lives better every day. The purpose of this study is to understand if there is a statistical relationship between level of education and confidence in the scientific community. Confidence in the scientific community is important as it is commonly understood to be a primary driver of American innovation and economic activity.
library(ggplot2)
load(url("http://bit.ly/dasi_gss_data"))
The data for this analysis comes from the General Social Survey, which has been administered by the National Opinion Research Center since 1972. The Research Center solicits American residents to take a survey covered a wide range of opinions as well as asking a number of demographic and descriptive questions about the respondent taking the survey. The respondents are chosen half by full random sample, and half by clustered random sample to provide what is called a full probability sample; although there have been some small changes to sample techniques over the years of administration. Full documentation on sampling design can be found here: [http://publicdata.norc.org:41000/gss/documents//BOOK/GSS_Codebook_AppendixA.pdf]
For the purpose of this study, only respondents from the 2012 survey are analyzed. Each case or observation in the data is the complete set of responses by a single respondent that are pertinent to the respondent. Thus, the study is entirely observational meaning that no causal links can be drawn from analysis of the data (due to lack of experimental control). Additionally, the study is mostly generalizable to the American adult (18+ years of age) public due to the stratified random sampling techniques that were used to select potential respondents.
There are two main variables at interst in this study: the level of attained education (which has been binned into 6 factor categories) and the confidence in the scientific community (which is a categorical response of ordered factors).
test_gss <- gss[,c('caseid','year','educ','consci')]
test_gss <- subset(test_gss, year == 2012)
sum(is.na(test_gss$educ)) / length(test_gss$educ)
## [1] 0.001013
For the analysis the column set of the entire survey is restricted only to the variables of interest. Then we restrict for the year we would like to analyze (2012) and check for any biases that may come from non-response. It appears that only a very small number of people failed to respond with their education level, considerably less than 1%.
sum(is.na(test_gss$consci)) / length(test_gss$consci)
## [1] 0.3597
Considerably more people did not respond to the confidence in the scientific communit question, which means we will need to use caution when applying these results to the larger population due to a possible non-response bias in the ~35% of respondents who did not answer this question.
test_gss <- test_gss[complete.cases(test_gss),]
nrow(test_gss)
## [1] 1264
For the purpose of the study the set is reduced to only cases where the questions were answered, providing a sample size of 1,264 respondents.
ggplot(data = test_gss, aes(x = consci)) +
geom_histogram(fill = 'dark orange', color = 'black', size = 1.25) +
ggtitle('Histogram of Confidence in Scientific Community (2012 Respondents)') +
xlab('Confidence Level Response') +
ylab('Number of Respondents')
There are three levels of Confidence in the response levels, with moderate and large levels of confidence at similar prevalance, and low confidence comparatively rare.
ggplot(data = test_gss, aes(x = educ)) +
geom_histogram(binwidth = 1, fill = 'blue', color = 'black', size = 1.0) +
ggtitle('Histogram of Education Level Attained (2012 Respondents)') +
xlab('Grade Level of Education Attained') +
ylab('Number of Respondents')
There education level is highly left skewed, and contains spikes at year 12 (Graduated High School), year 14 (Associate’s Degree), and year 16 (Bachelor’s Degree).
To better approximate education levels as a factor category, some areas are coalesced into easily understandable groupings that will provide a better analytical comparison and prevent the small sampling requirement seen in the long tail across very low levels of educational attainment. Education levels are then grouped into an ordered factor of 6 categories and the histogram is redrawn.
test_gss$grade_bin <- 'None'
test_gss$grade_bin[test_gss$educ <= 8] <- '8th or Less'
test_gss$grade_bin[(test_gss$educ > 8) & (test_gss$educ < 12)] <- 'Some High School'
test_gss$grade_bin[test_gss$educ == 12] <- 'High School Graduate'
test_gss$grade_bin[(test_gss$educ > 12) & (test_gss$educ < 16)] <- 'Some College'
test_gss$grade_bin[test_gss$educ == 16] <- 'College Graduate'
test_gss$grade_bin[test_gss$educ > 16] <- 'Post Graduate'
grade_levels <- c('8th or Less','Some High School','High School Graduate','Some College','College Graduate','Post Graduate')
test_gss$grade_bin <- ordered(test_gss$grade_bin, levels = grade_levels)
ggplot(data = test_gss, aes(x = grade_bin)) +
geom_histogram(fill = 'blue', color = 'black', size = 1.0) +
ggtitle('Histogram of Education Level Attained (2012 Respondents Binned)') +
xlab('Factor Level of Education Attained') +
ylab('Number of Respondents')
agg_gss <- test_gss
agg_gss$count <- 1
agg_gss <- aggregate(count ~ consci + grade_bin, data = agg_gss, FUN = sum)
ggplot(data = agg_gss) +
geom_tile(aes(x = grade_bin, y = consci, fill = count)) +
ggtitle('Tile Plot of Cross-Factor Responses') +
xlab('Factor Level of Education Attained') +
ylab('Confidence Level Response')
Prelimiary Analysis suggests that there may be a relationship between Confidence Level and Education Attainment for those that responded ‘A Great Deal’ to the Confidence Level Question. However, the tile plot above seems to look more normally distributed for the other two Confidence Levels. This seems to indicate that if indeed a relationship does exist, it will be more liekly yo exist among those that place a large amount of confidence in the Scientific Community skewing much higher in high education attainment levels than other categories of confidence. To verify that relationships do exist among these categories it is necessary to compute inferential statistics on the data collected.
The purpose of this inferential analysis is to investigate if there appears to be a relationship between Level of Education Attained and Confidence in the Scientific Community.
Null Hypothesis: Level of Education Attained and Confidence in the Scientific Community are independent; Confidence in the Scientific Community levels do not vary by Level of Education Attained.
Alternative Hypothesis: Level of Education Attained and Confidence in the Scientific Community are dependent; Confidence in the Scientific Community levels do vary by Level of Education Attained.
Evaluation of Conditions. Since the sample observations are randomly sampled, they meet the independence standard. Each person only responds once, which fulfills the requirement of sampling without replacement. The total sample for 2012 being evaluated is 1,264 people, which is considerably less than 10% of the American adult population for that year. Additionally, each person is only able to respond to one value in the candidate categories for both variables, so each observation only is able to contribute to one cell on the cross-table. Lastly, each cell of the cross-factors for both categorical variables include more than 5 responses, fulfilling the sample size requirement for performing a chi-square test.
Inference Methods. Since the analysis is looking at two categorical variables, with greater than two categories on one side (in this case both variables have more than two levels), a chi-square test of independence will be performed to evaluate the hypotheses. The hypothesis will be tested at both a 95% and 99% confidence level.
chi_test <- chisq.test(x = test_gss$grade_bin, y = test_gss$consci)
class(chi_test)
## [1] "htest"
To perform a chi-square test, the data is passed into the R function to create a chi-square object. The rulting object is an ‘htest’ list object, with 9 indexed components. To interpret the components of the test, the object is indexed into for the relevant components.
chi_test$statistic
## X-squared
## 52.53
This index accesses the computed chi-square statistic for the test.
chi_test$parameter
## df
## 10
The test statistic is combined with the degrees of freedom, accessible in the chi-square object via the ‘paramater’ index.
chi_test$expected
## test_gss$consci
## test_gss$grade_bin A Great Deal Only Some Hardly Any
## 8th or Less 28.04 34.08 4.877
## Some High School 56.92 69.18 9.899
## High School Graduate 134.34 163.29 23.364
## Some College 140.20 170.42 24.383
## College Graduate 94.58 114.97 16.449
## Post Graduate 74.91 91.06 13.028
With these two previous, the test object has calculated the expected value for each square in the cross tab of the data, under an assumption that the null hypothesis is true.
chi_test$observed
## test_gss$consci
## test_gss$grade_bin A Great Deal Only Some Hardly Any
## 8th or Less 23 40 4
## Some High School 49 73 14
## High School Graduate 104 180 37
## Some College 145 167 23
## College Graduate 100 118 8
## Post Graduate 108 65 6
This expected is compared to the observed values, seen earlier displayed in the tile plot.
chi_test$residuals
## test_gss$consci
## test_gss$grade_bin A Great Deal Only Some Hardly Any
## 8th or Less -0.9519 1.0135 -0.3969
## Some High School -1.0495 0.4588 1.3036
## High School Graduate -2.6179 1.3074 2.8211
## Some College 0.4052 -0.2616 -0.2801
## College Graduate 0.5569 0.2829 -2.0833
## Post Graduate 3.8227 -2.7307 -1.9472
From the differential between the expected and observed values for each cell, a matrix of residuals is calculated.
chi_test$p.value
## [1] 9.119e-08
Analysis of the residuals results in a p-value for the observed data. The p-value is very small, less than one tenth of one percent (< 0.1%). This means that the probability of observing equally or more extreme data given the null hypothesis is less than 0.1%. The null hypothesis is rejected in favor of the alternatiev hypothesis at both the 95% and 99% confidence level.
Simulation Analysis. Though the number of observations are sufficient to perform the inferential test without performing a simulation, a simulation can provide additional perspective on the relationship of the hypothoses with the data being analyzed. It is very easy to perform a second chi-square test with a Monte Carlo simulation (Patefield method simulation, Hope method p-value derivation). Under a Monte Carlo simulation p-value calculation, there no longer are degrees of freedom in the model.
chi_test_sim <- chisq.test(x = test_gss$grade_bin, y = test_gss$consci, simulate.p.value = TRUE)
chi_test_sim$p.value
## [1] 0.0004998
The resulting p-value from the simulation is still well below 0.1%, meaning that even with a simulated larger data set the null hypothesis would be rejected in favor of the alternative; both at the 95% and 99% confidence level.
The findings from this study suggest that there is statistically likely a relationship applicable to the American adult public between Level of Education Attained and Confidence in the Scientific Community. The findings do not conclude nor suggest what that relationship is, any increase or decrease in the Confidence in the Scientific Community at any given level attributable to a specific Level of Education, or any causal links between the two variables analyzed. Such conclusions are outside the parameters of this inferential analysis.
Future Research Questions. One particular point of interest surrounding the data is the considerable number of non-responses discovered among the Confidence in the Scientific Community question, that were not additionally missing in the Level of Education Attained question. Analyzing possible statistical patterns among non-response and testing for statistical bias among this question versus answering the question may be an interesting point of future research. Additionally, future research would be able to consider a longitudinal analysis across multiple years to see if trends have statistically changed over the last several years.
Administration: General Social Survey. National Opinion Research Center. University of Chicago. 1972-2012. [Modified by Duke University for Academic Course Usage].
Data Link: http://doi.org/10.3886/ICPSR34802.v1 (Persistent Link).
Codebook: Available at https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html
Data Sample
test_gss[1:25,]
## caseid year educ consci grade_bin
## 55089 55089 2012 12 A Great Deal High School Graduate
## 55090 55090 2012 12 A Great Deal High School Graduate
## 55092 55092 2012 16 A Great Deal College Graduate
## 55094 55094 2012 15 Only Some Some College
## 55095 55095 2012 11 Only Some Some High School
## 55097 55097 2012 17 A Great Deal Post Graduate
## 55100 55100 2012 12 Only Some High School Graduate
## 55102 55102 2012 4 A Great Deal 8th or Less
## 55103 55103 2012 13 Only Some Some College
## 55105 55105 2012 13 Only Some Some College
## 55106 55106 2012 12 Only Some High School Graduate
## 55108 55108 2012 0 Hardly Any 8th or Less
## 55109 55109 2012 10 A Great Deal Some High School
## 55110 55110 2012 14 Only Some Some College
## 55111 55111 2012 16 Only Some College Graduate
## 55112 55112 2012 12 Only Some High School Graduate
## 55113 55113 2012 17 A Great Deal Post Graduate
## 55116 55116 2012 16 Only Some College Graduate
## 55118 55118 2012 16 A Great Deal College Graduate
## 55119 55119 2012 14 A Great Deal Some College
## 55120 55120 2012 19 A Great Deal Post Graduate
## 55121 55121 2012 16 A Great Deal College Graduate
## 55122 55122 2012 14 Only Some Some College
## 55123 55123 2012 18 A Great Deal Post Graduate
## 55125 55125 2012 12 Only Some High School Graduate