library(ggplot2)
library(dplyr)
library(statsr)load("C:/Users/hlo/Desktop/gss.RData")
dim(gss)## [1] 57061 114
General Social Survey Data are interviews administered, national samples using a standard survey questionnaire. The National Data Program for the Social Sciences has been conducted since 1972 by NORC at the University of Chicago with the support of the National Science Foundation. Each survey from 1972 to 2004 was indepdently drawn sample of English speaking persons of 18 years of age or over, living in non-institutional arrangements within the United States. Starting in 2006, spanish speakers were added to the sample.
GSS questions representative sample of US adults about their social, political, and economic attitudes, values, self-assessments, and behaviors. As well, it collects extensive background information about demographic and social characteristics that predict differences about Americans.
Scope of Study: US adults who are aged 18 and over. The generalizability can be to this population of interest. But there could be biases to be careful of.
Scope of inference - casulatity: No there is no causation as GSS is just an observational study. It is an observational study, not an experiment.
Does there appear a relationship or association between variables respondent’s degree and family income status at age of 16?
Ie. whether this set of data shows that a family income level status have an association with the respondent educational degree level. I am extremely interested in looking into this as my research question because I am interested in what factors and traits affect chance or likelihood of pursuing higher degree levels. Even though I understand this is not a experiment only an observational study, I am interested in learning more about do people who come from lower income levels have the equal opportunitiy to higher education compared with higher income levels shown in statistics? My research question above will not answer this but it will give insight or insights upon this topic.
Variables of Interest: Degree and Incom16.
Degree: 5 factors: Respondents Degree. lt high school, high school, Junior college, Bachelor, Graduate, Don’t know, No answer
incom16: 6 Factors: Family income - far below avg, below avg, avg, above avg, far above avg, Lived institution.
*Due to small expected cells of less than 5, Lived institution factor was later dropped. Only 5 Factors for incom 16 were used in Statistical Inference.
# Create a summary of counts for degree #
summary(gss$degree)## Lt High School High School Junior College Bachelor Graduate
## 11822 29287 3070 8002 3870
## NA's
## 1010
# Create a summary of counts for family income #
summary(gss$incom16)## Far Below Average Below Average Average
## 3725 10692 21941
## Above Average Far Above Average Lived In Institution
## 6575 796 10
## NA's
## 13322
Summary of Counts (Highest to Lowest) Degree show most respondents are Lt High School, High School, Bachelor, Graduate, Junior College. Incom16 show most respondents said at age 16 their family condition was defined income as Average, Below Average, Above Average, Far Below Average, Far Above Average, and Lived in Institution.
#Lets Clean the Data, get rid of the NAs#
gss_clean<-gss%>%
group_by(degree,incom16)%>%
filter(!is.na(degree), !is.na(incom16))
#Lets take a look at the table of incom16 without NAs#
summary(gss_clean$degree)## Lt High School High School Junior College Bachelor Graduate
## 9628 22283 2194 5933 2878
#We will now create a basic bar graph showing degree as x variable and count as y variable and broken down by incom16 in color #
gss_clean <- droplevels(subset(gss_clean, incom16 != "Lived In Institution"))
ggplot(gss_clean, aes(x=degree, fill = incom16))+geom_bar(position = "fill")From looking at this chart, few interesting things I found right away. As we look at the Far Below Average category from left to right (Lt High School to Graduate), we find that the percentage as a count of overall decreases as education goes higher. Moreover, we look at the Above Average and Far Above Average, we see the inverse so from left to right the percentage as a count overall increases as education goes higher.
In easy terms to understand, we see that respondents who reported Far Below Average for family income at age 16 most likely also reported their own degree level was high school. And on the other hand, it was much more likely respondents who reported higher family income at age 16 also reported their own degree level to be higher in Bachelor and Graduate.
new <-table( gss_clean$degree, gss_clean$incom16)
new##
## Far Below Average Below Average Average Above Average
## Lt High School 1507 2994 4408 583
## High School 1531 5229 12062 3145
## Junior College 140 480 1124 415
## Bachelor 220 1108 2777 1635
## Graduate 158 610 1250 749
##
## Far Above Average
## Lt High School 131
## High School 313
## Junior College 35
## Bachelor 191
## Graduate 111
prop.table(new,2)##
## Far Below Average Below Average Average Above Average
## Lt High School 0.42379078 0.28730448 0.20387586 0.08932128
## High School 0.43053993 0.50177526 0.55788354 0.48184465
## Junior College 0.03937008 0.04606084 0.05198649 0.06358204
## Bachelor 0.06186727 0.10632377 0.12843994 0.25049793
## Graduate 0.04443195 0.05853565 0.05781416 0.11475410
##
## Far Above Average
## Lt High School 0.16773367
## High School 0.40076825
## Junior College 0.04481434
## Bachelor 0.24455826
## Graduate 0.14212548
Above shows the exact percentages in a proportion table. Similarly, we see my insight in percentages. For example for respondents who identified themselves at age 16 their family income was in the Far Below Average category, about 85% ended up with High School or Lt High School degree level. Moreover, we see this number decrease, as respondents who identified themselves with higher family income level. Below Average, 78%. Average 75%. Above average 56%. Far Above Average 56%. These are rough percentage approximations.
On the contrast, looking into most people who came with higher degrees also have higher income family incomes at age 16. 10% of Far below average identified with Bachelors or graduate level, 15% of below average identified with Bachelors or graduate level, 17% of average identified with Bachelors or graduate level, 36% of below average identified with Bachelors or graduate level, 38% of below average identified with Bachelors or graduate level.
From looking at the tables, we might have some significant results or associations but we have to test it out in statistical inference first.
Chi square test - independence is used to see relationship between two categorical variables.
Null hypothesis: Degree and income level of family are indepedent. Degree level do not vary by family income status level.
Alternative Hypothesis: Degree and income level of family are dependent. Degree level do vary by family income status level.
Sample data are randomly sampled, and sample size is large enough and being less than 10% of the population. Independence and random sampling are satisfied. Sample Size: Each cell must have expected of 5 counts. Yes, all cells met this requirement except the factor: “Lived in Institution” factor was dropped from the variable incom16 for this reason. Below is a table of expected counts.
##
## Far Below Average Below Average Average Above Average
## Lt High School 797.5432 2337.2322 4849.179 1463.8820
## High School 1846.5408 5411.3616 11227.238 3389.3059
## Junior College 181.8362 532.8782 1105.591 333.7584
## Bachelor 491.5545 1440.5200 2988.723 902.2430
## Graduate 238.5253 699.0080 1450.269 437.8107
##
## Far Above Average
## Lt High School 175.16345
## High School 405.55354
## Junior College 39.93647
## Bachelor 107.95952
## Graduate 52.38703
##
## Pearson's Chi-squared test
##
## data: new
## X-squared = 2847.1, df = 16, p-value < 2.2e-16
A large Chi square yields a very small pvalue. At a significance level of 0.05, and p-value of 2.2* 10^-16 which is much less than significance level, we reject the null hypothesis, it shows there is association between respondents degree and the respondents family income status at age of 16. Confidence Intervals do not apply to chi-square test of independence.
As from our inference, null hypothesis was rejected. There is association between the two variables: respondents degree and income of family status. Meaning that there is dependency between respondents degree level and family income status at the age of 16.
The same story is seen in our proportion tables from above. We can see respondents who identified themselves at age 16 their family income was in the Far Below Average category, about 85% ended up with High School or Lt High School degree level. Moreover, we see this number decrease, as respondents who identified themselves with higher family income level. Then Below Average, 78%. Average 75%. Above average 56%. Far Above Average 56%. These are rough percentage approximations.
On the contrast, looking into most people who came with higher degrees also have higher income family incomes at age 16. 10% of Far below average identified with Bachelors or graduate level, 15% of below average, 17% of average, 36% of above average, 38% of far above average identified with Bachelors or graduate level.
There can be confounding variables as to why the reason that is. Biases in how people answered the survey, it is hard to concretely define the variable incom16 as it is a subjective component where respondents rate how they compare to the “average” family. I think this can be losely interpreted by respondents as what they believe average.
Nonetheless I think it is very interesting where, respondent classifies themselves as higher status level of their family, in that respondent also answers with higher degree level. But this also plays into the social issue of is education equal among to all ? It would be interesting to further investigate into this in how family income associates with child’s degree level. There are many other research experiments possible looking into this.