Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
load("C:/Users/hlo/Desktop/gss.RData")
dim(gss)
## [1] 57061   114

Part 1: Data

General Social Survey Data are interviews administered, national samples using a standard survey questionnaire. The National Data Program for the Social Sciences has been conducted since 1972 by NORC at the University of Chicago with the support of the National Science Foundation. Each survey from 1972 to 2004 was indepdently drawn sample of English speaking persons of 18 years of age or over, living in non-institutional arrangements within the United States. Starting in 2006, spanish speakers were added to the sample.

GSS questions representative sample of US adults about their social, political, and economic attitudes, values, self-assessments, and behaviors. As well, it collects extensive background information about demographic and social characteristics that predict differences about Americans.

Scope of Study: US adults who are aged 18 and over. The generalizability can be to this population of interest. But there could be biases to be careful of.

Scope of inference - casulatity: No there is no causation as GSS is just an observational study. It is an observational study, not an experiment.


Part 2: Research question

Does there appear a relationship or association between variables respondent’s degree and family income status at age of 16?

Ie. whether this set of data shows that a family income level status have an association with the respondent educational degree level. I am extremely interested in looking into this as my research question because I am interested in what factors and traits affect chance or likelihood of pursuing higher degree levels. Even though I understand this is not a experiment only an observational study, I am interested in learning more about do people who come from lower income levels have the equal opportunitiy to higher education compared with higher income levels shown in statistics? My research question above will not answer this but it will give insight or insights upon this topic.


Part 3: Exploratory data analysis

Variables of Interest: Degree and Incom16.

Degree: 5 factors: Respondents Degree. lt high school, high school, Junior college, Bachelor, Graduate, Don’t know, No answer

incom16: 6 Factors: Family income - far below avg, below avg, avg, above avg, far above avg, Lived institution.

*Due to small expected cells of less than 5, Lived institution factor was later dropped. Only 5 Factors for incom 16 were used in Statistical Inference.

# Create a summary of counts for degree #
summary(gss$degree)
## Lt High School    High School Junior College       Bachelor       Graduate 
##          11822          29287           3070           8002           3870 
##           NA's 
##           1010
# Create a summary of counts for family income #
summary(gss$incom16)
##    Far Below Average        Below Average              Average 
##                 3725                10692                21941 
##        Above Average    Far Above Average Lived In Institution 
##                 6575                  796                   10 
##                 NA's 
##                13322

Summary of Counts (Highest to Lowest) Degree show most respondents are Lt High School, High School, Bachelor, Graduate, Junior College. Incom16 show most respondents said at age 16 their family condition was defined income as Average, Below Average, Above Average, Far Below Average, Far Above Average, and Lived in Institution.

#Lets Clean the Data, get rid of the NAs#

gss_clean<-gss%>%
  group_by(degree,incom16)%>%
  filter(!is.na(degree), !is.na(incom16)) 

#Lets take a look at the table of incom16 without NAs#
summary(gss_clean$degree)
## Lt High School    High School Junior College       Bachelor       Graduate 
##           9628          22283           2194           5933           2878
#We will now create a basic bar graph showing degree as x variable and count as y variable and broken down by incom16 in color #

gss_clean <- droplevels(subset(gss_clean, incom16 != "Lived In Institution"))
ggplot(gss_clean, aes(x=degree, fill = incom16))+geom_bar(position = "fill")

From looking at this chart, few interesting things I found right away. As we look at the Far Below Average category from left to right (Lt High School to Graduate), we find that the percentage as a count of overall decreases as education goes higher. Moreover, we look at the Above Average and Far Above Average, we see the inverse so from left to right the percentage as a count overall increases as education goes higher.

In easy terms to understand, we see that respondents who reported Far Below Average for family income at age 16 most likely also reported their own degree level was high school. And on the other hand, it was much more likely respondents who reported higher family income at age 16 also reported their own degree level to be higher in Bachelor and Graduate.

new <-table( gss_clean$degree, gss_clean$incom16) 
new
##                 
##                  Far Below Average Below Average Average Above Average
##   Lt High School              1507          2994    4408           583
##   High School                 1531          5229   12062          3145
##   Junior College               140           480    1124           415
##   Bachelor                     220          1108    2777          1635
##   Graduate                     158           610    1250           749
##                 
##                  Far Above Average
##   Lt High School               131
##   High School                  313
##   Junior College                35
##   Bachelor                     191
##   Graduate                     111
prop.table(new,2)
##                 
##                  Far Below Average Below Average    Average Above Average
##   Lt High School        0.42379078    0.28730448 0.20387586    0.08932128
##   High School           0.43053993    0.50177526 0.55788354    0.48184465
##   Junior College        0.03937008    0.04606084 0.05198649    0.06358204
##   Bachelor              0.06186727    0.10632377 0.12843994    0.25049793
##   Graduate              0.04443195    0.05853565 0.05781416    0.11475410
##                 
##                  Far Above Average
##   Lt High School        0.16773367
##   High School           0.40076825
##   Junior College        0.04481434
##   Bachelor              0.24455826
##   Graduate              0.14212548

Above shows the exact percentages in a proportion table. Similarly, we see my insight in percentages. For example for respondents who identified themselves at age 16 their family income was in the Far Below Average category, about 85% ended up with High School or Lt High School degree level. Moreover, we see this number decrease, as respondents who identified themselves with higher family income level. Below Average, 78%. Average 75%. Above average 56%. Far Above Average 56%. These are rough percentage approximations.

On the contrast, looking into most people who came with higher degrees also have higher income family incomes at age 16. 10% of Far below average identified with Bachelors or graduate level, 15% of below average identified with Bachelors or graduate level, 17% of average identified with Bachelors or graduate level, 36% of below average identified with Bachelors or graduate level, 38% of below average identified with Bachelors or graduate level.

From looking at the tables, we might have some significant results or associations but we have to test it out in statistical inference first.


Part 4: Inference

Chi square test - independence is used to see relationship between two categorical variables.

  1. Hypotheses

Null hypothesis: Degree and income level of family are indepedent. Degree level do not vary by family income status level.

Alternative Hypothesis: Degree and income level of family are dependent. Degree level do vary by family income status level.

  1. Check Conditions

Sample data are randomly sampled, and sample size is large enough and being less than 10% of the population. Independence and random sampling are satisfied. Sample Size: Each cell must have expected of 5 counts. Yes, all cells met this requirement except the factor: “Lived in Institution” factor was dropped from the variable incom16 for this reason. Below is a table of expected counts.

  1. Inference
##                 
##                  Far Below Average Below Average   Average Above Average
##   Lt High School          797.5432     2337.2322  4849.179     1463.8820
##   High School            1846.5408     5411.3616 11227.238     3389.3059
##   Junior College          181.8362      532.8782  1105.591      333.7584
##   Bachelor                491.5545     1440.5200  2988.723      902.2430
##   Graduate                238.5253      699.0080  1450.269      437.8107
##                 
##                  Far Above Average
##   Lt High School         175.16345
##   High School            405.55354
##   Junior College          39.93647
##   Bachelor               107.95952
##   Graduate                52.38703
## 
##  Pearson's Chi-squared test
## 
## data:  new
## X-squared = 2847.1, df = 16, p-value < 2.2e-16
  1. Interpretation

A large Chi square yields a very small pvalue. At a significance level of 0.05, and p-value of 2.2* 10^-16 which is much less than significance level, we reject the null hypothesis, it shows there is association between respondents degree and the respondents family income status at age of 16. Confidence Intervals do not apply to chi-square test of independence.

  1. Conclusion

As from our inference, null hypothesis was rejected. There is association between the two variables: respondents degree and income of family status. Meaning that there is dependency between respondents degree level and family income status at the age of 16.

The same story is seen in our proportion tables from above. We can see respondents who identified themselves at age 16 their family income was in the Far Below Average category, about 85% ended up with High School or Lt High School degree level. Moreover, we see this number decrease, as respondents who identified themselves with higher family income level. Then Below Average, 78%. Average 75%. Above average 56%. Far Above Average 56%. These are rough percentage approximations.

On the contrast, looking into most people who came with higher degrees also have higher income family incomes at age 16. 10% of Far below average identified with Bachelors or graduate level, 15% of below average, 17% of average, 36% of above average, 38% of far above average identified with Bachelors or graduate level.

There can be confounding variables as to why the reason that is. Biases in how people answered the survey, it is hard to concretely define the variable incom16 as it is a subjective component where respondents rate how they compare to the “average” family. I think this can be losely interpreted by respondents as what they believe average.

Nonetheless I think it is very interesting where, respondent classifies themselves as higher status level of their family, in that respondent also answers with higher degree level. But this also plays into the social issue of is education equal among to all ? It would be interesting to further investigate into this in how family income associates with child’s degree level. There are many other research experiments possible looking into this.