Social class and degree

Emiliano La Rocca

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("gss.Rdata")
dim(gss)
## [1] 57061   114

Part 1: Data

Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting. The interviewees are single individuals selected from rural and metropolitan areas with the following characteristics: English and Spanish persons 18 years of age or older, all non institutionalized and living in the United States.

The dataset is composed of 57.061 observations (cases) and 114 variables regarding different aspects of the interviewee social life: work, beliefs, family, community. For further information about variables please refer to the GSS Code Book.

For this particular study the following variables will be taken into consideration:

  • DEGREE (highest educational degree achieved) Categorical, ordinal variable (5 levels: Less than High School, High School, Associate/ Junior College, Bachelor’s, Graduate)

  • Class (Social Class), Categorical variable (5 levels: Lower Class, Working Class, Middle Class, Upper Class , No Class)

The data come from a survey and not from an experiment, so the study can be characterized as observational: it can establish only correlation between the variables examined and not causation. However, GSS data are random samples taken from US residents, so the study’s findings could be generalized to the entire US residents population. The data cut used in this research is first made a cut in the original frame data by selecting to dt2, the columns class and degree, and in a second step the lines with NA values are removed from the data frame forming a subset. As is presented below:

dt2<-gss[,c("degree","class")]
  dt <- subset(dt2, degree != "NA")
  dt <- subset(dt, class != "NA")
  str(dt)
## 'data.frame':    52718 obs. of  2 variables:
##  $ degree: Factor w/ 5 levels "Lt High School",..: 4 1 2 4 2 2 2 4 2 2 ...
##  $ class : Factor w/ 5 levels "Lower Class",..: 3 3 2 3 2 3 3 2 2 2 ...
   summary(dt$class)
##   Lower Class Working Class  Middle Class   Upper Class      No Class 
##          3027         23977         23997          1716             1
   summary(dt$degree)
## Lt High School    High School Junior College       Bachelor       Graduate 
##          11030          27589           2910           7542           3647

Part 2: Research question

My research question is that what is the relationship between the social class and the level of education. Several sub questions to this research question listed below:

  • social class influence the level of education;
  • working class and lower class not have the same possibilities of the middle class and the upper class to achieve a high level of education.

Part 3: Exploratory data analysis

data<-dt %>% filter(class != "No Class")
 data <- droplevels(data)
table(data)
##                 class
## degree           Lower Class Working Class Middle Class Upper Class
##   Lt High School        1322          5721         3732         255
##   High School           1450         14407        11194         538
##   Junior College         120          1396         1332          62
##   Bachelor               109          1919         5021         493
##   Graduate                26           534         2718         368
#histogram
par(mar=c(7,4,3,2))
barplot(table(data$degree), las=2, main="Highest Degree", col=rainbow(5))

par(mar=c(7,4,3,2))
barplot(table(data$class), las=2, main="Social Class", col=rainbow(4))

temp <- as.data.frame(prop.table(table(data), 2))
 temp
##            degree         class        Freq
## 1  Lt High School   Lower Class 0.436736042
## 2     High School   Lower Class 0.479022134
## 3  Junior College   Lower Class 0.039643211
## 4        Bachelor   Lower Class 0.036009250
## 5        Graduate   Lower Class 0.008589362
## 6  Lt High School Working Class 0.238603662
## 7     High School Working Class 0.600867498
## 8  Junior College Working Class 0.058222463
## 9        Bachelor Working Class 0.080035034
## 10       Graduate Working Class 0.022271343
## 11 Lt High School  Middle Class 0.155519440
## 12    High School  Middle Class 0.466474976
## 13 Junior College  Middle Class 0.055506938
## 14       Bachelor  Middle Class 0.209234488
## 15       Graduate  Middle Class 0.113264158
## 16 Lt High School   Upper Class 0.148601399
## 17    High School   Upper Class 0.313519814
## 18 Junior College   Upper Class 0.036130536
## 19       Bachelor   Upper Class 0.287296037
## 20       Graduate   Upper Class 0.214452214
 ggplot() + geom_bar(aes(y = Freq, x = class, fill = degree), data = temp,
 stat="identity") + coord_flip()+
 labs(x ="Social Class", y = "Relative Frequency" , fill = "Degree")

We can see proportion of Lt High School decreasing as we move to higher social class levels and proportion of graduates increasing. There seems to be a positive relation between highest degree and social class.


Part 4: Inference

The null hypothesis

  • H0 : Highest Degree and Social Class are independent of each other

The alternative hypothesis

  • HA : Highest Degree and Social Class are not independent of each other and are related in some way

Since we are dealing with two multilevel categorical variables we will use Chi square test for independence to see if there’s a relation between the two variables.

Conditions for Chi Square Test :

Independence : We have used random sampling and total observations are less than 10% of the population. Also each case contributes to only one cell.

chisq.test(data$degree, data$class)
## 
##  Pearson's Chi-squared test
## 
## data:  data$degree and data$class
## X-squared = 5830.3, df = 12, p-value < 2.2e-16

The p-value is very close to zero. Hence we reject our null hypothesis and conclude that there is a relation between highest degree obtained and social class. However since no random assignment was used, we can’t infer causation from this study. Confidence intervals don’t make sense in this analysis as we are calculating relationships for a 5x5 contingency table.

Our exploratory data analysis showed a positive correlation between social class and highest degree. Further analysis testing a correlation between the two can be done in addition to this analysis.

Conclusion

A reading of this observational study indicates that members of the lower class and the working class in the United States do not have the same opportunity to achieve a higher education such as members of the middle class and the upper class.

The American Dream is only for the middle class and upper class?

The US government should pursue social policies that help the members of the working class and lower class to achieve higher levels of higher education.

How the US governments have conducted these policies that help the lower class and the working class in the last 30 years?

References

General Social Survey Cumulative File, 1972-2012 Coursera Extract. Modified for Data Analysis and Statistical Inference course (Duke University).

R dataset could be downloaded at http://bit.ly/dasi_gss_data.

General Social Survey (GSS) FAQ. URL: http://publicdata.norc.org:41000/gssbeta/faqs.html. 10/27/2015