Part 1: Data
Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting. The interviewees are single individuals selected from rural and metropolitan areas with the following characteristics: English and Spanish persons 18 years of age or older, all non institutionalized and living in the United States.
The dataset is composed of 57.061 observations (cases) and 114 variables regarding different aspects of the interviewee social life: work, beliefs, family, community. For further information about variables please refer to the GSS Code Book.
For this particular study the following variables will be taken into consideration:
DEGREE (highest educational degree achieved) Categorical, ordinal variable (5 levels: Less than High School, High School, Associate/ Junior College, Bachelor’s, Graduate)
Class (Social Class), Categorical variable (5 levels: Lower Class, Working Class, Middle Class, Upper Class , No Class)
The data come from a survey and not from an experiment, so the study can be characterized as observational: it can establish only correlation between the variables examined and not causation. However, GSS data are random samples taken from US residents, so the study’s findings could be generalized to the entire US residents population. The data cut used in this research is first made a cut in the original frame data by selecting to dt2, the columns class and degree, and in a second step the lines with NA values are removed from the data frame forming a subset. As is presented below:
dt2<-gss[,c("degree","class")]
dt <- subset(dt2, degree != "NA")
dt <- subset(dt, class != "NA")
str(dt)
## 'data.frame': 52718 obs. of 2 variables:
## $ degree: Factor w/ 5 levels "Lt High School",..: 4 1 2 4 2 2 2 4 2 2 ...
## $ class : Factor w/ 5 levels "Lower Class",..: 3 3 2 3 2 3 3 2 2 2 ...
## Lower Class Working Class Middle Class Upper Class No Class
## 3027 23977 23997 1716 1
## Lt High School High School Junior College Bachelor Graduate
## 11030 27589 2910 7542 3647
Part 3: Exploratory data analysis
data<-dt %>% filter(class != "No Class")
data <- droplevels(data)
table(data)
## class
## degree Lower Class Working Class Middle Class Upper Class
## Lt High School 1322 5721 3732 255
## High School 1450 14407 11194 538
## Junior College 120 1396 1332 62
## Bachelor 109 1919 5021 493
## Graduate 26 534 2718 368
#histogram
par(mar=c(7,4,3,2))
barplot(table(data$degree), las=2, main="Highest Degree", col=rainbow(5))

par(mar=c(7,4,3,2))
barplot(table(data$class), las=2, main="Social Class", col=rainbow(4))

temp <- as.data.frame(prop.table(table(data), 2))
temp
## degree class Freq
## 1 Lt High School Lower Class 0.436736042
## 2 High School Lower Class 0.479022134
## 3 Junior College Lower Class 0.039643211
## 4 Bachelor Lower Class 0.036009250
## 5 Graduate Lower Class 0.008589362
## 6 Lt High School Working Class 0.238603662
## 7 High School Working Class 0.600867498
## 8 Junior College Working Class 0.058222463
## 9 Bachelor Working Class 0.080035034
## 10 Graduate Working Class 0.022271343
## 11 Lt High School Middle Class 0.155519440
## 12 High School Middle Class 0.466474976
## 13 Junior College Middle Class 0.055506938
## 14 Bachelor Middle Class 0.209234488
## 15 Graduate Middle Class 0.113264158
## 16 Lt High School Upper Class 0.148601399
## 17 High School Upper Class 0.313519814
## 18 Junior College Upper Class 0.036130536
## 19 Bachelor Upper Class 0.287296037
## 20 Graduate Upper Class 0.214452214
ggplot() + geom_bar(aes(y = Freq, x = class, fill = degree), data = temp,
stat="identity") + coord_flip()+
labs(x ="Social Class", y = "Relative Frequency" , fill = "Degree")

We can see proportion of Lt High School decreasing as we move to higher social class levels and proportion of graduates increasing. There seems to be a positive relation between highest degree and social class.
Part 4: Inference
The null hypothesis
- H0 : Highest Degree and Social Class are independent of each other
The alternative hypothesis
- HA : Highest Degree and Social Class are not independent of each other and are related in some way
Since we are dealing with two multilevel categorical variables we will use Chi square test for independence to see if there’s a relation between the two variables.
Conditions for Chi Square Test :
Independence : We have used random sampling and total observations are less than 10% of the population. Also each case contributes to only one cell.
chisq.test(data$degree, data$class)
##
## Pearson's Chi-squared test
##
## data: data$degree and data$class
## X-squared = 5830.3, df = 12, p-value < 2.2e-16
The p-value is very close to zero. Hence we reject our null hypothesis and conclude that there is a relation between highest degree obtained and social class. However since no random assignment was used, we can’t infer causation from this study. Confidence intervals don’t make sense in this analysis as we are calculating relationships for a 5x5 contingency table.
Our exploratory data analysis showed a positive correlation between social class and highest degree. Further analysis testing a correlation between the two can be done in addition to this analysis.
Social class and degree
Emiliano La Rocca
Setup
Load packages
Load data
Part 1: Data
Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting. The interviewees are single individuals selected from rural and metropolitan areas with the following characteristics: English and Spanish persons 18 years of age or older, all non institutionalized and living in the United States.
The dataset is composed of 57.061 observations (cases) and 114 variables regarding different aspects of the interviewee social life: work, beliefs, family, community. For further information about variables please refer to the GSS Code Book.
For this particular study the following variables will be taken into consideration:
DEGREE (highest educational degree achieved) Categorical, ordinal variable (5 levels: Less than High School, High School, Associate/ Junior College, Bachelor’s, Graduate)
Class (Social Class), Categorical variable (5 levels: Lower Class, Working Class, Middle Class, Upper Class , No Class)
The data come from a survey and not from an experiment, so the study can be characterized as observational: it can establish only correlation between the variables examined and not causation. However, GSS data are random samples taken from US residents, so the study’s findings could be generalized to the entire US residents population. The data cut used in this research is first made a cut in the original frame data by selecting to dt2, the columns class and degree, and in a second step the lines with NA values are removed from the data frame forming a subset. As is presented below:
Part 2: Research question
My research question is that what is the relationship between the social class and the level of education. Several sub questions to this research question listed below:
Part 3: Exploratory data analysis
We can see proportion of Lt High School decreasing as we move to higher social class levels and proportion of graduates increasing. There seems to be a positive relation between highest degree and social class.
Part 4: Inference
The null hypothesis
The alternative hypothesis
Since we are dealing with two multilevel categorical variables we will use Chi square test for independence to see if there’s a relation between the two variables.
Conditions for Chi Square Test :
Independence : We have used random sampling and total observations are less than 10% of the population. Also each case contributes to only one cell.
The p-value is very close to zero. Hence we reject our null hypothesis and conclude that there is a relation between highest degree obtained and social class. However since no random assignment was used, we can’t infer causation from this study. Confidence intervals don’t make sense in this analysis as we are calculating relationships for a 5x5 contingency table.
Our exploratory data analysis showed a positive correlation between social class and highest degree. Further analysis testing a correlation between the two can be done in addition to this analysis.