We will be exploring if there is any significant relationship between gender(male, female) and highest level of degree attained based on the general social survey(GSS) data collected from individuals in US from 1972-2012(Citation and link for the data is is given at the end of this analysis under references [1]). This question is important because if there is any relation between these two variables then we can explore by including confounding variables or designing an experiment to understand if there is really a relation between these variables and if there is then why one gender does better or worse then the other in education in USA in general.
General Social Survey(GSS) conducted surveys in United States for US English/Spanish speaking residents who are at least 18 years old. The data is collected from years 1972-2012. There are three ways of collection of data:
Cases are the individuals who are US residents with at least 18 years of age who are English/Spanish speaking. We will be studying relation between gender and highest level of degree attained. Both the variables are categorical where gender is a nominal variable representing values of male or female whereas highest level of degree attained is a ordinal variable with values of Less Than High School, High School, Junior College, Bachelor or Graduate degree.
This is an observational study as it is given in the documentation for the study that data was collected using survey. First stratified samples were taken and then groups were randomly selected from within a sample and then within each group an average of 5 individuals were surveyed at random. The sampling has been different for different years but it uses random sampling to select the individuals to survey. The survey used quota sampling in the beginning and later on full probability samples were taken. Split frame sampling was used in many years which took half of the samples from previous years and half samples from current year of the survey. The link for the documentation of the data set is given under references section [2].
The study is generalizable to the whole population(English/Spanish speakers only) as sampling was stratified and random individuals were selected and a subset was taken from the whole population. The samples were random and representative. We cannot establish a casual link between these two variables as for that we would need to do experimentation and this study is only a observational study so the best we can do is establish a correlation.
The data has been pre-processed to include only the relevant columns from the original gss data and all missing values were removed from the data. You can view sample of the data in Appendix section.
data <- gss[,c(4,12)]
data <- data[complete.cases(data),]
n <- nrow(data)
gender <- data$sex
degree <- data$degree
We have data from 56051 individuals in this data set for years 1972-2012.
Let us explore the gender and degree. We know from the study that due to quota sampling an equal number of men and women were represented. There is only slightly more females than males in study with 44% males and 56% females as you can see from the plot below.
## Male Female
## Count 24678 31373
## Percentage 44 56
Let us explore degree now starting with table first and then a bar plot.
## Lt High School High School Junior College Bachelor Graduate
## Count 11822 29287 3070.0 8002 3870.0
## Percentage 21 52 5.5 14 6.9
We can see from statistics above that half of the individuals have got high school degree with less people getting advanced degree such as bachelors or masters which is understandable as only few people go for higher education.
Finally, let us see a mosaic plot of degree and gender and we can see that for each degree the distribution is very similar for both genders. We can see the same information in table to get a summary of counts.
Lt High School | High School | Junior College | Bachelor | Graduate | |
---|---|---|---|---|---|
Male | 5153 | 12340 | 1272 | 3822 | 2091 |
Female | 6669 | 16947 | 1798 | 4180 | 1779 |
Even though there are slightly more females than males in the survey from the mosaic plot you can see that more male than females completed advanced degrees like bachelors and masters. There are more females than males in high school. Right now looking at the mosaic plot there are differences in high school, graduate and bachelors between the two genders but we will have to do hypothesis testing to determine if the difference is significant or not. Looking at the table we can see there are differences of anywhere from 1-5 thousand for each degree between genders. It looks that there might be a relation between the two variables but more analysis needs to be done to establish if the relation is important or not.
To find out if there is any relation between gender and degree level attained hypothesis testing will be done. The hypothesis are as follows:
\(\sf{H_{0}}\): There is nothing going on. Meaning that gender and degree level are independent and degree level attained do not vary by gender.
\(\sf{H_{A}}\): Something is going on. Meaning that gender and degree level are dependent and degree level attained do vary by gender.
As both of the variables are categorical and there are more than two levels for degree we would like to use Chi-Square independence test but before we can do that we need to check if the conditions are fulfilled in the data to apply the test. One by one below we will see what are the conditions for the test and whether our data set fulfills the conditions or not.
## degree
## gender Lt High School High School Junior College Bachelor Graduate
## Male 5153 12340 1272 3822 2091
## Female 6669 16947 1798 4180 1779
As you can see from this table each cell has more than 5 cases so the second condition for Chi-Square test of independence is fulfilled as well. As both of the conditions are fulfilled for this data set we can apply the test which we will do below. So we will use the theoretical method instead of simulation as all of our conditions are met. If we had small number of samples and our conditions for chi-square independence didn’t met we would have chosen to use simulation but as all conditions are met we will use theoretical method to do hypothetical testing. Also, note only hypothesis testing is possible we can’t do confidence interval as it is not defined for these types of categorical variables with more than 2 levels.
Now let us use total the row and sum for the gender and degree contingency table so we can start applying the chi-square test. The new table is given below with the total counts
Lt High School | High School | Junior College | Bachelor | Graduate | Total | |
---|---|---|---|---|---|---|
Male | 5153 | 12340 | 1272 | 3822 | 2091 | 24678 |
Female | 6669 | 16947 | 1798 | 4180 | 1779 | 31373 |
Total | 11822 | 29287 | 3070 | 8002 | 3870 | 56051 |
So looking at the table the male rate out of the total is 24678/56051 = 0.44.
Now we need to find the expected count of male and female for each cell if in fact null hypothesis is true meaning there the two variables are independent. Let us calculate the expected male count for each degree level below and we can take the complement to find female expected count.
The expected count can be summarized in table below:
Lt High School | High School | Junior College | Bachelor | Graduate | Total | |
---|---|---|---|---|---|---|
Male | 5205 | 12894 | 1352 | 3523 | 1704 | 24678 |
Female | 6617 | 16393 | 1718 | 4479 | 2166 | 31373 |
Total | 11822 | 29287 | 3070 | 8002 | 3870 | 56051 |
chitable <- (genderDegreeTotal - expectedCount)^2 / expectedCount
chi <- sum(chitable)
df <- (2 - 1) * (5 - 1)
pvalue <- pchisq(chi, df, lower.tail = FALSE)
Based on this Chi-Square comes out to be with 254.29 with 4 degrees of freedom. On a 5% significance value we get a p-value of 7.76 × 10-54. With such a small p-value we will reject the null hypothesis and accept the alternative hypothesis. Which means that our test shows is that is something going on and there is a significant relation of gender and degree level and these two variables are dependent.
Even though our hypothesis testing did show that there is a relation between gender and highest level of degree attained and they are not independent we can only say that there is a relation between these variables and there is a correlation nothing more as this survey was a observational study not an experiment. We can generalize the result to the whole population that we can observe similar data in USA but cannot make any claim that certain gender will perform better in getting higher degree as for that we would need to perform a control experiment with control and experiment groups and then we can make any claim.
It is interesting that male have higher percentage of advanced degress(bachelors/doctrate) than females. Whereas as we already saw that there is more propotion of females than males in high school. With such a small p-value it does mean the differences between the gender per degree are significant. This means this relation cannot be simply by chance and there has to be other factors which are contributing to the two variables. It does not make sense that gender is the only variable contributing to the propotion of gender in each degree there should be other social, economic and geographic reasons for this relation.
Also, more exploration is needed to see the effect of confounding variables on gender and degree. It is very well possible that there are many types of confounding variables such as location, economy, family status and other factors which can be attributed to the relation of these two variables and only if we do logistic regression or use other techniques can we make any claim.
We can probably extend this analysis to see the trend of educaton and gender based on each year as we have yearly data from the original gss survey.
In short our data and hypothesis did show there is a relation between gender and degree level attained but we need to explore confounding variables and do experimentation to make any claim that being male means they will get higher degrees then females.
Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1
[1] Link for data set page: http://doi.org/10.3886/ICPSR34802.v1
Direct link for data: http://bit.ly/dasi_gss_data
[2] Documentation/Cookbook for data set: https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html
[3] Study design from page 2867 onward http://www.icpsr.umich.edu/cgi-bin/file?comp=none&study=34802&ds=1&file_id=1136502&path=ICPSR
You can view sample rows for the data used for analysis below
## sex degree
## 1 Female Bachelor
## 2 Male Lt High School
## 3 Female High School
## 4 Female Bachelor
## 5 Female High School
## 6 Male High School
## 7 Male High School
## 8 Male Bachelor
## 9 Female High School
## 10 Female High School
## 11 Female High School
## 12 Male Lt High School
## 13 Male Lt High School
## 14 Female Lt High School
## 15 Male Lt High School
## 16 Male High School
## 17 Male High School
## 18 Female Lt High School
## 19 Female Bachelor
## 20 Female High School
## 21 Male High School
## 22 Male High School
## 23 Male High School
## 24 Female High School
## 25 Female Bachelor
## 26 Female High School
## 27 Male High School
## 28 Female High School
## 29 Male High School
## 30 Male Lt High School