Introduction:

We will be exploring if there is any significant relationship between gender(male, female) and highest level of degree attained based on the general social survey(GSS) data collected from individuals in US from 1972-2012(Citation and link for the data is is given at the end of this analysis under references [1]). This question is important because if there is any relation between these two variables then we can explore by including confounding variables or designing an experiment to understand if there is really a relation between these variables and if there is then why one gender does better or worse then the other in education in USA in general.

Data:

General Social Survey(GSS) conducted surveys in United States for US English/Spanish speaking residents who are at least 18 years old. The data is collected from years 1972-2012. There are three ways of collection of data:

Cases are the individuals who are US residents with at least 18 years of age who are English/Spanish speaking. We will be studying relation between gender and highest level of degree attained. Both the variables are categorical where gender is a nominal variable representing values of male or female whereas highest level of degree attained is a ordinal variable with values of Less Than High School, High School, Junior College, Bachelor or Graduate degree.

This is an observational study as it is given in the documentation for the study that data was collected using survey. First stratified samples were taken and then groups were randomly selected from within a sample and then within each group an average of 5 individuals were surveyed at random. The sampling has been different for different years but it uses random sampling to select the individuals to survey. The survey used quota sampling in the beginning and later on full probability samples were taken. Split frame sampling was used in many years which took half of the samples from previous years and half samples from current year of the survey. The link for the documentation of the data set is given under references section [2].

The study is generalizable to the whole population(English/Spanish speakers only) as sampling was stratified and random individuals were selected and a subset was taken from the whole population. The samples were random and representative. We cannot establish a casual link between these two variables as for that we would need to do experimentation and this study is only a observational study so the best we can do is establish a correlation.

Exploratory data analysis:

The data has been pre-processed to include only the relevant columns from the original gss data and all missing values were removed from the data. You can view sample of the data in Appendix section.

data <- gss[,c(4,12)]
data <- data[complete.cases(data),]
n <- nrow(data)
gender <- data$sex
degree <- data$degree

We have data from 56051 individuals in this data set for years 1972-2012.

Let us explore the gender and degree. We know from the study that due to quota sampling an equal number of men and women were represented. There is only slightly more females than males in study with 44% males and 56% females as you can see from the plot below.

##             Male Female
## Count      24678  31373
## Percentage    44     56

plot of chunk GenderBarplot

Let us explore degree now starting with table first and then a bar plot.

##            Lt High School High School Junior College Bachelor Graduate
## Count               11822       29287         3070.0     8002   3870.0
## Percentage             21          52            5.5       14      6.9

plot of chunk DegreeTable

We can see from statistics above that half of the individuals have got high school degree with less people getting advanced degree such as bachelors or masters which is understandable as only few people go for higher education.

Finally, let us see a mosaic plot of degree and gender and we can see that for each degree the distribution is very similar for both genders. We can see the same information in table to get a summary of counts.

plot of chunk Mosaic

Lt High School High School Junior College Bachelor Graduate
Male 5153 12340 1272 3822 2091
Female 6669 16947 1798 4180 1779

Even though there are slightly more females than males in the survey from the mosaic plot you can see that more male than females completed advanced degrees like bachelors and masters. There are more females than males in high school. Right now looking at the mosaic plot there are differences in high school, graduate and bachelors between the two genders but we will have to do hypothesis testing to determine if the difference is significant or not. Looking at the table we can see there are differences of anywhere from 1-5 thousand for each degree between genders. It looks that there might be a relation between the two variables but more analysis needs to be done to establish if the relation is important or not.

Inference:

To find out if there is any relation between gender and degree level attained hypothesis testing will be done. The hypothesis are as follows:

As both of the variables are categorical and there are more than two levels for degree we would like to use Chi-Square independence test but before we can do that we need to check if the conditions are fulfilled in the data to apply the test. One by one below we will see what are the conditions for the test and whether our data set fulfills the conditions or not.

  1. Independence: Sample observations must be independent.
    • Random sample/assignment for observational study/experiment. As we have already established in the data section that the original gss survey used random sampling to interview individuals we can be sure that the random condition is fulfilled for this study.
    • If sampling without replacing, n < 10% of population. This condition is also fulfilled as the sample size of around 50000 is less than 10% of the total population of US.
    • each case contributes to only cell in the table We know that each individual has only one value for the degree attained so each individual can only be attached to one cell in the resulting contingency table for these two variables.
  2. Sample Size: Each cell in the contingency table must have at least 5 cases. In order to answer this question let us look at the contingency table again for gender and degree below.
##         degree
## gender   Lt High School High School Junior College Bachelor Graduate
##   Male             5153       12340           1272     3822     2091
##   Female           6669       16947           1798     4180     1779

As you can see from this table each cell has more than 5 cases so the second condition for Chi-Square test of independence is fulfilled as well. As both of the conditions are fulfilled for this data set we can apply the test which we will do below. So we will use the theoretical method instead of simulation as all of our conditions are met. If we had small number of samples and our conditions for chi-square independence didn’t met we would have chosen to use simulation but as all conditions are met we will use theoretical method to do hypothetical testing. Also, note only hypothesis testing is possible we can’t do confidence interval as it is not defined for these types of categorical variables with more than 2 levels.

Now let us use total the row and sum for the gender and degree contingency table so we can start applying the chi-square test. The new table is given below with the total counts

Lt High School High School Junior College Bachelor Graduate Total
Male 5153 12340 1272 3822 2091 24678
Female 6669 16947 1798 4180 1779 31373
Total 11822 29287 3070 8002 3870 56051

So looking at the table the male rate out of the total is 24678/56051 = 0.44.

Now we need to find the expected count of male and female for each cell if in fact null hypothesis is true meaning there the two variables are independent. Let us calculate the expected male count for each degree level below and we can take the complement to find female expected count.

The expected count can be summarized in table below:

Lt High School High School Junior College Bachelor Graduate Total
Male 5205 12894 1352 3523 1704 24678
Female 6617 16393 1718 4479 2166 31373
Total 11822 29287 3070 8002 3870 56051
chitable <- (genderDegreeTotal - expectedCount)^2 / expectedCount
chi <- sum(chitable)
df <- (2 - 1) * (5 - 1)
pvalue <- pchisq(chi, df, lower.tail = FALSE)

Based on this Chi-Square comes out to be with 254.29 with 4 degrees of freedom. On a 5% significance value we get a p-value of 7.76 × 10-54. With such a small p-value we will reject the null hypothesis and accept the alternative hypothesis. Which means that our test shows is that is something going on and there is a significant relation of gender and degree level and these two variables are dependent.

Conclusion:

Even though our hypothesis testing did show that there is a relation between gender and highest level of degree attained and they are not independent we can only say that there is a relation between these variables and there is a correlation nothing more as this survey was a observational study not an experiment. We can generalize the result to the whole population that we can observe similar data in USA but cannot make any claim that certain gender will perform better in getting higher degree as for that we would need to perform a control experiment with control and experiment groups and then we can make any claim.

It is interesting that male have higher percentage of advanced degress(bachelors/doctrate) than females. Whereas as we already saw that there is more propotion of females than males in high school. With such a small p-value it does mean the differences between the gender per degree are significant. This means this relation cannot be simply by chance and there has to be other factors which are contributing to the two variables. It does not make sense that gender is the only variable contributing to the propotion of gender in each degree there should be other social, economic and geographic reasons for this relation.

Also, more exploration is needed to see the effect of confounding variables on gender and degree. It is very well possible that there are many types of confounding variables such as location, economy, family status and other factors which can be attributed to the relation of these two variables and only if we do logistic regression or use other techniques can we make any claim.

We can probably extend this analysis to see the trend of educaton and gender based on each year as we have yearly data from the original gss survey.

In short our data and hypothesis did show there is a relation between gender and degree level attained but we need to explore confounding variables and do experimentation to make any claim that being male means they will get higher degrees then females.

Citation

Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1

[1] Link for data set page: http://doi.org/10.3886/ICPSR34802.v1

Direct link for data: http://bit.ly/dasi_gss_data

[2] Documentation/Cookbook for data set: https://d396qusza40orc.cloudfront.net/statistics%2Fproject%2Fgss1.html

[3] Study design from page 2867 onward http://www.icpsr.umich.edu/cgi-bin/file?comp=none&study=34802&ds=1&file_id=1136502&path=ICPSR

Appendix

You can view sample rows for the data used for analysis below

##       sex         degree
## 1  Female       Bachelor
## 2    Male Lt High School
## 3  Female    High School
## 4  Female       Bachelor
## 5  Female    High School
## 6    Male    High School
## 7    Male    High School
## 8    Male       Bachelor
## 9  Female    High School
## 10 Female    High School
## 11 Female    High School
## 12   Male Lt High School
## 13   Male Lt High School
## 14 Female Lt High School
## 15   Male Lt High School
## 16   Male    High School
## 17   Male    High School
## 18 Female Lt High School
## 19 Female       Bachelor
## 20 Female    High School
## 21   Male    High School
## 22   Male    High School
## 23   Male    High School
## 24 Female    High School
## 25 Female       Bachelor
## 26 Female    High School
## 27   Male    High School
## 28 Female    High School
## 29   Male    High School
## 30   Male Lt High School