Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(tidyverse)
library(janitor)

Load data

load("gss.Rdata")

Part 1: Data

Our working data set is an extract from the full General Social Survey Cumulative File, a periodically sociological survey conducted since 1972. According to GSS documentation (https://gss.norc.org/), the samples are drawn by using area-probability frames. This technique uses clusters to randomly sample the target population.

Although it has some disadvantages, like bias prone (if the clusters that represent the entire population were formed under a biased opinion) and a high sampling error (when compared to simple randomization), we can conclude that the results from the survey has some degree of generabizability.

This is an observational study. There is no experimental design in discussion here. No random assignment to experimental and control groups. Therefore, we cannot talk about causality when analysing the relations among the variables presented.

Let’s first take a look at the dimensions of our data frame and, then, get some counts and levels for some variables that we could use on this analyis. We will also count the amount of NA entries.

dim(gss)
## [1] 57061   114

The dataset consists of 114 columns (features/variables) and 57061 rows (people/observations). Let’s now search for some possibile variables to analyze.

gss %>% 
  count(incom16)  %>% 
  mutate(freq = paste0(round(100 * prop.table(n),2),"%"))
##                incom16     n   freq
## 1    Far Below Average  3725  6.53%
## 2        Below Average 10692 18.74%
## 3              Average 21941 38.45%
## 4        Above Average  6575 11.52%
## 5    Far Above Average   796  1.39%
## 6 Lived In Institution    10  0.02%
## 7                 <NA> 13322 23.35%
gss %>% 
  count(income06)  %>% 
  mutate(freq = paste0(round(100 * prop.table(n),2),"%"))
##              income06     n   freq
## 1        Under $1 000   129  0.23%
## 2     $1 000 To 2 999   112   0.2%
## 3     $3 000 To 3 999    87  0.15%
## 4     $4 000 To 4 999    54  0.09%
## 5     $5 000 To 5 999    83  0.15%
## 6     $6 000 To 6 999   105  0.18%
## 7     $7 000 To 7 999   128  0.22%
## 8     $8 000 To 9 999   192  0.34%
## 9     $10000 To 12499   362  0.63%
## 10    $12500 To 14999   317  0.56%
## 11    $15000 To 17499   308  0.54%
## 12    $17500 To 19999   231   0.4%
## 13    $20000 To 22499   348  0.61%
## 14    $22500 To 24999   333  0.58%
## 15    $25000 To 29999   481  0.84%
## 16    $30000 To 34999   518  0.91%
## 17    $35000 To 39999   493  0.86%
## 18    $40000 To 49999   836  1.47%
## 19    $50000 To 59999   734  1.29%
## 20    $60000 To 74999   891  1.56%
## 21   $75000 To $89999   693  1.21%
## 22  $90000 To $109999   564  0.99%
## 23 $110000 To $129999   379  0.66%
## 24 $130000 To $149999   237  0.42%
## 25    $150000 Or Over   595  1.04%
## 26            Refused   860  1.51%
## 27               <NA> 46991 82.35%

It would be interesting to investigate if is there any relationship between family income when respondent was 16 years old (incom16) and the current respondent’s family income (income06). But, as the data has too many NA entries for this variables, we will give up on them for now.

What about sex and degree?

gss %>% 
  count(sex)  %>% 
  mutate(freq = paste0(round(100 * prop.table(n),2),"%"))
##      sex     n   freq
## 1   Male 25146 44.07%
## 2 Female 31915 55.93%
gss %>% 
  count(degree)  %>% 
  mutate(freq = paste0(round(100 * prop.table(n),2),"%"))
##           degree     n   freq
## 1 Lt High School 11822 20.72%
## 2    High School 29287 51.33%
## 3 Junior College  3070  5.38%
## 4       Bachelor  8002 14.02%
## 5       Graduate  3870  6.78%
## 6           <NA>  1010  1.77%

Sex and degree have relatively few NAs to deal with, but we will need to remove them before working with this data.

So, here we will consider the correlations between sex and degree, trying to find if being a man or a women is associated with achieving higher or lower education degrees.


Part 2: Research question

This study will address the relationship between two variables: 1) sex - respondent’s sex; and 2) degree - highest respondent degree, if finished 9th-12th grade.

We want to explore this data set and try to find if is there any difference on degrees among men and women. And if that difference is significative. This information could be useful to improve our understanding US social structure.


Part 3: Exploratory data analysis

Is there any difference on degrees among men and women?

First, let’s take a look at the absolute values for our variables.Excluding degrees NA values

gss %>%
  filter(!is.na(degree)) %>% 
  group_by(degree, sex) %>%
  summarise(count = n()) %>%
  pivot_wider(names_from = degree, values_from = count)%>%
  adorn_totals("row")%>%
  adorn_totals("col")
## `summarise()` regrouping output by 'degree' (override with `.groups` argument)
##     sex Lt High School High School Junior College Bachelor Graduate Total
##    Male           5153       12340           1272     3822     2091 24678
##  Female           6669       16947           1798     4180     1779 31373
##   Total          11822       29287           3070     8002     3870 56051

Now, let’s look at the relative frequency for this variables. Also, excluding degrees NA values.

gss %>%
  filter(!is.na(degree)) %>% 
  group_by(degree, sex) %>%
  summarise(count = n()) %>%
  mutate(freq = count / sum(count)) %>% 
  select(!c(count)) %>%
  pivot_wider(names_from = degree, values_from = freq)
## # A tibble: 2 x 6
##   sex    `Lt High School` `High School` `Junior College` Bachelor Graduate
##   <fct>             <dbl>         <dbl>            <dbl>    <dbl>    <dbl>
## 1 Male              0.436         0.421            0.414    0.478    0.540
## 2 Female            0.564         0.579            0.586    0.522    0.460
sexLabels <- c("Women", "Men")
gss %>%
  filter(!is.na(degree)) %>%
  mutate(x = forcats::fct_reorder(sex, as.numeric(degree),  mean)) %>% 
  ggplot(aes(x=x, fill = degree)) +
  geom_bar(position="fill", width = .7) +
  scale_x_discrete(labels= sexLabels) +
  labs(title = "Sex and obtained degree", x = "Sex", y = "Percentage") +
  theme(axis.text.x=element_text(color = "black", size=11, angle= 0, vjust=.8, 
                                 hjust=0.8)) + theme(legend.title=element_blank())

The percentage of maximum degree obtained on the first degree level (Lt High School) seems to be similar for men and women, both with roughly 0.20.

On the other hand, when we observe the next degrees, specially after High School degree, sex and degree appear to be correlated, as it seems that men max degree tend to be higher then the max degree reported by women.

So, it seems that our data shows a correlation (dependence) on sex and degree. Let’s try to check if this assossiation really existes or if it is explained just by random chance.


Part 4: Inference

Our variables are both categorical, and we aim to investigate if there is a dependence relationship between them. One way to do this is to run a chi-squared independence test. On this methodology, the observed data is compared to a theoretical distribution of expected values under the null hypothesis. If the difference between observed and expected values can be explained simply by standard error of the distribution under null hypothesis, then we can’t reject the null hypothesis.

For this test, we he have the following hypothesis:

H0 (nothing going on): sex and degree are independent. Degree does not vary by sex.

H1 (something going on): sex and degree are dependent. Degree does vary by sex.

Now, let’s check for conditions

CONDITION 1. Independence.

To run a chi squared test, sample observations must be independent and we have no reason to think they are not in this sample. According to gss documentation, the samples are randomly chosen and the numbers we have on each sample are represent less than 10% of US population. Finally, each case contributes to only once cell on our frequency table. So, we can conclude that this condition is attended and we have independent sample observations.

CONDITION 2. Sample size.

To attend this condition, each particular scenario (i.e. cell), must have at least 5 expected cases. Due to the great number of observed cases, we probably will find more than 5 expected cases on the null hypothesis distribution. That may happen because, to find the expected cases, we multiply the totals from rows and columns and divide that by the table total. But, to be certain, we will run the chi-squared test and plot the expected observations. If we find that this condition is not attended, then we will start over and search for a more appropriate test.

chisq <- chisq.test(gss$sex, gss$degree)
chisq
## 
##  Pearson's Chi-squared test
## 
## data:  gss$sex and gss$degree
## X-squared = 254.35, df = 4, p-value < 2.2e-16
chisq$observed
##         gss$degree
## gss$sex  Lt High School High School Junior College Bachelor Graduate
##   Male             5153       12340           1272     3822     2091
##   Female           6669       16947           1798     4180     1779
round(chisq$expected,2)
##         gss$degree
## gss$sex  Lt High School High School Junior College Bachelor Graduate
##   Male          5204.96    12894.41        1351.65   3523.1  1703.87
##   Female        6617.04    16392.59        1718.35   4478.9  2166.13

The statistic obtained by this test is very high and the p-value very low. Based on this outcome, we could not reject the null hypothesis. Therefore, the variables are independent and we cannot say that the maximum degree obtained varies by sex.

Also, the command ‘round(chisq$expected,2)’ shows us the expect values. And we can see that the sample size condition is attended on this sample, with each cell having at least 5 expected cases.