Setup

Load packages

library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss.

load("gss.Rdata")

Part 1: Data

In this project we use the General Social Survey (GSS) dataset provided by Coursera.

Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.

This extract of the General Social Survey (GSS) Cumulative File 1972-2012 provides a sample of selected indicators in the GSS with the goal of providing a convenient data resource for statistical reasoning using the R language. Unlike the full General Social Survey Cumulative File, we have removed missing values from the responses and created factor variables when appropriate to facilitate analysis using R.

The data extract contains 57061 observations of 114 variables.

Random sampling has been used to conduct the survey. The data for the project is generalizable to the entire population of the country. However causal inferences cannot be made from the data as the survey is of observational type.


Part 2: Research question

The impact of respondent’s highest year of school completed on financial satisfaction

In this analysis I am interested in finding out if the financial satisfaction of the respondents is impacted by the respondent’s highest year of school completed. My expectation is that financial satisfaction improves in relation with the respondent’s years of education completed.

The variables used in this analysis are as follows:

  1. educ : Respondent’s Highest year of school completed
summary(gss$educ)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   12.00   12.00   12.75   15.00   20.00     164
str(gss$educ)
##  int [1:57061] 16 10 12 17 12 14 13 16 12 12 ...
  1. satfin : Categorical variable indicating the satisfaction with current financial situation
summary(gss$satfin)
##      Satisfied   More Or Less Not At All Sat           NA's 
##          15344          23176          13934           4607
str(gss$satfin)
##  Factor w/ 3 levels "Satisfied","More Or Less",..: 3 2 1 3 1 2 2 3 2 3 ...

We will filter out all the N/A responses. Also we would filter the satfin “More Or Less” as we want to take into account only those responses which has a definite answer of “Satisfied” or “Not At All Sat”.

gss %>%
  filter(!is.na(educ) &
           !is.na(satfin) &
           !is.na(year) &
           satfin != "More Or Less") %>%
  select(educ,satfin,year) -> gss_satfin

dim(gss_satfin)
## [1] 29202     3
str(gss_satfin)
## 'data.frame':    29202 obs. of  3 variables:
##  $ educ  : int  16 12 17 12 16 12 13 6 9 9 ...
##  $ satfin: Factor w/ 3 levels "Satisfied","More Or Less",..: 3 1 3 1 3 3 3 1 1 1 ...
##  $ year  : int  1972 1972 1972 1972 1972 1972 1972 1972 1972 1972 ...

Part 3: Exploratory data analysis

ggplot(data=gss_satfin,aes(x=year,y=educ)) + geom_smooth(aes(fill=satfin)) 
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Observations:

  1. Throughout the entire period of study Satisfied group of people always had higher mean compared to Not At All Satisfied people in terms of highest year of school completed.

  2. Interestingly, the mean for both the Satisfied group and the Not At All Satisfied group have gradually increased over the course of time from 1972 - 2012.

  3. The Satisfied group experienced a much higher mean in terms of years of education in the later years from 2000 compared to Not At All Satisfied group which became more or less constant.

For this analysis, I consider the highest year of school completed as a categorical variable.

plot(table(gss_satfin$educ,gss_satfin$satfin))

Observations:

  1. When respondents report highest year of school completed less than 16, we observe fairly equal proportions between the Satisfied and Not At All Satisfied groups. This rule of thumb is broken when looking at respondents with year of school completed less than 9. There may be outlier cases here or other factors like family structure and earning members in a household that could be explored in future situations

  2. When the highest year of school completed is 16 or more, the proportion of Satisfied group becomes the clear majority.


Part 4: Inference

State Hypothesis

Null hypothesis: The number of years of education and financial satisfaction are independent.

Alternative hypothesis: The number of years of education and financial satisfaction are dependent.

As we have two Satisfaction groups and 20 number of years of education groups, the hypothesis test to be performed is the chi-sq test of independence.

Check Conditions

  1. Independence: The samples are independent as discussed previously in this article.

  2. Expected Counts:

chisq.test(gss_satfin$educ,gss_satfin$satfin)$expected
##                gss_satfin$satfin
## gss_satfin$educ   Satisfied Not At All Sat
##              0    42.968427      39.031573
##              1     7.860078       7.139922
##              2    43.492432      39.507568
##              3    66.024656      59.975344
##              4    84.888843      77.111157
##              5   121.045202     109.954798
##              6   210.650092     191.349908
##              7   244.186426     221.813574
##              8   757.711527     688.288473
##              9   543.393398     493.606602
##              10  757.711527     688.288473
##              11  964.693583     876.306417
##              12 4665.218341    4237.781659
##              13 1236.128279    1122.871721
##              14 1555.247449    1412.752551
##              15  660.770564     600.229436
##              16 1857.598452    1687.401548
##              17  424.968221     386.031779
##              18  524.005205     475.994795
##              19  203.838025     185.161975
##              20  329.599274     299.400726

From the above table, the expected counts are above the minimum required of 16 for each cell except one.

  1. Degrees of Freedom: The degrees of freedom is given by 20 (= (21-1)*(2-1)).

All the conditions to perform chi-square test of independence are satisfied.

Chi-Square test of independence

chisq.test(gss_satfin$educ,gss_satfin$satfin)
## 
##  Pearson's Chi-squared test
## 
## data:  gss_satfin$educ and gss_satfin$satfin
## X-squared = 924.33, df = 20, p-value < 2.2e-16

Findings

With a p-value of almost zero, we have strong evidence to reject the null hypothesis. Hence, we have convincing evidence to state that the number of years of education and the financial satisfaction are dependent in the U.S.

Conclusion

Considering the rejection of the null hypothesis, it will be wise for me to consider the overall number of years of education that I would like to have when it comes to overall financial satisfaction over the life time. It could be beneficial to add additional factors into follow on research (i.e., income of respondent, number of child, overall life satisfaction of respondent,) in order to properly understand components that could allow one to maximize financial and overall well being, while also maximizing years of education.