##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
In this project we use the General Social Survey (GSS) dataset provided by Coursera.
Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.
This extract of the General Social Survey (GSS) Cumulative File 1972-2012 provides a sample of selected indicators in the GSS with the goal of providing a convenient data resource for statistical reasoning using the R language. Unlike the full General Social Survey Cumulative File, we have removed missing values from the responses and created factor variables when appropriate to facilitate analysis using R.
The data extract contains 57061 observations of 114 variables.
Random sampling has been used to conduct the survey. The data for the project is generalizable to the entire population of the country. However causal inferences cannot be made from the data as the survey is of observational type.
In this analysis I am interested in finding out if the financial satisfaction of the respondents is impacted by the respondent’s highest year of school completed. My expectation is that financial satisfaction improves in relation with the respondent’s years of education completed.
The variables used in this analysis are as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 12.00 12.00 12.75 15.00 20.00 164
## int [1:57061] 16 10 12 17 12 14 13 16 12 12 ...
## Satisfied More Or Less Not At All Sat NA's
## 15344 23176 13934 4607
## Factor w/ 3 levels "Satisfied","More Or Less",..: 3 2 1 3 1 2 2 3 2 3 ...
We will filter out all the N/A responses. Also we would filter the satfin “More Or Less” as we want to take into account only those responses which has a definite answer of “Satisfied” or “Not At All Sat”.
gss %>%
filter(!is.na(educ) &
!is.na(satfin) &
!is.na(year) &
satfin != "More Or Less") %>%
select(educ,satfin,year) -> gss_satfin
dim(gss_satfin)## [1] 29202 3
## 'data.frame': 29202 obs. of 3 variables:
## $ educ : int 16 12 17 12 16 12 13 6 9 9 ...
## $ satfin: Factor w/ 3 levels "Satisfied","More Or Less",..: 3 1 3 1 3 3 3 1 1 1 ...
## $ year : int 1972 1972 1972 1972 1972 1972 1972 1972 1972 1972 ...
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Observations:
Throughout the entire period of study Satisfied group of people always had higher mean compared to Not At All Satisfied people in terms of highest year of school completed.
Interestingly, the mean for both the Satisfied group and the Not At All Satisfied group have gradually increased over the course of time from 1972 - 2012.
The Satisfied group experienced a much higher mean in terms of years of education in the later years from 2000 compared to Not At All Satisfied group which became more or less constant.
For this analysis, I consider the highest year of school completed as a categorical variable.
Observations:
When respondents report highest year of school completed less than 16, we observe fairly equal proportions between the Satisfied and Not At All Satisfied groups. This rule of thumb is broken when looking at respondents with year of school completed less than 9. There may be outlier cases here or other factors like family structure and earning members in a household that could be explored in future situations
When the highest year of school completed is 16 or more, the proportion of Satisfied group becomes the clear majority.
Null hypothesis: The number of years of education and financial satisfaction are independent.
Alternative hypothesis: The number of years of education and financial satisfaction are dependent.
As we have two Satisfaction groups and 20 number of years of education groups, the hypothesis test to be performed is the chi-sq test of independence.
Independence: The samples are independent as discussed previously in this article.
Expected Counts:
## gss_satfin$satfin
## gss_satfin$educ Satisfied Not At All Sat
## 0 42.968427 39.031573
## 1 7.860078 7.139922
## 2 43.492432 39.507568
## 3 66.024656 59.975344
## 4 84.888843 77.111157
## 5 121.045202 109.954798
## 6 210.650092 191.349908
## 7 244.186426 221.813574
## 8 757.711527 688.288473
## 9 543.393398 493.606602
## 10 757.711527 688.288473
## 11 964.693583 876.306417
## 12 4665.218341 4237.781659
## 13 1236.128279 1122.871721
## 14 1555.247449 1412.752551
## 15 660.770564 600.229436
## 16 1857.598452 1687.401548
## 17 424.968221 386.031779
## 18 524.005205 475.994795
## 19 203.838025 185.161975
## 20 329.599274 299.400726
From the above table, the expected counts are above the minimum required of 16 for each cell except one.
All the conditions to perform chi-square test of independence are satisfied.
##
## Pearson's Chi-squared test
##
## data: gss_satfin$educ and gss_satfin$satfin
## X-squared = 924.33, df = 20, p-value < 2.2e-16
With a p-value of almost zero, we have strong evidence to reject the null hypothesis. Hence, we have convincing evidence to state that the number of years of education and the financial satisfaction are dependent in the U.S.
Considering the rejection of the null hypothesis, it will be wise for me to consider the overall number of years of education that I would like to have when it comes to overall financial satisfaction over the life time. It could be beneficial to add additional factors into follow on research (i.e., income of respondent, number of child, overall life satisfaction of respondent,) in order to properly understand components that could allow one to maximize financial and overall well being, while also maximizing years of education.