John Eugene Driscoll | Week 5 - Data Analysis Project | June 2017 Submission
The purpose of this document is to complete the data analysis project required during week 5 of the Inferential Statistics course by Duke University (Coursera.)
The background context regarding the assignment can be found at: https://www.coursera.org/learn/inferential-statistics-intro/peer/TRNOq/data-analysis-project
library(ggplot2)
library(dplyr)
library(statsr)
library(lattice)load("gss.Rdata")This project uses an extract of the General Social Survey (GSS) Cumulative File 1972-2012 that was provided by Coursera.
Since 1972, the General Social Survey (GSS) has been monitoring societal change and studying the growing complexity of American society. The GSS aims to gather data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies in order to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.
The data extract contains 57061 observations of 114 variables.
Unlike the full General Social Survey Cumulative File, the extract has been sanitized by removing missing values from the responses and factor variables were created when appropriate to facilitate analysis using R.
As a large representative random sampling was drawn, the data for the sample is generalizable to the adult population of the participating states.
This long term survey is a type of observational study. Therefore, it won’t be possible to make causal inferences from the data.
The impact of family size on financial satisfaction
In this analysis, I am interested in finding out if the financial satisfaction of the respondents changes as number of children in a family increases from 0. As I grapple with the question of family planning, I am curious to see if there are any signifcant findings in the dataset.
The variables used in this analysis are:
summary(gss$childs)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 2.000 1.953 3.000 8.000 181
str(gss$childs)## int [1:57061] 0 5 4 0 2 0 2 0 2 4 ...
summary(gss$satfin)## Satisfied More Or Less Not At All Sat NA's
## 15344 23176 13934 4607
str(gss$satfin)## Factor w/ 3 levels "Satisfied","More Or Less",..: 3 2 1 3 1 2 2 3 2 3 ...
This data set will not be filerted by year as the underlying question is one that would have been applicable even at the start of the survey in 1972. We will remove non respones (N/A), along with middle of the road responses (Satfin “More Or Less”,) to understand better those that had a clear strong feeling regarding their financial condition.
gss %>%
filter(!is.na(childs) &
!is.na(satfin) &
!is.na(year) &
satfin != "More Or Less") %>%
select(childs,satfin,year) -> gss_satfin
dim(gss_satfin)## [1] 29183 3
ggplot(data=gss_satfin,aes(x=year,y=childs)) + geom_smooth(aes(fill=satfin))## `geom_smooth()` using method = 'gam'
Observations:
The Not At All Satisfied group had a higher mean in the earlier periods before 1984 and then again after 1998, until the present. This could have to do with broader factors like macro economy and rasing cost of eductation during the time before 1984 and after 1998.
The Satisfied group experienced has shown about 8 years where the mean was higher than Not At All ( 1984-1988,1992-1996.)
The late 80s and early 90s show a convergence of the two means, but overall, Not At All Satisfied has shown many more years with a higher mean.
plot(table(gss_satfin$childs,gss_satfin$satfin))Observations:
When respondents report a number of children less than 5, we observe fairly equal proportions between the Satisfied and Not At All Satisfied groups. This rule of thumb is broken when looking at famalies with 1 child. There may be outlier cases here or other factors like family structure and number of parents in a household that could be expolred in future situations
When the number of children 5 or more, the proportion of Not at all Satisfied group becomes the clear majority.
Null hypothesis: The number of children and financial satisfaction are independent.
Alternative hypothesis: The number of children and financial satisfaction are dependent.
As we have two satisfaction groups and 8 number of children groups, the hypothesis test to be performed is the chi-sq test of independence.
Independence: GSS dataset is generated from a random sample survey. We are fine in assuming that the records are independent.
Expected Counts:
chisq.test(gss_satfin$childs,gss_satfin$satfin)$expected## gss_satfin$satfin
## gss_satfin$childs Satisfied Not At All Sat
## 0 4113.9809 3735.0191
## 1 2380.6474 2161.3526
## 2 3742.8892 3398.1108
## 3 2395.8474 2175.1526
## 4 1276.2828 1158.7172
## 5 623.2034 565.7966
## 6 304.5258 276.4742
## 7 183.4493 166.5507
## 8 275.1739 249.8261
From the above table, the expected counts are above the minimum required of 5 for each cell.
All the conditions to perform chi-square test of independence are satisfied.
chisq.test(gss_satfin$childs,gss_satfin$satfin)##
## Pearson's Chi-squared test
##
## data: gss_satfin$childs and gss_satfin$satfin
## X-squared = 93.573, df = 8, p-value < 2.2e-16
With a p-value of almost zero, there is have strong evidence to reject the null hypothesis. Hence, we have convincing evidence to state that the number of children in a family and the financial satisfaction are dependent in the U.S.
Considering the rejection of the null hypothesis, it will be wise for me to consider the overall number of children that I would like to have when it comes to overall financial satisfaction over the life time. It could be benefical to add additonal factors into follow on research (i.e., income of respondent, age of respondent, overall life satisfaction of respondent,) in order to properlly understand components that could allow one to maximize financial and overall well being, while also maximizing family size.