My research question is if educated people are more satisfied with their financial situation. I care about if people can be happier with more education. There are more evidences show that people can improve the satisfaction by education, art, or sports. It is important to figure out if those activities are meaningful for happiness.
I choose the dataset which the course instructor provides for us, so I follow the citation.
The case in this dataset is the unit of observation, one person who took the survey. Each case has one unique caseid.
I am going to use General Social Survey (GSS) data which is a survey data. General Social Survey (GSS) is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The codebook below lists all variables, the values they take, and the survey questions associated with them. There are a total of 57,061 cases and 114 variables in this dataset. Note that this is a cumulative data file for surveys conducted between 1972 - 2012 and that not all respondents answered all questions in all years.
GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.
This is a observational study since it is not from a random assignment experiment, and it is from a random sampling survey.
The population of interest is the people in USA. Since the data were collected from random sampling, the analysis can be generalized to the people in USA.
The potential source of bias will be the sampling is not fully random, so some cohorts of population might be missing in the dataset due to they are not willing to take the survey. If the information from those particular cohorts was omitted in the dataset, our model’s estimate of parameters will be bias.
Since this research is not based on random assignment experiment, it cannot get conclusion of causal effect of education on satisfaction.
Calculate and discuss relevant descriptive statistics, including summary statistics and visualizations of the data. Also address what the exploratory data analysis suggests about research question.We are going to look at two categorical variables: education and financial satisfaction.
The summary statistics are calculated, and the bar charts are as follow.
RS HIGHEST DEGREE
If finished 9th-12th grade: Did you ever get a high school diploma or a GED certificate? VALUE LABEL 0 LT HIGH SCHOOL 1 HIGH SCHOOL 2 JUNIOR COLLEGE 3 BACHELOR 4 GRADUATE NA IAP NA DK NA NA
Data type: numeric Missing-data codes: 7,8,9 Record/column: 1/36
library(ggplot2)
library(knitr)
qplot(degree, data= gss, geom= "histogram", main = "distribution of education" )
The bar chart show that the distribution of education in US is right skew. Most of people got high school graduated.
SATISFACTION WITH FINANCIAL SITUATION
We are interested in how people are getting along financially these days. So far as you and your family are concerned, would you say that you are pretty well satisfied with your present financial situation, more or less satisfied, or not satisfied at all? VALUE LABEL NA IAP 1 SATISFIED 2 MORE OR LESS 3 NOT AT ALL SAT NA DK NA NA
Data type: numeric Missing-data codes: 0,8,9 Record/column: 1/169
qplot(satfin, data= gss, geom= "histogram", main = "distribution of satisfaction with financial satisfactio")
The bar chart show that the distribution of satisfaction in US is clost to normal. Most of people got more or less satisfied.
This contingency table shows that different groups with different education has different level of satisfaction.
kable(addmargins(with(gss, table(degree, satfin))))
Satisfied | More Or Less | Not At All Sat | Sum | |
---|---|---|---|---|
Lt High School | 3065 | 4710 | 3388 | 11163 |
High School | 7162 | 12068 | 7670 | 26900 |
Junior College | 669 | 1332 | 727 | 2728 |
Bachelor | 2669 | 3171 | 1393 | 7233 |
Graduate | 1504 | 1473 | 497 | 3474 |
Sum | 15069 | 22754 | 13675 | 51498 |
This graph shows the level of the data.
ggplot(gss, aes(x=degree, fill=satfin)) + geom_bar()
This table shows the percentage of the distribution of education and satisfaction.
If in fact education and satisfaction are independent, we suppose to see the same percentage across the education group.
kable(with(gss, prop.table(table(satfin, degree),2)))
Lt High School | High School | Junior College | Bachelor | Graduate | |
---|---|---|---|---|---|
Satisfied | 0.2746 | 0.2662 | 0.2452 | 0.3690 | 0.4329 |
More Or Less | 0.4219 | 0.4486 | 0.4883 | 0.4384 | 0.4240 |
Not At All Sat | 0.3035 | 0.2851 | 0.2665 | 0.1926 | 0.1431 |
This graph shows the percentage is not the same. It shows the higher education, the more satisfied.
barplot(with(gss, prop.table(table(satfin, degree),2)), col=c("#7fc97f","#beaed4","#fdc086"),main = "Degree and Financial Satisfaction", legend = TRUE, args.legend = list(x = "topleft"))
This mosaic plot shows the level and percentage.
mosaicplot(with(gss, table(degree, satfin)), main = "Degree and Financial Satisfaction", color = c("#7fc97f","#beaed4","#fdc086"), las = 1)
Our pre-estimation findings from the exploratory analysis suggests that education and satisfaction are not independent. We are going to test this hypothesis.
Since we have two category variables, we are going to do a hypothesis test: chi-square test of independence.
H_0: Satisfaction and education are independent. H_A: Satisfaction and education are dependent.
Conditions for the chi-square test: 1. Independence: Sampled observations must be independent. - random sample/assignment - GSS data are from random sample.
- if sampling without replacement, n < 10% of population - The sample size < 10% US population. - each case only contributes to one cell in the table - Each subject only belongs to one cell in the talbe. 2. Sample size: Each particular scenario (i.e. cell) must have at least 5 expected cases. - Each cell (particular scenario) has over 5 expected cases.
All conditions are satisfied.
a chi-square test of independence we have data from two variables, so that’s two columns of data, and we evaluate the relationship between these two variables to determine if they’re independent or dependent. Therefore, we are going to do a chi-square test of independence.
Since the expected sample size condition met, we first do chisq.test with theoretical method.
# load the inference function:
source("http://bit.ly/dasi_inference")
inference(gss$satfin, gss$degree, est = "proportion", type= "ht", method ="theoretical", alternative = 'greater', siglevel = 0.95 )
## Warning: package 'BHH2' was built under R version 3.0.3
## Response variable: categorical, Explanatory variable: categorical
## Chi-square test of independence
##
## Summary statistics:
## x
## y Lt High School High School Junior College Bachelor
## Satisfied 3065 7162 669 2669
## More Or Less 4710 12068 1332 3171
## Not At All Sat 3388 7670 727 1393
## Sum 11163 26900 2728 7233
## x
## y Graduate Sum
## Satisfied 1504 15069
## More Or Less 1473 22754
## Not At All Sat 497 13675
## Sum 3474 51498
## H_0: Response and explanatory variable are independent.
## H_A: Response and explanatory variable are dependent.
## Check conditions: expected counts
## x
## y Lt High School High School Junior College Bachelor
## Satisfied 3266 7871 798.2 2116
## More Or Less 4932 11886 1205.3 3196
## Not At All Sat 2964 7143 724.4 1921
## x
## y Graduate
## Satisfied 1016.5
## More Or Less 1535.0
## Not At All Sat 922.5
##
## Pearson's Chi-squared test
##
## data: y_table
## X-squared = 944.8, df = 8, p-value < 2.2e-16
The p-value < 2.2e-16, so we reject null hypothesis and are in favor of the alternative hypothesis.
p-value is the P(observed or more extreme outcome | H0 true), so it means there is almost 0% chance of obtaining a random sample of 57,061 Americans where the education and satifaction observed as dependent or more extreme, if in fact the education and satifaction are independent.
The results provides convincing evidence that the education and satifaction are dependent.
We conduct chisq.test with simulation method for comparison.
inference(gss$satfin, gss$degree, est = "proportion", type= "ht", method ="simulation", alternative = 'greater', siglevel = 0.95, eda_plot=FALSE )
The results from various methods agree. The p-value = 9.999e-05 almost zero, and we reject the null hypothesis.
We conclude that the education and satifaction are dependent according our evidence. However, Since this is not a random assignment experiment, we cannot conclude that more education causes more sastifaction. It is possible that another confoundering variable cause the education and satisfaction move in the same direction, such IQ or family income.
Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1
Persistent URL: http://doi.org/10.3886/ICPSR34802.v1