More Education, More Satisfaction – Data Analysis Project

Introduction:

My research question is if educated people are more satisfied with their financial situation. I care about if people can be happier with more education. There are more evidences show that people can improve the satisfaction by education, art, or sports. It is important to figure out if those activities are meaningful for happiness.

1. Data:

I choose the dataset which the course instructor provides for us, so I follow the citation.

The case in this dataset is the unit of observation, one person who took the survey. Each case has one unique caseid.

How the data were collected.

I am going to use General Social Survey (GSS) data which is a survey data. General Social Survey (GSS) is a sociological survey used to collect data on demographic characteristics and attitudes of residents of the United States. The codebook below lists all variables, the values they take, and the survey questions associated with them. There are a total of 57,061 cases and 114 variables in this dataset. Note that this is a cumulative data file for surveys conducted between 1972 - 2012 and that not all respondents answered all questions in all years.

GSS questions cover a diverse range of issues including national spending priorities, marijuana use, crime and punishment, race relations, quality of life, confidence in institutions, and sexual behavior.

Study: What is the type of study?

This is a observational study since it is not from a random assignment experiment, and it is from a random sampling survey.

Scope of inference - generalizability

The population of interest is the people in USA. Since the data were collected from random sampling, the analysis can be generalized to the people in USA.

The potential source of bias will be the sampling is not fully random, so some cohorts of population might be missing in the dataset due to they are not willing to take the survey. If the information from those particular cohorts was omitted in the dataset, our model’s estimate of parameters will be bias.

Scope of inference - causality

Since this research is not based on random assignment experiment, it cannot get conclusion of causal effect of education on satisfaction.

2. Exploratory data analysis:

Calculate and discuss relevant descriptive statistics, including summary statistics and visualizations of the data. Also address what the exploratory data analysis suggests about research question.We are going to look at two categorical variables: education and financial satisfaction.

The summary statistics are calculated, and the bar charts are as follow.

1. Categorical data : DEGREE

RS HIGHEST DEGREE

If finished 9th-12th grade: Did you ever get a high school diploma or a GED certificate? VALUE LABEL 0 LT HIGH SCHOOL 1 HIGH SCHOOL 2 JUNIOR COLLEGE 3 BACHELOR 4 GRADUATE NA IAP NA DK NA NA

Data type: numeric Missing-data codes: 7,8,9 Record/column: 1/36

library(ggplot2)
library(knitr)
qplot(degree, data= gss, geom= "histogram", main = "distribution of education"  )

plot of chunk unnamed-chunk-2

The bar chart show that the distribution of education in US is right skew. Most of people got high school graduated.

2. Categorical data : SATFIN

SATISFACTION WITH FINANCIAL SITUATION

We are interested in how people are getting along financially these days. So far as you and your family are concerned, would you say that you are pretty well satisfied with your present financial situation, more or less satisfied, or not satisfied at all? VALUE LABEL NA IAP 1 SATISFIED 2 MORE OR LESS 3 NOT AT ALL SAT NA DK NA NA

Data type: numeric Missing-data codes: 0,8,9 Record/column: 1/169

qplot(satfin, data= gss, geom= "histogram", main = "distribution of satisfaction with financial satisfactio")

plot of chunk unnamed-chunk-3

The bar chart show that the distribution of satisfaction in US is clost to normal. Most of people got more or less satisfied.

Visualization: The relationship between education and financial satisfaction

This contingency table shows that different groups with different education has different level of satisfaction.

kable(addmargins(with(gss, table(degree, satfin))))

	Satisfied	More Or Less	Not At All Sat	Sum
Lt High School	3065	4710	3388	11163
High School	7162	12068	7670	26900
Junior College	669	1332	727	2728
Bachelor	2669	3171	1393	7233
Graduate	1504	1473	497	3474
Sum	15069	22754	13675	51498

This graph shows the level of the data.

ggplot(gss, aes(x=degree,  fill=satfin)) + geom_bar()

plot of chunk unnamed-chunk-5

This table shows the percentage of the distribution of education and satisfaction.

If in fact education and satisfaction are independent, we suppose to see the same percentage across the education group.

kable(with(gss, prop.table(table(satfin, degree),2)))

	Lt High School	High School	Junior College	Bachelor	Graduate
Satisfied	0.2746	0.2662	0.2452	0.3690	0.4329
More Or Less	0.4219	0.4486	0.4883	0.4384	0.4240
Not At All Sat	0.3035	0.2851	0.2665	0.1926	0.1431

This graph shows the percentage is not the same. It shows the higher education, the more satisfied.

barplot(with(gss, prop.table(table(satfin, degree),2)), col=c("#7fc97f","#beaed4","#fdc086"),main = "Degree and Financial Satisfaction", legend = TRUE, args.legend = list(x = "topleft"))

plot of chunk unnamed-chunk-7

This mosaic plot shows the level and percentage.

mosaicplot(with(gss, table(degree, satfin)), main = "Degree and Financial Satisfaction", color = c("#7fc97f","#beaed4","#fdc086"), las = 1)

plot of chunk unnamed-chunk-8

Pre-estimation Result

Our pre-estimation findings from the exploratory analysis suggests that education and satisfaction are not independent. We are going to test this hypothesis.

3. Inference: a chi-square test of independence

Since we have two category variables, we are going to do a hypothesis test: chi-square test of independence.

Hypotheses

H_0: Satisfaction and education are independent. H_A: Satisfaction and education are dependent.

Check conditions

Conditions for the chi-square test: 1. Independence: Sampled observations must be independent. - random sample/assignment - GSS data are from random sample.
- if sampling without replacement, n < 10% of population - The sample size < 10% US population. - each case only contributes to one cell in the table - Each subject only belongs to one cell in the talbe. 2. Sample size: Each particular scenario (i.e. cell) must have at least 5 expected cases. - Each cell (particular scenario) has over 5 expected cases.

All conditions are satisfied.

The method(s) to be used and why and how

a chi-square test of independence we have data from two variables, so that’s two columns of data, and we evaluate the relationship between these two variables to determine if they’re independent or dependent. Therefore, we are going to do a chi-square test of independence.

Perform inference

Since the expected sample size condition met, we first do chisq.test with theoretical method.

# load the inference function:
source("http://bit.ly/dasi_inference")
inference(gss$satfin, gss$degree, est = "proportion", type= "ht", method ="theoretical", alternative = 'greater', siglevel = 0.95 )

## Warning: package 'BHH2' was built under R version 3.0.3

## Response variable: categorical, Explanatory variable: categorical
## Chi-square test of independence
## 
## Summary statistics:
##                 x
## y                Lt High School High School Junior College Bachelor
##   Satisfied                3065        7162            669     2669
##   More Or Less             4710       12068           1332     3171
##   Not At All Sat           3388        7670            727     1393
##   Sum                     11163       26900           2728     7233
##                 x
## y                Graduate   Sum
##   Satisfied          1504 15069
##   More Or Less       1473 22754
##   Not At All Sat      497 13675
##   Sum                3474 51498

## H_0: Response and explanatory variable are independent.
## H_A: Response and explanatory variable are dependent.
## Check conditions: expected counts
##                 x
## y                Lt High School High School Junior College Bachelor
##   Satisfied                3266        7871          798.2     2116
##   More Or Less             4932       11886         1205.3     3196
##   Not At All Sat           2964        7143          724.4     1921
##                 x
## y                Graduate
##   Satisfied        1016.5
##   More Or Less     1535.0
##   Not At All Sat    922.5
## 
##  Pearson's Chi-squared test
## 
## data:  y_table
## X-squared = 944.8, df = 8, p-value < 2.2e-16

plot of chunk unnamed-chunk-9

Interpret results

The p-value < 2.2e-16, so we reject null hypothesis and are in favor of the alternative hypothesis.

p-value is the P(observed or more extreme outcome | H0 true), so it means there is almost 0% chance of obtaining a random sample of 57,061 Americans where the education and satifaction observed as dependent or more extreme, if in fact the education and satifaction are independent.

The results provides convincing evidence that the education and satifaction are dependent.

whether results from various methods agree

We conduct chisq.test with simulation method for comparison.

inference(gss$satfin, gss$degree,  est = "proportion", type= "ht", method ="simulation", alternative = 'greater', siglevel = 0.95, eda_plot=FALSE )

The results from various methods agree. The p-value = 9.999e-05 almost zero, and we reject the null hypothesis.

4. Conclusion:

We conclude that the education and satifaction are dependent according our evidence. However, Since this is not a random assignment experiment, we cannot conclude that more education causes more sastifaction. It is possible that another confoundering variable cause the education and satisfaction move in the same direction, such IQ or family income.

References

Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972-2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013-09-11. doi:10.3886/ICPSR34802.v1

More Education, More Satisfaction – Data Analysis Project - Snowdj

2014/10/12