Statistical inference with the GSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.

load("gss.Rdata")

Part 1: Data

This data set is modified for the Coursera Data and Statistical Inference Course Spring 2014. Background note says that the data was simplified by removing any missed values and creating factor variables when appropriate to facilitate analysis using R. This studies was funded by National Science Foundation one of relieble research fundations, which means eliminating political influence. The unit of survey - individual, all noninstitutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States. Mode of survey - computer-assisted personal interview (CAPI), face-to-face interview, telephone interview, which means, the inclusion bias was eliminated. Data base contains 57061 observations, which is less that 10% of US population, so we could conclude the data is randomized.

Part 2: Research question

The problem of education is very intersting for me. I’m strongly believe in changing and creating power of education. I believe that people with high eduation live more prosperous life. Now it’s time to check my believes.

Is it true that people with the lowest level of education get welfare more often than those who has graduted from university? Can we conclude that those who has the lowest levels of education tend to spend more hours in front of TV? Could we assume that those who have the lowest level of education and spend the biggest amount of time in front of tv are more likely to find themself in the life situation when they will recieve welfare?

Response variable:
getaid EVER RECEIVED WELFARE?
1 YES
2 NO

tvhours HOURS PER DAY WATCHING TV (numerical data)

Explanatory variable(s):

degree HIGHEST DEGREE
0 LT HIGH SCHOOL
1 HIGH SCHOOL
2 JUNIOR COLLEGE
3 BACHELOR
4 GRADUATE

Part 3: Exploratory data analysis

let’s find out proportion of people with different levels of education within this survey, and build a plot

 level_education<- gss%>%
  select(degree)%>%
  filter(!is.na(degree))

level_education<-level_education%>%
  group_by(degree)%>%
  summarise(count = n())%>%
  mutate(prop = count / sum(count))

level_education

## # A tibble: 5 × 3
##           degree count       prop
##           <fctr> <int>      <dbl>
## 1 Lt High School 11822 0.21091506
## 2    High School 29287 0.52250629
## 3 Junior College  3070 0.05477155
## 4       Bachelor  8002 0.14276284
## 5       Graduate  3870 0.06904426

ggplot(level_education, aes(degree, prop))+geom_bar(stat = "identity")

Here we can see that people with High School education is the biggest group of the research. We can assume that this group will have the biggest proportion of those who get the welfare.

Before moving forward with exploration, we should pay atterntion that 97% of those who participated in that survey did not state their position if they had ever recieved a welfare. But because 1460 is still a large number, we could continue our survey, keeping in mind those changed circumstances.

na_welfare<- gss%>%
  select(degree, getaid)%>%
  filter(is.na(getaid))%>%
  group_by(degree, getaid)%>%
  summarise(count = n())%>%
  mutate(prop = count / 55601)

pr<-55601/57061
pr

## [1] 0.9744133

na_welfare

## Source: local data frame [6 x 4]
## Groups: degree [6]
## 
##           degree getaid count       prop
##           <fctr> <fctr> <int>      <dbl>
## 1 Lt High School     NA 11446 0.20585961
## 2    High School     NA 28543 0.51335408
## 3 Junior College     NA  3022 0.05435154
## 4       Bachelor     NA  7811 0.14048308
## 5       Graduate     NA  3792 0.06820021
## 6             NA     NA   987 0.01775148

 welfare<- gss%>%
  select(degree, getaid)

welfare2<-welfare%>%
filter(!is.na(getaid), !is.na(degree))%>%
group_by(degree, getaid)%>%
  summarise(count = n())%>%
  mutate(prop = count / 1437)

welfare2

## Source: local data frame [10 x 4]
## Groups: degree [5]
## 
##            degree getaid count        prop
##            <fctr> <fctr> <int>       <dbl>
## 1  Lt High School    Yes   115 0.080027836
## 2  Lt High School     No   261 0.181628392
## 3     High School    Yes   134 0.093249826
## 4     High School     No   610 0.424495477
## 5  Junior College    Yes     4 0.002783577
## 6  Junior College     No    44 0.030619346
## 7        Bachelor    Yes    17 0.011830202
## 8        Bachelor     No   174 0.121085595
## 9        Graduate    Yes     6 0.004175365
## 10       Graduate     No    72 0.050104384

ggplot(welfare2, aes(getaid, count, fill=degree))+
  geom_bar(width = 0.5, stat = "identity")+
  xlab("Ever recieved welfare?")+
  ylab("Degree of education")

At this part of studying we found that the largest proportion of those who had ever recieved welfare aid, had high school education. 20% of respondents are ever recieved welfare and 9% (High school) + 8%(Lt High School) of them was not able to pursue education higher than high school diploma.

1.2

To answer the second part of the question, we need to find out the proportion of time spend in front of TV by people with different levels of education

tv_time<-gss%>%
  select(degree, tvhours)%>%
  filter(!is.na(degree), !is.na(tvhours))

tv_time2<-tv_time%>%
  group_by(degree, tvhours)%>%
  summarise(count = n())%>%
  mutate(props = count / 33291)
tv_time2

## Source: local data frame [98 x 4]
## Groups: degree [5]
## 
##            degree tvhours count        props
##            <fctr>   <int> <int>        <dbl>
## 1  Lt High School       0   252 0.0075696134
## 2  Lt High School       1   855 0.0256826169
## 3  Lt High School       2  1464 0.0439758493
## 4  Lt High School       3  1250 0.0375476856
## 5  Lt High School       4  1056 0.0317202848
## 6  Lt High School       5   673 0.0202156739
## 7  Lt High School       6   503 0.0151091887
## 8  Lt High School       7   112 0.0033642726
## 9  Lt High School       8   262 0.0078699949
## 10 Lt High School       9    25 0.0007509537
## # ... with 88 more rows

ggplot(tv_time2, aes(tvhours, props, color = degree))+
  geom_line( stat = "identity", na.rm =TRUE)

According to this summary of table we can clearly see the group of peopel with lower education level tend to spend more time watching TV. To get it clear, we need to find statistics summary of each group and put it in a boxplot:

mean_hours<-gss%>%
filter(!is.na(tvhours), !is.na(degree))%>%
 group_by(degree) %>%
  summarise(mean_hours = mean(tvhours))
mean_hours

## # A tibble: 5 × 2
##           degree mean_hours
##           <fctr>      <dbl>
## 1 Lt High School   3.768033
## 2    High School   3.037413
## 3 Junior College   2.540138
## 4       Bachelor   2.191993
## 5       Graduate   1.896432

gss%>%
 ggplot(aes(degree, tvhours))+geom_boxplot()

## Warning: Removed 23206 rows containing non-finite values (stat_boxplot).

According to summary of those statistics, we could clearly see that the higher the education, the less hours watching TV a person tends to spend.

The last part of the research is to apply an extra condition - recieved welfare - to the founded ratio

tv_time_wf<-gss%>%
  select(degree, tvhours, getaid)%>%
  filter(!is.na(degree), !is.na(tvhours), getaid == "Yes")

tv_time_wf2<-tv_time_wf%>%
  group_by(degree, tvhours, getaid)%>%
  summarise(count = n())%>%
  mutate(prop = count /275)
 
tv_time_wf2

## Source: local data frame [41 x 5]
## Groups: degree, tvhours [41]
## 
##            degree tvhours getaid count        prop
##            <fctr>   <int> <fctr> <int>       <dbl>
## 1  Lt High School       0    Yes     1 0.003636364
## 2  Lt High School       1    Yes    14 0.050909091
## 3  Lt High School       2    Yes    20 0.072727273
## 4  Lt High School       3    Yes    21 0.076363636
## 5  Lt High School       4    Yes    14 0.050909091
## 6  Lt High School       5    Yes    11 0.040000000
## 7  Lt High School       6    Yes     9 0.032727273
## 8  Lt High School       7    Yes     3 0.010909091
## 9  Lt High School       8    Yes    10 0.036363636
## 10 Lt High School      10    Yes     3 0.010909091
## # ... with 31 more rows

ggplot(tv_time_wf2, aes(tvhours, prop, color = degree))+
  geom_line( stat = "identity", na.rm =TRUE)

According to the plot, we could assume that the people with lower education tend to spend more time watching TV, and welfare condition makes the difference more obvious.

gss_short<-gss%>%
filter(!is.na(tvhours), !is.na(getaid), !is.na(degree))%>%
 group_by(degree, getaid) %>%
  summarise(mean_hours = mean(tvhours))
gss_short

## Source: local data frame [10 x 3]
## Groups: degree [?]
## 
##            degree getaid mean_hours
##            <fctr> <fctr>      <dbl>
## 1  Lt High School    Yes   4.552632
## 2  Lt High School     No   3.542308
## 3     High School    Yes   3.970149
## 4     High School     No   2.901478
## 5  Junior College    Yes   1.500000
## 6  Junior College     No   2.159091
## 7        Bachelor    Yes   2.235294
## 8        Bachelor     No   2.218391
## 9        Graduate    Yes   2.166667
## 10       Graduate     No   1.887324

At this part of reserch we can see a clear tendention that people who have ever recieved a welfare, tend to spend more time watching TV

Part 4: Inference

Hypotheses
4.1

The object of Hypothesis testing is to answer the question if welfare status is associated with degree status.

H0: degree and getaid are independent
HA: degree and getaid are dependent

To find out the answer for that, we could use chi-square testing for goodness for fit. Because given a sample of cases that can be classified into several groups, determine if the sample is representative of the general population.

Conditions

There are two conditions that must be checked before performing a chi-square test:

Independence. Each case that contributes a count to the table must be indepen- dent of all the other cases in the table. According to the data observation, all cases are independent.

Sample size / distribution. Each particular scenario (i.e. cell count) must have at least 5 expected cases.

To check the sample size, simply need to call the summary function

summary(gss$degree)

## Lt High School    High School Junior College       Bachelor       Graduate 
##          11822          29287           3070           8002           3870 
##           NA's 
##           1010

summary(gss$getaid)

##   Yes    No  NA's 
##   281  1179 55601

So, conditions are clear.

inference(y = getaid , x = degree, data = gss, statistic = "proportion", type = "ht",  
          alternative = "greater", success = "Graduate", method = "theoretical")

## Response variable: categorical (2 levels) 
## Explanatory variable: categorical (5 levels) 
## Observed:
##                 y
## x                Yes  No
##   Lt High School 115 261
##   High School    134 610
##   Junior College   4  44
##   Bachelor        17 174
##   Graduate         6  72
## 
## Expected:
##                 y
## x                       Yes        No
##   Lt High School  72.217119 303.78288
##   High School    142.897704 601.10230
##   Junior College   9.219207  38.78079
##   Bachelor        36.684760 154.31524
##   Graduate        14.981211  63.01879
## 
## H0: degree and getaid are independent
## HA: degree and getaid are dependent
## chi_sq = 55.4515, df = 4, p_value = 0

Because larger chi-square values correspond to stronger evidence against the null hypothesis, we see shaded the upper tail to represent the p-value = 0. Generally we reject the null hypothesis with such a small p-value. In other words, the data provide convincing evidence of dependence between levels of education and welfare status

4.2

Hypotheses

The object of Hypothesis testing is to answer the question if amount of hours spend watching TV are associate with degree status.

H0: tvhours and getaid are independent
HA: tvhours and getaid are dependent

To find out the answer for that, we could use chi-square testing for goodness for fit. Because Given a sample of cases that can be classified into several groups, determine if the sample is representative of the general population.

Conditions

There are two conditions that must be checked before performing a chi-square test:
Independence. Each case that contributes a count to the table must be indepen- dent of all the other cases in the table. According to the data observation, all cases are independent.

Sample size / distribution. Each particular scenario (i.e. cell count) must have at least 5 expected cases.

To check the sample size, simply need to call the summary function

summary(gss$degree)

## Lt High School    High School Junior College       Bachelor       Graduate 
##          11822          29287           3070           8002           3870 
##           NA's 
##           1010

So, conditions are clear.

inference(y = degree , x = tvhours, data = gss, statistic = "proportion", type = "ht",  
          alternative = "greater", success = "Graduate", method = "theoretical")

## Warning: Explanatory variable was numerical, it has been converted
##               to categorical. In order to avoid this warning, first convert
##               your explanatory variable to a categorical variable using the
##               as.factor() function

## Warning in chisq.test(x, y, correct = FALSE): Chi-squared approximation may
## be incorrect

## Response variable: categorical (5 levels) 
## Explanatory variable: categorical (24 levels) 
## Observed:
##     y
## x    Lt High School High School Junior College Bachelor Graduate
##   0             252         720             91      351      219
##   1             855        3080            465     1421      860
##   2            1464        4711            577     1492      664
##   3            1250        3475            343      813      314
##   4            1056        2575            205      401      133
##   5             673        1241             86      141       51
##   6             503         789             61       62       36
##   7             112         180              8       13        4
##   8             262         357             23       30        4
##   9              25          36              1        4        2
##   10            149         143              9       17        5
##   11             15          11              0        3        1
##   12             97         109              5        7        1
##   13              9          13              0        2        0
##   14             21          19              2        1        0
##   15             19          21              1        5        2
##   16             11          15              1        0        0
##   17              1           2              0        0        0
##   18              9           7              2        0        0
##   20              8          16              1        7        1
##   21              1           3              0        0        0
##   22              3           2              0        0        0
##   23              1           0              0        0        0
##   24             11           9              0        1        1
## 
## Expected:
##     y
## x    Lt High School  High School Junior College    Bachelor     Graduate
##   0     333.8989817  860.0829654    92.26736956  234.028506 112.72217716
##   1    1366.0619086 3518.8085068   377.48824006  957.467514 461.17383077
##   2    1821.4158782 4691.7446757   503.31765342 1276.623352 614.89844102
##   3    1266.6896459 3262.8377039   350.02838605  887.817879 427.62638551
##   4     893.5324863 2301.6304707   246.91267910  626.273467 301.65089664
##   5     448.1975309 1154.5020576   123.85185185  314.139918 151.30864198
##   6     296.6855006  764.2255865    81.98404974  207.945721 100.15914211
##   7      64.8168874  166.9603797    17.91105704   45.429906  21.88176985
##   8     138.2215013  356.0416930    38.19518789   96.878916  46.66270163
##   9      13.9039380   35.8148449     3.84211949    9.745216   4.69388123
##   10     66.0437055  170.1205131    18.25006759   46.289778  22.29593584
##   11      6.1340903   15.8006668     1.69505272    4.299360   2.07082995
##   12     44.7788592  115.3448680    12.37388483   31.385329  15.11705866
##   13      4.9072722   12.6405335     1.35604217    3.439488   1.65666396
##   14      8.7921961   22.6476225     2.42957556    6.162416   2.96818960
##   15      9.8145445   25.2810670     2.71208435    6.878976   3.31332793
##   16      5.5206813   14.2206002     1.52554745    3.869424   1.86374696
##   17      0.6134090    1.5800667     0.16950527    0.429936   0.20708300
##   18      3.6804542    9.4804001     1.01703163    2.579616   1.24249797
##   20      6.7474993   17.3807335     1.86455799    4.729296   2.27791295
##   21      0.8178787    2.1067556     0.22600703    0.573248   0.27611066
##   22      1.0223484    2.6334445     0.28250879    0.716560   0.34513833
##   23      0.2044697    0.5266889     0.05650176    0.143312   0.06902767
##   24      4.4983329   11.5871557     1.24303866    3.152864   1.51860863
## 
## H0: tvhours and degree are independent
## HA: tvhours and degree are dependent
## chi_sq = 2690.96, df = 92, p_value = 0

Because larger chi-square values correspond to stronger evidence against the null hypothesis, we see shadedthe upper tail to represent the p-value = 0. Generally we reject the null hypothesis with such a small p-value. In other words, the data provide convincing evidence of dependence between hours spend infront of TV and level of education

4.3

Hypotheses

The object of Hypothesis testing is to answer the question if amount of hours spend watching TV are associate with status of recieving welfare.

To find out the answer for that, we could use student’s t testing, two-sided, because we are comparing two means (between people who recieve WF and who do not)

H0: mu_Yes = mu_No
HA: mu_Yes != mu_No

Conditions
Independence. Each case that contributes a count to the table must be indepen- dent of all the other cases in the table. According to the data observation, all cases are independent.

Sample size / distribution. Each particular scenario (i.e. cell count) must have at least 5 expected cases.

To check the sample size, simply need to call the summary function

summary(gss$getaid)

##   Yes    No  NA's 
##   281  1179 55601

So, conditions are clear.

inference(y = tvhours , x = getaid , data = gss, statistic = "mean", type = "ht", null = 0, 
          alternative = "twoside", method = "theoretical")

## Response variable: numerical
## Explanatory variable: categorical (2 levels) 
## n_Yes = 280, y_bar_Yes = 4.0571, s_Yes = 3.2775
## n_No = 1176, y_bar_No = 2.8597, s_No = 2.0408
## H0: mu_Yes =  mu_No
## HA: mu_Yes != mu_No
## t = 5.8494, df = 279
## p_value = < 0.0001

Because larger t values correspond to stronger evidence against the null hypothesis, we see shaded the upper tail to represent the p-value = < 0.0001. Generally we reject the null hypothesis with such a small p-value. In other words, the data provide convincing evidence that people recieving welfare tend to spend more time watching TV

4.1.1

we also can apply confidence interval for this reserach.

inference(y = tvhours , x = getaid , data = gss, statistic = "mean", type = "ci", null = 0, 
          alternative = "twoside", method = "theoretical")

## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_Yes = 280, y_bar_Yes = 4.0571, s_Yes = 3.2775
## n_No = 1176, y_bar_No = 2.8597, s_No = 2.0408
## 95% CI (Yes - No): (0.7945 , 1.6004)

We are 95% confident that people who recieve walfare tend to spend 0.8 - 1.6 hours watching Tv more than those who does not recieve WF.