library(ggplot2)
library(dplyr)
library(statsr)Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called gss. Delete this note when before you submit your work.
load("gss.Rdata")This data set is modified for the Coursera Data and Statistical Inference Course Spring 2014. Background note says that the data was simplified by removing any missed values and creating factor variables when appropriate to facilitate analysis using R. This studies was funded by National Science Foundation one of relieble research fundations, which means eliminating political influence. The unit of survey - individual, all noninstitutionalized, English and Spanish speaking persons 18 years of age or older, living in the United States. Mode of survey - computer-assisted personal interview (CAPI), face-to-face interview, telephone interview, which means, the inclusion bias was eliminated. Data base contains 57061 observations, which is less that 10% of US population, so we could conclude the data is randomized.
The problem of education is very intersting for me. I’m strongly believe in changing and creating power of education. I believe that people with high eduation live more prosperous life. Now it’s time to check my believes.
Response variable:
getaid EVER RECEIVED WELFARE?Explanatory variable(s):
let’s find out proportion of people with different levels of education within this survey, and build a plot
level_education<- gss%>%
select(degree)%>%
filter(!is.na(degree))
level_education<-level_education%>%
group_by(degree)%>%
summarise(count = n())%>%
mutate(prop = count / sum(count))
level_education## # A tibble: 5 × 3
## degree count prop
## <fctr> <int> <dbl>
## 1 Lt High School 11822 0.21091506
## 2 High School 29287 0.52250629
## 3 Junior College 3070 0.05477155
## 4 Bachelor 8002 0.14276284
## 5 Graduate 3870 0.06904426
ggplot(level_education, aes(degree, prop))+geom_bar(stat = "identity")Here we can see that people with High School education is the biggest group of the research. We can assume that this group will have the biggest proportion of those who get the welfare.
Before moving forward with exploration, we should pay atterntion that 97% of those who participated in that survey did not state their position if they had ever recieved a welfare. But because 1460 is still a large number, we could continue our survey, keeping in mind those changed circumstances.
na_welfare<- gss%>%
select(degree, getaid)%>%
filter(is.na(getaid))%>%
group_by(degree, getaid)%>%
summarise(count = n())%>%
mutate(prop = count / 55601)
pr<-55601/57061
pr## [1] 0.9744133
na_welfare## Source: local data frame [6 x 4]
## Groups: degree [6]
##
## degree getaid count prop
## <fctr> <fctr> <int> <dbl>
## 1 Lt High School NA 11446 0.20585961
## 2 High School NA 28543 0.51335408
## 3 Junior College NA 3022 0.05435154
## 4 Bachelor NA 7811 0.14048308
## 5 Graduate NA 3792 0.06820021
## 6 NA NA 987 0.01775148
welfare<- gss%>%
select(degree, getaid)
welfare2<-welfare%>%
filter(!is.na(getaid), !is.na(degree))%>%
group_by(degree, getaid)%>%
summarise(count = n())%>%
mutate(prop = count / 1437)
welfare2## Source: local data frame [10 x 4]
## Groups: degree [5]
##
## degree getaid count prop
## <fctr> <fctr> <int> <dbl>
## 1 Lt High School Yes 115 0.080027836
## 2 Lt High School No 261 0.181628392
## 3 High School Yes 134 0.093249826
## 4 High School No 610 0.424495477
## 5 Junior College Yes 4 0.002783577
## 6 Junior College No 44 0.030619346
## 7 Bachelor Yes 17 0.011830202
## 8 Bachelor No 174 0.121085595
## 9 Graduate Yes 6 0.004175365
## 10 Graduate No 72 0.050104384
ggplot(welfare2, aes(getaid, count, fill=degree))+
geom_bar(width = 0.5, stat = "identity")+
xlab("Ever recieved welfare?")+
ylab("Degree of education")At this part of studying we found that the largest proportion of those who had ever recieved welfare aid, had high school education. 20% of respondents are ever recieved welfare and 9% (High school) + 8%(Lt High School) of them was not able to pursue education higher than high school diploma.
1.2To answer the second part of the question, we need to find out the proportion of time spend in front of TV by people with different levels of education
tv_time<-gss%>%
select(degree, tvhours)%>%
filter(!is.na(degree), !is.na(tvhours))
tv_time2<-tv_time%>%
group_by(degree, tvhours)%>%
summarise(count = n())%>%
mutate(props = count / 33291)
tv_time2## Source: local data frame [98 x 4]
## Groups: degree [5]
##
## degree tvhours count props
## <fctr> <int> <int> <dbl>
## 1 Lt High School 0 252 0.0075696134
## 2 Lt High School 1 855 0.0256826169
## 3 Lt High School 2 1464 0.0439758493
## 4 Lt High School 3 1250 0.0375476856
## 5 Lt High School 4 1056 0.0317202848
## 6 Lt High School 5 673 0.0202156739
## 7 Lt High School 6 503 0.0151091887
## 8 Lt High School 7 112 0.0033642726
## 9 Lt High School 8 262 0.0078699949
## 10 Lt High School 9 25 0.0007509537
## # ... with 88 more rows
ggplot(tv_time2, aes(tvhours, props, color = degree))+
geom_line( stat = "identity", na.rm =TRUE)According to this summary of table we can clearly see the group of peopel with lower education level tend to spend more time watching TV. To get it clear, we need to find statistics summary of each group and put it in a boxplot:
mean_hours<-gss%>%
filter(!is.na(tvhours), !is.na(degree))%>%
group_by(degree) %>%
summarise(mean_hours = mean(tvhours))
mean_hours## # A tibble: 5 × 2
## degree mean_hours
## <fctr> <dbl>
## 1 Lt High School 3.768033
## 2 High School 3.037413
## 3 Junior College 2.540138
## 4 Bachelor 2.191993
## 5 Graduate 1.896432
gss%>%
ggplot(aes(degree, tvhours))+geom_boxplot()## Warning: Removed 23206 rows containing non-finite values (stat_boxplot).
According to summary of those statistics, we could clearly see that the higher the education, the less hours watching TV a person tends to spend.
The last part of the research is to apply an extra condition - recieved welfare - to the founded ratio
tv_time_wf<-gss%>%
select(degree, tvhours, getaid)%>%
filter(!is.na(degree), !is.na(tvhours), getaid == "Yes")
tv_time_wf2<-tv_time_wf%>%
group_by(degree, tvhours, getaid)%>%
summarise(count = n())%>%
mutate(prop = count /275)
tv_time_wf2## Source: local data frame [41 x 5]
## Groups: degree, tvhours [41]
##
## degree tvhours getaid count prop
## <fctr> <int> <fctr> <int> <dbl>
## 1 Lt High School 0 Yes 1 0.003636364
## 2 Lt High School 1 Yes 14 0.050909091
## 3 Lt High School 2 Yes 20 0.072727273
## 4 Lt High School 3 Yes 21 0.076363636
## 5 Lt High School 4 Yes 14 0.050909091
## 6 Lt High School 5 Yes 11 0.040000000
## 7 Lt High School 6 Yes 9 0.032727273
## 8 Lt High School 7 Yes 3 0.010909091
## 9 Lt High School 8 Yes 10 0.036363636
## 10 Lt High School 10 Yes 3 0.010909091
## # ... with 31 more rows
ggplot(tv_time_wf2, aes(tvhours, prop, color = degree))+
geom_line( stat = "identity", na.rm =TRUE)According to the plot, we could assume that the people with lower education tend to spend more time watching TV, and welfare condition makes the difference more obvious.
gss_short<-gss%>%
filter(!is.na(tvhours), !is.na(getaid), !is.na(degree))%>%
group_by(degree, getaid) %>%
summarise(mean_hours = mean(tvhours))
gss_short## Source: local data frame [10 x 3]
## Groups: degree [?]
##
## degree getaid mean_hours
## <fctr> <fctr> <dbl>
## 1 Lt High School Yes 4.552632
## 2 Lt High School No 3.542308
## 3 High School Yes 3.970149
## 4 High School No 2.901478
## 5 Junior College Yes 1.500000
## 6 Junior College No 2.159091
## 7 Bachelor Yes 2.235294
## 8 Bachelor No 2.218391
## 9 Graduate Yes 2.166667
## 10 Graduate No 1.887324
At this part of reserch we can see a clear tendention that people who have ever recieved a welfare, tend to spend more time watching TV
Hypotheses
4.1
The object of Hypothesis testing is to answer the question if welfare status is associated with degree status.
H0: degree and getaid are independent
HA: degree and getaid are dependent
To find out the answer for that, we could use chi-square testing for goodness for fit. Because given a sample of cases that can be classified into several groups, determine if the sample is representative of the general population.
Conditions
There are two conditions that must be checked before performing a chi-square test:
Independence. Each case that contributes a count to the table must be indepen- dent of all the other cases in the table. According to the data observation, all cases are independent.
Sample size / distribution. Each particular scenario (i.e. cell count) must have at least 5 expected cases.
summary(gss$degree)## Lt High School High School Junior College Bachelor Graduate
## 11822 29287 3070 8002 3870
## NA's
## 1010
summary(gss$getaid)## Yes No NA's
## 281 1179 55601
So, conditions are clear.
inference(y = getaid , x = degree, data = gss, statistic = "proportion", type = "ht",
alternative = "greater", success = "Graduate", method = "theoretical")## Response variable: categorical (2 levels)
## Explanatory variable: categorical (5 levels)
## Observed:
## y
## x Yes No
## Lt High School 115 261
## High School 134 610
## Junior College 4 44
## Bachelor 17 174
## Graduate 6 72
##
## Expected:
## y
## x Yes No
## Lt High School 72.217119 303.78288
## High School 142.897704 601.10230
## Junior College 9.219207 38.78079
## Bachelor 36.684760 154.31524
## Graduate 14.981211 63.01879
##
## H0: degree and getaid are independent
## HA: degree and getaid are dependent
## chi_sq = 55.4515, df = 4, p_value = 0
Because larger chi-square values correspond to stronger evidence against the null hypothesis, we see shaded the upper tail to represent the p-value = 0. Generally we reject the null hypothesis with such a small p-value. In other words, the data provide convincing evidence of dependence between levels of education and welfare status
4.2
Hypotheses
The object of Hypothesis testing is to answer the question if amount of hours spend watching TV are associate with degree status.
H0: tvhours and getaid are independent
HA: tvhours and getaid are dependent
To find out the answer for that, we could use chi-square testing for goodness for fit. Because Given a sample of cases that can be classified into several groups, determine if the sample is representative of the general population.
Conditions
There are two conditions that must be checked before performing a chi-square test:
Independence. Each case that contributes a count to the table must be indepen- dent of all the other cases in the table. According to the data observation, all cases are independent.
Sample size / distribution. Each particular scenario (i.e. cell count) must have at least 5 expected cases.
summary(gss$degree)## Lt High School High School Junior College Bachelor Graduate
## 11822 29287 3070 8002 3870
## NA's
## 1010
So, conditions are clear.
inference(y = degree , x = tvhours, data = gss, statistic = "proportion", type = "ht",
alternative = "greater", success = "Graduate", method = "theoretical")## Warning: Explanatory variable was numerical, it has been converted
## to categorical. In order to avoid this warning, first convert
## your explanatory variable to a categorical variable using the
## as.factor() function
## Warning in chisq.test(x, y, correct = FALSE): Chi-squared approximation may
## be incorrect
## Response variable: categorical (5 levels)
## Explanatory variable: categorical (24 levels)
## Observed:
## y
## x Lt High School High School Junior College Bachelor Graduate
## 0 252 720 91 351 219
## 1 855 3080 465 1421 860
## 2 1464 4711 577 1492 664
## 3 1250 3475 343 813 314
## 4 1056 2575 205 401 133
## 5 673 1241 86 141 51
## 6 503 789 61 62 36
## 7 112 180 8 13 4
## 8 262 357 23 30 4
## 9 25 36 1 4 2
## 10 149 143 9 17 5
## 11 15 11 0 3 1
## 12 97 109 5 7 1
## 13 9 13 0 2 0
## 14 21 19 2 1 0
## 15 19 21 1 5 2
## 16 11 15 1 0 0
## 17 1 2 0 0 0
## 18 9 7 2 0 0
## 20 8 16 1 7 1
## 21 1 3 0 0 0
## 22 3 2 0 0 0
## 23 1 0 0 0 0
## 24 11 9 0 1 1
##
## Expected:
## y
## x Lt High School High School Junior College Bachelor Graduate
## 0 333.8989817 860.0829654 92.26736956 234.028506 112.72217716
## 1 1366.0619086 3518.8085068 377.48824006 957.467514 461.17383077
## 2 1821.4158782 4691.7446757 503.31765342 1276.623352 614.89844102
## 3 1266.6896459 3262.8377039 350.02838605 887.817879 427.62638551
## 4 893.5324863 2301.6304707 246.91267910 626.273467 301.65089664
## 5 448.1975309 1154.5020576 123.85185185 314.139918 151.30864198
## 6 296.6855006 764.2255865 81.98404974 207.945721 100.15914211
## 7 64.8168874 166.9603797 17.91105704 45.429906 21.88176985
## 8 138.2215013 356.0416930 38.19518789 96.878916 46.66270163
## 9 13.9039380 35.8148449 3.84211949 9.745216 4.69388123
## 10 66.0437055 170.1205131 18.25006759 46.289778 22.29593584
## 11 6.1340903 15.8006668 1.69505272 4.299360 2.07082995
## 12 44.7788592 115.3448680 12.37388483 31.385329 15.11705866
## 13 4.9072722 12.6405335 1.35604217 3.439488 1.65666396
## 14 8.7921961 22.6476225 2.42957556 6.162416 2.96818960
## 15 9.8145445 25.2810670 2.71208435 6.878976 3.31332793
## 16 5.5206813 14.2206002 1.52554745 3.869424 1.86374696
## 17 0.6134090 1.5800667 0.16950527 0.429936 0.20708300
## 18 3.6804542 9.4804001 1.01703163 2.579616 1.24249797
## 20 6.7474993 17.3807335 1.86455799 4.729296 2.27791295
## 21 0.8178787 2.1067556 0.22600703 0.573248 0.27611066
## 22 1.0223484 2.6334445 0.28250879 0.716560 0.34513833
## 23 0.2044697 0.5266889 0.05650176 0.143312 0.06902767
## 24 4.4983329 11.5871557 1.24303866 3.152864 1.51860863
##
## H0: tvhours and degree are independent
## HA: tvhours and degree are dependent
## chi_sq = 2690.96, df = 92, p_value = 0
Because larger chi-square values correspond to stronger evidence against the null hypothesis, we see shadedthe upper tail to represent the p-value = 0. Generally we reject the null hypothesis with such a small p-value. In other words, the data provide convincing evidence of dependence between hours spend infront of TV and level of education
4.3
Hypotheses
To find out the answer for that, we could use student’s t testing, two-sided, because we are comparing two means (between people who recieve WF and who do not)
H0: mu_Yes = mu_No
HA: mu_Yes != mu_No
Conditions
Independence. Each case that contributes a count to the table must be indepen- dent of all the other cases in the table. According to the data observation, all cases are independent.
Sample size / distribution. Each particular scenario (i.e. cell count) must have at least 5 expected cases.
summary(gss$getaid)## Yes No NA's
## 281 1179 55601
So, conditions are clear.
inference(y = tvhours , x = getaid , data = gss, statistic = "mean", type = "ht", null = 0,
alternative = "twoside", method = "theoretical")## Response variable: numerical
## Explanatory variable: categorical (2 levels)
## n_Yes = 280, y_bar_Yes = 4.0571, s_Yes = 3.2775
## n_No = 1176, y_bar_No = 2.8597, s_No = 2.0408
## H0: mu_Yes = mu_No
## HA: mu_Yes != mu_No
## t = 5.8494, df = 279
## p_value = < 0.0001
``
Because larger t values correspond to stronger evidence against the null hypothesis, we see shaded the upper tail to represent the p-value = < 0.0001. Generally we reject the null hypothesis with such a small p-value. In other words, the data provide convincing evidence that people recieving welfare tend to spend more time watching TV
4.1.1we also can apply confidence interval for this reserach.
inference(y = tvhours , x = getaid , data = gss, statistic = "mean", type = "ci", null = 0,
alternative = "twoside", method = "theoretical")## Response variable: numerical, Explanatory variable: categorical (2 levels)
## n_Yes = 280, y_bar_Yes = 4.0571, s_Yes = 3.2775
## n_No = 1176, y_bar_No = 2.8597, s_No = 2.0408
## 95% CI (Yes - No): (0.7945 , 1.6004)
We are 95% confident that people who recieve walfare tend to spend 0.8 - 1.6 hours watching Tv more than those who does not recieve WF.