First I required the specific libraries that might be necessary for my R file.
require(ggvis)
## Loading required package: ggvis
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(magrittr)
## Loading required package: magrittr
require(knitr)
## Loading required package: knitr
Next I downloaded the data from the specified file.
WorkerSatisfaction <- read.csv(file="http://www.personal.psu.edu/dlp/WFED540/pwces.csv", header=TRUE, sep=",")
glimpse(WorkerSatisfaction)
## Observations: 756
## Variables: 2
## $ gender (int) 1, 1, 0, 0, 0, 0, 1, NA, 0, 0, 1, 0, 1, 0, 1, 1, NA, 0...
## $ lifesat (int) 20, 18, 25, 7, 23, 25, 22, NA, 21, 29, 26, 26, 18, 28,...
summary(WorkerSatisfaction)
## gender lifesat
## Min. :0.0000 Min. : 5.0
## 1st Qu.:0.0000 1st Qu.:18.0
## Median :0.0000 Median :23.0
## Mean :0.4111 Mean :21.5
## 3rd Qu.:1.0000 3rd Qu.:25.0
## Max. :1.0000 Max. :30.0
## NA's :70 NA's :78
I then created a data frame and filtered that file to get rid of any NA values for the “gender” variable. I ran a summary of that new data frame to insure I had what I was looking for.
WorkerSatDF <- tbl_df(WorkerSatisfaction)
WorkerSatDF
## Source: local data frame [756 x 2]
##
## gender lifesat
## (int) (int)
## 1 1 20
## 2 1 18
## 3 0 25
## 4 0 7
## 5 0 23
## 6 0 25
## 7 1 22
## 8 NA NA
## 9 0 21
## 10 0 29
## .. ... ...
WorkersSat <- WorkerSatDF %>% filter(gender >=0, na.rm=TRUE)
WorkersSat
## Source: local data frame [686 x 2]
##
## gender lifesat
## (int) (int)
## 1 1 20
## 2 1 18
## 3 0 25
## 4 0 7
## 5 0 23
## 6 0 25
## 7 1 22
## 8 0 21
## 9 0 29
## 10 1 26
## .. ... ...
summary(WorkersSat)
## gender lifesat
## Min. :0.0000 Min. : 5.00
## 1st Qu.:0.0000 1st Qu.:18.00
## Median :0.0000 Median :23.00
## Mean :0.4111 Mean :21.48
## 3rd Qu.:1.0000 3rd Qu.:25.00
## Max. :1.0000 Max. :30.00
## NA's :11
Next, I created a table to insure I didn’t have any hidden values in gender that I needed to account for.
TestForHiddenValues <- table(WorkersSat$gender)
TestForHiddenValues
##
## 0 1
## 404 282
Next, I ran some plots for my original data to get a better feel for it. First, I looked at males.
WorkersSat %>% filter(gender==0) %>%
ggvis(~lifesat) %>% layer_histograms()
## Guessing width = 1 # range / 25
WorkersSat %>% filter(gender==1) %>%
ggvis(~lifesat) %>% layer_histograms()
## Guessing width = 1 # range / 25
WorkersSat %>% filter(gender==0, lifesat >=15 & lifesat <= 27) %>%
ggvis(~lifesat) %>% layer_histograms()
## Guessing width = 0.5 # range / 24
WorkersSat %>% filter(gender==1, lifesat >=15 & lifesat <= 27) %>%
ggvis(~lifesat) %>% layer_histograms()
## Guessing width = 0.5 # range / 24
The null hypothesis in this example would state that there is no difference in the means of life satisfaction between men and women. Stated another way, knowing a person’s sex does not help predict life satisfaction, and vice versa. The alternative hypothesis is that there is a difference in the means of life satisfaction between men and women. Stated another, gender and life satisfaction are not independent variables, i.e. one can be used to predict the other. I will set my alpha value, the probability of Type 1 error, equal to 0.05.
I will use a t-test to determine whether I can reject the null hypothesis.
First, I measured the means of life satisfaction of men and women using my trimmed data.
WorkersSat %>% group_by(gender) %>%
filter(lifesat >= 15 & lifesat <= 27) %>%
summarize(num=n(), mean_lifesat=mean(lifesat, na.rm=TRUE))
## Source: local data frame [2 x 3]
##
## gender num mean_lifesat
## (int) (int) (dbl)
## 1 0 289 21.59862
## 2 1 198 22.22222
I then compared those mean against my original data frame.
WorkersSat %>% group_by(gender) %>%
summarize(num=n(), mean_lifesat=mean(lifesat, na.rm=TRUE))
## Source: local data frame [2 x 3]
##
## gender num mean_lifesat
## (int) (int) (dbl)
## 1 0 404 21.23869
## 2 1 282 21.83755
This showed a definite difference between means, but I must run the t-test to tell if it is a significant difference that meets our acceptable level of Type 1 error. Therefore, I ran a t-test using my trimmed data.
trim_WorkersSat <- WorkersSat %>%
filter(lifesat >= 15 & lifesat <= 27, na.rm=TRUE)
trim_WorkersSat
## Source: local data frame [487 x 2]
##
## gender lifesat
## (int) (int)
## 1 1 20
## 2 1 18
## 3 0 25
## 4 0 23
## 5 0 25
## 6 1 22
## 7 0 21
## 8 1 26
## 9 0 26
## 10 1 18
## .. ... ...
t.test(trim_WorkersSat$lifesat ~ trim_WorkersSat$gender, var.equal=TRUE)
##
## Two Sample t-test
##
## data: trim_WorkersSat$lifesat by trim_WorkersSat$gender
## t = -2.0027, df = 485, p-value = 0.04577
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.23544574 -0.01176687
## sample estimates:
## mean in group 0 mean in group 1
## 21.59862 22.22222
The t-test shows that the estimate of the difference between the means is more than twice the error in estimating that difference (t = -2.0027) and our probability value (p-value = 0.04577) is less than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [-1.235, -0.011], which doesn’t include a zero value. Therefore, the null hypothesis has been rejected.
However, I also found it interesting that without trimming the data significantly, we would fail to reject our null hypothesis. For example, if we use our original data, our t-test would produce the following results.
t.test(WorkersSat$lifesat ~ WorkersSat$gender, var.equal=TRUE)
##
## Two Sample t-test
##
## data: WorkersSat$lifesat by WorkersSat$gender
## t = -1.349, df = 673, p-value = 0.1778
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.4704615 0.2727582
## sample estimates:
## mean in group 0 mean in group 1
## 21.23869 21.83755
This t-test shows that the estimate of the difference between the means is only slightly more than the error in estimating that difference (t = -1.349) and our probability value (p-value = 0.1778) is more than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [-1.470, 0.272], which does include a zero value. Therefore, in this instance we have failed to reject the null hypothesis.
Even only a slight modification of the data would result in a failure to reject our null hypothesis. For example, lowering the minimum lifesat value to 10 and raising the maximum to 30 would provide the following results.
trim_WorkersSat1 <- WorkersSat %>%
filter(lifesat >= 10 & lifesat <= 30, na.rm=TRUE)
trim_WorkersSat1
## Source: local data frame [655 x 2]
##
## gender lifesat
## (int) (int)
## 1 1 20
## 2 1 18
## 3 0 25
## 4 0 23
## 5 0 25
## 6 1 22
## 7 0 21
## 8 0 29
## 9 1 26
## 10 0 26
## .. ... ...
t.test(trim_WorkersSat1$lifesat ~ trim_WorkersSat1$gender, var.equal=TRUE)
##
## Two Sample t-test
##
## data: trim_WorkersSat1$lifesat by trim_WorkersSat1$gender
## t = -1.725, df = 653, p-value = 0.08501
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.50762112 0.09753974
## sample estimates:
## mean in group 0 mean in group 1
## 21.63824 22.34328
This t-test also shows that the estimate of the difference between the means is only slightly more than the error in estimating that difference (t = -1.725) and our probability value (p-value = 0.08501) is more than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [-1.507, 0.097], which does include a zero value. Therefore, in this instance we also have failed to reject the null hypothesis.
By modifying the data and choosing appropriate limits on the variable “lifesat” (between 15 and 27) we are able to reject the null hypothesis, which would indicate that (as our alternative hypothesis states) gender and life satisfaction are not independent variables, i.e. one can be used to predict the other. Therefore, there are evident differences in life satisfaction between genders.