Getting started

First I required the specific libraries that might be necessary for my R file.

require(ggvis)
## Loading required package: ggvis
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(magrittr)
## Loading required package: magrittr
require(knitr)
## Loading required package: knitr

Next I downloaded the data from the specified file.

WorkerSatisfaction <- read.csv(file="http://www.personal.psu.edu/dlp/WFED540/pwces.csv", header=TRUE, sep=",")
glimpse(WorkerSatisfaction)
## Observations: 756
## Variables: 2
## $ gender  (int) 1, 1, 0, 0, 0, 0, 1, NA, 0, 0, 1, 0, 1, 0, 1, 1, NA, 0...
## $ lifesat (int) 20, 18, 25, 7, 23, 25, 22, NA, 21, 29, 26, 26, 18, 28,...
summary(WorkerSatisfaction)
##      gender          lifesat    
##  Min.   :0.0000   Min.   : 5.0  
##  1st Qu.:0.0000   1st Qu.:18.0  
##  Median :0.0000   Median :23.0  
##  Mean   :0.4111   Mean   :21.5  
##  3rd Qu.:1.0000   3rd Qu.:25.0  
##  Max.   :1.0000   Max.   :30.0  
##  NA's   :70       NA's   :78

I then created a data frame and filtered that file to get rid of any NA values for the “gender” variable. I ran a summary of that new data frame to insure I had what I was looking for.

WorkerSatDF <- tbl_df(WorkerSatisfaction)
WorkerSatDF
## Source: local data frame [756 x 2]
## 
##    gender lifesat
##     (int)   (int)
## 1       1      20
## 2       1      18
## 3       0      25
## 4       0       7
## 5       0      23
## 6       0      25
## 7       1      22
## 8      NA      NA
## 9       0      21
## 10      0      29
## ..    ...     ...
WorkersSat <- WorkerSatDF %>% filter(gender >=0, na.rm=TRUE)
WorkersSat
## Source: local data frame [686 x 2]
## 
##    gender lifesat
##     (int)   (int)
## 1       1      20
## 2       1      18
## 3       0      25
## 4       0       7
## 5       0      23
## 6       0      25
## 7       1      22
## 8       0      21
## 9       0      29
## 10      1      26
## ..    ...     ...
summary(WorkersSat)
##      gender          lifesat     
##  Min.   :0.0000   Min.   : 5.00  
##  1st Qu.:0.0000   1st Qu.:18.00  
##  Median :0.0000   Median :23.00  
##  Mean   :0.4111   Mean   :21.48  
##  3rd Qu.:1.0000   3rd Qu.:25.00  
##  Max.   :1.0000   Max.   :30.00  
##                   NA's   :11

Next, I created a table to insure I didn’t have any hidden values in gender that I needed to account for.

TestForHiddenValues <- table(WorkersSat$gender)
TestForHiddenValues
## 
##   0   1 
## 404 282

Next, I ran some plots for my original data to get a better feel for it. First, I looked at males.

WorkersSat %>% filter(gender==0) %>%
  ggvis(~lifesat) %>% layer_histograms() 
## Guessing width = 1 # range / 25

Then I ran the same plot for females.

WorkersSat %>% filter(gender==1) %>%
  ggvis(~lifesat) %>% layer_histograms() 
## Guessing width = 1 # range / 25

These plots seemed to indicate that the majority of the data lay between the values of 15 and 27, with some outliers below and above. So I reran the plots using these new values. Males:

WorkersSat %>% filter(gender==0, lifesat >=15 & lifesat <= 27) %>%
  ggvis(~lifesat) %>% layer_histograms() 
## Guessing width = 0.5 # range / 24

Females:

WorkersSat %>% filter(gender==1, lifesat >=15 & lifesat <= 27) %>%
  ggvis(~lifesat) %>% layer_histograms()
## Guessing width = 0.5 # range / 24

These plots seemed to create a more representative sample to test the null hypothesis.

Test the null hypothesis

The null hypothesis in this example would state that there is no difference in the means of life satisfaction between men and women. Stated another way, knowing a person’s sex does not help predict life satisfaction, and vice versa. The alternative hypothesis is that there is a difference in the means of life satisfaction between men and women. Stated another, gender and life satisfaction are not independent variables, i.e. one can be used to predict the other. I will set my alpha value, the probability of Type 1 error, equal to 0.05.

I will use a t-test to determine whether I can reject the null hypothesis.

First, I measured the means of life satisfaction of men and women using my trimmed data.

WorkersSat %>% group_by(gender) %>%
  filter(lifesat >= 15 & lifesat <= 27) %>%
  summarize(num=n(), mean_lifesat=mean(lifesat, na.rm=TRUE))
## Source: local data frame [2 x 3]
## 
##   gender   num mean_lifesat
##    (int) (int)        (dbl)
## 1      0   289     21.59862
## 2      1   198     22.22222

I then compared those mean against my original data frame.

WorkersSat %>% group_by(gender) %>%
  summarize(num=n(), mean_lifesat=mean(lifesat, na.rm=TRUE))
## Source: local data frame [2 x 3]
## 
##   gender   num mean_lifesat
##    (int) (int)        (dbl)
## 1      0   404     21.23869
## 2      1   282     21.83755

This showed a definite difference between means, but I must run the t-test to tell if it is a significant difference that meets our acceptable level of Type 1 error. Therefore, I ran a t-test using my trimmed data.

trim_WorkersSat <- WorkersSat %>%
  filter(lifesat >= 15 & lifesat <= 27, na.rm=TRUE)
trim_WorkersSat
## Source: local data frame [487 x 2]
## 
##    gender lifesat
##     (int)   (int)
## 1       1      20
## 2       1      18
## 3       0      25
## 4       0      23
## 5       0      25
## 6       1      22
## 7       0      21
## 8       1      26
## 9       0      26
## 10      1      18
## ..    ...     ...
t.test(trim_WorkersSat$lifesat ~ trim_WorkersSat$gender, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  trim_WorkersSat$lifesat by trim_WorkersSat$gender
## t = -2.0027, df = 485, p-value = 0.04577
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.23544574 -0.01176687
## sample estimates:
## mean in group 0 mean in group 1 
##        21.59862        22.22222

The t-test shows that the estimate of the difference between the means is more than twice the error in estimating that difference (t = -2.0027) and our probability value (p-value = 0.04577) is less than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [-1.235, -0.011], which doesn’t include a zero value. Therefore, the null hypothesis has been rejected.

Other interesting results

However, I also found it interesting that without trimming the data significantly, we would fail to reject our null hypothesis. For example, if we use our original data, our t-test would produce the following results.

t.test(WorkersSat$lifesat ~ WorkersSat$gender, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  WorkersSat$lifesat by WorkersSat$gender
## t = -1.349, df = 673, p-value = 0.1778
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4704615  0.2727582
## sample estimates:
## mean in group 0 mean in group 1 
##        21.23869        21.83755

This t-test shows that the estimate of the difference between the means is only slightly more than the error in estimating that difference (t = -1.349) and our probability value (p-value = 0.1778) is more than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [-1.470, 0.272], which does include a zero value. Therefore, in this instance we have failed to reject the null hypothesis.

Even only a slight modification of the data would result in a failure to reject our null hypothesis. For example, lowering the minimum lifesat value to 10 and raising the maximum to 30 would provide the following results.

trim_WorkersSat1 <- WorkersSat %>%
  filter(lifesat >= 10 & lifesat <= 30, na.rm=TRUE)
trim_WorkersSat1
## Source: local data frame [655 x 2]
## 
##    gender lifesat
##     (int)   (int)
## 1       1      20
## 2       1      18
## 3       0      25
## 4       0      23
## 5       0      25
## 6       1      22
## 7       0      21
## 8       0      29
## 9       1      26
## 10      0      26
## ..    ...     ...
t.test(trim_WorkersSat1$lifesat ~ trim_WorkersSat1$gender, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  trim_WorkersSat1$lifesat by trim_WorkersSat1$gender
## t = -1.725, df = 653, p-value = 0.08501
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.50762112  0.09753974
## sample estimates:
## mean in group 0 mean in group 1 
##        21.63824        22.34328

This t-test also shows that the estimate of the difference between the means is only slightly more than the error in estimating that difference (t = -1.725) and our probability value (p-value = 0.08501) is more than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [-1.507, 0.097], which does include a zero value. Therefore, in this instance we also have failed to reject the null hypothesis.

Final result

By modifying the data and choosing appropriate limits on the variable “lifesat” (between 15 and 27) we are able to reject the null hypothesis, which would indicate that (as our alternative hypothesis states) gender and life satisfaction are not independent variables, i.e. one can be used to predict the other. Therefore, there are evident differences in life satisfaction between genders.