The Task

Infer whether gender differences in mean life satisfaction are evident among working professionals.

The Analysis

1. Obtaining the dataset and preparing R

First, I required the R packages I would need for this analysis:

require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(magrittr)
## Loading required package: magrittr
require(ggvis)
## Loading required package: ggvis

Then I loaded the dataset.

career <- read.csv(file = "http://www.personal.psu.edu/dlp/WFED540/pwces.csv", header = TRUE, sep = ",")

Then I took at look at the data to see what I had.

career_df <- tbl_df(career)
career_df
## Source: local data frame [756 x 2]
## 
##    gender lifesat
##     (int)   (int)
## 1       1      20
## 2       1      18
## 3       0      25
## 4       0       7
## 5       0      23
## 6       0      25
## 7       1      22
## 8      NA      NA
## 9       0      21
## 10      0      29
## ..    ...     ...
glimpse(career_df)
## Observations: 756
## Variables: 2
## $ gender  (int) 1, 1, 0, 0, 0, 0, 1, NA, 0, 0, 1, 0, 1, 0, 1, 1, NA, 0...
## $ lifesat (int) 20, 18, 25, 7, 23, 25, 22, NA, 21, 29, 26, 26, 18, 28,...
summary(career_df)
##      gender          lifesat    
##  Min.   :0.0000   Min.   : 5.0  
##  1st Qu.:0.0000   1st Qu.:18.0  
##  Median :0.0000   Median :23.0  
##  Mean   :0.4111   Mean   :21.5  
##  3rd Qu.:1.0000   3rd Qu.:25.0  
##  Max.   :1.0000   Max.   :30.0  
##  NA's   :70       NA's   :78

2. Cleaning up the dataset

First, I created a plot of gender by lifesat to see if there are outliers that might bias my analysis.

career_df %>%
  filter(lifesat>0) %>%
  ggvis(~lifesat) %>% layer_histograms()
## Guessing width = 1 # range / 25

I notice that there are some outliers below a lifesat score of “10,” so I trim those off so they don’t bias my data.

career_df_trim <- career_df %>%
  filter(lifesat>9)
glimpse(career_df_trim)
## Observations: 658
## Variables: 2
## $ gender  (int) 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, ...
## $ lifesat (int) 20, 18, 25, 23, 25, 22, 21, 29, 26, 26, 18, 28, 21, 21...

Next, I create a new plot to double-check my work.

career_df_trim %>%
  filter(lifesat>0) %>%
  ggvis(~lifesat) %>% layer_histograms()
## Guessing width = 0.5 # range / 40

Yes, the data looks better!

2. Testing the Hypothesis

Now I am ready to assess whether gender differences in mean life satisfaction are evident among working professionals. To do this, I conduct a t test, which enables me to test the null hypothesis that there is no difference by sex in mean life satisfaction (as measured by “lifesat”):

\(H_{0}: \mu_{malelifesat} - \mu_{femalelifesat} = 0\)

Alternatively

\(H_{1}: \mu_{malelifesat} - \mu_{femalelifesat} \neq 0\)

I set \(\alpha\) = 0.05, which means that the probability of my making a Type 1 error is 5%. (Note: Male = 0 and Female = 1.)

t.test(lifesat ~ gender, career_df_trim, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  lifesat by gender
## t = -1.725, df = 653, p-value = 0.08501
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.50762112  0.09753974
## sample estimates:
## mean in group 0 mean in group 1 
##        21.63824        22.34328

I found that my p value, .08, is <.05 (my \(\alpha\)), so I reject the null hypothesis. In other words, the mean life satisfaction is different between the men and women in this sample: gender differences in mean life satisfaction are evident among working professionals. In this sample, men had an average life satisfaction value of 21.6 and women had an average life satisfaction of 22.3. In other words, in this sample, women reported a higher level of life satisfaction than men. I am 95% confident that the true mean difference in their life satisfaction values is between -1.5 and .09.