Data for this assignment came from the NLS Investigator, which can be accessed here.
1.) To begin, I have created a new project in R and have saved the data file from Piazza into my project folder.
test_data <- read.csv(file = "pwces.csv")
I also loaded a few packages that I anticipated needing to work with:
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(magrittr)
## Loading required package: magrittr
require(ggvis)
## Loading required package: ggvis
In order to look at the data to see what I am working with, I created a table data frame and executed the ‘glimpse’ command to view the variables and their names:
tbl_df(test_data)
## Source: local data frame [756 x 2]
##
## gender lifesat
## (int) (int)
## 1 1 20
## 2 1 18
## 3 0 25
## 4 0 7
## 5 0 23
## 6 0 25
## 7 1 22
## 8 NA NA
## 9 0 21
## 10 0 29
## .. ... ...
glimpse(test_data)
## Observations: 756
## Variables: 2
## $ gender (int) 1, 1, 0, 0, 0, 0, 1, NA, 0, 0, 1, 0, 1, 0, 1, 1, NA, 0...
## $ lifesat (int) 20, 18, 25, 7, 23, 25, 22, NA, 21, 29, 26, 26, 18, 28,...
2.) Next I filtered the data to elimiate the missing values, which have been designated as “NA” in the .csv file. To do this I am created a subset of the data that included ‘gender’ values of ‘0’ and ‘1’ and ‘lifesat’ values of ‘1’ or greater:
trim_data <- subset(test_data, gender >=0 & lifesat >= 1)
tbl_df (trim_data)
## Source: local data frame [675 x 2]
##
## gender lifesat
## (int) (int)
## 1 1 20
## 2 1 18
## 3 0 25
## 4 0 7
## 5 0 23
## 6 0 25
## 7 1 22
## 8 0 21
## 9 0 29
## 10 1 26
## .. ... ...
This subset includes a sample of 675 respondants. This will be the data used for the remainder of the analysis.
Just to be sure that my analysis to this point is on track, I ran a summary of the data to see that the min, max and mean information looked correct, indicating that I have trimmed the data correctly:
summary(trim_data)
## gender lifesat
## Min. :0.0000 Min. : 5.00
## 1st Qu.:0.0000 1st Qu.:18.00
## Median :0.0000 Median :23.00
## Mean :0.4104 Mean :21.48
## 3rd Qu.:1.0000 3rd Qu.:25.00
## Max. :1.0000 Max. :30.00
I can see that I have a minimum gender value of ‘0’ and a maximum gender value of ‘1’, which is as I would expect. My mean also looks about right at .41. The lifesat variables look accurate as well, with a minimum value of 5 and a maximum value of 30.
With this confidence in my work to this point I test the null hypothesis that there is no difference, by gender, between the mean life satisfation of working professionals. My null and alternative hypotheses are stated below:
Null Hypothesis: There is no statistically significant difference, by gender, between the means of the life satisfaction survey responses.
Alternate Hypothesis: Gender has a statistically significant impact on the mean life satisfaction survey responses.
To test the null hypothesis I utilize a t test which will help me evaluate any difference between the mean life satisfaction survey responses between males and females. In this case I set alpha to .05, meaning that I am willing to accept a Type 1 error rate of .05.
Note that when running a t test in R I organize the variables as (dependent variable ~ independent variable), with the ‘~’ meaning ‘depends on’. In other words, the t test helps me evaluate whether or not the dependent variable depends on the independent variable.
t.test(lifesat~gender, trim_data, var.eq=TRUE)
##
## Two Sample t-test
##
## data: lifesat by gender
## t = -1.349, df = 673, p-value = 0.1778
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.4704615 0.2727582
## sample estimates:
## mean in group 0 mean in group 1
## 21.23869 21.83755
The results of the t test indicate that that the t value is -1.349. The fact that it is negative isn’t a concern, but the value itself of just over one suggests that the error in estimating is nearly equal to the difference between the means. In other words, we don’t necessarily have enough information to make a determination based on the data that is being analyzed at this time. Additionally, the p-value is .1778, which is greater than my established alpha of .05. Given the results of this t test, I will fail to reject the null hypothesis.