For this assignment, I tested hypotheses about variations in work experience by sex and race/ethnicity.
First, I downloaded the necessary data set from NLS, then opened it in RStudio and cleaned it up.
new_data <- read.table('assignment2.dat', sep=' ')
names(new_data) <- c('R0000100','R0536300','R1482600','T6651300','Z9065500','Z9065700')
# Handle missing values
new_data[new_data == -1] = NA # Refused
new_data[new_data == -2] = NA # Dont know
new_data[new_data == -3] = NA # Invalid missing
new_data[new_data == -4] = NA # Valid missing
new_data[new_data == -5] = NA # Non-interview
# If there are values not categorized they will be represented as NA
vallabels = function(data) {
data$R0000100 <- cut(data$R0000100, c(0.0,1.0,1000.0,2000.0,3000.0,4000.0,5000.0,6000.0,7000.0,8000.0,9000.0,9999.0), labels=c("0","1 TO 999","1000 TO 1999","2000 TO 2999","3000 TO 3999","4000 TO 4999","5000 TO 5999","6000 TO 6999","7000 TO 7999","8000 TO 8999","9000 TO 9999"), right=FALSE)
data$R0536300 <- factor(data$R0536300, levels=c(1.0,2.0,0.0), labels=c("Male","Female","No Information"))
data$R1482600 <- factor(data$R1482600, levels=c(1.0,2.0,3.0,4.0), labels=c("Black","Hispanic","Mixed Race (Non-Hispanic)","Non-Black / Non-Hispanic"))
data$T6651300 <- factor(data$T6651300, levels=c(26.0,27.0,28.0,29.0,30.0,31.0,32.0), labels=c("26","27","28","29","30","31","32"))
data$Z9065500 <- cut(data$Z9065500, c(0.0,1.0,500.0,1000.0,1500.0,2000.0,2500.0,3000.0,3500.0,4000.0,4500.0,5000.0,9.9999999E7), labels=c("0","1 TO 499","500 TO 999","1000 TO 1499","1500 TO 1999","2000 TO 2499","2500 TO 2999","3000 TO 3499","3500 TO 3999","4000 TO 4499","4500 TO 4999","5000 TO 99999999: 5000+"), right=FALSE)
data$Z9065700 <- cut(data$Z9065700, c(0.0,1.0,500.0,1000.0,1500.0,2000.0,2500.0,3000.0,3500.0,4000.0,4500.0,5000.0,9.9999999E7), labels=c("0","1 TO 499","500 TO 999","1000 TO 1499","1500 TO 1999","2000 TO 2499","2500 TO 2999","3000 TO 3499","3500 TO 3999","4000 TO 4499","4500 TO 4999","5000 TO 99999999: 5000+"), right=FALSE)
return(data)
}
varlabels <- c( "PUBID - YTH ID CODE 1997",
"KEY!SEX (SYMBOL) 1997",
"KEY!RACE_ETHNICITY (SYMBOL) 1997",
"CV_AGE_INT_DATE 2011",
"CVC_HOURS_WK_TEEN",
"CVC_HOURS_WK_ADULT"
)
# Use qnames rather than rnums
qnames = function(data) {
names(data) <- c("ID","Sex","Ethnicity","Interview_age","Cum_Hrs_Wkd_Teen","Cum_Hrs_Wkd_Adult")
return(data)
}
#********************************************************************************************************
# Remove the '#' before the following line to create a data file called "categories" with value labels.
categories <- vallabels(new_data)
# Remove the '#' before the following lines to rename variables using Qnames instead of Reference Numbers
new_data <- qnames(new_data)
categories <- qnames(categories)
# Produce summaries for the raw (uncategorized) data file
summary(new_data)
## ID Sex Ethnicity Interview_age
## Min. : 1 Min. :1.000 Min. :1.000 Min. :26.00
## 1st Qu.:2249 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:28.00
## Median :4502 Median :1.000 Median :4.000 Median :29.00
## Mean :4504 Mean :1.488 Mean :2.788 Mean :28.79
## 3rd Qu.:6758 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:30.00
## Max. :9022 Max. :2.000 Max. :4.000 Max. :32.00
## NA's :1561
## Cum_Hrs_Wkd_Teen Cum_Hrs_Wkd_Adult
## Min. : 0 Min. : 0
## 1st Qu.: 1255 1st Qu.: 8190
## Median : 2741 Median :16396
## Mean : 3105 Mean :15595
## 3rd Qu.: 4470 3rd Qu.:22418
## Max. :18829 Max. :63722
## NA's :707 NA's :1596
# Remove the '#' before the following lines to produce summaries for the "categories" data file.
#categories <- vallabels(new_data)
#summary(categories)
#************************************************************************************************************
Next, I required the R packages I would need for this analysis:
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(magrittr)
## Loading required package: magrittr
require(ggvis)
## Loading required package: ggvis
Then I took at look at the data and the associated code book to see what I had.
youth <- tbl_df(new_data)
youth
## Source: local data frame [8,984 x 6]
##
## ID Sex Ethnicity Interview_age Cum_Hrs_Wkd_Teen Cum_Hrs_Wkd_Adult
## (int) (int) (int) (int) (int) (int)
## 1 1 2 4 29 5831 NA
## 2 2 1 2 29 NA 29712
## 3 3 2 2 28 6489 NA
## 4 4 2 2 30 3292 23390
## 5 5 1 2 29 680 28056
## 6 6 2 2 29 NA 18379
## 7 7 1 2 28 1650 NA
## 8 8 2 4 30 2082 18419
## 9 9 1 4 29 864 19274
## 10 10 1 4 NA 0 6849
## .. ... ... ... ... ... ...
glimpse(youth)
## Observations: 8,984
## Variables: 6
## $ ID (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ Sex (int) 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2,...
## $ Ethnicity (int) 4, 2, 2, 2, 2, 2, 2, 4, 4, 4, 2, 2, 2, 2, 2,...
## $ Interview_age (int) 29, 29, 28, 30, 29, 29, 28, 30, 29, NA, 29, ...
## $ Cum_Hrs_Wkd_Teen (int) 5831, NA, 6489, 3292, 680, NA, 1650, 2082, 8...
## $ Cum_Hrs_Wkd_Adult (int) NA, 29712, NA, 23390, 28056, 18379, NA, 1841...
summary(youth)
## ID Sex Ethnicity Interview_age
## Min. : 1 Min. :1.000 Min. :1.000 Min. :26.00
## 1st Qu.:2249 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:28.00
## Median :4502 Median :1.000 Median :4.000 Median :29.00
## Mean :4504 Mean :1.488 Mean :2.788 Mean :28.79
## 3rd Qu.:6758 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:30.00
## Max. :9022 Max. :2.000 Max. :4.000 Max. :32.00
## NA's :1561
## Cum_Hrs_Wkd_Teen Cum_Hrs_Wkd_Adult
## Min. : 0 Min. : 0
## 1st Qu.: 1255 1st Qu.: 8190
## Median : 2741 Median :16396
## Mean : 3105 Mean :15595
## 3rd Qu.: 4470 3rd Qu.:22418
## Max. :18829 Max. :63722
## NA's :707 NA's :1596
The Code book showed that there were NO non-responses or invalids for Sex or Ethnicity
To trim the data to look at only the 30 year olds, I used the first line of code shown below, then I took a “glimpse” of the data to be sure I had the correct variables.
trim_30 <- filter(youth, Interview_age == 30)
glimpse(trim_30)
## Observations: 1,546
## Variables: 6
## $ ID (int) 4, 8, 26, 27, 32, 33, 38, 55, 59, 68, 69, 78...
## $ Sex (int) 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2,...
## $ Ethnicity (int) 2, 4, 1, 1, 4, 4, 4, 2, 1, 1, 1, 4, 1, 2, 4,...
## $ Interview_age (int) 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ Cum_Hrs_Wkd_Teen (int) 3292, 2082, 760, 1592, 4611, 1862, 0, 1368, ...
## $ Cum_Hrs_Wkd_Adult (int) 23390, 18419, 0, 11060, 16981, 26551, NA, 16...
summary(trim_30)
## ID Sex Ethnicity Interview_age
## Min. : 4 Min. :1.000 Min. :1.000 Min. :30
## 1st Qu.:2528 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:30
## Median :4798 Median :2.000 Median :3.000 Median :30
## Mean :4712 Mean :1.504 Mean :2.732 Mean :30
## 3rd Qu.:7014 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:30
## Max. :9009 Max. :2.000 Max. :4.000 Max. :30
##
## Cum_Hrs_Wkd_Teen Cum_Hrs_Wkd_Adult
## Min. : 0 Min. : 0
## 1st Qu.: 1423 1st Qu.:13189
## Median : 2925 Median :20847
## Mean : 3252 Mean :19444
## 3rd Qu.: 4510 3rd Qu.:25961
## Max. :14334 Max. :63722
## NA's :130 NA's :334
Before I could proceed to question #3, I needed to clean up the data.
I noticed that there was a huge jump between the 3Q numbers and the Max number for both Cum_Hrs_Wkd_Teen and Cum_Hrs_Wkd_Adult, which made me think there may have been outliers that needed to be trimmed out. I created histograms to give me a better idea of what was going on.
First, I did this for Cum_Hrs_Wkd_Teen to prepare for question #3.
trim_30 %>%
filter(Cum_Hrs_Wkd_Teen>0) %>%
ggvis(~Cum_Hrs_Wkd_Teen) %>% layer_histograms()
## Guessing width = 500 # range / 29
I saw that there were just a small number of data points beyond about 11,500 hours worked that would make the mean seem artificially high. I also wanted to look at only those data points where the individual worked more than 0 hours.
trim_30_teen <- trim_30 %>%
filter(Cum_Hrs_Wkd_Teen>0, Cum_Hrs_Wkd_Teen<11500)
glimpse(trim_30_teen)
## Observations: 1,353
## Variables: 6
## $ ID (int) 4, 8, 26, 27, 32, 33, 55, 59, 68, 69, 78, 80...
## $ Sex (int) 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1,...
## $ Ethnicity (int) 2, 4, 1, 1, 4, 4, 2, 1, 1, 1, 4, 1, 2, 4, 4,...
## $ Interview_age (int) 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ Cum_Hrs_Wkd_Teen (int) 3292, 2082, 760, 1592, 4611, 1862, 1368, 564...
## $ Cum_Hrs_Wkd_Adult (int) 23390, 18419, 0, 11060, 16981, 26551, 16934,...
I was finally ready to run my t test for #3, which would test the null hypothesis that there is no difference by sex between the mean culumulative hours worked for age 14 through age 19:
\(H_{0}: \mu_{teenfemalecum} - \mu_{teenmalecum} = 0\)
Alternatively
\(H_{1}: \mu_{teenfemalecum} - \mu_{teenmalecum} \neq 0\)
I set \(\alpha\) = 0.05. (Note: Male = 1 and Female = 2.)
t.test(Cum_Hrs_Wkd_Teen ~ Sex, trim_30_teen, var.equal=TRUE)
##
## Two Sample t-test
##
## data: Cum_Hrs_Wkd_Teen by Sex
## t = 3.8084, df = 1351, p-value = 0.0001461
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 217.7845 680.4907
## sample estimates:
## mean in group 1 mean in group 2
## 3543.652 3094.515
I found that my p value, .0001461, is <.05 (my \(\alpha\)), so I rejected the null hypothesis. In other words, the mean cumulative number of hours worked as a teen is different between the men and women in this sample’s group of 30 year olds. In this sample, men worked an average of 3,543 hours and women worked an average of 3,094 hours as teens. Meaning that men worked an average of 449 more hours than women. I am 95% confident that the true mean difference in their cumulative work hours is between 217 and 680 hours.
To prepare for question #4, I now need to create a histogram to give me a better idea of whether there are data outliers I need to cull from Cum_Hrs_Wkd_Adult.
trim_30 %>%
filter(Cum_Hrs_Wkd_Adult>0) %>%
ggvis(~Cum_Hrs_Wkd_Adult) %>% layer_histograms()
## Guessing width = 2000 # range / 32
The resulting histogram shows that there are outliers that would make the mean artificially high (>40000). I also want to look at only those data points where the individual worked more than 0 hours.
trim_30_adult <- trim_30 %>%
filter(Cum_Hrs_Wkd_Adult>0, Cum_Hrs_Wkd_Adult<40000)
glimpse(trim_30_adult)
## Observations: 1,188
## Variables: 6
## $ ID (int) 4, 8, 27, 32, 33, 55, 78, 80, 83, 86, 102, 1...
## $ Sex (int) 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1,...
## $ Ethnicity (int) 2, 4, 1, 4, 4, 2, 4, 1, 2, 4, 2, 4, 4, 4, 4,...
## $ Interview_age (int) 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ Cum_Hrs_Wkd_Teen (int) 3292, 2082, 1592, 4611, 1862, 1368, 365, 312...
## $ Cum_Hrs_Wkd_Adult (int) 23390, 18419, 11060, 16981, 26551, 16934, 60...
Then I was ready to run my t test for #4, which would test the null hypothesis that there is no difference by sex between the mean culumulative hours worked from age 20 and older:
\(H_{0}: \mu_{adultfemalecum} - \mu_{adultmalecum} = 0\)
Alternatively
\(H_{1}: \mu_{adultfemalecum} - \mu_{adultmalecum} \neq 0\)
I set \(\alpha\) = 0.05. (Note: Male = 1 and Female = 2.)
t.test(Cum_Hrs_Wkd_Adult ~ Sex, trim_30_adult, var.equal=TRUE)
##
## Two Sample t-test
##
## data: Cum_Hrs_Wkd_Adult by Sex
## t = 4.969, df = 1186, p-value = 7.717e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1489.491 3433.146
## sample estimates:
## mean in group 1 mean in group 2
## 20893.26 18431.95
I find that my p value, .0000007, is <.05 (my \(\alpha\)), so I reject the null hypothesis. In other words, the mean cumulative number of hours worked as an adult is different between the men and women in this sample’s group of 30 year olds. In this sample, men worked an average of 20,893 hours as adults and women worked an average of 18,431 hours as adults. Meaning that men worked an average of 2,462 more hours than women as adults. I am 95% confident that the true mean difference in their cumulative work hours is between 1,489 and 3,433 hours.
To prepare for questions #5 and #6, I needed to re-code the Ethnicity values 1 through 3 as “1” and value 4 as “0.” This would group all of the black/hispanic/other into a single category. I did this by using an ifelse statement that says “if Ethnicity is 4, recode it to 0…if not, recode it to 1.” Then I looked at the recoded variable to make sure it worked. It did!
table(youth$Ethnicity)
##
## 1 2 3 4
## 2335 1901 83 4665
youth$EthnicityNew <- ifelse(youth$Ethnicity == 4, 0, 1)
table(youth$EthnicityNew)
##
## 0 1
## 4665 4319
summary(youth)
## ID Sex Ethnicity Interview_age
## Min. : 1 Min. :1.000 Min. :1.000 Min. :26.00
## 1st Qu.:2249 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:28.00
## Median :4502 Median :1.000 Median :4.000 Median :29.00
## Mean :4504 Mean :1.488 Mean :2.788 Mean :28.79
## 3rd Qu.:6758 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:30.00
## Max. :9022 Max. :2.000 Max. :4.000 Max. :32.00
## NA's :1561
## Cum_Hrs_Wkd_Teen Cum_Hrs_Wkd_Adult EthnicityNew
## Min. : 0 Min. : 0 Min. :0.0000
## 1st Qu.: 1255 1st Qu.: 8190 1st Qu.:0.0000
## Median : 2741 Median :16396 Median :0.0000
## Mean : 3105 Mean :15595 Mean :0.4807
## 3rd Qu.: 4470 3rd Qu.:22418 3rd Qu.:1.0000
## Max. :18829 Max. :63722 Max. :1.0000
## NA's :707 NA's :1596
Then I was ready to run my t test for #5, which would test the null hypothesis that there is no difference by race/ethnicity between the mean cumulative hours worked from age 14 through age 19:
\(H_{0}: \mu_{teenothercum} - \mu_{teenwhitecum} = 0\)
Alternatively
\(H_{1}: \mu_{teenothercum} - \mu_{teenwhitecum} \neq 0\)
I set \(\alpha\) = 0.05. (Note: Non-black, non-hispanic = 0 and Other ethnicity = 1.) I will use my trim_30_teen for this test.
t.test(youth$Cum_Hrs_Wkd_Teen ~ youth$EthnicityNew, trim_30_teen, var.equal=TRUE)
##
## Two Sample t-test
##
## data: youth$Cum_Hrs_Wkd_Teen by youth$EthnicityNew
## t = 14.884, df = 8275, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 669.6223 872.7625
## sample estimates:
## mean in group 0 mean in group 1
## 3474.743 2703.551
I find that my p value, 2.2e-16 (a VERY small number), is <.05 (my \(\alpha\)), so I reject the null hypothesis. In other words, the mean cumulative number of hours worked as a teen is different between the non-black, non-hispanics and “other” races/ethnicity in this sample’s group of 30 year olds. In this sample, non-black, non-hispanics worked an average of 3,474 hours as teens and other races/ethnicities worked an average of 2,703 hours as teens. Meaning that non-black, non-hispanics worked an average of 771 more hours as teens than other races/ethnicities. I am 95% confident that the true mean difference in their cumulative work hours is between 669 and 872 hours.
Using my re-coded Ethnicity variable, EthnicityNew, I was ready to run my t test for #6, which would test the null hypothesis that there is no difference by race/ethnicity between the mean cumulative hours worked from age 20 and older:
\(H_{0}: \mu_{adultothercum} - \mu_{adultwhitecum} = 0\)
Alternatively
\(H_{1}: \mu_{adultothercum} - \mu_{adultwhitecum} \neq 0\)
I set \(\alpha\) = 0.05. (Note: Non-black, non-hispanic = 0 and Other ethnicity = 1.) I will use my trim_30_adult for this test.
t.test(youth$Cum_Hrs_Wkd_Adult ~ youth$EthnicityNew, trim_30_adult, var.equal=TRUE)
##
## Two Sample t-test
##
## data: youth$Cum_Hrs_Wkd_Adult by youth$EthnicityNew
## t = 3.979, df = 7386, p-value = 6.985e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 444.1286 1306.6571
## sample estimates:
## mean in group 0 mean in group 1
## 16012.93 15137.53
I find that my p value, .00006985, is <.05 (my \(\alpha\)), so I reject the null hypothesis. In other words, the mean cumulative number of hours worked as a adult is different between the non-black, non-hispanics and “other” races/ethnicity in this sample’s group of 30 year olds. In this sample, non-black, non-hispanics worked an average of 16,012 hours as adults and other races/ethnicities worked an average of 15,137 hours as adults Meaning that non-black, non-hispanics worked an average of 875 more hours as adults than other races/ethnicities. I am 95% confident that the true mean difference in their cumulative work hours is between 444 and 1,306 hours.