The Assignment

For this assignment, I tested hypotheses about variations in work experience by sex and race/ethnicity.

Findings

1. Obtaining the dataset

First, I downloaded the necessary data set from NLS, then opened it in RStudio and cleaned it up.

new_data <- read.table('assignment2.dat', sep=' ')
names(new_data) <- c('R0000100','R0536300','R1482600','T6651300','Z9065500','Z9065700')

# Handle missing values
  new_data[new_data == -1] = NA  # Refused 
  new_data[new_data == -2] = NA  # Dont know 
  new_data[new_data == -3] = NA  # Invalid missing 
  new_data[new_data == -4] = NA  # Valid missing 
  new_data[new_data == -5] = NA  # Non-interview 

# If there are values not categorized they will be represented as NA
vallabels = function(data) {
  data$R0000100 <- cut(data$R0000100, c(0.0,1.0,1000.0,2000.0,3000.0,4000.0,5000.0,6000.0,7000.0,8000.0,9000.0,9999.0), labels=c("0","1 TO 999","1000 TO 1999","2000 TO 2999","3000 TO 3999","4000 TO 4999","5000 TO 5999","6000 TO 6999","7000 TO 7999","8000 TO 8999","9000 TO 9999"), right=FALSE)
  data$R0536300 <- factor(data$R0536300, levels=c(1.0,2.0,0.0), labels=c("Male","Female","No Information"))
  data$R1482600 <- factor(data$R1482600, levels=c(1.0,2.0,3.0,4.0), labels=c("Black","Hispanic","Mixed Race (Non-Hispanic)","Non-Black / Non-Hispanic"))
  data$T6651300 <- factor(data$T6651300, levels=c(26.0,27.0,28.0,29.0,30.0,31.0,32.0), labels=c("26","27","28","29","30","31","32"))
  data$Z9065500 <- cut(data$Z9065500, c(0.0,1.0,500.0,1000.0,1500.0,2000.0,2500.0,3000.0,3500.0,4000.0,4500.0,5000.0,9.9999999E7), labels=c("0","1 TO 499","500 TO 999","1000 TO 1499","1500 TO 1999","2000 TO 2499","2500 TO 2999","3000 TO 3499","3500 TO 3999","4000 TO 4499","4500 TO 4999","5000 TO 99999999: 5000+"), right=FALSE)
  data$Z9065700 <- cut(data$Z9065700, c(0.0,1.0,500.0,1000.0,1500.0,2000.0,2500.0,3000.0,3500.0,4000.0,4500.0,5000.0,9.9999999E7), labels=c("0","1 TO 499","500 TO 999","1000 TO 1499","1500 TO 1999","2000 TO 2499","2500 TO 2999","3000 TO 3499","3500 TO 3999","4000 TO 4499","4500 TO 4999","5000 TO 99999999: 5000+"), right=FALSE)
  return(data)
}

varlabels <- c(    "PUBID - YTH ID CODE 1997",
    "KEY!SEX (SYMBOL) 1997",
    "KEY!RACE_ETHNICITY (SYMBOL) 1997",
    "CV_AGE_INT_DATE 2011",
    "CVC_HOURS_WK_TEEN",
    "CVC_HOURS_WK_ADULT"
)

# Use qnames rather than rnums
qnames = function(data) {
  names(data) <- c("ID","Sex","Ethnicity","Interview_age","Cum_Hrs_Wkd_Teen","Cum_Hrs_Wkd_Adult")
  return(data)
}

#********************************************************************************************************

# Remove the '#' before the following line to create a data file called "categories" with value labels. 
categories <- vallabels(new_data)

# Remove the '#' before the following lines to rename variables using Qnames instead of Reference Numbers
new_data <- qnames(new_data)
categories <- qnames(categories)

# Produce summaries for the raw (uncategorized) data file
summary(new_data)
##        ID            Sex          Ethnicity     Interview_age  
##  Min.   :   1   Min.   :1.000   Min.   :1.000   Min.   :26.00  
##  1st Qu.:2249   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:28.00  
##  Median :4502   Median :1.000   Median :4.000   Median :29.00  
##  Mean   :4504   Mean   :1.488   Mean   :2.788   Mean   :28.79  
##  3rd Qu.:6758   3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:30.00  
##  Max.   :9022   Max.   :2.000   Max.   :4.000   Max.   :32.00  
##                                                 NA's   :1561   
##  Cum_Hrs_Wkd_Teen Cum_Hrs_Wkd_Adult
##  Min.   :    0    Min.   :    0    
##  1st Qu.: 1255    1st Qu.: 8190    
##  Median : 2741    Median :16396    
##  Mean   : 3105    Mean   :15595    
##  3rd Qu.: 4470    3rd Qu.:22418    
##  Max.   :18829    Max.   :63722    
##  NA's   :707      NA's   :1596
# Remove the '#' before the following lines to produce summaries for the "categories" data file.
#categories <- vallabels(new_data)
#summary(categories)

#************************************************************************************************************

Next, I required the R packages I would need for this analysis:

require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(magrittr)
## Loading required package: magrittr
require(ggvis)
## Loading required package: ggvis

Then I took at look at the data and the associated code book to see what I had.

youth <- tbl_df(new_data)
youth
## Source: local data frame [8,984 x 6]
## 
##       ID   Sex Ethnicity Interview_age Cum_Hrs_Wkd_Teen Cum_Hrs_Wkd_Adult
##    (int) (int)     (int)         (int)            (int)             (int)
## 1      1     2         4            29             5831                NA
## 2      2     1         2            29               NA             29712
## 3      3     2         2            28             6489                NA
## 4      4     2         2            30             3292             23390
## 5      5     1         2            29              680             28056
## 6      6     2         2            29               NA             18379
## 7      7     1         2            28             1650                NA
## 8      8     2         4            30             2082             18419
## 9      9     1         4            29              864             19274
## 10    10     1         4            NA                0              6849
## ..   ...   ...       ...           ...              ...               ...
glimpse(youth)
## Observations: 8,984
## Variables: 6
## $ ID                (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ Sex               (int) 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2,...
## $ Ethnicity         (int) 4, 2, 2, 2, 2, 2, 2, 4, 4, 4, 2, 2, 2, 2, 2,...
## $ Interview_age     (int) 29, 29, 28, 30, 29, 29, 28, 30, 29, NA, 29, ...
## $ Cum_Hrs_Wkd_Teen  (int) 5831, NA, 6489, 3292, 680, NA, 1650, 2082, 8...
## $ Cum_Hrs_Wkd_Adult (int) NA, 29712, NA, 23390, 28056, 18379, NA, 1841...
summary(youth)
##        ID            Sex          Ethnicity     Interview_age  
##  Min.   :   1   Min.   :1.000   Min.   :1.000   Min.   :26.00  
##  1st Qu.:2249   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:28.00  
##  Median :4502   Median :1.000   Median :4.000   Median :29.00  
##  Mean   :4504   Mean   :1.488   Mean   :2.788   Mean   :28.79  
##  3rd Qu.:6758   3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:30.00  
##  Max.   :9022   Max.   :2.000   Max.   :4.000   Max.   :32.00  
##                                                 NA's   :1561   
##  Cum_Hrs_Wkd_Teen Cum_Hrs_Wkd_Adult
##  Min.   :    0    Min.   :    0    
##  1st Qu.: 1255    1st Qu.: 8190    
##  Median : 2741    Median :16396    
##  Mean   : 3105    Mean   :15595    
##  3rd Qu.: 4470    3rd Qu.:22418    
##  Max.   :18829    Max.   :63722    
##  NA's   :707      NA's   :1596

The Code book showed that there were NO non-responses or invalids for Sex or Ethnicity

  • Code book shows that there ARE 1561 non-interviews for Interview_age
  • Code book shows that there ARE 707 Invalid Skips (-3) for Cum_Hrs_Wkd_Teen
  • Code book shows that there ARE 1596 Invalid Skip (-3) for Cum_Hrs_Wkd_Adult

2. Filtering the data to include only people who were 30 years of age at the time of the 2011 interview

To trim the data to look at only the 30 year olds, I used the first line of code shown below, then I took a “glimpse” of the data to be sure I had the correct variables.

trim_30 <- filter(youth, Interview_age == 30) 
glimpse(trim_30)
## Observations: 1,546
## Variables: 6
## $ ID                (int) 4, 8, 26, 27, 32, 33, 38, 55, 59, 68, 69, 78...
## $ Sex               (int) 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2,...
## $ Ethnicity         (int) 2, 4, 1, 1, 4, 4, 4, 2, 1, 1, 1, 4, 1, 2, 4,...
## $ Interview_age     (int) 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ Cum_Hrs_Wkd_Teen  (int) 3292, 2082, 760, 1592, 4611, 1862, 0, 1368, ...
## $ Cum_Hrs_Wkd_Adult (int) 23390, 18419, 0, 11060, 16981, 26551, NA, 16...
summary(trim_30)
##        ID            Sex          Ethnicity     Interview_age
##  Min.   :   4   Min.   :1.000   Min.   :1.000   Min.   :30   
##  1st Qu.:2528   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:30   
##  Median :4798   Median :2.000   Median :3.000   Median :30   
##  Mean   :4712   Mean   :1.504   Mean   :2.732   Mean   :30   
##  3rd Qu.:7014   3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:30   
##  Max.   :9009   Max.   :2.000   Max.   :4.000   Max.   :30   
##                                                              
##  Cum_Hrs_Wkd_Teen Cum_Hrs_Wkd_Adult
##  Min.   :    0    Min.   :    0    
##  1st Qu.: 1423    1st Qu.:13189    
##  Median : 2925    Median :20847    
##  Mean   : 3252    Mean   :19444    
##  3rd Qu.: 4510    3rd Qu.:25961    
##  Max.   :14334    Max.   :63722    
##  NA's   :130      NA's   :334

Before I could proceed to question #3, I needed to clean up the data.

I noticed that there was a huge jump between the 3Q numbers and the Max number for both Cum_Hrs_Wkd_Teen and Cum_Hrs_Wkd_Adult, which made me think there may have been outliers that needed to be trimmed out. I created histograms to give me a better idea of what was going on.

First, I did this for Cum_Hrs_Wkd_Teen to prepare for question #3.

trim_30 %>%
  filter(Cum_Hrs_Wkd_Teen>0) %>%
  ggvis(~Cum_Hrs_Wkd_Teen) %>% layer_histograms()
## Guessing width = 500 # range / 29

I saw that there were just a small number of data points beyond about 11,500 hours worked that would make the mean seem artificially high. I also wanted to look at only those data points where the individual worked more than 0 hours.

trim_30_teen <- trim_30 %>%
  filter(Cum_Hrs_Wkd_Teen>0, Cum_Hrs_Wkd_Teen<11500)
glimpse(trim_30_teen)
## Observations: 1,353
## Variables: 6
## $ ID                (int) 4, 8, 26, 27, 32, 33, 55, 59, 68, 69, 78, 80...
## $ Sex               (int) 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1,...
## $ Ethnicity         (int) 2, 4, 1, 1, 4, 4, 2, 1, 1, 1, 4, 1, 2, 4, 4,...
## $ Interview_age     (int) 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ Cum_Hrs_Wkd_Teen  (int) 3292, 2082, 760, 1592, 4611, 1862, 1368, 564...
## $ Cum_Hrs_Wkd_Adult (int) 23390, 18419, 0, 11060, 16981, 26551, 16934,...

3. Test the null hypothesis that there is no difference by sex between the mean cumulative hours worked from age 14 through age 19

I was finally ready to run my t test for #3, which would test the null hypothesis that there is no difference by sex between the mean culumulative hours worked for age 14 through age 19:

\(H_{0}: \mu_{teenfemalecum} - \mu_{teenmalecum} = 0\)

Alternatively

\(H_{1}: \mu_{teenfemalecum} - \mu_{teenmalecum} \neq 0\)

I set \(\alpha\) = 0.05. (Note: Male = 1 and Female = 2.)

t.test(Cum_Hrs_Wkd_Teen ~ Sex, trim_30_teen, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  Cum_Hrs_Wkd_Teen by Sex
## t = 3.8084, df = 1351, p-value = 0.0001461
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  217.7845 680.4907
## sample estimates:
## mean in group 1 mean in group 2 
##        3543.652        3094.515

I found that my p value, .0001461, is <.05 (my \(\alpha\)), so I rejected the null hypothesis. In other words, the mean cumulative number of hours worked as a teen is different between the men and women in this sample’s group of 30 year olds. In this sample, men worked an average of 3,543 hours and women worked an average of 3,094 hours as teens. Meaning that men worked an average of 449 more hours than women. I am 95% confident that the true mean difference in their cumulative work hours is between 217 and 680 hours.

4. Test the null hypothesis that there is no difference by sex between the mean cumulative hours worked from age 20 and older

To prepare for question #4, I now need to create a histogram to give me a better idea of whether there are data outliers I need to cull from Cum_Hrs_Wkd_Adult.

trim_30 %>%
  filter(Cum_Hrs_Wkd_Adult>0) %>%
  ggvis(~Cum_Hrs_Wkd_Adult) %>% layer_histograms()
## Guessing width = 2000 # range / 32

The resulting histogram shows that there are outliers that would make the mean artificially high (>40000). I also want to look at only those data points where the individual worked more than 0 hours.

trim_30_adult <- trim_30 %>%
  filter(Cum_Hrs_Wkd_Adult>0, Cum_Hrs_Wkd_Adult<40000)
glimpse(trim_30_adult)
## Observations: 1,188
## Variables: 6
## $ ID                (int) 4, 8, 27, 32, 33, 55, 78, 80, 83, 86, 102, 1...
## $ Sex               (int) 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1,...
## $ Ethnicity         (int) 2, 4, 1, 4, 4, 2, 4, 1, 2, 4, 2, 4, 4, 4, 4,...
## $ Interview_age     (int) 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ Cum_Hrs_Wkd_Teen  (int) 3292, 2082, 1592, 4611, 1862, 1368, 365, 312...
## $ Cum_Hrs_Wkd_Adult (int) 23390, 18419, 11060, 16981, 26551, 16934, 60...

Then I was ready to run my t test for #4, which would test the null hypothesis that there is no difference by sex between the mean culumulative hours worked from age 20 and older:

\(H_{0}: \mu_{adultfemalecum} - \mu_{adultmalecum} = 0\)

Alternatively

\(H_{1}: \mu_{adultfemalecum} - \mu_{adultmalecum} \neq 0\)

I set \(\alpha\) = 0.05. (Note: Male = 1 and Female = 2.)

t.test(Cum_Hrs_Wkd_Adult ~ Sex, trim_30_adult, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  Cum_Hrs_Wkd_Adult by Sex
## t = 4.969, df = 1186, p-value = 7.717e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1489.491 3433.146
## sample estimates:
## mean in group 1 mean in group 2 
##        20893.26        18431.95

I find that my p value, .0000007, is <.05 (my \(\alpha\)), so I reject the null hypothesis. In other words, the mean cumulative number of hours worked as an adult is different between the men and women in this sample’s group of 30 year olds. In this sample, men worked an average of 20,893 hours as adults and women worked an average of 18,431 hours as adults. Meaning that men worked an average of 2,462 more hours than women as adults. I am 95% confident that the true mean difference in their cumulative work hours is between 1,489 and 3,433 hours.

5. Test the null hypothesis that there is no difference by race/ethnicity between the mean cumulative hours worked from age 14 through age 19.

To prepare for questions #5 and #6, I needed to re-code the Ethnicity values 1 through 3 as “1” and value 4 as “0.” This would group all of the black/hispanic/other into a single category. I did this by using an ifelse statement that says “if Ethnicity is 4, recode it to 0…if not, recode it to 1.” Then I looked at the recoded variable to make sure it worked. It did!

table(youth$Ethnicity)
## 
##    1    2    3    4 
## 2335 1901   83 4665
youth$EthnicityNew <- ifelse(youth$Ethnicity == 4, 0, 1)
table(youth$EthnicityNew)
## 
##    0    1 
## 4665 4319
summary(youth)
##        ID            Sex          Ethnicity     Interview_age  
##  Min.   :   1   Min.   :1.000   Min.   :1.000   Min.   :26.00  
##  1st Qu.:2249   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:28.00  
##  Median :4502   Median :1.000   Median :4.000   Median :29.00  
##  Mean   :4504   Mean   :1.488   Mean   :2.788   Mean   :28.79  
##  3rd Qu.:6758   3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:30.00  
##  Max.   :9022   Max.   :2.000   Max.   :4.000   Max.   :32.00  
##                                                 NA's   :1561   
##  Cum_Hrs_Wkd_Teen Cum_Hrs_Wkd_Adult  EthnicityNew   
##  Min.   :    0    Min.   :    0     Min.   :0.0000  
##  1st Qu.: 1255    1st Qu.: 8190     1st Qu.:0.0000  
##  Median : 2741    Median :16396     Median :0.0000  
##  Mean   : 3105    Mean   :15595     Mean   :0.4807  
##  3rd Qu.: 4470    3rd Qu.:22418     3rd Qu.:1.0000  
##  Max.   :18829    Max.   :63722     Max.   :1.0000  
##  NA's   :707      NA's   :1596

Then I was ready to run my t test for #5, which would test the null hypothesis that there is no difference by race/ethnicity between the mean cumulative hours worked from age 14 through age 19:

\(H_{0}: \mu_{teenothercum} - \mu_{teenwhitecum} = 0\)

Alternatively

\(H_{1}: \mu_{teenothercum} - \mu_{teenwhitecum} \neq 0\)

I set \(\alpha\) = 0.05. (Note: Non-black, non-hispanic = 0 and Other ethnicity = 1.) I will use my trim_30_teen for this test.

t.test(youth$Cum_Hrs_Wkd_Teen ~ youth$EthnicityNew, trim_30_teen, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  youth$Cum_Hrs_Wkd_Teen by youth$EthnicityNew
## t = 14.884, df = 8275, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  669.6223 872.7625
## sample estimates:
## mean in group 0 mean in group 1 
##        3474.743        2703.551

I find that my p value, 2.2e-16 (a VERY small number), is <.05 (my \(\alpha\)), so I reject the null hypothesis. In other words, the mean cumulative number of hours worked as a teen is different between the non-black, non-hispanics and “other” races/ethnicity in this sample’s group of 30 year olds. In this sample, non-black, non-hispanics worked an average of 3,474 hours as teens and other races/ethnicities worked an average of 2,703 hours as teens. Meaning that non-black, non-hispanics worked an average of 771 more hours as teens than other races/ethnicities. I am 95% confident that the true mean difference in their cumulative work hours is between 669 and 872 hours.

5. Test the null hypothesis that there is no difference by race/ethnicity between the mean cumulative hours worked from age 20 and older.

Using my re-coded Ethnicity variable, EthnicityNew, I was ready to run my t test for #6, which would test the null hypothesis that there is no difference by race/ethnicity between the mean cumulative hours worked from age 20 and older:

\(H_{0}: \mu_{adultothercum} - \mu_{adultwhitecum} = 0\)

Alternatively

\(H_{1}: \mu_{adultothercum} - \mu_{adultwhitecum} \neq 0\)

I set \(\alpha\) = 0.05. (Note: Non-black, non-hispanic = 0 and Other ethnicity = 1.) I will use my trim_30_adult for this test.

t.test(youth$Cum_Hrs_Wkd_Adult ~ youth$EthnicityNew, trim_30_adult, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  youth$Cum_Hrs_Wkd_Adult by youth$EthnicityNew
## t = 3.979, df = 7386, p-value = 6.985e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   444.1286 1306.6571
## sample estimates:
## mean in group 0 mean in group 1 
##        16012.93        15137.53

I find that my p value, .00006985, is <.05 (my \(\alpha\)), so I reject the null hypothesis. In other words, the mean cumulative number of hours worked as a adult is different between the non-black, non-hispanics and “other” races/ethnicity in this sample’s group of 30 year olds. In this sample, non-black, non-hispanics worked an average of 16,012 hours as adults and other races/ethnicities worked an average of 15,137 hours as adults Meaning that non-black, non-hispanics worked an average of 875 more hours as adults than other races/ethnicities. I am 95% confident that the true mean difference in their cumulative work hours is between 444 and 1,306 hours.