WFED540 Assignment 2

Question 1

Before answering the questions in the assignment I had to download the appropriate data from NLSY97 and extract the resulting zipped file. I then created a new data set called “new data” from the R file provided in that download. This R file modified the data for easier use in R Studio and was completed by running the following code:

new_data <- read.table('540assign2.dat', sep=' ')
names(new_data) <- c('R0000100','R0536300','R1482600','T6651300','Z9065500','Z9065700')

# Handle missing values
  new_data[new_data == -1] = NA  # Refused 
  new_data[new_data == -2] = NA  # Dont know 
  new_data[new_data == -3] = NA  # Invalid missing 
  new_data[new_data == -4] = NA  # Valid missing 
  new_data[new_data == -5] = NA  # Non-interview 

# If there are values not categorized they will be represented as NA
vallabels = function(data) {
  data$R0000100 <- cut(data$R0000100, c(0.0,1.0,1000.0,2000.0,3000.0,4000.0,5000.0,6000.0,7000.0,8000.0,9000.0,9999.0), labels=c("0","1 TO 999","1000 TO 1999","2000 TO 2999","3000 TO 3999","4000 TO 4999","5000 TO 5999","6000 TO 6999","7000 TO 7999","8000 TO 8999","9000 TO 9999"), right=FALSE)
  data$R0536300 <- factor(data$R0536300, levels=c(1.0,2.0,0.0), labels=c("Male","Female","No Information"))
  data$R1482600 <- factor(data$R1482600, levels=c(1.0,2.0,3.0,4.0), labels=c("Black","Hispanic","Mixed Race (Non-Hispanic)","Non-Black / Non-Hispanic"))
  data$T6651300 <- factor(data$T6651300, levels=c(26.0,27.0,28.0,29.0,30.0,31.0,32.0), labels=c("26","27","28","29","30","31","32"))
  data$Z9065500 <- cut(data$Z9065500, c(0.0,1.0,500.0,1000.0,1500.0,2000.0,2500.0,3000.0,3500.0,4000.0,4500.0,5000.0,9.9999999E7), labels=c("0","1 TO 499","500 TO 999","1000 TO 1499","1500 TO 1999","2000 TO 2499","2500 TO 2999","3000 TO 3499","3500 TO 3999","4000 TO 4499","4500 TO 4999","5000 TO 99999999: 5000+"), right=FALSE)
  data$Z9065700 <- cut(data$Z9065700, c(0.0,1.0,500.0,1000.0,1500.0,2000.0,2500.0,3000.0,3500.0,4000.0,4500.0,5000.0,9.9999999E7), labels=c("0","1 TO 499","500 TO 999","1000 TO 1499","1500 TO 1999","2000 TO 2499","2500 TO 2999","3000 TO 3499","3500 TO 3999","4000 TO 4499","4500 TO 4999","5000 TO 99999999: 5000+"), right=FALSE)
  return(data)
}

varlabels <- c(    "PUBID _ YTH ID CODE 1997",
    "KEY!SEX (SYMBOL) 1997",
    "KEY!RACE_ETHNICITY (SYMBOL) 1997",
    "CV_AGE_INT_DATE 2011",
    "CVC_HOURS_WK_TEEN",
    "CVC_HOURS_WK_ADULT"
)

# Use qnames rather than rnums
qnames = function(data) {
  names(data) <- c("ID","Sex","Race","Age_2011","Total_teen_hours","Total_adult_hours")
  return(data)
}

categories <- vallabels(new_data)
new_data <- qnames(new_data)
categories <- qnames(categories)
summary(new_data)

##        ID            Sex             Race          Age_2011    
##  Min.   :   1   Min.   :1.000   Min.   :1.000   Min.   :26.00  
##  1st Qu.:2249   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:28.00  
##  Median :4502   Median :1.000   Median :4.000   Median :29.00  
##  Mean   :4504   Mean   :1.488   Mean   :2.788   Mean   :28.79  
##  3rd Qu.:6758   3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:30.00  
##  Max.   :9022   Max.   :2.000   Max.   :4.000   Max.   :32.00  
##                                                 NA's   :1561   
##  Total_teen_hours Total_adult_hours
##  Min.   :    0    Min.   :    0    
##  1st Qu.: 1255    1st Qu.: 8190    
##  Median : 2741    Median :16396    
##  Mean   : 3105    Mean   :15595    
##  3rd Qu.: 4470    3rd Qu.:22418    
##  Max.   :18829    Max.   :63722    
##  NA's   :707      NA's   :1596

Next I had to require the use of the dplyr and magrittr packages to complete this assignment. The dplyr package allows the creation of table data frames, and magrittr allows for piping of commands that will be used in later parts of the assignment.

require(dplyr)

## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

require(magrittr)

## Loading required package: magrittr

I then created a new data frame called new_dataDF from the data set created from the NLSY97 using the tbl_df function. Below I show a glimpse of that data frame.

new_dataDF <- tbl_df(new_data)
glimpse(new_dataDF)

## Observations: 8,984
## Variables: 6
## $ ID                (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ Sex               (int) 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2,...
## $ Race              (int) 4, 2, 2, 2, 2, 2, 2, 4, 4, 4, 2, 2, 2, 2, 2,...
## $ Age_2011          (int) 29, 29, 28, 30, 29, 29, 28, 30, 29, NA, 29, ...
## $ Total_teen_hours  (int) 5831, NA, 6489, 3292, 680, NA, 1650, 2082, 8...
## $ Total_adult_hours (int) NA, 29712, NA, 23390, 28056, 18379, NA, 1841...

Question 2

Next I filtered the data in this new data frame to include only people who were 30 years of age at the time of the 2011 interview into a new data frame called “ThirtyYearOlds”. I used the glimpse function to show that this was done correctly. Then I calculated the number of people included in that data frame by using the “n” (number) function. This is also provided as the number of observations in the glimpse of the data.

ThirtyYearOlds <- filter (new_dataDF, Age_2011 == 30)
glimpse(ThirtyYearOlds)

## Observations: 1,546
## Variables: 6
## $ ID                (int) 4, 8, 26, 27, 32, 33, 38, 55, 59, 68, 69, 78...
## $ Sex               (int) 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2,...
## $ Race              (int) 2, 4, 1, 1, 4, 4, 4, 2, 1, 1, 1, 4, 1, 2, 4,...
## $ Age_2011          (int) 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ Total_teen_hours  (int) 3292, 2082, 760, 1592, 4611, 1862, 0, 1368, ...
## $ Total_adult_hours (int) 23390, 18419, 0, 11060, 16981, 26551, NA, 16...

ThirtyYearOlds %>% summarize(Total_ThirtyYearOlds=n())

## Source: local data frame [1 x 1]
## 
##   Total_ThirtyYearOlds
##                  (int)
## 1                 1546

Question 3

Next, I went on to test the null hypothesis, which is that there is no difference by sex (Sex) in the mean cumulative hours worked from age 14 through age 19 (Total_teen_hours). Stated in another way, knowing a person’s sex does not help predict the number of hours worked during their teen years, and vice versa. The alternative hypothesis is that there is a difference by sex between the mean cumuluative hours worked from age 14 through age 19. Stated another way, sex and cumulative hours worked as a teen are not independent variables, i.e. one can be used to predict the other. As per the assignment I will set my alpha value, the probability of Type 1 error, equal to 0.05.

To start, I first looked at the mean total hours worked for each sex in the data which were:

ThirtyYearOlds %>% group_by(Sex) %>%
  summarize(mean_teen_hours=mean(Total_teen_hours, na.rm=TRUE))

## Source: local data frame [2 x 2]
## 
##     Sex mean_teen_hours
##   (int)           (dbl)
## 1     1        3528.906
## 2     2        2978.293

However, when I considered the data more carefully I realized that these numbers may be skewed by outliers (a few individuals who reported a great number of hours) and individuals who may not have worked at all (zero total hours worked). To determine this, I ran a few plots of the data.

To do this, I first needed to require the ggvis package to run plots.

require(ggvis)

## Loading required package: ggvis

Next I ran a plot for Male teens.

ThirtyYearOlds %>% filter(Sex==1) %>%
  ggvis(~Total_teen_hours) %>% layer_histograms()

## Guessing width = 500 # range / 29

This plot shows that there are a number of individuals above the 8,000 hour mark that might skew my results, and that there are a number of individuals who have not worked at all. Therefore, I decided to run the plot using the total number of teen hours worked between 500 and 8,000 total hours worked.

ThirtyYearOlds %>% filter(Sex==1, Total_teen_hours >=500 & Total_teen_hours <= 8000) %>%
  ggvis(~Total_teen_hours) %>% layer_histograms()

## Guessing width = 200 # range / 38

This plot gives me a more representative group of data.

I then ran the same plots for Female teen workers.

ThirtyYearOlds %>% filter(Sex==2) %>%
  ggvis(~Total_teen_hours) %>% layer_histograms()

## Guessing width = 500 # range / 27

I also ran plots with trimmed data.

ThirtyYearOlds %>% filter(Sex==2, Total_teen_hours >=500 & Total_teen_hours <= 8000) %>%
  ggvis(~Total_teen_hours) %>% layer_histograms()

## Guessing width = 200 # range / 38

Once again, these plots show that trimmed data provide a more accurate representation of teen hours worked for these groups. Therefore, I calculated my means again using these limitations on the data.

ThirtyYearOlds %>% group_by(Sex) %>%
  filter(Total_teen_hours >= 500 & Total_teen_hours <= 8000) %>%
  summarize(mean_teen_hours=mean(Total_teen_hours, na.rm=TRUE))

## Source: local data frame [2 x 2]
## 
##     Sex mean_teen_hours
##   (int)           (dbl)
## 1     1        3405.310
## 2     2        3162.204

These means seem more reasonable when compared with our original values, which were:

ThirtyYearOlds %>% group_by(Sex) %>%
  summarize(mean_teen_hours=mean(Total_teen_hours, na.rm=TRUE))

## Source: local data frame [2 x 2]
## 
##     Sex mean_teen_hours
##   (int)           (dbl)
## 1     1        3528.906
## 2     2        2978.293

Now that I’m confident that my trimmed data is a more accurate representation of the group, and I’m ready to run a t-test to test the null hypothesis. To do this I needed to create a new data frame for t-test comparison using trimmed data according to the specifications above. I called the new data frame trim_teen_data.

trim_teen_data <- ThirtyYearOlds %>%
  filter(Total_teen_hours >= 500 & Total_teen_hours <= 8000, na.rm=TRUE)
trim_teen_data

## Source: local data frame [1,228 x 6]
## 
##       ID   Sex  Race Age_2011 Total_teen_hours Total_adult_hours
##    (int) (int) (int)    (int)            (int)             (int)
## 1      4     2     2       30             3292             23390
## 2      8     2     4       30             2082             18419
## 3     26     1     1       30              760                 0
## 4     27     1     1       30             1592             11060
## 5     32     2     4       30             4611             16981
## 6     33     2     4       30             1862             26551
## 7     55     2     2       30             1368             16934
## 8     59     2     1       30             5648                NA
## 9     68     1     1       30             1316                NA
## 10    69     2     1       30             3221                NA
## ..   ...   ...   ...      ...              ...               ...

I then ran a t-test on the trimmed and scrubbed data:

t.test(trim_teen_data$Total_teen_hours ~ trim_teen_data$Sex, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  trim_teen_data$Total_teen_hours by trim_teen_data$Sex
## t = 2.3694, df = 1226, p-value = 0.01797
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   41.8123 444.3999
## sample estimates:
## mean in group 1 mean in group 2 
##        3405.310        3162.204

The t-test shows that the estimate of the difference between the means is more than twice the error in estimating that difference (t = 2.394) and our probability value (p-value = 0.01797) is less than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [41, 444]. Therefore, the null hypothesis has been rejected.

Question 4

I can test the null hypothesis in Question 4 using the same procedures I used in Question 3, but I will need to trim the data differently.

The null hypothesis I’m testing states that there is no difference by sex (Sex) in the mean cumulative hours worked from age 20 and older (Total_adult_hours). Stated in another way, knowing a person’s sex does not help predict the number of hours worked during their adult years, and vice versa. The alternative hypothesis is that there is a difference by sex between the mean cumuluative hours worked from age 20 and older. Stated another way, sex and cumulative hours worked as an adult are not independent variables, i.e. one can be used to predict the other. As per the assignment I will set my alpha value, the probability of Type 1 error, equal to 0.05.

Step 1. Calculate the means without data manipulation.

ThirtyYearOlds %>% group_by(Sex) %>%
  summarize(num=n(), mean_adult_hours=mean(Total_adult_hours, na.rm=TRUE))

## Source: local data frame [2 x 3]
## 
##     Sex   num mean_adult_hours
##   (int) (int)            (dbl)
## 1     1   767         20671.98
## 2     2   779         18252.12

Step 2. Create plot for Male Adults vs. the number of hours worked as an adult.

ThirtyYearOlds %>% filter(Sex==1) %>%
  ggvis(~Total_adult_hours) %>% layer_histograms()

## Guessing width = 2000 # range / 32

Step 3. Create plot for Female Adults vs. the number of hours worked as an adult.

ThirtyYearOlds %>% filter(Sex==2) %>%
  ggvis(~Total_adult_hours) %>% layer_histograms()

## Guessing width = 1000 # range / 37

The plots show that there are a number of people who haven’t worked any hours as an adult and that there are a number of outliers at the upper end of the scale. Based on the plots, I decided to filter my data to include only those individuals who have worked more than 5,000 hours and those that have worked less than 30,000. I then re-ran the plots using my those filters:

ThirtyYearOlds %>% filter(Sex==1, Total_adult_hours >=5000 & Total_adult_hours <= 30000) %>%
  ggvis(~Total_adult_hours) %>% layer_histograms()

## Guessing width = 1000 # range / 25

ThirtyYearOlds %>% filter(Sex==2, Total_adult_hours >=5000 & Total_adult_hours <= 30000) %>%
  ggvis(~Total_adult_hours) %>% layer_histograms()

## Guessing width = 1000 # range / 25

These plots show a more accurate representation of data by which to run my t-test. Therefore, I move onto my next step.

Step 4. I run the population means using trimmed data.

ThirtyYearOlds %>% group_by(Sex) %>%
  filter(Total_adult_hours >= 5000 & Total_adult_hours <= 30000) %>%
  summarize(num=n(), mean_adult_hours=mean(Total_adult_hours, na.rm=TRUE))

## Source: local data frame [2 x 3]
## 
##     Sex   num mean_adult_hours
##   (int) (int)            (dbl)
## 1     1   446         20396.74
## 2     2   544         18646.75

And then I compare them with the original means.

ThirtyYearOlds %>% group_by(Sex) %>%
  summarize(num=n(), mean_adult_hours=mean(Total_adult_hours, na.rm=TRUE))

## Source: local data frame [2 x 3]
## 
##     Sex   num mean_adult_hours
##   (int) (int)            (dbl)
## 1     1   767         20671.98
## 2     2   779         18252.12

I’m now confident that my trimmed data is more representative of the group, and I’m ready to run my t-test.

Step 5. First, I create new data frame with trimmed data.

trim_adult_data <- ThirtyYearOlds %>%
  filter(Total_adult_hours >= 5000 & Total_adult_hours <= 30000, na.rm=TRUE)
trim_adult_data

## Source: local data frame [990 x 6]
## 
##       ID   Sex  Race Age_2011 Total_teen_hours Total_adult_hours
##    (int) (int) (int)    (int)            (int)             (int)
## 1      4     2     2       30             3292             23390
## 2      8     2     4       30             2082             18419
## 3     27     1     1       30             1592             11060
## 4     32     2     4       30             4611             16981
## 5     33     2     4       30             1862             26551
## 6     55     2     2       30             1368             16934
## 7     78     2     4       30              365              6001
## 8     83     1     2       30             1910             14857
## 9     86     2     4       30              310             20959
## 10   102     2     2       30             3488             17674
## ..   ...   ...   ...      ...              ...               ...

Step 6. Run t-test.

t.test(trim_adult_data$Total_adult_hours ~ trim_adult_data$Sex, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  trim_adult_data$Total_adult_hours by trim_adult_data$Sex
## t = 4.214, df = 988, p-value = 2.739e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   935.0507 2564.9255
## sample estimates:
## mean in group 1 mean in group 2 
##        20396.74        18646.75

The t-test shows that the estimate of the difference between the means is more than four times the error in estimating that difference (t = 4.214) and our probability value (p-value = 0.00002739) is much less than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [935, 2,564]. Therefore, the null hypothesis has been rejected.

Question 5

The null hypothesis I’m testing states that there is no difference by race/ethnicity (Race) in the mean cumulative hours worked from age 14 through age 19 (Total_teen_hours). Stated in another way, knowing a person’s race/ethnicity does not help predict the number of hours worked during their teen years, and vice versa. The alternative hypothesis is that there is a difference by race/ethnicity between the mean cumuluative hours worked from age 14 through age 19. Stated another way, race/ethnicity and cumulative hours worked as an teen are not independent variables, i.e. one can be used to predict the other. As per the assignment I will set my alpha value, the probability of Type 1 error, equal to 0.05.

As per the assignment, to begin testing the null hypothesis, I must first recode the Race variable in my data set to indicate a “1” for any Non-Black, Non-Hispanic participant and “0” for all other racial categories. I do this through using the “ifelse” command.

First, I take a look at my original data frame:

ThirtyYearOlds

## Source: local data frame [1,546 x 6]
## 
##       ID   Sex  Race Age_2011 Total_teen_hours Total_adult_hours
##    (int) (int) (int)    (int)            (int)             (int)
## 1      4     2     2       30             3292             23390
## 2      8     2     4       30             2082             18419
## 3     26     1     1       30              760                 0
## 4     27     1     1       30             1592             11060
## 5     32     2     4       30             4611             16981
## 6     33     2     4       30             1862             26551
## 7     38     2     4       30                0                NA
## 8     55     2     2       30             1368             16934
## 9     59     2     1       30             5648                NA
## 10    68     1     1       30             1316                NA
## ..   ...   ...   ...      ...              ...               ...

To preserve the integrity of my original data frame, I copy this data set into a new file called NewRaceCodes.

NewRaceCodes <- ThirtyYearOlds

Using the “ifelse” command I recode the Race variable according to the requirements of the assignment:

NewRaceCodes$Race <- ifelse(NewRaceCodes$Race==4, 1, 0)
NewRaceCodes

## Source: local data frame [1,546 x 6]
## 
##       ID   Sex  Race Age_2011 Total_teen_hours Total_adult_hours
##    (int) (int) (dbl)    (int)            (int)             (int)
## 1      4     2     0       30             3292             23390
## 2      8     2     1       30             2082             18419
## 3     26     1     0       30              760                 0
## 4     27     1     0       30             1592             11060
## 5     32     2     1       30             4611             16981
## 6     33     2     1       30             1862             26551
## 7     38     2     1       30                0                NA
## 8     55     2     0       30             1368             16934
## 9     59     2     0       30             5648                NA
## 10    68     1     0       30             1316                NA
## ..   ...   ...   ...      ...              ...               ...

Comparing this to my original data above, I can see that the racial categories have been recoded properly.

However, to insure that all the data has been recoded, I can run the “table” command to check if there are any extraneous codes I need to account for.

y1 <- table(NewRaceCodes$Race)
y1

## 
##   0   1 
## 776 770

This distribution table shows that my data is clean. I can now run similar trimming functions and calculations on my new data set that I did earlier in the assignment, but using Race rather than Sex as my dependent variable.

I will use the same limits I placed on Total_teen_hours in the previous exercise to conduct these calculations:

NewRaceCodes %>% group_by(Race) %>%
  filter(Total_teen_hours >= 500 & Total_teen_hours <= 8000) %>%
  summarize(num=n(), mean_teen_hours=mean(Total_teen_hours, na.rm=TRUE))

## Source: local data frame [2 x 3]
## 
##    Race   num mean_teen_hours
##   (dbl) (int)           (dbl)
## 1     0   598        3002.256
## 2     1   630        3547.100

These means seem more reasonable than when compared with untrimmed data:

NewRaceCodes %>% group_by(Race) %>%
  summarize(num=n(), mean_teen_hours=mean(Total_teen_hours, na.rm=TRUE))

## Source: local data frame [2 x 3]
## 
##    Race   num mean_teen_hours
##   (dbl) (int)           (dbl)
## 1     0   776        2860.372
## 2     1   770        3645.603

I am now confident that my trimmed data is more representative of the group, and I’m ready to run my t-test.

First I trim my data. I will call the new data frame trim_teen_race_data.

trim_teen_race_data <- NewRaceCodes %>%
  filter(Total_teen_hours >= 500 & Total_teen_hours <= 8000, na.rm=TRUE)
trim_teen_race_data

## Source: local data frame [1,228 x 6]
## 
##       ID   Sex  Race Age_2011 Total_teen_hours Total_adult_hours
##    (int) (int) (dbl)    (int)            (int)             (int)
## 1      4     2     0       30             3292             23390
## 2      8     2     1       30             2082             18419
## 3     26     1     0       30              760                 0
## 4     27     1     0       30             1592             11060
## 5     32     2     1       30             4611             16981
## 6     33     2     1       30             1862             26551
## 7     55     2     0       30             1368             16934
## 8     59     2     0       30             5648                NA
## 9     68     1     0       30             1316                NA
## 10    69     2     0       30             3221                NA
## ..   ...   ...   ...      ...              ...               ...

Then I run my t-test.

t.test(trim_teen_race_data$Total_teen_hours ~ trim_teen_race_data$Race, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  trim_teen_race_data$Total_teen_hours by trim_teen_race_data$Race
## t = -5.3588, df = 1226, p-value = 1e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -744.3175 -345.3708
## sample estimates:
## mean in group 0 mean in group 1 
##        3002.256        3547.100

The t-test shows that the estimate of the difference between the means is more than five times the error in estimating that difference (t = -5.3588) and our probability value (p-value = 0.0000001) is much less than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [-744, -345]. Therefore, the null hypothesis has been rejected.

Question 6

The null hypothesis I’m testing states that there is no difference by race/ethnicity (Race) in the mean cumulative hours worked from age 20 and older (Total_adult_hours). Stated in another way, knowing a person’s race/ethnicity does not help predict the number of hours worked during their adult years, and vice versa. The alternative hypothesis is that there is a difference by race/ethnicity between the mean cumuluative hours worked from age 20 and older. Stated another way, race/ethnicity and cumulative hours worked as an adult are not independent variables, i.e. one can be used to predict the other. As per the assignment I will set my alpha value, the probability of Type 1 error, equal to 0.05.

I will use the same limits I placed on Total_adult_hours in the previous exercises to conduct these calculations and run my t-test:

NewRaceCodes %>% group_by(Race) %>%
  filter(Total_adult_hours >= 5000 & Total_adult_hours <= 30000) %>%
  summarize(num=n(), mean_adult_hours=mean(Total_adult_hours, na.rm=TRUE))

## Source: local data frame [2 x 3]
## 
##    Race   num mean_adult_hours
##   (dbl) (int)            (dbl)
## 1     0   479         19072.82
## 2     1   511         19774.75

Once again, these means seem more reasonable than when compared with untrimmed data.

NewRaceCodes %>% group_by(Race) %>%
  summarize(num=n(), mean_adult_hours=mean(Total_adult_hours, na.rm=TRUE))

## Source: local data frame [2 x 3]
## 
##    Race   num mean_adult_hours
##   (dbl) (int)            (dbl)
## 1     0   776         18321.36
## 2     1   770         20519.65

I’m now confident that my trimmed data is more representative of the group, and I’m ready to run my t-test.

I create a new data frame “trim_adult_race_data”.

trim_adult_race_data <- NewRaceCodes %>%
  filter(Total_adult_hours >= 5000 & Total_adult_hours <= 30000, na.rm=TRUE)
trim_adult_race_data

## Source: local data frame [990 x 6]
## 
##       ID   Sex  Race Age_2011 Total_teen_hours Total_adult_hours
##    (int) (int) (dbl)    (int)            (int)             (int)
## 1      4     2     0       30             3292             23390
## 2      8     2     1       30             2082             18419
## 3     27     1     0       30             1592             11060
## 4     32     2     1       30             4611             16981
## 5     33     2     1       30             1862             26551
## 6     55     2     0       30             1368             16934
## 7     78     2     1       30              365              6001
## 8     83     1     0       30             1910             14857
## 9     86     2     1       30              310             20959
## 10   102     2     0       30             3488             17674
## ..   ...   ...   ...      ...              ...               ...

I run my t-test on my trimmed and scrubbed data.

t.test(trim_adult_race_data$Total_adult_hours ~ trim_adult_race_data$Race, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  trim_adult_race_data$Total_adult_hours by trim_adult_race_data$Race
## t = -1.685, df = 988, p-value = 0.0923
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1519.3668   115.5209
## sample estimates:
## mean in group 0 mean in group 1 
##        19072.82        19774.75

The t-test shows that the estimate of the difference between the means is more than the error in estimating that difference (t = -1.685), however the probability value (p-value = 0.0923) is more than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). Additionally, our 95% CI [-1,519, 115] also spans zero. Therefore, we have failed to reject our null hypothesis using trimmed data.

However, I do find it interesting that when I expand the population to include all of our original data then we can reject our null hypothesis.

t.test(NewRaceCodes$Total_adult_hours ~ NewRaceCodes$Race, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  NewRaceCodes$Total_adult_hours by NewRaceCodes$Race
## t = -4.239, df = 1210, p-value = 2.416e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3215.729 -1180.859
## sample estimates:
## mean in group 0 mean in group 1 
##        18321.36        20519.65

In this case, the t-test shows that the estimate of the difference between the means is four times more than the error in estimating that difference (t = -4.239), and the probability value (p-value = 0.00002416) is less than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [-3,215, -1,180].

This would indicate that a closer examination of the data, particularly in regard to adult racial categories, would be in order to figure out what accounts for this difference. For example, is there a better way to trim our data to give a more accurate sample of our population?