Before answering the questions in the assignment I had to download the appropriate data from NLSY97 and extract the resulting zipped file. I then created a new data set called “new data” from the R file provided in that download. This R file modified the data for easier use in R Studio and was completed by running the following code:
new_data <- read.table('540assign2.dat', sep=' ')
names(new_data) <- c('R0000100','R0536300','R1482600','T6651300','Z9065500','Z9065700')
# Handle missing values
new_data[new_data == -1] = NA # Refused
new_data[new_data == -2] = NA # Dont know
new_data[new_data == -3] = NA # Invalid missing
new_data[new_data == -4] = NA # Valid missing
new_data[new_data == -5] = NA # Non-interview
# If there are values not categorized they will be represented as NA
vallabels = function(data) {
data$R0000100 <- cut(data$R0000100, c(0.0,1.0,1000.0,2000.0,3000.0,4000.0,5000.0,6000.0,7000.0,8000.0,9000.0,9999.0), labels=c("0","1 TO 999","1000 TO 1999","2000 TO 2999","3000 TO 3999","4000 TO 4999","5000 TO 5999","6000 TO 6999","7000 TO 7999","8000 TO 8999","9000 TO 9999"), right=FALSE)
data$R0536300 <- factor(data$R0536300, levels=c(1.0,2.0,0.0), labels=c("Male","Female","No Information"))
data$R1482600 <- factor(data$R1482600, levels=c(1.0,2.0,3.0,4.0), labels=c("Black","Hispanic","Mixed Race (Non-Hispanic)","Non-Black / Non-Hispanic"))
data$T6651300 <- factor(data$T6651300, levels=c(26.0,27.0,28.0,29.0,30.0,31.0,32.0), labels=c("26","27","28","29","30","31","32"))
data$Z9065500 <- cut(data$Z9065500, c(0.0,1.0,500.0,1000.0,1500.0,2000.0,2500.0,3000.0,3500.0,4000.0,4500.0,5000.0,9.9999999E7), labels=c("0","1 TO 499","500 TO 999","1000 TO 1499","1500 TO 1999","2000 TO 2499","2500 TO 2999","3000 TO 3499","3500 TO 3999","4000 TO 4499","4500 TO 4999","5000 TO 99999999: 5000+"), right=FALSE)
data$Z9065700 <- cut(data$Z9065700, c(0.0,1.0,500.0,1000.0,1500.0,2000.0,2500.0,3000.0,3500.0,4000.0,4500.0,5000.0,9.9999999E7), labels=c("0","1 TO 499","500 TO 999","1000 TO 1499","1500 TO 1999","2000 TO 2499","2500 TO 2999","3000 TO 3499","3500 TO 3999","4000 TO 4499","4500 TO 4999","5000 TO 99999999: 5000+"), right=FALSE)
return(data)
}
varlabels <- c( "PUBID _ YTH ID CODE 1997",
"KEY!SEX (SYMBOL) 1997",
"KEY!RACE_ETHNICITY (SYMBOL) 1997",
"CV_AGE_INT_DATE 2011",
"CVC_HOURS_WK_TEEN",
"CVC_HOURS_WK_ADULT"
)
# Use qnames rather than rnums
qnames = function(data) {
names(data) <- c("ID","Sex","Race","Age_2011","Total_teen_hours","Total_adult_hours")
return(data)
}
categories <- vallabels(new_data)
new_data <- qnames(new_data)
categories <- qnames(categories)
summary(new_data)
## ID Sex Race Age_2011
## Min. : 1 Min. :1.000 Min. :1.000 Min. :26.00
## 1st Qu.:2249 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:28.00
## Median :4502 Median :1.000 Median :4.000 Median :29.00
## Mean :4504 Mean :1.488 Mean :2.788 Mean :28.79
## 3rd Qu.:6758 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:30.00
## Max. :9022 Max. :2.000 Max. :4.000 Max. :32.00
## NA's :1561
## Total_teen_hours Total_adult_hours
## Min. : 0 Min. : 0
## 1st Qu.: 1255 1st Qu.: 8190
## Median : 2741 Median :16396
## Mean : 3105 Mean :15595
## 3rd Qu.: 4470 3rd Qu.:22418
## Max. :18829 Max. :63722
## NA's :707 NA's :1596
Next I had to require the use of the dplyr and magrittr packages to complete this assignment. The dplyr package allows the creation of table data frames, and magrittr allows for piping of commands that will be used in later parts of the assignment.
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(magrittr)
## Loading required package: magrittr
I then created a new data frame called new_dataDF from the data set created from the NLSY97 using the tbl_df function. Below I show a glimpse of that data frame.
new_dataDF <- tbl_df(new_data)
glimpse(new_dataDF)
## Observations: 8,984
## Variables: 6
## $ ID (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ Sex (int) 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2,...
## $ Race (int) 4, 2, 2, 2, 2, 2, 2, 4, 4, 4, 2, 2, 2, 2, 2,...
## $ Age_2011 (int) 29, 29, 28, 30, 29, 29, 28, 30, 29, NA, 29, ...
## $ Total_teen_hours (int) 5831, NA, 6489, 3292, 680, NA, 1650, 2082, 8...
## $ Total_adult_hours (int) NA, 29712, NA, 23390, 28056, 18379, NA, 1841...
Next I filtered the data in this new data frame to include only people who were 30 years of age at the time of the 2011 interview into a new data frame called “ThirtyYearOlds”. I used the glimpse function to show that this was done correctly. Then I calculated the number of people included in that data frame by using the “n” (number) function. This is also provided as the number of observations in the glimpse of the data.
ThirtyYearOlds <- filter (new_dataDF, Age_2011 == 30)
glimpse(ThirtyYearOlds)
## Observations: 1,546
## Variables: 6
## $ ID (int) 4, 8, 26, 27, 32, 33, 38, 55, 59, 68, 69, 78...
## $ Sex (int) 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2,...
## $ Race (int) 2, 4, 1, 1, 4, 4, 4, 2, 1, 1, 1, 4, 1, 2, 4,...
## $ Age_2011 (int) 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, ...
## $ Total_teen_hours (int) 3292, 2082, 760, 1592, 4611, 1862, 0, 1368, ...
## $ Total_adult_hours (int) 23390, 18419, 0, 11060, 16981, 26551, NA, 16...
ThirtyYearOlds %>% summarize(Total_ThirtyYearOlds=n())
## Source: local data frame [1 x 1]
##
## Total_ThirtyYearOlds
## (int)
## 1 1546
Next, I went on to test the null hypothesis, which is that there is no difference by sex (Sex) in the mean cumulative hours worked from age 14 through age 19 (Total_teen_hours). Stated in another way, knowing a person’s sex does not help predict the number of hours worked during their teen years, and vice versa. The alternative hypothesis is that there is a difference by sex between the mean cumuluative hours worked from age 14 through age 19. Stated another way, sex and cumulative hours worked as a teen are not independent variables, i.e. one can be used to predict the other. As per the assignment I will set my alpha value, the probability of Type 1 error, equal to 0.05.
To start, I first looked at the mean total hours worked for each sex in the data which were:
ThirtyYearOlds %>% group_by(Sex) %>%
summarize(mean_teen_hours=mean(Total_teen_hours, na.rm=TRUE))
## Source: local data frame [2 x 2]
##
## Sex mean_teen_hours
## (int) (dbl)
## 1 1 3528.906
## 2 2 2978.293
However, when I considered the data more carefully I realized that these numbers may be skewed by outliers (a few individuals who reported a great number of hours) and individuals who may not have worked at all (zero total hours worked). To determine this, I ran a few plots of the data.
To do this, I first needed to require the ggvis package to run plots.
require(ggvis)
## Loading required package: ggvis
Next I ran a plot for Male teens.
ThirtyYearOlds %>% filter(Sex==1) %>%
ggvis(~Total_teen_hours) %>% layer_histograms()
## Guessing width = 500 # range / 29
This plot shows that there are a number of individuals above the 8,000 hour mark that might skew my results, and that there are a number of individuals who have not worked at all. Therefore, I decided to run the plot using the total number of teen hours worked between 500 and 8,000 total hours worked.
ThirtyYearOlds %>% filter(Sex==1, Total_teen_hours >=500 & Total_teen_hours <= 8000) %>%
ggvis(~Total_teen_hours) %>% layer_histograms()
## Guessing width = 200 # range / 38
This plot gives me a more representative group of data.
I then ran the same plots for Female teen workers.
ThirtyYearOlds %>% filter(Sex==2) %>%
ggvis(~Total_teen_hours) %>% layer_histograms()
## Guessing width = 500 # range / 27
I also ran plots with trimmed data.
ThirtyYearOlds %>% filter(Sex==2, Total_teen_hours >=500 & Total_teen_hours <= 8000) %>%
ggvis(~Total_teen_hours) %>% layer_histograms()
## Guessing width = 200 # range / 38
Once again, these plots show that trimmed data provide a more accurate representation of teen hours worked for these groups. Therefore, I calculated my means again using these limitations on the data.
ThirtyYearOlds %>% group_by(Sex) %>%
filter(Total_teen_hours >= 500 & Total_teen_hours <= 8000) %>%
summarize(mean_teen_hours=mean(Total_teen_hours, na.rm=TRUE))
## Source: local data frame [2 x 2]
##
## Sex mean_teen_hours
## (int) (dbl)
## 1 1 3405.310
## 2 2 3162.204
These means seem more reasonable when compared with our original values, which were:
ThirtyYearOlds %>% group_by(Sex) %>%
summarize(mean_teen_hours=mean(Total_teen_hours, na.rm=TRUE))
## Source: local data frame [2 x 2]
##
## Sex mean_teen_hours
## (int) (dbl)
## 1 1 3528.906
## 2 2 2978.293
Now that I’m confident that my trimmed data is a more accurate representation of the group, and I’m ready to run a t-test to test the null hypothesis. To do this I needed to create a new data frame for t-test comparison using trimmed data according to the specifications above. I called the new data frame trim_teen_data.
trim_teen_data <- ThirtyYearOlds %>%
filter(Total_teen_hours >= 500 & Total_teen_hours <= 8000, na.rm=TRUE)
trim_teen_data
## Source: local data frame [1,228 x 6]
##
## ID Sex Race Age_2011 Total_teen_hours Total_adult_hours
## (int) (int) (int) (int) (int) (int)
## 1 4 2 2 30 3292 23390
## 2 8 2 4 30 2082 18419
## 3 26 1 1 30 760 0
## 4 27 1 1 30 1592 11060
## 5 32 2 4 30 4611 16981
## 6 33 2 4 30 1862 26551
## 7 55 2 2 30 1368 16934
## 8 59 2 1 30 5648 NA
## 9 68 1 1 30 1316 NA
## 10 69 2 1 30 3221 NA
## .. ... ... ... ... ... ...
I then ran a t-test on the trimmed and scrubbed data:
t.test(trim_teen_data$Total_teen_hours ~ trim_teen_data$Sex, var.equal=TRUE)
##
## Two Sample t-test
##
## data: trim_teen_data$Total_teen_hours by trim_teen_data$Sex
## t = 2.3694, df = 1226, p-value = 0.01797
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 41.8123 444.3999
## sample estimates:
## mean in group 1 mean in group 2
## 3405.310 3162.204
The t-test shows that the estimate of the difference between the means is more than twice the error in estimating that difference (t = 2.394) and our probability value (p-value = 0.01797) is less than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [41, 444]. Therefore, the null hypothesis has been rejected.
I can test the null hypothesis in Question 4 using the same procedures I used in Question 3, but I will need to trim the data differently.
The null hypothesis I’m testing states that there is no difference by sex (Sex) in the mean cumulative hours worked from age 20 and older (Total_adult_hours). Stated in another way, knowing a person’s sex does not help predict the number of hours worked during their adult years, and vice versa. The alternative hypothesis is that there is a difference by sex between the mean cumuluative hours worked from age 20 and older. Stated another way, sex and cumulative hours worked as an adult are not independent variables, i.e. one can be used to predict the other. As per the assignment I will set my alpha value, the probability of Type 1 error, equal to 0.05.
Step 1. Calculate the means without data manipulation.
ThirtyYearOlds %>% group_by(Sex) %>%
summarize(num=n(), mean_adult_hours=mean(Total_adult_hours, na.rm=TRUE))
## Source: local data frame [2 x 3]
##
## Sex num mean_adult_hours
## (int) (int) (dbl)
## 1 1 767 20671.98
## 2 2 779 18252.12
Step 2. Create plot for Male Adults vs. the number of hours worked as an adult.
ThirtyYearOlds %>% filter(Sex==1) %>%
ggvis(~Total_adult_hours) %>% layer_histograms()
## Guessing width = 2000 # range / 32
Step 3. Create plot for Female Adults vs. the number of hours worked as an adult.
ThirtyYearOlds %>% filter(Sex==2) %>%
ggvis(~Total_adult_hours) %>% layer_histograms()
## Guessing width = 1000 # range / 37
The plots show that there are a number of people who haven’t worked any hours as an adult and that there are a number of outliers at the upper end of the scale. Based on the plots, I decided to filter my data to include only those individuals who have worked more than 5,000 hours and those that have worked less than 30,000. I then re-ran the plots using my those filters:
ThirtyYearOlds %>% filter(Sex==1, Total_adult_hours >=5000 & Total_adult_hours <= 30000) %>%
ggvis(~Total_adult_hours) %>% layer_histograms()
## Guessing width = 1000 # range / 25
ThirtyYearOlds %>% filter(Sex==2, Total_adult_hours >=5000 & Total_adult_hours <= 30000) %>%
ggvis(~Total_adult_hours) %>% layer_histograms()
## Guessing width = 1000 # range / 25
These plots show a more accurate representation of data by which to run my t-test. Therefore, I move onto my next step.
Step 4. I run the population means using trimmed data.
ThirtyYearOlds %>% group_by(Sex) %>%
filter(Total_adult_hours >= 5000 & Total_adult_hours <= 30000) %>%
summarize(num=n(), mean_adult_hours=mean(Total_adult_hours, na.rm=TRUE))
## Source: local data frame [2 x 3]
##
## Sex num mean_adult_hours
## (int) (int) (dbl)
## 1 1 446 20396.74
## 2 2 544 18646.75
And then I compare them with the original means.
ThirtyYearOlds %>% group_by(Sex) %>%
summarize(num=n(), mean_adult_hours=mean(Total_adult_hours, na.rm=TRUE))
## Source: local data frame [2 x 3]
##
## Sex num mean_adult_hours
## (int) (int) (dbl)
## 1 1 767 20671.98
## 2 2 779 18252.12
I’m now confident that my trimmed data is more representative of the group, and I’m ready to run my t-test.
Step 5. First, I create new data frame with trimmed data.
trim_adult_data <- ThirtyYearOlds %>%
filter(Total_adult_hours >= 5000 & Total_adult_hours <= 30000, na.rm=TRUE)
trim_adult_data
## Source: local data frame [990 x 6]
##
## ID Sex Race Age_2011 Total_teen_hours Total_adult_hours
## (int) (int) (int) (int) (int) (int)
## 1 4 2 2 30 3292 23390
## 2 8 2 4 30 2082 18419
## 3 27 1 1 30 1592 11060
## 4 32 2 4 30 4611 16981
## 5 33 2 4 30 1862 26551
## 6 55 2 2 30 1368 16934
## 7 78 2 4 30 365 6001
## 8 83 1 2 30 1910 14857
## 9 86 2 4 30 310 20959
## 10 102 2 2 30 3488 17674
## .. ... ... ... ... ... ...
Step 6. Run t-test.
t.test(trim_adult_data$Total_adult_hours ~ trim_adult_data$Sex, var.equal=TRUE)
##
## Two Sample t-test
##
## data: trim_adult_data$Total_adult_hours by trim_adult_data$Sex
## t = 4.214, df = 988, p-value = 2.739e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 935.0507 2564.9255
## sample estimates:
## mean in group 1 mean in group 2
## 20396.74 18646.75
The t-test shows that the estimate of the difference between the means is more than four times the error in estimating that difference (t = 4.214) and our probability value (p-value = 0.00002739) is much less than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [935, 2,564]. Therefore, the null hypothesis has been rejected.
The null hypothesis I’m testing states that there is no difference by race/ethnicity (Race) in the mean cumulative hours worked from age 14 through age 19 (Total_teen_hours). Stated in another way, knowing a person’s race/ethnicity does not help predict the number of hours worked during their teen years, and vice versa. The alternative hypothesis is that there is a difference by race/ethnicity between the mean cumuluative hours worked from age 14 through age 19. Stated another way, race/ethnicity and cumulative hours worked as an teen are not independent variables, i.e. one can be used to predict the other. As per the assignment I will set my alpha value, the probability of Type 1 error, equal to 0.05.
As per the assignment, to begin testing the null hypothesis, I must first recode the Race variable in my data set to indicate a “1” for any Non-Black, Non-Hispanic participant and “0” for all other racial categories. I do this through using the “ifelse” command.
First, I take a look at my original data frame:
ThirtyYearOlds
## Source: local data frame [1,546 x 6]
##
## ID Sex Race Age_2011 Total_teen_hours Total_adult_hours
## (int) (int) (int) (int) (int) (int)
## 1 4 2 2 30 3292 23390
## 2 8 2 4 30 2082 18419
## 3 26 1 1 30 760 0
## 4 27 1 1 30 1592 11060
## 5 32 2 4 30 4611 16981
## 6 33 2 4 30 1862 26551
## 7 38 2 4 30 0 NA
## 8 55 2 2 30 1368 16934
## 9 59 2 1 30 5648 NA
## 10 68 1 1 30 1316 NA
## .. ... ... ... ... ... ...
To preserve the integrity of my original data frame, I copy this data set into a new file called NewRaceCodes.
NewRaceCodes <- ThirtyYearOlds
Using the “ifelse” command I recode the Race variable according to the requirements of the assignment:
NewRaceCodes$Race <- ifelse(NewRaceCodes$Race==4, 1, 0)
NewRaceCodes
## Source: local data frame [1,546 x 6]
##
## ID Sex Race Age_2011 Total_teen_hours Total_adult_hours
## (int) (int) (dbl) (int) (int) (int)
## 1 4 2 0 30 3292 23390
## 2 8 2 1 30 2082 18419
## 3 26 1 0 30 760 0
## 4 27 1 0 30 1592 11060
## 5 32 2 1 30 4611 16981
## 6 33 2 1 30 1862 26551
## 7 38 2 1 30 0 NA
## 8 55 2 0 30 1368 16934
## 9 59 2 0 30 5648 NA
## 10 68 1 0 30 1316 NA
## .. ... ... ... ... ... ...
Comparing this to my original data above, I can see that the racial categories have been recoded properly.
However, to insure that all the data has been recoded, I can run the “table” command to check if there are any extraneous codes I need to account for.
y1 <- table(NewRaceCodes$Race)
y1
##
## 0 1
## 776 770
This distribution table shows that my data is clean. I can now run similar trimming functions and calculations on my new data set that I did earlier in the assignment, but using Race rather than Sex as my dependent variable.
I will use the same limits I placed on Total_teen_hours in the previous exercise to conduct these calculations:
NewRaceCodes %>% group_by(Race) %>%
filter(Total_teen_hours >= 500 & Total_teen_hours <= 8000) %>%
summarize(num=n(), mean_teen_hours=mean(Total_teen_hours, na.rm=TRUE))
## Source: local data frame [2 x 3]
##
## Race num mean_teen_hours
## (dbl) (int) (dbl)
## 1 0 598 3002.256
## 2 1 630 3547.100
These means seem more reasonable than when compared with untrimmed data:
NewRaceCodes %>% group_by(Race) %>%
summarize(num=n(), mean_teen_hours=mean(Total_teen_hours, na.rm=TRUE))
## Source: local data frame [2 x 3]
##
## Race num mean_teen_hours
## (dbl) (int) (dbl)
## 1 0 776 2860.372
## 2 1 770 3645.603
I am now confident that my trimmed data is more representative of the group, and I’m ready to run my t-test.
First I trim my data. I will call the new data frame trim_teen_race_data.
trim_teen_race_data <- NewRaceCodes %>%
filter(Total_teen_hours >= 500 & Total_teen_hours <= 8000, na.rm=TRUE)
trim_teen_race_data
## Source: local data frame [1,228 x 6]
##
## ID Sex Race Age_2011 Total_teen_hours Total_adult_hours
## (int) (int) (dbl) (int) (int) (int)
## 1 4 2 0 30 3292 23390
## 2 8 2 1 30 2082 18419
## 3 26 1 0 30 760 0
## 4 27 1 0 30 1592 11060
## 5 32 2 1 30 4611 16981
## 6 33 2 1 30 1862 26551
## 7 55 2 0 30 1368 16934
## 8 59 2 0 30 5648 NA
## 9 68 1 0 30 1316 NA
## 10 69 2 0 30 3221 NA
## .. ... ... ... ... ... ...
Then I run my t-test.
t.test(trim_teen_race_data$Total_teen_hours ~ trim_teen_race_data$Race, var.equal=TRUE)
##
## Two Sample t-test
##
## data: trim_teen_race_data$Total_teen_hours by trim_teen_race_data$Race
## t = -5.3588, df = 1226, p-value = 1e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -744.3175 -345.3708
## sample estimates:
## mean in group 0 mean in group 1
## 3002.256 3547.100
The t-test shows that the estimate of the difference between the means is more than five times the error in estimating that difference (t = -5.3588) and our probability value (p-value = 0.0000001) is much less than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [-744, -345]. Therefore, the null hypothesis has been rejected.
The null hypothesis I’m testing states that there is no difference by race/ethnicity (Race) in the mean cumulative hours worked from age 20 and older (Total_adult_hours). Stated in another way, knowing a person’s race/ethnicity does not help predict the number of hours worked during their adult years, and vice versa. The alternative hypothesis is that there is a difference by race/ethnicity between the mean cumuluative hours worked from age 20 and older. Stated another way, race/ethnicity and cumulative hours worked as an adult are not independent variables, i.e. one can be used to predict the other. As per the assignment I will set my alpha value, the probability of Type 1 error, equal to 0.05.
I will use the same limits I placed on Total_adult_hours in the previous exercises to conduct these calculations and run my t-test:
NewRaceCodes %>% group_by(Race) %>%
filter(Total_adult_hours >= 5000 & Total_adult_hours <= 30000) %>%
summarize(num=n(), mean_adult_hours=mean(Total_adult_hours, na.rm=TRUE))
## Source: local data frame [2 x 3]
##
## Race num mean_adult_hours
## (dbl) (int) (dbl)
## 1 0 479 19072.82
## 2 1 511 19774.75
Once again, these means seem more reasonable than when compared with untrimmed data.
NewRaceCodes %>% group_by(Race) %>%
summarize(num=n(), mean_adult_hours=mean(Total_adult_hours, na.rm=TRUE))
## Source: local data frame [2 x 3]
##
## Race num mean_adult_hours
## (dbl) (int) (dbl)
## 1 0 776 18321.36
## 2 1 770 20519.65
I’m now confident that my trimmed data is more representative of the group, and I’m ready to run my t-test.
I create a new data frame “trim_adult_race_data”.
trim_adult_race_data <- NewRaceCodes %>%
filter(Total_adult_hours >= 5000 & Total_adult_hours <= 30000, na.rm=TRUE)
trim_adult_race_data
## Source: local data frame [990 x 6]
##
## ID Sex Race Age_2011 Total_teen_hours Total_adult_hours
## (int) (int) (dbl) (int) (int) (int)
## 1 4 2 0 30 3292 23390
## 2 8 2 1 30 2082 18419
## 3 27 1 0 30 1592 11060
## 4 32 2 1 30 4611 16981
## 5 33 2 1 30 1862 26551
## 6 55 2 0 30 1368 16934
## 7 78 2 1 30 365 6001
## 8 83 1 0 30 1910 14857
## 9 86 2 1 30 310 20959
## 10 102 2 0 30 3488 17674
## .. ... ... ... ... ... ...
I run my t-test on my trimmed and scrubbed data.
t.test(trim_adult_race_data$Total_adult_hours ~ trim_adult_race_data$Race, var.equal=TRUE)
##
## Two Sample t-test
##
## data: trim_adult_race_data$Total_adult_hours by trim_adult_race_data$Race
## t = -1.685, df = 988, p-value = 0.0923
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1519.3668 115.5209
## sample estimates:
## mean in group 0 mean in group 1
## 19072.82 19774.75
The t-test shows that the estimate of the difference between the means is more than the error in estimating that difference (t = -1.685), however the probability value (p-value = 0.0923) is more than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). Additionally, our 95% CI [-1,519, 115] also spans zero. Therefore, we have failed to reject our null hypothesis using trimmed data.
However, I do find it interesting that when I expand the population to include all of our original data then we can reject our null hypothesis.
t.test(NewRaceCodes$Total_adult_hours ~ NewRaceCodes$Race, var.equal=TRUE)
##
## Two Sample t-test
##
## data: NewRaceCodes$Total_adult_hours by NewRaceCodes$Race
## t = -4.239, df = 1210, p-value = 2.416e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3215.729 -1180.859
## sample estimates:
## mean in group 0 mean in group 1
## 18321.36 20519.65
In this case, the t-test shows that the estimate of the difference between the means is four times more than the error in estimating that difference (t = -4.239), and the probability value (p-value = 0.00002416) is less than our acceptable probability of Type 1 error (\(\alpha\) = 0.05). It also shows a 95% CI [-3,215, -1,180].
This would indicate that a closer examination of the data, particularly in regard to adult racial categories, would be in order to figure out what accounts for this difference. For example, is there a better way to trim our data to give a more accurate sample of our population?