Create a project in R with these data. Use the dplyr command tbl_df to create a table data frame. Then, show a glimpse of the data. The inclusion of the correct variables in this glimpse is the focal point for assessment of successful completion of this item in the assignment.
# Set working directory
getwd()
## [1] "C:/Users/usere/Desktop/assignment2 (1)"
new_data <- read.table('assignment2.dat', sep=' ')
names(new_data) <- c('R0000100','R0536300','R1482600','T6651300','Z9065500','Z9065700')
# Handle missing values
new_data[new_data == -1] = NA # Refused
new_data[new_data == -2] = NA # Dont know
new_data[new_data == -3] = NA # Invalid missing
new_data[new_data == -4] = NA # Valid missing
new_data[new_data == -5] = NA # Non-interview
# If there are values not categorized they will be represented as NA
vallabels = function(data) {
data$R0000100 <- cut(data$R0000100, c(0.0,1.0,1000.0,2000.0,3000.0,4000.0,5000.0,6000.0,7000.0,8000.0,9000.0,9999.0), labels=c("0","1 TO 999","1000 TO 1999","2000 TO 2999","3000 TO 3999","4000 TO 4999","5000 TO 5999","6000 TO 6999","7000 TO 7999","8000 TO 8999","9000 TO 9999"), right=FALSE)
data$R0536300 <- factor(data$R0536300, levels=c(1.0,2.0,0.0), labels=c("Male","Female","No Information"))
data$R1482600 <- factor(data$R1482600, levels=c(1.0,2.0,3.0,4.0), labels=c("Black","Hispanic","Mixed Race (Non-Hispanic)","Non-Black / Non-Hispanic"))
data$T6651300 <- factor(data$T6651300, levels=c(26.0,27.0,28.0,29.0,30.0,31.0,32.0), labels=c("26","27","28","29","30","31","32"))
data$Z9065500 <- cut(data$Z9065500, c(0.0,1.0,500.0,1000.0,1500.0,2000.0,2500.0,3000.0,3500.0,4000.0,4500.0,5000.0,9.9999999E7), labels=c("0","1 TO 499","500 TO 999","1000 TO 1499","1500 TO 1999","2000 TO 2499","2500 TO 2999","3000 TO 3499","3500 TO 3999","4000 TO 4499","4500 TO 4999","5000 TO 99999999: 5000+"), right=FALSE)
data$Z9065700 <- cut(data$Z9065700, c(0.0,1.0,500.0,1000.0,1500.0,2000.0,2500.0,3000.0,3500.0,4000.0,4500.0,5000.0,9.9999999E7), labels=c("0","1 TO 499","500 TO 999","1000 TO 1499","1500 TO 1999","2000 TO 2499","2500 TO 2999","3000 TO 3499","3500 TO 3999","4000 TO 4499","4500 TO 4999","5000 TO 99999999: 5000+"), right=FALSE)
return(data)
}
varlabels <- c( "PUBID - YTH ID CODE 1997",
"KEY_SEX (SYMBOL) 1997",
"KEY_RACE_ETHNICITY (SYMBOL) 1997",
"CV_AGE_INT_DATE 2011",
"CVC_HOURS_WK_TEEN",
"CVC_HOURS_WK_ADULT"
)
# Use qnames rather than rnums
qnames = function(data) {
names(data) <- c("PUBID_1997","KEY_SEX_1997","KEY_RACE_ETHNICITY_1997","CV_AGE_INT_DATE_2011","CVC_HOURS_WK_TEEN_XRND","CVC_HOURS_WK_ADULT_ALL_XRND")
return(data)
}
#********************************************************************************************************
# Remove the '#' before the following line to create a data file called "categories" with value labels.
categories <- vallabels(new_data)
# Remove the '#' before the following lines to rename variables using Qnames instead of Reference Numbers
new_data <- qnames(new_data)
categories <- qnames(categories)
# Produce summaries for the raw (uncategorized) data file
summary(new_data)
## PUBID_1997 KEY_SEX_1997 KEY_RACE_ETHNICITY_1997
## Min. : 1 Min. :1.000 Min. :1.000
## 1st Qu.:2249 1st Qu.:1.000 1st Qu.:1.000
## Median :4502 Median :1.000 Median :4.000
## Mean :4504 Mean :1.488 Mean :2.788
## 3rd Qu.:6758 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :9022 Max. :2.000 Max. :4.000
##
## CV_AGE_INT_DATE_2011 CVC_HOURS_WK_TEEN_XRND CVC_HOURS_WK_ADULT_ALL_XRND
## Min. :26.00 Min. : 0 Min. : 0
## 1st Qu.:28.00 1st Qu.: 1255 1st Qu.: 8190
## Median :29.00 Median : 2741 Median :16396
## Mean :28.79 Mean : 3105 Mean :15595
## 3rd Qu.:30.00 3rd Qu.: 4470 3rd Qu.:22418
## Max. :32.00 Max. :18829 Max. :63722
## NA's :1561 NA's :707 NA's :1596
# Remove the '#' before the following lines to produce summaries for the "categories" data file.
#categories <- vallabels(new_data)
#summary(categories)
#************************************************************************************************************
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
nls<-tbl_df(new_data)
glimpse(nls)
## Observations: 8984
## Variables:
## $ PUBID_1997 (int) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,...
## $ KEY_SEX_1997 (int) 2, 1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1...
## $ KEY_RACE_ETHNICITY_1997 (int) 4, 2, 2, 2, 2, 2, 2, 4, 4, 4, 2, 2...
## $ CV_AGE_INT_DATE_2011 (int) 29, 29, 28, 30, 29, 29, 28, 30, 29...
## $ CVC_HOURS_WK_TEEN_XRND (int) 5831, NA, 6489, 3292, 680, NA, 165...
## $ CVC_HOURS_WK_ADULT_ALL_XRND (int) NA, 29712, NA, 23390, 28056, 18379...
summary(nls)
## PUBID_1997 KEY_SEX_1997 KEY_RACE_ETHNICITY_1997
## Min. : 1 Min. :1.000 Min. :1.000
## 1st Qu.:2249 1st Qu.:1.000 1st Qu.:1.000
## Median :4502 Median :1.000 Median :4.000
## Mean :4504 Mean :1.488 Mean :2.788
## 3rd Qu.:6758 3rd Qu.:2.000 3rd Qu.:4.000
## Max. :9022 Max. :2.000 Max. :4.000
##
## CV_AGE_INT_DATE_2011 CVC_HOURS_WK_TEEN_XRND CVC_HOURS_WK_ADULT_ALL_XRND
## Min. :26.00 Min. : 0 Min. : 0
## 1st Qu.:28.00 1st Qu.: 1255 1st Qu.: 8190
## Median :29.00 Median : 2741 Median :16396
## Mean :28.79 Mean : 3105 Mean :15595
## 3rd Qu.:30.00 3rd Qu.: 4470 3rd Qu.:22418
## Max. :32.00 Max. :18829 Max. :63722
## NA's :1561 NA's :707 NA's :1596
require(magrittr)
## Loading required package: magrittr
nls %>%
filter(CV_AGE_INT_DATE_2011 == 30) %>%
summarize(nls=n())
## Source: local data frame [1 x 1]
##
## nls
## 1 1546
nls<-nls %>%
filter(CV_AGE_INT_DATE_2011 == 30)
t.test(nls$CVC_HOURS_WK_TEEN_XRND~nls$KEY_SEX_1997)
##
## Welch Two Sample t-test
##
## data: nls$CVC_HOURS_WK_TEEN_XRND by nls$KEY_SEX_1997
## t = 4.423, df = 1339.6, p-value = 1.052e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 306.4014 794.8262
## sample estimates:
## mean in group 1 mean in group 2
## 3528.906 2978.293
t.test(nls$CVC_HOURS_WK_ADULT_ALL_XRND~nls$KEY_SEX_1997)
##
## Welch Two Sample t-test
##
## data: nls$CVC_HOURS_WK_ADULT_ALL_XRND by nls$KEY_SEX_1997
## t = 4.6599, df = 1146.6, p-value = 3.535e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1400.975 3438.741
## sample estimates:
## mean in group 1 mean in group 2
## 20671.98 18252.12
nls$race<-ifelse(nls$KEY_RACE_ETHNICITY_1997==4, 1, 0)
t.test(nls$CVC_HOURS_WK_TEEN_XRND~nls$race)
##
## Welch Two Sample t-test
##
## data: nls$CVC_HOURS_WK_TEEN_XRND by nls$race
## t = -6.3593, df = 1412.8, p-value = 2.73e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1027.4485 -543.0118
## sample estimates:
## mean in group 0 mean in group 1
## 2860.372 3645.603
t.test(nls$CVC_HOURS_WK_ADULT_ALL_XRND~nls$race)
##
## Welch Two Sample t-test
##
## data: nls$CVC_HOURS_WK_ADULT_ALL_XRND by nls$race
## t = -4.2344, df = 1199.7, p-value = 2.466e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3216.827 -1179.760
## sample estimates:
## mean in group 0 mean in group 1
## 18321.36 20519.65
In items (3) through (6), (a) apply a probability of Type 1 error of 0.05 and (b) state the results of null hypothesis tests. Section 4.44 of thePublication Manual of the American Psychological Association (6th ed.) provides standards for reporting the results of t???tests within text.
The T-test of the null hypothesis that there is no difference by sex between the mean cumulative hours work from age 14 through age 19 was not statistically significant, t = 4.423, df = 1339.6, p-value = 1.05, 95% CI [306.4014, 794.8262].
The T-test of the null hypothesis that there is no difference by sex between the mean cumulative hours work from age 20 was not statistically significant, t = 4.6599, df = 1146.6, p-value = 3.54, 95% CI [1400.975, 3438.741].
The T-test of the null hypothesis that there is no difference by race/ethnicity between the mean cumulative hours work from age 14 through age 19 was not statistically significant, t = -6.3593, df = 1412.8, p-value = 2.73, 95% CI [-1027.4485, -543.0118].
The T-test of the null hypothesis that there is no difference by race/ethnicity between the mean cumulative hours work from age 20 was not statistically significant, t = -4.2344, df = 1199.7, p-value = 2.47, 95% CI [-3216.827, -1179.760].