The following report is analysis of data from the NLS database
To begin, I downloaded the appropriate variables from the NLS dataset and loaded the required packages. I also edited variable names to allow for an easier understanding of variables.
## PUBID_1997 SEX RACE VOCATION
## Min. : 1 Min. :1.000 Min. :1.000 Min. :-8.0000
## 1st Qu.:2249 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 0.0000
## Median :4502 Median :1.000 Median :4.000 Median : 0.0000
## Mean :4504 Mean :1.488 Mean :2.788 Mean : 0.1171
## 3rd Qu.:6758 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.: 1.0000
## Max. :9022 Max. :2.000 Max. :4.000 Max. : 1.0000
## NA's :2752
## Loading required package: ggvis
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Loading required package: magrittr
## Source: local data frame [8,984 x 4]
##
## PUBID_1997 SEX RACE VOCATION
## 1 1 2 4 0
## 2 2 1 2 1
## 3 3 2 2 NA
## 4 4 2 2 1
## 5 5 1 2 0
## 6 6 2 2 0
## 7 7 1 2 NA
## 8 8 2 4 0
## 9 9 1 4 0
## 10 10 1 4 0
## .. ... ... ... ...
## PUBID_1997 SEX RACE VOCATION
## Min. : 1 Min. :1.000 Min. :1.000 Min. :-8.0000
## 1st Qu.:2249 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 0.0000
## Median :4502 Median :1.000 Median :4.000 Median : 0.0000
## Mean :4504 Mean :1.488 Mean :2.788 Mean : 0.1171
## 3rd Qu.:6758 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.: 1.0000
## Max. :9022 Max. :2.000 Max. :4.000 Max. : 1.0000
## NA's :2752
The variables chosen for a chi-square test were Sex, Race, and Vocational concentration.
Summary of the data show variables that need to be recoded to NA and then eliminate the NA from the dataset.
contingency$VOCATION<-ifelse(contingency$VOCATION <0, NA, contingency$VOCATION)
contingency<- na.omit(contingency)
summary(contingency)
## PUBID_1997 SEX RACE VOCATION
## Min. : 1 Min. :1.000 Min. :1.000 Min. :0.0000
## 1st Qu.:2328 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:0.0000
## Median :4554 Median :1.000 Median :4.000 Median :0.0000
## Mean :4481 Mean :1.499 Mean :2.857 Mean :0.3216
## 3rd Qu.:6577 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:1.0000
## Max. :9021 Max. :2.000 Max. :4.000 Max. :1.0000
Hypothesis (null) There is no relationship in sex, gender, and vocational concentration
Hypothesis (alternative) There is a relationship in sex, gender, and vocational concentration
To test the null hypothesis, I created a contingency table and ran a chi-square test.
SEX_RACE_VOCATION<- xtabs(~contingency$SEX + contingency$RACE + contingency$VOCATION)
ftable(SEX_RACE_VOCATION)
## contingency$VOCATION 0 1
## contingency$SEX contingency$RACE
## 1 1 485 220
## 2 412 193
## 3 13 10
## 4 996 704
## 2 1 543 253
## 2 458 124
## 3 23 2
## 4 1180 442
summary(SEX_RACE_VOCATION)
## Call: xtabs(formula = ~contingency$SEX + contingency$RACE + contingency$VOCATION)
## Number of cases in table: 6058
## Number of factors: 3
## Test for independence of all factors:
## Chisq = 133.05, df = 10, p-value = 1.113e-23
The chi-square test rejects the null hypothesis at the .05 level
From this test statistic, there is a relationship between sex, race, and vocational concentrator.