For this assignment, I tested hypotheses about relationships among race, sex, and general health.
First, I downloaded the necessary data set from NLS, then opened it in RStudio and cleaned it up.
## ID Sex Race General_Health
## Min. : 1 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2249 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Median :4502 Median :1.000 Median :4.000 Median :2.000
## Mean :4504 Mean :1.488 Mean :2.788 Mean :2.319
## 3rd Qu.:6758 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :9022 Max. :2.000 Max. :4.000 Max. :5.000
## NA's :1853
A look at this data showed me that there were NAs for the “General Health” data that I need to trim from the dataset.
## ID Sex Race General_Health
## Min. : 1 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2249 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Median :4502 Median :1.000 Median :4.000 Median :2.000
## Mean :4504 Mean :1.488 Mean :2.788 Mean :2.319
## 3rd Qu.:6758 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :9022 Max. :2.000 Max. :4.000 Max. :5.000
## NA's :1853
So I filtered the dataset to remove the NAs from the General_Health.
## Source: local data frame [7,131 x 4]
##
## ID Sex Race General_Health
## (int) (int) (int) (int)
## 1 1 2 4 2
## 2 2 1 2 2
## 3 3 2 2 3
## 4 4 2 2 2
## 5 5 1 2 2
## 6 6 2 2 2
## 7 9 1 4 2
## 8 11 2 2 3
## 9 12 1 2 4
## 10 13 1 2 1
## .. ... ... ... ...
## ID Sex Race General_Health
## Min. : 1 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2356 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
## Median :4630 Median :2.000 Median :3.000 Median :2.000
## Mean :4592 Mean :1.504 Mean :2.733 Mean :2.319
## 3rd Qu.:6860 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :9022 Max. :2.000 Max. :4.000 Max. :5.000
I next used the table command to provide a frequency distribution for each variable.
##
## 1 2
## 3539 3592
##
## 1 2 3 4
## 1970 1529 67 3565
##
## 1 2 3 4 5
## 1622 2578 2080 735 116
I used the xtabs function to create a cross tabulation of genereal health by race and sex. First, I created a new data frame that contains this cross tabulation
## , , trimyouth$General_Health = 1
##
## trimyouth$Sex
## trimyouth$Race 1 2
## 1 269 188
## 2 142 144
## 3 5 10
## 4 451 413
##
## , , trimyouth$General_Health = 2
##
## trimyouth$Sex
## trimyouth$Race 1 2
## 1 298 328
## 2 277 257
## 3 16 9
## 4 707 686
##
## , , trimyouth$General_Health = 3
##
## trimyouth$Sex
## trimyouth$Race 1 2
## 1 254 341
## 2 225 258
## 3 8 10
## 4 509 475
##
## , , trimyouth$General_Health = 4
##
## trimyouth$Sex
## trimyouth$Race 1 2
## 1 104 158
## 2 98 90
## 3 4 3
## 4 126 152
##
## , , trimyouth$General_Health = 5
##
## trimyouth$Sex
## trimyouth$Race 1 2
## 1 14 16
## 2 13 25
## 3 1 1
## 4 18 28
Then I used ftable to present the data in a more readable format.
## trimyouth$General_Health 1 2 3 4 5
## trimyouth$Race trimyouth$Sex
## 1 1 269 298 254 104 14
## 2 188 328 341 158 16
## 2 1 142 277 225 98 13
## 2 144 257 258 90 25
## 3 1 5 16 8 4 1
## 2 10 9 10 3 1
## 4 1 451 707 509 126 18
## 2 413 686 475 152 28
Finally, I used “summary” to run a Chi Square test, with an \(\alpha\) of .05.
## Call: xtabs(formula = ~trimyouth$Race + trimyouth$Sex + trimyouth$General_Health)
## Number of cases in table: 7131
## Number of factors: 3
## Test for independence of all factors:
## Chisq = 159.34, df = 31, p-value = 3.389e-19
## Chi-squared approximation may be incorrect
I find that my p value, 3.389e-19, is <.05 (my \(\alpha\)), so I reject the null hypothesis. In other words, general health is independent of race and sex.