The Assignment

For this assignment, I tested hypotheses about relationships among race, sex, and general health.

Findings

1. Obtaining the dataset

First, I downloaded the necessary data set from NLS, then opened it in RStudio and cleaned it up.

##        ID            Sex             Race       General_Health 
##  Min.   :   1   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2249   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :4502   Median :1.000   Median :4.000   Median :2.000  
##  Mean   :4504   Mean   :1.488   Mean   :2.788   Mean   :2.319  
##  3rd Qu.:6758   3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :9022   Max.   :2.000   Max.   :4.000   Max.   :5.000  
##                                                 NA's   :1853

A look at this data showed me that there were NAs for the “General Health” data that I need to trim from the dataset.

##        ID            Sex             Race       General_Health 
##  Min.   :   1   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2249   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :4502   Median :1.000   Median :4.000   Median :2.000  
##  Mean   :4504   Mean   :1.488   Mean   :2.788   Mean   :2.319  
##  3rd Qu.:6758   3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :9022   Max.   :2.000   Max.   :4.000   Max.   :5.000  
##                                                 NA's   :1853

So I filtered the dataset to remove the NAs from the General_Health.

## Source: local data frame [7,131 x 4]
## 
##       ID   Sex  Race General_Health
##    (int) (int) (int)          (int)
## 1      1     2     4              2
## 2      2     1     2              2
## 3      3     2     2              3
## 4      4     2     2              2
## 5      5     1     2              2
## 6      6     2     2              2
## 7      9     1     4              2
## 8     11     2     2              3
## 9     12     1     2              4
## 10    13     1     2              1
## ..   ...   ...   ...            ...
##        ID            Sex             Race       General_Health 
##  Min.   :   1   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2356   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :4630   Median :2.000   Median :3.000   Median :2.000  
##  Mean   :4592   Mean   :1.504   Mean   :2.733   Mean   :2.319  
##  3rd Qu.:6860   3rd Qu.:2.000   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :9022   Max.   :2.000   Max.   :4.000   Max.   :5.000

I next used the table command to provide a frequency distribution for each variable.

## 
##    1    2 
## 3539 3592
## 
##    1    2    3    4 
## 1970 1529   67 3565
## 
##    1    2    3    4    5 
## 1622 2578 2080  735  116

I used the xtabs function to create a cross tabulation of genereal health by race and sex. First, I created a new data frame that contains this cross tabulation

## , , trimyouth$General_Health = 1
## 
##               trimyouth$Sex
## trimyouth$Race   1   2
##              1 269 188
##              2 142 144
##              3   5  10
##              4 451 413
## 
## , , trimyouth$General_Health = 2
## 
##               trimyouth$Sex
## trimyouth$Race   1   2
##              1 298 328
##              2 277 257
##              3  16   9
##              4 707 686
## 
## , , trimyouth$General_Health = 3
## 
##               trimyouth$Sex
## trimyouth$Race   1   2
##              1 254 341
##              2 225 258
##              3   8  10
##              4 509 475
## 
## , , trimyouth$General_Health = 4
## 
##               trimyouth$Sex
## trimyouth$Race   1   2
##              1 104 158
##              2  98  90
##              3   4   3
##              4 126 152
## 
## , , trimyouth$General_Health = 5
## 
##               trimyouth$Sex
## trimyouth$Race   1   2
##              1  14  16
##              2  13  25
##              3   1   1
##              4  18  28

Then I used ftable to present the data in a more readable format.

##                              trimyouth$General_Health   1   2   3   4   5
## trimyouth$Race trimyouth$Sex                                             
## 1              1                                      269 298 254 104  14
##                2                                      188 328 341 158  16
## 2              1                                      142 277 225  98  13
##                2                                      144 257 258  90  25
## 3              1                                        5  16   8   4   1
##                2                                       10   9  10   3   1
## 4              1                                      451 707 509 126  18
##                2                                      413 686 475 152  28

Finally, I used “summary” to run a Chi Square test, with an \(\alpha\) of .05.

## Call: xtabs(formula = ~trimyouth$Race + trimyouth$Sex + trimyouth$General_Health)
## Number of cases in table: 7131 
## Number of factors: 3 
## Test for independence of all factors:
##  Chisq = 159.34, df = 31, p-value = 3.389e-19
##  Chi-squared approximation may be incorrect

I find that my p value, 3.389e-19, is <.05 (my \(\alpha\)), so I reject the null hypothesis. In other words, general health is independent of race and sex.