Instructions

Task 0

Create an R code chunk below to load the tidyverse, ggplot2, and the NHANES package. Further, load the NHANES dataset.

Homework question 1

Is there a relationship between an individual’s race (per the Race3 variable) and median household income (HHIncomeMid)?

Task 1

Write down formal hypotheses to investigate the overall relationship H_0: m_1=m_2=m_3=… H_A: at least one race group has a different mean household income

Task 2

Create a graphical summary and write 3-4 sentences describing the visualization.

NHANES %>%
  filter(!is.na(Race3), !is.na(HHIncomeMid))%>%
  ggplot(aes(x=Race3, y= HHIncomeMid))+
  geom_boxplot()+
  labs(
    x="Race",
    y= "Median Household Income",
    title= "Median Household Income by Race"
  )

Task 3

Fit the model suitable for this analysis in a code chunk below.

income_aov<-aov(
  HHIncomeMid~Race3, 
  data=NHANES
)
summary(income_aov)
##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## Race3          5 4.088e+11 8.176e+10   79.05 <2e-16 ***
## Residuals   4617 4.775e+12 1.034e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 5377 observations deleted due to missingness

Task 4

Using the p-value, write a conclusion for the test we are conducting at 5% level of significance. The p-value is less than .05 so we reject the null hypothese because we have statitically significant evidence that the mean household income differes across race groups.

Task 5

What are the conditions that must be satisfied for this analysis to be valid? Observations are independent of one another, the distribution residuals within each race group is approximately normal, and the population variances of household income are approximately equal across race groups.

Task 6

Conduct post-hoc tests on this model.

TukeyHSD(income_aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = HHIncomeMid ~ Race3, data = NHANES)
## 
## $Race3
##                        diff        lwr        upr     p adj
## Black-Asian      -23227.189 -30325.518 -16128.861 0.0000000
## Hispanic-Asian   -20658.593 -28429.258 -12887.928 0.0000000
## Mexican-Asian    -24354.050 -31699.225 -17008.874 0.0000000
## White-Asian       -2258.210  -8329.662   3813.243 0.8969901
## Other-Asian      -16969.394 -26440.450  -7498.339 0.0000051
## Hispanic-Black     2568.597  -3966.310   9103.503 0.8730276
## Mexican-Black     -1126.860  -7149.598   4895.877 0.9948262
## White-Black       20968.980  16588.990  25348.969 0.0000000
## Other-Black        6257.795  -2228.822  14744.412 0.2861920
## Mexican-Hispanic  -3695.457 -10497.687   3106.774 0.6324785
## White-Hispanic    18400.383  12998.413  23802.353 0.0000000
## Other-Hispanic     3689.198  -5367.271  12745.668 0.8551701
## White-Mexican     22095.840  17326.182  26865.498 0.0000000
## Other-Mexican      7384.655  -1309.480  16078.791 0.1489372
## Other-White      -14711.185 -22359.644  -7062.725 0.0000007

Task 7

List the pairs of race groups which have significantly different median household incomes at 5% level of significance. The race pairs that have significantly different median household incomes at 5% level of significance are Other-White, White-Mexican, White-Hispanic, White-Black, Other-Asian, Mexican-Asian, Black-Asian, and Hispanic-Asian.

Homework question 2

Can we explain a person’s direct HDL cholestorol (DirectChol) based on the person’s weight?

Task 1

Create a suitable graphical summary to capture the relationship between the two variables.

NHANES %>%
  filter(!is.na(Weight), !is.na(DirectChol))%>%
  ggplot(aes(x=Weight, y=DirectChol))+
  geom_point(alpha=.5)+
  geom_smooth(method="lm", se=FALSE)+
  labs(
    x="Weight (kg)",
    y="Direct HDL Cholesterol",
    title= "Direct HDL Cholesterol vs. Weight"
  )
## `geom_smooth()` using formula = 'y ~ x'

Task 2

Write 2-3 lines describing the relationship observed in the plot above, including a line about the strength and direction of the relationship observed. The scatterplot shows a week negative relationship between weight and direct HDL cholesterol. As weight increases, HDL cholesterol tends to decrease slightly. However, the relationship is not very strong, as there is substantial vaariability around the fitted line.

Task 3

Fit the model suitable for this analysis in a code chunk below.

chol_model<- lm(DirectChol~Weight, data=NHANES)
summary(chol_model)
## 
## Call:
## lm(formula = DirectChol ~ Weight, data = NHANES)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01291 -0.25722 -0.05621  0.20373  2.56954 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.7796789  0.0137402  129.52   <2e-16 ***
## Weight      -0.0053741  0.0001702  -31.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3772 on 8409 degrees of freedom
##   (1589 observations deleted due to missingness)
## Multiple R-squared:  0.106,  Adjusted R-squared:  0.1059 
## F-statistic: 997.3 on 1 and 8409 DF,  p-value: < 2.2e-16

Task 4

Interpret the estimated slope. The estimated slope represents the average change in direct HDL cholesterol associated with a one unit increase in weight. The negative slope suggests that heavier individuals will have a lower direct HDL cholesterol.

Task 5

Is it practically meaningful to interpret the intercept of this model? Is it statistically meaningful to interpret the intercept of this model? This is not practically meaningful to interpret the intercept because no one weighs 0 kg. the intercept may still be statistically estimable .

Task 6

Write the hypotheses for the slope test and conduct it at 1% level of significance a.k.a. decision and conclusion. B_0=0 B_1 does not equal 0

Since the p-value for the slope is less that .01 we reject the null hypotheses because we have statistically significant evidence of a linear relationship between weight and direct HDL cholesterol.

Task 7

Interpret p-value for the slope test. p-value represents the probablity of obseerving a slop as extreme as the one estimated if there were truly no relationship between weight and HDL cholesterol.

Task 8

Interpret the \(R^2\) of this model. R^2 value indicates the proportion of variation in direct HDL cholesterol that is explained by weight. A low R^2 suggest that weight explains only a small fraction of the variability in HDL cholesterol.

Task 9

What are the assumptions underlying this model. Please provide a 2-3 line explanation of what those assumptions are actually saying, not just the the acronym. Assuming the realtionship between weight and HDL cholesterol is linear, each observation is independent of others, the residuals of the model are approximately normal, and the variability of residuals is roughly the same.

Task 10

The p-value of the slope test is low, but so is the \(R^2\) value. Please comment on this dichotomy. A low p-value indicates that the relationship between weight and HDL cholesterol is statistically significant, while a low R^2 indicates that the relationship is weak. This can happen with large sample sizes, where even small effects become statistically detectable but are not practically strong.

Task 11

Based on the R output, determine the number of observations that were used to fit this model. There were 6,820 observations used because there were a total of 8,409 and R deleted 1,589 observations to fit the model.

External resources used

If you used any external resources to write editable code or debug code that won’t work, please list them here. Please avoid saying things such as Googled it!, though that’s where you might begin. Please provide specific reference such as StackExchange, R-bloggers, notes from class such and such etc.