Create an R code chunk below to load the tidyverse, ggplot2, and the NHANES package. Further, load the NHANES dataset.
Is there a relationship between an individual’s race (per the Race3 variable) and median household income (HHIncomeMid)?
Write down formal hypotheses to investigate the overall relationship H_0: m_1=m_2=m_3=… H_A: at least one race group has a different mean household income
Create a graphical summary and write 3-4 sentences describing the visualization.
NHANES %>%
filter(!is.na(Race3), !is.na(HHIncomeMid))%>%
ggplot(aes(x=Race3, y= HHIncomeMid))+
geom_boxplot()+
labs(
x="Race",
y= "Median Household Income",
title= "Median Household Income by Race"
)
Fit the model suitable for this analysis in a code chunk below.
income_aov<-aov(
HHIncomeMid~Race3,
data=NHANES
)
summary(income_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Race3 5 4.088e+11 8.176e+10 79.05 <2e-16 ***
## Residuals 4617 4.775e+12 1.034e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 5377 observations deleted due to missingness
Using the p-value, write a conclusion for the test we are conducting at 5% level of significance. The p-value is less than .05 so we reject the null hypothese because we have statitically significant evidence that the mean household income differes across race groups.
What are the conditions that must be satisfied for this analysis to be valid? Observations are independent of one another, the distribution residuals within each race group is approximately normal, and the population variances of household income are approximately equal across race groups.
Conduct post-hoc tests on this model.
TukeyHSD(income_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = HHIncomeMid ~ Race3, data = NHANES)
##
## $Race3
## diff lwr upr p adj
## Black-Asian -23227.189 -30325.518 -16128.861 0.0000000
## Hispanic-Asian -20658.593 -28429.258 -12887.928 0.0000000
## Mexican-Asian -24354.050 -31699.225 -17008.874 0.0000000
## White-Asian -2258.210 -8329.662 3813.243 0.8969901
## Other-Asian -16969.394 -26440.450 -7498.339 0.0000051
## Hispanic-Black 2568.597 -3966.310 9103.503 0.8730276
## Mexican-Black -1126.860 -7149.598 4895.877 0.9948262
## White-Black 20968.980 16588.990 25348.969 0.0000000
## Other-Black 6257.795 -2228.822 14744.412 0.2861920
## Mexican-Hispanic -3695.457 -10497.687 3106.774 0.6324785
## White-Hispanic 18400.383 12998.413 23802.353 0.0000000
## Other-Hispanic 3689.198 -5367.271 12745.668 0.8551701
## White-Mexican 22095.840 17326.182 26865.498 0.0000000
## Other-Mexican 7384.655 -1309.480 16078.791 0.1489372
## Other-White -14711.185 -22359.644 -7062.725 0.0000007
List the pairs of race groups which have significantly different median household incomes at 5% level of significance. The race pairs that have significantly different median household incomes at 5% level of significance are Other-White, White-Mexican, White-Hispanic, White-Black, Other-Asian, Mexican-Asian, Black-Asian, and Hispanic-Asian.
Can we explain a person’s direct HDL cholestorol (DirectChol) based on the person’s weight?
Create a suitable graphical summary to capture the relationship between the two variables.
NHANES %>%
filter(!is.na(Weight), !is.na(DirectChol))%>%
ggplot(aes(x=Weight, y=DirectChol))+
geom_point(alpha=.5)+
geom_smooth(method="lm", se=FALSE)+
labs(
x="Weight (kg)",
y="Direct HDL Cholesterol",
title= "Direct HDL Cholesterol vs. Weight"
)
## `geom_smooth()` using formula = 'y ~ x'
Write 2-3 lines describing the relationship observed in the plot above, including a line about the strength and direction of the relationship observed. The scatterplot shows a week negative relationship between weight and direct HDL cholesterol. As weight increases, HDL cholesterol tends to decrease slightly. However, the relationship is not very strong, as there is substantial vaariability around the fitted line.
Fit the model suitable for this analysis in a code chunk below.
chol_model<- lm(DirectChol~Weight, data=NHANES)
summary(chol_model)
##
## Call:
## lm(formula = DirectChol ~ Weight, data = NHANES)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01291 -0.25722 -0.05621 0.20373 2.56954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7796789 0.0137402 129.52 <2e-16 ***
## Weight -0.0053741 0.0001702 -31.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3772 on 8409 degrees of freedom
## (1589 observations deleted due to missingness)
## Multiple R-squared: 0.106, Adjusted R-squared: 0.1059
## F-statistic: 997.3 on 1 and 8409 DF, p-value: < 2.2e-16
Interpret the estimated slope. The estimated slope represents the average change in direct HDL cholesterol associated with a one unit increase in weight. The negative slope suggests that heavier individuals will have a lower direct HDL cholesterol.
Is it practically meaningful to interpret the intercept of this model? Is it statistically meaningful to interpret the intercept of this model? This is not practically meaningful to interpret the intercept because no one weighs 0 kg. the intercept may still be statistically estimable .
Write the hypotheses for the slope test and conduct it at 1% level of significance a.k.a. decision and conclusion. B_0=0 B_1 does not equal 0
Since the p-value for the slope is less that .01 we reject the null hypotheses because we have statistically significant evidence of a linear relationship between weight and direct HDL cholesterol.
Interpret p-value for the slope test. p-value represents the probablity of obseerving a slop as extreme as the one estimated if there were truly no relationship between weight and HDL cholesterol.
Interpret the \(R^2\) of this model. R^2 value indicates the proportion of variation in direct HDL cholesterol that is explained by weight. A low R^2 suggest that weight explains only a small fraction of the variability in HDL cholesterol.
What are the assumptions underlying this model. Please provide a 2-3 line explanation of what those assumptions are actually saying, not just the the acronym. Assuming the realtionship between weight and HDL cholesterol is linear, each observation is independent of others, the residuals of the model are approximately normal, and the variability of residuals is roughly the same.
The p-value of the slope test is low, but so is the \(R^2\) value. Please comment on this dichotomy. A low p-value indicates that the relationship between weight and HDL cholesterol is statistically significant, while a low R^2 indicates that the relationship is weak. This can happen with large sample sizes, where even small effects become statistically detectable but are not practically strong.
Based on the R output, determine the number of observations that were used to fit this model. There were 6,820 observations used because there were a total of 8,409 and R deleted 1,589 observations to fit the model.
If you used any external resources to write editable code or debug code that won’t work, please list them here. Please avoid saying things such as Googled it!, though that’s where you might begin. Please provide specific reference such as StackExchange, R-bloggers, notes from class such and such etc.