Code
# load in the data
load("nhanes_samp_adult_500.Rda")
df <- nhanes.samp.adult.500
library(tidyverse)
df <- mutate(df, Income = Poverty)Spring 2025 - Gloria Lewis
Reading: Chapter 11
Homework: E5.1-E5.3; R5.1, R5.2
Determine which of the following statements are true and false. For each statement that is false, explain why it is false.
If predictors are collinear, then removing one variable will have no influence on the point estimate of another variable’s coefficient.
False,because removing one variable can influence the point estimate of another variable’s coefficient
Suppose a numerical variable x has a coefficient of \(b_1\) = 2.5 in the multiple regression model. Suppose also that the first observation has \(x_1\) = 7.2, the second observation has a value of \(x_1\) = 8.2, and these two observations have the same values for all other predictors. Then the predicted value of the second observation will be 2.5 higher than the prediction of the first observation based on the multiple regression model. True
If a regression model’s first variable has a coefficient of \(b_1\) = 5.7, then if we are able to influence the data so that an observation will have its \(x_1\) be 1 larger than it would otherwise, the value \(y\) for this observation would increase by 5.7. False, because it assumes a deterministic relationship between \(x_1\) and y. The regession model explains an expected relationship with some uncertainty
The following is a model for predicting baby weight from whether the mom is classified as a mature mom (35 years or older at the time of pregnancy). (ICPSR, 2014)
a. Write the equation of the regression model. Note that the variable
mature takes two values younger mom and older mom and matureyounger mom indicates that the effect for being a younger mom (i.e., older mom is the reference category). baby weight= 7.354(older mom)-0.185(younger mom)(mature)
(mature mom)baby weight= 7.354-0.185(0)= 7.354, (younger mom)baby weight = 7.354-0.185(1)=7.169
A more realistic approach to modeling baby weights is to consider all possibly related variables at once. Other variables of interest include length of pregnancy in weeks (weeks), mother’s age in years (mage), the sex of the baby (sex), smoking status of the mother (habit), and the number of hospital (visits) visits during pregnancy. Below are three observations from this data set.
The summary table below shows the results of a regression model for predicting the average birth weight of babies based on all of the variables presented above.
The National Health and Nutrition Examination Survey (NHANES) is a yearly survey conducted by the US Centers for Disease Control. This question uses the nhanes.samp.adult.500 dataset in the oibiostat package, which consists of information on a subset of 500 individuals ages 21 years and older from the larger NHANES dataset (This data is also in the file nhanes_samp_adult_500.Rda). Poverty is measured as a ratio of family income to poverty guidelines. Smaller numbers indicate more poverty, and ratios of 5 or larger were recorded as 5. Education is reported for individuals ages 20 years or older and indicates the highest level of education achieved: either 8th Grade, 9 - 11th Grade, High School, Some College, or College Grad. The variable HomeOwn records whether a participant rents or owns their home; the levels of the variable are Own, Rent, and Other. We’ll rename the Poverty variable Income so that smaller numbers indicate smaller Income. That is, the numerical values are in the order indicated by the variable label.
# load in the data
load("nhanes_samp_adult_500.Rda")
df <- nhanes.samp.adult.500
library(tidyverse)
df <- mutate(df, Income = Poverty)# Choose just ONE of the following plots
# (1) base R
# with(df, plot(Income ~ Education))
# (2) tidyverse; notice warning not given using base R
df |>
ggplot(aes(x = Education, y = Income)) + geom_boxplot()Warning: Removed 38 rows containing non-finite outside the scale range
(`stat_boxplot()`).
There is a positive association between Income and Education. The hight leve of Education the hight the Income with be.
lm1 <- lm(Income ~ Education, df)
summary(lm1)
Call:
lm(formula = Income ~ Education, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.4903 -1.2003 0.0901 1.0497 2.7545
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4555 0.2703 5.384 1.17e-07 ***
Education9 - 11th Grade 0.9931 0.3302 3.008 0.002776 **
EducationHigh School 1.0900 0.3113 3.501 0.000508 ***
EducationSome College 1.4943 0.2976 5.021 7.37e-07 ***
EducationCollege Grad 2.4948 0.2958 8.434 4.45e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.456 on 456 degrees of freedom
(39 observations deleted due to missingness)
Multiple R-squared: 0.1977, Adjusted R-squared: 0.1906
F-statistic: 28.09 on 4 and 456 DF, p-value: < 2.2e-16
aggregate(Income ~ Education, df, mean) Education Income
1 8th Grade 1.455517
2 9 - 11th Grade 2.448644
3 High School 2.545506
4 Some College 2.949854
5 College Grad 3.950340
2. Education level overall is associated with inccome. The F-statistic is 28.09 and the p-value is 2.2e-16 this explains that with the higher level of Education there is a increase in Income. The Multiple R-squared is 0.1977 which is 19.77% which means that the variation in Income is explain by Education level.
# Choose just ONE of the following plots
# (1) base R
# with(df, plot(Income ~ Education))
# (2) tidyverse; notice warning not given using base R
df |>
ggplot(aes(x = Education, y = Income)) + geom_boxplot()Warning: Removed 38 rows containing non-finite outside the scale range
(`stat_boxplot()`).
There is a positive association with Incomes with people who have homes then those who are renting.
lm1 <- lm(Income ~ Education + HomeOwn, df)
summary(lm1)
Call:
lm(formula = Income ~ Education + HomeOwn, data = df)
Residuals:
Min 1Q Median 3Q Max
-3.3974 -1.0462 0.0826 0.7826 3.2707
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8125 0.2565 7.067 5.99e-12 ***
Education9 - 11th Grade 0.9506 0.3087 3.080 0.002199 **
EducationHigh School 1.0885 0.2914 3.736 0.000211 ***
EducationSome College 1.5337 0.2782 5.513 5.91e-08 ***
EducationCollege Grad 2.4049 0.2767 8.691 < 2e-16 ***
HomeOwnRent -1.1717 0.1441 -8.128 4.18e-15 ***
HomeOwnOther -0.9792 0.4611 -2.123 0.034259 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.361 on 454 degrees of freedom
(39 observations deleted due to missingness)
Multiple R-squared: 0.3023, Adjusted R-squared: 0.2931
F-statistic: 32.79 on 6 and 454 DF, p-value: < 2.2e-16
aggregate(Income ~ Education + HomeOwn, df, mean) Education HomeOwn Income
1 8th Grade Own 1.624000
2 9 - 11th Grade Own 2.979302
3 High School Own 2.738710
4 Some College Own 3.523111
5 College Grad Own 4.116637
6 8th Grade Rent 0.935000
7 9 - 11th Grade Rent 1.072000
8 High School Rent 2.101852
9 Some College Rent 1.796047
10 College Grad Rent 3.421613
11 8th Grade Other 2.250000
12 9 - 11th Grade Other 0.280000
13 Some College Other 2.455000
14 College Grad Other 3.150000
This model is better then the model in part be. The clearly shows that the predict Income is effected by Education and Home Ownership. If you have a highter level of Education, then there is a better chance of you owning your own home vs rent it. This model also showed that the lower level of Education that you would be renting instead of owning a home.
The file vitamin_d.Rdata contains the relevant data from a study on the Vitamin D status among schoolchildren in Thailand. Exposure to sunlight allows the body to produce serum 25(OH)D, which is a marker of Vitamin D status. Vitamin D deficiency is defined as having a serum 25(OH)D level below 50 nmol/L.1
The following table provides a list of the variables in the dataset and their description.
Does the association between serum 25(OH)D level and age differ between males and females?
load("vitamin_d.Rdata")
df <- vitamin_d
p <- ggplot(df, aes(x=age, y=vit_d25, color=sex, shape=sex)) + geom_point()
pWarning: Removed 8 rows containing missing values or values outside the scale range
(`geom_point()`).
The association with males is positive and the shape is in the linear cluster in the middle. There are a few outlires. The association with females is negative and the the shape is not linear and there are more outliers.
p + geom_smooth(method=lm, se=FALSE)`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 8 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_point()`).
Yes, the slope are consistet with the associations in part a.
vit_d25) and age differs by sex. Interpret the coefficients on the term age and on age:sexM. The coefficient on sexM is negative. Does this indicate higher or lower levels of vit_d25 for males? At what age is this comparison made?lm2 <- lm(vit_d25 ~ age*sex, df)
summary(lm2)
Call:
lm(formula = vit_d25 ~ age * sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-44.736 -8.428 -1.092 7.810 46.208
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 97.7709 4.7732 20.483 < 2e-16 ***
age -3.0156 0.4774 -6.317 5.68e-10 ***
sexM -16.2848 7.0740 -2.302 0.0217 *
age:sexM 2.9369 0.7054 4.163 3.66e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.57 on 525 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.2274, Adjusted R-squared: 0.223
F-statistic: 51.52 on 3 and 525 DF, p-value: < 2.2e-16
This indicate there are lower levels of vit_d25 for males. The age that this comparison is made is at 5.5 years old.
age centered at 8 years. What is the coefficient on sexM now? Interpret the value of this coefficient.df <- mutate(df,
age_c8 = -3.0156 - 8) # replace X with the numerical explanatory variable
lm(vit_d25 ~ age_c8*sex, df) |>
summary()
Call:
lm(formula = vit_d25 ~ age_c8 * sex, data = df)
Residuals:
Min 1Q Median 3Q Max
-44.606 -9.306 -0.875 8.494 46.294
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 68.075 0.858 79.34 <2e-16 ***
age_c8 NA NA NA NA
sexM 12.631 1.222 10.34 <2e-16 ***
age_c8:sexM NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.05 on 527 degrees of freedom
(8 observations deleted due to missingness)
Multiple R-squared: 0.1687, Adjusted R-squared: 0.1671
F-statistic: 106.9 on 1 and 527 DF, p-value: < 2.2e-16
plot(lm2, which=1)The reason why there are so many points above 80 is because many individuals in the dataset have predicted vit_d25 near this value.
The full study is described in Houghton LA et al., 2014 PLoS One, doi: 10.1371/journal.pone.0104825.↩︎