BIOS 7022 - HW 5

Spring 2025 - Gloria Lewis

Reading: Chapter 11
Homework: E5.1-E5.3; R5.1, R5.2

Exercises

E5.1. [5 pts] Multiple regression fact checking

Determine which of the following statements are true and false. For each statement that is false, explain why it is false.

If predictors are collinear, then removing one variable will have no influence on the point estimate of another variable’s coefficient.
False,because removing one variable can influence the point estimate of another variable’s coefficient
Suppose a numerical variable x has a coefficient of \(b_1\) = 2.5 in the multiple regression model. Suppose also that the first observation has \(x_1\) = 7.2, the second observation has a value of \(x_1\) = 8.2, and these two observations have the same values for all other predictors. Then the predicted value of the second observation will be 2.5 higher than the prediction of the first observation based on the multiple regression model. True
If a regression model’s first variable has a coefficient of \(b_1\) = 5.7, then if we are able to influence the data so that an observation will have its \(x_1\) be 1 larger than it would otherwise, the value \(y\) for this observation would increase by 5.7. False, because it assumes a deterministic relationship between \(x_1\) and y. The regession model explains an expected relationship with some uncertainty

E5.2. [5 pts] Baby weights and mature moms

The following is a model for predicting baby weight from whether the mom is classified as a mature mom (35 years or older at the time of pregnancy). (ICPSR, 2014)

a. Write the equation of the regression model. Note that the variable mature takes two values younger mom and older mom and matureyounger mom indicates that the effect for being a younger mom (i.e., older mom is the reference category). baby weight= 7.354(older mom)-0.185(younger mom)(mature)

Interpret the slope in this context, and calculate the predicted birth weight of babies born to mature and younger mothers.

(mature mom)baby weight= 7.354-0.185(0)= 7.354, (younger mom)baby weight = 7.354-0.185(1)=7.169

E5.3. [8 pts] Predicting baby weights.

A more realistic approach to modeling baby weights is to consider all possibly related variables at once. Other variables of interest include length of pregnancy in weeks (weeks), mother’s age in years (mage), the sex of the baby (sex), smoking status of the mother (habit), and the number of hospital (visits) visits during pregnancy. Below are three observations from this data set.

The summary table below shows the results of a regression model for predicting the average birth weight of babies based on all of the variables presented above.

Write the equation of the regression model that includes all of the variables.
weight= -3.82 + 0.26(weeks) + 0.02(mage) + 0.37(sexmale) + 0.02(visits) - 0.43(habitsmoker)
Interpret the slopes of weeks and habit in this context.
Weeks(0.26) - With each addtional week of pregnancy is associated with an increase of 0.26 units in the predicted birth weight of the baby, keeping all other variables constant. The p-value is < 0.0001 this makes the effect statistically significant. Habit(-0.43)- If the mother smokes, the babys weight is expected to lower by -0.43 units compared to babys with mother who do not smoke, keeping all other variable constant. The p-value is 0.0007 this makes the effect statistically significant.
If we fit a model predicting baby weight from only habit (whether the mom smokes), we observe a difference in the slope coefficient for habit in this small model and the slope coefficient for habit in the larger model. Why might there be a difference?
The reason there is a difference is because in the larger model other factors such as weeks,sex, visits, and mother’s age were controlled for, however, in the smaller model habit is taking the effects if correlated variable. In this case is would show that smoking has a big impact on baby weight.
Calculate the residual for the first observation in the data set.
weight= -3.82 + (0.2637) + (0.0234) + (0.371) + (0.0214) + (-0.43*0)= 7.13 Residual= 6.96-7.13= -0.17 units

R homework

R5.1 [25 pts] Poverty and Education

The National Health and Nutrition Examination Survey (NHANES) is a yearly survey conducted by the US Centers for Disease Control. This question uses the nhanes.samp.adult.500 dataset in the oibiostat package, which consists of information on a subset of 500 individuals ages 21 years and older from the larger NHANES dataset (This data is also in the file nhanes_samp_adult_500.Rda). Poverty is measured as a ratio of family income to poverty guidelines. Smaller numbers indicate more poverty, and ratios of 5 or larger were recorded as 5. Education is reported for individuals ages 20 years or older and indicates the highest level of education achieved: either 8th Grade, 9 - 11th Grade, High School, Some College, or College Grad. The variable HomeOwn records whether a participant rents or owns their home; the levels of the variable are Own, Rent, and Other. We’ll rename the Poverty variable Income so that smaller numbers indicate smaller Income. That is, the numerical values are in the order indicated by the variable label.

Code

# load in the data
load("nhanes_samp_adult_500.Rda")
df <- nhanes.samp.adult.500
library(tidyverse)
df <- mutate(df, Income = Poverty)

[5 pts] Create a plot showing the association between Income and educational level. Describe what you see.

Code

# Choose just ONE of the following plots
# (1) base R
# with(df, plot(Income ~ Education))
# (2) tidyverse; notice warning not given using base R
df |> 
  ggplot(aes(x = Education, y = Income)) + geom_boxplot()

Warning: Removed 38 rows containing non-finite outside the scale range
(`stat_boxplot()`).

There is a positive association between Income and Education. The hight leve of Education the hight the Income with be.

[10 pts] Fit a linear model to predict Income from educational level.
1. Interpret the model coefficients and associated p-values.
2. Assess whether educational level, overall, is associated with Income. Be sure to include any relevant numerical evidence as part of your answer.
  1. Intercet- represents the estimated Income level for individuals with an 8th Grade Education. The p-value is < 0.001. Education9- represents individuals with 9-11th grade education and the p-value (0.0028) indicates that is effect is statistically significant. EducationHigh School-represents the high school graduates and the p-value (0.0005) is statistically signifincat. EducationSome College- represents those who attended some college and the p-value (0.001)is highly significan. EducationCollege Grad-represents college grauates and the p-value(0.001) is highly significant.

Code

lm1 <- lm(Income ~ Education, df)
summary(lm1)


Call:
lm(formula = Income ~ Education, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4903 -1.2003  0.0901  1.0497  2.7545 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)               1.4555     0.2703   5.384 1.17e-07 ***
Education9 - 11th Grade   0.9931     0.3302   3.008 0.002776 ** 
EducationHigh School      1.0900     0.3113   3.501 0.000508 ***
EducationSome College     1.4943     0.2976   5.021 7.37e-07 ***
EducationCollege Grad     2.4948     0.2958   8.434 4.45e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.456 on 456 degrees of freedom
  (39 observations deleted due to missingness)
Multiple R-squared:  0.1977,    Adjusted R-squared:  0.1906 
F-statistic: 28.09 on 4 and 456 DF,  p-value: < 2.2e-16

Code

aggregate(Income ~ Education, df, mean)

       Education   Income
1      8th Grade 1.455517
2 9 - 11th Grade 2.448644
3    High School 2.545506
4   Some College 2.949854
5   College Grad 3.950340

2. Education level overall is associated with inccome. The F-statistic is 28.09 and the p-value is 2.2e-16 this explains that with the higher level of Education there is a increase in Income. The Multiple R-squared is 0.1977 which is 19.77% which means that the variation in Income is explain by Education level.

[5 pts] Create a plot showing the association between poverty and home ownership. Compare the median incomes for those who own a home to those who rent.

Code

# Choose just ONE of the following plots
# (1) base R
# with(df, plot(Income ~ Education))
# (2) tidyverse; notice warning not given using base R
df |> 
  ggplot(aes(x = Education, y = Income)) + geom_boxplot()

Warning: Removed 38 rows containing non-finite outside the scale range
(`stat_boxplot()`).

There is a positive association with Incomes with people who have homes then those who are renting.

[5] Fit a linear model to predict Income from educational level and home ownership. Comment on whether this model is an improvement from the model in part b.

Code

lm1 <- lm(Income ~ Education + HomeOwn, df)
summary(lm1)


Call:
lm(formula = Income ~ Education + HomeOwn, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3974 -1.0462  0.0826  0.7826  3.2707 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)               1.8125     0.2565   7.067 5.99e-12 ***
Education9 - 11th Grade   0.9506     0.3087   3.080 0.002199 ** 
EducationHigh School      1.0885     0.2914   3.736 0.000211 ***
EducationSome College     1.5337     0.2782   5.513 5.91e-08 ***
EducationCollege Grad     2.4049     0.2767   8.691  < 2e-16 ***
HomeOwnRent              -1.1717     0.1441  -8.128 4.18e-15 ***
HomeOwnOther             -0.9792     0.4611  -2.123 0.034259 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.361 on 454 degrees of freedom
  (39 observations deleted due to missingness)
Multiple R-squared:  0.3023,    Adjusted R-squared:  0.2931 
F-statistic: 32.79 on 6 and 454 DF,  p-value: < 2.2e-16

Code

aggregate(Income ~ Education + HomeOwn, df, mean)

        Education HomeOwn   Income
1       8th Grade     Own 1.624000
2  9 - 11th Grade     Own 2.979302
3     High School     Own 2.738710
4    Some College     Own 3.523111
5    College Grad     Own 4.116637
6       8th Grade    Rent 0.935000
7  9 - 11th Grade    Rent 1.072000
8     High School    Rent 2.101852
9    Some College    Rent 1.796047
10   College Grad    Rent 3.421613
11      8th Grade   Other 2.250000
12 9 - 11th Grade   Other 0.280000
13   Some College   Other 2.455000
14   College Grad   Other 3.150000

This model is better then the model in part be. The clearly shows that the predict Income is effected by Education and Home Ownership. If you have a highter level of Education, then there is a better chance of you owning your own home vs rent it. This model also showed that the lower level of Education that you would be renting instead of owning a home.

R5.2 [27 pts] Vitamin D

The file vitamin_d.Rdata contains the relevant data from a study on the Vitamin D status among schoolchildren in Thailand. Exposure to sunlight allows the body to produce serum 25(OH)D, which is a marker of Vitamin D status. Vitamin D deficiency is defined as having a serum 25(OH)D level below 50 nmol/L.¹
The following table provides a list of the variables in the dataset and their description.

Does the association between serum 25(OH)D level and age differ between males and females?

[5 pts] Construct a single plot that visualizes the relationship between vit_d25 (response variable) and both age and sex. Describe the association and shape for males. Do the same for females.

Code

load("vitamin_d.Rdata")
df <- vitamin_d
p <- ggplot(df, aes(x=age, y=vit_d25, color=sex, shape=sex)) + geom_point() 
p

Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_point()`).

The association with males is positive and the shape is in the linear cluster in the middle. There are a few outlires. The association with females is negative and the the shape is not linear and there are more outliers.

[2 pts] Add the linear model to describe the relationship in your plot above. Are the slopes consistent with your answers for the association in part a)?

Code

p + geom_smooth(method=lm, se=FALSE)

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 8 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 8 rows containing missing values or values outside the scale range
(`geom_point()`).

Yes, the slope are consistet with the associations in part a.

[10 pts] Fit a linear model to investigate whether the association between serum 25(OH)D level (vit_d25) and age differs by sex. Interpret the coefficients on the term age and on age:sexM. The coefficient on sexM is negative. Does this indicate higher or lower levels of vit_d25 for males? At what age is this comparison made?

Code

lm2 <- lm(vit_d25 ~ age*sex, df)
summary(lm2)


Call:
lm(formula = vit_d25 ~ age * sex, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.736  -8.428  -1.092   7.810  46.208 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  97.7709     4.7732  20.483  < 2e-16 ***
age          -3.0156     0.4774  -6.317 5.68e-10 ***
sexM        -16.2848     7.0740  -2.302   0.0217 *  
age:sexM      2.9369     0.7054   4.163 3.66e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.57 on 525 degrees of freedom
  (8 observations deleted due to missingness)
Multiple R-squared:  0.2274,    Adjusted R-squared:  0.223 
F-statistic: 51.52 on 3 and 525 DF,  p-value: < 2.2e-16

This indicate there are lower levels of vit_d25 for males. The age that this comparison is made is at 5.5 years old.

[5 pts] Refit the above model using age centered at 8 years. What is the coefficient on sexM now? Interpret the value of this coefficient.

Code

df <- mutate(df, 
             age_c8 = -3.0156 - 8) # replace X with the numerical explanatory variable
lm(vit_d25 ~ age_c8*sex, df) |> 
  summary()


Call:
lm(formula = vit_d25 ~ age_c8 * sex, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.606  -9.306  -0.875   8.494  46.294 

Coefficients: (2 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   68.075      0.858   79.34   <2e-16 ***
age_c8            NA         NA      NA       NA    
sexM          12.631      1.222   10.34   <2e-16 ***
age_c8:sexM       NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.05 on 527 degrees of freedom
  (8 observations deleted due to missingness)
Multiple R-squared:  0.1687,    Adjusted R-squared:  0.1671 
F-statistic: 106.9 on 1 and 527 DF,  p-value: < 2.2e-16

[5 pts] For the model from part a), plot the residuals versus the fitted values. Why are there so many points with a fitted value above 80?

Code

plot(lm2, which=1)

The reason why there are so many points above 80 is because many individuals in the dataset have predicted vit_d25 near this value.

Footnotes

The full study is described in Houghton LA et al., 2014 PLoS One, doi: 10.1371/journal.pone.0104825.↩︎