140.622.02_part2_Modeling Non-linear Relationships with MLR

Modelling Non-linear Relationships with MLR

1. Plot weight against age and calcaulate the sample mean weight

Plot weight against age (use small plotting symbols, e.g. dots, and jitter the points so they can all be seen).and calculate the sample mean weight for each month of age and add the estimated means for each month to the graph (with bold symbols and a connecting line) to highlight the trend. To compute mean weight for each month of age will work because age takes on 13 unique values (the integers 0-12): if age was truly continuous, the following command would fail to work. (Scientific digression: this pattern actually represents the convolution of two biological processes: growth and seasonality. For simplicity, below we will refer to it as “growth.”)

# import the dataset
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

nepalA = read_csv("nepal_anthro.csv")

## Rows: 4386 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): id, age, sex, weight, height, armcirc
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# omit children with ages greater than 12 months.
# also omit all children for whom height, weight or arm circumference is missing.
nepalData = nepalA %>% ### store this dataset under a new name
  filter(age<=12) %>% ### keep only children with age <= 12
  filter(!is.na(height), !is.na(weight), !is.na(armcirc)) ### remove NAs

nepalData = nepalData %>%
  mutate(gender=recode_factor(sex, '1'="Male", '2'="Female"))

# Plot weight against age (use small plotting symbols, e.g. dots, and jitter the points so they can all be seen
qplot(x=jitter(age), y=weight, data=nepalData, xlab="Age in months",
      ylab="Weight in kg", ylim=c(0,12))

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#Calculate the sample mean weight for each month of age and add the estimated means
#To compute mean weight for each month of age will work because age takes on 13 unique values (the integers 0-12)
nepalData %>% group_by(age) %>%summarize(mean=mean(weight))

## # A tibble: 13 × 2
##      age  mean
##    <dbl> <dbl>
##  1     0  2.87
##  2     1  3.26
##  3     2  4.27
##  4     3  4.94
##  5     4  5.47
##  6     5  5.83
##  7     6  5.97
##  8     7  6.25
##  9     8  6.71
## 10     9  6.76
## 11    10  7.03
## 12    11  7.04
## 13    12  6.99

#To add means to previous graph:
qplot(x=jitter(age), y=weight, data=nepalData, xlab="Age in months",
      ylab="Weight in kg", ylim=c(0,12)) +
  stat_summary(aes(x=age, y=weight), fun.y=mean, geom="line", lwd=2, color="red")

## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

2. Use simple linear regression to regress weight on age and add the least squares line to the plot

The linear regression of weight on age is, E[weight] = 3.67 + 0.33*age

#Use simple linear regression to regress weight on age and add the least squares line to the plot
model4 = lm(weight ~ age, data=nepalData)
summary(model4)

## 
## Call:
## lm(formula = weight ~ age, data = nepalData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4470 -0.7018 -0.0365  0.6946  3.3449 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.67920    0.07235   50.85   <2e-16 ***
## age          0.33065    0.01000   33.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.1 on 896 degrees of freedom
## Multiple R-squared:  0.5495, Adjusted R-squared:  0.549 
## F-statistic:  1093 on 1 and 896 DF,  p-value: < 2.2e-16

confint(model4)

##                 2.5 %    97.5 %
## (Intercept) 3.5372042 3.8211918
## age         0.3110189 0.3502817

qplot(x=jitter(age), y=weight, data=nepalData, xlab="Age in months",
      ylab="Weight in kg", ylim=c(0,12)) + geom_smooth(method="lm")

## `geom_smooth()` using formula = 'y ~ x'

#OR, a more general way to plot “fitted values” on a plot:
qplot(x=jitter(age), y=weight, data=nepalData, xlab="Age in months",
      ylab="Weight in kg", ylim=c(0,12)) +
  geom_line(aes(x = age, y=model4$fitted.values), color="red", lwd=2)

3.Calculate the residuals from the simple linear regression

Calculate the residuals from the simple linear regression above and plot them versus age. Add a smooth function using lowess smoothing.

Residuals = observed y – (3.67 + 0.33*age). We observe a curve-linear relationship in the residual analysis, which suggests that the assumption of linearity does not hold.

# calculate residuals from SLR with loess smoother
qplot(y=model4$residuals, x=jitter(age), data=nepalData, ylab="Residuals",
      xlab="Age in months") +
  geom_smooth(method="loess", se=FALSE) + #loess smoother
  geom_hline(yintercept=0, color="red") #horizontal y=0 line

## `geom_smooth()` using formula = 'y ~ x'

4. Use linear regression to regress weight onto the monthly mean ages

# Use linear regression to regress weight onto the monthly mean ages.
model5 = lm(weight ~ as.factor(age), data=nepalData)
summary(model5)

## 
## Call:
## lm(formula = weight ~ as.factor(age), data = nepalData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7899 -0.6267  0.0279  0.6405  3.2405 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.8667     0.1965  14.589  < 2e-16 ***
## as.factor(age)1    0.3928     0.2276   1.726   0.0847 .  
## as.factor(age)2    1.4007     0.2235   6.268 5.71e-10 ***
## as.factor(age)3    2.0727     0.2333   8.886  < 2e-16 ***
## as.factor(age)4    2.6054     0.2323  11.218  < 2e-16 ***
## as.factor(age)5    2.9600     0.2292  12.917  < 2e-16 ***
## as.factor(age)6    3.1075     0.2379  13.063  < 2e-16 ***
## as.factor(age)7    3.3869     0.2309  14.671  < 2e-16 ***
## as.factor(age)8    3.8397     0.2349  16.349  < 2e-16 ***
## as.factor(age)9    3.8928     0.2296  16.957  < 2e-16 ***
## as.factor(age)10   4.1599     0.2276  18.276  < 2e-16 ***
## as.factor(age)11   4.1751     0.2327  17.939  < 2e-16 ***
## as.factor(age)12   4.1232     0.2276  18.115  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.021 on 885 degrees of freedom
## Multiple R-squared:  0.6163, Adjusted R-squared:  0.6111 
## F-statistic: 118.4 on 12 and 885 DF,  p-value: < 2.2e-16

confint(model5)

##                        2.5 %    97.5 %
## (Intercept)       2.48101330 3.2523201
## as.factor(age)1  -0.05389422 0.8395482
## as.factor(age)2   0.96211642 1.8393328
## as.factor(age)3   1.61493685 2.5305176
## as.factor(age)4   2.14956045 3.0612238
## as.factor(age)5   2.51025469 3.4097453
## as.factor(age)6   2.64060483 3.5743377
## as.factor(age)7   2.93376802 3.8399408
## as.factor(age)8   3.37873854 4.3006265
## as.factor(age)9   3.44224382 4.3433417
## as.factor(age)10  3.71319438 4.6066368
## as.factor(age)11  3.71832689 4.6319218
## as.factor(age)12  3.67648551 4.5699279

5. Create linear splines

Let our linear spline regression model : E(weight) = b0 + b1 X + b2 (X – knot1)+ + b3 (X – knot2)+ + b4 (X – knot3)+ + e

b0 : the mean weight when age 0.
b1, coefficients for age: change in mean weight per month when age <= sp1
b2, coefficients for age_sp1: difference in the slope for ‘age <= sp1’ versus ‘sp1 < age <= sp2’
b3, coefficients for age_sp2: difference in the slope for ‘sp1 < age <= sp2’ versus ‘sp2 < age <= sp3’
b4, coefficients for age_sp3: difference in the slope for ‘sp2 < age <= sp3’ versus ‘sp3 < age’

# Create three new variables:
nepalData = nepalData %>%
  mutate(age_sp1 = ifelse(age > 2, age-2, 0)) %>%
  mutate(age_sp2 = ifelse(age > 4, age-4, 0)) %>%
  mutate(age_sp3 = ifelse(age > 6, age-6, 0))

#Regress weight on age, age_sp1, age_sp2 and age_sp3.
model6 = lm(weight ~ age + age_sp1 + age_sp2 + age_sp3, data=nepalData)
summary(model6)

## 
## Call:
## lm(formula = weight ~ age + age_sp1 + age_sp2 + age_sp3, data = nepalData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9965 -0.6318  0.0424  0.6657  3.3202 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.62533    0.15516  16.920  < 2e-16 ***
## age          0.79912    0.10076   7.931 6.49e-15 ***
## age_sp1     -0.15861    0.15232  -1.041   0.2981    
## age_sp2     -0.31121    0.13191  -2.359   0.0185 *  
## age_sp3     -0.15709    0.08571  -1.833   0.0672 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 893 degrees of freedom
## Multiple R-squared:  0.6097, Adjusted R-squared:  0.608 
## F-statistic: 348.8 on 4 and 893 DF,  p-value: < 2.2e-16

confint(model6)

##                  2.5 %      97.5 %
## (Intercept)  2.3208152  2.92985326
## age          0.6013622  0.99687958
## age_sp1     -0.4575615  0.14035130
## age_sp2     -0.5700990 -0.05232632
## age_sp3     -0.3253102  0.01112148

# Get the predicted values from this regression
qplot(x=jitter(age), y=weight, data=nepalData, xlab="Age in months",
      ylab="Weight in kg", ylim=c(0,12)) +
  geom_line(aes(x = age, y=model6$fitted.values), color="red", lwd=2)

6. Test the linear relationship against the alternative defined by the linear spline fit by using F-test.

Use an F-test with 3 degrees of freedom in the numerator to test the null hypothesis of a linear relationship against the alternative defined by the linear spline fit in the steps above.

The null hypothesis of F-test is ‘H0: b2 = b3 = b4’. According to the result of the test, we can reject the null hypothesis and conclude there is statistically significant difference in the slope of each age strata (Fobs = 45.984).

To address the question of whether the difference in growth rate per month is changed as the age increasing, we added splines in the original linear regression model. The result of the F test indicates that there is statistically significant difference in the slopes between the age strata. By adding splines in the original linear regression model, we can get more improved fitted model.

7. Comment on the quality of each of the 3 models using the plots

Based on the AIC, linear splines model (Model 6) has the lowest AIC that approacimates the lowest predicted residual sum of square(PRESS) following by model with monthly means (Model 5) and linear model (Model 4). We can find the Model 5 has the lowest residual variance among three models. As a result, we can use linear splines model or model with monthly means to predict the mean of weight precisely, and conclude that the growth rate (the change in mean of weight per unit change in age) decreases. When the model has smaller AIC and smaller residual variance (residual mean squares), the quality of model increases.

#Use an F-test with 3 degrees of freedom in the numerator to test the null hypothesis 
anova(model4, model6)

## Analysis of Variance Table
## 
## Model 1: weight ~ age
## Model 2: weight ~ age + age_sp1 + age_sp2 + age_sp3
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1    896 1083.20                                  
## 2    893  938.26  3    144.94 45.984 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model4)

## Analysis of Variance Table
## 
## Response: weight
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## age         1 1321.0 1321.02  1092.7 < 2.2e-16 ***
## Residuals 896 1083.2    1.21                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model5)

## Analysis of Variance Table
## 
## Response: weight
##                 Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(age)  12 1481.62 123.468  118.44 < 2.2e-16 ***
## Residuals      885  922.61   1.042                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model6)

## Analysis of Variance Table
## 
## Response: weight
##            Df  Sum Sq Mean Sq   F value    Pr(>F)    
## age         1 1321.02 1321.02 1257.2927 < 2.2e-16 ***
## age_sp1     1   88.13   88.13   83.8823 < 2.2e-16 ***
## age_sp2     1   53.28   53.28   50.7102 2.202e-12 ***
## age_sp3     1    3.53    3.53    3.3594   0.06716 .  
## Residuals 893  938.26    1.05                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#using glm commend, make the models again.
model4_glm <- glm(weight ~ age, data = nepalData, family = gaussian)
log_likelihood_model4 <- logLik(model4_glm)
log_likelihood_model4

## 'log Lik.' -1358.399 (df=3)

model5_glm <- glm(weight ~ as.factor(age), data = nepalData, family = gaussian)
log_likelihood_model5 <- logLik(model5_glm)
log_likelihood_model5

## 'log Lik.' -1286.344 (df=14)

model6_glm <- glm(weight ~ age + age_sp1 + age_sp2 + age_sp3, data = nepalData, family = gaussian)
log_likelihood_model6 <- logLik(model6_glm)
log_likelihood_model6

## 'log Lik.' -1293.899 (df=6)

##8
AIC(model4, model5, model6)

##        df      AIC
## model4  3 2722.797
## model5 14 2600.689
## model6  6 2599.798

AIC(model4_glm, model5_glm, model6_glm)

##            df      AIC
## model4_glm  3 2722.797
## model5_glm 14 2600.689
## model6_glm  6 2599.798