Plot weight against age (use small plotting symbols, e.g. dots, and jitter the points so they can all be seen).and calculate the sample mean weight for each month of age and add the estimated means for each month to the graph (with bold symbols and a connecting line) to highlight the trend. To compute mean weight for each month of age will work because age takes on 13 unique values (the integers 0-12): if age was truly continuous, the following command would fail to work. (Scientific digression: this pattern actually represents the convolution of two biological processes: growth and seasonality. For simplicity, below we will refer to it as “growth.”)
# import the dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
nepalA = read_csv("nepal_anthro.csv")
## Rows: 4386 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): id, age, sex, weight, height, armcirc
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# omit children with ages greater than 12 months.
# also omit all children for whom height, weight or arm circumference is missing.
nepalData = nepalA %>% ### store this dataset under a new name
filter(age<=12) %>% ### keep only children with age <= 12
filter(!is.na(height), !is.na(weight), !is.na(armcirc)) ### remove NAs
nepalData = nepalData %>%
mutate(gender=recode_factor(sex, '1'="Male", '2'="Female"))
# Plot weight against age (use small plotting symbols, e.g. dots, and jitter the points so they can all be seen
qplot(x=jitter(age), y=weight, data=nepalData, xlab="Age in months",
ylab="Weight in kg", ylim=c(0,12))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Calculate the sample mean weight for each month of age and add the estimated means
#To compute mean weight for each month of age will work because age takes on 13 unique values (the integers 0-12)
nepalData %>% group_by(age) %>%summarize(mean=mean(weight))
## # A tibble: 13 × 2
## age mean
## <dbl> <dbl>
## 1 0 2.87
## 2 1 3.26
## 3 2 4.27
## 4 3 4.94
## 5 4 5.47
## 6 5 5.83
## 7 6 5.97
## 8 7 6.25
## 9 8 6.71
## 10 9 6.76
## 11 10 7.03
## 12 11 7.04
## 13 12 6.99
#To add means to previous graph:
qplot(x=jitter(age), y=weight, data=nepalData, xlab="Age in months",
ylab="Weight in kg", ylim=c(0,12)) +
stat_summary(aes(x=age, y=weight), fun.y=mean, geom="line", lwd=2, color="red")
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The linear regression of weight on age is, E[weight] = 3.67 + 0.33*age
#Use simple linear regression to regress weight on age and add the least squares line to the plot
model4 = lm(weight ~ age, data=nepalData)
summary(model4)
##
## Call:
## lm(formula = weight ~ age, data = nepalData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4470 -0.7018 -0.0365 0.6946 3.3449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.67920 0.07235 50.85 <2e-16 ***
## age 0.33065 0.01000 33.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.1 on 896 degrees of freedom
## Multiple R-squared: 0.5495, Adjusted R-squared: 0.549
## F-statistic: 1093 on 1 and 896 DF, p-value: < 2.2e-16
confint(model4)
## 2.5 % 97.5 %
## (Intercept) 3.5372042 3.8211918
## age 0.3110189 0.3502817
qplot(x=jitter(age), y=weight, data=nepalData, xlab="Age in months",
ylab="Weight in kg", ylim=c(0,12)) + geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'
#OR, a more general way to plot “fitted values” on a plot:
qplot(x=jitter(age), y=weight, data=nepalData, xlab="Age in months",
ylab="Weight in kg", ylim=c(0,12)) +
geom_line(aes(x = age, y=model4$fitted.values), color="red", lwd=2)
Calculate the residuals from the simple linear regression above and plot them versus age. Add a smooth function using lowess smoothing.
Residuals = observed y – (3.67 + 0.33*age). We observe a curve-linear relationship in the residual analysis, which suggests that the assumption of linearity does not hold.
# calculate residuals from SLR with loess smoother
qplot(y=model4$residuals, x=jitter(age), data=nepalData, ylab="Residuals",
xlab="Age in months") +
geom_smooth(method="loess", se=FALSE) + #loess smoother
geom_hline(yintercept=0, color="red") #horizontal y=0 line
## `geom_smooth()` using formula = 'y ~ x'
# Use linear regression to regress weight onto the monthly mean ages.
model5 = lm(weight ~ as.factor(age), data=nepalData)
summary(model5)
##
## Call:
## lm(formula = weight ~ as.factor(age), data = nepalData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7899 -0.6267 0.0279 0.6405 3.2405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8667 0.1965 14.589 < 2e-16 ***
## as.factor(age)1 0.3928 0.2276 1.726 0.0847 .
## as.factor(age)2 1.4007 0.2235 6.268 5.71e-10 ***
## as.factor(age)3 2.0727 0.2333 8.886 < 2e-16 ***
## as.factor(age)4 2.6054 0.2323 11.218 < 2e-16 ***
## as.factor(age)5 2.9600 0.2292 12.917 < 2e-16 ***
## as.factor(age)6 3.1075 0.2379 13.063 < 2e-16 ***
## as.factor(age)7 3.3869 0.2309 14.671 < 2e-16 ***
## as.factor(age)8 3.8397 0.2349 16.349 < 2e-16 ***
## as.factor(age)9 3.8928 0.2296 16.957 < 2e-16 ***
## as.factor(age)10 4.1599 0.2276 18.276 < 2e-16 ***
## as.factor(age)11 4.1751 0.2327 17.939 < 2e-16 ***
## as.factor(age)12 4.1232 0.2276 18.115 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.021 on 885 degrees of freedom
## Multiple R-squared: 0.6163, Adjusted R-squared: 0.6111
## F-statistic: 118.4 on 12 and 885 DF, p-value: < 2.2e-16
confint(model5)
## 2.5 % 97.5 %
## (Intercept) 2.48101330 3.2523201
## as.factor(age)1 -0.05389422 0.8395482
## as.factor(age)2 0.96211642 1.8393328
## as.factor(age)3 1.61493685 2.5305176
## as.factor(age)4 2.14956045 3.0612238
## as.factor(age)5 2.51025469 3.4097453
## as.factor(age)6 2.64060483 3.5743377
## as.factor(age)7 2.93376802 3.8399408
## as.factor(age)8 3.37873854 4.3006265
## as.factor(age)9 3.44224382 4.3433417
## as.factor(age)10 3.71319438 4.6066368
## as.factor(age)11 3.71832689 4.6319218
## as.factor(age)12 3.67648551 4.5699279
Let our linear spline regression model : E(weight) = b0 + b1 X + b2 (X – knot1)+ + b3 (X – knot2)+ + b4 (X – knot3)+ + e
# Create three new variables:
nepalData = nepalData %>%
mutate(age_sp1 = ifelse(age > 2, age-2, 0)) %>%
mutate(age_sp2 = ifelse(age > 4, age-4, 0)) %>%
mutate(age_sp3 = ifelse(age > 6, age-6, 0))
#Regress weight on age, age_sp1, age_sp2 and age_sp3.
model6 = lm(weight ~ age + age_sp1 + age_sp2 + age_sp3, data=nepalData)
summary(model6)
##
## Call:
## lm(formula = weight ~ age + age_sp1 + age_sp2 + age_sp3, data = nepalData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9965 -0.6318 0.0424 0.6657 3.3202
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.62533 0.15516 16.920 < 2e-16 ***
## age 0.79912 0.10076 7.931 6.49e-15 ***
## age_sp1 -0.15861 0.15232 -1.041 0.2981
## age_sp2 -0.31121 0.13191 -2.359 0.0185 *
## age_sp3 -0.15709 0.08571 -1.833 0.0672 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.025 on 893 degrees of freedom
## Multiple R-squared: 0.6097, Adjusted R-squared: 0.608
## F-statistic: 348.8 on 4 and 893 DF, p-value: < 2.2e-16
confint(model6)
## 2.5 % 97.5 %
## (Intercept) 2.3208152 2.92985326
## age 0.6013622 0.99687958
## age_sp1 -0.4575615 0.14035130
## age_sp2 -0.5700990 -0.05232632
## age_sp3 -0.3253102 0.01112148
# Get the predicted values from this regression
qplot(x=jitter(age), y=weight, data=nepalData, xlab="Age in months",
ylab="Weight in kg", ylim=c(0,12)) +
geom_line(aes(x = age, y=model6$fitted.values), color="red", lwd=2)
Use an F-test with 3 degrees of freedom in the numerator to test the null hypothesis of a linear relationship against the alternative defined by the linear spline fit in the steps above.
The null hypothesis of F-test is ‘H0: b2 = b3 = b4’. According to the result of the test, we can reject the null hypothesis and conclude there is statistically significant difference in the slope of each age strata (Fobs = 45.984).
To address the question of whether the difference in growth rate per month is changed as the age increasing, we added splines in the original linear regression model. The result of the F test indicates that there is statistically significant difference in the slopes between the age strata. By adding splines in the original linear regression model, we can get more improved fitted model.
7. Comment on the quality of each of the 3 models using the plots
Based on the AIC, linear splines model (Model 6) has the lowest AIC that approacimates the lowest predicted residual sum of square(PRESS) following by model with monthly means (Model 5) and linear model (Model 4). We can find the Model 5 has the lowest residual variance among three models. As a result, we can use linear splines model or model with monthly means to predict the mean of weight precisely, and conclude that the growth rate (the change in mean of weight per unit change in age) decreases. When the model has smaller AIC and smaller residual variance (residual mean squares), the quality of model increases.