Make a scatter-plot of weight against age discriminating boy and girl infants by the plotting color. Recode sex as a 0/1 factor variable named gender with two levels, “Male” and “Female”, where “Male” is the reference group (the 0).
# import the dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
nepalA = read_csv("nepal_anthro.csv")
## Rows: 4386 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): id, age, sex, weight, height, armcirc
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# omit children with ages greater than 12 months.
# also omit all children for whom height, weight or arm circumference is missing.
nepalData = nepalA %>% ### store this dataset under a new name
filter(age<=12) %>% ### keep only children with age <= 12
filter(!is.na(height), !is.na(weight), !is.na(armcirc)) ### remove NAs
nepalData = nepalData %>%
mutate(gender=recode_factor(sex, '1'="Male", '2'="Female"))
**Fit a simple ANCOVA model (Model A) that includes gender, age and gender*age (gender by age interaction). Obtain the fitted values from this model and add them to the graph using the same colors for the boy and girl predicted values as for the original data**
According to the above results, the Model A is made as the following. (X1 is a continuous variable, age, X2 is a dichotomous variable, gender, and X1X2 (agegender) is interaction term): E(weight) = b0 + b1 X1 (age) + b2 X2 (gender) + b3 X1X2 (agegender) + b
For male (X2 = 0), let interpret the regression coefficients, 1) b0 : the expected mean weight when age is 0 for male. 2) b1 : the difference in ‘the expected mean weight per unit change in age (=growth rate)’ for male.
For female (X2 = 1), the interpretation of the regression coefficients is, 3) b0 + b2: the expected mean weight when age is 0 for female. 4) b1 + b3 : the difference in the expected mean weight per unit change in age for male. 5) b2 : the difference in the expected mean weight at age 0 for male versus female. 6) b3 : the difference in ‘the expected mean weight per unit change in age (=growth rate)’ for male versus female.
The regression results suggests that the linear relationship between weight and age may not differ by gender based on the result of t-test for the coefficient for the interaction term b3, which indicates p-value, 0.40. In other words, gender can be a confounder in this model. In the scatter-plot with the fitted line in the previous question, we may find the a difference in the two slopes, however, the difference is not statistically significant.
# Make a scatter-plot of weight against age
## the first level you designate will be the reference!
qplot(x=jitter(age), y=weight, color=gender, shape=gender,
data=nepalData, xlab="Age in months", ylab="Weight in kg")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
#Fit a simple ANCOVA model (Model A) that includes gender, age and gender*age
model1 = lm(weight ~ age + gender, data=nepalData)
summary(model1)
##
## Call:
## lm(formula = weight ~ age + gender, data = nepalData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1912 -0.6524 0.0187 0.6989 3.5998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.943245 0.079281 49.738 < 2e-16 ***
## age 0.330324 0.009728 33.955 < 2e-16 ***
## genderFemale -0.515984 0.071379 -7.229 1.04e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.069 on 895 degrees of freedom
## Multiple R-squared: 0.5743, Adjusted R-squared: 0.5734
## F-statistic: 603.7 on 2 and 895 DF, p-value: < 2.2e-16
confint(model1)
## 2.5 % 97.5 %
## (Intercept) 3.7876476 4.0988429
## age 0.3112310 0.3494173
## genderFemale -0.6560741 -0.3758942
model2 = lm(weight ~ age*gender, data=nepalData)
summary(model2)
##
## Call:
## lm(formula = weight ~ age * gender, data = nepalData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1448 -0.6332 0.0229 0.6999 3.6221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.89218 0.10027 38.817 < 2e-16 ***
## age 0.33849 0.01382 24.488 < 2e-16 ***
## genderFemale -0.41504 0.14077 -2.948 0.00328 **
## age:genderFemale -0.01619 0.01946 -0.832 0.40564
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.07 on 894 degrees of freedom
## Multiple R-squared: 0.5746, Adjusted R-squared: 0.5732
## F-statistic: 402.6 on 3 and 894 DF, p-value: < 2.2e-16
confint(model2)
## 2.5 % 97.5 %
## (Intercept) 3.69539225 4.08897693
## age 0.31136399 0.36562074
## genderFemale -0.69132327 -0.13876422
## age:genderFemale -0.05438561 0.02200318
qplot(x=jitter(age), y=weight, color=gender, shape=gender, data=nepalData,
xlab="Age in months", ylab="Weight in kg") +
geom_line(aes(x = age, y=model2$fitted.values, color=gender))
Calculate the residuals from this model and plot them against age using the same colors for boys and girls as were used for the raw data. Comment whether this model, that assumes growth is linear in the first year, is adequate.
There is a tendency that distribution of spots in the range of 0-4 month and 8-12 month below the line = 0. This curve-linear relationship in the following plot suggests that the assumption of linearity does not hold. Furthermore, the spread in residual values for fitted values increases as the age increasing, which suggests the violent in the assumption of equal variance.
#Calculate the residuals from this model and plot them against age
qplot(y=model2$residuals, x=jitter(age), color=gender, shape=gender,
data=nepalData, ylab="Residuals", xlab="Age in months")
Add a linear spline term with a break at 4 months and its interaction with gender to the model above. Call the new model B. Calculate and plot the fitted values from Model B against age. Interpret each of the coefficients in Model B.
Following the previous interpretation I figured out, Model B with a linear spline was made as the following (X2 = (X - knot)+ , If X < knot, X2 = 0, and If X >= knot, X2 = (X - knot)+ = X - knot) ; E(weight) = b0 + b1 X + b2 (X - knot)+ + e
For X < knot, the model is ‘E(weight) = b0 + b1 X + e’ 1) Intercept b0 : expected mean weight at age = 0. 2) Slope b1 : ‘expected change in mean weight per a month(=growth rate)’ at or before 4 months after the birth.
For X >= knot, the model is ‘E(weight) = b0 + b1 X + b2 (X - knot) + e’ 1) Intercept b0 - b2 : extrapolated expected mean weight at age = 0. 2) Slope b1 + b2 : ‘expected change in mean weight per a month(=growth rate)’ after 4 months since the birth.
#Add a linear spline term with a break at 4 months and its interaction with gender to the model
nepalData = nepalData %>% mutate(agesp = ifelse(age > 4, age-4, 0))
# model B
model3 = lm(weight ~ age*gender + agesp*gender, data=nepalData)
summary(model3)
##
## Call:
## lm(formula = weight ~ age * gender + agesp * gender, data = nepalData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7733 -0.5689 0.0299 0.6274 3.6286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.79295 0.15280 18.279 <2e-16 ***
## age 0.78656 0.05101 15.420 <2e-16 ***
## genderFemale -0.20707 0.21160 -0.979 0.328
## agesp -0.58579 0.06454 -9.077 <2e-16 ***
## age:genderFemale -0.09096 0.07102 -1.281 0.201
## genderFemale:agesp 0.09081 0.09041 1.004 0.315
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9938 on 892 degrees of freedom
## Multiple R-squared: 0.6336, Adjusted R-squared: 0.6315
## F-statistic: 308.5 on 5 and 892 DF, p-value: < 2.2e-16
confint(model3)
## 2.5 % 97.5 %
## (Intercept) 2.49306592 3.09283098
## age 0.68645128 0.88667102
## genderFemale -0.62237081 0.20822255
## agesp -0.71245089 -0.45912493
## age:genderFemale -0.23035576 0.04843374
## genderFemale:agesp -0.08663958 0.26826081
qplot(x=jitter(age), y=weight, color=gender, shape=gender,
data=nepalData, xlab="Age in months", ylab="Weight in kg") +
geom_line(aes(x = age, y=model3$fitted.values, color=gender))
To test the linearity over the first 12 months, Set the null hypothesis: b2 = 0. Based on the result of f-test, we can reject the null hypothesis with the p-value < 0.05. It means that growth is not linear over the first twelve months.
Calculate the residuals from this model and plot them against age. Comment on any assumptions of the linear regression that still appear to be violated.
The assumption of the linearity can hold with the linear spline regression model. We can see the centering around zero, which suggests linearity. However, the spread in residual values for fitted values still increases as the age increasing, which suggests the violent in the assumption of equal variance.
According to the result of analysis of cavariance (ANCOVA) multiple linear regression (MLR) of weight on age, gender, and age*gender, the weight is on average 0.338 kg greater per each month increase in age (95% CI: 0.31 – 0.37) for boys. For girls, the mean of weight is 0.32 kg greater per each month increase in age. The association between weight and age is not statistically significant different compared to that of boys (estimated difference = - 0.02, 95% CI: -0.05 - 0.02). We can also see the rate of change in monthly weight with age, after versus before, 4 months of age, is statistically significantly different from zero (F-test = 71.74, p-value = 0.00). For children less than 4 months, the mean of growth rate for both genders is 0.79 kg per month (95% CI: 0.69 – 0.89). However, the rate is slightly decrease after 4 months by 0.21 kg per month on average.
#Test the null hypothesis that growth is linear over the first twelve months by testing whether the coefficients of the two new terms in Model B are both zero.
anova(model2, model3)
## Analysis of Variance Table
##
## Model 1: weight ~ age * gender
## Model 2: weight ~ age * gender + agesp * gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 894 1022.66
## 2 892 880.95 2 141.71 71.744 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Calculate the residuals from this model and plot them against age.
qplot(y=model3$residuals, x=jitter(age), data=nepalData, ylab="Residuals",
xlab="Age in months") + geom_hline(yintercept=0, color="red")