Name: Dustin Duffy

Linear regression

Boston is the capital of Massachusetts and the largest city in New England. After the coming of American independence the city became an important port and manufacturing center, and a center of education and culture as well. However, it is hard to determine the housing price in Boston for the complicated composition of varieties of factors. For this question, we will explore Boston housing price using linear regression model.

The Boston housing price is available in MASS package.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

All variables are listed below:

A1:

medv is the variable for median value of owner-occupied homes in $1000. We want to investigate which covariates have large impacts on medv. Perform univariate linear regression for each of the available covariates (other than medv), where univariate indicates that each time, one predictor is included in the regression model. Find the covariates that are significant at Bonferroni corrected p-value 5% level.

covariates0 <- names(Boston)
covariates <- setdiff(covariates0, "medv")
pvalue_lms <- numeric(length(covariates))
names(pvalue_lms) <- covariates

for(acov in covariates){
  alm <- lm(Boston$medv ~ Boston[[acov]])
  asumm <- summary(alm)
  pvalue_lms[acov] <- asumm$coefficients["Boston[[acov]]", "Pr(>|t|)"]
}
covariates[pvalue_lms < 0.05/length(pvalue_lms)]
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"
### your code, you only need to print the variable names that are significant at Bonferroni corrected p-value 5% level.

A2:

Fit a linear regression model to the Boston housing price data using medv as outcome variable and all the rest of variables as predictors.

  1. Find the covariate which has the smallest p-value. Print the covariate name.
  2. What is the 99% confidence inverval for this variable? (Hint, for the smallest p-value, there is only one variable which has the smallest p-value)
  3. Is there any collinearity issue with the model?
uni <- lm(medv~., Boston)
pv <- summary(uni)$coefficients[,4]
which.min(pv)
## lstat 
##    14
confint(uni, level = .99)
##                     0.5 %       99.5 %
## (Intercept)  23.262664069 49.656312701
## crim         -0.192995595 -0.023027121
## zn            0.010923180  0.081917737
## indus        -0.138460547  0.179577800
## chas          0.458810151  4.914657488
## nox         -27.643929437 -7.889293019
## rm            2.729169475  4.890560938
## age          -0.033466407  0.034850857
## dis          -1.991328644 -0.959805048
## rad           0.134486947  0.477612011
## tax          -0.022058811 -0.002610377
## ptratio      -1.291046761 -0.614447702
## black         0.002366157  0.016257210
## lstat        -0.655900931 -0.393615825
vif(uni)
##     crim       zn    indus     chas      nox       rm      age      dis 
## 1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 
##      rad      tax  ptratio    black    lstat 
## 7.484496 9.008554 1.799084 1.348521 2.941491
### rad and tax variables have a large vif which could cause collinearity issues, removing
univif <- lm(medv~crim+zn+indus+chas+nox+rm+age+dis+ptratio+black+lstat, data=Boston)
summary(univif)
## 
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + age + 
##     dis + ptratio + black + lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7734  -2.7928  -0.6504   1.5477  27.9833 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.210710   4.893167   5.970 4.54e-09 ***
## crim         -0.061435   0.030434  -2.019 0.044066 *  
## zn            0.041516   0.013551   3.064 0.002306 ** 
## indus        -0.045831   0.055961  -0.819 0.413190    
## chas          3.085016   0.871868   3.538 0.000441 ***
## nox         -14.594752   3.670109  -3.977 8.03e-05 ***
## rm            4.134272   0.419341   9.859  < 2e-16 ***
## age          -0.004295   0.013415  -0.320 0.748983    
## dis          -1.488851   0.203337  -7.322 9.98e-13 ***
## ptratio      -0.811272   0.121647  -6.669 6.91e-11 ***
## black         0.008205   0.002706   3.032 0.002558 ** 
## lstat        -0.515812   0.051668  -9.983  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.839 on 494 degrees of freedom
## Multiple R-squared:  0.7293, Adjusted R-squared:  0.7232 
## F-statistic:   121 on 11 and 494 DF,  p-value: < 2.2e-16
anova(uni, univif)
## Analysis of Variance Table
## 
## Model 1: medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + black + lstat
## Model 2: medv ~ crim + zn + indus + chas + nox + rm + age + dis + ptratio + 
##     black + lstat
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    492 11079                                  
## 2    494 11565 -2   -486.63 10.805 2.555e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
###r squared is better in the model with rad and tax removed

A3:

Perform model selection via forward-backward selection (both) starting from the full model.

  1. Print coefficient table after model selection.
  2. Do not print out the model selection process (specify trace = F).
  3. How many variables are selected (excluding intercept)?
finalModel <- step(lm(medv~., data=Boston), direction = c("both"), trace  = F)
summary(finalModel)$coef
##                  Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)  36.341145004 5.067492203   7.171426 2.727265e-12
## crim         -0.108413345 0.032779445  -3.307358 1.010438e-03
## zn            0.045844929 0.013522670   3.390228 7.542759e-04
## chas          2.718716303 0.854239823   3.182615 1.551469e-03
## nox         -17.376023429 3.535243066  -4.915086 1.209413e-06
## rm            3.801578840 0.406315906   9.356215 2.889779e-19
## dis          -1.492711460 0.185730779  -8.036963 6.837043e-15
## rad           0.299608454 0.063402104   4.725529 2.996799e-06
## tax          -0.011777973 0.003372332  -3.492531 5.214237e-04
## ptratio      -0.946524570 0.129065618  -7.333669 9.235063e-13
## black         0.009290845 0.002673905   3.474636 5.565743e-04
## lstat        -0.522553457 0.047424359 -11.018672 2.140586e-25

11 variacles remained after backwards selection

A4:

Use the model after forward-backward selection as your final model. Plot the Q-Q plot of standardized residuals. Do you think the standardized residuals are normally distributed by visual check? Also formally test the normality of standardized residuals using Shapiro-Wilk normality test. (Hint: you may use rstandard function to get standardized residuals).

a5 <- step(uni, direction = "both", trace = F)
plot(a5, 2)

###Does not look normally distributed by visual check
shapiro.test(residuals(a5))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(a5)
## W = 0.90131, p-value < 2.2e-16

A5:

Cook’s distance is one way to detect potential influential points. One suggestion is that if Cook’s distance is greater than \(4/n\), we can say the data point is a potential influential point. Visualize the Cook’s distance versus Observation number. Draw a red line to indicate the boundary for potential influential points. Which obsevations (obvervations ID) are potential influential data? (Hint: cooks.distance() function can help calculate cooks distance.)

cooksd <- cooks.distance(a5)
sample_size <- nrow(Boston)
plot(cooksd, pch = 3, cex=2, main="Influential Obs by Cooks distance")
plot(cooksd, pch = 3, cex=2, main="Influential Obs by Cooks distance")
abline(h = 4/sample_size, col="red")

which((cooksd>4/sample_size))
##  65 142 149 162 163 164 167 187 196 204 205 215 226 234 254 263 268 365 366 368 
##  65 142 149 162 163 164 167 187 196 204 205 215 226 234 254 263 268 365 366 368 
## 369 370 371 372 373 374 375 376 381 406 413 415 506 
## 369 370 371 372 373 374 375 376 381 406 413 415 506

B Generalized Linear Model

In this question, we will use generalized linear model (glm) to analyze count data.

p <- read.csv("https://caleb-huo.github.io/teaching/data/award_count/poisson_sim.csv")

p <- within(p, {
  prog <- factor(prog, levels=1:3, labels=c("General", "Academic", 
                                            "Vocational"))
  id <- factor(id)
})

head(p)
##   id num_awards       prog math
## 1  1          0 Vocational   40
## 2  2          0 Vocational   33
## 3  3          0   Academic   48
## 4  4          1   Academic   41
## 5  5          1   Academic   43
## 6  6          0   Academic   46

There are 200 subjects and 4 columns. id is subject ID; num_awards is number of awards won by the subject; prog is the program type, including General, Academic, Vocational; math is the math score of the subject. We want to investigate what factors will affect number of awards. Note by the way we speficied factor program, “General” will be reference group; “Academic” and “Vocational” will be other categories.

B1 histogram of num_awards

The x-axis is number of awards (0,1,2,…). The y-axis is count (frequency) for each specific number of award. Also at each specific number of award, all three programs should be put side by side at each x specific number of award, and presented using different colors. (hint: position = “dodge”) Please use black and white theme (e.g., theme_bw). For all element_text, use bold and italic font, blue color and text size 20.

plot4 <- ggplot(data = p) +  geom_bar(mapping = aes(x = num_awards, fill = prog), position ="dodge")+theme_bw()
plot4 +  theme(axis.text.x = element_text(size = 20, color = "blue", face="bold.italic"), axis.title.x = element_text(size = 20, colour = "blue", face="bold.italic"),
         axis.text.y = element_text(size = 20, colour = "blue", face="bold.italic"), axis.title.y = element_text(size = 20, colour = "blue", face="bold.italic"),
         plot.title = element_text(size = 20, face = "bold.italic", color = "blue"))+ggtitle("Histogram of Number of Awards")+labs(x="Number of Awards", y="Count")

B2, draw scattered plot of math versus number of awards

The x-axis is the math score. The y-axis is number of awards. Each dot (subject) should be colored based on their program. Also fit three straight lines (for different programs) to show the trend of these scattered plot. Please use black and white theme (e.g., theme_bw). For all element_text, use bold and italic font, blue color and text size 20.

plot5 <- ggplot(data = p, aes(x = math, y = num_awards, color = prog)) + 
   geom_point() +
   geom_smooth(method = "lm", se = FALSE)
plot5 +  theme(axis.text.x = element_text(size = 20, color = "blue", face="bold.italic"), axis.title.x = element_text(size = 20, colour = "blue", face="bold.italic"),
          axis.text.y = element_text(size = 20, colour = "blue", face="bold.italic"), axis.title.y = element_text(size = 20, colour = "blue", face="bold.italic"),
          plot.title = element_text(size = 20, face = "bold.italic", color = "blue"))+ggtitle("Scatter Plot of Math vs. Number of Awards")+labs(x="Math Score", y="Number of Awards")
## `geom_smooth()` using formula 'y ~ x'

Question B3, fit Poisson regression model. (5 pts)

Fit the Poisson regression model using num_awards as outcome variable; prog and math as predictors. What is the 95% confidence interval for predictor math (original one not the exponentiated one)? Write in word, what is the interpretation of predictor estimate of math?

## your code
summary(g <- glm(num_awards ~ prog + math,family = 'poisson', data = p))
## 
## Call:
## glm(formula = num_awards ~ prog + math, family = "poisson", data = p)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2043  -0.8436  -0.5106   0.2558   2.6796  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -5.24712    0.65845  -7.969 1.60e-15 ***
## progAcademic    1.08386    0.35825   3.025  0.00248 ** 
## progVocational  0.36981    0.44107   0.838  0.40179    
## math            0.07015    0.01060   6.619 3.63e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.45  on 196  degrees of freedom
## AIC: 373.5
## 
## Number of Fisher Scoring iterations: 6
confint(g)
## Waiting for profiling to be done...
##                      2.5 %      97.5 %
## (Intercept)    -6.57106799 -3.98529805
## progAcademic    0.43504394  1.85586906
## progVocational -0.49020399  1.26602301
## math            0.04949674  0.09107698
exp(coef(g))
##    (Intercept)   progAcademic progVocational           math 
##     0.00526263     2.95606545     1.44745846     1.07267164
###This shows if math scores are increased by 1, the chances of increasing your number of awards increases by 1.07 or about 7% holding all other vairables constant

Fit negative binomial (NB) model

Fit negative binomial (NB) regression model using num_awards as outcome variable; prog and math as predictors. Comparing with General program, which program(s) has(have) a significant impact on number of awards (p<0.05), adjusting for math score? Just print the program name(s).

## your code
glm_nb <- glm.nb(num_awards ~ prog + math, data = p)
asumm <- summary(glm_nb)$coef
print(names(which(asumm[2:3,"Pr(>|z|)"] < 0.05)))
## [1] "progAcademic"

B5 Model Comparison

Please evaluate the following 4 models, which is the best in terms of AIC? Print the coefficient table of the best model.

  1. Possion: num_awards ~ prog + math
  2. Possion: num_awards ~ prog + math + interaction(prog, math)
  3. NB: num_awards ~ prog + math
  4. NB: num_awards ~ prog + math + interaction(prog, math)
## your code
glm_poisson <- glm(num_awards ~ prog + math, family = poisson(link="log"), data=p)
glm_nb <- glm.nb(num_awards ~ prog + math, data = p)
glm_poisson_inter <- glm(num_awards ~ prog * math, family = poisson(link="log"), data=p)
glm_nb_inter <- glm.nb(num_awards ~ prog * math, data = p)

AIC(glm_poisson, glm_nb, glm_poisson_inter, glm_nb_inter)
##                   df      AIC
## glm_poisson        4 373.5045
## glm_nb             5 373.8109
## glm_poisson_inter  6 377.1565
## glm_nb_inter       7 377.4078
###The best in terms of AIC is our original Poisson model with no interaction

B6 zero inflated Poisson

Fit a zero inflated Poisson model https://caleb-huo.github.io/teaching/2019FALL/lectures/week10_GLM/glm/glm.html#(62), using num_awards as the outcome varirable, prog, math as the predictors for the Poisson part, and prog, math as the predictors for zero proportion part.

  • What is the coefficient of math from the Poisson part, and what is its interpretation?
  • What is the coefficient of math from the zero proportion part, and what is its interpretation?
summary(m1 <- zeroinfl( num_awards ~ prog + math | prog + math, dist = 'poisson', data = p))
## 
## Call:
## zeroinfl(formula = num_awards ~ prog + math | prog + math, data = p, 
##     dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5076 -0.5599 -0.3825  0.2805  3.3543 
## 
## Count model coefficients (poisson with log link):
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -4.57716    0.83143  -5.505 3.69e-08 ***
## progAcademic    1.30807    0.39421   3.318 0.000906 ***
## progVocational  0.35044    0.44173   0.793 0.427581    
## math            0.05761    0.01437   4.010 6.06e-05 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -1.94229   80.87217  -0.024    0.981
## progAcademic     7.90777   80.68261   0.098    0.922
## progVocational  -2.99821  346.12583  -0.009    0.993
## math            -0.13903    0.09664  -1.439    0.150
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 69 
## Log-likelihood: -181.8 on 8 Df
exp(coef(m1))
##    count_(Intercept)   count_progAcademic count_progVocational 
##         1.028403e-02         3.699011e+00         1.419695e+00 
##           count_math     zero_(Intercept)    zero_progAcademic 
##         1.059304e+00         1.433750e-01         2.718318e+03 
##  zero_progVocational            zero_math 
##         4.987622e-02         8.701978e-01
## For the poisson part, a 1 unit increase in math score would lead to a 1.06 increase of number of awards, or about a 6% increase. For the zero proportion part, if someone were to increase its math by one, the odds that it would be in the "Certain Zero" group would decrease by a factor of .869

Linear Mixed Model

In this question, we will use linear mixed model to account for within subject correlation (or repeated measure, longitudinal outcomes). The dataset calcium is inside lava package.

suppressMessages(library(lava))
data(calcium)
calcium$person <- as.factor(calcium$person)

Bone Mineral Density Data consisting of 112 girls randomized to receive calcium or placebo. Longitudinal measurements of bone mineral density (\(g/cm^2\)) measured approximately every 6th month in 3 years. A data.frame containing 501 (incomplete) observations. The ‘person’ column defines the individual girls of the study with measurements at visiting times ‘visit’, and age in years ‘age’ at the time of visit. The bone mineral density variable is ‘bmd’ (\(g/cm^2\)).

C1, visualize the data using spaghetti plot

Draw scattered plot of visit (x-axis) and bmd (y-axis). Use different colors to denote different persons; then facet_wrap the figure by group. For each person, connect scattered dots if they have adjacent visit. This type of plot is often refered as spaghetti plot. Please use the black white theme, and do not show the legend.

## your code
b <- ggplot(data = calcium, aes(x = visit, y = bmd, color=person))
b2 <- b + geom_point()+theme_bw()
b3 <- b2+theme(legend.position = "none")
b3 + facet_wrap( ~ group)+geom_line()+ggtitle("Spaghetti Plot by Group")+labs(x="Visit", y="BMD")

C2, create a subset of the data matrix

At this moment, we are only interested in treatment (group) effect on bmd change (bmd of last visit - bmd of initial visit). Therefore create a matrix with 112 rows (each row represents a person). For each person, include initial bmd, last bmd, bmd change (last bmd - initial bmd), group information (placebo(P) or control(C)), initial age (age at initial visit) as covariates. Print the head of this data matrix.

c1 <- calcium %>%
   group_by(person) %>%
   arrange(visit) %>%
   slice(1) %>%
   ungroup
c2 <- calcium %>%
   group_by(person) %>%
   arrange(visit) %>%
   slice(n()) %>%
   ungroup
colnames(c1)[colnames(c1)=="bmd"] <- "initial_bmd"
colnames(c1)[colnames(c1)=="age"] <- "initial_age"
colnames(c2)[colnames(c2)=="bmd"] <- "last_bmd"
colnames(c2)[colnames(c2)=="age"] <- "last_age"
c2 <- c2[,-2]
c2 <- c2[,-3]
c2 <- c2[,-4]
c5 <- merge(c1,c2,by="person")
c5$bmd_change <- (c5$last_bmd - c5$initial_bmd)
c5 <- c5[,-4]
head(c5)
##   person initial_bmd group initial_age ctime last_bmd last_age bmd_change
## 1    101       0.815     C    10.91307 11078    0.970 12.90897      0.155
## 2    102       0.813     P    10.91307 11078    0.901 12.86516      0.088
## 3    103       0.812     P    10.91581 11079    0.895 12.87064      0.083
## 4    104       0.804     C    10.93498 11086    0.948 13.04860      0.144
## 5    105       0.904     C    10.94867 11091    1.002 12.90349      0.098
## 6    106       0.831     P    10.95140 11092    0.933 13.11978      0.102

C3, fit linear regression model

Fit a linear regression using bmd change as outcome variable; group, initial age, group by initial age interaction, and initial bmd as covariates. What is the p-value and the 95% CI of initial bmd?

lm3 <- lm(bmd_change~group+initial_age+(group*initial_age)+initial_bmd, data=c5)
summary(lm3)
## 
## Call:
## lm(formula = bmd_change ~ group + initial_age + (group * initial_age) + 
##     initial_bmd, data = c5)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.094205 -0.023279  0.004461  0.029479  0.091786 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)         0.404980   0.692291   0.585    0.560
## groupP              0.303212   0.962478   0.315    0.753
## initial_age        -0.027544   0.062135  -0.443    0.658
## initial_bmd        -0.009836   0.065016  -0.151    0.880
## groupP:initial_age -0.028841   0.087010  -0.331    0.741
## 
## Residual standard error: 0.04281 on 107 degrees of freedom
## Multiple R-squared:  0.04344,    Adjusted R-squared:  0.007676 
## F-statistic: 1.215 on 4 and 107 DF,  p-value: 0.3089
confint(lm3)
##                         2.5 %     97.5 %
## (Intercept)        -0.9674054 1.77736592
## groupP             -1.6047880 2.21121120
## initial_age        -0.1507185 0.09563065
## initial_bmd        -0.1387233 0.11905113
## groupP:initial_age -0.2013272 0.14364603
### P-value of initial bmd is 0.880 and 95%CI is (-0.139, 0.119)

C4, fit linear mixed model

Now we should work on the original data matrix. We want to investigate whether time (visit) plays an important role in bmd. We will fit a linear mixed model with bmd as outcome variable and visit as the only predictor. We also want to account for within subject correlation by considering random intercept and random slope. Don’t worry about REML options (both of them are fine). What is the intercept and slope for person 101? Please print out the result.

m12 = lmer(bmd~visit+(0+visit|person) + (1|person), data=calcium, REML = FALSE)
coef(m12)
## $person
##     (Intercept)       visit
## 101   0.7981765 0.035581925
## 102   0.7898536 0.022373282
## 103   0.7805478 0.021094278
## 104   0.7811519 0.033358334
## 105   0.8787255 0.023033949
## 106   0.8081483 0.025170900
## 107   0.7701030 0.012063924
## 108   0.7699399 0.022866404
## 109   0.8025730 0.020774082
## 110   0.7845769 0.025076327
## 111   0.8052647 0.036833506
## 112   0.7775318 0.017039106
## 113   0.8401723 0.018230665
## 114   0.7675427 0.023255941
## 115   0.8083255 0.020978155
## 116   0.8403540 0.020535877
## 117   0.8322047 0.024361384
## 118   0.7375897 0.019764488
## 119   0.7269952 0.025454820
## 120   0.7721074 0.029258347
## 121   0.7707175 0.030457840
## 122   0.8504799 0.024840349
## 123   0.7973023 0.033075120
## 124   0.7818378 0.027394961
## 125   0.7778165 0.028010307
## 126   0.7622461 0.020232330
## 127   0.7647272 0.028800208
## 128   0.7234818 0.013910179
## 129   0.8013714 0.030443753
## 130   0.8106661 0.019059834
## 131   0.8090538 0.028000557
## 132   0.7587934 0.018190287
## 133   0.8507134 0.029762042
## 134   0.7612200 0.012643129
## 135   0.7996230 0.023944033
## 136   0.7846758 0.029109814
## 137   0.7890021 0.009058872
## 201   0.8207340 0.029216558
## 202   0.8461164 0.034970445
## 203   0.8964386 0.020826631
## 204   0.8572455 0.024971299
## 205   0.8564937 0.022223439
## 206   0.8826730 0.027176701
## 207   0.8844005 0.037997317
## 208   0.8263883 0.031920207
## 209   0.8753730 0.025353475
## 210   0.8486712 0.016341111
## 211   0.8543549 0.028470123
## 212   0.9661545 0.023845199
## 213   0.9051210 0.028714713
## 214   0.8673361 0.024320416
## 215   0.8477382 0.025626518
## 301   0.8905065 0.023282838
## 302   0.8449245 0.026170186
## 303   0.8880721 0.024232601
## 305   0.8629842 0.020550679
## 306   0.8714943 0.024864660
## 307   0.9022605 0.019740740
## 308   0.8175217 0.024657957
## 309   0.8618918 0.030813002
## 310   0.8044725 0.024337024
## 311   0.8247928 0.020455331
## 312   0.8074216 0.031214625
## 313   0.8889463 0.026739406
## 314   0.8669133 0.022462208
## 315   0.7976071 0.031706250
## 316   0.8241078 0.014885043
## 317   0.8493001 0.017169351
## 318   0.8109815 0.029729224
## 319   0.8850544 0.025178028
## 320   0.8948074 0.012507036
## 321   0.8494538 0.024582335
## 322   0.8182217 0.020325827
## 323   0.8423938 0.024618928
## 324   0.7999817 0.013576520
## 325   0.7784771 0.031322074
## 326   0.8403763 0.017828907
## 327   0.8163586 0.020150667
## 328   0.8341221 0.017459793
## 329   0.9682809 0.026104438
## 330   0.8513704 0.024606885
## 331   0.9335343 0.023118773
## 401   0.9970295 0.026472689
## 402   0.8531885 0.028731926
## 403   0.9449444 0.034092024
## 404   0.9837529 0.022775207
## 405   0.9940789 0.025746542
## 406   0.9430190 0.015204433
## 407   0.9296214 0.027364401
## 408   0.8857646 0.021473176
## 409   0.9226120 0.016181900
## 410   0.8439756 0.021888416
## 411   0.9469247 0.030961199
## 412   0.9040975 0.022657977
## 413   0.8986889 0.026005281
## 414   0.8757173 0.023576172
## 415   0.9278330 0.028964207
## 416   0.9911932 0.026047710
## 417   0.9064475 0.038403178
## 418   0.9414250 0.016805684
## 419   0.8608094 0.022744977
## 420   0.9197800 0.024065040
## 421   0.9289702 0.041716156
## 422   0.8486475 0.023255301
## 423   0.8367343 0.018850348
## 424   0.8566187 0.023514849
## 425   0.8966443 0.033490999
## 426   0.9053676 0.023232004
## 427   0.8378436 0.039344973
## 428   0.9780675 0.023449829
## 429   0.9342592 0.022075646
## 430   0.9489623 0.025411149
## 
## attr(,"class")
## [1] "coef.mer"
###Intercept and slope for 101 is 0.798 and 0.0356 respectively

C5, p-value for group using likelihood ratio test

Now we should work on the original data matrix. We want to fit a linear mixed model with bmd as outcome variable; visit and group as predictors. We also want to account for within subject correlation by considering random intercept and random slope. Please provide p-value for group using likelihood ratio test.

m13 = lmer(bmd~visit+group+(0+visit|person) + (1|person), data=calcium, REML = FALSE)
anova(m12, m13,test="Chisq")
## Data: calcium
## Models:
## m12: bmd ~ visit + (0 + visit | person) + (1 | person)
## m13: bmd ~ visit + group + (0 + visit | person) + (1 | person)
##     npar     AIC     BIC logLik deviance  Chisq Df Pr(>Chisq)
## m12    5 -2361.8 -2340.7 1185.9  -2371.8                     
## m13    6 -2360.4 -2335.2 1186.2  -2372.4 0.6821  1     0.4089
###p-value is 0.41

C6, check if initial age is a significant predictor

Initial age will be the same for the same person which equals the age at initial visit for each person. For example, person 101 has 5 visits and the initial age for all these 5 visits are 10.9130732. We want to fit a linear mixed model with bmd as outcome variable; visit, group and initial age as predictors. Don’t worry about REML options (both of them are fine). Please provide p-value for initial age using any test. Is initial age a significant predictor?

c6 <- calcium %>% group_by(person) %>% mutate(initial_age = min(age, na.rm = TRUE))
m14 = lmer(bmd~visit+group+initial_age+(0+visit|person) + (1|person), data=c6, REML = TRUE)
anova(m14, ddf = "Kenward-Roger")
## Type III Analysis of Variance Table with Kenward-Roger's method
##               Sum Sq  Mean Sq NumDF  DenDF  F value Pr(>F)    
## visit       0.113759 0.113759     1 100.77 911.7760 <2e-16 ***
## group       0.000082 0.000082     1 109.00   0.6575 0.4192    
## initial_age 0.000064 0.000064     1 108.98   0.5152 0.4744    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
###Initial age is not a significant predictor

Check if initial bmd is a significant predictor

We want to fit a linear mixed model with bmd as outcome variable; visit, group, initial age and initial bmd as predictors. Don’t worry about REML options (both of them are fine). Please provide p-value for initial bmd using any test. You should find that the initial bmd is a very significant predictor. With that being said, when fitting a mixed model to data with longitudinal outcomes, we always want to adjust the outcome at baseline as a covariate.

## Creating initial bmd
c7 <- c6 %>% group_by(person) %>% mutate(initial_bmd = min(bmd, na.rm = TRUE))
m15 = lmer(bmd~visit+group+initial_age+initial_bmd+(0+visit|person) + (1|person), data=c7, REML = TRUE)
## boundary (singular) fit: see ?isSingular
anova(m15, ddf = "Kenward-Roger")
## Type III Analysis of Variance Table with Kenward-Roger's method
##              Sum Sq Mean Sq NumDF  DenDF   F value Pr(>F)    
## visit       0.14497 0.14497     1 152.53 1268.1175 <2e-16 ***
## group       0.00005 0.00005     1 106.80    0.4755 0.4920    
## initial_age 0.00018 0.00018     1 106.19    1.6150 0.2066    
## initial_bmd 0.48894 0.48894     1 109.04 4277.1121 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
###Shows intial bmd is very significant

C8, model selection

Comparing models from Question 4,5,6,7, which model is the best in terms of BIC? Make sure the REML is option is consistent in order to be comparible. Print the names of significant predictors of the selected model. From this question, is there enough statistical evidence that calcium is helpful for increasing bone mineral density in girls?

## Changing previous 2 models to REML = FALSE
m15 = lmer(bmd~visit+group+initial_age+initial_bmd+(0+visit|person) + (1|person), data=c7, REML = FALSE)
## boundary (singular) fit: see ?isSingular
m14 = lmer(bmd~visit+group+initial_age+(0+visit|person) + (1|person), data=c6, REML = FALSE)
BIC(m14, m15, m13, m12)
##     df       BIC
## m14  7 -2329.463
## m15  8 -2794.999
## m13  6 -2335.150
## m12  5 -2340.685
names(which(summary(m15)$coef[,5] < 0.05))
## [1] "visit"       "initial_bmd"
###The model with the lowest BIC is m15 which is the model with initial bmd, which makes sense as to how significant it is