Multiple Regression Model

This dataset is built in R datasets v3.4.3.

load data

head(ChickWeight)

##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

summary(ChickWeight)

##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

This data is primarily used for determining how did the four experimental diet received by chickens affect the weight of chikens.
There are four variables in this dataset including: chick & diet are categorical variable, time & weight are continuous variable.
Since the chick variable is an ordered label of final weight within each group, it’s not useful in predicting weights.

Observation

plot(ChickWeight$Time,ChickWeight$weight,xlab = 'time',ylab = 'weight')

cor(ChickWeight$weight,ChickWeight$Time)

## [1] 0.8371017

weight and time variables are highly correlated with a coefficient of 0.8371, the plot also shows a clear trend. However, as time passed, the variation in weights between chicks becomes wider which makes the experiment of diet worth exploring.

plot(ChickWeight$weight~ChickWeight$Diet,xlab = 'diet',ylab = 'weight')

hist(ChickWeight$weight[ChickWeight$Diet=='1'],breaks = 30,main = 'Histogram: Weight for Diet 1')

hist(ChickWeight$weight[ChickWeight$Diet=='2'],breaks = 30,main = 'Histogram: Weight for Diet 2')

hist(ChickWeight$weight[ChickWeight$Diet=='3'],breaks = 30,main = 'Histogram: Weight for Diet 3')

hist(ChickWeight$weight[ChickWeight$Diet=='4'],breaks = 30,main = 'Histogram: Weight for Diet 4')

This categorical variable shows that weight distribution of all diet are skewed to the left. Diet 3 leads to highest weight but interquantile range of weights of all four diets are approximately the same. Further exploration will be neccessary.

plot(ChickWeight$weight~ChickWeight$Chick,xlab = 'ordered label by weight within the same diet',ylab = 'weight')

As mentioned earlier, chick variable is a label created based on the final weight of all chicks. In this case, it would not be worth considering in the linear model since all information were included in the other two variables.

Linear Model

model 0 - all data included: this model ishould not be considered predictive since the chick variable includes final result of weight.

m0 <- lm(weight ~ .,data = ChickWeight)
summary(m0)

## 
## Call:
## lm(formula = weight ~ ., data = ChickWeight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.128 -15.976  -2.476  13.844  91.955 
## 
## Coefficients: (3 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.8983     2.2157  12.591  < 2e-16 ***
## Time          8.7152     0.1759  49.538  < 2e-16 ***
## Chick.L     130.2672     9.5424  13.651  < 2e-16 ***
## Chick.Q     -24.9790    10.1028  -2.472 0.013732 *  
## Chick.C     -16.2937    10.4180  -1.564 0.118422    
## Chick^4      16.8300    10.5032   1.602 0.109674    
## Chick^5     -11.8874    10.4181  -1.141 0.254373    
## Chick^6      62.0273    10.2441   6.055 2.67e-09 ***
## Chick^7      -5.7763    10.0555  -0.574 0.565913    
## Chick^8      -2.1871     9.8366  -0.222 0.824137    
## Chick^9      -2.5741     9.5743  -0.269 0.788145    
## Chick^10    -15.8198     9.3250  -1.696 0.090382 .  
## Chick^11    -29.2864     9.0988  -3.219 0.001367 ** 
## Chick^12     12.6115     8.8899   1.419 0.156599    
## Chick^13     45.5732     8.7345   5.218 2.61e-07 ***
## Chick^14      8.4847     8.6499   0.981 0.327089    
## Chick^15    -36.7921     8.5910  -4.283 2.19e-05 ***
## Chick^16    -18.5430     8.5389  -2.172 0.030332 *  
## Chick^17     13.8154     8.5068   1.624 0.104966    
## Chick^18     19.2098     8.4673   2.269 0.023690 *  
## Chick^19    -15.1139     8.3971  -1.800 0.072450 .  
## Chick^20     19.1073     8.3290   2.294 0.022178 *  
## Chick^21     10.3924     8.3016   1.252 0.211177    
## Chick^22    -13.8848     8.2768  -1.678 0.094028 .  
## Chick^23     -2.6091     8.2224  -0.317 0.751133    
## Chick^24      3.7279     8.1876   0.455 0.649070    
## Chick^25    -30.7405     8.2072  -3.746 0.000200 ***
## Chick^26     22.9762     8.2255   2.793 0.005407 ** 
## Chick^27     37.2504     8.1981   4.544 6.86e-06 ***
## Chick^28     11.6756     8.1679   1.429 0.153467    
## Chick^29    -15.8480     8.1893  -1.935 0.053501 .  
## Chick^30     -9.6065     8.2243  -1.168 0.243311    
## Chick^31      2.6487     8.2253   0.322 0.747567    
## Chick^32     15.8318     8.1691   1.938 0.053157 .  
## Chick^33     -4.2143     8.1816  -0.515 0.606700    
## Chick^34     -8.3473     8.1762  -1.021 0.307759    
## Chick^35      0.6771     8.1644   0.083 0.933933    
## Chick^36    -17.3660     8.1675  -2.126 0.033948 *  
## Chick^37     -2.6001     8.1705  -0.318 0.750435    
## Chick^38    -15.5406     8.1727  -1.902 0.057778 .  
## Chick^39     -2.8088     8.2396  -0.341 0.733324    
## Chick^40     39.4063     8.2667   4.767 2.42e-06 ***
## Chick^41    -15.4934     8.2493  -1.878 0.060914 .  
## Chick^42    -27.7127     8.1962  -3.381 0.000775 ***
## Chick^43    -32.3320     8.1715  -3.957 8.64e-05 ***
## Chick^44      9.6068     8.1724   1.176 0.240319    
## Chick^45    -12.8100     8.1665  -1.569 0.117338    
## Chick^46     21.7381     8.1646   2.662 0.007994 ** 
## Chick^47     -6.3382     8.1950  -0.773 0.439623    
## Chick^48    -16.3989     8.1667  -2.008 0.045152 *  
## Chick^49    -25.3853     8.1647  -3.109 0.001978 ** 
## Diet2             NA         NA      NA       NA    
## Diet3             NA         NA      NA       NA    
## Diet4             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.28 on 527 degrees of freedom
## Multiple R-squared:  0.8554, Adjusted R-squared:  0.8416 
## F-statistic: 62.33 on 50 and 527 DF,  p-value: < 2.2e-16

model 1 - Time & Diet: this model is not bad given the standard error of most variables are within 5-10 times the estimate and the p-value of all variables shows that all the factors are considered significant.

m1 <- lm(weight ~ Time + Diet,data = ChickWeight)
summary(m1)

## 
## Call:
## lm(formula = weight ~ Time + Diet, data = ChickWeight)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -136.851  -17.151   -2.595   15.033  141.816 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.9244     3.3607   3.251  0.00122 ** 
## Time          8.7505     0.2218  39.451  < 2e-16 ***
## Diet2        16.1661     4.0858   3.957 8.56e-05 ***
## Diet3        36.4994     4.0858   8.933  < 2e-16 ***
## Diet4        30.2335     4.1075   7.361 6.39e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.99 on 573 degrees of freedom
## Multiple R-squared:  0.7453, Adjusted R-squared:  0.7435 
## F-statistic: 419.2 on 4 and 573 DF,  p-value: < 2.2e-16

model 2 - adding a quadratic term (Time\(^2\)): this model has similar advantage has the previous one. Moreover, the p-values this time are even smaller, which make the variables more significant in predicting weight. R-square also increases, but it doesn’t neccessarily garentee that the model is better.

ChickWeight2 <- ChickWeight
ChickWeight2$Time2 <- ChickWeight$Time*ChickWeight$Time

m2 <- lm(weight ~ Time + Diet + Time2, data = ChickWeight2)
summary(m2)

## 
## Call:
## lm(formula = weight ~ Time + Diet + Time2, data = ChickWeight2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -146.372  -16.983   -0.684   15.283  132.295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 21.56559    4.18810   5.149 3.60e-07 ***
## Time         5.42189    0.83037   6.530 1.46e-10 ***
## Diet2       16.08443    4.02910   3.992 7.40e-05 ***
## Diet3       36.41776    4.02910   9.039  < 2e-16 ***
## Diet4       30.28713    4.05041   7.478 2.86e-13 ***
## Time2        0.15615    0.03758   4.155 3.74e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.49 on 572 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7506 
## F-statistic: 348.3 on 5 and 572 DF,  p-value: < 2.2e-16

model 3 - interaction between a quadratic term (Time\(^2\)) and a categorical term (Diet): This model has minor improvement in residual standard error and R-square.

m3 <- lm(weight ~ Time + Diet + Time2*Diet, data = ChickWeight2)
summary(m3)

## 
## Call:
## lm(formula = weight ~ Time + Diet + Time2 * Diet, data = ChickWeight2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -143.679  -10.038   -0.696    8.508  125.452 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.49227    4.35509   7.920 1.25e-14 ***
## Time         5.53099    0.78421   7.053 5.10e-12 ***
## Diet2        4.23604    5.59678   0.757   0.4494    
## Diet3        2.55418    5.59678   0.456   0.6483    
## Diet4       10.88120    5.60776   1.940   0.0528 .  
## Time2        0.06554    0.03770   1.739   0.0826 .  
## Diet2:Time2  0.07686    0.02541   3.025   0.0026 ** 
## Diet3:Time2  0.21022    0.02541   8.274 9.23e-16 ***
## Diet4:Time2  0.12361    0.02582   4.788 2.15e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.51 on 569 degrees of freedom
## Multiple R-squared:  0.7807, Adjusted R-squared:  0.7777 
## F-statistic: 253.3 on 8 and 569 DF,  p-value: < 2.2e-16

model 4 - interaction between a quadratic term (Time\(^2\)) and a dichotomous term (Dt): while separating diet into 4 categories is neccessary for determining which diet recipe leads to highest weight, creating a model to predict weight doesn’t neccessarily require this separation. Given the difference in weight as a result of diet 1,2 and diet 3,4 (shown below as boxplot), I’ll try to separate them into Dt A and B.
The improvement of this model is that both ends of the residuals are becoming more symmetrical. Other than that, no obvious improvement was observed.

ChickWeight4 <- ChickWeight2
ChickWeight4$Dt[ChickWeight4$Diet == '1' | ChickWeight4$Diet == '2'] <- 'A'
ChickWeight4$Dt[ChickWeight4$Diet == '3' | ChickWeight4$Diet == '4'] <- 'B'

boxplot(ChickWeight4$weight~ChickWeight4$Dt)

m4 <- lm(weight ~ Time + Diet + Time2 * Dt, data = ChickWeight4)
summary(m4)

## 
## Call:
## lm(formula = weight ~ Time + Diet + Time2 * Dt, data = ChickWeight4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -130.267  -12.371   -0.796   10.607  130.376 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.39325    4.18867   7.256 1.31e-12 ***
## Time         5.47101    0.79500   6.882 1.56e-11 ***
## Diet2       16.65219    3.85814   4.316 1.87e-05 ***
## Diet3       13.84609    4.94763   2.799  0.00531 ** 
## Diet4        8.06980    4.93334   1.636  0.10244    
## Time2        0.09599    0.03691   2.600  0.00955 ** 
## DtB               NA         NA      NA       NA    
## Time2:DtB    0.14017    0.01924   7.285 1.07e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.98 on 571 degrees of freedom
## Multiple R-squared:  0.7738, Adjusted R-squared:  0.7714 
## F-statistic: 325.5 on 6 and 571 DF,  p-value: < 2.2e-16

Residual Analysis

hist(m0$residuals,main = 'Histogram: Residual of Model 0') #nearly normal residuals

qqnorm(m0$residuals) #normal probability plot of the residuals
qqline(m0$residuals) #adds diagonal line to the normal prob plot

hist(m1$residuals,main = 'Histogram: Residual of Model 1')

qqnorm(m1$residuals) #normal probability plot of the residuals
qqline(m1$residuals) #adds diagonal line to the normal prob plot

hist(m2$residuals,main = 'Histogram: Residual of Model 2')

qqnorm(m2$residuals) #normal probability plot of the residuals
qqline(m2$residuals) #adds diagonal line to the normal prob plot

hist(m3$residuals,main = 'Histogram: Residual of Model 3')

qqnorm(m3$residuals) #normal probability plot of the residuals
qqline(m3$residuals) #adds diagonal line to the normal prob plot

hist(m4$residuals,main = 'Histogram: Residual of Model 4')

qqnorm(m4$residuals) #normal probability plot of the residuals
qqline(m4$residuals) #adds diagonal line to the normal prob plot

Conclusion

While indicator and QQ-plot of model 0 performs better than the other 4 models, the model might not be as pragmatic as the others when we are trying to predict the weights using the model.
The QQ-plots of Model 1-4 show that both ends of the graph don’t align with the base line. If the best model has to be chosen, I will pick either model 1 or 2. Else, I would consider all four models unsatisfying and choose to do further investigation (probably more data and/or another type of model) for this dataset.