This dataset is built in R datasets v3.4.3.
load data
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
summary(ChickWeight)
## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
This data is primarily used for determining how did the four experimental diet received by chickens affect the weight of chikens.
There are four variables in this dataset including: chick & diet are categorical variable, time & weight are continuous variable.
Since the chick variable is an ordered label of final weight within each group, it’s not useful in predicting weights.
Observation
plot(ChickWeight$Time,ChickWeight$weight,xlab = 'time',ylab = 'weight')
cor(ChickWeight$weight,ChickWeight$Time)
## [1] 0.8371017
weight and time variables are highly correlated with a coefficient of 0.8371, the plot also shows a clear trend. However, as time passed, the variation in weights between chicks becomes wider which makes the experiment of diet worth exploring.
plot(ChickWeight$weight~ChickWeight$Diet,xlab = 'diet',ylab = 'weight')
hist(ChickWeight$weight[ChickWeight$Diet=='1'],breaks = 30,main = 'Histogram: Weight for Diet 1')
hist(ChickWeight$weight[ChickWeight$Diet=='2'],breaks = 30,main = 'Histogram: Weight for Diet 2')
hist(ChickWeight$weight[ChickWeight$Diet=='3'],breaks = 30,main = 'Histogram: Weight for Diet 3')
hist(ChickWeight$weight[ChickWeight$Diet=='4'],breaks = 30,main = 'Histogram: Weight for Diet 4')
This categorical variable shows that weight distribution of all diet are skewed to the left. Diet 3 leads to highest weight but interquantile range of weights of all four diets are approximately the same. Further exploration will be neccessary.
plot(ChickWeight$weight~ChickWeight$Chick,xlab = 'ordered label by weight within the same diet',ylab = 'weight')
As mentioned earlier, chick variable is a label created based on the final weight of all chicks. In this case, it would not be worth considering in the linear model since all information were included in the other two variables.
Linear Model
model 0 - all data included: this model ishould not be considered predictive since the chick variable includes final result of weight.
m0 <- lm(weight ~ .,data = ChickWeight)
summary(m0)
##
## Call:
## lm(formula = weight ~ ., data = ChickWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.128 -15.976 -2.476 13.844 91.955
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.8983 2.2157 12.591 < 2e-16 ***
## Time 8.7152 0.1759 49.538 < 2e-16 ***
## Chick.L 130.2672 9.5424 13.651 < 2e-16 ***
## Chick.Q -24.9790 10.1028 -2.472 0.013732 *
## Chick.C -16.2937 10.4180 -1.564 0.118422
## Chick^4 16.8300 10.5032 1.602 0.109674
## Chick^5 -11.8874 10.4181 -1.141 0.254373
## Chick^6 62.0273 10.2441 6.055 2.67e-09 ***
## Chick^7 -5.7763 10.0555 -0.574 0.565913
## Chick^8 -2.1871 9.8366 -0.222 0.824137
## Chick^9 -2.5741 9.5743 -0.269 0.788145
## Chick^10 -15.8198 9.3250 -1.696 0.090382 .
## Chick^11 -29.2864 9.0988 -3.219 0.001367 **
## Chick^12 12.6115 8.8899 1.419 0.156599
## Chick^13 45.5732 8.7345 5.218 2.61e-07 ***
## Chick^14 8.4847 8.6499 0.981 0.327089
## Chick^15 -36.7921 8.5910 -4.283 2.19e-05 ***
## Chick^16 -18.5430 8.5389 -2.172 0.030332 *
## Chick^17 13.8154 8.5068 1.624 0.104966
## Chick^18 19.2098 8.4673 2.269 0.023690 *
## Chick^19 -15.1139 8.3971 -1.800 0.072450 .
## Chick^20 19.1073 8.3290 2.294 0.022178 *
## Chick^21 10.3924 8.3016 1.252 0.211177
## Chick^22 -13.8848 8.2768 -1.678 0.094028 .
## Chick^23 -2.6091 8.2224 -0.317 0.751133
## Chick^24 3.7279 8.1876 0.455 0.649070
## Chick^25 -30.7405 8.2072 -3.746 0.000200 ***
## Chick^26 22.9762 8.2255 2.793 0.005407 **
## Chick^27 37.2504 8.1981 4.544 6.86e-06 ***
## Chick^28 11.6756 8.1679 1.429 0.153467
## Chick^29 -15.8480 8.1893 -1.935 0.053501 .
## Chick^30 -9.6065 8.2243 -1.168 0.243311
## Chick^31 2.6487 8.2253 0.322 0.747567
## Chick^32 15.8318 8.1691 1.938 0.053157 .
## Chick^33 -4.2143 8.1816 -0.515 0.606700
## Chick^34 -8.3473 8.1762 -1.021 0.307759
## Chick^35 0.6771 8.1644 0.083 0.933933
## Chick^36 -17.3660 8.1675 -2.126 0.033948 *
## Chick^37 -2.6001 8.1705 -0.318 0.750435
## Chick^38 -15.5406 8.1727 -1.902 0.057778 .
## Chick^39 -2.8088 8.2396 -0.341 0.733324
## Chick^40 39.4063 8.2667 4.767 2.42e-06 ***
## Chick^41 -15.4934 8.2493 -1.878 0.060914 .
## Chick^42 -27.7127 8.1962 -3.381 0.000775 ***
## Chick^43 -32.3320 8.1715 -3.957 8.64e-05 ***
## Chick^44 9.6068 8.1724 1.176 0.240319
## Chick^45 -12.8100 8.1665 -1.569 0.117338
## Chick^46 21.7381 8.1646 2.662 0.007994 **
## Chick^47 -6.3382 8.1950 -0.773 0.439623
## Chick^48 -16.3989 8.1667 -2.008 0.045152 *
## Chick^49 -25.3853 8.1647 -3.109 0.001978 **
## Diet2 NA NA NA NA
## Diet3 NA NA NA NA
## Diet4 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.28 on 527 degrees of freedom
## Multiple R-squared: 0.8554, Adjusted R-squared: 0.8416
## F-statistic: 62.33 on 50 and 527 DF, p-value: < 2.2e-16
model 1 - Time & Diet: this model is not bad given the standard error of most variables are within 5-10 times the estimate and the p-value of all variables shows that all the factors are considered significant.
m1 <- lm(weight ~ Time + Diet,data = ChickWeight)
summary(m1)
##
## Call:
## lm(formula = weight ~ Time + Diet, data = ChickWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -136.851 -17.151 -2.595 15.033 141.816
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.9244 3.3607 3.251 0.00122 **
## Time 8.7505 0.2218 39.451 < 2e-16 ***
## Diet2 16.1661 4.0858 3.957 8.56e-05 ***
## Diet3 36.4994 4.0858 8.933 < 2e-16 ***
## Diet4 30.2335 4.1075 7.361 6.39e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.99 on 573 degrees of freedom
## Multiple R-squared: 0.7453, Adjusted R-squared: 0.7435
## F-statistic: 419.2 on 4 and 573 DF, p-value: < 2.2e-16
model 2 - adding a quadratic term (Time\(^2\)): this model has similar advantage has the previous one. Moreover, the p-values this time are even smaller, which make the variables more significant in predicting weight. R-square also increases, but it doesn’t neccessarily garentee that the model is better.
ChickWeight2 <- ChickWeight
ChickWeight2$Time2 <- ChickWeight$Time*ChickWeight$Time
m2 <- lm(weight ~ Time + Diet + Time2, data = ChickWeight2)
summary(m2)
##
## Call:
## lm(formula = weight ~ Time + Diet + Time2, data = ChickWeight2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -146.372 -16.983 -0.684 15.283 132.295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.56559 4.18810 5.149 3.60e-07 ***
## Time 5.42189 0.83037 6.530 1.46e-10 ***
## Diet2 16.08443 4.02910 3.992 7.40e-05 ***
## Diet3 36.41776 4.02910 9.039 < 2e-16 ***
## Diet4 30.28713 4.05041 7.478 2.86e-13 ***
## Time2 0.15615 0.03758 4.155 3.74e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.49 on 572 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7506
## F-statistic: 348.3 on 5 and 572 DF, p-value: < 2.2e-16
model 3 - interaction between a quadratic term (Time\(^2\)) and a categorical term (Diet): This model has minor improvement in residual standard error and R-square.
m3 <- lm(weight ~ Time + Diet + Time2*Diet, data = ChickWeight2)
summary(m3)
##
## Call:
## lm(formula = weight ~ Time + Diet + Time2 * Diet, data = ChickWeight2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -143.679 -10.038 -0.696 8.508 125.452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.49227 4.35509 7.920 1.25e-14 ***
## Time 5.53099 0.78421 7.053 5.10e-12 ***
## Diet2 4.23604 5.59678 0.757 0.4494
## Diet3 2.55418 5.59678 0.456 0.6483
## Diet4 10.88120 5.60776 1.940 0.0528 .
## Time2 0.06554 0.03770 1.739 0.0826 .
## Diet2:Time2 0.07686 0.02541 3.025 0.0026 **
## Diet3:Time2 0.21022 0.02541 8.274 9.23e-16 ***
## Diet4:Time2 0.12361 0.02582 4.788 2.15e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.51 on 569 degrees of freedom
## Multiple R-squared: 0.7807, Adjusted R-squared: 0.7777
## F-statistic: 253.3 on 8 and 569 DF, p-value: < 2.2e-16
model 4 - interaction between a quadratic term (Time\(^2\)) and a dichotomous term (Dt): while separating diet into 4 categories is neccessary for determining which diet recipe leads to highest weight, creating a model to predict weight doesn’t neccessarily require this separation. Given the difference in weight as a result of diet 1,2 and diet 3,4 (shown below as boxplot), I’ll try to separate them into Dt A and B.
The improvement of this model is that both ends of the residuals are becoming more symmetrical. Other than that, no obvious improvement was observed.
ChickWeight4 <- ChickWeight2
ChickWeight4$Dt[ChickWeight4$Diet == '1' | ChickWeight4$Diet == '2'] <- 'A'
ChickWeight4$Dt[ChickWeight4$Diet == '3' | ChickWeight4$Diet == '4'] <- 'B'
boxplot(ChickWeight4$weight~ChickWeight4$Dt)
m4 <- lm(weight ~ Time + Diet + Time2 * Dt, data = ChickWeight4)
summary(m4)
##
## Call:
## lm(formula = weight ~ Time + Diet + Time2 * Dt, data = ChickWeight4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -130.267 -12.371 -0.796 10.607 130.376
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.39325 4.18867 7.256 1.31e-12 ***
## Time 5.47101 0.79500 6.882 1.56e-11 ***
## Diet2 16.65219 3.85814 4.316 1.87e-05 ***
## Diet3 13.84609 4.94763 2.799 0.00531 **
## Diet4 8.06980 4.93334 1.636 0.10244
## Time2 0.09599 0.03691 2.600 0.00955 **
## DtB NA NA NA NA
## Time2:DtB 0.14017 0.01924 7.285 1.07e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.98 on 571 degrees of freedom
## Multiple R-squared: 0.7738, Adjusted R-squared: 0.7714
## F-statistic: 325.5 on 6 and 571 DF, p-value: < 2.2e-16
Residual Analysis
hist(m0$residuals,main = 'Histogram: Residual of Model 0') #nearly normal residuals
qqnorm(m0$residuals) #normal probability plot of the residuals
qqline(m0$residuals) #adds diagonal line to the normal prob plot
hist(m1$residuals,main = 'Histogram: Residual of Model 1')
qqnorm(m1$residuals) #normal probability plot of the residuals
qqline(m1$residuals) #adds diagonal line to the normal prob plot
hist(m2$residuals,main = 'Histogram: Residual of Model 2')
qqnorm(m2$residuals) #normal probability plot of the residuals
qqline(m2$residuals) #adds diagonal line to the normal prob plot
hist(m3$residuals,main = 'Histogram: Residual of Model 3')
qqnorm(m3$residuals) #normal probability plot of the residuals
qqline(m3$residuals) #adds diagonal line to the normal prob plot
hist(m4$residuals,main = 'Histogram: Residual of Model 4')
qqnorm(m4$residuals) #normal probability plot of the residuals
qqline(m4$residuals) #adds diagonal line to the normal prob plot
Conclusion
While indicator and QQ-plot of model 0 performs better than the other 4 models, the model might not be as pragmatic as the others when we are trying to predict the weights using the model.
The QQ-plots of Model 1-4 show that both ends of the graph don’t align with the base line. If the best model has to be chosen, I will pick either model 1 or 2. Else, I would consider all four models unsatisfying and choose to do further investigation (probably more data and/or another type of model) for this dataset.