A medical researcher is interested in predicting survival (in days) in patients undergoing a particular type of operation. A random selection of 54 patients was available for analysis. The following information was available for each patient.
| Variable | Description |
|---|---|
| X1 | Blood clotting score |
| X2 | Prognostic index |
| X3 | Enzyme function test score |
| X4 | Liver function test score |
| X5 | Age in years |
| X6 | Number of years employed |
| X7 | Gender (0 = male, 1 = female) |
| X8 | Moderate alcohol use (1 = yes, 0 = no) |
| X9 | Severe alcohol use (1 = yes, 0 = no) |
| Y | Survival (days) |
Check for Outliers (O) and influential observations (IO)
The plots below can help us evaluate the statistical assumptions in the regression analysis model and evaluate the overall fit of the model applied to the data set.
Starting from the the upper-left corner we can see the Residual vs Fitted plot. To hold the assumptions on a regression analysis, there shouldn’t be a relationship between the residuals and the predicted values (fitted value). The model should capture all the systematic variance present in the data, only leaving random noise. From the below Residual vs Fitted we can see how there are signs of a smooth curve, thus we should explore more in depth as the plot suggests that there is a problem with linear dependency.Normal Q-Q: If the dependent variable is normally distributed for the given data, then the residual values should also be normally distributed with a mean = 0. For our case, the data nearly met the normality assumption, and would have if it wasn’t for the extreme observation on the top of the line.Scale-Location: This plot is closely related to the residual VS Fitted plot. If linear independance is met, the data points should be a random band around the horizontal line. Residuals vs Leverage: This plot provides information about individual observations that we can examine further later.
To examine outliers and influential observation in the data we can create added-variable plots to give us more information about how influential observations affect the model. For each predictor \(Xk\), plot the residuals from regressing the response variable on the other \(k-1\) predictors versus the residuals from regressing \(Xk\) on the other \(k-1\) predictors. The straight line in each plot is the actual regression coeffient of the corresponding predictor variable. We can identify a couple of influential observations to examine further.
This method helps us to identify observations that have a disproportionate impact on the values of the model parameters. Cook’s Distance values greater than \(4/(n-k-1)\), where \(n\) is the size of the sample and \(k\) is the number of predictor variables, indicate influential observations. Now we can narrow down to the observations \((5,28)\) that are causing a disproportionate impact on the model.
After removing two data points (5,28) we need to check for Outliers (O) one more time.
After generating the standard diagnostic plots below one more time we can evaluate the statistical assumptions of a regression analysis model to evaluate the overall fit of the model applied to the new data set without extreme observations. The program highlights other interesting data points to consider, but overall it looks like the new data has influential observations but does not have extreme outliers.
For the new added-variable plots, now that the extreme observations are out, we can investigate new data points. In general, it looks like the residuals are spread evenly around the actual regression coeffient line for each predictor.
To verify that there is not any other extreme observation that is disproportionately impacting the mode, we run Cook’s Distance values once again. The plot identifies other influential observations but it does not exceed the 50th percentile of the F distribution.
formula = survival ~ blood + index + enzyme + liver + age + nyears + gender + alcoholm + alcoholh
| dfb.1_ | dfb.blod | dfb.indx | dfb.enzy | dfb.livr | dfb.age | dfb.nyrs | dfb.gnd1 | dfb.alchlm1 | dfb.alchlh1 | dffit | cov.r | cook.d | hat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12 | -1.4022 | 0.3153 | 0.9073 | 0.6737 | -0.0096 | 0.7636 | -0.5945 | 0.2854 | -0.9696 | -0.4524 | 2.1183 | 0.1636 | 0.3619 | 0.2883 |
| 36 | -0.1037 | 0.1505 | 0.5444 | -0.0833 | -0.1977 | -0.0808 | 0.0095 | -0.1303 | 0.2275 | 0.1898 | -0.7830 | 1.7336 | 0.0615 | 0.4078 |
| 40 | 0.0220 | -0.0089 | -0.0119 | 0.0217 | -0.0135 | -0.0260 | 0.0286 | -0.0074 | 0.0029 | 0.0311 | 0.0668 | 1.8277 | 0.0005 | 0.3055 |
| 41 | -0.0679 | 0.0761 | -0.0200 | -0.0495 | 0.2619 | -0.0175 | -0.0272 | 0.0105 | 0.0885 | -0.0041 | 0.5397 | 1.7368 | 0.0295 | 0.3554 |
| 46 | 0.0072 | -0.0128 | 0.0281 | 0.0133 | 0.0049 | -0.0261 | 0.0219 | 0.0161 | -0.0013 | 0.0426 | 0.0765 | 1.7600 | 0.0006 | 0.2797 |
The influence plot creates a “bubble” plot of Studentized residuals by hat values with the areas of the circles representing the observations proportional to Cook’s distances. Vertical reference lines are drawn at twice and three times the average hat value; horizontal reference lines are drawn at -2, 0, and 2 on the Studentized-residual scale. This plot is helpful to identify influential observations and their impact on the model.
##
## Call:
## lm(formula = survival ~ blood + index + enzyme + liver + age +
## nyears + gender + alcoholm + alcoholh)
##
## Residuals:
## Min 1Q Median 3Q Max
## -227.93 -81.21 12.71 83.05 338.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -665.711 195.180 -3.411 0.00144 **
## blood 26.986 17.504 1.542 0.13065
## index 8.577 1.307 6.564 6.12e-08 ***
## enzyme 7.946 1.262 6.297 1.48e-07 ***
## liver 35.185 31.309 1.124 0.26748
## age -4.205 5.863 -0.717 0.47726
## nyears 2.745 5.541 0.496 0.62283
## gender1 50.018 39.743 1.259 0.21515
## alcoholm1 11.608 43.299 0.268 0.78994
## alcoholh1 193.568 59.336 3.262 0.00220 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.2 on 42 degrees of freedom
## Multiple R-squared: 0.8072, Adjusted R-squared: 0.7659
## F-statistic: 19.53 on 9 and 42 DF, p-value: 2.484e-12
To help us identify multicolinearity we can use the Variance Inflation Factor (VIF) statistic for \(\beta\) parameter greater than 10, where \((VIF)_i =\frac{1}{1-R^{2}_i}\) , \(i=1,2,...,k\). For variables with a VIF greater than 10 we must examine further, as this is an indication of multicollinearity.
| vif(fit) | |
|---|---|
| blood | 1.7319 |
| index | 1.4204 |
| enzyme | 1.9720 |
| liver | 2.7141 |
| age | 12.3641 |
| nyears | 11.8115 |
| gender | 1.1342 |
| alcoholm | 1.3363 |
| alcoholh | 1.3243 |
From the table above we can identify two large values for the variables \(age\) (Age in years) and \(nyears\) (Number of years employed).
To help us understand more about the relation between predictors, two correlation matrices are shown: one to identify large correlation coeffients and the other with the correlation values for all the variables.
| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | Y | |
|---|---|---|---|---|---|---|---|---|---|---|
| X1 | 1 | |||||||||
| X2 | 1 | |||||||||
| X3 | 1 | |||||||||
| X4 | . | . | . | 1 | ||||||
| X5 | 1 | |||||||||
| X6 | * | 1 | ||||||||
| X7 | . | 1 | ||||||||
| X8 | 1 | |||||||||
| X9 | . | 1 | ||||||||
| Y | . | , | , | 1 |
Legend: [0:" “] — [0.3: “.”] — [0.6:“,”] — [0.8:“+”] — [0.9:“*”] — [0.95:“B”] — [1:“1”]
| blood | index | enzyme | liver | age | nyears | gender | alcoholm | alcoholh | survival(Y) | |
|---|---|---|---|---|---|---|---|---|---|---|
| blood | 1.0000 | 0.0443 | -0.2814 | 0.3723 | -0.0503 | -0.0158 | -0.0047 | 0.0008 | 0.0449 | 0.0569 |
| index | 0.0443 | 1.0000 | -0.0380 | 0.3630 | -0.0553 | -0.1075 | 0.1135 | 0.1486 | -0.1310 | 0.5361 |
| enzyme | -0.2814 | -0.0380 | 1.0000 | 0.3816 | 0.0006 | -0.0667 | 0.1673 | -0.0372 | 0.0230 | 0.6007 |
| liver | 0.3723 | 0.3630 | 0.3816 | 1.0000 | -0.2443 | -0.2182 | 0.3006 | 0.0705 | -0.0510 | 0.6301 |
| age | -0.0503 | -0.0553 | 0.0006 | -0.2443 | 1.0000 | 0.9486 | -0.0165 | 0.1532 | -0.1238 | -0.1577 |
| nyears | -0.0158 | -0.1075 | -0.0667 | -0.2182 | 0.9486 | 1.0000 | -0.0352 | 0.1436 | -0.1009 | -0.1991 |
| gender | -0.0047 | 0.1135 | 0.1673 | 0.3006 | -0.0165 | -0.0352 | 1.0000 | 0.0478 | -0.0740 | 0.2692 |
| alcoholm | 0.0008 | 0.1486 | -0.0372 | 0.0705 | 0.1532 | 0.1436 | 0.0478 | 1.0000 | -0.4788 | -0.0407 |
| alcoholh | 0.0449 | -0.1310 | 0.0230 | -0.0510 | -0.1238 | -0.1009 | -0.0740 | -0.4788 | 1.0000 | 0.1912 |
| survival(Y) | 0.0569 | 0.5361 | 0.6007 | 0.6301 | -0.1577 | -0.1991 | 0.2692 | -0.0407 | 0.1912 | 1.0000 |
From the correlation matrix we can confirm what the VIF statistic found - age and nyears are highly correlated and are the cause of multicollinearity in the model. [Age|nyears]=[0.9478]. There are other interesting correlations like [blood|liver]=[0.5024],[index|liver]=[0.3690], [enzyme|liver]=[0.4164].
For the stepwise regression procedure variables are added to the model or deleted one at a time until adding or deleting variables does not improve the model any more. For this case we are using backward stepwise regression, where we start with a complete model (all predictors), and systematically remove one by one until removing variables no longer improves the model.
## Start: AIC=518.39
## survival ~ blood + index + enzyme + liver + age + nyears + gender +
## alcoholm + alcoholh
##
## Df Sum of Sq RSS AIC
## - alcoholm 1 1293 757155 516.48
## - nyears 1 4419 760280 516.69
## - age 1 9255 765117 517.02
## - liver 1 22728 778590 517.93
## - gender 1 28506 784367 518.31
## <none> 755861 518.39
## - blood 1 42776 798637 519.25
## - alcoholh 1 191522 947384 528.13
## - enzyme 1 713627 1469489 550.96
## - index 1 775503 1531365 553.10
##
## Step: AIC=516.48
## survival ~ blood + index + enzyme + liver + age + nyears + gender +
## alcoholh
##
## Df Sum of Sq RSS AIC
## - nyears 1 4498 761653 514.78
## - age 1 9090 766245 515.10
## - liver 1 23490 780645 516.06
## - gender 1 28423 785578 516.39
## <none> 757155 516.48
## - blood 1 42443 799598 517.31
## - alcoholh 1 222975 980130 527.90
## - enzyme 1 712379 1469534 548.96
## - index 1 782911 1540066 551.40
##
## Step: AIC=514.78
## survival ~ blood + index + enzyme + liver + age + gender + alcoholh
##
## Df Sum of Sq RSS AIC
## - age 1 10801 772454 513.52
## - gender 1 26809 788461 514.58
## <none> 761653 514.78
## - liver 1 32126 793778 514.93
## - blood 1 39567 801219 515.42
## - alcoholh 1 227350 989003 526.37
## - enzyme 1 755930 1517583 548.63
## - index 1 817178 1578831 550.69
##
## Step: AIC=513.52
## survival ~ blood + index + enzyme + liver + gender + alcoholh
##
## Df Sum of Sq RSS AIC
## - gender 1 24783 797237 513.16
## <none> 772454 513.52
## - blood 1 34375 806829 513.78
## - liver 1 49132 821585 514.72
## - alcoholh 1 248033 1020487 526.00
## - enzyme 1 747762 1520216 546.72
## - index 1 807492 1579946 548.73
##
## Step: AIC=513.16
## survival ~ blood + index + enzyme + liver + alcoholh
##
## Df Sum of Sq RSS AIC
## - blood 1 28475 825713 512.98
## <none> 797237 513.16
## - liver 1 72996 870233 515.71
## - alcoholh 1 240332 1037569 524.86
## - enzyme 1 746130 1543367 545.51
## - index 1 803077 1600314 547.39
##
## Step: AIC=512.98
## survival ~ index + enzyme + liver + alcoholh
##
## Df Sum of Sq RSS AIC
## <none> 825713 512.98
## - liver 1 203597 1029309 522.44
## - alcoholh 1 255535 1081248 525.00
## - index 1 776306 1602019 545.45
## - enzyme 1 835130 1660843 547.32
##
## Call:
## lm(formula = survival ~ index + enzyme + liver + alcoholh)
##
## Coefficients:
## (Intercept) index enzyme liver alcoholh1
## -601.070 7.947 6.767 75.779 196.028
The final model selected by the Stepwise regression is given by \(lm(formula = survival \text{ ~ } index + enzyme + liver + alcoholh)\)
For this procedure, every possible model is inspected and a best model is suggested.
| No. Predictors | blood | index | enzyme | liver | age | gender | alcoholm | alcoholh | R^2 | adj-R^2 | s | Cp |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 ( 1 ) | * | 0.3970 | 0.3850 | 2363607.6 | 85.6812 | |||||||
| 2 ( 1 ) | * | * | 0.6737 | 0.6604 | 1279137.8 | 26.3456 | ||||||
| 3 ( 1 ) | * | * | * | 0.7374 | 0.7210 | 1029309.2 | 14.2158 | |||||
| 4 ( 1 ) | * | * | * | * | 0.7894 | 0.7714 | 825712.6 | 4.7007 | ||||
| 5 ( 1 ) | * | * | * | * | * | 0.7966 | 0.7745 | 797237.1 | 5.0902 | |||
| 6 ( 1 ) | * | * | * | * | * | * | 0.8029 | 0.7767 | 772453.7 | 5.6885 | ||
| 7 ( 1 ) | * | * | * | * | * | * | * | 0.8057 | 0.7748 | 761652.6 | 7.0776 | |
| 8 ( 1 ) | * | * | * | * | * | * | * | * | 0.8060 | 0.7700 | 760280.0 | 9.0000 |
This plot shows the “best” models for each subset based on the Adjusted R-Squared.
The “best” model given by the all-subsets regression procedure incorporates the following predictors:
| (Intercept) | blood | index | enzyme | liver | age | gender | alcoholm | alcoholh |
|---|---|---|---|---|---|---|---|---|
| TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE | FALSE | TRUE |
Although the model suggested by the all-subsets regression has the largest Adjusted R-Squared, it does not include the age variable, which is an important variable when trying to predict survival. Moreover, the adjusted R-Squared values are really close between the model without \(age\) variable and with \(age\) variable.
| No. Predictors | blood | index | enzyme | liver | age | gender | alcoholm | alcoholh | \(R^2\) | adj-\(R^2\) | s | Cp |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6 ( 1 ) | * | * | * | * | * | * | 0.8029 | 0.7767 | 772453.7 | 5.6885 | ||
| 7 ( 1 ) | * | * | * | * | * | * | * | 0.8057 | 0.7748 | 761652.6 | 7.0776 |
Hence the \(age\) variable should be kept in the model. The “best” subset of predictors are:
| (Intercept) | blood | index | enzyme | liver | age | gender | alcoholm | alcoholh |
|---|---|---|---|---|---|---|---|---|
| TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | TRUE | FALSE | TRUE |
Using the model determined by the all subsets regression procedure \(lm(formula = survival \text{ ~ } blood + index + enzyme + liver + age + gender + alcoholh)\)
Interpretation: From the graph above we can determine that the model follows a normal distribution and it is negative skewed.
normal probability plot
Interpretation: From the QQ-Plot above we can determine that the dependent variables of the model are normally distributed. To meet the conditions the points on the QQ-Plot should fall on the straight 45-degree line.
Interpretation: The Component+Residuals Plot (partial residuals plot) helps us to look for trends that are different from our linear model. From the component plus residuals plots we can conclude that the assumptions of linearity and \(E(\epsilon)=0\) hold. The linear model seems to be appropiate for the data.
Interpretation: From the plot above we can spot some trends, but they are not strong enough to discard the model. It does not look like there is a systematic relationship between the residuals and the predicted values. The model seems to capture all of the systematic variance present in the data.
By using the spreadLevelPlot() function in R, we can create a scatter plot of the absolute standardized residuals versus the fitted values over the best fit line. The points should form a random horizontal band around the best fit line. If the model violated the regression assumptions, we would see a nonhorizontal line. For this case, there is a problem with heteroscedasticity: it appears to have multiplicative erros.
Statistics for Linear Model without variable interactions
##
## Call:
## lm(formula = survival ~ blood + index + enzyme + liver + age +
## nyears + gender + alcoholm + alcoholh)
##
## Residuals:
## Min 1Q Median 3Q Max
## -227.93 -81.21 12.71 83.05 338.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -665.711 195.180 -3.411 0.00144 **
## blood 26.986 17.504 1.542 0.13065
## index 8.577 1.307 6.564 6.12e-08 ***
## enzyme 7.946 1.262 6.297 1.48e-07 ***
## liver 35.185 31.309 1.124 0.26748
## age -4.205 5.863 -0.717 0.47726
## nyears 2.745 5.541 0.496 0.62283
## gender1 50.018 39.743 1.259 0.21515
## alcoholm1 11.608 43.299 0.268 0.78994
## alcoholh1 193.568 59.336 3.262 0.00220 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.2 on 42 degrees of freedom
## Multiple R-squared: 0.8072, Adjusted R-squared: 0.7659
## F-statistic: 19.53 on 9 and 42 DF, p-value: 2.484e-12
The spread level plot helps us to determine if the new model with interactions still has a heteroscedastic problem. By adding interaction terms, the previous issue with heteroscedasticity was resolved.
Statistics for Linear Model with variable interactions and transformation
##
## Call:
## lm(formula = survival ~ age + liver + gender + alcoholh + blood:age +
## blood:index + liver:index + age:enzyme)
##
## Residuals:
## Min 1Q Median 3Q Max
## -193.003 -71.814 7.571 71.664 305.779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 798.60655 183.26738 4.358 8.03e-05 ***
## age -18.83532 4.12468 -4.566 4.13e-05 ***
## liver -300.76375 70.73640 -4.252 0.000112 ***
## gender1 63.05109 34.85808 1.809 0.077479 .
## alcoholh1 177.34388 45.95638 3.859 0.000377 ***
## age:blood 1.46590 0.55707 2.631 0.011757 *
## blood:index -0.72409 0.42667 -1.697 0.096909 .
## liver:index 4.79637 0.90870 5.278 4.05e-06 ***
## age:enzyme 0.14341 0.01933 7.418 3.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 116.9 on 43 degrees of freedom
## Multiple R-squared: 0.85, Adjusted R-squared: 0.8221
## F-statistic: 30.47 on 8 and 43 DF, p-value: 2.607e-15
By comparing the models given by all subsets procedures and the interaction model, I can determine which is the most efficient model to predict survival. The adjusted R-squared increased after we added variable interactions between the predicted variables. The F-Statistics when compared with the previous model also increased. At this point, we can run a nested test using the Akaike Information Criterion (AIC), which takes into account a model’s statistical fit and the number of parameters needed to achive this fit. The model with lower AIC indicates adequate fit. For this case, the model that was selected is the model with the variable interactions.
## df AIC
## fit 11 667.9568
## trfit 10 652.8809
Original Data , with outliers and influentials observations.
##
## Call:
## lm(formula = survival0 ~ blood0 + index0 + enzyme0 + liver0 +
## age0 + nyears0 + gender0 + alcoholm0 + alcoholh0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -288.53 -133.68 -9.18 89.64 788.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1132.818 269.581 -4.202 0.000127 ***
## blood0 63.041 25.159 2.506 0.015994 *
## index0 9.055 1.981 4.571 3.92e-05 ***
## enzyme0 9.976 1.866 5.347 3.04e-06 ***
## liver0 48.645 47.123 1.032 0.307570
## age0 -2.131 8.715 -0.245 0.807921
## nyears0 1.181 8.299 0.142 0.887463
## gender01 16.724 59.423 0.281 0.779685
## alcoholm01 7.503 65.692 0.114 0.909582
## alcoholh01 320.283 86.061 3.722 0.000559 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 203.7 on 44 degrees of freedom
## Multiple R-squared: 0.7819, Adjusted R-squared: 0.7373
## F-statistic: 17.53 on 9 and 44 DF, p-value: 7.465e-12
The linear model with all of the predicted variables is acceptable for a first model without any modifications. That being said, we still need to examine the model further to improve its accuracy. The model accounts for 73.73% of the variation in the data.
By looking at the anova table we can see that there are four variables with highly significant p-values and one variable with a slightly significant p-value. The remaining variables are not significant. These insignificant variables do not contribute to the model and can be removed. The potential removal of some variables should be explored further to improve the model.
## Analysis of Variance Table
##
## Response: survival0
## Df Sum Sq Mean Sq F value Pr(>F)
## blood0 1 1005152 1005152 24.2329 1.244e-05 ***
## index0 1 1278496 1278496 30.8229 1.531e-06 ***
## enzyme0 1 3442172 3442172 82.9864 1.093e-11 ***
## liver0 1 57862 57862 1.3950 0.2439108
## age0 1 33032 33032 0.7964 0.3770374
## nyears0 1 2656 2656 0.0640 0.8014041
## gender0 1 37 37 0.0009 0.9763252
## alcoholm0 1 150557 150557 3.6297 0.0633039 .
## alcoholh0 1 574491 574491 13.8503 0.0005588 ***
## Residuals 44 1825065 41479
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Without outliers and influential observations.
##
## Call:
## lm(formula = survival1 ~ blood1 + index1 + enzyme1 + liver1 +
## age1 + nyears1 + gender1 + alcoholm1 + alcoholh1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -227.93 -81.21 12.71 83.05 338.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -665.711 195.180 -3.411 0.00144 **
## blood1 26.986 17.504 1.542 0.13065
## index1 8.577 1.307 6.564 6.12e-08 ***
## enzyme1 7.946 1.262 6.297 1.48e-07 ***
## liver1 35.185 31.309 1.124 0.26748
## age1 -4.205 5.863 -0.717 0.47726
## nyears1 2.745 5.541 0.496 0.62283
## gender11 50.018 39.743 1.259 0.21515
## alcoholm11 11.608 43.299 0.268 0.78994
## alcoholh11 193.568 59.336 3.262 0.00220 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 134.2 on 42 degrees of freedom
## Multiple R-squared: 0.8072, Adjusted R-squared: 0.7659
## F-statistic: 19.53 on 9 and 42 DF, p-value: 2.484e-12
After deleting outliers and influential values in the data, the model improved. The linear model with all the predicted variables and without outliers significantly improved our prediction. There is still room for improvement as we need to look for multicollinearity and correlation of the predicting variables. The model explains 76.59% of the variation in the data at this time. This is an improvement from the previous model.
By looking at the anova table we can see that there are two variables with highly significant p-values, one variable with significant p-value, and one variable with a slightly significant p-value. The remaining variables are not significant. It seems as though after removing the influential values and outliers, one variable is no longer highly significant for this linear model.
## Analysis of Variance Table
##
## Response: survival1
## Df Sum Sq Mean Sq F value Pr(>F)
## blood1 1 12670 12670 0.7040 0.406189
## index1 1 1118269 1118269 62.1374 8.164e-10 ***
## enzyme1 1 1692481 1692481 94.0440 2.783e-12 ***
## liver1 1 58863 58863 3.2707 0.077689 .
## age1 1 28117 28117 1.5623 0.218245
## nyears1 1 6730 6730 0.3740 0.544144
## gender1 1 22592 22592 1.2553 0.268899
## alcoholm1 1 32746 32746 1.8196 0.184591
## alcoholh1 1 191522 191522 10.6421 0.002199 **
## Residuals 42 755861 17997
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Without outliers and influential observations also multicollinearity problem was fixed.
##
## Call:
## lm(formula = survival3 ~ blood3 + index3 + enzyme3 + liver3 +
## age3 + gender3 + alcoholh3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -232.62 -76.48 9.67 80.78 319.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -703.469 172.341 -4.082 0.000185 ***
## blood3 25.720 17.012 1.512 0.137719
## index3 8.412 1.224 6.871 1.77e-08 ***
## enzyme3 7.724 1.169 6.608 4.30e-08 ***
## liver3 40.079 29.420 1.362 0.180035
## age3 -1.373 1.738 -0.790 0.433812
## gender31 48.349 38.851 1.244 0.219913
## alcoholh31 187.903 51.849 3.624 0.000748 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 131.6 on 44 degrees of freedom
## Multiple R-squared: 0.8057, Adjusted R-squared: 0.7748
## F-statistic: 26.06 on 7 and 44 DF, p-value: 1.098e-13
Eliminating the variable “nyear” (number of years of employment), which was causing multicollinearity problems, made a small improvement to the model. The linear model with all of the predicted variables and without outliers significantly improved our prediction. The model now accounts for 77% of the variation in the data. There is still room for improvement as we need to look for multicollinearity and correlation of the predicting variables.
By looking at the anova table we can see that there are now three variables with highly significant p-values and one variable with a slightly significant p-value. The remaining three variables are not significant. These insignificant variables do not contribute to the model and can be removed. At this point, we can explore variable transformation and variable interactions.
## Analysis of Variance Table
##
## Response: survival3
## Df Sum Sq Mean Sq F value Pr(>F)
## blood3 1 12670 12670 0.7319 0.396893
## index3 1 1118269 1118269 64.6014 3.564e-10 ***
## enzyme3 1 1692481 1692481 97.7732 9.444e-13 ***
## liver3 1 58863 58863 3.4004 0.071919 .
## age3 1 28117 28117 1.6243 0.209187
## gender3 1 20450 20450 1.1814 0.282999
## alcoholh3 1 227350 227350 13.1338 0.000748 ***
## Residuals 44 761653 17310
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Variable transformation and interactions
##
## Call:
## lm(formula = survival ~ age + liver + gender + alcoholh + blood:age +
## blood:index + liver:index + age:enzyme)
##
## Residuals:
## Min 1Q Median 3Q Max
## -193.003 -71.814 7.571 71.664 305.779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 798.60655 183.26738 4.358 8.03e-05 ***
## age -18.83532 4.12468 -4.566 4.13e-05 ***
## liver -300.76375 70.73640 -4.252 0.000112 ***
## gender1 63.05109 34.85808 1.809 0.077479 .
## alcoholh1 177.34388 45.95638 3.859 0.000377 ***
## age:blood 1.46590 0.55707 2.631 0.011757 *
## blood:index -0.72409 0.42667 -1.697 0.096909 .
## liver:index 4.79637 0.90870 5.278 4.05e-06 ***
## age:enzyme 0.14341 0.01933 7.418 3.21e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 116.9 on 43 degrees of freedom
## Multiple R-squared: 0.85, Adjusted R-squared: 0.8221
## F-statistic: 30.47 on 8 and 43 DF, p-value: 2.607e-15
Eliminating the variable “nyear” (number of years of employment), which was causing multicollinearity problems, made a small improvement to the model. The linear model with all of the predicted variables and without outliers significally improved our prediction. The model now accounts for 82.21% of the variation in the data. There is a significant improvement after removing variables with multicollinearity and correlation problems and adding interaction terms.
By looking at the anova table we can see how all but one predictive variable have highly significant p-values and there is just one variable with a non-significant p-value. If we take this variable out, the model’s overall fit goes down.
## Analysis of Variance Table
##
## Response: survival
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 97533 97533 7.1349 0.0106314 *
## liver 1 1458772 1458772 106.7140 3.190e-13 ***
## gender 1 27683 27683 2.0251 0.1619307
## alcoholh 1 207678 207678 15.1923 0.0003351 ***
## age:blood 1 143279 143279 10.4813 0.0023252 **
## blood:index 1 375633 375633 27.4788 4.566e-06 ***
## liver:index 1 269271 269271 19.6981 6.218e-05 ***
## age:enzyme 1 752196 752196 55.0256 3.211e-09 ***
## Residuals 43 587806 13670
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1