1 Description of the data

A medical researcher is interested in predicting survival (in days) in patients undergoing a particular type of operation. A random selection of 54 patients was available for analysis. The following information was available for each patient.

Variable Description
X1 Blood clotting score
X2 Prognostic index
X3 Enzyme function test score
X4 Liver function test score
X5 Age in years
X6 Number of years employed
X7 Gender (0 = male, 1 = female)
X8 Moderate alcohol use (1 = yes, 0 = no)
X9 Severe alcohol use (1 = yes, 0 = no)
Y Survival (days)

2 Linear Model

Check for Outliers (O) and influential observations (IO)

2.1 Standard Diagnostic Plots

The plots below can help us evaluate the statistical assumptions in the regression analysis model and evaluate the overall fit of the model applied to the data set.

Starting from the the upper-left corner we can see the Residual vs Fitted plot. To hold the assumptions on a regression analysis, there shouldn’t be a relationship between the residuals and the predicted values (fitted value). The model should capture all the systematic variance present in the data, only leaving random noise. From the below Residual vs Fitted we can see how there are signs of a smooth curve, thus we should explore more in depth as the plot suggests that there is a problem with linear dependency.Normal Q-Q: If the dependent variable is normally distributed for the given data, then the residual values should also be normally distributed with a mean = 0. For our case, the data nearly met the normality assumption, and would have if it wasn’t for the extreme observation on the top of the line.Scale-Location: This plot is closely related to the residual VS Fitted plot. If linear independance is met, the data points should be a random band around the horizontal line. Residuals vs Leverage: This plot provides information about individual observations that we can examine further later.

2.2 Variable Plots

To examine outliers and influential observation in the data we can create added-variable plots to give us more information about how influential observations affect the model. For each predictor \(Xk\), plot the residuals from regressing the response variable on the other \(k-1\) predictors versus the residuals from regressing \(Xk\) on the other \(k-1\) predictors. The straight line in each plot is the actual regression coeffient of the corresponding predictor variable. We can identify a couple of influential observations to examine further.


2.3 Cook’s Distance

This method helps us to identify observations that have a disproportionate impact on the values of the model parameters. Cook’s Distance values greater than \(4/(n-k-1)\), where \(n\) is the size of the sample and \(k\) is the number of predictor variables, indicate influential observations. Now we can narrow down to the observations \((5,28)\) that are causing a disproportionate impact on the model.


3 Linear Model - Without Extreme Observations - First Check

After removing two data points (5,28) we need to check for Outliers (O) one more time.

3.1 Standard Diagnostic Plots

After generating the standard diagnostic plots below one more time we can evaluate the statistical assumptions of a regression analysis model to evaluate the overall fit of the model applied to the new data set without extreme observations. The program highlights other interesting data points to consider, but overall it looks like the new data has influential observations but does not have extreme outliers.

3.2 Variable Plots

For the new added-variable plots, now that the extreme observations are out, we can investigate new data points. In general, it looks like the residuals are spread evenly around the actual regression coeffient line for each predictor.


3.3 Cook’s Distance

To verify that there is not any other extreme observation that is disproportionately impacting the mode, we run Cook’s Distance values once again. The plot identifies other influential observations but it does not exceed the 50th percentile of the F distribution.

4 Linear Model - Influential observations

formula = survival ~ blood + index + enzyme + liver + age + nyears + gender + alcoholm + alcoholh

Output Statistics
dfb.1_ dfb.blod dfb.indx dfb.enzy dfb.livr dfb.age dfb.nyrs dfb.gnd1 dfb.alchlm1 dfb.alchlh1 dffit cov.r cook.d hat
12 -1.4022 0.3153 0.9073 0.6737 -0.0096 0.7636 -0.5945 0.2854 -0.9696 -0.4524 2.1183 0.1636 0.3619 0.2883
36 -0.1037 0.1505 0.5444 -0.0833 -0.1977 -0.0808 0.0095 -0.1303 0.2275 0.1898 -0.7830 1.7336 0.0615 0.4078
40 0.0220 -0.0089 -0.0119 0.0217 -0.0135 -0.0260 0.0286 -0.0074 0.0029 0.0311 0.0668 1.8277 0.0005 0.3055
41 -0.0679 0.0761 -0.0200 -0.0495 0.2619 -0.0175 -0.0272 0.0105 0.0885 -0.0041 0.5397 1.7368 0.0295 0.3554
46 0.0072 -0.0128 0.0281 0.0133 0.0049 -0.0261 0.0219 0.0161 -0.0013 0.0426 0.0765 1.7600 0.0006 0.2797

The influence plot creates a “bubble” plot of Studentized residuals by hat values with the areas of the circles representing the observations proportional to Cook’s distances. Vertical reference lines are drawn at twice and three times the average hat value; horizontal reference lines are drawn at -2, 0, and 2 on the Studentized-residual scale. This plot is helpful to identify influential observations and their impact on the model.


5 Summary Statistics of the Model

## 
## Call:
## lm(formula = survival ~ blood + index + enzyme + liver + age + 
##     nyears + gender + alcoholm + alcoholh)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -227.93  -81.21   12.71   83.05  338.26 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -665.711    195.180  -3.411  0.00144 ** 
## blood         26.986     17.504   1.542  0.13065    
## index          8.577      1.307   6.564 6.12e-08 ***
## enzyme         7.946      1.262   6.297 1.48e-07 ***
## liver         35.185     31.309   1.124  0.26748    
## age           -4.205      5.863  -0.717  0.47726    
## nyears         2.745      5.541   0.496  0.62283    
## gender1       50.018     39.743   1.259  0.21515    
## alcoholm1     11.608     43.299   0.268  0.78994    
## alcoholh1    193.568     59.336   3.262  0.00220 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.2 on 42 degrees of freedom
## Multiple R-squared:  0.8072, Adjusted R-squared:  0.7659 
## F-statistic: 19.53 on 9 and 42 DF,  p-value: 2.484e-12

6 Check for Multicollinearity

6.1 Variance Inflation Factors

To help us identify multicolinearity we can use the Variance Inflation Factor (VIF) statistic for \(\beta\) parameter greater than 10, where \((VIF)_i =\frac{1}{1-R^{2}_i}\) , \(i=1,2,...,k\). For variables with a VIF greater than 10 we must examine further, as this is an indication of multicollinearity.

VIF Table
vif(fit)
blood 1.7319
index 1.4204
enzyme 1.9720
liver 2.7141
age 12.3641
nyears 11.8115
gender 1.1342
alcoholm 1.3363
alcoholh 1.3243

From the table above we can identify two large values for the variables \(age\) (Age in years) and \(nyears\) (Number of years employed).


6.2 Correlation Matrix

To help us understand more about the relation between predictors, two correlation matrices are shown: one to identify large correlation coeffients and the other with the correlation values for all the variables.

X1 X2 X3 X4 X5 X6 X7 X8 X9 Y
X1 1
X2 1
X3 1
X4 . . . 1
X5 1
X6 * 1
X7 . 1
X8 1
X9 . 1
Y . , , 1

Legend: [0:" “] — [0.3: “.”] — [0.6:“,”] — [0.8:“+”] — [0.9:“*”] — [0.95:“B”] — [1:“1”]

Correlation Matrix
blood index enzyme liver age nyears gender alcoholm alcoholh survival(Y)
blood 1.0000 0.0443 -0.2814 0.3723 -0.0503 -0.0158 -0.0047 0.0008 0.0449 0.0569
index 0.0443 1.0000 -0.0380 0.3630 -0.0553 -0.1075 0.1135 0.1486 -0.1310 0.5361
enzyme -0.2814 -0.0380 1.0000 0.3816 0.0006 -0.0667 0.1673 -0.0372 0.0230 0.6007
liver 0.3723 0.3630 0.3816 1.0000 -0.2443 -0.2182 0.3006 0.0705 -0.0510 0.6301
age -0.0503 -0.0553 0.0006 -0.2443 1.0000 0.9486 -0.0165 0.1532 -0.1238 -0.1577
nyears -0.0158 -0.1075 -0.0667 -0.2182 0.9486 1.0000 -0.0352 0.1436 -0.1009 -0.1991
gender -0.0047 0.1135 0.1673 0.3006 -0.0165 -0.0352 1.0000 0.0478 -0.0740 0.2692
alcoholm 0.0008 0.1486 -0.0372 0.0705 0.1532 0.1436 0.0478 1.0000 -0.4788 -0.0407
alcoholh 0.0449 -0.1310 0.0230 -0.0510 -0.1238 -0.1009 -0.0740 -0.4788 1.0000 0.1912
survival(Y) 0.0569 0.5361 0.6007 0.6301 -0.1577 -0.1991 0.2692 -0.0407 0.1912 1.0000

From the correlation matrix we can confirm what the VIF statistic found - age and nyears are highly correlated and are the cause of multicollinearity in the model. [Age|nyears]=[0.9478]. There are other interesting correlations like [blood|liver]=[0.5024],[index|liver]=[0.3690], [enzyme|liver]=[0.4164].

7 Summary Statistics of the Model without correlated variables

## 
## Call:
## lm(formula = survival ~ blood + index + enzyme + liver + age + 
##     gender + alcoholm + alcoholh)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -236.01  -75.65    6.75   77.32  328.24 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -707.253    174.704  -4.048 0.000211 ***
## blood         25.847     17.200   1.503 0.140205    
## index          8.390      1.240   6.767 2.80e-08 ***
## enzyme         7.743      1.183   6.544 5.91e-08 ***
## liver         39.507     29.804   1.326 0.191981    
## age           -1.438      1.772  -0.812 0.421510    
## gender1       48.439     39.266   1.234 0.224048    
## alcoholm1     11.957     42.912   0.279 0.781866    
## alcoholh1    195.283     58.714   3.326 0.001810 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 133 on 43 degrees of freedom
## Multiple R-squared:  0.806,  Adjusted R-squared:   0.77 
## F-statistic: 22.34 on 8 and 43 DF,  p-value: 5.662e-13

The Adjusted R-squared and the F-statistics increased compared to the previous model after we removed the nyears variable. Now we can run a nested test using the Akaike Information Criterion (AIC), that takes into account a model’s statistical fit and the number of parameters needed to achive this fit. The model with lower AIC indicates adequate fit.

##      df      AIC
## fit  11 667.9568
## fit2 10 666.2599

7.1 Second test of Variance Inflation Factors without correlated variables

There are multiple ways to fix multicollinearity. For our case we are going to remove the nyears variable, as it has greater correlation values than years, and it seems to be unnecessary for this model.

VIF Table
vif(fit2)
blood 1.7020
index 1.3016
enzyme 1.7652
liver 2.5034
age 1.1489
gender 1.1269
alcoholm 1.3359
alcoholh 1.3198

7.2 Second test of correlated variables

There are not any large correlated predictors in the model without the nyears variable.

X1 X2 X3 X4 X5 X7 X8 X9 Y
X1 1
X2 1
X3 1
X4 . . . 1
X5 1
X7 . 1
X8 1
X9 . 1
Y . , , 1

Legend: [0:" “] — [0.3: “.”] — [0.6:“,”] — [0.8:“+”] — [0.9:“*”] — [0.95:“B”] — [1:“1”]

Correlation Matrix
blood index enzyme liver age gender alcoholm alcoholh survival(Y)
blood 1.0000 0.0443 -0.2814 0.3723 -0.0503 -0.0047 0.0008 0.0449 0.0569
index 0.0443 1.0000 -0.0380 0.3630 -0.0553 0.1135 0.1486 -0.1310 0.5361
enzyme -0.2814 -0.0380 1.0000 0.3816 0.0006 0.1673 -0.0372 0.0230 0.6007
liver 0.3723 0.3630 0.3816 1.0000 -0.2443 0.3006 0.0705 -0.0510 0.6301
age -0.0503 -0.0553 0.0006 -0.2443 1.0000 -0.0165 0.1532 -0.1238 -0.1577
gender -0.0047 0.1135 0.1673 0.3006 -0.0165 1.0000 0.0478 -0.0740 0.2692
alcoholm 0.0008 0.1486 -0.0372 0.0705 0.1532 0.0478 1.0000 -0.4788 -0.0407
alcoholh 0.0449 -0.1310 0.0230 -0.0510 -0.1238 -0.0740 -0.4788 1.0000 0.1912
survival(Y) 0.0569 0.5361 0.6007 0.6301 -0.1577 0.2692 -0.0407 0.1912 1.0000

8 Variable Screening


8.1 Stepwise Regression

For the stepwise regression procedure variables are added to the model or deleted one at a time until adding or deleting variables does not improve the model any more. For this case we are using backward stepwise regression, where we start with a complete model (all predictors), and systematically remove one by one until removing variables no longer improves the model.

## Start:  AIC=518.39
## survival ~ blood + index + enzyme + liver + age + nyears + gender + 
##     alcoholm + alcoholh
## 
##            Df Sum of Sq     RSS    AIC
## - alcoholm  1      1293  757155 516.48
## - nyears    1      4419  760280 516.69
## - age       1      9255  765117 517.02
## - liver     1     22728  778590 517.93
## - gender    1     28506  784367 518.31
## <none>                   755861 518.39
## - blood     1     42776  798637 519.25
## - alcoholh  1    191522  947384 528.13
## - enzyme    1    713627 1469489 550.96
## - index     1    775503 1531365 553.10
## 
## Step:  AIC=516.48
## survival ~ blood + index + enzyme + liver + age + nyears + gender + 
##     alcoholh
## 
##            Df Sum of Sq     RSS    AIC
## - nyears    1      4498  761653 514.78
## - age       1      9090  766245 515.10
## - liver     1     23490  780645 516.06
## - gender    1     28423  785578 516.39
## <none>                   757155 516.48
## - blood     1     42443  799598 517.31
## - alcoholh  1    222975  980130 527.90
## - enzyme    1    712379 1469534 548.96
## - index     1    782911 1540066 551.40
## 
## Step:  AIC=514.78
## survival ~ blood + index + enzyme + liver + age + gender + alcoholh
## 
##            Df Sum of Sq     RSS    AIC
## - age       1     10801  772454 513.52
## - gender    1     26809  788461 514.58
## <none>                   761653 514.78
## - liver     1     32126  793778 514.93
## - blood     1     39567  801219 515.42
## - alcoholh  1    227350  989003 526.37
## - enzyme    1    755930 1517583 548.63
## - index     1    817178 1578831 550.69
## 
## Step:  AIC=513.52
## survival ~ blood + index + enzyme + liver + gender + alcoholh
## 
##            Df Sum of Sq     RSS    AIC
## - gender    1     24783  797237 513.16
## <none>                   772454 513.52
## - blood     1     34375  806829 513.78
## - liver     1     49132  821585 514.72
## - alcoholh  1    248033 1020487 526.00
## - enzyme    1    747762 1520216 546.72
## - index     1    807492 1579946 548.73
## 
## Step:  AIC=513.16
## survival ~ blood + index + enzyme + liver + alcoholh
## 
##            Df Sum of Sq     RSS    AIC
## - blood     1     28475  825713 512.98
## <none>                   797237 513.16
## - liver     1     72996  870233 515.71
## - alcoholh  1    240332 1037569 524.86
## - enzyme    1    746130 1543367 545.51
## - index     1    803077 1600314 547.39
## 
## Step:  AIC=512.98
## survival ~ index + enzyme + liver + alcoholh
## 
##            Df Sum of Sq     RSS    AIC
## <none>                   825713 512.98
## - liver     1    203597 1029309 522.44
## - alcoholh  1    255535 1081248 525.00
## - index     1    776306 1602019 545.45
## - enzyme    1    835130 1660843 547.32
## 
## Call:
## lm(formula = survival ~ index + enzyme + liver + alcoholh)
## 
## Coefficients:
## (Intercept)        index       enzyme        liver    alcoholh1  
##    -601.070        7.947        6.767       75.779      196.028

The final model selected by the Stepwise regression is given by \(lm(formula = survival \text{ ~ } index + enzyme + liver + alcoholh)\)


8.2 All Subsets Regression

For this procedure, every possible model is inspected and a best model is suggested.

Best Model Variables
No. Predictors blood index enzyme liver age gender alcoholm alcoholh R^2 adj-R^2 s Cp
1 ( 1 ) * 0.3970 0.3850 2363607.6 85.6812
2 ( 1 ) * * 0.6737 0.6604 1279137.8 26.3456
3 ( 1 ) * * * 0.7374 0.7210 1029309.2 14.2158
4 ( 1 ) * * * * 0.7894 0.7714 825712.6 4.7007
5 ( 1 ) * * * * * 0.7966 0.7745 797237.1 5.0902
6 ( 1 ) * * * * * * 0.8029 0.7767 772453.7 5.6885
7 ( 1 ) * * * * * * * 0.8057 0.7748 761652.6 7.0776
8 ( 1 ) * * * * * * * * 0.8060 0.7700 760280.0 9.0000

This plot shows the “best” models for each subset based on the Adjusted R-Squared.

The “best” model given by the all-subsets regression procedure incorporates the following predictors:

(Intercept) blood index enzyme liver age gender alcoholm alcoholh
TRUE TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE

Although the model suggested by the all-subsets regression has the largest Adjusted R-Squared, it does not include the age variable, which is an important variable when trying to predict survival. Moreover, the adjusted R-Squared values are really close between the model without \(age\) variable and with \(age\) variable.

No. Predictors blood index enzyme liver age gender alcoholm alcoholh \(R^2\) adj-\(R^2\) s Cp
6 ( 1 ) * * * * * * 0.8029 0.7767 772453.7 5.6885
7 ( 1 ) * * * * * * * 0.8057 0.7748 761652.6 7.0776

Hence the \(age\) variable should be kept in the model. The “best” subset of predictors are:

(Intercept) blood index enzyme liver age gender alcoholm alcoholh
TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE

9 Error Term Assumptions

Using the model determined by the all subsets regression procedure \(lm(formula = survival \text{ ~ } blood + index + enzyme + liver + age + gender + alcoholh)\)

9.1 Normality

Interpretation: From the graph above we can determine that the model follows a normal distribution and it is negative skewed.

normal probability plot

Interpretation: From the QQ-Plot above we can determine that the dependent variables of the model are normally distributed. To meet the conditions the points on the QQ-Plot should fall on the straight 45-degree line.

9.2 Linearity and \(E(\epsilon) = 0\)

Interpretation: The Component+Residuals Plot (partial residuals plot) helps us to look for trends that are different from our linear model. From the component plus residuals plots we can conclude that the assumptions of linearity and \(E(\epsilon)=0\) hold. The linear model seems to be appropiate for the data.

9.3 \(Var(\epsilon)\) is constant

Interpretation: From the plot above we can spot some trends, but they are not strong enough to discard the model. It does not look like there is a systematic relationship between the residuals and the predicted values. The model seems to capture all of the systematic variance present in the data.


10 Variable Transformation


10.1 Model without Interactions from given by the All Subsets Regression.

By using the spreadLevelPlot() function in R, we can create a scatter plot of the absolute standardized residuals versus the fitted values over the best fit line. The points should form a random horizontal band around the best fit line. If the model violated the regression assumptions, we would see a nonhorizontal line. For this case, there is a problem with heteroscedasticity: it appears to have multiplicative erros.

Statistics for Linear Model without variable interactions

## 
## Call:
## lm(formula = survival ~ blood + index + enzyme + liver + age + 
##     nyears + gender + alcoholm + alcoholh)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -227.93  -81.21   12.71   83.05  338.26 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -665.711    195.180  -3.411  0.00144 ** 
## blood         26.986     17.504   1.542  0.13065    
## index          8.577      1.307   6.564 6.12e-08 ***
## enzyme         7.946      1.262   6.297 1.48e-07 ***
## liver         35.185     31.309   1.124  0.26748    
## age           -4.205      5.863  -0.717  0.47726    
## nyears         2.745      5.541   0.496  0.62283    
## gender1       50.018     39.743   1.259  0.21515    
## alcoholm1     11.608     43.299   0.268  0.78994    
## alcoholh1    193.568     59.336   3.262  0.00220 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.2 on 42 degrees of freedom
## Multiple R-squared:  0.8072, Adjusted R-squared:  0.7659 
## F-statistic: 19.53 on 9 and 42 DF,  p-value: 2.484e-12

10.2 Model with variable Interactions and transformation

The spread level plot helps us to determine if the new model with interactions still has a heteroscedastic problem. By adding interaction terms, the previous issue with heteroscedasticity was resolved.

Statistics for Linear Model with variable interactions and transformation

## 
## Call:
## lm(formula = survival ~ age + liver + gender + alcoholh + blood:age + 
##     blood:index + liver:index + age:enzyme)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -193.003  -71.814    7.571   71.664  305.779 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  798.60655  183.26738   4.358 8.03e-05 ***
## age          -18.83532    4.12468  -4.566 4.13e-05 ***
## liver       -300.76375   70.73640  -4.252 0.000112 ***
## gender1       63.05109   34.85808   1.809 0.077479 .  
## alcoholh1    177.34388   45.95638   3.859 0.000377 ***
## age:blood      1.46590    0.55707   2.631 0.011757 *  
## blood:index   -0.72409    0.42667  -1.697 0.096909 .  
## liver:index    4.79637    0.90870   5.278 4.05e-06 ***
## age:enzyme     0.14341    0.01933   7.418 3.21e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 116.9 on 43 degrees of freedom
## Multiple R-squared:   0.85,  Adjusted R-squared:  0.8221 
## F-statistic: 30.47 on 8 and 43 DF,  p-value: 2.607e-15

10.3 Comparing Models - AIC Test

By comparing the models given by all subsets procedures and the interaction model, I can determine which is the most efficient model to predict survival. The adjusted R-squared increased after we added variable interactions between the predicted variables. The F-Statistics when compared with the previous model also increased. At this point, we can run a nested test using the Akaike Information Criterion (AIC), which takes into account a model’s statistical fit and the number of parameters needed to achive this fit. The model with lower AIC indicates adequate fit. For this case, the model that was selected is the model with the variable interactions.

##       df      AIC
## fit   11 667.9568
## trfit 10 652.8809

11 Tests for model overall goodness of fit and individual coefficient significance


11.1 Model 1

Original Data , with outliers and influentials observations.

## 
## Call:
## lm(formula = survival0 ~ blood0 + index0 + enzyme0 + liver0 + 
##     age0 + nyears0 + gender0 + alcoholm0 + alcoholh0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -288.53 -133.68   -9.18   89.64  788.76 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1132.818    269.581  -4.202 0.000127 ***
## blood0         63.041     25.159   2.506 0.015994 *  
## index0          9.055      1.981   4.571 3.92e-05 ***
## enzyme0         9.976      1.866   5.347 3.04e-06 ***
## liver0         48.645     47.123   1.032 0.307570    
## age0           -2.131      8.715  -0.245 0.807921    
## nyears0         1.181      8.299   0.142 0.887463    
## gender01       16.724     59.423   0.281 0.779685    
## alcoholm01      7.503     65.692   0.114 0.909582    
## alcoholh01    320.283     86.061   3.722 0.000559 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 203.7 on 44 degrees of freedom
## Multiple R-squared:  0.7819, Adjusted R-squared:  0.7373 
## F-statistic: 17.53 on 9 and 44 DF,  p-value: 7.465e-12

The linear model with all of the predicted variables is acceptable for a first model without any modifications. That being said, we still need to examine the model further to improve its accuracy. The model accounts for 73.73% of the variation in the data.

11.2 Model 1 - Anova Table

By looking at the anova table we can see that there are four variables with highly significant p-values and one variable with a slightly significant p-value. The remaining variables are not significant. These insignificant variables do not contribute to the model and can be removed. The potential removal of some variables should be explored further to improve the model.

## Analysis of Variance Table
## 
## Response: survival0
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## blood0     1 1005152 1005152 24.2329 1.244e-05 ***
## index0     1 1278496 1278496 30.8229 1.531e-06 ***
## enzyme0    1 3442172 3442172 82.9864 1.093e-11 ***
## liver0     1   57862   57862  1.3950 0.2439108    
## age0       1   33032   33032  0.7964 0.3770374    
## nyears0    1    2656    2656  0.0640 0.8014041    
## gender0    1      37      37  0.0009 0.9763252    
## alcoholm0  1  150557  150557  3.6297 0.0633039 .  
## alcoholh0  1  574491  574491 13.8503 0.0005588 ***
## Residuals 44 1825065   41479                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

11.3 Model 2

Without outliers and influential observations.

## 
## Call:
## lm(formula = survival1 ~ blood1 + index1 + enzyme1 + liver1 + 
##     age1 + nyears1 + gender1 + alcoholm1 + alcoholh1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -227.93  -81.21   12.71   83.05  338.26 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -665.711    195.180  -3.411  0.00144 ** 
## blood1        26.986     17.504   1.542  0.13065    
## index1         8.577      1.307   6.564 6.12e-08 ***
## enzyme1        7.946      1.262   6.297 1.48e-07 ***
## liver1        35.185     31.309   1.124  0.26748    
## age1          -4.205      5.863  -0.717  0.47726    
## nyears1        2.745      5.541   0.496  0.62283    
## gender11      50.018     39.743   1.259  0.21515    
## alcoholm11    11.608     43.299   0.268  0.78994    
## alcoholh11   193.568     59.336   3.262  0.00220 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 134.2 on 42 degrees of freedom
## Multiple R-squared:  0.8072, Adjusted R-squared:  0.7659 
## F-statistic: 19.53 on 9 and 42 DF,  p-value: 2.484e-12

After deleting outliers and influential values in the data, the model improved. The linear model with all the predicted variables and without outliers significantly improved our prediction. There is still room for improvement as we need to look for multicollinearity and correlation of the predicting variables. The model explains 76.59% of the variation in the data at this time. This is an improvement from the previous model.

11.4 Model 2 - Anova Table

By looking at the anova table we can see that there are two variables with highly significant p-values, one variable with significant p-value, and one variable with a slightly significant p-value. The remaining variables are not significant. It seems as though after removing the influential values and outliers, one variable is no longer highly significant for this linear model.

## Analysis of Variance Table
## 
## Response: survival1
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## blood1     1   12670   12670  0.7040  0.406189    
## index1     1 1118269 1118269 62.1374 8.164e-10 ***
## enzyme1    1 1692481 1692481 94.0440 2.783e-12 ***
## liver1     1   58863   58863  3.2707  0.077689 .  
## age1       1   28117   28117  1.5623  0.218245    
## nyears1    1    6730    6730  0.3740  0.544144    
## gender1    1   22592   22592  1.2553  0.268899    
## alcoholm1  1   32746   32746  1.8196  0.184591    
## alcoholh1  1  191522  191522 10.6421  0.002199 ** 
## Residuals 42  755861   17997                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

11.5 Model 3

Without outliers and influential observations also multicollinearity problem was fixed.

## 
## Call:
## lm(formula = survival3 ~ blood3 + index3 + enzyme3 + liver3 + 
##     age3 + gender3 + alcoholh3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -232.62  -76.48    9.67   80.78  319.05 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -703.469    172.341  -4.082 0.000185 ***
## blood3        25.720     17.012   1.512 0.137719    
## index3         8.412      1.224   6.871 1.77e-08 ***
## enzyme3        7.724      1.169   6.608 4.30e-08 ***
## liver3        40.079     29.420   1.362 0.180035    
## age3          -1.373      1.738  -0.790 0.433812    
## gender31      48.349     38.851   1.244 0.219913    
## alcoholh31   187.903     51.849   3.624 0.000748 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 131.6 on 44 degrees of freedom
## Multiple R-squared:  0.8057, Adjusted R-squared:  0.7748 
## F-statistic: 26.06 on 7 and 44 DF,  p-value: 1.098e-13

Eliminating the variable “nyear” (number of years of employment), which was causing multicollinearity problems, made a small improvement to the model. The linear model with all of the predicted variables and without outliers significantly improved our prediction. The model now accounts for 77% of the variation in the data. There is still room for improvement as we need to look for multicollinearity and correlation of the predicting variables.

11.6 Model 3 - Anova Table

By looking at the anova table we can see that there are now three variables with highly significant p-values and one variable with a slightly significant p-value. The remaining three variables are not significant. These insignificant variables do not contribute to the model and can be removed. At this point, we can explore variable transformation and variable interactions.

## Analysis of Variance Table
## 
## Response: survival3
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## blood3     1   12670   12670  0.7319  0.396893    
## index3     1 1118269 1118269 64.6014 3.564e-10 ***
## enzyme3    1 1692481 1692481 97.7732 9.444e-13 ***
## liver3     1   58863   58863  3.4004  0.071919 .  
## age3       1   28117   28117  1.6243  0.209187    
## gender3    1   20450   20450  1.1814  0.282999    
## alcoholh3  1  227350  227350 13.1338  0.000748 ***
## Residuals 44  761653   17310                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

12 Model 4

Variable transformation and interactions

## 
## Call:
## lm(formula = survival ~ age + liver + gender + alcoholh + blood:age + 
##     blood:index + liver:index + age:enzyme)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -193.003  -71.814    7.571   71.664  305.779 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  798.60655  183.26738   4.358 8.03e-05 ***
## age          -18.83532    4.12468  -4.566 4.13e-05 ***
## liver       -300.76375   70.73640  -4.252 0.000112 ***
## gender1       63.05109   34.85808   1.809 0.077479 .  
## alcoholh1    177.34388   45.95638   3.859 0.000377 ***
## age:blood      1.46590    0.55707   2.631 0.011757 *  
## blood:index   -0.72409    0.42667  -1.697 0.096909 .  
## liver:index    4.79637    0.90870   5.278 4.05e-06 ***
## age:enzyme     0.14341    0.01933   7.418 3.21e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 116.9 on 43 degrees of freedom
## Multiple R-squared:   0.85,  Adjusted R-squared:  0.8221 
## F-statistic: 30.47 on 8 and 43 DF,  p-value: 2.607e-15

Eliminating the variable “nyear” (number of years of employment), which was causing multicollinearity problems, made a small improvement to the model. The linear model with all of the predicted variables and without outliers significally improved our prediction. The model now accounts for 82.21% of the variation in the data. There is a significant improvement after removing variables with multicollinearity and correlation problems and adding interaction terms.

12.1 Model 4 - Anova Table

By looking at the anova table we can see how all but one predictive variable have highly significant p-values and there is just one variable with a non-significant p-value. If we take this variable out, the model’s overall fit goes down.

## Analysis of Variance Table
## 
## Response: survival
##             Df  Sum Sq Mean Sq  F value    Pr(>F)    
## age          1   97533   97533   7.1349 0.0106314 *  
## liver        1 1458772 1458772 106.7140 3.190e-13 ***
## gender       1   27683   27683   2.0251 0.1619307    
## alcoholh     1  207678  207678  15.1923 0.0003351 ***
## age:blood    1  143279  143279  10.4813 0.0023252 ** 
## blood:index  1  375633  375633  27.4788 4.566e-06 ***
## liver:index  1  269271  269271  19.6981 6.218e-05 ***
## age:enzyme   1  752196  752196  55.0256 3.211e-09 ***
## Residuals   43  587806   13670                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1