Evaluate a regression model. Importance of investigating on assumptions.

Introduction

In this short guide I will try to show:

problems I met in building a multiple linear regression model without first checking the assumptions;
how to detect them using {car};
possible implications and solutions.

The {car} package in R provides a plenty of useful tools to verify model assumptions.

Data exploration

Let us consider two wine datasets, one for red and another for white wines but with the same variables. EDA is the first compulsory step to investigate data, understand interesting relationships among them and recognize patterns. ###Red and white wine correlations

Most relevant plots:

ggplot(r.wine, aes(quality))+     
  geom_density(col="red3",alpha=1) + 
  labs(title="Density plot", 
       x="quality")+
  geom_density(data=w.wine, col="green4",aes(quality),alpha=0.8)

ggplot(r.wine, aes(density,alcohol,color="tomato2"))+    #alcohol&fdensity
  geom_point(alpha = 1/8, position = position_jitter(h = 0),size=2)+
  geom_point(data=w.wine, aes(density,alcohol,color="yellow3"),
             alpha = 1/8, position = position_jitter(h = 0),size=2)+
  geom_smooth(data=w.wine,aes(density,alcohol),method = 'lm', colour="yellow3")+
  geom_smooth(method = 'lm')+
    coord_cartesian(xlim=c(min(r.wine$density),1.005), ylim=c(8,15))

Some of the tools used and the main results obtained are:

1.Study of basic statistics, media, median, mode, correlation and distribution, thanks to:

correlation matrix;
ggplot.

2.Identification of interesting variables. I decide to focus the analysis on:

pH: inverse trend between red wines, better with high pH and white wine, better with lower pH;
alcohol percentage: negative correlated with density but positive correlated with wine quality;
density: negative correlated with wine quality;
fixed acidity: positive correlation with density, greater in red than in white wines.

2) Identifying variables for regression model

Following EDA output, we start by those variables to build a regression model. In particular, we will predict alcohol percentage, that is strictly correlated to wine quality, according to other chemical components. Using a *** classic statistical approach *** i.e. assuming that data are generated by a given stochastic model, we suppose there is a linear relationship among predictors and output. I build a multiple linear regression model to estimate the unknown f(x) without verifying assumptions but keeping in mind the following questions:

Is at least one of the predictors useful?
Do all predictors help to explain Y?
How well does the model approximate or fit the data?
Given specific values of X, how to predict Y?

After many proof, these seem to be the best multiple linear regression models for red and white wine density:

# red wine: 
MR <- lm(alcohol~density+ pH+`fixed acidity`+`citric acid`,data=r.wine)
summary(MR)

## 
## Call:
## lm(formula = alcohol ~ density + pH + `fixed acidity` + `citric acid`, 
##     data = r.wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1210 -0.5218 -0.1038  0.4629  2.7679 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      499.96694   13.14893  38.023   <2e-16 ***
## density         -507.56726   13.38598 -37.918   <2e-16 ***
## pH                 3.68371    0.16675  22.091   <2e-16 ***
## `fixed acidity`    0.46272    0.02087  22.170   <2e-16 ***
## `citric acid`      1.19998    0.12887   9.311   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.729 on 1594 degrees of freedom
## Multiple R-squared:  0.5333, Adjusted R-squared:  0.5321 
## F-statistic: 455.3 on 4 and 1594 DF,  p-value: < 2.2e-16

# white wine:
MW <- lm(alcohol~density+`residual sugar`, data=w.wine)
summary(MW)

## 
## Call:
## lm(formula = alcohol ~ density + `residual sugar`, data = w.wine)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0005 -0.4173 -0.0470  0.3549 16.3747 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.648e+02  5.365e+00  105.27   <2e-16 ***
## density          -5.586e+02  5.414e+00 -103.18   <2e-16 ***
## `residual sugar`  1.670e-01  3.193e-03   52.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6167 on 4895 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.7489 
## F-statistic:  7303 on 2 and 4895 DF,  p-value: < 2.2e-16

These are the best multiple linear regression models, respectively for red and white wines. With a first look at the summaries, both models seem to be really significant, beacuse:

All predictors are relevant according to p-value. The model has an overall high F-stat, this means that a high percentage of variation is exlpained by the model.
All predictors are significant according to p-values.
Adjusted R-squared: this pure number, among 0 and 1, indicates the variance of Y explaind by the model predictors. In both models it is over the 50%.

However, the best model is not necessarily a linear combination with all of these predictors. The aim is to find the right balance among significant and not superfluous variables, not strongly correlated each other. This output shows that most variance of overall models is exlained. As drawback, it might produce high bias, with an increase of the so called variance-bias trade off.

In addition, we still do not have any information about residuals, errors, distribution and on the implicit assumptions of a linear model: additivity and linearity.

A question arises: Are the models definitely reliable?

Investigation on assumptions

Satisfaction of assumptions of linear regression models have to be checked before taking any conclusion and to avoid potential problems that may occur when fitting the model to a dataset. We will use {car} R package.

1. Normally distributed errors

Looking at pp-plot and qq-plot, we are able to detect if the model follows the normal distribution or not. qq-plot detect skewness on the tails, while pp-plot focuses on the central part of distribution. Tools used: ‘qqplot’ and ‘spreadlevelPlot’ {car} and ‘studres’ {MASS}

In both model there is a component not catched; it is visible on the tails on the qq-plot.

## [1] 379 560

## [1] 1654 2782

2. Correlation of residuals

Residuals should be uncorrelated. Here there is an evident autocorrelation of residuals with red wine model. Possible reasons:

model is misspecified, i.e non linearity of data;
a relevant component not included in the model;

Solution: taking into account quadratic or higher order or assume a non-linear specification.

Residuals of white wine model seem to be uncorrelated but there is a clear outlier problem. We will discuss later.

plot(resid(MR),type="b", main="red wine")+   #
abline(h=0,lwd=2, type="dashed", col=2)

## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...):
## graphical parameter "type" is obsolete

## integer(0)

plot(resid(MW),type="b",main="White wine")+
abline(h=0,lwd=2, type="dashed", col=2)

## Warning in int_abline(a = a, b = b, h = h, v = v, untf = untf, ...):
## graphical parameter "type" is obsolete

## integer(0)

3. Eteroskedasticity of residuals

The assumption of omoschedasticity of residuals is central in every linear regression model. ncvTest of {car} and spreadLevelPlot, an extension of studentized residuals from the linear model, plot shows that there is not a particular pattern between standardized residuals and fitted values but suggests a power transformation.

We can validate the ipothesis of omoschedasticity of residuals, despite the presence of outliers/significant points.

ncvTest(MR)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 2.988176    Df = 1     p = 0.08387461

spreadLevelPlot(MR, col= "red",col.lines = 1, main = "Red Wine")

## 
## Suggested power transformation:  0.1927305

ncvTest(MW)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 12130.43    Df = 1     p = 0

spreadLevelPlot(MW, col= "green3",col.lines = 1, main="White wine")

## Warning in spreadLevelPlot.lm(MW, col = "green3", col.lines = 1, main = "White wine"): 
## 1 negative fitted value removed

## 
## Suggested power transformation:  1.937443

4. High leverage points and outliers

Focusing on the last two plots, we can detect outliers and high leverage points, i.e. outliers with Cook’s distance > 1.

Pay attention to the last ones because they can significately alter the model.

###Red wine
par(mfrow=c(2,2))
plot(MR)

###white wine
par(mfrow=c(2,2))
plot(MW)

They can be detected, with the following function in {car} for example, removed or manually corrected, according to the specific situation.

car::outlierTest(MR)

## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##     rstudent unadjusted p-value Bonferonni p
## 379 3.819524         0.00013884      0.22201

car::outlierTest(MW)

##       rstudent unadjusted p-value Bonferonni p
## 2782 29.516389        2.2840e-176  1.1187e-172
## 1654  5.162931         2.5274e-07   1.2379e-03
## 1664  5.162931         2.5274e-07   1.2379e-03
## 3902  4.867412         1.1662e-06   5.7122e-03

5. Collinearity of predictors:

In presence of a quasi-linear relation among varibles, the model predicted risk to be unreliable. Investigating by correlation matrix and VIF in {car} Variance Inflation Factor suggests to pay attention to:

fixed acidity inred wine model
both variable in white wine model

We are in presence of a possible problem of multicollinearity among predictors, especially in the white regression model

vif(MR)

##         density              pH `fixed acidity`   `citric acid` 
##        1.919378        1.992986        3.971310        1.895223

sqrt(vif(MR))

##         density              pH `fixed acidity`   `citric acid` 
##        1.385416        1.411732        1.992815        1.376671

vif(MW)

##          density `residual sugar` 
##         3.376835         3.376835

sqrt(vif(MW))

##          density `residual sugar` 
##         1.837617         1.837617

Conclusion

After the investigation on assumptions, models that initially seemded to be significant show to have some lack. In particular:

presence of collinearity among some predictors;
not perfect normal distribution of errors;
important high leverage points, that risk to mispecify the overall regression models;
non linear component detected.