Introduction

Multiple linear regression is a widely used statistical method for modelling the relationship between a dependent variable and one or more independent variables. In this analysis, we aim to develop a robust and interpretable linear regression model while addressing common challenges such as multicollinearity, overfitting, and variable selection. Key steps include evaluating the significance of predictors using the F-statistic and p-values, assessing multicollinearity through variance inflation factors, and exploring the impact of reducing the number of variables on model performance. Ready? Let’s get started.

Examination of the dataset

Load the dataset on automobiles and assign it to the variable ‘dati’:

dati <- read.delim("Automobili.mult.txt", head = T, sep = ",")

Let’s examine the structure of the dataset:

str(dati)
## 'data.frame':    195 obs. of  6 variables:
##  $ Cavalli         : int  111 111 154 102 115 110 110 110 140 101 ...
##  $ Miglia.per.litro: int  27 27 26 30 22 25 25 25 20 29 ...
##  $ Lunghezza       : num  169 169 171 177 177 ...
##  $ Tempi.del.motore: num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 2.8 ...
##  $ Prezzo          : int  13495 16500 16500 13950 17450 15250 17710 18920 23875 16430 ...
##  $ Carburante      : chr  "gas" "gas" "gas" "gas" ...

We have 195 observations of 6 variables: cavalli (horsepower), miglia.per.litro (mpl), lunghezza (length), tempi.del.motore (engine timing), prezzo (price) and carburante (fuel type). The first five variables are quantitative, but the 6th is categorical.

head(dati$Carburante)
## [1] "gas" "gas" "gas" "gas" "gas" "gas"

As the 6th attribute is categorical, type: character, we use the ‘factor’ function to change it into factors. Factors store the data as integer codes associated with levels, making it memory efficient and ensures there is an inherent order to the levels, which can be important for analysis or plotting.

dati$Carburante <- factor(dati$Carburante)

Now we re-examine the 6th variable:

head(dati$Carburante)
## [1] gas gas gas gas gas gas
## Levels: diesel gas
str(dati)
## 'data.frame':    195 obs. of  6 variables:
##  $ Cavalli         : int  111 111 154 102 115 110 110 110 140 101 ...
##  $ Miglia.per.litro: int  27 27 26 30 22 25 25 25 20 29 ...
##  $ Lunghezza       : num  169 169 171 177 177 ...
##  $ Tempi.del.motore: num  2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 2.8 ...
##  $ Prezzo          : int  13495 16500 16500 13950 17450 15250 17710 18920 23875 16430 ...
##  $ Carburante      : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...

We can see the ‘Carburante’ variable is now a factor with two levels: ‘gas’ and ‘diesel’.

To see an overview of the data, we can use the ‘summary’ function:

summary(dati)
##     Cavalli      Miglia.per.litro   Lunghezza     Tempi.del.motore
##  Min.   : 48.0   Min.   :16.00    Min.   :141.1   Min.   :2.07    
##  1st Qu.: 70.0   1st Qu.:25.00    1st Qu.:166.3   1st Qu.:3.11    
##  Median : 95.0   Median :30.00    Median :173.2   Median :3.29    
##  Mean   :103.3   Mean   :30.84    Mean   :174.3   Mean   :3.25    
##  3rd Qu.:116.0   3rd Qu.:35.00    3rd Qu.:184.1   3rd Qu.:3.41    
##  Max.   :262.0   Max.   :54.00    Max.   :208.1   Max.   :4.17    
##      Prezzo       Carburante 
##  Min.   : 5118   diesel: 20  
##  1st Qu.: 7756   gas   :175  
##  Median :10245               
##  Mean   :13248               
##  3rd Qu.:16509               
##  Max.   :45400

To see a visual overview of the data, we can plot histograms of each numerical variable, first setting up a 3 by 2 plotting frame using the ‘par’ function:

par(mfrow = c(3,2)) 
hist(dati$Cavalli)
hist(dati$Miglia.per.litro) 
hist(dati$Lunghezza)
hist(dati$Tempi.del.motore) 
hist(dati$Prezzo)

We have positively skewed data for most variables, whilst the lunghezza (length) data appears to be normally distributed. Tempi del motore seems to be slightly negatively skewed.

We can now use the ‘pairs’ function to plot scatterplots between every pair of variables:

pairs(dati)

We can immediately see that as the ‘carburante’ category has two levels, 1 and 2 (gas and diesel), it does not make sense to include it. We can rerun the ‘pairs’ function, but eliminating the 6th column with the command: ‘pairs(dati[,-6])’. If we had a lot of columns of data, and we knew we wanted the last one, we could use the ‘dim’ function to identify the last column:

dim(dati) 
## [1] 195   6
dim(dati)[2]
## [1] 6
pairs(dati[,-dim(dati)[2]])

We can see there is no correlation between engine timing (tempi del motore) and the other variables. There appears to be correlations between the other variables, although the relationship does not always seem linear (prezzo and miglia per litro for example). This non linearity could potentially cause problems but we will press on to explore what happens. We can check the ‘r’ values between each pair of variables using the ‘cor’ function:

round(cor(dati[,-6]),2)
##                  Cavalli Miglia.per.litro Lunghezza Tempi.del.motore Prezzo
## Cavalli             1.00            -0.81      0.58             0.10   0.81
## Miglia.per.litro   -0.81             1.00     -0.72            -0.04  -0.72
## Lunghezza           0.58            -0.72      1.00             0.12   0.70
## Tempi.del.motore    0.10            -0.04      0.12             1.00   0.09
## Prezzo              0.81            -0.72      0.70             0.09   1.00

The low values in the ‘tempi del motore’ column confirm the lack of correlation between the 4th variable and the others. All other correlations have magnitude 0.58 or higher.

Before performing our linear regression, we should check the contrast coding for the factor carburante:

contrasts(dati$Carburante )
##        gas
## diesel   0
## gas      1

For the carburante variable, ‘gas’ is 1 and ‘diesel’ is 0.

COMPLETE MULTIPLE LINEAR REGRESSION MODEL: reg.m

Now we fit the linear multiple regression model, defining prezzo (price) as the response and all variables as inputs, taking the data from our prepared dataset. We assign the result to ‘reg.m’:

reg.m <- lm(Prezzo ~ ., data = dati) 
summary(reg.m)
## 
## Call:
## lm(formula = Prezzo ~ ., data = dati)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8139.3 -2357.8    27.1  1655.1 16683.9 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -20561.53   10131.20  -2.030   0.0438 *  
## Cavalli             147.62      13.15  11.226  < 2e-16 ***
## Miglia.per.litro    -13.54      91.36  -0.148   0.8823    
## Lunghezza           160.42      38.66   4.149 5.04e-05 ***
## Tempi.del.motore  -1375.50     958.27  -1.435   0.1528    
## Carburantegas     -5015.84    1156.11  -4.339 2.33e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4004 on 189 degrees of freedom
## Multiple R-squared:  0.7594, Adjusted R-squared:  0.753 
## F-statistic: 119.3 on 5 and 189 DF,  p-value: < 2.2e-16

The large p-values (p > 0.05) in the last column of the ‘coefficients’ section of the summary tell us that ‘Miglia.per.litro’ and ‘tempi.del.motore’ do not have statistically significant effects on the model. This indicates that we can better our model by omitting these variables.

REDUCED MULTIPLE LINEAR REGRESSION MODEL: reg.m2

We could try eliminating one of the variables at a time, or both simultaneously. For speed, we will try the latter. Now we fit the reduced linear multiple regression model, again defining prezzo (price) as the response, but this time eliminating the input variables ‘Miglia.per.litro’ and ‘Tempi.del.motore’. We assign the result to the variable ‘reg.m2’:

reg.m2 <- lm(dati$Prezzo ~ dati$Cavalli + dati$Lunghezza + dati$Carburante) 
summary(reg.m2)
## 
## Call:
## lm(formula = dati$Prezzo ~ dati$Cavalli + dati$Lunghezza + dati$Carburante)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8150.8 -2165.3    80.2  1827.6 16635.1 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -26682.46    5175.05  -5.156 6.28e-07 ***
## dati$Cavalli          146.84      10.05  14.605  < 2e-16 ***
## dati$Lunghezza        165.48      30.77   5.378 2.19e-07 ***
## dati$Carburantegas  -4536.27    1039.66  -4.363 2.10e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4005 on 191 degrees of freedom
## Multiple R-squared:  0.7567, Adjusted R-squared:  0.7529 
## F-statistic:   198 on 3 and 191 DF,  p-value: < 2.2e-16

All of the p-values are very small, much less than 0.05, implying the reduced model is better than the complete model we originally looked at. We can interpret the values in the ‘Estimate’ column: an increase of 10 cavalli would increase the price by 1468; an increase of 10cm would increase the price by 1655; changing the carburante from 0 (diesel) to 1 (gas) would decrease the price by 4536.

We should also inspect the adjusted R-squared values:

R-squared reg.m: 0.753

R-squared reg.m2: 0.7529

We have very similar, reasonably high R-squared values, although the reg.m2 model has the advantage of fewer variables. Fewer variables means it is less likely that the variables are correlated, which would make it difficult to determine each variable’s individual effect; we avoid overfitting, where the model captures too much noise or random fluctuations; the model is easier to interpret and explain; we improve the computational efficiency, and adhere to Occam’s Razor, that is, given competing explanations, the simplest ones should be preferred. In both models, about 75% of the variance is accounted for.

Is the Reduced Model Better? Hypothesis Testing

H0: The parameters for miglio.per.litro and tempi.del.motore are zero, that is, our reduced model is better.

H1: The parameters for miglio.per.litro and tempi.del.motore are NOT zero, that is, our complete model is better.

We reject H0 if Foss >= F_[p+1-k, n-p-1, 1-alpha]

We can use the ANOVA (AN-alysis O-f VA-riance) procedure to test for a statistically significant difference between groups. This will calculate the F-statistic, which is based on the ratio of the variance across group means to the variance due to residual error. The higher this ratio, the more statistically significant the result.

anova.MR <- anova(reg.m2) 
anova.MR
## Analysis of Variance Table
## 
## Response: dati$Prezzo
##                  Df     Sum Sq    Mean Sq F value    Pr(>F)    
## dati$Cavalli      1 8282218500 8282218500 516.326 < 2.2e-16 ***
## dati$Lunghezza    1  940102245  940102245  58.607 9.312e-13 ***
## dati$Carburante   1  305375826  305375826  19.038 2.096e-05 ***
## Residuals       191 3063767615   16040668                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova.MC <- anova(reg.m) 
anova.MC
## Analysis of Variance Table
## 
## Response: Prezzo
##                   Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Cavalli            1 8282218500 8282218500 516.6013 < 2.2e-16 ***
## Miglia.per.litro   1  117642074  117642074   7.3379  0.007372 ** 
## Lunghezza          1  856933134  856933134  53.4510 7.285e-12 ***
## Tempi.del.motore   1    2827469    2827469   0.1764  0.674995    
## Carburante         1  301770606  301770606  18.8229 2.333e-05 ***
## Residuals        189 3030072404   16032129                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To calculate the F statistic, we first isolate:

RSS.MR <- anova.MR$`Sum Sq`[4]
RSS.MC <- anova.MC$`Sum Sq`[6] 
p <- dim(dati)[2] 
q <- p+1-4 
n <- dim(dati)[1] 
d <- n-p-1

Now we can find F_oss (F osservata / observed) using the statistical test F:

Foss <- ((RSS.MR-RSS.MC)/q)/(RSS.MC/d) 
round(Foss,2)
## [1] 0.7

Now we find the cut-off level above which we reject the null hypothesis. To find the correct Fisher distribution value, we use the ‘df’ function, which takes arguments: significance value required (1 - alpha), degrees of freedom q, degrees of freedom d:

f_5 <- qf(0.95, q, d)

Foss >= f_5
## [1] FALSE

Given the FALSE result, we do not reject the null hypothesis. So omitting miglio.al.litro and tempi.del.motore gives us a better model.

Is the Reduced Model Better? Are the Variables in the Reduced Model Correlated?

We can use the function ‘vif’, variance inflation factors to detect multicollinearity. The VIF quantifies how much the variance of a regression model is inflated due to multicollinearity. A VIF value of 1 would indicate no correlation between variables, above 1 but up to 5 indicates some multicollinearity, whilst a VIF value above 5 (or above 10, depending on the strictness required) indicates there are problematic levels of correlation between variables.

library(car) 
## Caricamento del pacchetto richiesto: carData
vif(reg.m2) 
##    dati$Cavalli  dati$Lunghezza dati$Carburante 
##        1.753471        1.782798        1.209475
vif(reg.m)
##          Cavalli Miglia.per.litro        Lunghezza Tempi.del.motore 
##         3.000978         4.710219         2.815907         1.096387 
##       Carburante 
##         1.496387

In both the reduced model and the complete model, there are no multicollinearity issues, as all of the VIF values are below 5. In the complete model, there is a variable whose variance inflation factor is close to 5, which is close to a problematic value, although still in the ‘all clear’ zone.

Confidence Intervals for the Reduced Model Parameters

To identify the confidence intervals for the parameters of our reduced model, we can use the ‘confint’ function which as a default uses 95%:

confint(reg.m2)
##                          2.5 %      97.5 %
## (Intercept)        -36890.0486 -16474.8638
## dati$Cavalli          127.0123    166.6771
## dati$Lunghezza        104.7844    226.1817
## dati$Carburantegas  -6586.9679  -2485.5737

The 95% confidence interval is a range of values that you can be 95% certain contains the true price increase: increasing the vehicle’s engine power by 10 cavalli would most likely lead to a price increase from 1270 to 1667.

Increasing the vehicle in length by 10cm would most likely lead to a price increase of anything from 1048 to 2262.

Changing the vehicle’s fuel from diesel to gas would most likely reduce its price by anything from 2486 to 6587.

Checking the Model - Analysis of Residuals

We check the residuals after fitting the model to ensure that the key assumptions (linearity, independence, homoscedasticity (constant variance) and normality of errors) behind the model are met. It also helps us understand how well the model captures the data, identify outliers and leverage values which might be distorting the model and identify predictors which are missing or need to be altered (using a squared variable, rather than a linear variable, for example).

par(mfrow = c(2,2)) 
plot(reg.m2)

Residuals vs Fitted: Given the trend seen in this graph, it is likely that a better fitting model is not linear but might include some x-squared terms.

Q-Q Residuals: the residuals seem to be normally distributed with the exception of the right-most values, where there are a number of outliers.

Residuals vs Leverage: points with standardised residuals outside the range of [-2, 2] but with low leverage values are vertical outliers (e.g. point 68); points with high leverage outside of the range of [-2, 2] are bad leverage points (e.g. point 122); whilst the point with the highest leverage (the right-most point) has a standardised residual within the [-2, 2] range, hence is a good leverage point.

Using the Model to Make Predictions

Given a vehicle with cavalli = 200, lunghezza = 199.8, carburante = diesel, what is the expected price?

First we construct a vector ‘x0’ with four values: intercept, cavalli, lunghezza, and carburante:

x0 <- c(1, 200, 199.8, 0)

Then we multiply the transposed x0 vector by the parameters resulting from the reduced model coefficients:

x0 <- c(1, 200, 199.8, 0) 
t(x0)%*%reg.m2$coefficients
##       [,1]
## [1,] 35750

Using this model, we would expect a vehicle with the above characteristics to have a price of 35750.

Conclusion

The F-statistic was used to compare the restricted and unrestricted models, while the p-value confirmed that the result was statistically significant. The variance inflation factors indicated no multicollinearity issues. Reducing the number variables did not significantly change the model’s adjusted R-squared value, but it reduced the number of variables needed, improving the model’s “thriftiness”, robustness and interpretability, while reducing risks like overfitting.

This multiple linear regression model is expected to make fairly accurate predictions as long as the input variables remain within their central range and do not extend to extreme or boundary values. To better test the model’s performance, new data may be required, or the given data could be split into training and testing sets. To further improve the model, squared variables could be considered.