Multiple linear regression is a widely used statistical method for modelling the relationship between a dependent variable and one or more independent variables. In this analysis, we aim to develop a robust and interpretable linear regression model while addressing common challenges such as multicollinearity, overfitting, and variable selection. Key steps include evaluating the significance of predictors using the F-statistic and p-values, assessing multicollinearity through variance inflation factors, and exploring the impact of reducing the number of variables on model performance. Ready? Let’s get started.
Load the dataset on automobiles and assign it to the variable ‘dati’:
dati <- read.delim("Automobili.mult.txt", head = T, sep = ",")
Let’s examine the structure of the dataset:
str(dati)
## 'data.frame': 195 obs. of 6 variables:
## $ Cavalli : int 111 111 154 102 115 110 110 110 140 101 ...
## $ Miglia.per.litro: int 27 27 26 30 22 25 25 25 20 29 ...
## $ Lunghezza : num 169 169 171 177 177 ...
## $ Tempi.del.motore: num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 2.8 ...
## $ Prezzo : int 13495 16500 16500 13950 17450 15250 17710 18920 23875 16430 ...
## $ Carburante : chr "gas" "gas" "gas" "gas" ...
We have 195 observations of 6 variables: cavalli (horsepower), miglia.per.litro (mpl), lunghezza (length), tempi.del.motore (engine timing), prezzo (price) and carburante (fuel type). The first five variables are quantitative, but the 6th is categorical.
head(dati$Carburante)
## [1] "gas" "gas" "gas" "gas" "gas" "gas"
As the 6th attribute is categorical, type: character, we use the ‘factor’ function to change it into factors. Factors store the data as integer codes associated with levels, making it memory efficient and ensures there is an inherent order to the levels, which can be important for analysis or plotting.
dati$Carburante <- factor(dati$Carburante)
Now we re-examine the 6th variable:
head(dati$Carburante)
## [1] gas gas gas gas gas gas
## Levels: diesel gas
str(dati)
## 'data.frame': 195 obs. of 6 variables:
## $ Cavalli : int 111 111 154 102 115 110 110 110 140 101 ...
## $ Miglia.per.litro: int 27 27 26 30 22 25 25 25 20 29 ...
## $ Lunghezza : num 169 169 171 177 177 ...
## $ Tempi.del.motore: num 2.68 2.68 3.47 3.4 3.4 3.4 3.4 3.4 3.4 2.8 ...
## $ Prezzo : int 13495 16500 16500 13950 17450 15250 17710 18920 23875 16430 ...
## $ Carburante : Factor w/ 2 levels "diesel","gas": 2 2 2 2 2 2 2 2 2 2 ...
We can see the ‘Carburante’ variable is now a factor with two levels: ‘gas’ and ‘diesel’.
To see an overview of the data, we can use the ‘summary’ function:
summary(dati)
## Cavalli Miglia.per.litro Lunghezza Tempi.del.motore
## Min. : 48.0 Min. :16.00 Min. :141.1 Min. :2.07
## 1st Qu.: 70.0 1st Qu.:25.00 1st Qu.:166.3 1st Qu.:3.11
## Median : 95.0 Median :30.00 Median :173.2 Median :3.29
## Mean :103.3 Mean :30.84 Mean :174.3 Mean :3.25
## 3rd Qu.:116.0 3rd Qu.:35.00 3rd Qu.:184.1 3rd Qu.:3.41
## Max. :262.0 Max. :54.00 Max. :208.1 Max. :4.17
## Prezzo Carburante
## Min. : 5118 diesel: 20
## 1st Qu.: 7756 gas :175
## Median :10245
## Mean :13248
## 3rd Qu.:16509
## Max. :45400
To see a visual overview of the data, we can plot histograms of each numerical variable, first setting up a 3 by 2 plotting frame using the ‘par’ function:
par(mfrow = c(3,2))
hist(dati$Cavalli)
hist(dati$Miglia.per.litro)
hist(dati$Lunghezza)
hist(dati$Tempi.del.motore)
hist(dati$Prezzo)
We have positively skewed data for most variables, whilst the lunghezza (length) data appears to be normally distributed. Tempi del motore seems to be slightly negatively skewed.
We can now use the ‘pairs’ function to plot scatterplots between every pair of variables:
pairs(dati)
We can immediately see that as the ‘carburante’ category has two levels, 1 and 2 (gas and diesel), it does not make sense to include it. We can rerun the ‘pairs’ function, but eliminating the 6th column with the command: ‘pairs(dati[,-6])’. If we had a lot of columns of data, and we knew we wanted the last one, we could use the ‘dim’ function to identify the last column:
dim(dati)
## [1] 195 6
dim(dati)[2]
## [1] 6
pairs(dati[,-dim(dati)[2]])
We can see there is no correlation between engine timing (tempi del motore) and the other variables. There appears to be correlations between the other variables, although the relationship does not always seem linear (prezzo and miglia per litro for example). This non linearity could potentially cause problems but we will press on to explore what happens. We can check the ‘r’ values between each pair of variables using the ‘cor’ function:
round(cor(dati[,-6]),2)
## Cavalli Miglia.per.litro Lunghezza Tempi.del.motore Prezzo
## Cavalli 1.00 -0.81 0.58 0.10 0.81
## Miglia.per.litro -0.81 1.00 -0.72 -0.04 -0.72
## Lunghezza 0.58 -0.72 1.00 0.12 0.70
## Tempi.del.motore 0.10 -0.04 0.12 1.00 0.09
## Prezzo 0.81 -0.72 0.70 0.09 1.00
The low values in the ‘tempi del motore’ column confirm the lack of correlation between the 4th variable and the others. All other correlations have magnitude 0.58 or higher.
Before performing our linear regression, we should check the contrast coding for the factor carburante:
contrasts(dati$Carburante )
## gas
## diesel 0
## gas 1
For the carburante variable, ‘gas’ is 1 and ‘diesel’ is 0.
Now we fit the linear multiple regression model, defining prezzo (price) as the response and all variables as inputs, taking the data from our prepared dataset. We assign the result to ‘reg.m’:
reg.m <- lm(Prezzo ~ ., data = dati)
summary(reg.m)
##
## Call:
## lm(formula = Prezzo ~ ., data = dati)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8139.3 -2357.8 27.1 1655.1 16683.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -20561.53 10131.20 -2.030 0.0438 *
## Cavalli 147.62 13.15 11.226 < 2e-16 ***
## Miglia.per.litro -13.54 91.36 -0.148 0.8823
## Lunghezza 160.42 38.66 4.149 5.04e-05 ***
## Tempi.del.motore -1375.50 958.27 -1.435 0.1528
## Carburantegas -5015.84 1156.11 -4.339 2.33e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4004 on 189 degrees of freedom
## Multiple R-squared: 0.7594, Adjusted R-squared: 0.753
## F-statistic: 119.3 on 5 and 189 DF, p-value: < 2.2e-16
The large p-values (p > 0.05) in the last column of the ‘coefficients’ section of the summary tell us that ‘Miglia.per.litro’ and ‘tempi.del.motore’ do not have statistically significant effects on the model. This indicates that we can better our model by omitting these variables.
We could try eliminating one of the variables at a time, or both simultaneously. For speed, we will try the latter. Now we fit the reduced linear multiple regression model, again defining prezzo (price) as the response, but this time eliminating the input variables ‘Miglia.per.litro’ and ‘Tempi.del.motore’. We assign the result to the variable ‘reg.m2’:
reg.m2 <- lm(dati$Prezzo ~ dati$Cavalli + dati$Lunghezza + dati$Carburante)
summary(reg.m2)
##
## Call:
## lm(formula = dati$Prezzo ~ dati$Cavalli + dati$Lunghezza + dati$Carburante)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8150.8 -2165.3 80.2 1827.6 16635.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -26682.46 5175.05 -5.156 6.28e-07 ***
## dati$Cavalli 146.84 10.05 14.605 < 2e-16 ***
## dati$Lunghezza 165.48 30.77 5.378 2.19e-07 ***
## dati$Carburantegas -4536.27 1039.66 -4.363 2.10e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4005 on 191 degrees of freedom
## Multiple R-squared: 0.7567, Adjusted R-squared: 0.7529
## F-statistic: 198 on 3 and 191 DF, p-value: < 2.2e-16
All of the p-values are very small, much less than 0.05, implying the reduced model is better than the complete model we originally looked at. We can interpret the values in the ‘Estimate’ column: an increase of 10 cavalli would increase the price by 1468; an increase of 10cm would increase the price by 1655; changing the carburante from 0 (diesel) to 1 (gas) would decrease the price by 4536.
We should also inspect the adjusted R-squared values:
R-squared reg.m: 0.753
R-squared reg.m2: 0.7529
We have very similar, reasonably high R-squared values, although the reg.m2 model has the advantage of fewer variables. Fewer variables means it is less likely that the variables are correlated, which would make it difficult to determine each variable’s individual effect; we avoid overfitting, where the model captures too much noise or random fluctuations; the model is easier to interpret and explain; we improve the computational efficiency, and adhere to Occam’s Razor, that is, given competing explanations, the simplest ones should be preferred. In both models, about 75% of the variance is accounted for.
H0: The parameters for miglio.per.litro and tempi.del.motore are zero, that is, our reduced model is better.
H1: The parameters for miglio.per.litro and tempi.del.motore are NOT zero, that is, our complete model is better.
We reject H0 if Foss >= F_[p+1-k, n-p-1, 1-alpha]
We can use the ANOVA (AN-alysis O-f VA-riance) procedure to test for a statistically significant difference between groups. This will calculate the F-statistic, which is based on the ratio of the variance across group means to the variance due to residual error. The higher this ratio, the more statistically significant the result.
anova.MR <- anova(reg.m2)
anova.MR
## Analysis of Variance Table
##
## Response: dati$Prezzo
## Df Sum Sq Mean Sq F value Pr(>F)
## dati$Cavalli 1 8282218500 8282218500 516.326 < 2.2e-16 ***
## dati$Lunghezza 1 940102245 940102245 58.607 9.312e-13 ***
## dati$Carburante 1 305375826 305375826 19.038 2.096e-05 ***
## Residuals 191 3063767615 16040668
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova.MC <- anova(reg.m)
anova.MC
## Analysis of Variance Table
##
## Response: Prezzo
## Df Sum Sq Mean Sq F value Pr(>F)
## Cavalli 1 8282218500 8282218500 516.6013 < 2.2e-16 ***
## Miglia.per.litro 1 117642074 117642074 7.3379 0.007372 **
## Lunghezza 1 856933134 856933134 53.4510 7.285e-12 ***
## Tempi.del.motore 1 2827469 2827469 0.1764 0.674995
## Carburante 1 301770606 301770606 18.8229 2.333e-05 ***
## Residuals 189 3030072404 16032129
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To calculate the F statistic, we first isolate:
the residuals of each of the models;
q, the number of variables in the complete model plus one, less the number of parameters in the reduced model;
p, the number predictors in the reduced model; and
d, the number of observations less the number of predictors in the reduced model less one.
RSS.MR <- anova.MR$`Sum Sq`[4]
RSS.MC <- anova.MC$`Sum Sq`[6]
p <- dim(dati)[2]
q <- p+1-4
n <- dim(dati)[1]
d <- n-p-1
Now we can find F_oss (F osservata / observed) using the statistical test F:
Foss <- ((RSS.MR-RSS.MC)/q)/(RSS.MC/d)
round(Foss,2)
## [1] 0.7
Now we find the cut-off level above which we reject the null hypothesis. To find the correct Fisher distribution value, we use the ‘df’ function, which takes arguments: significance value required (1 - alpha), degrees of freedom q, degrees of freedom d:
f_5 <- qf(0.95, q, d)
Foss >= f_5
## [1] FALSE
Given the FALSE result, we do not reject the null hypothesis. So omitting miglio.al.litro and tempi.del.motore gives us a better model.
To identify the confidence intervals for the parameters of our reduced model, we can use the ‘confint’ function which as a default uses 95%:
confint(reg.m2)
## 2.5 % 97.5 %
## (Intercept) -36890.0486 -16474.8638
## dati$Cavalli 127.0123 166.6771
## dati$Lunghezza 104.7844 226.1817
## dati$Carburantegas -6586.9679 -2485.5737
The 95% confidence interval is a range of values that you can be 95% certain contains the true price increase: increasing the vehicle’s engine power by 10 cavalli would most likely lead to a price increase from 1270 to 1667.
Increasing the vehicle in length by 10cm would most likely lead to a price increase of anything from 1048 to 2262.
Changing the vehicle’s fuel from diesel to gas would most likely reduce its price by anything from 2486 to 6587.
We check the residuals after fitting the model to ensure that the key assumptions (linearity, independence, homoscedasticity (constant variance) and normality of errors) behind the model are met. It also helps us understand how well the model captures the data, identify outliers and leverage values which might be distorting the model and identify predictors which are missing or need to be altered (using a squared variable, rather than a linear variable, for example).
par(mfrow = c(2,2))
plot(reg.m2)
Residuals vs Fitted: Given the trend seen in this graph, it is likely that a better fitting model is not linear but might include some x-squared terms.
Q-Q Residuals: the residuals seem to be normally distributed with the exception of the right-most values, where there are a number of outliers.
Residuals vs Leverage: points with standardised residuals outside the range of [-2, 2] but with low leverage values are vertical outliers (e.g. point 68); points with high leverage outside of the range of [-2, 2] are bad leverage points (e.g. point 122); whilst the point with the highest leverage (the right-most point) has a standardised residual within the [-2, 2] range, hence is a good leverage point.
Given a vehicle with cavalli = 200, lunghezza = 199.8, carburante = diesel, what is the expected price?
First we construct a vector ‘x0’ with four values: intercept, cavalli, lunghezza, and carburante:
x0 <- c(1, 200, 199.8, 0)
Then we multiply the transposed x0 vector by the parameters resulting from the reduced model coefficients:
x0 <- c(1, 200, 199.8, 0)
t(x0)%*%reg.m2$coefficients
## [,1]
## [1,] 35750
Using this model, we would expect a vehicle with the above characteristics to have a price of 35750.
The F-statistic was used to compare the restricted and unrestricted models, while the p-value confirmed that the result was statistically significant. The variance inflation factors indicated no multicollinearity issues. Reducing the number variables did not significantly change the model’s adjusted R-squared value, but it reduced the number of variables needed, improving the model’s “thriftiness”, robustness and interpretability, while reducing risks like overfitting.
This multiple linear regression model is expected to make fairly accurate predictions as long as the input variables remain within their central range and do not extend to extreme or boundary values. To better test the model’s performance, new data may be required, or the given data could be split into training and testing sets. To further improve the model, squared variables could be considered.