Multiple Linear Regression (MLR) is a statistical technique used to examine the relationship between one dependent variable and two or more independent variables. It extends simple linear regression by allowing researchers to assess the combined effect of multiple predictors on an outcome variable. MLR is widely applied in various fields, including economics, business, healthcare, engineering, and social sciences, to understand complex relationships and make predictions.
Mathematically expressed as:
\[
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p +
\varepsilon
\]
In this study, a real-world dataset is analyzed using Multiple Linear Regression to identify the factors that significantly influence the dependent variable. The model estimates how changes in the independent variables affect the outcome while controlling for the effects of other predictors. By fitting a regression model and evaluating its performance, insights can be obtained regarding the strength, direction, and significance of the relationships among the variables.
The analysis involves data exploration, model fitting, interpretation of regression coefficients, assessment of statistical significance, and evaluation of model assumptions. The findings from this study can support data-driven decision-making and provide a deeper understanding of the factors associated with the response variable.
I worked on example of Predicting House Prices where we have used the built-in Boston Housing dataset
data(Boston)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
model <- lm(medv ~ rm + lstat + ptratio + crim + tax,
data = Boston)
summary(model)
##
## Call:
## lm(formula = medv ~ rm + lstat + ptratio + crim + tax, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3602 -3.1111 -0.9237 1.6569 30.4116
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.7488084 4.0001180 4.187 3.34e-05 ***
## rm 4.6349234 0.4292367 10.798 < 2e-16 ***
## lstat -0.5280046 0.0480346 -10.992 < 2e-16 ***
## ptratio -0.8731668 0.1251429 -6.977 9.59e-12 ***
## crim -0.0593795 0.0339830 -1.747 0.0812 .
## tax -0.0008196 0.0019328 -0.424 0.6717
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared: 0.6816, Adjusted R-squared: 0.6784
## F-statistic: 214.1 on 5 and 500 DF, p-value: < 2.2e-16
formula(model)
## medv ~ rm + lstat + ptratio + crim + tax
round(coef(model),3)
## (Intercept) rm lstat ptratio crim tax
## 16.749 4.635 -0.528 -0.873 -0.059 -0.001
The estimated multiple linear regression model is:
\[ medv = 16.749 + 4.635\,rm - 0.528\,lstat - 0.873\,ptratio - 0.059\,crim - 0.001\,tax \]
The model indicates that, holding all other variables constant, an increase of one unit in the average number of rooms ((rm)) is associated with an increase of 4.635 units in the predicted median house value. Conversely, increases in (lstat), (ptratio), (crim), and (tax) are associated with decreases in the predicted median house value by 0.528, 0.873, 0.059, and 0.001 units, respectively.
library(ggplot2)
library(MASS)
model <- lm(medv ~ rm + lstat + ptratio + crim + tax,
data = Boston)
results <- data.frame(
Actual = Boston$medv,
Predicted = predict(model)
)
ggplot(results, aes(x = Actual, y = Predicted)) +
geom_point(color = "blue", alpha = 0.7) +
geom_abline(intercept = 0, slope = 1,
color = "red", linewidth = 1) +
labs(title = "Actual vs Predicted Values",
x = "Actual medv",
y = "Predicted medv") +
theme_minimal()
The Actual vs Predicted plot compares the observed median house values (medv) with the values predicted by the multiple linear regression model. The red 45-degree reference line represents perfect predictions, where the predicted values are exactly equal to the actual values.
The scatter points are generally clustered around the reference line, indicating that the model is able to predict house values reasonably well. Points that lie close to the line correspond to observations with small prediction errors, while points farther from the line indicate larger discrepancies between the actual and predicted values.
The spread of points around the reference line suggests that the model captures a substantial portion of the variation in house values, although some prediction errors remain. Any systematic pattern, such as points consistently above or below the line, would indicate model bias; however, if the points are randomly distributed around the line, it suggests that the model predictions are unbiased.
Overall, the plot indicates that the multiple linear regression model provides a satisfactory fit to the data and has reasonable predictive accuracy for estimating median house values based on the selected predictor variables (rm, lstat, ptratio, crim, and tax).
par(mfrow = c(2,2))
plot(model)
The diagnostic plots suggest that the linear regression model is
generally appropriate, but there are signs of slight non-linearity and
heteroscedasticity in the Residuals vs Fitted and Scale-Location plots.
The Q-Q plot shows some deviation from normality at the tails,
indicating possible outliers or non-normal residuals. The Residuals vs
Leverage plot indicates a few influential observations, but none appear
extremely dominant.
hist(residuals(model),
xlab = "Residuals",
ylab = "Frequencies",
col = "skyblue",
main = "Residuals")
The histogram shows that the residuals are mostly centered around zero,
which suggests that the model is generally unbiased. However, the
distribution is slightly right-skewed, indicating the presence of some
larger positive residuals. This suggests that the normality assumption
is not perfectly satisfied and there may be a few outliers affecting the
model fit.
Variable selection is an important process in regression analysis that involves selecting the most relevant independent variables for predicting a response variable. The purpose is to improve model performance, reduce complexity, and make the model easier to interpret. This report demonstrates variable selection methods using the Boston Housing dataset, where the objective is to predict the median value of owner-occupied homes (medv) using several explanatory variables.
The Boston Housing dataset contains information collected from different suburbs of Boston. It includes 506 observations and 14 variables. The response variable is medv, which represents the median value of owner-occupied homes in thousands of dollars. The predictor variables include crime rate, average number of rooms, accessibility to highways, property tax rate, pupil-teacher ratio, and others.
This helps identify variables that are strongly related to house prices (medv).
library(MASS)
library(corrplot)
## corrplot 0.95 loaded
data(Boston)
cor_matrix <- cor(Boston)
corrplot(cor_matrix,
method = "color",
type = "upper",
tl.cex = 0.7)
Variables such as \(rm\) tend to have a positive correlation with \(medv\), while \(lstat\) tends to have a strong negative correlation.
library(ggplot2)
ggplot(Boston, aes(x = lstat, y = medv)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Relationship between LSTAT and MEDV",
x = "Lower Status Population (%)",
y = "Median House Value")
## `geom_smooth()` using formula = 'y ~ x'
Forward selection starts with no predictor variables and adds variables one at a time based on their contribution to the model. The process continues until no additional variable significantly improves the model.
full_model <- lm(medv ~ ., data = Boston)
null_model <- lm(medv ~ 1, data = Boston)
forward_model <- step(
null_model,
scope = list(lower = null_model,
upper = full_model),
direction = "forward",
trace = 0
)
summary(forward_model)
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas +
## black + zn + crim + rad + tax, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## black 0.009291 0.002674 3.475 0.000557 ***
## zn 0.045845 0.013523 3.390 0.000754 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
Backward elimination starts with all predictor variables included in the model. At each step, the least significant variable is removed until all remaining variables contribute meaningfully to the model.
backward_model <- step(
full_model,
direction = "backward",
trace = 0
)
summary(backward_model)
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
Stepwise selection combines forward selection and backward elimination. Variables can be added or removed during the selection process depending on their statistical significance.
stepwise_model <- step(
full_model,
direction = "both",
trace = 0
)
summary(stepwise_model)
##
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## crim -0.108413 0.032779 -3.307 0.001010 **
## zn 0.045845 0.013523 3.390 0.000754 ***
## chas 2.718716 0.854240 3.183 0.001551 **
## nox -17.376023 3.535243 -4.915 1.21e-06 ***
## rm 3.801579 0.406316 9.356 < 2e-16 ***
## dis -1.492711 0.185731 -8.037 6.84e-15 ***
## rad 0.299608 0.063402 4.726 3.00e-06 ***
## tax -0.011778 0.003372 -3.493 0.000521 ***
## ptratio -0.946525 0.129066 -7.334 9.24e-13 ***
## black 0.009291 0.002674 3.475 0.000557 ***
## lstat -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
The variable selection procedures typically identify several important predictors of house prices. Variables such as:
rm (average number of rooms per dwelling)
lstat (percentage of lower-status population)
ptratio (pupil-teacher ratio)
dis (weighted distances to employment centers)
crim (per capita crime rate)
tax (property tax rate)
are often retained because they have strong relationships with housing prices.
The variable rm generally has a positive effect on house prices, indicating that houses with more rooms tend to be more expensive. On the other hand, lstat usually has a negative effect, suggesting that areas with a higher percentage of lower-status population tend to have lower housing values.
coef_df <- data.frame(
Variable = names(coef(full_model))[-1],
Coefficient = coef(full_model)[-1]
)
library(ggplot2)
ggplot(coef_df,
aes(x = reorder(Variable, abs(Coefficient)),
y = Coefficient)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Regression Coefficients",
x = "Variables",
y = "Coefficient")
pred <- predict(full_model)
ggplot(data.frame(Actual = Boston$medv,
Predicted = pred),
aes(x = Actual, y = Predicted)) +
geom_point() +
geom_abline(slope = 1,
intercept = 0,
linetype = "dashed") +
labs(title = "Predicted vs Actual House Prices")
The results show that not all variables contribute equally to predicting housing prices. Variable selection methods help identify the most influential predictors and remove variables with little explanatory power. Forward selection builds the model gradually, while backward elimination simplifies a complete model by removing unimportant variables. Stepwise selection combines both approaches and often produces a balanced model.
Using fewer but important variables makes the model easier to interpret and may improve prediction performance when applied to new data. In the Boston Housing dataset, housing prices are strongly influenced by structural characteristics such as the number of rooms and socioeconomic factors such as population status and educational conditions.
Variable selection is a useful technique for improving regression models by identifying the most important predictors. Using the Boston Housing dataset, forward selection, backward elimination, and stepwise selection can be applied to determine which variables best explain variations in housing prices. The selected variables provide valuable insights into the factors affecting property values and contribute to the development of a more efficient and interpretable predictive model.