Homeworks.knit

Multiple Linear Regression & Variable Selection Methods

1. Multiple Linear Regression

Multiple Linear Regression (MLR) is a statistical technique used to examine the relationship between one dependent variable and two or more independent variables. It extends simple linear regression by allowing researchers to assess the combined effect of multiple predictors on an outcome variable. MLR is widely applied in various fields, including economics, business, healthcare, engineering, and social sciences, to understand complex relationships and make predictions.

Mathematically expressed as:
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p + \varepsilon \]

In this study, a real-world dataset is analyzed using Multiple Linear Regression to identify the factors that significantly influence the dependent variable. The model estimates how changes in the independent variables affect the outcome while controlling for the effects of other predictors. By fitting a regression model and evaluating its performance, insights can be obtained regarding the strength, direction, and significance of the relationships among the variables.

The analysis involves data exploration, model fitting, interpretation of regression coefficients, assessment of statistical significance, and evaluation of model assumptions. The findings from this study can support data-driven decision-making and provide a deeper understanding of the factors associated with the response variable.

I worked on example of Predicting House Prices where we have used the built-in Boston Housing dataset

data(Boston)
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Fit a Multiple Linear Regression Model

model <- lm(medv ~ rm + lstat + ptratio + crim + tax,
            data = Boston)

summary(model)

## 
## Call:
## lm(formula = medv ~ rm + lstat + ptratio + crim + tax, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3602  -3.1111  -0.9237   1.6569  30.4116 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.7488084  4.0001180   4.187 3.34e-05 ***
## rm           4.6349234  0.4292367  10.798  < 2e-16 ***
## lstat       -0.5280046  0.0480346 -10.992  < 2e-16 ***
## ptratio     -0.8731668  0.1251429  -6.977 9.59e-12 ***
## crim        -0.0593795  0.0339830  -1.747   0.0812 .  
## tax         -0.0008196  0.0019328  -0.424   0.6717    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared:  0.6816, Adjusted R-squared:  0.6784 
## F-statistic: 214.1 on 5 and 500 DF,  p-value: < 2.2e-16

formula(model)

## medv ~ rm + lstat + ptratio + crim + tax

round(coef(model),3)

## (Intercept)          rm       lstat     ptratio        crim         tax 
##      16.749       4.635      -0.528      -0.873      -0.059      -0.001

Multiple Linear Regression Model

The estimated multiple linear regression model is:

\[ medv = 16.749 + 4.635\,rm - 0.528\,lstat - 0.873\,ptratio - 0.059\,crim - 0.001\,tax \]

The model indicates that, holding all other variables constant, an increase of one unit in the average number of rooms ((rm)) is associated with an increase of 4.635 units in the predicted median house value. Conversely, increases in (lstat), (ptratio), (crim), and (tax) are associated with decreases in the predicted median house value by 0.528, 0.873, 0.059, and 0.001 units, respectively.

library(ggplot2)
library(MASS)

model <- lm(medv ~ rm + lstat + ptratio + crim + tax,
            data = Boston)

results <- data.frame(
  Actual = Boston$medv,
  Predicted = predict(model)
)

ggplot(results, aes(x = Actual, y = Predicted)) +
  geom_point(color = "blue", alpha = 0.7) +
  geom_abline(intercept = 0, slope = 1,
              color = "red", linewidth = 1) +
  labs(title = "Actual vs Predicted Values",
       x = "Actual medv",
       y = "Predicted medv") +
  theme_minimal()

The Actual vs Predicted plot compares the observed median house values (medv) with the values predicted by the multiple linear regression model. The red 45-degree reference line represents perfect predictions, where the predicted values are exactly equal to the actual values.

The scatter points are generally clustered around the reference line, indicating that the model is able to predict house values reasonably well. Points that lie close to the line correspond to observations with small prediction errors, while points farther from the line indicate larger discrepancies between the actual and predicted values.

The spread of points around the reference line suggests that the model captures a substantial portion of the variation in house values, although some prediction errors remain. Any systematic pattern, such as points consistently above or below the line, would indicate model bias; however, if the points are randomly distributed around the line, it suggests that the model predictions are unbiased.

Overall, the plot indicates that the multiple linear regression model provides a satisfactory fit to the data and has reasonable predictive accuracy for estimating median house values based on the selected predictor variables (rm, lstat, ptratio, crim, and tax).

Check Model Assumptions

par(mfrow = c(2,2))
plot(model)

The diagnostic plots suggest that the linear regression model is generally appropriate, but there are signs of slight non-linearity and heteroscedasticity in the Residuals vs Fitted and Scale-Location plots. The Q-Q plot shows some deviation from normality at the tails, indicating possible outliers or non-normal residuals. The Residuals vs Leverage plot indicates a few influential observations, but none appear extremely dominant.

hist(residuals(model),
     xlab = "Residuals",
     ylab = "Frequencies",
     col = "skyblue",
     main = "Residuals")

The histogram shows that the residuals are mostly centered around zero, which suggests that the model is generally unbiased. However, the distribution is slightly right-skewed, indicating the presence of some larger positive residuals. This suggests that the normality assumption is not perfectly satisfied and there may be a few outliers affecting the model fit.

2. Variable Selection Methods in Regression Analysis

Variable selection is an important process in regression analysis that involves selecting the most relevant independent variables for predicting a response variable. The purpose is to improve model performance, reduce complexity, and make the model easier to interpret. This report demonstrates variable selection methods using the Boston Housing dataset, where the objective is to predict the median value of owner-occupied homes (medv) using several explanatory variables.

Description of the Boston Dataset

The Boston Housing dataset contains information collected from different suburbs of Boston. It includes 506 observations and 14 variables. The response variable is medv, which represents the median value of owner-occupied homes in thousands of dollars. The predictor variables include crime rate, average number of rooms, accessibility to highways, property tax rate, pupil-teacher ratio, and others.

Variable Selection Methods

Correlation Plot

This helps identify variables that are strongly related to house prices (medv).

library(MASS)
library(corrplot)

## corrplot 0.95 loaded

data(Boston)

cor_matrix <- cor(Boston)

corrplot(cor_matrix,
         method = "color",
         type = "upper",
         tl.cex = 0.7)

Variables such as \(rm\) tend to have a positive correlation with \(medv\), while \(lstat\) tends to have a strong negative correlation.

Scatter Plot of the Most Important Predictor

library(ggplot2)

ggplot(Boston, aes(x = lstat, y = medv)) +
  geom_point() +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Relationship between LSTAT and MEDV",
       x = "Lower Status Population (%)",
       y = "Median House Value")

## `geom_smooth()` using formula = 'y ~ x'

Forward Selection

Forward selection starts with no predictor variables and adds variables one at a time based on their contribution to the model. The process continues until no additional variable significantly improves the model.

full_model <- lm(medv ~ ., data = Boston)
null_model <- lm(medv ~ 1, data = Boston)

forward_model <- step(
  null_model,
  scope = list(lower = null_model,
               upper = full_model),
  direction = "forward",
  trace = 0
)

summary(forward_model)

## 
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + chas + 
##     black + zn + crim + rad + tax, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## chas          2.718716   0.854240   3.183 0.001551 ** 
## black         0.009291   0.002674   3.475 0.000557 ***
## zn            0.045845   0.013523   3.390 0.000754 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

Backward Elimination

Backward elimination starts with all predictor variables included in the model. At each step, the least significant variable is removed until all remaining variables contribute meaningfully to the model.

backward_model <- step(
  full_model,
  direction = "backward",
  trace = 0
)

summary(backward_model)

## 
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas          2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

Stepwise Selection

Stepwise selection combines forward selection and backward elimination. Variables can be added or removed during the selection process depending on their statistical significance.

stepwise_model <- step(
  full_model,
  direction = "both",
  trace = 0
)

summary(stepwise_model)

## 
## Call:
## lm(formula = medv ~ crim + zn + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat, data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## crim         -0.108413   0.032779  -3.307 0.001010 ** 
## zn            0.045845   0.013523   3.390 0.000754 ***
## chas          2.718716   0.854240   3.183 0.001551 ** 
## nox         -17.376023   3.535243  -4.915 1.21e-06 ***
## rm            3.801579   0.406316   9.356  < 2e-16 ***
## dis          -1.492711   0.185731  -8.037 6.84e-15 ***
## rad           0.299608   0.063402   4.726 3.00e-06 ***
## tax          -0.011778   0.003372  -3.493 0.000521 ***
## ptratio      -0.946525   0.129066  -7.334 9.24e-13 ***
## black         0.009291   0.002674   3.475 0.000557 ***
## lstat        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

Expected Results

The variable selection procedures typically identify several important predictors of house prices. Variables such as:

rm (average number of rooms per dwelling)
lstat (percentage of lower-status population)
ptratio (pupil-teacher ratio)
dis (weighted distances to employment centers)
crim (per capita crime rate)
tax (property tax rate)

are often retained because they have strong relationships with housing prices.

The variable rm generally has a positive effect on house prices, indicating that houses with more rooms tend to be more expensive. On the other hand, lstat usually has a negative effect, suggesting that areas with a higher percentage of lower-status population tend to have lower housing values.

coefficient plot

coef_df <- data.frame(
  Variable = names(coef(full_model))[-1],
  Coefficient = coef(full_model)[-1]
)

library(ggplot2)

ggplot(coef_df,
       aes(x = reorder(Variable, abs(Coefficient)),
           y = Coefficient)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Regression Coefficients",
       x = "Variables",
       y = "Coefficient")

Predicted vs Actual Values

pred <- predict(full_model)

ggplot(data.frame(Actual = Boston$medv,
                  Predicted = pred),
       aes(x = Actual, y = Predicted)) +
  geom_point() +
  geom_abline(slope = 1,
              intercept = 0,
              linetype = "dashed") +
  labs(title = "Predicted vs Actual House Prices")

Discussion

The results show that not all variables contribute equally to predicting housing prices. Variable selection methods help identify the most influential predictors and remove variables with little explanatory power. Forward selection builds the model gradually, while backward elimination simplifies a complete model by removing unimportant variables. Stepwise selection combines both approaches and often produces a balanced model.

Using fewer but important variables makes the model easier to interpret and may improve prediction performance when applied to new data. In the Boston Housing dataset, housing prices are strongly influenced by structural characteristics such as the number of rooms and socioeconomic factors such as population status and educational conditions.

Conclusion

Variable selection is a useful technique for improving regression models by identifying the most important predictors. Using the Boston Housing dataset, forward selection, backward elimination, and stepwise selection can be applied to determine which variables best explain variations in housing prices. The selected variables provide valuable insights into the factors affecting property values and contribute to the development of a more efficient and interpretable predictive model.

ADVENTIST UNIVERSITY OF CENTRAL AFRICA

MASTERS OF IT IN BIG DATA ANALYTICS

20251MBI055

HABIMANA AIMABLE

2026-06-04