1 Introduction

1.1 Introduction

The dataset that is being used contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. It was obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston), and has been used extensively throughout the literature to benchmark algorithms. However, these comparisons were primarily done outside of Delve and are thus somewhat suspect. The dataset is small in size with only 506 cases.

This report will show you process to make a linear regression model to predict a house price, in which the median value of a home is to be predicted.

The data was originally published by Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978.

Therefore, the purpose of the linear regession model is :

  • Which variables that are significant in predicting the price of a car

  • How effective those variables describe the price of a car

1.2 Package

To make the model we will use the following library :

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode

2 Data Preparation & Explanation

2.1 Data input & Explanation

house <- read.csv("HousingData.csv")
head(house)

There are 14 attributes in each case of the dataset. They are:

  • CRIM - per capita crime rate by town

  • ZN - proportion of residential land zoned for lots over 25,000 sq.ft.

  • INDUS - proportion of non-retail business acres per town.

  • CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)

  • NOX - nitric oxides concentration (parts per 10 million)

  • RM - average number of rooms per dwelling

  • AGE - proportion of owner-occupied units built prior to 1940

  • DIS - weighted distances to five Boston employment centres

  • RAD - index of accessibility to radial highways

  • TAX - full-value property-tax rate per $10,000

  • PTRATIO - pupil-teacher ratio by town

  • B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

  • LSTAT - % lower status of the population

  • MEDV - Median value of owner-occupied homes in $1000’s this variable represent the house price

2.2 Data Structure

glimpse(house)
## Rows: 506
## Columns: 14
## $ CRIM    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ ZN      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ INDUS   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ CHAS    <int> 0, 0, 0, 0, 0, 0, NA, 0, 0, NA, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0,…
## $ NOX     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ RM      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ AGE     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ DIS     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ RAD     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ TAX     <int> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ PTRATIO <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ B       <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ LSTAT   <dbl> 4.98, 9.14, 4.03, 2.94, NA, 5.21, 12.43, 19.15, 29.93, 17.10, …
## $ MEDV    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

The data has 506 rows and 14 columns. The column ‘MEDV’ which is Median value of owner-occupied homes in $1000’s that’s represent the house price will be use as a target variable with the other column will be use as the predictor.

as we can see that the variable ‘CHAS’ consist of categorical data, therefore we need to change it’s data type to factor :

house$CHAS <- as.factor(house$CHAS)
glimpse(house)
## Rows: 506
## Columns: 14
## $ CRIM    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ ZN      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ INDUS   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ CHAS    <fct> 0, 0, 0, 0, 0, 0, NA, 0, 0, NA, 0, 0, 0, 0, NA, 0, 0, 0, 0, 0,…
## $ NOX     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ RM      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ AGE     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ DIS     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ RAD     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ TAX     <int> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ PTRATIO <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ B       <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ LSTAT   <dbl> 4.98, 9.14, 4.03, 2.94, NA, 5.21, 12.43, 19.15, 29.93, 17.10, …
## $ MEDV    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

2.2.1 Missing Value

colSums(is.na(house))
##    CRIM      ZN   INDUS    CHAS     NOX      RM     AGE     DIS     RAD     TAX 
##      20      20      20      20       0       0      20       0       0       0 
## PTRATIO       B   LSTAT    MEDV 
##       0       0      20       0

Since there is a small portion of missing values in the data set, we will remove the rows that contain the missing value :

cleaned_house <- na.omit(house)
colSums(is.na(cleaned_house))
##    CRIM      ZN   INDUS    CHAS     NOX      RM     AGE     DIS     RAD     TAX 
##       0       0       0       0       0       0       0       0       0       0 
## PTRATIO       B   LSTAT    MEDV 
##       0       0       0       0
glimpse(cleaned_house)
## Rows: 394
## Columns: 14
## $ CRIM    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.02985, 0.14455, 0.21124,…
## $ ZN      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 0.0, 0…
## $ INDUS   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.87, 8.…
## $ CHAS    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ NOX     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.524, 0.524, 0.524, 0.524,…
## $ RM      <dbl> 6.575, 6.421, 7.185, 6.998, 6.430, 6.172, 5.631, 6.377, 6.009,…
## $ AGE     <dbl> 65.2, 78.9, 61.1, 45.8, 58.7, 96.1, 100.0, 94.3, 82.9, 39.0, 6…
## $ DIS     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 5.9505, 6.0821, 6.3467…
## $ RAD     <int> 1, 2, 2, 3, 3, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ TAX     <int> 296, 242, 242, 222, 222, 311, 311, 311, 311, 311, 307, 307, 30…
## $ PTRATIO <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15.2, 21…
## $ B       <dbl> 396.90, 396.90, 392.83, 394.63, 394.12, 396.90, 386.63, 392.52…
## $ LSTAT   <dbl> 4.98, 9.14, 4.03, 2.94, 5.21, 19.15, 29.93, 20.45, 13.27, 15.7…
## $ MEDV    <dbl> 24.0, 21.6, 34.7, 33.4, 28.7, 27.1, 16.5, 15.0, 18.9, 21.7, 20…

3 Data Exploratory

In this data exploratory we will check if there are any pattern that can indicate any kind of correlation between variables.

ggcorr(cleaned_house, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)
## Warning in ggcorr(cleaned_house, label = TRUE, label_size = 2.9, hjust = 1, :
## data in column(s) 'CHAS' are not numeric and were ignored

From the graphic we can conclude that the predictor variables that are have strong correlation ( > 0.5 ) with the target variable ‘MEDV’ is ‘RM’.

Next we will check the data distribution in our target variable :

boxplot(cleaned_house$MEDV)

We can see that this variable have outliers.

4 Model Building

4.1 Train-test Splitting

In this section, we will split the dataset to data train and data test. The data train will be used to train the linear regression model and the data test will be used to evaluate the performance of the trained linear model. 80% of the dataset will be used for data train and the rest is data test.

set.seed(123)
samplesize <- round(0.8 * nrow(cleaned_house), 0)
index <- sample(seq_len(nrow(cleaned_house)), size = samplesize)

data_train <- cleaned_house[index, ]
data_test <- cleaned_house[-index, ]

4.2 Linear Regression Model

In this section we will build three models and analize which one is the better model.

4.2.1 Model 1

This model will use the only predictor variable with strong correlation with the target variable to make model, that is ‘RM’

model_1 <- lm(MEDV ~ RM,
              data = data_train)

summary(model_1)
## 
## Call:
## lm(formula = MEDV ~ RM, data = data_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.306  -2.222   0.477   3.118  31.848 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -40.6054     3.2039  -12.67   <2e-16 ***
## RM           10.0013     0.5043   19.83   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.161 on 313 degrees of freedom
## Multiple R-squared:  0.5568, Adjusted R-squared:  0.5554 
## F-statistic: 393.2 on 1 and 313 DF,  p-value: < 2.2e-16

With this model we can make interpretation :

  • (Intercept): This represents the intercept or the constant term in the regression equation. The estimate is -40.6054, which indicates the expected value of the dependent variable when all independent variables are zero.

  • RM: This is the coefficient for the variable RM. The estimate is 10.0013, suggesting that for a one-unit increase in RM, the dependent variable is expected to increase by 10.0013 units (assuming all other variables are held constant).

  • The “***” symbols indicate that both the intercept and the RM variable have p-values that are highly statistically significant (less than 0.05), suggesting that they have a significant impact on the dependent variable.

4.2.2 Model ALL

This model will use all of the predictor variables to make the model.

model_all <- lm(MEDV ~.,
                data = data_train)
summary(model_all)
## 
## Call:
## lm(formula = MEDV ~ ., data = data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.4525  -2.3807  -0.5144   1.6477  27.0785 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.465821   6.277630   4.375 1.67e-05 ***
## CRIM         -0.095562   0.036626  -2.609  0.00953 ** 
## ZN            0.047684   0.015601   3.057  0.00244 ** 
## INDUS         0.005292   0.073265   0.072  0.94246    
## CHAS1         3.442113   1.064293   3.234  0.00136 ** 
## NOX         -19.837100   4.975790  -3.987 8.41e-05 ***
## RM            4.921938   0.557390   8.830  < 2e-16 ***
## AGE          -0.012045   0.016346  -0.737  0.46176    
## DIS          -1.409296   0.231382  -6.091 3.42e-09 ***
## RAD           0.250727   0.076870   3.262  0.00123 ** 
## TAX          -0.011693   0.004329  -2.701  0.00730 ** 
## PTRATIO      -0.855854   0.154587  -5.536 6.73e-08 ***
## B             0.009309   0.003363   2.768  0.00599 ** 
## LSTAT        -0.349823   0.065465  -5.344 1.80e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.478 on 301 degrees of freedom
## Multiple R-squared:  0.7749, Adjusted R-squared:  0.7652 
## F-statistic:  79.7 on 13 and 301 DF,  p-value: < 2.2e-16

With this model we can make interpretation :

  • (Intercept): This represents the intercept or the constant term in the regression equation. The estimate is 32.680059, which indicates the expected value of the dependent variable when all independent variables are zero.

  • CRIM, ZN, INDUS, CHAS1, NOX, RM, AGE, DIS, RAD, TAX, PTRATIO, B, LSTAT: These are the coefficients for the respective variables. The estimates represent the expected change in the dependent variable associated with a one-unit increase in each independent variable, assuming all other variables are held constant.

  • Based on the p-values provided, some variables, such as CRIM, ZN, CHAS1, NOX, RM, DIS, RAD, TAX, PTRATIO, B, and LSTAT, are found to have statistically significant relationships with the dependent variable, while variables such as INDUS and AGE are not statistically significant.

4.2.3 Model Feature Selection

This model will use predictor variable selection using Step-wise method using backward elimination. Stepwise regression is a set of methods to fit regression models where the choice of predictive variables is carried out by an automatic procedure. On each step, one variable is considered for addition to or subtraction from the set of variables in the current model - and this process repeats itself until the model cannot be further improved upon.

Backward elimination works by starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion (AIC or residual sum of squares) and then removing the variable whose absence gives the highest reduction in AIC (biggest improvement in model fit) and repeating this process until no further variables can be removed without a loss of fit.

model_none <- lm(MEDV ~ 1, data = data_train)

model_backward <- step(object = model_all, 
                      direction = "backward",
                      trace = T)
## Start:  AIC=958.11
## MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + 
##     TAX + PTRATIO + B + LSTAT
## 
##           Df Sum of Sq    RSS     AIC
## - INDUS    1      0.10 6034.9  956.12
## - AGE      1     10.89 6045.7  956.68
## <none>                 6034.8  958.11
## - CRIM     1    136.49 6171.3  963.16
## - TAX      1    146.28 6181.1  963.65
## - B        1    153.62 6188.5  964.03
## - ZN       1    187.31 6222.1  965.74
## - CHAS     1    209.71 6244.5  966.87
## - RAD      1    213.30 6248.1  967.05
## - NOX      1    318.66 6353.5  972.32
## - LSTAT    1    572.50 6607.3  984.66
## - PTRATIO  1    614.55 6649.4  986.66
## - DIS      1    743.78 6778.6  992.72
## - RM       1   1563.34 7598.2 1028.67
## 
## Step:  AIC=956.12
## MEDV ~ CRIM + ZN + CHAS + NOX + RM + AGE + DIS + RAD + TAX + 
##     PTRATIO + B + LSTAT
## 
##           Df Sum of Sq    RSS     AIC
## - AGE      1     10.81 6045.7  954.68
## <none>                 6034.9  956.12
## - CRIM     1    136.90 6171.8  961.18
## - B        1    153.55 6188.5  962.03
## - TAX      1    183.75 6218.7  963.56
## - ZN       1    188.27 6223.2  963.79
## - CHAS     1    212.30 6247.2  965.01
## - RAD      1    230.35 6265.3  965.92
## - NOX      1    341.00 6375.9  971.43
## - LSTAT    1    572.67 6607.6  982.67
## - PTRATIO  1    626.01 6661.0  985.21
## - DIS      1    769.54 6804.5  991.92
## - RM       1   1582.33 7617.3 1027.46
## 
## Step:  AIC=954.68
## MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO + 
##     B + LSTAT
## 
##           Df Sum of Sq    RSS     AIC
## <none>                 6045.7  954.68
## - CRIM     1    136.77 6182.5  959.73
## - B        1    149.11 6194.9  960.35
## - TAX      1    185.31 6231.1  962.19
## - ZN       1    201.48 6247.2  963.01
## - CHAS     1    208.75 6254.5  963.37
## - RAD      1    244.87 6290.6  965.19
## - NOX      1    410.60 6456.3  973.38
## - PTRATIO  1    646.59 6692.3  984.69
## - LSTAT    1    737.01 6782.8  988.91
## - DIS      1    789.68 6835.4  991.35
## - RM       1   1605.20 7650.9 1026.85
model_backward
## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + 
##     TAX + PTRATIO + B + LSTAT, data = data_train)
## 
## Coefficients:
## (Intercept)         CRIM           ZN        CHAS1          NOX           RM  
##   27.967322    -0.095607     0.048886     3.417237   -20.747782     4.825840  
##         DIS          RAD          TAX      PTRATIO            B        LSTAT  
##   -1.360419     0.255189    -0.011592    -0.864469     0.009146    -0.367898

Using the feature selection with step-wise method we have filtered the predictor variable to : CRIM + ZN + CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO + B + LSTAT

summary(model_backward)
## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + 
##     TAX + PTRATIO + B + LSTAT, data = data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2737  -2.4146  -0.5658   1.6704  26.8155 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27.967322   6.207329   4.506 9.46e-06 ***
## CRIM         -0.095607   0.036517  -2.618 0.009285 ** 
## ZN            0.048886   0.015384   3.178 0.001638 ** 
## CHAS1         3.417237   1.056498   3.234 0.001353 ** 
## NOX         -20.747782   4.573704  -4.536 8.26e-06 ***
## RM            4.825840   0.538038   8.969  < 2e-16 ***
## DIS          -1.360419   0.216247  -6.291 1.10e-09 ***
## RAD           0.255189   0.072845   3.503 0.000529 ***
## TAX          -0.011592   0.003804  -3.048 0.002510 ** 
## PTRATIO      -0.864469   0.151859  -5.693 2.96e-08 ***
## B             0.009146   0.003346   2.734 0.006631 ** 
## LSTAT        -0.367898   0.060533  -6.078 3.66e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.467 on 303 degrees of freedom
## Multiple R-squared:  0.7745, Adjusted R-squared:  0.7663 
## F-statistic:  94.6 on 11 and 303 DF,  p-value: < 2.2e-16

With this model we can make interpretation :

  • (Intercept): This represents the intercept or the constant term in the regression equation. The estimate is 32.975051, which indicates the expected value of the dependent variable when all independent variables are zero.

  • CRIM, ZN, CHAS1, NOX, RM, DIS, RAD, TAX, PTRATIO, B, LSTAT: These are the coefficients for the respective variables. The estimates represent the expected change in the dependent variable associated with a one-unit increase in each independent variable, assuming all other variables are held constant.

  • Based on the p-values provided, several variables, such as CRIM, ZN, CHAS1, NOX, RM, DIS, RAD, TAX, PTRATIO, B, and LSTAT, are found to have statistically significant relationships with the dependent variable.

5 Model Evaluation

5.1 Model Performance

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error. RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. Other reason why we use RMSE is that because we outliers in our target variable.

  • RMSE in model_1
pred_1 <- predict(model_1, newdata = data_test %>% select(-MEDV))
RMSE(y_pred = pred_1 ,
    y_true = data_test$MEDV)
## [1] 6.948758
  • RMSE in model_all
pred_all <- predict(model_all, newdata = data_test %>% select(-MEDV))
RMSE(y_pred = pred_all ,
    y_true = data_test$MEDV)
## [1] 4.643686
  • RMSE in model_backward
pred_backward <- predict(model_backward, newdata = data_test %>% select(-MEDV))
RMSE(y_pred = pred_backward ,
    y_true = data_test$MEDV)
## [1] 4.648957

A lower RMSE signifies that the predictions are closer to the actual values, indicating a smaller average prediction error and a better fit of the model to the data. Within these models, model_all has the least RMSE indicates the best model performance among the other two.

5.2 Assumptions

We have our best model, that is model_all. However, as a statistical model, linear regression is a model that relies on strict assumptions. The following are several assumptions that need to be checked to ensure whether the model we construct is considered a Best Linear Unbiased Estimator (BLUE) model, which is a model that can consistently predict new data.

5.2.1 Linearity

Linearity refers to the condition where the target variable and its predictors have a linear relationship, meaning their relationship is characterized by a straight line.

To assess the assumption of linearity in a multiple linear regression model, one can create a plot of residuals versus fitted values. This plot is a scatter plot with the x-axis representing the fitted values (predicted values of the target variable) and the y-axis representing the residuals/errors produced by the model.

plot(model_all,
     which = 1)

The residuals are randomly scattered between -10 and 10, it indicates that the model we have satisfies the assumption of linearity.

5.2.2 Normality of Residual

In linear regression, it is expected that the errors follow a normal distribution. Consequently, the errors tend to cluster around zero, indicating that there are more errors close to zero.

  • Residual histogram visualization to show the distribution :
hist(model_backward$residuals)

  • Shapiro-Wilk hypothesis test :

The Shapiro-Wilk hypothesis test can be used to test the assumption of normal distribution for errors.

H0: The errors are normally distributed.

H1: The errors are NOT normally distributed.

H0 is rejected if the p-value is less than 0.05 (alpha).

shapiro.test(model_all$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_all$residuals
## W = 0.87463, p-value = 2.42e-15

Based on the Shapiro-Wilk normality test performed on the residuals of the model, the test statistic (W) is 0.92927, and the corresponding p-value is 4.464e-11 (very close to zero).Since the p-value is less than the significance level of 0.05 (alpha), we reject the null hypothesis (H0) and conclude that the errors are not normally distributed.

5.2.3 Homoscedasticity

It is expected that the errors generated by the model exhibit random or constant variation. When visualized, the errors should not exhibit any specific pattern. This condition is also known as homoscedasticity.

  • Visualize scatter plot: fitted.values vs residuals of the model :
plot(x = model_all$fitted.values,
     y = model_all$residuals)
abline(h = 0, col = "red")

  • Statistic test using Breusch-Pagan hypothesis test :

H0: The errors exhibit constant variance (homoscedasticity). H1: The errors do not exhibit constant variance (heteroscedasticity).

bptest(model_all)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_all
## BP = 40.019, df = 13, p-value = 0.0001373

Based on the studentized Breusch-Pagan test performed on the model, the test statistic (BP) is 32.229 with 13 degrees of freedom, and the corresponding p-value is 0.002222.

Since the p-value is less than the significance level of 0.05, we reject the null hypothesis (H0) and conclude that there is evidence of heteroscedasticity. This suggests that the errors do not exhibit constant variance and violate the assumption of homoscedasticity.

5.2.4 Multicollinearity

One of the statistical tool when assessing multicollinearity is the Variance Inflation Factor (VIF). Put simply, VIF is a way to measure the effect of multicollinearity among the predictors in our model.

vif(model_all)
##     CRIM       ZN    INDUS     CHAS      NOX       RM      AGE      DIS 
## 1.764436 2.398748 4.110956 1.058194 4.602251 2.312473 3.270797 3.884844 
##      RAD      TAX  PTRATIO        B    LSTAT 
## 6.987297 8.512164 1.756912 1.334624 3.634574

Generally, a VIF value above 5 or 10 is considered high and may indicate the presence of multicollinearity. In this case, the variables “RAD” and “TAX” have VIF values exceeding 5, suggesting a potential issue of multicollinearity between these variables.

6 Conclusion

We have created the best model to predict the median value of owner-occupied homes that represent the house price (MEDV). According to our best model that is model_all. This model suggest that all of the predictor variables will be used to predict the MEDV, with The adjusted R-squared value is 0.7804, which takes into account the number of predictors and adjusts the R-squared value accordingly and RMSE of 4.643686 the lowest of all model, indicates the best perfomance model amongs the other models. However, according to statistical assumption this model do not satisfy the normality of residual, homoscedasticity, and multicollinearity.