In this analysis, we will learn to use linear regression model using the dataset on Boston house prices. We want to investigate the relationship among variables and to predict house prices in Boston based on the historical data.The database used can be accessed from Kaggle: Boston House Prices.
Goals for this analysis:
MDV).MDV).CRIM: per capita crime rate by townZN: proportion of residential land zoned for lots over
25,000 sq.ft.INDUS: proportion of non-retail business acres per
townCHAS: Charles River dummy variable (1 if tract bounds
river; 0 otherwise)NOX: nitric oxides concentration (parts per 10 million)
[parts/10M]RM: average number of rooms per dwellingAGE: proportion of owner-occupied units built prior to
1940DIS: weighted distances to five Boston employment
centresRAD: index of accessibility to radial highwaysTAX: full-value property-tax rate per 10kPTRATIO: pupil-teacher ratio by townB: proportion of blacks by townLSTAT: % lower status of the populationMEDV: Median value of owner-occupied homes in 1k’s## Rows: 506
## Columns: 14
## $ CRIM <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ ZN <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ INDUS <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ CHAS <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ NOX <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ RM <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ AGE <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ DIS <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ RAD <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ TAX <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ PTRATIO <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ B <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ LSTAT <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ MEDV <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
Let’s change the data type for CHAS to factor from
integer, in order for the model to create a dummy variable.
Check for Missing Values
## CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX
## 0 0 0 0 0 0 0 0 0 0
## PTRATIO B LSTAT MEDV
## 0 0 0 0
There are no missing values in the data.
## CRIM ZN INDUS CHAS NOX
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:471 Min. :0.3850
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1: 35 1st Qu.:0.4490
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.5380
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.5547
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.6240
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :0.8710
## RM AGE DIS RAD
## Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000
## 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000
## Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000
## Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549
## 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000
## TAX PTRATIO B LSTAT
## Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73
## 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95
## Median :330.0 Median :19.05 Median :391.44 Median :11.36
## Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65
## 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95
## Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97
## MEDV
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
Let’s investigate whether there are any relationships between variables using Pearson Correlation.
## Warning in ggcorr(house_clean, label = T, label_size = 2.9, hjust = 0.7, : data
## in column(s) 'CHAS' are not numeric and were ignored
From the correlation graphic above, we can see that most of the
variables have a moderate to strong negative correlation with our target
variable MEDV. There is one strong positive correlation
between RM and MEDV.
Now let’s look at the correlation between our only categorical data
in the database, CHAS, and our target variable,
MEDV using point-biserial correlation.
## [1] -0.1752602
The relationship between house price and proximity to Charles River
is weak and negative. Since the correlation is weak, it is unlikely that
CHAS or proximity to the river influences the house
prices.
Since the variables with the highest correlation to our target
variable are LSTAT and RM, let’s try making a
model with these two variables.
##
## Call:
## lm(formula = MEDV ~ RM + LSTAT, data = house_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.076 -3.516 -1.010 1.909 28.131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.35827 3.17283 -0.428 0.669
## RM 5.09479 0.44447 11.463 <0.0000000000000002 ***
## LSTAT -0.64236 0.04373 -14.689 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.54 on 503 degrees of freedom
## Multiple R-squared: 0.6386, Adjusted R-squared: 0.6371
## F-statistic: 444.3 on 2 and 503 DF, p-value: < 0.00000000000000022
For the first model with RM and LSTAT, the
goodness of fit for this model produced an adjusted R-squared value of
0.6371. This means that the model can explain 63.71% of variance in the
target variable MEDV (house price).
Let’s also make a model with all variables and compare:
##
## Call:
## lm(formula = MEDV ~ ., data = house_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.4594884 5.1034588 7.144 0.000000000003283 ***
## CRIM -0.1080114 0.0328650 -3.287 0.001087 **
## ZN 0.0464205 0.0137275 3.382 0.000778 ***
## INDUS 0.0205586 0.0614957 0.334 0.738288
## CHAS1 2.6867338 0.8615798 3.118 0.001925 **
## NOX -17.7666112 3.8197437 -4.651 0.000004245643808 ***
## RM 3.8098652 0.4179253 9.116 < 0.0000000000000002 ***
## AGE 0.0006922 0.0132098 0.052 0.958229
## DIS -1.4755668 0.1994547 -7.398 0.000000000000601 ***
## RAD 0.3060495 0.0663464 4.613 0.000005070529023 ***
## TAX -0.0123346 0.0037605 -3.280 0.001112 **
## PTRATIO -0.9527472 0.1308268 -7.283 0.000000000001309 ***
## B 0.0093117 0.0026860 3.467 0.000573 ***
## LSTAT -0.5247584 0.0507153 -10.347 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 0.00000000000000022
For the second model, all the predictor variables were included and
the goodness of fit for this model produced an adjusted R-squared value
of 0.7338. This means that the model can explain 73.38% of variance in
the target variable MEDV (house price). This model improved
by 9.67%! Looking at the Pr(>|t|) column, the variables
are all significant, except INDUS and AGE.
From this observation alone, we can try to remove the insignificant variables to obtain a more optimal model, but we will try to use stepwise regression to automatically determine feature selection with a forward and backward (both) method of elimination.
model_all <-lm(formula = MEDV~., data = house_clean)
step_both <- stats::step(object = model_all, direction = "both")## Start: AIC=1589.64
## MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD +
## TAX + PTRATIO + B + LSTAT
##
## Df Sum of Sq RSS AIC
## - AGE 1 0.06 11079 1587.7
## - INDUS 1 2.52 11081 1587.8
## <none> 11079 1589.6
## - CHAS 1 218.97 11298 1597.5
## - TAX 1 242.26 11321 1598.6
## - CRIM 1 243.22 11322 1598.6
## - ZN 1 257.49 11336 1599.3
## - B 1 270.63 11349 1599.8
## - RAD 1 479.15 11558 1609.1
## - NOX 1 487.16 11566 1609.4
## - PTRATIO 1 1194.23 12273 1639.4
## - DIS 1 1232.41 12311 1641.0
## - RM 1 1871.32 12950 1666.6
## - LSTAT 1 2410.84 13490 1687.3
##
## Step: AIC=1587.65
## MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + DIS + RAD + TAX +
## PTRATIO + B + LSTAT
##
## Df Sum of Sq RSS AIC
## - INDUS 1 2.52 11081 1585.8
## <none> 11079 1587.7
## + AGE 1 0.06 11079 1589.6
## - CHAS 1 219.91 11299 1595.6
## - TAX 1 242.24 11321 1596.6
## - CRIM 1 243.20 11322 1596.6
## - ZN 1 260.32 11339 1597.4
## - B 1 272.26 11351 1597.9
## - RAD 1 481.09 11560 1607.2
## - NOX 1 520.87 11600 1608.9
## - PTRATIO 1 1200.23 12279 1637.7
## - DIS 1 1352.26 12431 1643.9
## - RM 1 1959.55 13038 1668.0
## - LSTAT 1 2718.88 13798 1696.7
##
## Step: AIC=1585.76
## MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO +
## B + LSTAT
##
## Df Sum of Sq RSS AIC
## <none> 11081 1585.8
## + INDUS 1 2.52 11079 1587.7
## + AGE 1 0.06 11081 1587.8
## - CHAS 1 227.21 11309 1594.0
## - CRIM 1 245.37 11327 1594.8
## - ZN 1 257.82 11339 1595.4
## - B 1 270.82 11352 1596.0
## - TAX 1 273.62 11355 1596.1
## - RAD 1 500.92 11582 1606.1
## - NOX 1 541.91 11623 1607.9
## - PTRATIO 1 1206.45 12288 1636.0
## - DIS 1 1448.94 12530 1645.9
## - RM 1 1963.66 13045 1666.3
## - LSTAT 1 2723.48 13805 1695.0
The result from both stepwise regression method took out the
variables: INDUS and AGE, which we also
identified as insignificant in the earlier analysis.
Make a new model based on results from stepwise regression:
model_step <-lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO +
B + LSTAT, data = house_clean)
summary(model_step)##
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD +
## TAX + PTRATIO + B + LSTAT, data = house_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 0.00000000000272727 ***
## CRIM -0.108413 0.032779 -3.307 0.001010 **
## ZN 0.045845 0.013523 3.390 0.000754 ***
## CHAS1 2.718716 0.854240 3.183 0.001551 **
## NOX -17.376023 3.535243 -4.915 0.00000120941304009 ***
## RM 3.801579 0.406316 9.356 < 0.0000000000000002 ***
## DIS -1.492711 0.185731 -8.037 0.00000000000000684 ***
## RAD 0.299608 0.063402 4.726 0.00000299679930933 ***
## TAX -0.011778 0.003372 -3.493 0.000521 ***
## PTRATIO -0.946525 0.129066 -7.334 0.00000000000092351 ***
## B 0.009291 0.002674 3.475 0.000557 ***
## LSTAT -0.522553 0.047424 -11.019 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 0.00000000000000022
Compared to the model with all variables, model_all the
adjusted R-squared of 0.7338, this new model based on step-wise
regression, model_step produced an adjusted R-squared of
0.7348. There is a slight improvement in the new model’s ability to
explain for variance of the target variable by 0.1%.
perform <- compare_performance(model_highcor,
model_all,
model_step)
comparison <- as.data.frame(perform)
comparisonWe will compare the models based on its adjusted R-squared, AIC value, and RMSE, to determine the most optimal model. Note that the most optimal model will have a higher adjusted R-squared value, a lower AIC and RMSE value.
Overall, model_step proved to be a more optimal model
out of the three, so we will use this model for prediction.
Predicted Fitted Values
## 1 2 3 4 5 6
## 30.12428 24.99653 30.53337 28.64799 27.98264 25.29804
## [1] 4.679736
Predicted Range Values
pred_interval <- predict(
object = model_step,
newdata = house_clean,
interval = "prediction",
level = 0.95)
head(pred_interval)## fit lwr upr
## 1 30.12428 20.77174 39.47682
## 2 24.99653 15.65887 34.33418
## 3 30.53337 21.17626 39.89048
## 4 28.64799 19.27267 38.02332
## 5 27.98264 18.60622 37.35906
## 6 25.29804 15.93515 34.66092
Actual Values
## [1] 24.0 21.6 34.7 33.4 36.2 28.7
Let’s interpret the prediction interval, for example, look at row 1 in head(house_clean):
If a house has these characteristics
CRIM(crime rate) of 0.00632ZN(proportion of residential land) of 18CHAS(Charles River proximity) of 0NOX(nitric oxides concentration) of 0.538RM(average number of rooms per dwelling) of
6.575DIS(weighted distances to five Boston employment
centres) of 4.09RAD(index of accessibility to radial highways)
of 1TAX(full-value property-tax rate) of 296PTRATIO(pupil-teacher ratio by town) of
15.3B(proportion of blacks by town) of 396.9LSTAT(% lower status of the population) of
4.98with a level of confidence of 95%, we are 95% certain that the result of our prediction will be in the range between 20.77174 - 39.47682.
Linearity
Let’s plot for fitted values vs residual, to check linearity
Although the line is near y=0, there are noticeable
outliers in the plot, and the values are gathered in the middle instead
of randomly scattered. Thus, the linearity test has not been met, which
means that a linear regression model may not be a suitable model that
represents the data.
Normality
##
## Shapiro-Wilk normality test
##
## data: model_step$residuals
## W = 0.90131, p-value < 0.00000000000000022
Based on the histogram and Shapiro-Wilk test, this model is not normally distributed. Thus, the next step of action could be to remove outliers from the data, transform the target variable before modelling, or using a different model that has no assumptions.
Homoscedacity
plot(model_step$fitted.values, y = model_step$residuals)
abline(h = 0, col = "red") # garis horizontal di angka 0##
## studentized Breusch-Pagan test
##
## data: model_step
## BP = 59.907, df = 11, p-value = 0.000000009647
The plot shows that the model is heteroscedastic, since there is a discernible pattern. The Breusch-Pagan test also presents a value of p<0.05.
Multicollinearity
## CRIM ZN CHAS NOX RM DIS RAD TAX
## 1.789704 2.239229 1.059819 3.778011 1.834806 3.443420 6.861126 7.272386
## PTRATIO B LSTAT
## 1.757681 1.341559 2.581984
There seems to be no multicollinearity in the model since all the VIF values are below 10.
Only one assumption was met for the model_step model,
which was no multicollinearity. Although it is possible to create a
linear regression model with only one assumption met, violating all the
other assumptions can lead to misleading results, thus making the
results of the model possibly unreliable.
Based on the models we made, variables that are useful to describe the
variances in house prices are CRIM, ZN, CHAS, NOX, RM, DIS, RAD, TAX,
PTRATIO, B, LSTAT. This model has an adjusted R-Square of 0.7348 and a
pretty low RMSE score 4.6797, which indicates that the model’s
predictions are close to the actual values.
To further improve our model’s reliability, the next course of action is to try a nonlinear transformation of the predictor variables or try to fit our data with a model without assumptions instead.