1 Introduction

In this analysis, we will learn to use linear regression model using the dataset on Boston house prices. We want to investigate the relationship among variables and to predict house prices in Boston based on the historical data.The database used can be accessed from Kaggle: Boston House Prices.

Goals for this analysis:

  • Which variables are significant in predicting house prices (MDV).
  • How well those variables describe the house prices (MDV).

2 Data Preparation

house <- read.csv("boston.csv")
head(house)

2.1 Variable Description

  • CRIM: per capita crime rate by town
  • ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS: proportion of non-retail business acres per town
  • CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise)
  • NOX: nitric oxides concentration (parts per 10 million) [parts/10M]
  • RM: average number of rooms per dwelling
  • AGE: proportion of owner-occupied units built prior to 1940
  • DIS: weighted distances to five Boston employment centres
  • RAD: index of accessibility to radial highways
  • TAX: full-value property-tax rate per 10k
  • PTRATIO: pupil-teacher ratio by town
  • B: proportion of blacks by town
  • LSTAT: % lower status of the population
  • MEDV: Median value of owner-occupied homes in 1k’s

2.2 Inspecting Data

glimpse(house)
## Rows: 506
## Columns: 14
## $ CRIM    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ ZN      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ INDUS   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ CHAS    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ NOX     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ RM      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ AGE     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ DIS     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ RAD     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ TAX     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ PTRATIO <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ B       <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ LSTAT   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ MEDV    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…

Let’s change the data type for CHAS to factor from integer, in order for the model to create a dummy variable.

house_clean <- house %>% mutate(CHAS = as.factor(CHAS))

Check for Missing Values

colSums(is.na(house))
##    CRIM      ZN   INDUS    CHAS     NOX      RM     AGE     DIS     RAD     TAX 
##       0       0       0       0       0       0       0       0       0       0 
## PTRATIO       B   LSTAT    MEDV 
##       0       0       0       0

There are no missing values in the data.

3 Exploratory Data Analysis

summary(house_clean)
##       CRIM                ZN             INDUS       CHAS         NOX        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:471   Min.   :0.3850  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1: 35   1st Qu.:0.4490  
##  Median : 0.25651   Median :  0.00   Median : 9.69           Median :0.5380  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14           Mean   :0.5547  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10           3rd Qu.:0.6240  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74           Max.   :0.8710  
##        RM             AGE              DIS              RAD        
##  Min.   :3.561   Min.   :  2.90   Min.   : 1.130   Min.   : 1.000  
##  1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000  
##  Median :6.208   Median : 77.50   Median : 3.207   Median : 5.000  
##  Mean   :6.285   Mean   : 68.57   Mean   : 3.795   Mean   : 9.549  
##  3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.000  
##       TAX           PTRATIO            B              LSTAT      
##  Min.   :187.0   Min.   :12.60   Min.   :  0.32   Min.   : 1.73  
##  1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38   1st Qu.: 6.95  
##  Median :330.0   Median :19.05   Median :391.44   Median :11.36  
##  Mean   :408.2   Mean   :18.46   Mean   :356.67   Mean   :12.65  
##  3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23   3rd Qu.:16.95  
##  Max.   :711.0   Max.   :22.00   Max.   :396.90   Max.   :37.97  
##       MEDV      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.53  
##  3rd Qu.:25.00  
##  Max.   :50.00

Let’s investigate whether there are any relationships between variables using Pearson Correlation.

ggcorr(house_clean, label = T, label_size = 2.9, hjust = 0.7, layout.exp = 2)
## Warning in ggcorr(house_clean, label = T, label_size = 2.9, hjust = 0.7, : data
## in column(s) 'CHAS' are not numeric and were ignored

From the correlation graphic above, we can see that most of the variables have a moderate to strong negative correlation with our target variable MEDV. There is one strong positive correlation between RM and MEDV.

Now let’s look at the correlation between our only categorical data in the database, CHAS, and our target variable, MEDV using point-biserial correlation.

biserial.cor(x = house_clean$MEDV, y = house_clean$CHAS)
## [1] -0.1752602
plot(house_clean$CHAS, house_clean$MEDV)

The relationship between house price and proximity to Charles River is weak and negative. Since the correlation is weak, it is unlikely that CHAS or proximity to the river influences the house prices.

4 Modelling

Since the variables with the highest correlation to our target variable are LSTAT and RM, let’s try making a model with these two variables.

model_highcor <-lm(formula = MEDV ~ RM+ LSTAT, data = house_clean)
summary(model_highcor)
## 
## Call:
## lm(formula = MEDV ~ RM + LSTAT, data = house_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.076  -3.516  -1.010   1.909  28.131 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -1.35827    3.17283  -0.428               0.669    
## RM           5.09479    0.44447  11.463 <0.0000000000000002 ***
## LSTAT       -0.64236    0.04373 -14.689 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.54 on 503 degrees of freedom
## Multiple R-squared:  0.6386, Adjusted R-squared:  0.6371 
## F-statistic: 444.3 on 2 and 503 DF,  p-value: < 0.00000000000000022

For the first model with RM and LSTAT, the goodness of fit for this model produced an adjusted R-squared value of 0.6371. This means that the model can explain 63.71% of variance in the target variable MEDV (house price).

Let’s also make a model with all variables and compare:

model_all <-lm(formula = MEDV~., data = house_clean)
summary(model_all)
## 
## Call:
## lm(formula = MEDV ~ ., data = house_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##                Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)  36.4594884   5.1034588   7.144    0.000000000003283 ***
## CRIM         -0.1080114   0.0328650  -3.287             0.001087 ** 
## ZN            0.0464205   0.0137275   3.382             0.000778 ***
## INDUS         0.0205586   0.0614957   0.334             0.738288    
## CHAS1         2.6867338   0.8615798   3.118             0.001925 ** 
## NOX         -17.7666112   3.8197437  -4.651    0.000004245643808 ***
## RM            3.8098652   0.4179253   9.116 < 0.0000000000000002 ***
## AGE           0.0006922   0.0132098   0.052             0.958229    
## DIS          -1.4755668   0.1994547  -7.398    0.000000000000601 ***
## RAD           0.3060495   0.0663464   4.613    0.000005070529023 ***
## TAX          -0.0123346   0.0037605  -3.280             0.001112 ** 
## PTRATIO      -0.9527472   0.1308268  -7.283    0.000000000001309 ***
## B             0.0093117   0.0026860   3.467             0.000573 ***
## LSTAT        -0.5247584   0.0507153 -10.347 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 0.00000000000000022

For the second model, all the predictor variables were included and the goodness of fit for this model produced an adjusted R-squared value of 0.7338. This means that the model can explain 73.38% of variance in the target variable MEDV (house price). This model improved by 9.67%! Looking at the Pr(>|t|) column, the variables are all significant, except INDUS and AGE.

From this observation alone, we can try to remove the insignificant variables to obtain a more optimal model, but we will try to use stepwise regression to automatically determine feature selection with a forward and backward (both) method of elimination.

4.1 Stepwise Regression: Feature Selection

model_all <-lm(formula = MEDV~., data = house_clean)
step_both <- stats::step(object = model_all, direction = "both")
## Start:  AIC=1589.64
## MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + 
##     TAX + PTRATIO + B + LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## - AGE      1      0.06 11079 1587.7
## - INDUS    1      2.52 11081 1587.8
## <none>                 11079 1589.6
## - CHAS     1    218.97 11298 1597.5
## - TAX      1    242.26 11321 1598.6
## - CRIM     1    243.22 11322 1598.6
## - ZN       1    257.49 11336 1599.3
## - B        1    270.63 11349 1599.8
## - RAD      1    479.15 11558 1609.1
## - NOX      1    487.16 11566 1609.4
## - PTRATIO  1   1194.23 12273 1639.4
## - DIS      1   1232.41 12311 1641.0
## - RM       1   1871.32 12950 1666.6
## - LSTAT    1   2410.84 13490 1687.3
## 
## Step:  AIC=1587.65
## MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + DIS + RAD + TAX + 
##     PTRATIO + B + LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## - INDUS    1      2.52 11081 1585.8
## <none>                 11079 1587.7
## + AGE      1      0.06 11079 1589.6
## - CHAS     1    219.91 11299 1595.6
## - TAX      1    242.24 11321 1596.6
## - CRIM     1    243.20 11322 1596.6
## - ZN       1    260.32 11339 1597.4
## - B        1    272.26 11351 1597.9
## - RAD      1    481.09 11560 1607.2
## - NOX      1    520.87 11600 1608.9
## - PTRATIO  1   1200.23 12279 1637.7
## - DIS      1   1352.26 12431 1643.9
## - RM       1   1959.55 13038 1668.0
## - LSTAT    1   2718.88 13798 1696.7
## 
## Step:  AIC=1585.76
## MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO + 
##     B + LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 11081 1585.8
## + INDUS    1      2.52 11079 1587.7
## + AGE      1      0.06 11081 1587.8
## - CHAS     1    227.21 11309 1594.0
## - CRIM     1    245.37 11327 1594.8
## - ZN       1    257.82 11339 1595.4
## - B        1    270.82 11352 1596.0
## - TAX      1    273.62 11355 1596.1
## - RAD      1    500.92 11582 1606.1
## - NOX      1    541.91 11623 1607.9
## - PTRATIO  1   1206.45 12288 1636.0
## - DIS      1   1448.94 12530 1645.9
## - RM       1   1963.66 13045 1666.3
## - LSTAT    1   2723.48 13805 1695.0

The result from both stepwise regression method took out the variables: INDUS and AGE, which we also identified as insignificant in the earlier analysis.

Make a new model based on results from stepwise regression:

model_step <-lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO + 
                B + LSTAT, data = house_clean)
summary(model_step)
## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + RAD + 
##     TAX + PTRATIO + B + LSTAT, data = house_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171  0.00000000000272727 ***
## CRIM         -0.108413   0.032779  -3.307             0.001010 ** 
## ZN            0.045845   0.013523   3.390             0.000754 ***
## CHAS1         2.718716   0.854240   3.183             0.001551 ** 
## NOX         -17.376023   3.535243  -4.915  0.00000120941304009 ***
## RM            3.801579   0.406316   9.356 < 0.0000000000000002 ***
## DIS          -1.492711   0.185731  -8.037  0.00000000000000684 ***
## RAD           0.299608   0.063402   4.726  0.00000299679930933 ***
## TAX          -0.011778   0.003372  -3.493             0.000521 ***
## PTRATIO      -0.946525   0.129066  -7.334  0.00000000000092351 ***
## B             0.009291   0.002674   3.475             0.000557 ***
## LSTAT        -0.522553   0.047424 -11.019 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 0.00000000000000022

Compared to the model with all variables, model_all the adjusted R-squared of 0.7338, this new model based on step-wise regression, model_step produced an adjusted R-squared of 0.7348. There is a slight improvement in the new model’s ability to explain for variance of the target variable by 0.1%.

5 Model Evaluation

5.1 Model Comparison

perform <- compare_performance(model_highcor,
                               model_all,
                               model_step)

comparison <- as.data.frame(perform)
comparison

We will compare the models based on its adjusted R-squared, AIC value, and RMSE, to determine the most optimal model. Note that the most optimal model will have a higher adjusted R-squared value, a lower AIC and RMSE value.

  • Optimal model based on adjusted R-squared: model_step
  • Optimal model based on AIC value: model_step
  • Optimal model based on RMSE value: model_all

Overall, model_step proved to be a more optimal model out of the three, so we will use this model for prediction.

5.2 Model Prediction and Error

Predicted Fitted Values

pred_step <- predict(object = model_step, 
                     data= house_clean)

head(pred_step)
##        1        2        3        4        5        6 
## 30.12428 24.99653 30.53337 28.64799 27.98264 25.29804
RMSE(y_pred = pred_step, y_true = house_clean$MEDV)
## [1] 4.679736

Predicted Range Values

pred_interval <- predict(
  object = model_step,
  newdata = house_clean,
  interval = "prediction",
  level = 0.95)

head(pred_interval)
##        fit      lwr      upr
## 1 30.12428 20.77174 39.47682
## 2 24.99653 15.65887 34.33418
## 3 30.53337 21.17626 39.89048
## 4 28.64799 19.27267 38.02332
## 5 27.98264 18.60622 37.35906
## 6 25.29804 15.93515 34.66092

Actual Values

head(house_clean$MEDV)
## [1] 24.0 21.6 34.7 33.4 36.2 28.7
head(house_clean)

Let’s interpret the prediction interval, for example, look at row 1 in head(house_clean):

If a house has these characteristics

  • CRIM(crime rate) of 0.00632
  • ZN(proportion of residential land) of 18
  • CHAS(Charles River proximity) of 0
  • NOX(nitric oxides concentration) of 0.538
  • RM(average number of rooms per dwelling) of 6.575
  • DIS(weighted distances to five Boston employment centres) of 4.09
  • RAD(index of accessibility to radial highways) of 1
  • TAX(full-value property-tax rate) of 296
  • PTRATIO(pupil-teacher ratio by town) of 15.3
  • B(proportion of blacks by town) of 396.9
  • LSTAT(% lower status of the population) of 4.98

with a level of confidence of 95%, we are 95% certain that the result of our prediction will be in the range between 20.77174 - 39.47682.

5.3 Assumptions

Linearity

Let’s plot for fitted values vs residual, to check linearity

plot(model_step, which = 1)

Although the line is near y=0, there are noticeable outliers in the plot, and the values are gathered in the middle instead of randomly scattered. Thus, the linearity test has not been met, which means that a linear regression model may not be a suitable model that represents the data.

Normality

hist(model_step$residuals)

shapiro.test(model_step$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_step$residuals
## W = 0.90131, p-value < 0.00000000000000022

Based on the histogram and Shapiro-Wilk test, this model is not normally distributed. Thus, the next step of action could be to remove outliers from the data, transform the target variable before modelling, or using a different model that has no assumptions.

Homoscedacity

plot(model_step$fitted.values, y = model_step$residuals)
abline(h = 0, col = "red") # garis horizontal di angka 0

bptest(model_step)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_step
## BP = 59.907, df = 11, p-value = 0.000000009647

The plot shows that the model is heteroscedastic, since there is a discernible pattern. The Breusch-Pagan test also presents a value of p<0.05.

Multicollinearity

vif(model_step)
##     CRIM       ZN     CHAS      NOX       RM      DIS      RAD      TAX 
## 1.789704 2.239229 1.059819 3.778011 1.834806 3.443420 6.861126 7.272386 
##  PTRATIO        B    LSTAT 
## 1.757681 1.341559 2.581984

There seems to be no multicollinearity in the model since all the VIF values are below 10.

6 Conclusion

Only one assumption was met for the model_step model, which was no multicollinearity. Although it is possible to create a linear regression model with only one assumption met, violating all the other assumptions can lead to misleading results, thus making the results of the model possibly unreliable.
Based on the models we made, variables that are useful to describe the variances in house prices are CRIM, ZN, CHAS, NOX, RM, DIS, RAD, TAX, PTRATIO, B, LSTAT. This model has an adjusted R-Square of 0.7348 and a pretty low RMSE score 4.6797, which indicates that the model’s predictions are close to the actual values.

To further improve our model’s reliability, the next course of action is to try a nonlinear transformation of the predictor variables or try to fit our data with a model without assumptions instead.