Multiple Linear Regression - Taipei Real Estate

library(tidyverse, warn.conflicts = FALSE)
library(ggplot2, warn.conflicts = FALSE)
library(car, warn.conflicts = FALSE)
library(summarytools, warn.conflicts = FALSE)
library(leaflet, warn.conflicts = FALSE)

Taipei, Taiwan

START

The following dataset is taken from https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set. The inputs are as follows:

X1=the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)

X2=the house age (unit: year)

X3=the distance to the nearest MRT station (unit: meter)

X4=the number of convenience stores in the living circle on foot (integer)

X5=the geographic coordinate, latitude. (unit: degree)

X6=the geographic coordinate, longitude. (unit: degree)

real_estate <- read_csv("C:/Users/Michele/Desktop/Esami, miei/Real estate.csv", show_col_types = FALSE)

head(real_estate)

## # A tibble: 6 x 8
##      No `X1 transaction date` `X2 house age` `X3 distance to t~ `X4 number of c~
##   <dbl>                 <dbl>          <dbl>              <dbl>            <dbl>
## 1     1                 2013.           32                 84.9               10
## 2     2                 2013.           19.5              307.                 9
## 3     3                 2014.           13.3              562.                 5
## 4     4                 2014.           13.3              562.                 5
## 5     5                 2013.            5                391.                 5
## 6     6                 2013.            7.1             2175.                 3
## # ... with 3 more variables: X5 latitude <dbl>, X6 longitude <dbl>,
## #   Y house price of unit area <dbl>

Since we have access to latitude and longitude, let’s visualize where are properties located.

leaflet(real_estate)%>%
  addTiles()%>%
  addCircleMarkers(
    lat = real_estate$`X5 latitude`,
    lng = real_estate$`X6 longitude`,
    clusterOptions = markerClusterOptions()
    )

Data Cleaning and EDA

Latitude and longitude are useful for visualization, but I will not deal with them in this project. Let’s focus on the variables that most likely will impact the price of a property.

real_estate_clean <- real_estate%>%
  dplyr::select(-c(1,2,6,7))

colnames(real_estate_clean) = c("house_age",
                                "distance_to_the_nearest_MRT_station",
                                "number_of_convenience_store",
                                "house_price_of_unit_area")

head(real_estate_clean)

## # A tibble: 6 x 4
##   house_age distance_to_the_nearest_~ number_of_convenience~ house_price_of_uni~
##       <dbl>                     <dbl>                  <dbl>               <dbl>
## 1      32                        84.9                     10                37.9
## 2      19.5                     307.                       9                42.2
## 3      13.3                     562.                       5                47.3
## 4      13.3                     562.                       5                54.8
## 5       5                       391.                       5                43.1
## 6       7.1                    2175.                       3                32.1

print(summarytools::dfSummary(real_estate_clean), method = "render")

Data Frame Summary

real_estate_clean

Dimensions: 414 x 4
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

house_age [numeric]

Mean (sd) : 17.7 (11.4)

min ≤ med ≤ max:

0 ≤ 16.1 ≤ 43.8

IQR (CV) : 19.1 (0.6)

236 distinct values

414 (100.0%)

0 (0.0%)

distance_to_the_nearest_MRT_station [numeric]

Mean (sd) : 1083.9 (1262.1)

min ≤ med ≤ max:

23.4 ≤ 492.2 ≤ 6488

IQR (CV) : 1165 (1.2)

259 distinct values

414 (100.0%)

0 (0.0%)

number_of_convenience_store [numeric]

Mean (sd) : 4.1 (2.9)

min ≤ med ≤ max:

0 ≤ 4 ≤ 10

IQR (CV) : 5 (0.7)

11 distinct values

414 (100.0%)

0 (0.0%)

house_price_of_unit_area [numeric]

Mean (sd) : 38 (13.6)

min ≤ med ≤ max:

7.6 ≤ 38.5 ≤ 117.5

IQR (CV) : 18.9 (0.4)

270 distinct values

414 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.0 (R version 4.1.2)
2022-11-10

Now let’s look singularly at the relationship between each of our predictors and our dependent variable.

gg1 <- ggplot(real_estate_clean, aes(x=house_age, y= house_price_of_unit_area))+
  geom_point()+
  stat_smooth(method = lm, formula = y~x)+
  xlab("House age")+
  ylab("Price")

gg2 <- ggplot(real_estate_clean, aes(x=distance_to_the_nearest_MRT_station,
                                     y= house_price_of_unit_area))+
  geom_point()+
  stat_smooth(method = lm, formula = y~x)+
  xlab("Distance to the nearest MRT station")+
  ylab("Price")

gg3 <- ggplot(real_estate_clean, aes(x=number_of_convenience_store, y=house_price_of_unit_area))+
  geom_point()+
  stat_smooth(method = lm, formula = y~x )+
  xlab("Number of convenience stores")+
  ylab("Price")

gg1

gg2

gg3

Let’s inspect singularly each one of our variables with a SLR. The first one is House Age.

mod1_a <- lm(house_price_of_unit_area~house_age, data = real_estate_clean)
summary(mod1_a)

## 
## Call:
## lm(formula = house_price_of_unit_area ~ house_age, data = real_estate_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.113 -10.738   1.626   8.199  77.781 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.43470    1.21098  35.042  < 2e-16 ***
## house_age   -0.25149    0.05752  -4.372 1.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.32 on 412 degrees of freedom
## Multiple R-squared:  0.04434,    Adjusted R-squared:  0.04202 
## F-statistic: 19.11 on 1 and 412 DF,  p-value: 1.56e-05

We can clearly see how poor our adjusted R-squared is. But… If we take a look at the graph from before, we can see how the relationship between this predictor and the outcome seems to be non-linear. It is more an exponential relation. Let’s visualize again the plot, this time without “forcing” the linear relationship.

ggplot(real_estate_clean, aes(house_age, house_price_of_unit_area) ) +
geom_point() +
stat_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Now let’s add the polynomial term to our model and see if it improves.

mod1_b<-lm(house_price_of_unit_area~house_age+I(house_age^2), data = real_estate_clean)
summary(mod1_b)

## 
## Call:
## lm(formula = house_price_of_unit_area ~ house_age + I(house_age^2), 
##     data = real_estate_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.542  -9.085  -0.445   8.260  79.961 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    53.450539   1.651624  32.362   <2e-16 ***
## house_age      -1.928877   0.193759  -9.955   <2e-16 ***
## I(house_age^2)  0.042181   0.004689   8.995   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.19 on 411 degrees of freedom
## Multiple R-squared:  0.2015, Adjusted R-squared:  0.1977 
## F-statistic: 51.87 on 2 and 411 DF,  p-value: < 2.2e-16

Just look at how much our adjusted R-squared has improved. Let’s now look at the other predictors.

mod2_a <-lm(house_price_of_unit_area~distance_to_the_nearest_MRT_station, data = real_estate_clean)
summary(mod2_a)

## 
## Call:
## lm(formula = house_price_of_unit_area ~ distance_to_the_nearest_MRT_station, 
##     data = real_estate_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.396  -6.007  -1.195   4.831  73.483 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         45.8514271  0.6526105   70.26   <2e-16 ***
## distance_to_the_nearest_MRT_station -0.0072621  0.0003925  -18.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.07 on 412 degrees of freedom
## Multiple R-squared:  0.4538, Adjusted R-squared:  0.4524 
## F-statistic: 342.2 on 1 and 412 DF,  p-value: < 2.2e-16

This will likely be the variable with the most impact on our model. Let’s take a look at its distribution.

ggplot(real_estate_clean, aes(distance_to_the_nearest_MRT_station))+
geom_histogram(fill = "red")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the histogram above we can see how this variable is highly right-skewed. What can we do?

Logarithmic transformations

Logarithmic transformations are a convenient way to transform skewed variables into one that are approximately normal. We use natural logarithms. Let’s look at the distribution of the logarithm of Distance_to_the_nearest_MRT_station.

ggplot(real_estate_clean, aes(log(distance_to_the_nearest_MRT_station)))+
geom_histogram(fill = "red")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now the distribution seems to be approximately normal. Let’s insert this in our model.

mod2_b <- lm(house_price_of_unit_area~log(distance_to_the_nearest_MRT_station), data = real_estate_clean)
summary(mod2_b)

## 
## Call:
## lm(formula = house_price_of_unit_area ~ log(distance_to_the_nearest_MRT_station), 
##     data = real_estate_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.103  -5.023  -0.827   3.449  71.846 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               95.0169     2.6369   36.03   <2e-16
## log(distance_to_the_nearest_MRT_station)  -8.9235     0.4064  -21.96   <2e-16
##                                             
## (Intercept)                              ***
## log(distance_to_the_nearest_MRT_station) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.247 on 412 degrees of freedom
## Multiple R-squared:  0.5393, Adjusted R-squared:  0.5381 
## F-statistic: 482.2 on 1 and 412 DF,  p-value: < 2.2e-16

As expected, the Adjusted R-squared has improved by performing a log-transformation on this variable. Finally, let’s check for our last predictor.

mod3<- lm(house_price_of_unit_area~ number_of_convenience_store, data = real_estate_clean)
summary(mod3)

## 
## Call:
## lm(formula = house_price_of_unit_area ~ number_of_convenience_store, 
##     data = real_estate_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.407  -7.341  -1.788   5.984  87.681 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  27.1811     0.9419   28.86   <2e-16 ***
## number_of_convenience_store   2.6377     0.1868   14.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.18 on 412 degrees of freedom
## Multiple R-squared:  0.326,  Adjusted R-squared:  0.3244 
## F-statistic: 199.3 on 1 and 412 DF,  p-value: < 2.2e-16

ggplot(real_estate_clean, aes(number_of_convenience_store))+
geom_density(fill = "red", alpha = 0.3)

Now we are ok with all of the predictors. Let’s finally check the distribution of our target variable: House price per unit area.

ggplot(real_estate_clean, aes(house_price_of_unit_area ))+
geom_histogram(fill = "red")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Again, we can likely improve the distribution of our variable by performing a log transformation.

ggplot(real_estate_clean, aes(log(house_price_of_unit_area)))+
geom_histogram(fill = "red")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

FINAL MODEL AND IMPROVEMENTS

final_model<- lm(log(house_price_of_unit_area) ~ house_age +
     I(house_age^2)+
     log(distance_to_the_nearest_MRT_station)+
     number_of_convenience_store, data = real_estate_clean)

summary(final_model)

## 
## Call:
## lm(formula = log(house_price_of_unit_area) ~ house_age + I(house_age^2) + 
##     log(distance_to_the_nearest_MRT_station) + number_of_convenience_store, 
##     data = real_estate_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6459 -0.1288  0.0060  0.1328  1.0793 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               4.9625164  0.1106896  44.833  < 2e-16
## house_age                                -0.0198488  0.0041843  -4.744 2.90e-06
## I(house_age^2)                            0.0003614  0.0001009   3.580 0.000385
## log(distance_to_the_nearest_MRT_station) -0.2037434  0.0155044 -13.141  < 2e-16
## number_of_convenience_store               0.0239011  0.0055903   4.275 2.38e-05
##                                             
## (Intercept)                              ***
## house_age                                ***
## I(house_age^2)                           ***
## log(distance_to_the_nearest_MRT_station) ***
## number_of_convenience_store              ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2413 on 409 degrees of freedom
## Multiple R-squared:  0.6255, Adjusted R-squared:  0.6219 
## F-statistic: 170.8 on 4 and 409 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(final_model)

Assumptions

Let’s check for all of the assumptions and see if we can improve our model.

Linearity: The relationship between the independent and dependent variables must be linear.

crPlots(final_model)

Normality of residuals: residuals should be normally distributed, and their mean should be 0.

The first thing to do when checking the normality of residuals is to look at the Q-Q plot.

If the distribution of our residuals is normal, the black line should approximately coincide with the blue diagonal line.

qqPlot(final_model)

## [1] 114 271

It’s clear that we are close to a normal distribution. However, there are some observations that could cause problems: from the plot the observations with the highest residuals (in absolute value) are 114 and 271. But let’s try to take a closer look.

real_estate_clean$resid <- final_model$residuals
which(abs(real_estate_clean$resid)>0.5)

##  20  48  56  89 114 127 149 163 229 252 271 276 286 313 331 345 348 
##  20  48  56  89 114 127 149 163 229 252 271 276 286 313 331 345 348

real_estate_clean2 <- real_estate_clean[-c(20,48,56,89,114,127,149,229,252,271,276,286,313,331,345,348),]

final_model2 <- lm(log(house_price_of_unit_area) ~ house_age +
     I(house_age^2)+
     log(distance_to_the_nearest_MRT_station)+
     number_of_convenience_store, data = real_estate_clean2)

summary(final_model2)

## 
## Call:
## lm(formula = log(house_price_of_unit_area) ~ house_age + I(house_age^2) + 
##     log(distance_to_the_nearest_MRT_station) + number_of_convenience_store, 
##     data = real_estate_clean2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57420 -0.11894  0.00891  0.12995  0.49488 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                               5.028e+00  9.058e-02  55.515  < 2e-16
## house_age                                -1.809e-02  3.324e-03  -5.441 9.32e-08
## I(house_age^2)                            3.014e-04  8.051e-05   3.744 0.000208
## log(distance_to_the_nearest_MRT_station) -2.155e-01  1.262e-02 -17.073  < 2e-16
## number_of_convenience_store               2.442e-02  4.520e-03   5.402 1.14e-07
##                                             
## (Intercept)                              ***
## house_age                                ***
## I(house_age^2)                           ***
## log(distance_to_the_nearest_MRT_station) ***
## number_of_convenience_store              ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1903 on 393 degrees of freedom
## Multiple R-squared:  0.7353, Adjusted R-squared:  0.7326 
## F-statistic:   273 on 4 and 393 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(final_model2)

qqPlot(final_model2)

## [1] 156 214

Our model, as we can see from the adjusted R-Squared, has improved.

Also the normality of residuals now is better.

Finally, to understand if we truly match the normality assumption, we should run a real test.

shapiro.test(final_model2$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  final_model2$residuals
## W = 0.99569, p-value = 0.3462

The Shapiro-Wilk is a test of normality that tests the null-hyphotesis that a sample come from a normally distributed population. Since our p-value is greater than the significance level that we choose for this test (alpha = 0.01) we can accept the null-hyphotesis.

Lastly, we should check that the mean of the residuals is equal to 0.

mean(final_model2$residuals)

## [1] 8.176754e-18

It is basically equal to 0, so we can proceed.

Homoscedasticity of residuals: The variance of predictors should be reasonably equal across predictors range

By looking at the third graph of our model’s plot, we see that variance is roughly equal distributed, but it is always better to check with a test.

To analytically test for this assumption, we will use the Breusch-Pagan test.

The null-hyphotesis of this test is that the model is homoscedasticity.(The residuals are distributed with equal variance)

The alternative hyphotesis is that that the model is heteroscedastic. (The residuals are not distributed with equal variance)

lmtest::bptest(final_model2)

## 
##  studentized Breusch-Pagan test
## 
## data:  final_model2
## BP = 12.619, df = 4, p-value = 0.0133

At significance level alpha = 0.01 we accept our null-hyphotesis. We can move on.

Independence of the errors: no autocorrelations of residuals.

Let’s take a first look at the autocorrelation plot. Ideally, we want the correlation(y-axis) to drop to close-to-zero values right after the first vertical line.

acf(final_model2$residuals)

By looking at the graph we should be ok, but let’s check, as always, with a test.

lawstat::runs.test(final_model2$residuals)

## 
##  Runs Test - Two sided
## 
## data:  final_model2$residuals
## Standardized Runs Statistic = -0.10038, p-value = 0.92

With a near perfect p-value, we accept the null-hyphotesis and move on to the 5th assumption.

Absence of endogeneity: the residuals are not correlated with explanatory variables.

Let’s take a look at Pearson’s correlation coefficients.

cor.test(real_estate_clean2$house_age, final_model2$residuals)

## 
##  Pearson's product-moment correlation
## 
## data:  real_estate_clean2$house_age and final_model2$residuals
## t = -9.5775e-16, df = 396, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09829804  0.09829804
## sample estimates:
##           cor 
## -4.812893e-17

cor.test((real_estate_clean2$house_age)^2, final_model2$residuals)

## 
##  Pearson's product-moment correlation
## 
## data:  (real_estate_clean2$house_age)^2 and final_model2$residuals
## t = 6.0981e-16, df = 396, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09829804  0.09829804
## sample estimates:
##         cor 
## 3.06441e-17

cor.test(log(real_estate_clean2$distance_to_the_nearest_MRT_station), final_model2$residuals)

## 
##  Pearson's product-moment correlation
## 
## data:  log(real_estate_clean2$distance_to_the_nearest_MRT_station) and final_model2$residuals
## t = -2.9527e-15, df = 396, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09829804  0.09829804
## sample estimates:
##           cor 
## -1.483768e-16

cor.test(real_estate_clean2$number_of_convenience_store, final_model2$residuals)

## 
##  Pearson's product-moment correlation
## 
## data:  real_estate_clean2$number_of_convenience_store and final_model2$residuals
## t = -7.5209e-15, df = 396, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.09829804  0.09829804
## sample estimates:
##           cor 
## -3.779387e-16

Correlation coefficient is really close to 0 with all of the predictors.

Absence of perfect multicollinearity. We must not have a perfect relationship between predictors.

real_estate_corr <- real_estate_clean2[,1:3]%>%
mutate(house_age_squared = house_age^2,
      log_Distance_Mrt_station = log(distance_to_the_nearest_MRT_station))

real_estate_corr<-real_estate_corr%>%
select(-distance_to_the_nearest_MRT_station)

corrplot::corrplot(cor(real_estate_corr), method = "number")

Let’s check with a real test! We will consider Variance Inflection Factor. To keep it simple, it tells us how much one regressor is explained by the others. The lower it is, the better it is. Generally, we want to exclude variables with VIF higher than 10.

car::vif(final_model2)

##                                house_age 
##                                15.496165 
##                           I(house_age^2) 
##                                15.482576 
## log(distance_to_the_nearest_MRT_station) 
##                                 2.106847 
##              number_of_convenience_store 
##                                 1.917704

We can see how house_age has an high VIF. In this case it doesn’t matter, since we’ve introduced it squared in the model. It is just correlated with itself squared. So we are ok with all the variables.

Multiple Linear Regression - Taipei Real Estate

Michele Corvino

10/11/2022

START

Data Cleaning and EDA

Data Frame Summary

real_estate_clean

Logarithmic transformations

FINAL MODEL AND IMPROVEMENTS

Assumptions