library(tidyverse, warn.conflicts = FALSE)
library(ggplot2, warn.conflicts = FALSE)
library(car, warn.conflicts = FALSE)
library(summarytools, warn.conflicts = FALSE)
library(leaflet, warn.conflicts = FALSE)
Taipei, Taiwan
The following dataset is taken from https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set. The inputs are as follows:
X1=the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)
X2=the house age (unit: year)
X3=the distance to the nearest MRT station (unit: meter)
X4=the number of convenience stores in the living circle on foot (integer)
X5=the geographic coordinate, latitude. (unit: degree)
X6=the geographic coordinate, longitude. (unit: degree)
real_estate <- read_csv("C:/Users/Michele/Desktop/Esami, miei/Real estate.csv", show_col_types = FALSE)
head(real_estate)
## # A tibble: 6 x 8
## No `X1 transaction date` `X2 house age` `X3 distance to t~ `X4 number of c~
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2013. 32 84.9 10
## 2 2 2013. 19.5 307. 9
## 3 3 2014. 13.3 562. 5
## 4 4 2014. 13.3 562. 5
## 5 5 2013. 5 391. 5
## 6 6 2013. 7.1 2175. 3
## # ... with 3 more variables: X5 latitude <dbl>, X6 longitude <dbl>,
## # Y house price of unit area <dbl>
Since we have access to latitude and longitude, let’s visualize where are properties located.
leaflet(real_estate)%>%
addTiles()%>%
addCircleMarkers(
lat = real_estate$`X5 latitude`,
lng = real_estate$`X6 longitude`,
clusterOptions = markerClusterOptions()
)
Latitude and longitude are useful for visualization, but I will not deal with them in this project. Let’s focus on the variables that most likely will impact the price of a property.
real_estate_clean <- real_estate%>%
dplyr::select(-c(1,2,6,7))
colnames(real_estate_clean) = c("house_age",
"distance_to_the_nearest_MRT_station",
"number_of_convenience_store",
"house_price_of_unit_area")
head(real_estate_clean)
## # A tibble: 6 x 4
## house_age distance_to_the_nearest_~ number_of_convenience~ house_price_of_uni~
## <dbl> <dbl> <dbl> <dbl>
## 1 32 84.9 10 37.9
## 2 19.5 307. 9 42.2
## 3 13.3 562. 5 47.3
## 4 13.3 562. 5 54.8
## 5 5 391. 5 43.1
## 6 7.1 2175. 3 32.1
print(summarytools::dfSummary(real_estate_clean), method = "render")
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | house_age [numeric] |
|
236 distinct values | 414 (100.0%) | 0 (0.0%) | |||||
| 2 | distance_to_the_nearest_MRT_station [numeric] |
|
259 distinct values | 414 (100.0%) | 0 (0.0%) | |||||
| 3 | number_of_convenience_store [numeric] |
|
11 distinct values | 414 (100.0%) | 0 (0.0%) | |||||
| 4 | house_price_of_unit_area [numeric] |
|
270 distinct values | 414 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.0 (R version 4.1.2)
2022-11-10
Now let’s look singularly at the relationship between each of our predictors and our dependent variable.
gg1 <- ggplot(real_estate_clean, aes(x=house_age, y= house_price_of_unit_area))+
geom_point()+
stat_smooth(method = lm, formula = y~x)+
xlab("House age")+
ylab("Price")
gg2 <- ggplot(real_estate_clean, aes(x=distance_to_the_nearest_MRT_station,
y= house_price_of_unit_area))+
geom_point()+
stat_smooth(method = lm, formula = y~x)+
xlab("Distance to the nearest MRT station")+
ylab("Price")
gg3 <- ggplot(real_estate_clean, aes(x=number_of_convenience_store, y=house_price_of_unit_area))+
geom_point()+
stat_smooth(method = lm, formula = y~x )+
xlab("Number of convenience stores")+
ylab("Price")
gg1
gg2
gg3
Let’s inspect singularly each one of our variables with a SLR. The first one is House Age.
mod1_a <- lm(house_price_of_unit_area~house_age, data = real_estate_clean)
summary(mod1_a)
##
## Call:
## lm(formula = house_price_of_unit_area ~ house_age, data = real_estate_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.113 -10.738 1.626 8.199 77.781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.43470 1.21098 35.042 < 2e-16 ***
## house_age -0.25149 0.05752 -4.372 1.56e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.32 on 412 degrees of freedom
## Multiple R-squared: 0.04434, Adjusted R-squared: 0.04202
## F-statistic: 19.11 on 1 and 412 DF, p-value: 1.56e-05
We can clearly see how poor our adjusted R-squared is. But… If we take a look at the graph from before, we can see how the relationship between this predictor and the outcome seems to be non-linear. It is more an exponential relation. Let’s visualize again the plot, this time without “forcing” the linear relationship.
ggplot(real_estate_clean, aes(house_age, house_price_of_unit_area) ) +
geom_point() +
stat_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Now let’s add the polynomial term to our model and see if it improves.
mod1_b<-lm(house_price_of_unit_area~house_age+I(house_age^2), data = real_estate_clean)
summary(mod1_b)
##
## Call:
## lm(formula = house_price_of_unit_area ~ house_age + I(house_age^2),
## data = real_estate_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.542 -9.085 -0.445 8.260 79.961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.450539 1.651624 32.362 <2e-16 ***
## house_age -1.928877 0.193759 -9.955 <2e-16 ***
## I(house_age^2) 0.042181 0.004689 8.995 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.19 on 411 degrees of freedom
## Multiple R-squared: 0.2015, Adjusted R-squared: 0.1977
## F-statistic: 51.87 on 2 and 411 DF, p-value: < 2.2e-16
Just look at how much our adjusted R-squared has improved. Let’s now look at the other predictors.
mod2_a <-lm(house_price_of_unit_area~distance_to_the_nearest_MRT_station, data = real_estate_clean)
summary(mod2_a)
##
## Call:
## lm(formula = house_price_of_unit_area ~ distance_to_the_nearest_MRT_station,
## data = real_estate_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.396 -6.007 -1.195 4.831 73.483
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.8514271 0.6526105 70.26 <2e-16 ***
## distance_to_the_nearest_MRT_station -0.0072621 0.0003925 -18.50 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.07 on 412 degrees of freedom
## Multiple R-squared: 0.4538, Adjusted R-squared: 0.4524
## F-statistic: 342.2 on 1 and 412 DF, p-value: < 2.2e-16
This will likely be the variable with the most impact on our model. Let’s take a look at its distribution.
ggplot(real_estate_clean, aes(distance_to_the_nearest_MRT_station))+
geom_histogram(fill = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the histogram above we can see how this variable is highly right-skewed. What can we do?
Logarithmic transformations are a convenient way to transform skewed variables into one that are approximately normal. We use natural logarithms. Let’s look at the distribution of the logarithm of Distance_to_the_nearest_MRT_station.
ggplot(real_estate_clean, aes(log(distance_to_the_nearest_MRT_station)))+
geom_histogram(fill = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now the distribution seems to be approximately normal. Let’s insert this in our model.
mod2_b <- lm(house_price_of_unit_area~log(distance_to_the_nearest_MRT_station), data = real_estate_clean)
summary(mod2_b)
##
## Call:
## lm(formula = house_price_of_unit_area ~ log(distance_to_the_nearest_MRT_station),
## data = real_estate_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.103 -5.023 -0.827 3.449 71.846
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.0169 2.6369 36.03 <2e-16
## log(distance_to_the_nearest_MRT_station) -8.9235 0.4064 -21.96 <2e-16
##
## (Intercept) ***
## log(distance_to_the_nearest_MRT_station) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.247 on 412 degrees of freedom
## Multiple R-squared: 0.5393, Adjusted R-squared: 0.5381
## F-statistic: 482.2 on 1 and 412 DF, p-value: < 2.2e-16
As expected, the Adjusted R-squared has improved by performing a log-transformation on this variable. Finally, let’s check for our last predictor.
mod3<- lm(house_price_of_unit_area~ number_of_convenience_store, data = real_estate_clean)
summary(mod3)
##
## Call:
## lm(formula = house_price_of_unit_area ~ number_of_convenience_store,
## data = real_estate_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.407 -7.341 -1.788 5.984 87.681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.1811 0.9419 28.86 <2e-16 ***
## number_of_convenience_store 2.6377 0.1868 14.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.18 on 412 degrees of freedom
## Multiple R-squared: 0.326, Adjusted R-squared: 0.3244
## F-statistic: 199.3 on 1 and 412 DF, p-value: < 2.2e-16
ggplot(real_estate_clean, aes(number_of_convenience_store))+
geom_density(fill = "red", alpha = 0.3)
Now we are ok with all of the predictors. Let’s finally check the distribution of our target variable: House price per unit area.
ggplot(real_estate_clean, aes(house_price_of_unit_area ))+
geom_histogram(fill = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Again, we can likely improve the distribution of our variable by performing a log transformation.
ggplot(real_estate_clean, aes(log(house_price_of_unit_area)))+
geom_histogram(fill = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
final_model<- lm(log(house_price_of_unit_area) ~ house_age +
I(house_age^2)+
log(distance_to_the_nearest_MRT_station)+
number_of_convenience_store, data = real_estate_clean)
summary(final_model)
##
## Call:
## lm(formula = log(house_price_of_unit_area) ~ house_age + I(house_age^2) +
## log(distance_to_the_nearest_MRT_station) + number_of_convenience_store,
## data = real_estate_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6459 -0.1288 0.0060 0.1328 1.0793
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9625164 0.1106896 44.833 < 2e-16
## house_age -0.0198488 0.0041843 -4.744 2.90e-06
## I(house_age^2) 0.0003614 0.0001009 3.580 0.000385
## log(distance_to_the_nearest_MRT_station) -0.2037434 0.0155044 -13.141 < 2e-16
## number_of_convenience_store 0.0239011 0.0055903 4.275 2.38e-05
##
## (Intercept) ***
## house_age ***
## I(house_age^2) ***
## log(distance_to_the_nearest_MRT_station) ***
## number_of_convenience_store ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2413 on 409 degrees of freedom
## Multiple R-squared: 0.6255, Adjusted R-squared: 0.6219
## F-statistic: 170.8 on 4 and 409 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(final_model)
Let’s check for all of the assumptions and see if we can improve our model.
crPlots(final_model)
The first thing to do when checking the normality of residuals is to look at the Q-Q plot.
If the distribution of our residuals is normal, the black line should approximately coincide with the blue diagonal line.
qqPlot(final_model)
## [1] 114 271
It’s clear that we are close to a normal distribution. However, there are some observations that could cause problems: from the plot the observations with the highest residuals (in absolute value) are 114 and 271. But let’s try to take a closer look.
real_estate_clean$resid <- final_model$residuals
which(abs(real_estate_clean$resid)>0.5)
## 20 48 56 89 114 127 149 163 229 252 271 276 286 313 331 345 348
## 20 48 56 89 114 127 149 163 229 252 271 276 286 313 331 345 348
real_estate_clean2 <- real_estate_clean[-c(20,48,56,89,114,127,149,229,252,271,276,286,313,331,345,348),]
final_model2 <- lm(log(house_price_of_unit_area) ~ house_age +
I(house_age^2)+
log(distance_to_the_nearest_MRT_station)+
number_of_convenience_store, data = real_estate_clean2)
summary(final_model2)
##
## Call:
## lm(formula = log(house_price_of_unit_area) ~ house_age + I(house_age^2) +
## log(distance_to_the_nearest_MRT_station) + number_of_convenience_store,
## data = real_estate_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57420 -0.11894 0.00891 0.12995 0.49488
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.028e+00 9.058e-02 55.515 < 2e-16
## house_age -1.809e-02 3.324e-03 -5.441 9.32e-08
## I(house_age^2) 3.014e-04 8.051e-05 3.744 0.000208
## log(distance_to_the_nearest_MRT_station) -2.155e-01 1.262e-02 -17.073 < 2e-16
## number_of_convenience_store 2.442e-02 4.520e-03 5.402 1.14e-07
##
## (Intercept) ***
## house_age ***
## I(house_age^2) ***
## log(distance_to_the_nearest_MRT_station) ***
## number_of_convenience_store ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1903 on 393 degrees of freedom
## Multiple R-squared: 0.7353, Adjusted R-squared: 0.7326
## F-statistic: 273 on 4 and 393 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(final_model2)
qqPlot(final_model2)
## [1] 156 214
Our model, as we can see from the adjusted R-Squared, has improved.
Also the normality of residuals now is better.
Finally, to understand if we truly match the normality assumption, we should run a real test.
shapiro.test(final_model2$residuals)
##
## Shapiro-Wilk normality test
##
## data: final_model2$residuals
## W = 0.99569, p-value = 0.3462
The Shapiro-Wilk is a test of normality that tests the null-hyphotesis that a sample come from a normally distributed population. Since our p-value is greater than the significance level that we choose for this test (alpha = 0.01) we can accept the null-hyphotesis.
Lastly, we should check that the mean of the residuals is equal to 0.
mean(final_model2$residuals)
## [1] 8.176754e-18
It is basically equal to 0, so we can proceed.
By looking at the third graph of our model’s plot, we see that variance is roughly equal distributed, but it is always better to check with a test.
To analytically test for this assumption, we will use the Breusch-Pagan test.
The null-hyphotesis of this test is that the model is homoscedasticity.(The residuals are distributed with equal variance)
The alternative hyphotesis is that that the model is heteroscedastic. (The residuals are not distributed with equal variance)
lmtest::bptest(final_model2)
##
## studentized Breusch-Pagan test
##
## data: final_model2
## BP = 12.619, df = 4, p-value = 0.0133
At significance level alpha = 0.01 we accept our null-hyphotesis. We can move on.
Let’s take a first look at the autocorrelation plot. Ideally, we want the correlation(y-axis) to drop to close-to-zero values right after the first vertical line.
acf(final_model2$residuals)
By looking at the graph we should be ok, but let’s check, as always, with a test.
lawstat::runs.test(final_model2$residuals)
##
## Runs Test - Two sided
##
## data: final_model2$residuals
## Standardized Runs Statistic = -0.10038, p-value = 0.92
With a near perfect p-value, we accept the null-hyphotesis and move on to the 5th assumption.
Let’s take a look at Pearson’s correlation coefficients.
cor.test(real_estate_clean2$house_age, final_model2$residuals)
##
## Pearson's product-moment correlation
##
## data: real_estate_clean2$house_age and final_model2$residuals
## t = -9.5775e-16, df = 396, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09829804 0.09829804
## sample estimates:
## cor
## -4.812893e-17
cor.test((real_estate_clean2$house_age)^2, final_model2$residuals)
##
## Pearson's product-moment correlation
##
## data: (real_estate_clean2$house_age)^2 and final_model2$residuals
## t = 6.0981e-16, df = 396, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09829804 0.09829804
## sample estimates:
## cor
## 3.06441e-17
cor.test(log(real_estate_clean2$distance_to_the_nearest_MRT_station), final_model2$residuals)
##
## Pearson's product-moment correlation
##
## data: log(real_estate_clean2$distance_to_the_nearest_MRT_station) and final_model2$residuals
## t = -2.9527e-15, df = 396, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09829804 0.09829804
## sample estimates:
## cor
## -1.483768e-16
cor.test(real_estate_clean2$number_of_convenience_store, final_model2$residuals)
##
## Pearson's product-moment correlation
##
## data: real_estate_clean2$number_of_convenience_store and final_model2$residuals
## t = -7.5209e-15, df = 396, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09829804 0.09829804
## sample estimates:
## cor
## -3.779387e-16
Correlation coefficient is really close to 0 with all of the predictors.
real_estate_corr <- real_estate_clean2[,1:3]%>%
mutate(house_age_squared = house_age^2,
log_Distance_Mrt_station = log(distance_to_the_nearest_MRT_station))
real_estate_corr<-real_estate_corr%>%
select(-distance_to_the_nearest_MRT_station)
corrplot::corrplot(cor(real_estate_corr), method = "number")
Let’s check with a real test! We will consider Variance Inflection Factor. To keep it simple, it tells us how much one regressor is explained by the others. The lower it is, the better it is. Generally, we want to exclude variables with VIF higher than 10.
car::vif(final_model2)
## house_age
## 15.496165
## I(house_age^2)
## 15.482576
## log(distance_to_the_nearest_MRT_station)
## 2.106847
## number_of_convenience_store
## 1.917704
We can see how house_age has an high VIF. In this case it doesn’t matter, since we’ve introduced it squared in the model. It is just correlated with itself squared. So we are ok with all the variables.