For this section, we focus on the dataset . The goal is building a linear regression model that predicts the Price of the house using the variables available. Let us start by investigating the dataset
setwd("~/Desktop/FALL_2019/OSCR")
housing <- read.csv("Melbourne_housing_FULL.csv", stringsAsFactors = FALSE)
head(housing)
## Suburb Address Rooms Type Price Method SellerG
## 1 Abbotsford 68 Studley St 2 h NA SS Jellis
## 2 Abbotsford 85 Turner St 2 h 1480000 S Biggin
## 3 Abbotsford 25 Bloomburg St 2 h 1035000 S Biggin
## 4 Abbotsford 18/659 Victoria St 3 u NA VB Rounds
## 5 Abbotsford 5 Charles St 3 h 1465000 SP Biggin
## 6 Abbotsford 40 Federation La 3 h 850000 PI Biggin
## Date Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea
## 1 3/09/2016 2.5 3067 2 1 1 126 NA
## 2 3/12/2016 2.5 3067 2 1 1 202 NA
## 3 4/02/2016 2.5 3067 2 1 0 156 79
## 4 4/02/2016 2.5 3067 3 2 1 0 NA
## 5 4/03/2017 2.5 3067 3 2 0 134 150
## 6 4/03/2017 2.5 3067 3 2 1 94 NA
## YearBuilt CouncilArea Lattitude Longtitude Regionname
## 1 NA Yarra City Council -37.8014 144.9958 Northern Metropolitan
## 2 NA Yarra City Council -37.7996 144.9984 Northern Metropolitan
## 3 1900 Yarra City Council -37.8079 144.9934 Northern Metropolitan
## 4 NA Yarra City Council -37.8114 145.0116 Northern Metropolitan
## 5 1900 Yarra City Council -37.8093 144.9944 Northern Metropolitan
## 6 NA Yarra City Council -37.7969 144.9969 Northern Metropolitan
## Propertycount
## 1 4019
## 2 4019
## 3 4019
## 4 4019
## 5 4019
## 6 4019
colnames(housing)
## [1] "Suburb" "Address" "Rooms" "Type"
## [5] "Price" "Method" "SellerG" "Date"
## [9] "Distance" "Postcode" "Bedroom2" "Bathroom"
## [13] "Car" "Landsize" "BuildingArea" "YearBuilt"
## [17] "CouncilArea" "Lattitude" "Longtitude" "Regionname"
## [21] "Propertycount"
sum(is.na(housing))
## [1] 100964
From the dataset, we see that there are 21 variables (7 categoricals). We attempt some basic variable transformation.
housing$Suburb <- as.factor(housing$Suburb)
housing$Type <- as.factor(housing$Type)
housing$Method <- as.factor(housing$Method)
housing$SellerG <- as.factor(housing$SellerG)
housing$Date <- as.factor(housing$Date)
housing$Postcode <- as.factor(housing$Postcode)
housing$Regionname <- as.factor(housing$Regionname)
housing$Propertycount <- as.numeric(housing$Propertycount)
## Warning: NAs introduced by coercion
We found there are multiple missing values in the dataset, for the purpose of this section. We omit the observations with missing values (the user can also treat the missing values by imputing them with column means/medians. Details please see the post: Missing values with R)
housing <- na.omit(housing)
housing <- as.data.frame(housing)
Next, we select only the numerical variables and display the correlation matrix
library(corrplot)
## corrplot 0.84 loaded
housing_select <- housing[c("Rooms","Price",
"Bedroom2", "Bathroom", "Car", "Landsize","BuildingArea","Lattitude","Longtitude")]
cormat<-round(cor(housing_select, use="pairwise.complete.obs"),4)
cormat
## Rooms Price Bedroom2 Bathroom Car Landsize BuildingArea
## Rooms 1.0000 0.4751 0.9645 0.6241 0.4014 0.1012 0.6067
## Price 0.4751 1.0000 0.4609 0.4635 0.2095 0.0584 0.5073
## Bedroom2 0.9645 0.4609 1.0000 0.6265 0.4056 0.1010 0.5953
## Bathroom 0.6241 0.4635 0.6265 1.0000 0.3110 0.0759 0.5539
## Car 0.4014 0.2095 0.4056 0.3110 1.0000 0.1235 0.3176
## Landsize 0.1012 0.0584 0.1010 0.0759 0.1235 1.0000 0.0832
## BuildingArea 0.6067 0.5073 0.5953 0.5539 0.3176 0.0832 1.0000
## Lattitude 0.0188 -0.2243 0.0227 -0.0419 0.0151 0.0425 -0.0346
## Longtitude 0.0830 0.2122 0.0827 0.1093 0.0356 -0.0082 0.0976
## Lattitude Longtitude
## Rooms 0.0188 0.0830
## Price -0.2243 0.2122
## Bedroom2 0.0227 0.0827
## Bathroom -0.0419 0.1093
## Car 0.0151 0.0356
## Landsize 0.0425 -0.0082
## BuildingArea -0.0346 0.0976
## Lattitude 1.0000 -0.3458
## Longtitude -0.3458 1.0000
corplot.matrix<-corrplot.mixed(cormat, lower="number",upper="pie",number.digits = 4)
We can see that price has a high correlation with Bedroom2 and BuildingArea
The aim of linear regression is to model a continuous variable Y as a mathematical function of one or more X variable(s), so that we can use this regression model to predict the Y when only the X is known. This mathematical equation can be generalized as follows:
\[\begin{equation} Y_j = \beta_0 + \beta_1 x_j + \epsilon_j, \label{eq:simplelinear} \end{equation}\]Linear regression is built using the command “lm”. We start by building a linear regression model using all numeric predictors:
model1 <- lm(Price~Rooms+Bedroom2+Bathroom+Car+Landsize+BuildingArea+Lattitude+Longtitude, data = housing)
summary(model1)
##
## Call:
## lm(formula = Price ~ Rooms + Bedroom2 + Bathroom + Car + Landsize +
## BuildingArea + Lattitude + Longtitude, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5309749 -288630 -84943 211154 7830436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.239e+08 7.010e+06 -17.671 < 2e-16 ***
## Rooms 1.739e+05 2.273e+04 7.652 2.19e-14 ***
## Bedroom2 -3.592e+04 2.258e+04 -1.591 0.112
## Bathroom 1.618e+05 1.066e+04 15.180 < 2e-16 ***
## Car -9.420e+03 6.474e+03 -1.455 0.146
## Landsize 7.800e+00 5.435e+00 1.435 0.151
## BuildingArea 2.173e+03 8.556e+01 25.401 < 2e-16 ***
## Lattitude -1.356e+06 6.746e+04 -20.103 < 2e-16 ***
## Longtitude 5.013e+05 5.150e+04 9.735 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 538200 on 8878 degrees of freedom
## Multiple R-squared: 0.373, Adjusted R-squared: 0.3725
## F-statistic: 660.3 on 8 and 8878 DF, p-value: < 2.2e-16
Printing out the linear regression output, we can see that the Rsquared is 0.373, that means our predictors were able to explain about 37% of the variations in Pricing. Not bad considering that we have not included any categorical variables yet!
However, we can see that there are a few variable that are not significant: Bedroom2, Car, Landsize. This is strange because from the previous correlation plot, we can see that Bedroom2 is highly correlated with Price. Why is this? This is because Rooms are highly correlated with Bedroom2 (0.96 correlation!). By accounting from Rooms, R will drop the highly correlated Bedroom2. We update our model by excluding the insignificant variables
model2 <- lm(Price~Rooms+Bathroom+BuildingArea+Lattitude+Longtitude, data = housing)
summary(model2)
##
## Call:
## lm(formula = Price ~ Rooms + Bathroom + BuildingArea + Lattitude +
## Longtitude, data = housing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5302086 -288192 -85356 212005 7836007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.238e+08 7.010e+06 -17.654 <2e-16 ***
## Rooms 1.380e+05 8.301e+03 16.623 <2e-16 ***
## Bathroom 1.592e+05 1.058e+04 15.044 <2e-16 ***
## BuildingArea 2.164e+03 8.526e+01 25.378 <2e-16 ***
## Lattitude -1.357e+06 6.739e+04 -20.137 <2e-16 ***
## Longtitude 5.002e+05 5.150e+04 9.712 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 538300 on 8881 degrees of freedom
## Multiple R-squared: 0.3726, Adjusted R-squared: 0.3722
## F-statistic: 1055 on 5 and 8881 DF, p-value: < 2.2e-16
Great! Now we have fewer predictors but our Rsquare has not decreased tremendously. Now we attempt to incorporate more categorical variables.
model3 <- lm(Price~Rooms+Bathroom+BuildingArea+Lattitude+Longtitude+Suburb+Postcode+Regionname, data = housing)
summary(model3)$r.squared
## [1] 0.6578964
summary(model3)$adj.r.squared
## [1] 0.6451579
We can see that the added three variabled dramatically improved out model’s Performance and Rsquare from 0.3726 to 0.6579. We will look at model evaluation in the next post.