For this section, we focus on the dataset . The goal is building a linear regression model that predicts the Price of the house using the variables available. Let us start by investigating the dataset

Part One: dataset evaluation

setwd("~/Desktop/FALL_2019/OSCR")
housing <- read.csv("Melbourne_housing_FULL.csv", stringsAsFactors = FALSE)
head(housing)
##       Suburb            Address Rooms Type   Price Method SellerG
## 1 Abbotsford      68 Studley St     2    h      NA     SS  Jellis
## 2 Abbotsford       85 Turner St     2    h 1480000      S  Biggin
## 3 Abbotsford    25 Bloomburg St     2    h 1035000      S  Biggin
## 4 Abbotsford 18/659 Victoria St     3    u      NA     VB  Rounds
## 5 Abbotsford       5 Charles St     3    h 1465000     SP  Biggin
## 6 Abbotsford   40 Federation La     3    h  850000     PI  Biggin
##        Date Distance Postcode Bedroom2 Bathroom Car Landsize BuildingArea
## 1 3/09/2016      2.5     3067        2        1   1      126           NA
## 2 3/12/2016      2.5     3067        2        1   1      202           NA
## 3 4/02/2016      2.5     3067        2        1   0      156           79
## 4 4/02/2016      2.5     3067        3        2   1        0           NA
## 5 4/03/2017      2.5     3067        3        2   0      134          150
## 6 4/03/2017      2.5     3067        3        2   1       94           NA
##   YearBuilt        CouncilArea Lattitude Longtitude            Regionname
## 1        NA Yarra City Council  -37.8014   144.9958 Northern Metropolitan
## 2        NA Yarra City Council  -37.7996   144.9984 Northern Metropolitan
## 3      1900 Yarra City Council  -37.8079   144.9934 Northern Metropolitan
## 4        NA Yarra City Council  -37.8114   145.0116 Northern Metropolitan
## 5      1900 Yarra City Council  -37.8093   144.9944 Northern Metropolitan
## 6        NA Yarra City Council  -37.7969   144.9969 Northern Metropolitan
##   Propertycount
## 1          4019
## 2          4019
## 3          4019
## 4          4019
## 5          4019
## 6          4019
colnames(housing)
##  [1] "Suburb"        "Address"       "Rooms"         "Type"         
##  [5] "Price"         "Method"        "SellerG"       "Date"         
##  [9] "Distance"      "Postcode"      "Bedroom2"      "Bathroom"     
## [13] "Car"           "Landsize"      "BuildingArea"  "YearBuilt"    
## [17] "CouncilArea"   "Lattitude"     "Longtitude"    "Regionname"   
## [21] "Propertycount"
sum(is.na(housing))
## [1] 100964

From the dataset, we see that there are 21 variables (7 categoricals). We attempt some basic variable transformation.

housing$Suburb <- as.factor(housing$Suburb)
housing$Type <- as.factor(housing$Type)
housing$Method <- as.factor(housing$Method)
housing$SellerG <- as.factor(housing$SellerG)
housing$Date <- as.factor(housing$Date)
housing$Postcode <- as.factor(housing$Postcode)
housing$Regionname <- as.factor(housing$Regionname)
housing$Propertycount <- as.numeric(housing$Propertycount)
## Warning: NAs introduced by coercion

We found there are multiple missing values in the dataset, for the purpose of this section. We omit the observations with missing values (the user can also treat the missing values by imputing them with column means/medians. Details please see the post: Missing values with R)

housing <- na.omit(housing)
housing <- as.data.frame(housing)

Next, we select only the numerical variables and display the correlation matrix

library(corrplot)
## corrplot 0.84 loaded
housing_select <- housing[c("Rooms","Price", 
                        "Bedroom2", "Bathroom", "Car", "Landsize","BuildingArea","Lattitude","Longtitude")]
cormat<-round(cor(housing_select, use="pairwise.complete.obs"),4)
cormat
##               Rooms   Price Bedroom2 Bathroom    Car Landsize BuildingArea
## Rooms        1.0000  0.4751   0.9645   0.6241 0.4014   0.1012       0.6067
## Price        0.4751  1.0000   0.4609   0.4635 0.2095   0.0584       0.5073
## Bedroom2     0.9645  0.4609   1.0000   0.6265 0.4056   0.1010       0.5953
## Bathroom     0.6241  0.4635   0.6265   1.0000 0.3110   0.0759       0.5539
## Car          0.4014  0.2095   0.4056   0.3110 1.0000   0.1235       0.3176
## Landsize     0.1012  0.0584   0.1010   0.0759 0.1235   1.0000       0.0832
## BuildingArea 0.6067  0.5073   0.5953   0.5539 0.3176   0.0832       1.0000
## Lattitude    0.0188 -0.2243   0.0227  -0.0419 0.0151   0.0425      -0.0346
## Longtitude   0.0830  0.2122   0.0827   0.1093 0.0356  -0.0082       0.0976
##              Lattitude Longtitude
## Rooms           0.0188     0.0830
## Price          -0.2243     0.2122
## Bedroom2        0.0227     0.0827
## Bathroom       -0.0419     0.1093
## Car             0.0151     0.0356
## Landsize        0.0425    -0.0082
## BuildingArea   -0.0346     0.0976
## Lattitude       1.0000    -0.3458
## Longtitude     -0.3458     1.0000
corplot.matrix<-corrplot.mixed(cormat, lower="number",upper="pie",number.digits = 4)

We can see that price has a high correlation with Bedroom2 and BuildingArea

Part Two: Model Building

The aim of linear regression is to model a continuous variable Y as a mathematical function of one or more X variable(s), so that we can use this regression model to predict the Y when only the X is known. This mathematical equation can be generalized as follows:

\[\begin{equation} Y_j = \beta_0 + \beta_1 x_j + \epsilon_j, \label{eq:simplelinear} \end{equation}\]

Linear regression is built using the command “lm”. We start by building a linear regression model using all numeric predictors:

model1 <- lm(Price~Rooms+Bedroom2+Bathroom+Car+Landsize+BuildingArea+Lattitude+Longtitude, data = housing)
summary(model1)
## 
## Call:
## lm(formula = Price ~ Rooms + Bedroom2 + Bathroom + Car + Landsize + 
##     BuildingArea + Lattitude + Longtitude, data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5309749  -288630   -84943   211154  7830436 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.239e+08  7.010e+06 -17.671  < 2e-16 ***
## Rooms         1.739e+05  2.273e+04   7.652 2.19e-14 ***
## Bedroom2     -3.592e+04  2.258e+04  -1.591    0.112    
## Bathroom      1.618e+05  1.066e+04  15.180  < 2e-16 ***
## Car          -9.420e+03  6.474e+03  -1.455    0.146    
## Landsize      7.800e+00  5.435e+00   1.435    0.151    
## BuildingArea  2.173e+03  8.556e+01  25.401  < 2e-16 ***
## Lattitude    -1.356e+06  6.746e+04 -20.103  < 2e-16 ***
## Longtitude    5.013e+05  5.150e+04   9.735  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 538200 on 8878 degrees of freedom
## Multiple R-squared:  0.373,  Adjusted R-squared:  0.3725 
## F-statistic: 660.3 on 8 and 8878 DF,  p-value: < 2.2e-16

Printing out the linear regression output, we can see that the Rsquared is 0.373, that means our predictors were able to explain about 37% of the variations in Pricing. Not bad considering that we have not included any categorical variables yet!

However, we can see that there are a few variable that are not significant: Bedroom2, Car, Landsize. This is strange because from the previous correlation plot, we can see that Bedroom2 is highly correlated with Price. Why is this? This is because Rooms are highly correlated with Bedroom2 (0.96 correlation!). By accounting from Rooms, R will drop the highly correlated Bedroom2. We update our model by excluding the insignificant variables

model2 <- lm(Price~Rooms+Bathroom+BuildingArea+Lattitude+Longtitude, data = housing)
summary(model2)
## 
## Call:
## lm(formula = Price ~ Rooms + Bathroom + BuildingArea + Lattitude + 
##     Longtitude, data = housing)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5302086  -288192   -85356   212005  7836007 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.238e+08  7.010e+06 -17.654   <2e-16 ***
## Rooms         1.380e+05  8.301e+03  16.623   <2e-16 ***
## Bathroom      1.592e+05  1.058e+04  15.044   <2e-16 ***
## BuildingArea  2.164e+03  8.526e+01  25.378   <2e-16 ***
## Lattitude    -1.357e+06  6.739e+04 -20.137   <2e-16 ***
## Longtitude    5.002e+05  5.150e+04   9.712   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 538300 on 8881 degrees of freedom
## Multiple R-squared:  0.3726, Adjusted R-squared:  0.3722 
## F-statistic:  1055 on 5 and 8881 DF,  p-value: < 2.2e-16

Great! Now we have fewer predictors but our Rsquare has not decreased tremendously. Now we attempt to incorporate more categorical variables.

model3 <- lm(Price~Rooms+Bathroom+BuildingArea+Lattitude+Longtitude+Suburb+Postcode+Regionname, data = housing)
summary(model3)$r.squared
## [1] 0.6578964
summary(model3)$adj.r.squared
## [1] 0.6451579

We can see that the added three variabled dramatically improved out model’s Performance and Rsquare from 0.3726 to 0.6579. We will look at model evaluation in the next post.