Previously we built a linear model using 8 variables (3 categorical and 5 numeric). For this section, we investigate the model validity.
setwd("~/Desktop/FALL_2019/OSCR")
housing <- read.csv("Melbourne_housing_FULL.csv", stringsAsFactors = FALSE)
housing$Suburb <- as.factor(housing$Suburb)
housing$Type <- as.factor(housing$Type)
housing$Method <- as.factor(housing$Method)
housing$SellerG <- as.factor(housing$SellerG)
housing$Date <- as.factor(housing$Date)
housing$Postcode <- as.factor(housing$Postcode)
housing$Regionname <- as.factor(housing$Regionname)
housing$Propertycount <- as.numeric(housing$Propertycount)
## Warning: NAs introduced by coercion
housing <- na.omit(housing)
model <- lm(Price~Rooms+Bathroom+BuildingArea+Lattitude+Longtitude+Suburb+Postcode+Regionname, data = housing)
We check the residuals using the command “plot”
plot(model)
## Warning: not plotting observations with leverage one:
## 4618, 5140, 5493, 5539, 5641, 5886, 6192, 6371, 6674, 6773, 7564, 7642, 7813, 8294, 8603, 8689, 8775, 8838
## Warning: not plotting observations with leverage one:
## 4618, 5140, 5493, 5539, 5641, 5886, 6192, 6371, 6674, 6773, 7564, 7642, 7813, 8294, 8603, 8689, 8775, 8838
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
From the residual plots, we can see that some of the assumptions are violated. More specifically, the normality of error and constant variance. We attempt to remove the bad leverage points based on cook’s distance and rebuild our model
cooksd <- cooks.distance(model)
influential <- as.numeric(names(cooksd)[(cooksd > (6/nrow(housing)))])
influential <- na.omit(influential)
new_housing <- housing[-influential,]
model_fixed <- lm(Price~Rooms+Bathroom+BuildingArea+Lattitude+Longtitude+Suburb+Postcode+Regionname, data = new_housing)