Previously we built a linear model using 8 variables (3 categorical and 5 numeric). For this section, we investigate the model validity.

setwd("~/Desktop/FALL_2019/OSCR")
housing <- read.csv("Melbourne_housing_FULL.csv", stringsAsFactors = FALSE)
housing$Suburb <- as.factor(housing$Suburb)
housing$Type <- as.factor(housing$Type)
housing$Method <- as.factor(housing$Method)
housing$SellerG <- as.factor(housing$SellerG)
housing$Date <- as.factor(housing$Date)
housing$Postcode <- as.factor(housing$Postcode)
housing$Regionname <- as.factor(housing$Regionname)
housing$Propertycount <- as.numeric(housing$Propertycount)
## Warning: NAs introduced by coercion
housing <- na.omit(housing)
model <- lm(Price~Rooms+Bathroom+BuildingArea+Lattitude+Longtitude+Suburb+Postcode+Regionname, data = housing)

Part One: Checking the Residuals

We check the residuals using the command “plot”

plot(model)
## Warning: not plotting observations with leverage one:
##   4618, 5140, 5493, 5539, 5641, 5886, 6192, 6371, 6674, 6773, 7564, 7642, 7813, 8294, 8603, 8689, 8775, 8838

## Warning: not plotting observations with leverage one:
##   4618, 5140, 5493, 5539, 5641, 5886, 6192, 6371, 6674, 6773, 7564, 7642, 7813, 8294, 8603, 8689, 8775, 8838

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

From the residual plots, we can see that some of the assumptions are violated. More specifically, the normality of error and constant variance. We attempt to remove the bad leverage points based on cook’s distance and rebuild our model

Part Two: Fixing the model by removing some influential points

cooksd <- cooks.distance(model)
influential <- as.numeric(names(cooksd)[(cooksd > (6/nrow(housing)))])
influential <- na.omit(influential)
new_housing <- housing[-influential,]
model_fixed <- lm(Price~Rooms+Bathroom+BuildingArea+Lattitude+Longtitude+Suburb+Postcode+Regionname, data = new_housing)