Model Selection using R

Previously we built a linear model using 8 variables (3 categorical and 5 numeric). For this section, we investigate the model validity.

Part One: Checking the Residuals

We check the residuals using the command “plot”

plot(model)

## Warning: not plotting observations with leverage one:
##   4618, 5140, 5493, 5539, 5641, 5886, 6192, 6371, 6674, 6773, 7564, 7642, 7813, 8294, 8603, 8689, 8775, 8838

## Warning: not plotting observations with leverage one:
##   4618, 5140, 5493, 5539, 5641, 5886, 6192, 6371, 6674, 6773, 7564, 7642, 7813, 8294, 8603, 8689, 8775, 8838

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

The first residual plot shows that the linearity assumption is not violated as the plot shows a near linear trend.
In the second plot, the normality of the error is violated as in the errors does not fit the straight line towards the end of the graph
In the third plot, the constant variance assumption is violated, because the standarized residuals shows a downward then upward slopping trend.
The fourth plot shows that the there are 3 bad leverage points (labeled 22633, 14427 and 20140). We remove them.

From the residual plots, we can see that some of the assumptions are violated. More specifically, the normality of error and constant variance. We attempt to remove the bad leverage points based on cook’s distance and rebuild our model

Part Two: Fixing the model by removing some influential points

cooksd <- cooks.distance(model)
influential <- as.numeric(names(cooksd)[(cooksd > (6/nrow(housing)))])
influential <- na.omit(influential)
new_housing <- housing[-influential,]
model_fixed <- lm(Price~Rooms+Bathroom+BuildingArea+Lattitude+Longtitude+Suburb+Postcode+Regionname, data = new_housing)

Model Selection using R

Huizi Yu

10/22/2019

Part One: Checking the Residuals

Part Two: Fixing the model by removing some influential points