Home buyers as well as real estate dealers would benefit from house price prediction based on the characteristics of the house as well as its immediate environment. In this report we will attempt to predict house prices based on several factors such as square feet living area.
The data for house price is originally from here. The description of the columns is as follows.
id: Notation for a housedate: Date house was soldprice: Price of propertybedrooms: Number of bedroomssqft_living: Total living area (in square feet)sqft_lot: Lot area (in square feet)floors: Number of building floorswaterfront: Does the house has waterfront? (1 = yes, 0 = no)view: Has been viewedcondition: How good the condition is ( Overall ). 1 indicates worn out property and 5 excellent.grade: Overall grade given to the housing unit, based on King County grading system. 1 poor ,13 excellent.yr_built: Year of constructionyr_renovated: Year of renovationzipcode: zip code of house addresslat: Latitude of house addresslot: Longitude of house addresssqft_living15: Living room area in 2015sqft_lot15: Lot area in 2015library(lubridate)
house <- read_csv("data_input/kc_house_data.csv")
glimpse(house)
#> Rows: 21,597
#> Columns: 21
#> $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
#> $ date <chr> "10/13/2014", "12/9/2014", "2/25/2015", "12/9/2014", "2/…
#> $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1230000, 257500,…
#> $ bedrooms <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
#> $ bathrooms <dbl> 1, 2, 1, 3, 2, 4, 2, 1, 1, 2, 2, 1, 1, 1, 2, 3, 2, 1, 1,…
#> $ sqft_living <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
#> $ sqft_lot <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
#> $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
#> $ waterfront <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ view <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
#> $ condition <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
#> $ grade <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
#> $ sqft_above <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
#> $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
#> $ yr_built <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
#> $ yr_renovated <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ zipcode <dbl> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
#> $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
#> $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
#> $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
#> $ sqft_lot15 <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …
We see that we have 21,597 number of observations
The target variable of our model will be the price, while the predictors will be the rest of the columns, after some data wrangling.
We can drop the id column as it cannot be used in either EDA or prediction. We will also remove zipcode, lat and long columns to make the predictions area or region independent. We also need to transform the date column’s type to date. Also, some columns that appears to be numeric may not be numeric at all and need to be changed into factors
house1 <- house %>% mutate(date = mdy(date)) %>% select(-c(id,lat,long,zipcode))
# check for any NA's in the table
colSums(is.na(house1))
#> date price bedrooms bathrooms sqft_living
#> 0 0 0 0 0
#> sqft_lot floors waterfront view condition
#> 0 0 0 0 0
#> grade sqft_above sqft_basement yr_built yr_renovated
#> 0 0 0 0 0
#> sqft_living15 sqft_lot15
#> 0 0
There is no NA values in all of the columns. Next, let us check the unique values of the numeric columns.
house1 %>% summarise_all(n_distinct)
As said above, some columns have very small unique values that are more suitable as factors. These columns are bedrooms, floors, waterfront, view, condition, and grade. Also, yr_bulit and yr_renovated are better as dates than integers. The zipcode, lat, and long can be removed because we like the prediction to be a general one and not specific to a location or region.
# check the unique values
unique(house1$bedrooms)
#> [1] 3 2 4 5 1 6 7 8 9 11 10 33
unique(house1$bathrooms)
#> [1] 1 2 3 4 0 5 6 8 7
unique(house1$floors)
#> [1] 1.0 2.0 1.5 3.0 2.5 3.5
unique(house1$waterfront)
#> [1] 0 1
unique(house1$view)
#> [1] 0 3 4 2 1
unique(house1$condition)
#> [1] 3 5 4 1 2
unique(house1$grade)
#> [1] 7 6 8 11 9 5 10 12 4 3 13
Let us check the occurences of the unique values in the columns with < 100 unique values
table(house1$bedrooms)
#>
#> 1 2 3 4 5 6 7 8 9 10 11 33
#> 196 2760 9824 6882 1601 272 38 13 6 3 1 1
table(house1$bathrooms)
#>
#> 0 1 2 3 4 5 6 7 8
#> 75 8353 10539 2228 338 48 12 2 2
table(house1$floors)
#>
#> 1 1.5 2 2.5 3 3.5
#> 10673 1910 8235 161 611 7
table(house1$waterfront)
#>
#> 0 1
#> 21434 163
table(house1$view)
#>
#> 0 1 2 3 4
#> 19475 332 961 510 319
table(house1$condition)
#>
#> 1 2 3 4 5
#> 29 170 14020 5677 1701
table(house1$grade)
#>
#> 3 4 5 6 7 8 9 10 11 12 13
#> 1 27 242 2038 8974 6065 2615 1134 399 89 13
From result it is clear that some values have only 3 or less observations in the data_frame. Now let us remove the observations which have less than 3 occurrences in the bathrooms, bedrooms, grade.
# filtering bathrooms
bathrooms_f <- house1 %>% count(bathrooms) %>% filter(n>3)
house2 <- house1[house1$bathrooms %in% bathrooms_f$bathrooms,] %>% mutate(bathrooms = factor(bathrooms,ordered = TRUE))
# fitering bedrooms
bedrooms_f <- house1 %>% count(bedrooms) %>% filter(n>3)
house2 <- house1[house1$bedrooms %in% bedrooms_f$bedrooms,] %>% mutate(bedrooms = factor(bedrooms,ordered = TRUE))
# filtering grade
grade_f <- house1 %>% count(grade) %>% filter(n>3)
house2 <- house1[house1$grade %in% grade_f$grade,] %>% mutate(grade = factor(grade,ordered = TRUE))
# converting the rest 4 columns to ordered factors
house2 <- house2 %>%
mutate(
floors = factor(floors, ordered = TRUE),
waterfront = factor(waterfront, ordered = FALSE),
view = factor(view, ordered = TRUE),
condition = factor(condition, ordered = TRUE),
)
glimpse(house2)
#> Rows: 21,596
#> Columns: 17
#> $ date <date> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-02…
#> $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1230000, 257500,…
#> $ bedrooms <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
#> $ bathrooms <dbl> 1, 2, 1, 3, 2, 4, 2, 1, 1, 2, 2, 1, 1, 1, 2, 3, 2, 1, 1,…
#> $ sqft_living <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
#> $ sqft_lot <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
#> $ floors <ord> 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1.5, 1, 1.5, 2, 2, 1…
#> $ waterfront <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ view <ord> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
#> $ condition <ord> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
#> $ grade <ord> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
#> $ sqft_above <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
#> $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
#> $ yr_built <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
#> $ yr_renovated <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
#> $ sqft_lot15 <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …
Let us also look at the proportion of yr_renovation values.
round(prop.table(table(house1$yr_renovated))*100,2 )
#>
#> 0 1934 1940 1944 1945 1946 1948 1950 1951 1953 1954 1955 1956
#> 95.77 0.00 0.01 0.00 0.01 0.01 0.00 0.01 0.00 0.01 0.00 0.01 0.01
#> 1957 1958 1959 1960 1962 1963 1964 1965 1967 1968 1969 1970 1971
#> 0.01 0.02 0.00 0.02 0.01 0.02 0.02 0.02 0.01 0.04 0.02 0.04 0.01
#> 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
#> 0.02 0.02 0.01 0.03 0.01 0.04 0.03 0.05 0.05 0.02 0.05 0.08 0.08
#> 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997
#> 0.08 0.08 0.08 0.07 0.10 0.12 0.09 0.08 0.09 0.09 0.07 0.07 0.07
#> 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
#> 0.09 0.08 0.16 0.09 0.10 0.17 0.12 0.16 0.11 0.16 0.08 0.10 0.08
#> 2011 2012 2013 2014 2015
#> 0.06 0.05 0.17 0.42 0.07
95.77% of the houses have 0 as the yr_renovated value which means they were not renovated at all. Since this is equivalent as 96% having NA values this column is a candidate for dropping but for now let’s keep it.
What about the correlation between predictors and target, as well as between predictors?
ggcorr(house2, label=TRUE)
The correlation heatmap shows that factors such as bedroom and sqft_lot are weakly correlated to price while factors such as sqft_living and sqft_above are strongly correlated.
I want to check if there is a column that is an alias of another column with the alias method.
model <- lm(price ~. , data=house2)
alias(model)
#> Model :
#> price ~ date + bedrooms + bathrooms + sqft_living + sqft_lot +
#> floors + waterfront + view + condition + grade + sqft_above +
#> sqft_basement + yr_built + yr_renovated + sqft_living15 +
#> sqft_lot15
#>
#> Complete :
#> (Intercept) date bedrooms bathrooms sqft_living sqft_lot floors.L
#> sqft_basement 0 0 0 0 1 0 0
#> floors.Q floors.C floors^4 floors^5 waterfront1 view.L view.Q
#> sqft_basement 0 0 0 0 0 0 0
#> view.C view^4 condition.L condition.Q condition.C condition^4
#> sqft_basement 0 0 0 0 0 0
#> grade.L grade.Q grade.C grade^4 grade^5 grade^6 grade^7 grade^8
#> sqft_basement 0 0 0 0 0 0 0 0
#> grade^9 sqft_above yr_built yr_renovated sqft_living15 sqft_lot15
#> sqft_basement 0 -1 0 0 0 0
Checking the alias shows that sqft_basement is an alias for sqft_living or sqft_above, so we will remove the sqft_basement.
house2 <- house2 %>% select(-sqft_basement)
# split the dataset into testing and training sets
library(rsample)
set.seed(313)
init <- initial_split(data=house2, prop=0.8, strata = "price")
house2_train <- training(init)
house2_test <- testing(init)
model1 <- lm(price ~. , data=house2_train)
summary(model1)
#>
#> Call:
#> lm(formula = price ~ ., data = house2_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1709097 -104585 -9042 85554 4091480
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4995454.50628 280699.56107 17.796 < 0.0000000000000002 ***
#> date 100.57286 13.97392 7.197 0.000000000000640 ***
#> bedrooms -25119.38445 2199.20491 -11.422 < 0.0000000000000002 ***
#> bathrooms 47978.78855 3339.83579 14.366 < 0.0000000000000002 ***
#> sqft_living 153.33215 4.87466 31.455 < 0.0000000000000002 ***
#> sqft_lot 0.03095 0.05442 0.569 0.569472
#> floors.L 199236.29211 50978.40457 3.908 0.000093322705353 ***
#> floors.Q 44022.37296 46966.51013 0.937 0.348610
#> floors.C -20404.03762 32583.86959 -0.626 0.531192
#> floors^4 12716.42128 18650.87445 0.682 0.495366
#> floors^5 44316.38823 13798.36169 3.212 0.001322 **
#> waterfront1 499796.06316 22912.49924 21.813 < 0.0000000000000002 ***
#> view.L 149221.04419 11811.53804 12.633 < 0.0000000000000002 ***
#> view.Q 48382.81610 10302.68223 4.696 0.000002671687283 ***
#> view.C 83289.67119 11401.85666 7.305 0.000000000000290 ***
#> view^4 -33574.93462 9650.40616 -3.479 0.000504 ***
#> condition.L 93779.16610 27769.04935 3.377 0.000734 ***
#> condition.Q 381.43017 23445.98965 0.016 0.987020
#> condition.C 10770.88356 17572.56350 0.613 0.539926
#> condition^4 3946.99219 10090.95147 0.391 0.695697
#> grade.L 2029337.87605 43795.70662 46.336 < 0.0000000000000002 ***
#> grade.Q 1128916.01697 40539.72790 27.847 < 0.0000000000000002 ***
#> grade.C 535475.01767 35105.16281 15.253 < 0.0000000000000002 ***
#> grade^4 308130.47366 28345.21491 10.871 < 0.0000000000000002 ***
#> grade^5 119561.87994 21733.78313 5.501 0.000000038258219 ***
#> grade^6 64421.03247 15924.89692 4.045 0.000052483003372 ***
#> grade^7 8335.20262 11096.19717 0.751 0.452557
#> grade^8 -6097.65639 7419.49015 -0.822 0.411178
#> grade^9 3906.92825 4816.98812 0.811 0.417335
#> sqft_above -23.75335 5.11273 -4.646 0.000003410552201 ***
#> yr_built -2997.71285 82.67212 -36.260 < 0.0000000000000002 ***
#> yr_renovated 21.88524 4.16452 5.255 0.000000149653794 ***
#> sqft_living15 33.34127 3.93364 8.476 < 0.0000000000000002 ***
#> sqft_lot15 -0.60463 0.08387 -7.209 0.000000000000586 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 207000 on 17246 degrees of freedom
#> Multiple R-squared: 0.6909, Adjusted R-squared: 0.6903
#> F-statistic: 1168 on 33 and 17246 DF, p-value: < 0.00000000000000022
Let us try using step backward to try to improve the model’s R-squared score:
model2 <- step(model1, direction="backward")
#> Start: AIC=423064.8
#> price ~ date + bedrooms + bathrooms + sqft_living + sqft_lot +
#> floors + waterfront + view + condition + grade + sqft_above +
#> yr_built + yr_renovated + sqft_living15 + sqft_lot15
#>
#> Df Sum of Sq RSS AIC
#> - sqft_lot 1 13865044864 738988101163934 423063
#> <none> 738974236119070 423065
#> - sqft_above 1 924880483148 739899116602218 423084
#> - yr_renovated 1 1183350253632 740157586372702 423090
#> - date 1 2219556003492 741193792122562 423115
#> - sqft_lot15 1 2227022384538 741201258503608 423115
#> - sqft_living15 1 3078335629427 742052571748497 423135
#> - condition 4 4415270480013 743389506599083 423160
#> - bedrooms 1 5590202091607 744564438210677 423193
#> - bathrooms 1 8842788599224 747817024718294 423268
#> - floors 5 9460119484972 748434355604042 423275
#> - view 4 14413878805054 753388114924124 423391
#> - waterfront 1 20388329813301 759362565932371 423533
#> - sqft_living 1 42395460179584 781369696298654 424027
#> - yr_built 1 56338180485718 795312416604788 424332
#> - grade 9 168721262807468 907695498926538 426600
#>
#> Step: AIC=423063.1
#> price ~ date + bedrooms + bathrooms + sqft_living + floors +
#> waterfront + view + condition + grade + sqft_above + yr_built +
#> yr_renovated + sqft_living15 + sqft_lot15
#>
#> Df Sum of Sq RSS AIC
#> <none> 738988101163934 423063
#> - sqft_above 1 916383528876 739904484692810 423082
#> - yr_renovated 1 1182742514208 740170843678142 423089
#> - date 1 2221375976004 741209477139938 423113
#> - sqft_living15 1 3066704482622 742054805646555 423133
#> - sqft_lot15 1 3850879807655 742838980971589 423151
#> - condition 4 4411419708253 743399520872187 423158
#> - bedrooms 1 5604865006161 744592966170094 423192
#> - bathrooms 1 8851179645848 747839280809781 423267
#> - floors 5 9447813592308 748435914756242 423273
#> - view 4 14422606629527 753410707793460 423389
#> - waterfront 1 20388792787967 759376893951900 423531
#> - sqft_living 1 42409435908689 781397537072623 424025
#> - yr_built 1 56336734959935 795324836123868 424331
#> - grade 9 168712304006500 907700405170434 426598
With the step backward it is found that if we remove sqft_lot, the AIC decreases a little. what is the model summary looks like ?
summary(model2)
#>
#> Call:
#> lm(formula = price ~ date + bedrooms + bathrooms + sqft_living +
#> floors + waterfront + view + condition + grade + sqft_above +
#> yr_built + yr_renovated + sqft_living15 + sqft_lot15, data = house2_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1709468 -104546 -9057 85495 4090912
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 4994810.83500 280691.77584 17.795 < 0.0000000000000002 ***
#> date 100.61281 13.97347 7.200 0.000000000000626 ***
#> bedrooms -25146.42852 2198.64785 -11.437 < 0.0000000000000002 ***
#> bathrooms 47998.86637 3339.58379 14.373 < 0.0000000000000002 ***
#> sqft_living 153.35307 4.87442 31.461 < 0.0000000000000002 ***
#> floors.L 199048.77009 50976.33902 3.905 0.000094690332617 ***
#> floors.Q 44104.85994 46965.36527 0.939 0.347696
#> floors.C -20344.82830 32583.06437 -0.624 0.532374
#> floors^4 12615.11511 18649.65842 0.676 0.498779
#> floors^5 44246.51134 13797.54430 3.207 0.001344 **
#> waterfront1 499801.69117 22912.04778 21.814 < 0.0000000000000002 ***
#> view.L 149238.79538 11811.26519 12.635 < 0.0000000000000002 ***
#> view.Q 48258.66637 10300.16829 4.685 0.000002817791301 ***
#> view.C 83140.13358 11398.60222 7.294 0.000000000000314 ***
#> view^4 -33617.88230 9649.92158 -3.484 0.000496 ***
#> condition.L 93419.38778 27761.30125 3.365 0.000767 ***
#> condition.Q 484.01608 23444.83629 0.021 0.983529
#> condition.C 10895.91070 17570.84432 0.620 0.535192
#> condition^4 3800.18312 10087.45274 0.377 0.706384
#> grade.L 2028993.68805 43790.66792 46.334 < 0.0000000000000002 ***
#> grade.Q 1128684.85821 40536.89619 27.843 < 0.0000000000000002 ***
#> grade.C 535232.17938 35101.87866 15.248 < 0.0000000000000002 ***
#> grade^4 307834.76669 28339.89188 10.862 < 0.0000000000000002 ***
#> grade^5 119598.27402 21733.26276 5.503 0.000000037868803 ***
#> grade^6 64278.92540 15922.62506 4.037 0.000054383273830 ***
#> grade^7 8416.23199 11095.06523 0.759 0.448128
#> grade^8 -6136.16502 7419.03582 -0.827 0.408201
#> grade^9 3892.89534 4816.83049 0.808 0.418995
#> sqft_above -23.61859 5.10713 -4.625 0.000003779979353 ***
#> yr_built -2997.67333 82.67047 -36.261 < 0.0000000000000002 ***
#> yr_renovated 21.87956 4.16443 5.254 0.000000150669083 ***
#> sqft_living15 33.25133 3.93038 8.460 < 0.0000000000000002 ***
#> sqft_lot15 -0.57146 0.06028 -9.480 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 207000 on 17247 degrees of freedom
#> Multiple R-squared: 0.6909, Adjusted R-squared: 0.6903
#> F-statistic: 1205 on 32 and 17247 DF, p-value: < 0.00000000000000022
The feature selection was done by the step backwards package in base-R. As can be seen, there is no improvement in the adjusted R-Squared score, although there is a slight improvement in the AIC score. The Adjusted R-squared shows that our linear model can account for 68.07% of our data which means that there is still 31.03% of the data that still needs to be accounted for by other factors.
Next, we want to compare the Root Mean Squared Error of our model on the train set and the test set to see how much our model overfit.
# RMSE of test set
predicted_price <- predict(object = model2, newdata = house2_test)
actual_price <- house2_test$price
RMSE(predicted_price, actual_price)
#> [1] 200482.2
# RMSE of train set
predicted_price <- predict(object = model2, newdata = house2_train)
actual_price <- house2_train$price
RMSE(predicted_price, actual_price)
#> [1] 206798.2
The RMSE of the test set is higher than the train set so our model slightly overfits.
Given that our model only slightly overfits based on the RMSE measure and that the adjusted R-Squared shows that the model can account for approximately 68% of our data, we tentatively conclude that our model performance is good. But, we need to do assunptions check with our model to see if our model is valid. The first check is the normal distribution of the data. We will use qqPlot from the car library.
qqPlot(model2$residuals)
#> [1] 3107 5775
The plot shows that our data is not normally distributed because the points lies outside the blue area.
The second test we have to do is checking if the error is constant or form a pattern.
data.frame(prediction = model2$fitted.values,
residual = model2$residuals) %>%
ggplot(aes(prediction, residual)) +
geom_hline(yintercept = 0) +
geom_point()
The plot shows that the error form a cone pattern opening towards the right. So the second assumption is not fulfilled as well, that the error should not form any pattern.
The third assumption check is multicolinearity, that is, if multiple predictors have strong correlations to each other.In a valid linear regression model, multicolinearity should not be present among the predictor variables. We can do this by using the VIF function in the car library.
vif(model2)
#> GVIF Df GVIF^(1/(2*Df))
#> date 1.007373 1 1.003680
#> bedrooms 1.692337 1 1.300898
#> bathrooms 2.435783 1 1.560700
#> sqft_living 8.068253 1 2.840467
#> floors 3.094176 5 1.119578
#> waterfront 1.532485 1 1.237936
#> view 1.828723 4 1.078372
#> condition 1.353555 4 1.038567
#> grade 4.478827 9 1.086865
#> sqft_above 7.178657 1 2.679302
#> yr_built 2.385371 1 1.544465
#> yr_renovated 1.162679 1 1.078276
#> sqft_living15 2.916870 1 1.707885
#> sqft_lot15 1.084190 1 1.041245
Multicolinearity is not present according to vif test because none of the predictors have GVIF value at or above 10. So out of the three assumptions, normal distribution, no pattern of errors, and multicolinearity, only the non-presence of multicolinearity assumption is fulfilled.
We have used linear regression to try to predict house prices based on their characteristics and the characteristics of their immediate surrounding (parking lot and waterfront). Although the regression model at first have a fair performance as measured by R-squared, the model failed two out of three assumptions check which shows that the model is not valid and further data processing needs to be done.