1 Background

Home buyers as well as real estate dealers would benefit from house price prediction based on the characteristics of the house as well as its immediate environment. In this report we will attempt to predict house prices based on several factors such as square feet living area.

2 Data description

The data for house price is originally from here. The description of the columns is as follows.

  • id: Notation for a house
  • date: Date house was sold
  • price: Price of property
  • bedrooms: Number of bedrooms
  • sqft_living: Total living area (in square feet)
  • sqft_lot: Lot area (in square feet)
  • floors: Number of building floors
  • waterfront: Does the house has waterfront? (1 = yes, 0 = no)
  • view: Has been viewed
  • condition: How good the condition is ( Overall ). 1 indicates worn out property and 5 excellent.
  • grade: Overall grade given to the housing unit, based on King County grading system. 1 poor ,13 excellent.
  • yr_built: Year of construction
  • yr_renovated: Year of renovation
  • zipcode: zip code of house address
  • lat: Latitude of house address
  • lot: Longitude of house address
  • sqft_living15: Living room area in 2015
  • sqft_lot15: Lot area in 2015

3 Read and Inspect The Data

library(lubridate)
house <- read_csv("data_input/kc_house_data.csv")
glimpse(house)
#> Rows: 21,597
#> Columns: 21
#> $ id            <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
#> $ date          <chr> "10/13/2014", "12/9/2014", "2/25/2015", "12/9/2014", "2/…
#> $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1230000, 257500,…
#> $ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
#> $ bathrooms     <dbl> 1, 2, 1, 3, 2, 4, 2, 1, 1, 2, 2, 1, 1, 1, 2, 3, 2, 1, 1,…
#> $ sqft_living   <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
#> $ sqft_lot      <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
#> $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
#> $ waterfront    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ view          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
#> $ condition     <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
#> $ grade         <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
#> $ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
#> $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
#> $ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
#> $ yr_renovated  <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ zipcode       <dbl> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
#> $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
#> $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
#> $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
#> $ sqft_lot15    <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …

We see that we have 21,597 number of observations

The target variable of our model will be the price, while the predictors will be the rest of the columns, after some data wrangling.

4 Data Wrangling

We can drop the id column as it cannot be used in either EDA or prediction. We will also remove zipcode, lat and long columns to make the predictions area or region independent. We also need to transform the date column’s type to date. Also, some columns that appears to be numeric may not be numeric at all and need to be changed into factors

house1 <- house %>% mutate(date = mdy(date)) %>% select(-c(id,lat,long,zipcode))
# check for any NA's in the table
colSums(is.na(house1))
#>          date         price      bedrooms     bathrooms   sqft_living 
#>             0             0             0             0             0 
#>      sqft_lot        floors    waterfront          view     condition 
#>             0             0             0             0             0 
#>         grade    sqft_above sqft_basement      yr_built  yr_renovated 
#>             0             0             0             0             0 
#> sqft_living15    sqft_lot15 
#>             0             0

There is no NA values in all of the columns. Next, let us check the unique values of the numeric columns.

house1 %>% summarise_all(n_distinct)

As said above, some columns have very small unique values that are more suitable as factors. These columns are bedrooms, floors, waterfront, view, condition, and grade. Also, yr_bulit and yr_renovated are better as dates than integers. The zipcode, lat, and long can be removed because we like the prediction to be a general one and not specific to a location or region.

# check the unique values
unique(house1$bedrooms)
#>  [1]  3  2  4  5  1  6  7  8  9 11 10 33
unique(house1$bathrooms)
#> [1] 1 2 3 4 0 5 6 8 7
unique(house1$floors)
#> [1] 1.0 2.0 1.5 3.0 2.5 3.5
unique(house1$waterfront)
#> [1] 0 1
unique(house1$view)
#> [1] 0 3 4 2 1
unique(house1$condition)
#> [1] 3 5 4 1 2
unique(house1$grade)
#>  [1]  7  6  8 11  9  5 10 12  4  3 13

Let us check the occurences of the unique values in the columns with < 100 unique values

table(house1$bedrooms)
#> 
#>    1    2    3    4    5    6    7    8    9   10   11   33 
#>  196 2760 9824 6882 1601  272   38   13    6    3    1    1
table(house1$bathrooms)
#> 
#>     0     1     2     3     4     5     6     7     8 
#>    75  8353 10539  2228   338    48    12     2     2
table(house1$floors)
#> 
#>     1   1.5     2   2.5     3   3.5 
#> 10673  1910  8235   161   611     7
table(house1$waterfront)
#> 
#>     0     1 
#> 21434   163
table(house1$view)
#> 
#>     0     1     2     3     4 
#> 19475   332   961   510   319
table(house1$condition)
#> 
#>     1     2     3     4     5 
#>    29   170 14020  5677  1701
table(house1$grade)
#> 
#>    3    4    5    6    7    8    9   10   11   12   13 
#>    1   27  242 2038 8974 6065 2615 1134  399   89   13

From result it is clear that some values have only 3 or less observations in the data_frame. Now let us remove the observations which have less than 3 occurrences in the bathrooms, bedrooms, grade.

# filtering bathrooms
bathrooms_f <- house1 %>% count(bathrooms) %>% filter(n>3)
house2 <- house1[house1$bathrooms %in% bathrooms_f$bathrooms,] %>% mutate(bathrooms = factor(bathrooms,ordered = TRUE))

# fitering bedrooms
bedrooms_f <- house1 %>% count(bedrooms) %>% filter(n>3)
house2 <- house1[house1$bedrooms %in% bedrooms_f$bedrooms,] %>% mutate(bedrooms = factor(bedrooms,ordered = TRUE))

# filtering grade
grade_f <- house1 %>% count(grade) %>% filter(n>3)
house2 <- house1[house1$grade %in% grade_f$grade,] %>% mutate(grade = factor(grade,ordered = TRUE))

# converting the rest 4 columns to ordered factors
house2 <- house2 %>% 
  mutate(
         floors = factor(floors, ordered = TRUE),
         waterfront = factor(waterfront, ordered = FALSE),
         view = factor(view, ordered = TRUE),
         condition = factor(condition, ordered = TRUE),
         )
glimpse(house2)
#> Rows: 21,596
#> Columns: 17
#> $ date          <date> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-02…
#> $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1230000, 257500,…
#> $ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
#> $ bathrooms     <dbl> 1, 2, 1, 3, 2, 4, 2, 1, 1, 2, 2, 1, 1, 1, 2, 3, 2, 1, 1,…
#> $ sqft_living   <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
#> $ sqft_lot      <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
#> $ floors        <ord> 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1.5, 1, 1.5, 2, 2, 1…
#> $ waterfront    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ view          <ord> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
#> $ condition     <ord> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
#> $ grade         <ord> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
#> $ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
#> $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
#> $ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
#> $ yr_renovated  <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
#> $ sqft_lot15    <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …

Let us also look at the proportion of yr_renovation values.

round(prop.table(table(house1$yr_renovated))*100,2 )
#> 
#>     0  1934  1940  1944  1945  1946  1948  1950  1951  1953  1954  1955  1956 
#> 95.77  0.00  0.01  0.00  0.01  0.01  0.00  0.01  0.00  0.01  0.00  0.01  0.01 
#>  1957  1958  1959  1960  1962  1963  1964  1965  1967  1968  1969  1970  1971 
#>  0.01  0.02  0.00  0.02  0.01  0.02  0.02  0.02  0.01  0.04  0.02  0.04  0.01 
#>  1972  1973  1974  1975  1976  1977  1978  1979  1980  1981  1982  1983  1984 
#>  0.02  0.02  0.01  0.03  0.01  0.04  0.03  0.05  0.05  0.02  0.05  0.08  0.08 
#>  1985  1986  1987  1988  1989  1990  1991  1992  1993  1994  1995  1996  1997 
#>  0.08  0.08  0.08  0.07  0.10  0.12  0.09  0.08  0.09  0.09  0.07  0.07  0.07 
#>  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009  2010 
#>  0.09  0.08  0.16  0.09  0.10  0.17  0.12  0.16  0.11  0.16  0.08  0.10  0.08 
#>  2011  2012  2013  2014  2015 
#>  0.06  0.05  0.17  0.42  0.07

95.77% of the houses have 0 as the yr_renovated value which means they were not renovated at all. Since this is equivalent as 96% having NA values this column is a candidate for dropping but for now let’s keep it.

What about the correlation between predictors and target, as well as between predictors?

ggcorr(house2, label=TRUE)

The correlation heatmap shows that factors such as bedroom and sqft_lot are weakly correlated to price while factors such as sqft_living and sqft_above are strongly correlated.

I want to check if there is a column that is an alias of another column with the alias method.

model <- lm(price ~. , data=house2)
alias(model)
#> Model :
#> price ~ date + bedrooms + bathrooms + sqft_living + sqft_lot + 
#>     floors + waterfront + view + condition + grade + sqft_above + 
#>     sqft_basement + yr_built + yr_renovated + sqft_living15 + 
#>     sqft_lot15
#> 
#> Complete :
#>               (Intercept) date bedrooms bathrooms sqft_living sqft_lot floors.L
#> sqft_basement  0           0    0        0         1           0        0      
#>               floors.Q floors.C floors^4 floors^5 waterfront1 view.L view.Q
#> sqft_basement  0        0        0        0        0           0      0    
#>               view.C view^4 condition.L condition.Q condition.C condition^4
#> sqft_basement  0      0      0           0           0           0         
#>               grade.L grade.Q grade.C grade^4 grade^5 grade^6 grade^7 grade^8
#> sqft_basement  0       0       0       0       0       0       0       0     
#>               grade^9 sqft_above yr_built yr_renovated sqft_living15 sqft_lot15
#> sqft_basement  0      -1          0        0            0             0

Checking the alias shows that sqft_basement is an alias for sqft_living or sqft_above, so we will remove the sqft_basement.

house2 <- house2 %>% select(-sqft_basement)
# split the dataset into testing and training sets
library(rsample)
set.seed(313)
init <- initial_split(data=house2, prop=0.8, strata = "price")
house2_train <- training(init) 
house2_test  <- testing(init)
model1 <- lm(price ~. , data=house2_train)
summary(model1)
#> 
#> Call:
#> lm(formula = price ~ ., data = house2_train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1709097  -104585    -9042    85554  4091480 
#> 
#> Coefficients:
#>                    Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept)   4995454.50628  280699.56107  17.796 < 0.0000000000000002 ***
#> date              100.57286      13.97392   7.197    0.000000000000640 ***
#> bedrooms       -25119.38445    2199.20491 -11.422 < 0.0000000000000002 ***
#> bathrooms       47978.78855    3339.83579  14.366 < 0.0000000000000002 ***
#> sqft_living       153.33215       4.87466  31.455 < 0.0000000000000002 ***
#> sqft_lot            0.03095       0.05442   0.569             0.569472    
#> floors.L       199236.29211   50978.40457   3.908    0.000093322705353 ***
#> floors.Q        44022.37296   46966.51013   0.937             0.348610    
#> floors.C       -20404.03762   32583.86959  -0.626             0.531192    
#> floors^4        12716.42128   18650.87445   0.682             0.495366    
#> floors^5        44316.38823   13798.36169   3.212             0.001322 ** 
#> waterfront1    499796.06316   22912.49924  21.813 < 0.0000000000000002 ***
#> view.L         149221.04419   11811.53804  12.633 < 0.0000000000000002 ***
#> view.Q          48382.81610   10302.68223   4.696    0.000002671687283 ***
#> view.C          83289.67119   11401.85666   7.305    0.000000000000290 ***
#> view^4         -33574.93462    9650.40616  -3.479             0.000504 ***
#> condition.L     93779.16610   27769.04935   3.377             0.000734 ***
#> condition.Q       381.43017   23445.98965   0.016             0.987020    
#> condition.C     10770.88356   17572.56350   0.613             0.539926    
#> condition^4      3946.99219   10090.95147   0.391             0.695697    
#> grade.L       2029337.87605   43795.70662  46.336 < 0.0000000000000002 ***
#> grade.Q       1128916.01697   40539.72790  27.847 < 0.0000000000000002 ***
#> grade.C        535475.01767   35105.16281  15.253 < 0.0000000000000002 ***
#> grade^4        308130.47366   28345.21491  10.871 < 0.0000000000000002 ***
#> grade^5        119561.87994   21733.78313   5.501    0.000000038258219 ***
#> grade^6         64421.03247   15924.89692   4.045    0.000052483003372 ***
#> grade^7          8335.20262   11096.19717   0.751             0.452557    
#> grade^8         -6097.65639    7419.49015  -0.822             0.411178    
#> grade^9          3906.92825    4816.98812   0.811             0.417335    
#> sqft_above        -23.75335       5.11273  -4.646    0.000003410552201 ***
#> yr_built        -2997.71285      82.67212 -36.260 < 0.0000000000000002 ***
#> yr_renovated       21.88524       4.16452   5.255    0.000000149653794 ***
#> sqft_living15      33.34127       3.93364   8.476 < 0.0000000000000002 ***
#> sqft_lot15         -0.60463       0.08387  -7.209    0.000000000000586 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 207000 on 17246 degrees of freedom
#> Multiple R-squared:  0.6909, Adjusted R-squared:  0.6903 
#> F-statistic:  1168 on 33 and 17246 DF,  p-value: < 0.00000000000000022

Let us try using step backward to try to improve the model’s R-squared score:

model2 <- step(model1, direction="backward")
#> Start:  AIC=423064.8
#> price ~ date + bedrooms + bathrooms + sqft_living + sqft_lot + 
#>     floors + waterfront + view + condition + grade + sqft_above + 
#>     yr_built + yr_renovated + sqft_living15 + sqft_lot15
#> 
#>                 Df       Sum of Sq             RSS    AIC
#> - sqft_lot       1     13865044864 738988101163934 423063
#> <none>                             738974236119070 423065
#> - sqft_above     1    924880483148 739899116602218 423084
#> - yr_renovated   1   1183350253632 740157586372702 423090
#> - date           1   2219556003492 741193792122562 423115
#> - sqft_lot15     1   2227022384538 741201258503608 423115
#> - sqft_living15  1   3078335629427 742052571748497 423135
#> - condition      4   4415270480013 743389506599083 423160
#> - bedrooms       1   5590202091607 744564438210677 423193
#> - bathrooms      1   8842788599224 747817024718294 423268
#> - floors         5   9460119484972 748434355604042 423275
#> - view           4  14413878805054 753388114924124 423391
#> - waterfront     1  20388329813301 759362565932371 423533
#> - sqft_living    1  42395460179584 781369696298654 424027
#> - yr_built       1  56338180485718 795312416604788 424332
#> - grade          9 168721262807468 907695498926538 426600
#> 
#> Step:  AIC=423063.1
#> price ~ date + bedrooms + bathrooms + sqft_living + floors + 
#>     waterfront + view + condition + grade + sqft_above + yr_built + 
#>     yr_renovated + sqft_living15 + sqft_lot15
#> 
#>                 Df       Sum of Sq             RSS    AIC
#> <none>                             738988101163934 423063
#> - sqft_above     1    916383528876 739904484692810 423082
#> - yr_renovated   1   1182742514208 740170843678142 423089
#> - date           1   2221375976004 741209477139938 423113
#> - sqft_living15  1   3066704482622 742054805646555 423133
#> - sqft_lot15     1   3850879807655 742838980971589 423151
#> - condition      4   4411419708253 743399520872187 423158
#> - bedrooms       1   5604865006161 744592966170094 423192
#> - bathrooms      1   8851179645848 747839280809781 423267
#> - floors         5   9447813592308 748435914756242 423273
#> - view           4  14422606629527 753410707793460 423389
#> - waterfront     1  20388792787967 759376893951900 423531
#> - sqft_living    1  42409435908689 781397537072623 424025
#> - yr_built       1  56336734959935 795324836123868 424331
#> - grade          9 168712304006500 907700405170434 426598

With the step backward it is found that if we remove sqft_lot, the AIC decreases a little. what is the model summary looks like ?

summary(model2)
#> 
#> Call:
#> lm(formula = price ~ date + bedrooms + bathrooms + sqft_living + 
#>     floors + waterfront + view + condition + grade + sqft_above + 
#>     yr_built + yr_renovated + sqft_living15 + sqft_lot15, data = house2_train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1709468  -104546    -9057    85495  4090912 
#> 
#> Coefficients:
#>                    Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept)   4994810.83500  280691.77584  17.795 < 0.0000000000000002 ***
#> date              100.61281      13.97347   7.200    0.000000000000626 ***
#> bedrooms       -25146.42852    2198.64785 -11.437 < 0.0000000000000002 ***
#> bathrooms       47998.86637    3339.58379  14.373 < 0.0000000000000002 ***
#> sqft_living       153.35307       4.87442  31.461 < 0.0000000000000002 ***
#> floors.L       199048.77009   50976.33902   3.905    0.000094690332617 ***
#> floors.Q        44104.85994   46965.36527   0.939             0.347696    
#> floors.C       -20344.82830   32583.06437  -0.624             0.532374    
#> floors^4        12615.11511   18649.65842   0.676             0.498779    
#> floors^5        44246.51134   13797.54430   3.207             0.001344 ** 
#> waterfront1    499801.69117   22912.04778  21.814 < 0.0000000000000002 ***
#> view.L         149238.79538   11811.26519  12.635 < 0.0000000000000002 ***
#> view.Q          48258.66637   10300.16829   4.685    0.000002817791301 ***
#> view.C          83140.13358   11398.60222   7.294    0.000000000000314 ***
#> view^4         -33617.88230    9649.92158  -3.484             0.000496 ***
#> condition.L     93419.38778   27761.30125   3.365             0.000767 ***
#> condition.Q       484.01608   23444.83629   0.021             0.983529    
#> condition.C     10895.91070   17570.84432   0.620             0.535192    
#> condition^4      3800.18312   10087.45274   0.377             0.706384    
#> grade.L       2028993.68805   43790.66792  46.334 < 0.0000000000000002 ***
#> grade.Q       1128684.85821   40536.89619  27.843 < 0.0000000000000002 ***
#> grade.C        535232.17938   35101.87866  15.248 < 0.0000000000000002 ***
#> grade^4        307834.76669   28339.89188  10.862 < 0.0000000000000002 ***
#> grade^5        119598.27402   21733.26276   5.503    0.000000037868803 ***
#> grade^6         64278.92540   15922.62506   4.037    0.000054383273830 ***
#> grade^7          8416.23199   11095.06523   0.759             0.448128    
#> grade^8         -6136.16502    7419.03582  -0.827             0.408201    
#> grade^9          3892.89534    4816.83049   0.808             0.418995    
#> sqft_above        -23.61859       5.10713  -4.625    0.000003779979353 ***
#> yr_built        -2997.67333      82.67047 -36.261 < 0.0000000000000002 ***
#> yr_renovated       21.87956       4.16443   5.254    0.000000150669083 ***
#> sqft_living15      33.25133       3.93038   8.460 < 0.0000000000000002 ***
#> sqft_lot15         -0.57146       0.06028  -9.480 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 207000 on 17247 degrees of freedom
#> Multiple R-squared:  0.6909, Adjusted R-squared:  0.6903 
#> F-statistic:  1205 on 32 and 17247 DF,  p-value: < 0.00000000000000022

The feature selection was done by the step backwards package in base-R. As can be seen, there is no improvement in the adjusted R-Squared score, although there is a slight improvement in the AIC score. The Adjusted R-squared shows that our linear model can account for 68.07% of our data which means that there is still 31.03% of the data that still needs to be accounted for by other factors.

Next, we want to compare the Root Mean Squared Error of our model on the train set and the test set to see how much our model overfit.

# RMSE of test set
predicted_price <- predict(object = model2, newdata = house2_test)
actual_price <- house2_test$price
RMSE(predicted_price, actual_price)
#> [1] 200482.2
# RMSE of train set
predicted_price <- predict(object = model2, newdata = house2_train)
actual_price <- house2_train$price
RMSE(predicted_price, actual_price)
#> [1] 206798.2

The RMSE of the test set is higher than the train set so our model slightly overfits.

5 testing model assumptions

Given that our model only slightly overfits based on the RMSE measure and that the adjusted R-Squared shows that the model can account for approximately 68% of our data, we tentatively conclude that our model performance is good. But, we need to do assunptions check with our model to see if our model is valid. The first check is the normal distribution of the data. We will use qqPlot from the car library.

qqPlot(model2$residuals)

#> [1] 3107 5775

The plot shows that our data is not normally distributed because the points lies outside the blue area.

The second test we have to do is checking if the error is constant or form a pattern.

data.frame(prediction = model2$fitted.values,
           residual = model2$residuals) %>% 
  ggplot(aes(prediction, residual)) +
  geom_hline(yintercept = 0) +
  geom_point() 

The plot shows that the error form a cone pattern opening towards the right. So the second assumption is not fulfilled as well, that the error should not form any pattern.

The third assumption check is multicolinearity, that is, if multiple predictors have strong correlations to each other.In a valid linear regression model, multicolinearity should not be present among the predictor variables. We can do this by using the VIF function in the car library.

vif(model2)
#>                   GVIF Df GVIF^(1/(2*Df))
#> date          1.007373  1        1.003680
#> bedrooms      1.692337  1        1.300898
#> bathrooms     2.435783  1        1.560700
#> sqft_living   8.068253  1        2.840467
#> floors        3.094176  5        1.119578
#> waterfront    1.532485  1        1.237936
#> view          1.828723  4        1.078372
#> condition     1.353555  4        1.038567
#> grade         4.478827  9        1.086865
#> sqft_above    7.178657  1        2.679302
#> yr_built      2.385371  1        1.544465
#> yr_renovated  1.162679  1        1.078276
#> sqft_living15 2.916870  1        1.707885
#> sqft_lot15    1.084190  1        1.041245

Multicolinearity is not present according to vif test because none of the predictors have GVIF value at or above 10. So out of the three assumptions, normal distribution, no pattern of errors, and multicolinearity, only the non-presence of multicolinearity assumption is fulfilled.

6 Conclusion

We have used linear regression to try to predict house prices based on their characteristics and the characteristics of their immediate surrounding (parking lot and waterfront). Although the regression model at first have a fair performance as measured by R-squared, the model failed two out of three assumptions check which shows that the model is not valid and further data processing needs to be done.