Using R, build a multiple regression model for data that interests you.  Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis.  Was the linear model appropriate? Why or why not?

The King County House Sales dataset from the ModernDive package contains 1339 observations of 43 variables from home sales data. I’ve subsetted the following 15 variables for this model:

  • ‘price’ (num) (response variable)
  • ‘bedrooms’ (int)
  • ‘bathrooms’ (num)
  • ‘floors’ (num)
  • ‘waterfront’ (num) (0-1)
  • ‘condition’ (factor)
  • ‘grade’ (factor)
  • ‘sqft_above’ (int)
  • ‘sqft_basement’ (int)
  • ‘yr_built’ (int)
  • ‘yr_renovated’ (int)
  • ‘zipcode’ (factor)
  • ‘sqft_living15’ (int)
  • ‘sqft_lot15’ (int)

We’ll build a multiple linear regression model for price as a function of the remaining variables, and analyze the results.

Data Setup and EDA

df = house_prices
head(df) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)  %>%
  scroll_box(width='100%', height = '200px')
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
7129300520 2014-10-13 221900 3 1.00 1180 5650 1 FALSE 0 3 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
6414100192 2014-12-09 538000 3 2.25 2570 7242 2 FALSE 0 3 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
5631500400 2015-02-25 180000 2 1.00 770 10000 1 FALSE 0 3 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
2487200875 2014-12-09 604000 4 3.00 1960 5000 1 FALSE 0 5 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
1954400510 2015-02-18 510000 3 2.00 1680 8080 1 FALSE 0 3 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503
7237550310 2014-05-12 1225000 4 4.50 5420 101930 1 FALSE 0 3 11 3890 1530 2001 0 98053 47.6561 -122.005 4760 101930

Missing Data

apply(df, 2, function(col) sum(is.na(col))) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F) %>%
  scroll_box(width='100%', height = '200px')
x
id 0
date 0
price 0
bedrooms 0
bathrooms 0
sqft_living 0
sqft_lot 0
floors 0
waterfront 0
view 0
condition 0
grade 0
sqft_above 0
sqft_basement 0
yr_built 0
yr_renovated 0
zipcode 0
lat 0
long 0
sqft_living15 0
sqft_lot15 0

No missing data.

Summary Statistics

summary(df)
##       id                 date                price            bedrooms     
##  Length:21613       Min.   :2014-05-02   Min.   :  75000   Min.   : 0.000  
##  Class :character   1st Qu.:2014-07-22   1st Qu.: 321950   1st Qu.: 3.000  
##  Mode  :character   Median :2014-10-16   Median : 450000   Median : 3.000  
##                     Mean   :2014-10-29   Mean   : 540088   Mean   : 3.371  
##                     3rd Qu.:2015-02-17   3rd Qu.: 645000   3rd Qu.: 4.000  
##                     Max.   :2015-05-27   Max.   :7700000   Max.   :33.000  
##                                                                            
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.000   Min.   :  290   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1427   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.115   Mean   : 2080   Mean   :  15107   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500  
##                                                                   
##  waterfront           view        condition     grade        sqft_above  
##  Mode :logical   Min.   :0.0000   1:   30   7      :8981   Min.   : 290  
##  FALSE:21450     1st Qu.:0.0000   2:  172   8      :6068   1st Qu.:1190  
##  TRUE :163       Median :0.0000   3:14031   9      :2615   Median :1560  
##                  Mean   :0.2343   4: 5679   6      :2038   Mean   :1788  
##                  3rd Qu.:0.0000   5: 1701   10     :1134   3rd Qu.:2210  
##                  Max.   :4.0000             11     : 399   Max.   :9410  
##                                             (Other): 378                 
##  sqft_basement       yr_built     yr_renovated       zipcode     
##  Min.   :   0.0   Min.   :1900   Min.   :   0.0   98103  :  602  
##  1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.0   98038  :  590  
##  Median :   0.0   Median :1975   Median :   0.0   98115  :  583  
##  Mean   : 291.5   Mean   :1971   Mean   :  84.4   98052  :  574  
##  3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.0   98117  :  553  
##  Max.   :4820.0   Max.   :2015   Max.   :2015.0   98042  :  548  
##                                                   (Other):18163  
##       lat             long        sqft_living15    sqft_lot15    
##  Min.   :47.16   Min.   :-122.5   Min.   : 399   Min.   :   651  
##  1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490   1st Qu.:  5100  
##  Median :47.57   Median :-122.2   Median :1840   Median :  7620  
##  Mean   :47.56   Mean   :-122.2   Mean   :1987   Mean   : 12768  
##  3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360   3rd Qu.: 10083  
##  Max.   :47.78   Max.   :-121.3   Max.   :6210   Max.   :871200  
## 

The median price is $450k, with homes built betweeen 1900 and 2015. There is a wide variety of square footage of building and property, as well as number of floors, bedrooms and bathrooms. Information about waterfront views and renovations are also included.

LR Model

Quadratic terms

Inspecting the relationships between each of the numeric variables and the response, bathrooms seems slighly curvilinear - we’ll add a Quadratic term for it in the model, and illustrate it temporarily in the scatterplot matrix below.

df %>%
  select(where(is.numeric)) %>%
  mutate(bathrooms2 = bathrooms ^ 2) %>%
  pivot_longer(cols = -price, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value, y = price)) +
    geom_point(alpha=0.1) +
    facet_wrap(~variable, scales = "free")

Dichotomous terms

The variables waterfront and yr_renovated are both booleans, and good candidates for Dichotomous / dummy variables.

df <- df %>%
  mutate(waterfront = ifelse(waterfront == FALSE,0,1),
         renovated = ifelse(yr_renovated == 0,0,1))

Interaction terms

It may be interesting to understand the relationship between bathrooms and price for homes that underwent renovation, versus those that did not. We’ll add an Interaction term of bathrooms:renovated to the model.

Normalize

# predictors only, keep the scale of the coefficients
df_y <- df %>% select(price)
df_x <- df %>% select(!price)

obj_normalize <- preProcess(df_x, method = c("center", "scale"))
df_xn <- predict(obj_normalize, df_x)

df_n <- bind_cols(df_y,df_xn)

Train model

lm1 <- lm(price ~ bathrooms + I(bathrooms^2) + bedrooms + floors + renovated + sqft_above +
           sqft_basement + sqft_living15 + bathrooms:renovated +
           sqft_lot15 + waterfront + yr_built + yr_renovated, data=df_n)

get_regression_table(lm1) %>%
  arrange(p_value, desc(estimate)) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
term estimate std_error statistic p_value lower_ci upper_ci
yr_renovated 1347613.57 202025.796 6.671 0 951628.097 1743599.04
intercept 510395.69 1812.412 281.611 0 506843.231 513948.16
sqft_above 172817.05 3242.388 53.299 0 166461.729 179172.37
sqft_basement 101645.19 2084.618 48.760 0 97559.184 105731.19
sqft_living15 75477.73 2489.686 30.316 0 70597.761 80357.70
waterfront 61281.28 1571.345 38.999 0 58201.330 64361.24
floors 36522.59 2127.364 17.168 0 32352.795 40692.38
bathrooms 31657.12 2865.471 11.048 0 26040.584 37273.65
I(bathrooms^2) 29190.69 954.800 30.573 0 27319.212 31062.17
bathrooms:renovated 10010.56 1414.550 7.077 0 7237.934 12783.18
sqft_lot15 -19280.90 1598.517 -12.062 0 -22414.109 -16147.69
bedrooms -45726.09 1967.926 -23.236 0 -49583.367 -41868.81
yr_built -84255.47 2097.607 -40.167 0 -88366.932 -80144.00
renovated -1342496.98 201970.895 -6.647 0 -1738374.844 -946619.11
get_regression_summaries(lm1) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
r_squared adj_r_squared mse rmse sigma statistic p_value df nobs
0.618 0.618 51486209417 226905.7 226979.3 2687.769 0 13 21613

All our predictors have statistically significant p-values. The predictors were normalized but response was not, so we can interpret each coefficient as the dollar increase/decrease in price for every standard deviation increase in the predictor. This approach helps us see the relative impact of each predictor on price.

As we might expect, increases in building square footage are associated with increases in price - however, increases in property square footage are associated with decreases in price, as are the number of bedrooms.

Our Quadratic term bathooms^2 had a slightly smaller positive association with price than bathrooms itself, suggesting a weaker curvilinear relationship than linear after all. In practice we’d expect these two variables to be collinear, and would probably drop the weaker Quadratic term.

The Interaction term bathrooms:renovated has a positive relationship with price.

There’s an interesting dichotomy between our renovated Dichotomous variable (strongly negatively associated with price) and the year in which a renovation took place (recent renovations are strongly positively associated with price.) One hypothesis - older homes in need of remodeling tend to command lower prices, but an actual remodel shoots the price right back up.

Residual Analysis

get_regression_points(lm1) %>%
  ggplot(aes(x=price_hat, y=residual)) +
    geom_point(alpha=0.3)

Lots of heteroskedasticity is observed in the residuals! Perhaps a linear regression is not the best model for these data.

The Q-Q- plot provides a clearer picture of non-normally distributed values towards the extremes, confirming a skew to the distribution of residuals:

qqnorm(resid(lm1))
qqline(resid(lm1))

Conclusion

Overall this simple LR model seems to define the relationship between price and the remaining variables fairly well with an Adjusted \(R^2\) of 0.62, but the high degree of heteroskedasticity suggests that any predictions made with this model may be unreliable at certain values. Additional steps to address this condition - or application of an alternative regression model - might be advisable to improve performance.