Discussion 12 - Linear Regression

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

The King County House Sales dataset from the ModernDive package contains 1339 observations of 43 variables from home sales data. I’ve subsetted the following 15 variables for this model:

‘price’ (num) (response variable)
‘bedrooms’ (int)
‘bathrooms’ (num)
‘floors’ (num)
‘waterfront’ (num) (0-1)
‘condition’ (factor)
‘grade’ (factor)
‘sqft_above’ (int)
‘sqft_basement’ (int)
‘yr_built’ (int)
‘yr_renovated’ (int)
‘zipcode’ (factor)
‘sqft_living15’ (int)
‘sqft_lot15’ (int)

We’ll build a multiple linear regression model for price as a function of the remaining variables, and analyze the results.

Data Setup and EDA

df = house_prices
head(df) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)  %>%
  scroll_box(width='100%', height = '200px')

id	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	condition	grade	sqft_above	sqft_basement	yr_built	yr_renovated	zipcode	lat	long	sqft_living15	sqft_lot15
7129300520	2014-10-13	221900	3	1.00	1180	5650	1	FALSE	3	7	1180	0	1955	0	98178	47.5112	-122.257	1340	5650
6414100192	2014-12-09	538000	3	2.25	2570	7242	2	FALSE	3	7	2170	400	1951	1991	98125	47.7210	-122.319	1690	7639
5631500400	2015-02-25	180000	2	1.00	770	10000	1	FALSE	3	6	770	0	1933	0	98028	47.7379	-122.233	2720	8062
2487200875	2014-12-09	604000	4	3.00	1960	5000	1	FALSE	5	7	1050	910	1965	0	98136	47.5208	-122.393	1360	5000
1954400510	2015-02-18	510000	3	2.00	1680	8080	1	FALSE	3	8	1680	0	1987	0	98074	47.6168	-122.045	1800	7503
7237550310	2014-05-12	1225000	4	4.50	5420	101930	1	FALSE	3	11	3890	1530	2001	0	98053	47.6561	-122.005	4760	101930

Missing Data

apply(df, 2, function(col) sum(is.na(col))) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F) %>%
  scroll_box(width='100%', height = '200px')

	x
id	0
date	0
price	0
bedrooms	0
bathrooms	0
sqft_living	0
sqft_lot	0
floors	0
waterfront	0
view	0
condition	0
grade	0
sqft_above	0
sqft_basement	0
yr_built	0
yr_renovated	0
zipcode	0
lat	0
long	0
sqft_living15	0
sqft_lot15	0

No missing data.

Summary Statistics

summary(df)

##       id                 date                price            bedrooms     
##  Length:21613       Min.   :2014-05-02   Min.   :  75000   Min.   : 0.000  
##  Class :character   1st Qu.:2014-07-22   1st Qu.: 321950   1st Qu.: 3.000  
##  Mode  :character   Median :2014-10-16   Median : 450000   Median : 3.000  
##                     Mean   :2014-10-29   Mean   : 540088   Mean   : 3.371  
##                     3rd Qu.:2015-02-17   3rd Qu.: 645000   3rd Qu.: 4.000  
##                     Max.   :2015-05-27   Max.   :7700000   Max.   :33.000  
##                                                                            
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.000   Min.   :  290   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1427   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.115   Mean   : 2080   Mean   :  15107   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500  
##                                                                   
##  waterfront           view        condition     grade        sqft_above  
##  Mode :logical   Min.   :0.0000   1:   30   7      :8981   Min.   : 290  
##  FALSE:21450     1st Qu.:0.0000   2:  172   8      :6068   1st Qu.:1190  
##  TRUE :163       Median :0.0000   3:14031   9      :2615   Median :1560  
##                  Mean   :0.2343   4: 5679   6      :2038   Mean   :1788  
##                  3rd Qu.:0.0000   5: 1701   10     :1134   3rd Qu.:2210  
##                  Max.   :4.0000             11     : 399   Max.   :9410  
##                                             (Other): 378                 
##  sqft_basement       yr_built     yr_renovated       zipcode     
##  Min.   :   0.0   Min.   :1900   Min.   :   0.0   98103  :  602  
##  1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.0   98038  :  590  
##  Median :   0.0   Median :1975   Median :   0.0   98115  :  583  
##  Mean   : 291.5   Mean   :1971   Mean   :  84.4   98052  :  574  
##  3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.0   98117  :  553  
##  Max.   :4820.0   Max.   :2015   Max.   :2015.0   98042  :  548  
##                                                   (Other):18163  
##       lat             long        sqft_living15    sqft_lot15    
##  Min.   :47.16   Min.   :-122.5   Min.   : 399   Min.   :   651  
##  1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490   1st Qu.:  5100  
##  Median :47.57   Median :-122.2   Median :1840   Median :  7620  
##  Mean   :47.56   Mean   :-122.2   Mean   :1987   Mean   : 12768  
##  3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360   3rd Qu.: 10083  
##  Max.   :47.78   Max.   :-121.3   Max.   :6210   Max.   :871200  
##

The median price is $450k, with homes built betweeen 1900 and 2015. There is a wide variety of square footage of building and property, as well as number of floors, bedrooms and bathrooms. Information about waterfront views and renovations are also included.

LR Model

Quadratic terms

Inspecting the relationships between each of the numeric variables and the response, bathrooms seems slighly curvilinear - we’ll add a Quadratic term for it in the model, and illustrate it temporarily in the scatterplot matrix below.

df %>%
  select(where(is.numeric)) %>%
  mutate(bathrooms2 = bathrooms ^ 2) %>%
  pivot_longer(cols = -price, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value, y = price)) +
    geom_point(alpha=0.1) +
    facet_wrap(~variable, scales = "free")

Dichotomous terms

The variables waterfront and yr_renovated are both booleans, and good candidates for Dichotomous / dummy variables.

df <- df %>%
  mutate(waterfront = ifelse(waterfront == FALSE,0,1),
         renovated = ifelse(yr_renovated == 0,0,1))

Interaction terms

It may be interesting to understand the relationship between bathrooms and price for homes that underwent renovation, versus those that did not. We’ll add an Interaction term of bathrooms:renovated to the model.

Normalize

# predictors only, keep the scale of the coefficients
df_y <- df %>% select(price)
df_x <- df %>% select(!price)

obj_normalize <- preProcess(df_x, method = c("center", "scale"))
df_xn <- predict(obj_normalize, df_x)

df_n <- bind_cols(df_y,df_xn)

Train model

lm1 <- lm(price ~ bathrooms + I(bathrooms^2) + bedrooms + floors + renovated + sqft_above +
           sqft_basement + sqft_living15 + bathrooms:renovated +
           sqft_lot15 + waterfront + yr_built + yr_renovated, data=df_n)

get_regression_table(lm1) %>%
  arrange(p_value, desc(estimate)) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)

term	estimate	std_error	statistic	lower_ci	upper_ci
yr_renovated	1347613.57	202025.796	6.671	951628.097	1743599.04
intercept	510395.69	1812.412	281.611	506843.231	513948.16
sqft_above	172817.05	3242.388	53.299	166461.729	179172.37
sqft_basement	101645.19	2084.618	48.760	97559.184	105731.19
sqft_living15	75477.73	2489.686	30.316	70597.761	80357.70
waterfront	61281.28	1571.345	38.999	58201.330	64361.24
floors	36522.59	2127.364	17.168	32352.795	40692.38
bathrooms	31657.12	2865.471	11.048	26040.584	37273.65
I(bathrooms^2)	29190.69	954.800	30.573	27319.212	31062.17
bathrooms:renovated	10010.56	1414.550	7.077	7237.934	12783.18
sqft_lot15	-19280.90	1598.517	-12.062	-22414.109	-16147.69
bedrooms	-45726.09	1967.926	-23.236	-49583.367	-41868.81
yr_built	-84255.47	2097.607	-40.167	-88366.932	-80144.00
renovated	-1342496.98	201970.895	-6.647	-1738374.844	-946619.11

get_regression_summaries(lm1) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df	nobs
0.618	0.618	51486209417	226905.7	226979.3	2687.769	0	13	21613

All our predictors have statistically significant p-values. The predictors were normalized but response was not, so we can interpret each coefficient as the dollar increase/decrease in price for every standard deviation increase in the predictor. This approach helps us see the relative impact of each predictor on price.

As we might expect, increases in building square footage are associated with increases in price - however, increases in property square footage are associated with decreases in price, as are the number of bedrooms.

Our Quadratic term bathooms^2 had a slightly smaller positive association with price than bathrooms itself, suggesting a weaker curvilinear relationship than linear after all. In practice we’d expect these two variables to be collinear, and would probably drop the weaker Quadratic term.

The Interaction term bathrooms:renovated has a positive relationship with price.

There’s an interesting dichotomy between our renovated Dichotomous variable (strongly negatively associated with price) and the year in which a renovation took place (recent renovations are strongly positively associated with price.) One hypothesis - older homes in need of remodeling tend to command lower prices, but an actual remodel shoots the price right back up.

Residual Analysis

get_regression_points(lm1) %>%
  ggplot(aes(x=price_hat, y=residual)) +
    geom_point(alpha=0.3)

Lots of heteroskedasticity is observed in the residuals! Perhaps a linear regression is not the best model for these data.

The Q-Q- plot provides a clearer picture of non-normally distributed values towards the extremes, confirming a skew to the distribution of residuals:

qqnorm(resid(lm1))
qqline(resid(lm1))

Conclusion

Overall this simple LR model seems to define the relationship between price and the remaining variables fairly well with an Adjusted $R^2$ of 0.62, but the high degree of heteroskedasticity suggests that any predictions made with this model may be unreliable at certain values. Additional steps to address this condition - or application of an alternative regression model - might be advisable to improve performance.