Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis.  Was the linear model appropriate? Why or why not?
The King County House Sales dataset from the ModernDive package
contains 1339 observations of 43 variables from home sales data. I’ve
subsetted the following 15 variables for this model:
We’ll build a multiple linear regression model for price
as a function of the remaining variables, and analyze the results.
df = house_prices
head(df) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F) %>%
scroll_box(width='100%', height = '200px')
| id | date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | grade | sqft_above | sqft_basement | yr_built | yr_renovated | zipcode | lat | long | sqft_living15 | sqft_lot15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7129300520 | 2014-10-13 | 221900 | 3 | 1.00 | 1180 | 5650 | 1 | FALSE | 0 | 3 | 7 | 1180 | 0 | 1955 | 0 | 98178 | 47.5112 | -122.257 | 1340 | 5650 |
| 6414100192 | 2014-12-09 | 538000 | 3 | 2.25 | 2570 | 7242 | 2 | FALSE | 0 | 3 | 7 | 2170 | 400 | 1951 | 1991 | 98125 | 47.7210 | -122.319 | 1690 | 7639 |
| 5631500400 | 2015-02-25 | 180000 | 2 | 1.00 | 770 | 10000 | 1 | FALSE | 0 | 3 | 6 | 770 | 0 | 1933 | 0 | 98028 | 47.7379 | -122.233 | 2720 | 8062 |
| 2487200875 | 2014-12-09 | 604000 | 4 | 3.00 | 1960 | 5000 | 1 | FALSE | 0 | 5 | 7 | 1050 | 910 | 1965 | 0 | 98136 | 47.5208 | -122.393 | 1360 | 5000 |
| 1954400510 | 2015-02-18 | 510000 | 3 | 2.00 | 1680 | 8080 | 1 | FALSE | 0 | 3 | 8 | 1680 | 0 | 1987 | 0 | 98074 | 47.6168 | -122.045 | 1800 | 7503 |
| 7237550310 | 2014-05-12 | 1225000 | 4 | 4.50 | 5420 | 101930 | 1 | FALSE | 0 | 3 | 11 | 3890 | 1530 | 2001 | 0 | 98053 | 47.6561 | -122.005 | 4760 | 101930 |
apply(df, 2, function(col) sum(is.na(col))) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F) %>%
scroll_box(width='100%', height = '200px')
| x | |
|---|---|
| id | 0 |
| date | 0 |
| price | 0 |
| bedrooms | 0 |
| bathrooms | 0 |
| sqft_living | 0 |
| sqft_lot | 0 |
| floors | 0 |
| waterfront | 0 |
| view | 0 |
| condition | 0 |
| grade | 0 |
| sqft_above | 0 |
| sqft_basement | 0 |
| yr_built | 0 |
| yr_renovated | 0 |
| zipcode | 0 |
| lat | 0 |
| long | 0 |
| sqft_living15 | 0 |
| sqft_lot15 | 0 |
No missing data.
summary(df)
## id date price bedrooms
## Length:21613 Min. :2014-05-02 Min. : 75000 Min. : 0.000
## Class :character 1st Qu.:2014-07-22 1st Qu.: 321950 1st Qu.: 3.000
## Mode :character Median :2014-10-16 Median : 450000 Median : 3.000
## Mean :2014-10-29 Mean : 540088 Mean : 3.371
## 3rd Qu.:2015-02-17 3rd Qu.: 645000 3rd Qu.: 4.000
## Max. :2015-05-27 Max. :7700000 Max. :33.000
##
## bathrooms sqft_living sqft_lot floors
## Min. :0.000 Min. : 290 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.115 Mean : 2080 Mean : 15107 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
##
## waterfront view condition grade sqft_above
## Mode :logical Min. :0.0000 1: 30 7 :8981 Min. : 290
## FALSE:21450 1st Qu.:0.0000 2: 172 8 :6068 1st Qu.:1190
## TRUE :163 Median :0.0000 3:14031 9 :2615 Median :1560
## Mean :0.2343 4: 5679 6 :2038 Mean :1788
## 3rd Qu.:0.0000 5: 1701 10 :1134 3rd Qu.:2210
## Max. :4.0000 11 : 399 Max. :9410
## (Other): 378
## sqft_basement yr_built yr_renovated zipcode
## Min. : 0.0 Min. :1900 Min. : 0.0 98103 : 602
## 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.0 98038 : 590
## Median : 0.0 Median :1975 Median : 0.0 98115 : 583
## Mean : 291.5 Mean :1971 Mean : 84.4 98052 : 574
## 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.0 98117 : 553
## Max. :4820.0 Max. :2015 Max. :2015.0 98042 : 548
## (Other):18163
## lat long sqft_living15 sqft_lot15
## Min. :47.16 Min. :-122.5 Min. : 399 Min. : 651
## 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490 1st Qu.: 5100
## Median :47.57 Median :-122.2 Median :1840 Median : 7620
## Mean :47.56 Mean :-122.2 Mean :1987 Mean : 12768
## 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360 3rd Qu.: 10083
## Max. :47.78 Max. :-121.3 Max. :6210 Max. :871200
##
The median price is $450k, with homes built betweeen
1900 and 2015. There is a wide variety of square footage of building and
property, as well as number of floors, bedrooms and bathrooms.
Information about waterfront views and renovations are also
included.
Inspecting the relationships between each of the numeric variables
and the response, bathrooms seems slighly curvilinear -
we’ll add a Quadratic term for it in the model, and illustrate it
temporarily in the scatterplot matrix below.
df %>%
select(where(is.numeric)) %>%
mutate(bathrooms2 = bathrooms ^ 2) %>%
pivot_longer(cols = -price, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value, y = price)) +
geom_point(alpha=0.1) +
facet_wrap(~variable, scales = "free")
The variables waterfront and yr_renovated
are both booleans, and good candidates for Dichotomous / dummy
variables.
df <- df %>%
mutate(waterfront = ifelse(waterfront == FALSE,0,1),
renovated = ifelse(yr_renovated == 0,0,1))
It may be interesting to understand the relationship between
bathrooms and price for homes that underwent
renovation, versus those that did not. We’ll add an Interaction term of
bathrooms:renovated to the model.
# predictors only, keep the scale of the coefficients
df_y <- df %>% select(price)
df_x <- df %>% select(!price)
obj_normalize <- preProcess(df_x, method = c("center", "scale"))
df_xn <- predict(obj_normalize, df_x)
df_n <- bind_cols(df_y,df_xn)
lm1 <- lm(price ~ bathrooms + I(bathrooms^2) + bedrooms + floors + renovated + sqft_above +
sqft_basement + sqft_living15 + bathrooms:renovated +
sqft_lot15 + waterfront + yr_built + yr_renovated, data=df_n)
get_regression_table(lm1) %>%
arrange(p_value, desc(estimate)) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| yr_renovated | 1347613.57 | 202025.796 | 6.671 | 0 | 951628.097 | 1743599.04 |
| intercept | 510395.69 | 1812.412 | 281.611 | 0 | 506843.231 | 513948.16 |
| sqft_above | 172817.05 | 3242.388 | 53.299 | 0 | 166461.729 | 179172.37 |
| sqft_basement | 101645.19 | 2084.618 | 48.760 | 0 | 97559.184 | 105731.19 |
| sqft_living15 | 75477.73 | 2489.686 | 30.316 | 0 | 70597.761 | 80357.70 |
| waterfront | 61281.28 | 1571.345 | 38.999 | 0 | 58201.330 | 64361.24 |
| floors | 36522.59 | 2127.364 | 17.168 | 0 | 32352.795 | 40692.38 |
| bathrooms | 31657.12 | 2865.471 | 11.048 | 0 | 26040.584 | 37273.65 |
| I(bathrooms^2) | 29190.69 | 954.800 | 30.573 | 0 | 27319.212 | 31062.17 |
| bathrooms:renovated | 10010.56 | 1414.550 | 7.077 | 0 | 7237.934 | 12783.18 |
| sqft_lot15 | -19280.90 | 1598.517 | -12.062 | 0 | -22414.109 | -16147.69 |
| bedrooms | -45726.09 | 1967.926 | -23.236 | 0 | -49583.367 | -41868.81 |
| yr_built | -84255.47 | 2097.607 | -40.167 | 0 | -88366.932 | -80144.00 |
| renovated | -1342496.98 | 201970.895 | -6.647 | 0 | -1738374.844 | -946619.11 |
get_regression_summaries(lm1) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.618 | 0.618 | 51486209417 | 226905.7 | 226979.3 | 2687.769 | 0 | 13 | 21613 |
All our predictors have statistically significant p-values. The
predictors were normalized but response was not, so we can interpret
each coefficient as the dollar increase/decrease in price
for every standard deviation increase in the predictor. This
approach helps us see the relative impact of each predictor on
price.
As we might expect, increases in building square footage are associated with increases in price - however, increases in property square footage are associated with decreases in price, as are the number of bedrooms.
Our Quadratic term bathooms^2 had a slightly smaller
positive association with price than bathrooms
itself, suggesting a weaker curvilinear relationship than linear after
all. In practice we’d expect these two variables to be collinear, and
would probably drop the weaker Quadratic term.
The Interaction term bathrooms:renovated has a positive
relationship with price.
There’s an interesting dichotomy between our renovated
Dichotomous variable (strongly negatively associated with price) and the
year in which a renovation took place (recent renovations are strongly
positively associated with price.) One hypothesis - older homes in need
of remodeling tend to command lower prices, but an actual remodel shoots
the price right back up.
get_regression_points(lm1) %>%
ggplot(aes(x=price_hat, y=residual)) +
geom_point(alpha=0.3)
Lots of heteroskedasticity is observed in the residuals! Perhaps a linear regression is not the best model for these data.
The Q-Q- plot provides a clearer picture of non-normally distributed values towards the extremes, confirming a skew to the distribution of residuals:
qqnorm(resid(lm1))
qqline(resid(lm1))
Overall this simple LR model seems to define the relationship between
price and the remaining variables fairly well with an
Adjusted \(R^2\) of 0.62, but the high
degree of heteroskedasticity suggests that any predictions made with
this model may be unreliable at certain values. Additional steps to
address this condition - or application of an alternative regression
model - might be advisable to improve performance.