# libraries
library(tidyverse)
library(moderndive)
library(kableExtra)
library(mice)
library(caret)
# ggplot
theme_set(theme_light())
The Saratoga Houses dataset from the ModernDive package
contains 1057 observations of 8 variables from home sales data:
We’ll build a simple linear regression model for price
as a function of the remaining variables, and analyze the results.
df = saratoga_houses
head(df) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
| price | living_area | bathrooms | bedrooms | fireplaces | lot_size | age | fireplace |
|---|---|---|---|---|---|---|---|
| 142212 | 1982 | 1.0 | 3 | 0 | 2.00 | 133 | FALSE |
| 134865 | 1676 | 1.5 | 3 | 1 | 0.38 | 14 | TRUE |
| 118007 | 1694 | 2.0 | 3 | 1 | 0.96 | 15 | TRUE |
| 138297 | 1800 | 1.0 | 2 | 2 | 0.48 | 49 | TRUE |
| 129470 | 2088 | 1.0 | 3 | 1 | 1.84 | 29 | TRUE |
| 206512 | 1456 | 2.0 | 3 | 0 | 0.98 | 10 | FALSE |
apply(df, 2, function(col) sum(is.na(col))) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
| x | |
|---|---|
| price | 0 |
| living_area | 0 |
| bathrooms | 0 |
| bedrooms | 0 |
| fireplaces | 0 |
| lot_size | 9 |
| age | 0 |
| fireplace | 0 |
There are nine missing values for lot_size, we’ll do a
quick imputation using mice.
obj_impute <- mice(df)
df2 <- complete(obj_impute)
summary(df2)
## price living_area bathrooms bedrooms
## Min. : 16858 Min. : 672 Min. :1.000 Min. :1.000
## 1st Qu.:112400 1st Qu.:1342 1st Qu.:1.500 1st Qu.:3.000
## Median :152404 Median :1675 Median :2.000 Median :3.000
## Mean :167902 Mean :1819 Mean :1.929 Mean :3.184
## 3rd Qu.:206512 3rd Qu.:2223 3rd Qu.:2.500 3rd Qu.:4.000
## Max. :599701 Max. :5228 Max. :4.500 Max. :5.000
## fireplaces lot_size age fireplace
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Mode :logical
## 1st Qu.:0.0000 1st Qu.:0.2100 1st Qu.: 6.00 FALSE:428
## Median :1.0000 Median :0.3900 Median : 18.00 TRUE :629
## Mean :0.6244 Mean :0.5754 Mean : 28.09
## 3rd Qu.:1.0000 3rd Qu.:0.6100 3rd Qu.: 34.00
## Max. :4.0000 Max. :9.0000 Max. :247.00
The median price is $152k, with homes having 1 to 4.5
bathrooms, and 1 to 5 bedrooms. There might be some multicolinnearity
between fireplaces (0 to 4) and the fireplace
logical variable, but for simplicity we’ll leave everything in.
df2_numeric <- df2[,sapply(df2, is.numeric)]
df2_numeric %>%
pivot_longer(cols = everything(),
names_to='variables',
values_to='values') %>%
ggplot(aes(x=values)) +
geom_histogram() +
facet_wrap(~ variables, scales='free')
Many of our variables have a rightward skew. Linear Regression assumes normality of data, so we’ll take an additional step to center and normalize.
# normalize
obj_normalize <- preProcess(df2, method = c("center", "scale"))
df2n <- predict(obj_normalize, df2)
df2n_numeric <- df2n[,sapply(df2n, is.numeric)]
df2n_numeric %>%
pivot_longer(cols = everything(),
names_to='variables',
values_to='values') %>%
ggplot(aes(x=values)) +
geom_histogram() +
facet_wrap(~ variables, scales='free')
lm1 = lm(price ~ ., data=df2n)
get_regression_table(lm1) %>%
arrange(p_value) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| living_area | 0.626 | 0.035 | 17.855 | 0.000 | 0.557 | 0.694 |
| bathrooms | 0.165 | 0.031 | 5.306 | 0.000 | 0.104 | 0.226 |
| age | -0.067 | 0.022 | -3.043 | 0.002 | -0.110 | -0.024 |
| bedrooms | -0.058 | 0.027 | -2.171 | 0.030 | -0.110 | -0.006 |
| fireplaces | 0.112 | 0.057 | 1.957 | 0.051 | 0.000 | 0.225 |
| fireplaceTRUE | -0.103 | 0.115 | -0.889 | 0.374 | -0.329 | 0.124 |
| intercept | 0.061 | 0.071 | 0.855 | 0.393 | -0.079 | 0.201 |
| lot_size | 0.005 | 0.020 | 0.238 | 0.812 | -0.034 | 0.043 |
get_regression_summaries(lm1) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.605 | 0.603 | 0.3944202 | 0.6280288 | 0.63 | 229.726 | 0 | 7 | 1057 |
As you may expect, variables such as living_area,
bathrooms and fireplaces are positively
related to price, while age is negatively
related.
Surprisingly, the number of bedrooms is also negatively
related, and lot_size has a weak relationship (large
p-value) to price.
Removing those variables with large p-values (fireplace
and lot_size) and refitting the model did not have a large
impact, so for simplicity we’ll leave them in.
get_regression_points(lm1) %>%
ggplot(aes(x=price_hat, y=residual)) +
geom_point(alpha=0.3)
The scatterplot shows the residuals to generally randomly distributed around the zero y-axis, but there does appear to be some skew.
The Q-Q- plot provides a clearer picture of non-normally distributed values towards the extremes, confirming a skew to the distribution of residuals:
qqnorm(resid(lm1))
qqline(resid(lm1))
Overall this simple LR model seems to define the relationship between
price and the remaining variables fairly well with an
Adjusted \(R^2\) of 0.60, although some
skewness in the distribution of residuals suggests this relationship is
not perfectly described, and there may be other variables at work not
captured by the model.