# libraries
library(tidyverse)
library(moderndive)
library(kableExtra)
library(mice)
library(caret)

# ggplot
theme_set(theme_light())

The Saratoga Houses dataset from the ModernDive package contains 1057 observations of 8 variables from home sales data:

  • price (num)
  • living_area (num)
  • bathrooms (num)
  • bedrooms (num)
  • fireplaces (num)
  • lot_size (num)
  • age (num)
  • fireplace (logical)

We’ll build a simple linear regression model for price as a function of the remaining variables, and analyze the results.

Data Setup and EDA

df = saratoga_houses
head(df) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
price living_area bathrooms bedrooms fireplaces lot_size age fireplace
142212 1982 1.0 3 0 2.00 133 FALSE
134865 1676 1.5 3 1 0.38 14 TRUE
118007 1694 2.0 3 1 0.96 15 TRUE
138297 1800 1.0 2 2 0.48 49 TRUE
129470 2088 1.0 3 1 1.84 29 TRUE
206512 1456 2.0 3 0 0.98 10 FALSE

Missing Data

apply(df, 2, function(col) sum(is.na(col))) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
x
price 0
living_area 0
bathrooms 0
bedrooms 0
fireplaces 0
lot_size 9
age 0
fireplace 0

There are nine missing values for lot_size, we’ll do a quick imputation using mice.

obj_impute <- mice(df)
df2 <- complete(obj_impute)

Summary Statistics

summary(df2)
##      price         living_area     bathrooms        bedrooms    
##  Min.   : 16858   Min.   : 672   Min.   :1.000   Min.   :1.000  
##  1st Qu.:112400   1st Qu.:1342   1st Qu.:1.500   1st Qu.:3.000  
##  Median :152404   Median :1675   Median :2.000   Median :3.000  
##  Mean   :167902   Mean   :1819   Mean   :1.929   Mean   :3.184  
##  3rd Qu.:206512   3rd Qu.:2223   3rd Qu.:2.500   3rd Qu.:4.000  
##  Max.   :599701   Max.   :5228   Max.   :4.500   Max.   :5.000  
##    fireplaces        lot_size           age         fireplace      
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Mode :logical  
##  1st Qu.:0.0000   1st Qu.:0.2100   1st Qu.:  6.00   FALSE:428      
##  Median :1.0000   Median :0.3900   Median : 18.00   TRUE :629      
##  Mean   :0.6244   Mean   :0.5754   Mean   : 28.09                  
##  3rd Qu.:1.0000   3rd Qu.:0.6100   3rd Qu.: 34.00                  
##  Max.   :4.0000   Max.   :9.0000   Max.   :247.00

The median price is $152k, with homes having 1 to 4.5 bathrooms, and 1 to 5 bedrooms. There might be some multicolinnearity between fireplaces (0 to 4) and the fireplace logical variable, but for simplicity we’ll leave everything in.

Normalization

df2_numeric <- df2[,sapply(df2, is.numeric)]

df2_numeric %>%
  pivot_longer(cols = everything(),
               names_to='variables', 
               values_to='values') %>%
  ggplot(aes(x=values)) +
  geom_histogram() +
  facet_wrap(~ variables, scales='free')

Many of our variables have a rightward skew. Linear Regression assumes normality of data, so we’ll take an additional step to center and normalize.

# normalize
obj_normalize <- preProcess(df2, method = c("center", "scale"))
df2n <- predict(obj_normalize, df2)
df2n_numeric <- df2n[,sapply(df2n, is.numeric)]

df2n_numeric %>%
  pivot_longer(cols = everything(),
               names_to='variables', 
               values_to='values') %>%
  ggplot(aes(x=values)) +
  geom_histogram() +
  facet_wrap(~ variables, scales='free')

LR Model

lm1 = lm(price ~ ., data=df2n)

get_regression_table(lm1) %>%
  arrange(p_value) %>%
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
term estimate std_error statistic p_value lower_ci upper_ci
living_area 0.626 0.035 17.855 0.000 0.557 0.694
bathrooms 0.165 0.031 5.306 0.000 0.104 0.226
age -0.067 0.022 -3.043 0.002 -0.110 -0.024
bedrooms -0.058 0.027 -2.171 0.030 -0.110 -0.006
fireplaces 0.112 0.057 1.957 0.051 0.000 0.225
fireplaceTRUE -0.103 0.115 -0.889 0.374 -0.329 0.124
intercept 0.061 0.071 0.855 0.393 -0.079 0.201
lot_size 0.005 0.020 0.238 0.812 -0.034 0.043
get_regression_summaries(lm1) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
r_squared adj_r_squared mse rmse sigma statistic p_value df nobs
0.605 0.603 0.3944202 0.6280288 0.63 229.726 0 7 1057

As you may expect, variables such as living_area, bathrooms and fireplaces are positively related to price, while age is negatively related.

Surprisingly, the number of bedrooms is also negatively related, and lot_size has a weak relationship (large p-value) to price.

Removing those variables with large p-values (fireplace and lot_size) and refitting the model did not have a large impact, so for simplicity we’ll leave them in.

Residual Analysis

get_regression_points(lm1) %>%
  ggplot(aes(x=price_hat, y=residual)) +
    geom_point(alpha=0.3)

The scatterplot shows the residuals to generally randomly distributed around the zero y-axis, but there does appear to be some skew.

The Q-Q- plot provides a clearer picture of non-normally distributed values towards the extremes, confirming a skew to the distribution of residuals:

qqnorm(resid(lm1))
qqline(resid(lm1))

Conclusion

Overall this simple LR model seems to define the relationship between price and the remaining variables fairly well with an Adjusted \(R^2\) of 0.60, although some skewness in the distribution of residuals suggests this relationship is not perfectly described, and there may be other variables at work not captured by the model.