Exercise 11 - Linear Regression

# libraries
library(tidyverse)
library(moderndive)
library(kableExtra)
library(mice)
library(caret)

# ggplot
theme_set(theme_light())

The Saratoga Houses dataset from the ModernDive package contains 1057 observations of 8 variables from home sales data:

price (num)
living_area (num)
bathrooms (num)
bedrooms (num)
fireplaces (num)
lot_size (num)
age (num)
fireplace (logical)

We’ll build a simple linear regression model for price as a function of the remaining variables, and analyze the results.

Data Setup and EDA

df = saratoga_houses
head(df) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)

price	living_area	bathrooms	bedrooms	fireplaces	lot_size	age	fireplace
142212	1982	1.0	3	0	2.00	133	FALSE
134865	1676	1.5	3	1	0.38	14	TRUE
118007	1694	2.0	3	1	0.96	15	TRUE
138297	1800	1.0	2	2	0.48	49	TRUE
129470	2088	1.0	3	1	1.84	29	TRUE
206512	1456	2.0	3	0	0.98	10	FALSE

Missing Data

apply(df, 2, function(col) sum(is.na(col))) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)

	x
price	0
living_area	0
bathrooms	0
bedrooms	0
fireplaces	0
lot_size	9
age	0
fireplace	0

There are nine missing values for lot_size, we’ll do a quick imputation using mice.

obj_impute <- mice(df)
df2 <- complete(obj_impute)

Summary Statistics

summary(df2)

##      price         living_area     bathrooms        bedrooms    
##  Min.   : 16858   Min.   : 672   Min.   :1.000   Min.   :1.000  
##  1st Qu.:112400   1st Qu.:1342   1st Qu.:1.500   1st Qu.:3.000  
##  Median :152404   Median :1675   Median :2.000   Median :3.000  
##  Mean   :167902   Mean   :1819   Mean   :1.929   Mean   :3.184  
##  3rd Qu.:206512   3rd Qu.:2223   3rd Qu.:2.500   3rd Qu.:4.000  
##  Max.   :599701   Max.   :5228   Max.   :4.500   Max.   :5.000  
##    fireplaces        lot_size           age         fireplace      
##  Min.   :0.0000   Min.   :0.0000   Min.   :  0.00   Mode :logical  
##  1st Qu.:0.0000   1st Qu.:0.2100   1st Qu.:  6.00   FALSE:428      
##  Median :1.0000   Median :0.3900   Median : 18.00   TRUE :629      
##  Mean   :0.6244   Mean   :0.5754   Mean   : 28.09                  
##  3rd Qu.:1.0000   3rd Qu.:0.6100   3rd Qu.: 34.00                  
##  Max.   :4.0000   Max.   :9.0000   Max.   :247.00

The median price is $152k, with homes having 1 to 4.5 bathrooms, and 1 to 5 bedrooms. There might be some multicolinnearity between fireplaces (0 to 4) and the fireplace logical variable, but for simplicity we’ll leave everything in.

Normalization

df2_numeric <- df2[,sapply(df2, is.numeric)]

df2_numeric %>%
  pivot_longer(cols = everything(),
               names_to='variables', 
               values_to='values') %>%
  ggplot(aes(x=values)) +
  geom_histogram() +
  facet_wrap(~ variables, scales='free')

Many of our variables have a rightward skew. Linear Regression assumes normality of data, so we’ll take an additional step to center and normalize.

# normalize
obj_normalize <- preProcess(df2, method = c("center", "scale"))
df2n <- predict(obj_normalize, df2)

df2n_numeric <- df2n[,sapply(df2n, is.numeric)]

df2n_numeric %>%
  pivot_longer(cols = everything(),
               names_to='variables', 
               values_to='values') %>%
  ggplot(aes(x=values)) +
  geom_histogram() +
  facet_wrap(~ variables, scales='free')

LR Model

lm1 = lm(price ~ ., data=df2n)

get_regression_table(lm1) %>%
  arrange(p_value) %>%
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
living_area	0.626	0.035	17.855	0.000	0.557	0.694
bathrooms	0.165	0.031	5.306	0.000	0.104	0.226
age	-0.067	0.022	-3.043	0.002	-0.110	-0.024
bedrooms	-0.058	0.027	-2.171	0.030	-0.110	-0.006
fireplaces	0.112	0.057	1.957	0.051	0.000	0.225
fireplaceTRUE	-0.103	0.115	-0.889	0.374	-0.329	0.124
intercept	0.061	0.071	0.855	0.393	-0.079	0.201
lot_size	0.005	0.020	0.238	0.812	-0.034	0.043

get_regression_summaries(lm1) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df	nobs
0.605	0.603	0.3944202	0.6280288	0.63	229.726	0	7	1057

As you may expect, variables such as living_area, bathrooms and fireplaces are positively related to price, while age is negatively related.

Surprisingly, the number of bedrooms is also negatively related, and lot_size has a weak relationship (large p-value) to price.

Removing those variables with large p-values (fireplace and lot_size) and refitting the model did not have a large impact, so for simplicity we’ll leave them in.

Residual Analysis

get_regression_points(lm1) %>%
  ggplot(aes(x=price_hat, y=residual)) +
    geom_point(alpha=0.3)

The scatterplot shows the residuals to generally randomly distributed around the zero y-axis, but there does appear to be some skew.

The Q-Q- plot provides a clearer picture of non-normally distributed values towards the extremes, confirming a skew to the distribution of residuals:

qqnorm(resid(lm1))
qqline(resid(lm1))

Conclusion

Overall this simple LR model seems to define the relationship between price and the remaining variables fairly well with an Adjusted $R^2$ of 0.60, although some skewness in the distribution of residuals suggests this relationship is not perfectly described, and there may be other variables at work not captured by the model.