Q: Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?
A: I chose the real estate sales data set from the State of CT that covers the years 2001-2021 and has 1.05 million real estate sale records.
library(readr)
library(lubridate)
library(ggplot2)
library(dplyr)
filename <- "Real_Estate_Sales_2001-2021_GL.csv"
fullFilePath <- file.path(path, filename)
RE_df <- read_csv(fullFilePath)
## Rows: 1054159 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Date Recorded, Town, Address, Property Type, Residential Type, Non ...
## dbl (5): Serial Number, List Year, Assessed Value, Sale Amount, Sales Ratio
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# converting the Date Recorded column from character to date format (lubridate library)
RE_df$`Date Recorded` <- mdy(RE_df$`Date Recorded`)
realEstateData <- RE_df
Data Exploration and Preparation
# missing values
sum(is.na(realEstateData))
## [1] 4263092
# summary stats
summary(RE_df[,sapply(RE_df, is.numeric)])
## Serial Number List Year Assessed Value Sale Amount
## Min. :0.000e+00 Min. :2001 Min. : 0 Min. :0.000e+00
## 1st Qu.:3.055e+04 1st Qu.:2004 1st Qu.: 88450 1st Qu.:1.422e+05
## Median :8.008e+04 Median :2011 Median : 139580 Median :2.300e+05
## Mean :5.027e+05 Mean :2011 Mean : 279742 Mean :3.990e+05
## 3rd Qu.:1.608e+05 3rd Qu.:2017 3rd Qu.: 227000 3rd Qu.:3.700e+05
## Max. :2.001e+09 Max. :2021 Max. :881510000 Max. :5.000e+09
## Sales Ratio
## Min. : 0.0
## 1st Qu.: 0.5
## Median : 0.6
## Mean : 10.0
## 3rd Qu.: 0.8
## Max. :1226420.0
# year
RE_df$Year <- year(RE_df$`Date Recorded`)
# month
RE_df$Month <- month(RE_df$`Date Recorded`)
# a combined Year-Month for plotting
RE_df$YearMonth <- paste(RE_df$Year, sprintf("%02d", RE_df$Month), sep = "-")
# convert YearMonth to a Date by adding "-01" to each YearMonth value
RE_df$YearMonthDate <- as.Date(paste0(RE_df$YearMonth, "-01"))
RE_df$YearRecorded <- as.numeric(format(as.Date(RE_df$`Date Recorded`, format = "%Y-%m-%d"), "%Y"))
RE_df$`Residential Type` <- as.factor(RE_df$`Residential Type`)
ggplot(RE_df, aes(x = YearMonthDate, y = `Sale Amount`)) +
geom_line(stat = "summary", fun = mean) +
scale_x_date(date_breaks = "1 year", date_labels = "%Y-%m") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "Average Real Estate Sale Prices in CT by Month (2001-2021)",
x = "Year-Month",
y = "Average Sale Price")
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_summary()`).
It looks like the average sales price was consistently between
250k-500k, except for a couple of outliers towards the end of 2016 and
mid-2021.
Trend Analysis: The Linear Model
# LM of sale price against year
lm_model1 <- lm(`Sale Amount` ~ Year, data = RE_df)
summary(lm_model1)
##
## Call:
## lm(formula = `Sale Amount` ~ Year, data = RE_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -494435 -258544 -161870 -23384 4999512862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.830e+07 1.565e+06 -11.70 <2e-16 ***
## Year 9.297e+03 7.779e+02 11.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5229000 on 1054155 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.0001355, Adjusted R-squared: 0.0001345
## F-statistic: 142.8 on 1 and 1054155 DF, p-value: < 2.2e-16
The coefficient for Year is 9,297 (or 9.297e+03), indicating that, on average, the sale amount increases by approximately $9,297 with each passing year. This is a significant positive trend, implying that real estate prices have been increasing over the years in CT. The p values are VERY low, indicating strong statistical significance. But there are strangely large sales prices, most likely because the data set is not limited to residential real estate only, but includes everything. So I am going to limit to residential real estate so that it would make more sense.
# Summary of each categorical variable
for(col in c("Property Type", "Residential Type")){
print(paste("Summary for", col))
print(summary(factor(RE_df[[col]])))
cat("\n\n")
}
## [1] "Summary for Property Type"
## Apartments Commercial Condo Four Family Industrial
## 943 4208 105420 2150 533
## Public Utility Residential Single Family Three Family Two Family
## 8 112099 401612 12586 26408
## Vacant Land NA's
## 5746 382446
##
##
## [1] "Summary for Residential Type"
## Condo Four Family Single Family Three Family Two Family
## 128789 2763 480566 15542 32615
## NA's
## 393884
# I could also check for how many sales were in each town and 'Non Use Code" and location: (but I don't need it right)
# for(col in c("Town","Non Use Code", "Assessor Remarks", "OPM remarks", "Location")){
# print(paste("Summary for", col))
# print(summary(factor(RE_df[[col]])))
# cat("\n\n")
#}
ggplot(RE_df, aes(x = `Property Type`)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Distribution of Real Estate Sales by Property Type", x = "Property Type", y = "Count")
RE_df %>%
group_by(`Property Type`) %>%
summarize(AverageSaleAmount = mean(`Sale Amount`, na.rm = TRUE))
## # A tibble: 12 × 2
## `Property Type` AverageSaleAmount
## <chr> <dbl>
## 1 Apartments 8445812.
## 2 Commercial 1750861.
## 3 Condo 260211.
## 4 Four Family 314291.
## 5 Industrial 2236719.
## 6 Public Utility 240756.
## 7 Residential 460993.
## 8 Single Family 388514.
## 9 Three Family 179845.
## 10 Two Family 199045.
## 11 Vacant Land 463155.
## 12 <NA> 413434.
# So clearly, there are outliers.
# So let's focus on residential homes and apartments and ignore the commercial, industrial, and public utility
So clearly, there are outliers. So let’s focus on residential homes and apartments and ignore the commercial, industrial, and public utility
filtered_df <- RE_df %>%
filter(`Property Type` %in% c("Residential", "Apartments", "Single Family", "Two Family", "Three Family"))
summary_stats <- filtered_df %>%
group_by(`Property Type`) %>%
summarise(
Count = n(),
Mean = mean(`Sale Amount`, na.rm = TRUE),
Median = median(`Sale Amount`, na.rm = TRUE),
Min = min(`Sale Amount`, na.rm = TRUE),
Max = max(`Sale Amount`, na.rm = TRUE),
Q1 = quantile(`Sale Amount`, 0.25, na.rm = TRUE),
Q3 = quantile(`Sale Amount`, 0.75, na.rm = TRUE),
IQR = IQR(`Sale Amount`, na.rm = TRUE)
)
summary_stats
## # A tibble: 5 × 9
## `Property Type` Count Mean Median Min Max Q1 Q3 IQR
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Apartments 943 8445812. 461250 17000 5000000000 279250 1232500 9.53e5
## 2 Residential 112099 460993. 300000 2000 318790019 205000 465000 2.6 e5
## 3 Single Family 401612 388514. 250000 0 120000000 166000 395000 2.29e5
## 4 Three Family 12586 179845. 153000 1671 6700000 82000 235000 1.53e5
## 5 Two Family 26408 199045. 162000 0 17900000 92000 241512. 1.50e5
Since the apartments are small in number, yet they are adding extreme values to the data set, I will filter them out as well. Then, I will rebuild the linear model.
A linear model that predicts the sales price by: assessed amount, residential type, list year, data recorded:
filtered_df <- filtered_df %>%
filter(`Property Type` %in% c("Residential", "Single Family", "Two Family", "Three Family"))
# This leaves us with still more than half a million real estate sale records.
# Linear model 2
lm_model2 <- lm(`Sale Amount` ~ `Assessed Value` + `Residential Type` + `List Year` + YearRecorded, data = filtered_df)
summary(lm_model2)
##
## Call:
## lm(formula = `Sale Amount` ~ `Assessed Value` + `Residential Type` +
## `List Year` + YearRecorded, data = filtered_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56028363 -152427 -84404 19123 318381054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.387e+06 5.186e+05 -14.244 < 2e-16 ***
## `Assessed Value` 5.058e-01 1.829e-03 276.590 < 2e-16 ***
## `Residential Type`Four Family 4.701e+04 3.576e+04 1.315 0.189
## `Residential Type`Single Family 5.436e+04 6.065e+03 8.963 < 2e-16 ***
## `Residential Type`Three Family -6.782e+04 9.188e+03 -7.382 1.57e-13 ***
## `Residential Type`Two Family -7.384e+04 7.654e+03 -9.647 < 2e-16 ***
## `List Year` -2.316e+04 2.691e+03 -8.608 < 2e-16 ***
## YearRecorded 2.693e+04 2.680e+03 10.046 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 873900 on 552697 degrees of freedom
## Multiple R-squared: 0.125, Adjusted R-squared: 0.125
## F-statistic: 1.128e+04 on 7 and 552697 DF, p-value: < 2.2e-16
LM interpretation:
I would have liked to create a new variable of the difference between the listing date and the sale date, but the listing date only shows the year without month or day, so it would not be a very good new variable to help. Otherwise, I would expect that properties that did not last long on the market had an interesting relationship with sales price.
Model diagnostics:
# observed vs. predicted values
predicted_values <- predict(lm_model2, filtered_df)
plot(filtered_df$`Sale Amount`, predicted_values, main = "Observed vs. Predicted Values", xlab = "Observed Values", ylab = "Predicted Values")
abline(0, 1)
# residuals vs. predicted values
plot(predicted_values, residuals(lm_model2), main = "Residuals vs. Predicted Values", xlab = "Predicted Values", ylab = "Residuals")
abline(h = 0, col = "red")
plot(predicted_values, residuals(lm_model2), main = "Residuals vs. Predicted Values", xlab = "Predicted Values", ylab = "Residuals")
abline(h = 0, col = "red")
qqnorm(residuals(lm_model2))
qqline(residuals(lm_model2), col = "red")
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(lm_model2)
## GVIF Df GVIF^(1/(2*Df))
## `Assessed Value` 1.003546 1 1.001772
## `Residential Type` 1.079069 4 1.009558
## `List Year` 118.227799 1 10.873261
## YearRecorded 118.132236 1 10.868865
par(mfrow = c(2, 2)) # plots into a 2x2 grid
plot(lm_model2)
I can conclude that there is heteroscedasticity in the model and evidence of non-linearity. There also seems to be still outliers.
So I will instead build a simpler model:
lm_model3 <- lm(`Sale Amount` ~ `Assessed Value` + YearRecorded, data = filtered_df)
summary(lm_model3)
##
## Call:
## lm(formula = `Sale Amount` ~ `Assessed Value` + YearRecorded,
## data = filtered_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56383730 -152619 -85097 14625 318402068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.740e+06 4.975e+05 -13.55 <2e-16 ***
## `Assessed Value` 5.092e-01 1.827e-03 278.65 <2e-16 ***
## YearRecorded 3.471e+03 2.469e+02 14.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 874700 on 552702 degrees of freedom
## Multiple R-squared: 0.1234, Adjusted R-squared: 0.1233
## F-statistic: 3.889e+04 on 2 and 552702 DF, p-value: < 2.2e-16
Model output: For each dollar increase in the assessed value, the sale amount increases by approximately 51 cents. Additionally, each additional year is associated with an average increase in sale amount by $3,471. The model explains about 12.34% of the variability in the sale amounts, indicating a modest fit. Both predictors are highly statistically significant.
predicted_values <- predict(lm_model3, filtered_df)
# observed vs. predicted values
plot(filtered_df$`Sale Amount`, predicted_values, main = "Observed vs. Predicted Values", xlab = "Observed Values", ylab = "Predicted Values")
abline(0, 1)
# residuals vs. predicted values
plot(predicted_values, residuals(lm_model3), main = "Residuals vs. Predicted Values", xlab = "Predicted Values", ylab = "Residuals")
abline(h = 0, col = "red")
plot(predicted_values, residuals(lm_model3), main = "Residuals vs. Predicted Values", xlab = "Predicted Values", ylab = "Residuals")
abline(h = 0, col = "red")
qqnorm(residuals(lm_model3))
qqline(residuals(lm_model3), col = "red")
library(car)
vif(lm_model3)
## `Assessed Value` YearRecorded
## 1.000117 1.000117
par(mfrow = c(2, 2)) # plots into a 2x2 grid
plot(lm_model3)
Take away from the plots: