Q: Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

A: I chose the real estate sales data set from the State of CT that covers the years 2001-2021 and has 1.05 million real estate sale records.

library(readr)
library(lubridate)
library(ggplot2)
library(dplyr)
filename <- "Real_Estate_Sales_2001-2021_GL.csv"
fullFilePath <- file.path(path, filename)
RE_df <- read_csv(fullFilePath)
## Rows: 1054159 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): Date Recorded, Town, Address, Property Type, Residential Type, Non ...
## dbl (5): Serial Number, List Year, Assessed Value, Sale Amount, Sales Ratio
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# converting the Date Recorded column from character to date format (lubridate library)
RE_df$`Date Recorded` <- mdy(RE_df$`Date Recorded`)
realEstateData <- RE_df

Data Exploration and Preparation

# missing values
sum(is.na(realEstateData))
## [1] 4263092
# summary stats
summary(RE_df[,sapply(RE_df, is.numeric)])
##  Serial Number         List Year    Assessed Value       Sale Amount       
##  Min.   :0.000e+00   Min.   :2001   Min.   :        0   Min.   :0.000e+00  
##  1st Qu.:3.055e+04   1st Qu.:2004   1st Qu.:    88450   1st Qu.:1.422e+05  
##  Median :8.008e+04   Median :2011   Median :   139580   Median :2.300e+05  
##  Mean   :5.027e+05   Mean   :2011   Mean   :   279742   Mean   :3.990e+05  
##  3rd Qu.:1.608e+05   3rd Qu.:2017   3rd Qu.:   227000   3rd Qu.:3.700e+05  
##  Max.   :2.001e+09   Max.   :2021   Max.   :881510000   Max.   :5.000e+09  
##   Sales Ratio       
##  Min.   :      0.0  
##  1st Qu.:      0.5  
##  Median :      0.6  
##  Mean   :     10.0  
##  3rd Qu.:      0.8  
##  Max.   :1226420.0
# year
RE_df$Year <- year(RE_df$`Date Recorded`)
# month
RE_df$Month <- month(RE_df$`Date Recorded`)
# a combined Year-Month for plotting
RE_df$YearMonth <- paste(RE_df$Year, sprintf("%02d", RE_df$Month), sep = "-")
# convert YearMonth to a Date by adding "-01" to each YearMonth value
RE_df$YearMonthDate <- as.Date(paste0(RE_df$YearMonth, "-01"))

RE_df$YearRecorded <- as.numeric(format(as.Date(RE_df$`Date Recorded`, format = "%Y-%m-%d"), "%Y"))
RE_df$`Residential Type` <- as.factor(RE_df$`Residential Type`)

ggplot(RE_df, aes(x = YearMonthDate, y = `Sale Amount`)) +
  geom_line(stat = "summary", fun = mean) +
  scale_x_date(date_breaks = "1 year", date_labels = "%Y-%m") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title = "Average Real Estate Sale Prices in CT by Month (2001-2021)", 
       x = "Year-Month", 
       y = "Average Sale Price")
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_summary()`).

It looks like the average sales price was consistently between 250k-500k, except for a couple of outliers towards the end of 2016 and mid-2021.

Trend Analysis: The Linear Model

# LM of sale price against year
lm_model1 <- lm(`Sale Amount` ~ Year, data = RE_df)
summary(lm_model1)
## 
## Call:
## lm(formula = `Sale Amount` ~ Year, data = RE_df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##    -494435    -258544    -161870     -23384 4999512862 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.830e+07  1.565e+06  -11.70   <2e-16 ***
## Year         9.297e+03  7.779e+02   11.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5229000 on 1054155 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.0001355,  Adjusted R-squared:  0.0001345 
## F-statistic: 142.8 on 1 and 1054155 DF,  p-value: < 2.2e-16

The coefficient for Year is 9,297 (or 9.297e+03), indicating that, on average, the sale amount increases by approximately $9,297 with each passing year. This is a significant positive trend, implying that real estate prices have been increasing over the years in CT. The p values are VERY low, indicating strong statistical significance. But there are strangely large sales prices, most likely because the data set is not limited to residential real estate only, but includes everything. So I am going to limit to residential real estate so that it would make more sense.

# Summary of each categorical variable
for(col in c("Property Type", "Residential Type")){
  print(paste("Summary for", col))
  print(summary(factor(RE_df[[col]])))
  cat("\n\n")
}
## [1] "Summary for Property Type"
##     Apartments     Commercial          Condo    Four Family     Industrial 
##            943           4208         105420           2150            533 
## Public Utility    Residential  Single Family   Three Family     Two Family 
##              8         112099         401612          12586          26408 
##    Vacant Land           NA's 
##           5746         382446 
## 
## 
## [1] "Summary for Residential Type"
##         Condo   Four Family Single Family  Three Family    Two Family 
##        128789          2763        480566         15542         32615 
##          NA's 
##        393884
# I could also check for how many sales were in each town and 'Non Use Code" and location: (but I don't need it right)
# for(col in c("Town","Non Use Code", "Assessor Remarks", "OPM remarks", "Location")){
#  print(paste("Summary for", col))
#  print(summary(factor(RE_df[[col]])))
#  cat("\n\n")
#}

ggplot(RE_df, aes(x = `Property Type`)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Distribution of Real Estate Sales by Property Type", x = "Property Type", y = "Count")

RE_df %>%
  group_by(`Property Type`) %>%
  summarize(AverageSaleAmount = mean(`Sale Amount`, na.rm = TRUE))
## # A tibble: 12 × 2
##    `Property Type` AverageSaleAmount
##    <chr>                       <dbl>
##  1 Apartments               8445812.
##  2 Commercial               1750861.
##  3 Condo                     260211.
##  4 Four Family               314291.
##  5 Industrial               2236719.
##  6 Public Utility            240756.
##  7 Residential               460993.
##  8 Single Family             388514.
##  9 Three Family              179845.
## 10 Two Family                199045.
## 11 Vacant Land               463155.
## 12 <NA>                      413434.
# So clearly, there are outliers.
# So let's focus on residential homes and apartments and ignore the commercial, industrial, and public utility

So clearly, there are outliers. So let’s focus on residential homes and apartments and ignore the commercial, industrial, and public utility

filtered_df <- RE_df %>%
  filter(`Property Type` %in% c("Residential", "Apartments", "Single Family", "Two Family", "Three Family"))

summary_stats <- filtered_df %>%
  group_by(`Property Type`) %>%
  summarise(
    Count = n(),
    Mean = mean(`Sale Amount`, na.rm = TRUE),
    Median = median(`Sale Amount`, na.rm = TRUE),
    Min = min(`Sale Amount`, na.rm = TRUE),
    Max = max(`Sale Amount`, na.rm = TRUE),
    Q1 = quantile(`Sale Amount`, 0.25, na.rm = TRUE),
    Q3 = quantile(`Sale Amount`, 0.75, na.rm = TRUE),
    IQR = IQR(`Sale Amount`, na.rm = TRUE)
  )

summary_stats
## # A tibble: 5 × 9
##   `Property Type`  Count     Mean Median   Min        Max     Q1       Q3    IQR
##   <chr>            <int>    <dbl>  <dbl> <dbl>      <dbl>  <dbl>    <dbl>  <dbl>
## 1 Apartments         943 8445812. 461250 17000 5000000000 279250 1232500  9.53e5
## 2 Residential     112099  460993. 300000  2000  318790019 205000  465000  2.6 e5
## 3 Single Family   401612  388514. 250000     0  120000000 166000  395000  2.29e5
## 4 Three Family     12586  179845. 153000  1671    6700000  82000  235000  1.53e5
## 5 Two Family       26408  199045. 162000     0   17900000  92000  241512. 1.50e5

Since the apartments are small in number, yet they are adding extreme values to the data set, I will filter them out as well. Then, I will rebuild the linear model.

A linear model that predicts the sales price by: assessed amount, residential type, list year, data recorded:

filtered_df <- filtered_df %>%
  filter(`Property Type` %in% c("Residential", "Single Family", "Two Family", "Three Family"))

# This leaves us with still more than half a million real estate sale records.

# Linear model 2
lm_model2 <- lm(`Sale Amount` ~ `Assessed Value` + `Residential Type` + `List Year` + YearRecorded, data = filtered_df)
summary(lm_model2)
## 
## Call:
## lm(formula = `Sale Amount` ~ `Assessed Value` + `Residential Type` + 
##     `List Year` + YearRecorded, data = filtered_df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -56028363   -152427    -84404     19123 318381054 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -7.387e+06  5.186e+05 -14.244  < 2e-16 ***
## `Assessed Value`                 5.058e-01  1.829e-03 276.590  < 2e-16 ***
## `Residential Type`Four Family    4.701e+04  3.576e+04   1.315    0.189    
## `Residential Type`Single Family  5.436e+04  6.065e+03   8.963  < 2e-16 ***
## `Residential Type`Three Family  -6.782e+04  9.188e+03  -7.382 1.57e-13 ***
## `Residential Type`Two Family    -7.384e+04  7.654e+03  -9.647  < 2e-16 ***
## `List Year`                     -2.316e+04  2.691e+03  -8.608  < 2e-16 ***
## YearRecorded                     2.693e+04  2.680e+03  10.046  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 873900 on 552697 degrees of freedom
## Multiple R-squared:  0.125,  Adjusted R-squared:  0.125 
## F-statistic: 1.128e+04 on 7 and 552697 DF,  p-value: < 2.2e-16

LM interpretation:

  1. single Family: An increase of $54,360 in sale amount compared to condo;
  2. two Family: A decrease of $73,840;
  3. three Family: A decrease of $67,820;
  4. four Family: The coefficient for four-family homes is not statistically significant, so no clear relationship.

I would have liked to create a new variable of the difference between the listing date and the sale date, but the listing date only shows the year without month or day, so it would not be a very good new variable to help. Otherwise, I would expect that properties that did not last long on the market had an interesting relationship with sales price.

Model diagnostics:

Linearity and Independence

# observed vs. predicted values
predicted_values <- predict(lm_model2, filtered_df)

plot(filtered_df$`Sale Amount`, predicted_values, main = "Observed vs. Predicted Values", xlab = "Observed Values", ylab = "Predicted Values")
abline(0, 1)

# residuals vs. predicted values
plot(predicted_values, residuals(lm_model2), main = "Residuals vs. Predicted Values", xlab = "Predicted Values", ylab = "Residuals")
abline(h = 0, col = "red")

Homoscedasticity

plot(predicted_values, residuals(lm_model2), main = "Residuals vs. Predicted Values", xlab = "Predicted Values", ylab = "Residuals")
abline(h = 0, col = "red") 

Normality of Residuals (normal Q-Q plot)

qqnorm(residuals(lm_model2))
qqline(residuals(lm_model2), col = "red")

Multicollinearity

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(lm_model2)
##                          GVIF Df GVIF^(1/(2*Df))
## `Assessed Value`     1.003546  1        1.001772
## `Residential Type`   1.079069  4        1.009558
## `List Year`        118.227799  1       10.873261
## YearRecorded       118.132236  1       10.868865
par(mfrow = c(2, 2)) # plots into a 2x2 grid
plot(lm_model2)

I can conclude that there is heteroscedasticity in the model and evidence of non-linearity. There also seems to be still outliers.

So I will instead build a simpler model:

lm_model3 <- lm(`Sale Amount` ~ `Assessed Value` + YearRecorded, data = filtered_df)
summary(lm_model3)
## 
## Call:
## lm(formula = `Sale Amount` ~ `Assessed Value` + YearRecorded, 
##     data = filtered_df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -56383730   -152619    -85097     14625 318402068 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.740e+06  4.975e+05  -13.55   <2e-16 ***
## `Assessed Value`  5.092e-01  1.827e-03  278.65   <2e-16 ***
## YearRecorded      3.471e+03  2.469e+02   14.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 874700 on 552702 degrees of freedom
## Multiple R-squared:  0.1234, Adjusted R-squared:  0.1233 
## F-statistic: 3.889e+04 on 2 and 552702 DF,  p-value: < 2.2e-16

Model output: For each dollar increase in the assessed value, the sale amount increases by approximately 51 cents. Additionally, each additional year is associated with an average increase in sale amount by $3,471. The model explains about 12.34% of the variability in the sale amounts, indicating a modest fit. Both predictors are highly statistically significant.

predicted_values <- predict(lm_model3, filtered_df)
# observed vs. predicted values
plot(filtered_df$`Sale Amount`, predicted_values, main = "Observed vs. Predicted Values", xlab = "Observed Values", ylab = "Predicted Values")
abline(0, 1)

# residuals vs. predicted values
plot(predicted_values, residuals(lm_model3), main = "Residuals vs. Predicted Values", xlab = "Predicted Values", ylab = "Residuals")
abline(h = 0, col = "red")

plot(predicted_values, residuals(lm_model3), main = "Residuals vs. Predicted Values", xlab = "Predicted Values", ylab = "Residuals")
abline(h = 0, col = "red") 

qqnorm(residuals(lm_model3))
qqline(residuals(lm_model3), col = "red")

library(car)
vif(lm_model3)
## `Assessed Value`     YearRecorded 
##         1.000117         1.000117
par(mfrow = c(2, 2)) # plots into a 2x2 grid
plot(lm_model3)

Take away from the plots: