CASE STUDY

The dataset represents real estate information from the City of Hartford and comprises various fields associated with property transactions. Here’s a brief description based on the available fields:

This dataset encompasses diverse details about property transactions, including sale prices, ownership information, transaction dates, property characteristics, and legal references, among other relevant information within the City of Hartford’s real estate domain.

Objectives

Exploratory Data Analysis

# install the necessary packages
install.packages(c("tidyverse", "ggthemes", "plotly", "Amelia", "caTools", "imputeTS", "knitr"), repos = "https:\\cran.rstudio.com")
## Warning: unable to access index for repository https:\cran.rstudio.com/src/contrib:
##   cannot open URL 'https:\cran.rstudio.com/src/contrib/PACKAGES'
## Warning: packages 'tidyverse', 'ggthemes', 'plotly', 'Amelia', 'caTools', 'imputeTS', 'knitr' are not available for this version of R
## 
## Versions of these packages for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
## Warning: unable to access index for repository https:\cran.rstudio.com/bin/windows/contrib/4.3:
##   cannot open URL 'https:\cran.rstudio.com/bin/windows/contrib/4.3/PACKAGES'
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(knitr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr   1.1.4     ✔ readr   2.1.4
## ✔ forcats 1.0.0     ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4     ✔ tibble  3.2.1
## ✔ purrr   1.0.2     ✔ tidyr   1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(Amelia)
## Loading required package: Rcpp
## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(caTools)
library(imputeTS)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
# loading the data 
hartford <- read.csv("real_estate_730.csv")

# checking the structure and head
str(hartford)
## 'data.frame':    4735 obs. of  21 variables:
##  $ index                  : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ PropertyID             : int  30 50 58 66 78 79 94 94 98 98 ...
##  $ xrCompositeLandUseID   : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ xrBuildingTypeID       : int  4 4 6 6 3 4 5 5 6 6 ...
##  $ ParcelID               : chr  "144-698-048" "121-698-073" "121-699-090" "144-699-037" ...
##  $ LocationStartNumber    : int  22 164 87 55 5 237 66 66 82 82 ...
##  $ ApartmentUnitNumber    : chr  "" "" "" "" ...
##  $ StreetNameAndWay       : chr  "BRISTOL ST" "HOLLYWOOD AVE" "BRISTOL ST" "BRISTOL ST" ...
##  $ xrPrimaryNeighborhoodID: int  2 2 2 2 2 2 2 2 2 2 ...
##  $ LandSF                 : num  6000 5650 5850 5900 7788 ...
##  $ TotalFinishedArea      : num  1422 1384 1249 1390 960 ...
##  $ LivingUnits            : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ OwnerLastName          : chr  "KHAN" "GONZALEZ" "SAMPAYO" "CARRILLO PACHECO" ...
##  $ OwnerFirstName         : chr  "BIBI" "LOURDES" "YARLIS" "RICARDO" ...
##  $ PrimaryGrantor         : chr  "ALI MUNTAAZ" "GONZALEZ JOHNETTE" "FLOYD ANNIE J ESTATE" "HARGROVE DAVID L" ...
##  $ SaleDate               : chr  "12/10/2019" "12/18/2019" "5/24/2019" "9/10/2019" ...
##  $ SalePrice              : int  0 179900 174900 140000 0 156750 140000 127972 0 107500 ...
##  $ TotalAppraisedValue    : int  136600 158400 144400 142400 147900 145000 144200 144200 125100 125100 ...
##  $ LegalReference         : chr  "07559-0116" "07561-0225" "07485-0035" "07523-0330" ...
##  $ xrSalesValidityID      : int  5 1 12 1 11 1 9 26 12 26 ...
##  $ xrDeedID               : int  8 10 1 10 7 10 11 11 6 19 ...
head(hartford)
##   index PropertyID xrCompositeLandUseID xrBuildingTypeID    ParcelID
## 1     0         30                    3                4 144-698-048
## 2     1         50                    3                4 121-698-073
## 3     2         58                    3                6 121-699-090
## 4     3         66                    3                6 144-699-037
## 5     4         78                    3                3 144-699-025
## 6     5         79                    3                4 144-699-024
##   LocationStartNumber ApartmentUnitNumber StreetNameAndWay
## 1                  22                           BRISTOL ST
## 2                 164                        HOLLYWOOD AVE
## 3                  87                           BRISTOL ST
## 4                  55                           BRISTOL ST
## 5                   5                           BRISTOL ST
## 6                 237                        NEWINGTON AVE
##   xrPrimaryNeighborhoodID LandSF TotalFinishedArea LivingUnits
## 1                       2   6000            1422.0           1
## 2                       2   5650            1384.0           1
## 3                       2   5850            1249.0           1
## 4                       2   5900            1390.0           1
## 5                       2   7788             960.0           1
## 6                       2   8100            1075.2           1
##               OwnerLastName OwnerFirstName             PrimaryGrantor
## 1                      KHAN           BIBI                ALI MUNTAAZ
## 2                  GONZALEZ        LOURDES          GONZALEZ JOHNETTE
## 3                   SAMPAYO         YARLIS       FLOYD ANNIE J ESTATE
## 4          CARRILLO PACHECO        RICARDO           HARGROVE DAVID L
## 5 VARJABEDIAN ROSE M (L.U.)                VARJABEDIAN ANTHONY ( L U)
## 6                   BENITEZ       MICHELLE               CRANICK PAUL
##     SaleDate SalePrice TotalAppraisedValue LegalReference xrSalesValidityID
## 1 12/10/2019         0              136600     07559-0116                 5
## 2 12/18/2019    179900              158400     07561-0225                 1
## 3  5/24/2019    174900              144400     07485-0035                12
## 4  9/10/2019    140000              142400     07523-0330                 1
## 5  5/14/2020         0              147900     07616-0064                11
## 6   6/9/2020    156750              145000     07622-0277                 1
##   xrDeedID
## 1        8
## 2       10
## 3        1
## 4       10
## 5        7
## 6       10
# Searching for Null Values within the Dataset
any(is.na(hartford)) # there are null values
## [1] TRUE
# let's look for column wise null's
col_na <- colSums(is.na(hartford)) 
col_na[col_na > 0]
##    xrBuildingTypeID LocationStartNumber              LandSF   TotalFinishedArea 
##                  16                   1                 570                 199 
##         LivingUnits 
##                  16
# using Amelia package to visually see the null values
missmap(hartford, main = "Missing Map", col = c("yellow", "black"), legend = FALSE)

# imputting nulls with mean value
hartford <- na_mean(hartford)

Focusing on SalePrice, TotalAppraisedValue, SaleDate, LocationStartNumber, StreetNameAndWay columns.

colnames(hartford)
##  [1] "index"                   "PropertyID"             
##  [3] "xrCompositeLandUseID"    "xrBuildingTypeID"       
##  [5] "ParcelID"                "LocationStartNumber"    
##  [7] "ApartmentUnitNumber"     "StreetNameAndWay"       
##  [9] "xrPrimaryNeighborhoodID" "LandSF"                 
## [11] "TotalFinishedArea"       "LivingUnits"            
## [13] "OwnerLastName"           "OwnerFirstName"         
## [15] "PrimaryGrantor"          "SaleDate"               
## [17] "SalePrice"               "TotalAppraisedValue"    
## [19] "LegalReference"          "xrSalesValidityID"      
## [21] "xrDeedID"
# Creating Street Addresses by Combining Location Start Number and StreetNameAndWay Columns
hartford <- hartford %>% 
  mutate(fullAddress = paste(LocationStartNumber, StreetNameAndWay, sep = " "))
    

# converting SaleDate to a proper date format
hartford$SaleDate <- mdy(hartford$SaleDate)
 
# for analysis creating a df with required columns
sale_tracker <- hartford %>% 
                select(SalePrice, TotalAppraisedValue, SaleDate, fullAddress, xrPrimaryNeighborhoodID)

summary(sale_tracker)
##    SalePrice        TotalAppraisedValue    SaleDate          fullAddress       
##  Min.   :       0   Min.   :       0    Min.   :2018-08-27   Length:4735       
##  1st Qu.:       0   1st Qu.:   52100    1st Qu.:2019-01-31   Class :character  
##  Median :  107000   Median :  132000    Median :2019-08-02   Mode  :character  
##  Mean   :  733117   Mean   :  230927    Mean   :2019-07-30                     
##  3rd Qu.:  305000   3rd Qu.:  175800    3rd Qu.:2019-12-17                     
##  Max.   :70500000   Max.   :64087000    Max.   :2020-08-21                     
##  xrPrimaryNeighborhoodID
##  Min.   :  1.0          
##  1st Qu.: 46.0          
##  Median :151.0          
##  Mean   :182.5          
##  3rd Qu.:280.0          
##  Max.   :496.0
# extracting year and month components
sale_tracker <- sale_tracker %>% 
                mutate(SaleYear = year(SaleDate),
                       SaleMonth = month(SaleDate, label = TRUE, abbr = TRUE))

Analysis

theme_set(theme_solarized())

# visual representation of total sale price value vs total appraised value
total_values <- sale_tracker %>% 
                summarise(SP = sum(SalePrice),
                          AV = sum(TotalAppraisedValue))

# Converting the summarized data to long format for plotting
total_values_long <- pivot_longer(total_values, cols = everything(),
                                  names_to = "Metric",
                                  values_to = "Value")

ggplot(total_values_long, aes(x = Metric, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.5) +
  labs(title = "Sales Price Total vs Appraised Value Total") +
  theme(axis.text.x = element_text(size = 8),
        plot.title = element_text(size = 10, face = "bold")) +
  theme(plot.title = element_text(color = "#3498DB", face = "bold")) +
  theme(legend.text = element_text(size = 8)) 

# Sales Price and Appraised Value trend over time
trend <- sale_tracker %>% 
  group_by(SaleDate) %>% 
  summarise(Avg_SP = mean(SalePrice),
            Avg_AV = mean(TotalAppraisedValue))

plotly1 <- ggplot(trend, aes(x = SaleDate)) +
  geom_line(aes(y = Avg_SP, color = "Sale Price")) +
  geom_line(aes(y = Avg_AV, color = "Appraised Value")) +
  labs(title = "Trend in Sales Price and Appraised Value Over Time", 
       x = "Date", y = "Average Value", color = "Type") +
  scale_color_manual(values = c("Sale Price" = "blue", "Appraised Value" = "red")) +
  theme(plot.title = element_text(size = 10, color = "#3498DB", face = "bold")) +
  theme(legend.text = element_text(size = 8)) +
  theme(legend.title = element_text(size = 10))
      
ggplotly(plotly1)

The above visualization is interactive—explore the data with your cursor.

Significant Discrepancies Between Actual Sale Price and Appraised Value, Peaking in May and July for Maximum Sale Prices.

sale_month <- sale_tracker %>% 
  group_by(SaleMonth) %>% 
  summarise(Sales_Count = n())

ggplot(sale_month, aes(x = SaleMonth, y = Sales_Count)) +
  geom_col(color = "black", aes(fill = Sales_Count)) +
  labs(title = "Month wise Property Sales Count", x = " ", 
       y = "Sale Count", subtitle = "December saw the highest number of sales, with January following closely behind.") +
  guides(fill = FALSE) +
  theme(plot.title = element_text(size = 13, face = "bold")) +
  theme(plot.subtitle = element_text(size = 8, color = "black"))
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Checking for correlation between Appraised Value and Sale Price
cor(sale_tracker$SalePrice, sale_tracker$TotalAppraisedValue)
## [1] 0.5998602

There is a moderate positive correlation of 0.5998602 between Sale Price and Appraised Value.

# Building a model
set.seed(456)

sample <- sample.split(hartford$SalePrice, SplitRatio = 0.70)

train <- subset(hartford, sample == TRUE)

test <- subset(hartford, sample == FALSE)


model <- lm(SalePrice ~ TotalAppraisedValue + SaleDate + TotalFinishedArea + LandSF + LivingUnits
            , data = train)

summary(model)
## 
## Call:
## lm(formula = SalePrice ~ TotalAppraisedValue + SaleDate + TotalFinishedArea + 
##     LandSF + LivingUnits, data = train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -14011575   -659111   -554133   -370679  16781526 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.001e+06  2.666e+06   1.126    0.260    
## TotalAppraisedValue  1.498e+00  5.583e-02  26.822   <2e-16 ***
## SaleDate            -1.320e+02  1.472e+02  -0.897    0.370    
## TotalFinishedArea   -3.598e+01  2.809e+00 -12.811   <2e-16 ***
## LandSF              -7.696e+00  8.642e-01  -8.906   <2e-16 ***
## LivingUnits          1.199e+03  3.414e+03   0.351    0.725    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1686000 on 3398 degrees of freedom
## Multiple R-squared:  0.3225, Adjusted R-squared:  0.3215 
## F-statistic: 323.5 on 5 and 3398 DF,  p-value: < 2.2e-16

Model Summary Breakdown: Factors Influencing Sale Price

The linear regression model was employed to understand the factors influencing Sale Price. The model was constructed using variables such as Total Appraised Value, Sale Date, Total Finished Area, Land Square Footage (LandSF), and Living Units.

The coefficients derived from the model offer insights into the impact of each variable on Sale Price. Notably, Total Appraised Value exhibited a significant positive correlation (Coefficient Estimate: 1.498e+00, p < 0.001, t-value: 26.822), indicating that an increase in the Appraised Value leads to a proportional rise in the Sale Price. Additionally, Total Finished Area and LandSF showcased negative correlations, implying that as these values increase, the Sale Price tends to decrease (Total Finished Area: Coefficient Estimate: -3.598e+01, p < 0.001, t-value: -12.811; LandSF: Coefficient Estimate: -7.696e+00, p < 0.001, t-value: -8.906).

Sale Date and Living Units, however, did not exhibit statistically significant relationships with Sale Price (Sale Date: Coefficient Estimate: -1.320e+02, p = 0.370; Living Units: Coefficient Estimate: 1.199e+03, p = 0.725).

Conclusion

The analysis of Hartford’s real estate market, encompassing 4735 observations and 21 variables, provides valuable insights into property transactions within the city. Exploratory Data Analysis (EDA) revealed trends and correlations, allowing for a deeper understanding of the market dynamics.

Visual representations displayed key metrics such as total sales prices versus total appraised values, highlighting variations and trends over time. Notable discrepancies between actual sale prices and appraised values were identified, particularly during peak periods in May and July, coinciding with spikes in sale prices.

The analysis uncovered a moderate positive correlation (0.5998602) between Sale Price and Appraised Value, indicating a relationship between these variables. Furthermore, predictive modeling utilizing linear regression showcased Total Appraised Value, Total Finished Area, and Land SF as influential predictors for Sale Price.

To complement these findings, interactive visualizations were developed using Plotly with ggplot, enhancing the accessibility and engagement of the analysis. Additionally, a Tableau visualization was created to provide further insights into forecasted sales prices and other relevant visuals, offering a comprehensive overview of Hartford’s real estate trends.

This analysis serves as a valuable resource for stakeholders, offering a holistic view of the real estate landscape in Hartford. The correlations, trends, and predictive models derived from this analysis empower informed decision-making, aiding in identifying investment opportunities and navigating the dynamic real estate market landscape effectively.

The integration of interactive visualization tools such as Plotly and Tableau adds depth and clarity to the analysis, enabling stakeholders to interactively explore and understand the nuances of Hartford’s real estate market dynamics.

Link to Tableau Dashboard