The dataset represents real estate information from the City of Hartford and comprises various fields associated with property transactions. Here’s a brief description based on the available fields:
Index: An index or unique identifier for each record.
PropertyID: Identification number for the property.
xrCompositeLandUseID: Composite land use identification number.
xrBuildingTypeID: Building type identification number.
ParcelID: Identification number for the parcel of land.
LocationStartNumber: Starting number of the property’s location.
ApartmentUnitNumber: Unit number or apartment details.
StreetNameAndWay: Name of the street where the property is located.
xrPrimaryNeighborhoodID: Identification number for the primary neighborhood.
LandSF: Land area measured in square feet.
TotalFinishedArea: Total finished area of the property.
LivingUnits: Number of living units on the property.
OwnerLastName: Last name of the property owner.
OwnerFirstName: First name of the property owner.
PrimaryGrantor: Primary grantor or seller involved in the transaction.
SaleDate: Date of the property sale transaction.
SalePrice: Price at which the property was sold.
TotalAppraisedValue: Total appraised value of the property.
LegalReference: Legal reference associated with the property transaction.
xrSalesValidityID: Sales validity identification number.
xrDeedID: Deed identification number related to the property.
This dataset encompasses diverse details about property transactions, including sale prices, ownership information, transaction dates, property characteristics, and legal references, among other relevant information within the City of Hartford’s real estate domain.
Analyze market trends in the City of Hartford’s real estate industry by tracking sale prices and appraised values over time to identify regions who are being under or overvalued.
Conduct a predictive analysis to predict future sales prices, annual appreciation rates, and key features associated with residential properties such as total finished area and living units for investment purposes.
# install the necessary packages
install.packages(c("tidyverse", "ggthemes", "plotly", "Amelia", "caTools", "imputeTS", "knitr"), repos = "https:\\cran.rstudio.com")## Warning: unable to access index for repository https:\cran.rstudio.com/src/contrib:
## cannot open URL 'https:\cran.rstudio.com/src/contrib/PACKAGES'
## Warning: packages 'tidyverse', 'ggthemes', 'plotly', 'Amelia', 'caTools', 'imputeTS', 'knitr' are not available for this version of R
##
## Versions of these packages for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
## Warning: unable to access index for repository https:\cran.rstudio.com/bin/windows/contrib/4.3:
## cannot open URL 'https:\cran.rstudio.com/bin/windows/contrib/4.3/PACKAGES'
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.1, built: 2022-11-18)
## ## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
# loading the data
hartford <- read.csv("real_estate_730.csv")
# checking the structure and head
str(hartford)## 'data.frame': 4735 obs. of 21 variables:
## $ index : int 0 1 2 3 4 5 6 7 8 9 ...
## $ PropertyID : int 30 50 58 66 78 79 94 94 98 98 ...
## $ xrCompositeLandUseID : int 3 3 3 3 3 3 3 3 3 3 ...
## $ xrBuildingTypeID : int 4 4 6 6 3 4 5 5 6 6 ...
## $ ParcelID : chr "144-698-048" "121-698-073" "121-699-090" "144-699-037" ...
## $ LocationStartNumber : int 22 164 87 55 5 237 66 66 82 82 ...
## $ ApartmentUnitNumber : chr "" "" "" "" ...
## $ StreetNameAndWay : chr "BRISTOL ST" "HOLLYWOOD AVE" "BRISTOL ST" "BRISTOL ST" ...
## $ xrPrimaryNeighborhoodID: int 2 2 2 2 2 2 2 2 2 2 ...
## $ LandSF : num 6000 5650 5850 5900 7788 ...
## $ TotalFinishedArea : num 1422 1384 1249 1390 960 ...
## $ LivingUnits : int 1 1 1 1 1 1 1 1 1 1 ...
## $ OwnerLastName : chr "KHAN" "GONZALEZ" "SAMPAYO" "CARRILLO PACHECO" ...
## $ OwnerFirstName : chr "BIBI" "LOURDES" "YARLIS" "RICARDO" ...
## $ PrimaryGrantor : chr "ALI MUNTAAZ" "GONZALEZ JOHNETTE" "FLOYD ANNIE J ESTATE" "HARGROVE DAVID L" ...
## $ SaleDate : chr "12/10/2019" "12/18/2019" "5/24/2019" "9/10/2019" ...
## $ SalePrice : int 0 179900 174900 140000 0 156750 140000 127972 0 107500 ...
## $ TotalAppraisedValue : int 136600 158400 144400 142400 147900 145000 144200 144200 125100 125100 ...
## $ LegalReference : chr "07559-0116" "07561-0225" "07485-0035" "07523-0330" ...
## $ xrSalesValidityID : int 5 1 12 1 11 1 9 26 12 26 ...
## $ xrDeedID : int 8 10 1 10 7 10 11 11 6 19 ...
## index PropertyID xrCompositeLandUseID xrBuildingTypeID ParcelID
## 1 0 30 3 4 144-698-048
## 2 1 50 3 4 121-698-073
## 3 2 58 3 6 121-699-090
## 4 3 66 3 6 144-699-037
## 5 4 78 3 3 144-699-025
## 6 5 79 3 4 144-699-024
## LocationStartNumber ApartmentUnitNumber StreetNameAndWay
## 1 22 BRISTOL ST
## 2 164 HOLLYWOOD AVE
## 3 87 BRISTOL ST
## 4 55 BRISTOL ST
## 5 5 BRISTOL ST
## 6 237 NEWINGTON AVE
## xrPrimaryNeighborhoodID LandSF TotalFinishedArea LivingUnits
## 1 2 6000 1422.0 1
## 2 2 5650 1384.0 1
## 3 2 5850 1249.0 1
## 4 2 5900 1390.0 1
## 5 2 7788 960.0 1
## 6 2 8100 1075.2 1
## OwnerLastName OwnerFirstName PrimaryGrantor
## 1 KHAN BIBI ALI MUNTAAZ
## 2 GONZALEZ LOURDES GONZALEZ JOHNETTE
## 3 SAMPAYO YARLIS FLOYD ANNIE J ESTATE
## 4 CARRILLO PACHECO RICARDO HARGROVE DAVID L
## 5 VARJABEDIAN ROSE M (L.U.) VARJABEDIAN ANTHONY ( L U)
## 6 BENITEZ MICHELLE CRANICK PAUL
## SaleDate SalePrice TotalAppraisedValue LegalReference xrSalesValidityID
## 1 12/10/2019 0 136600 07559-0116 5
## 2 12/18/2019 179900 158400 07561-0225 1
## 3 5/24/2019 174900 144400 07485-0035 12
## 4 9/10/2019 140000 142400 07523-0330 1
## 5 5/14/2020 0 147900 07616-0064 11
## 6 6/9/2020 156750 145000 07622-0277 1
## xrDeedID
## 1 8
## 2 10
## 3 1
## 4 10
## 5 7
## 6 10
## [1] TRUE
## xrBuildingTypeID LocationStartNumber LandSF TotalFinishedArea
## 16 1 570 199
## LivingUnits
## 16
# using Amelia package to visually see the null values
missmap(hartford, main = "Missing Map", col = c("yellow", "black"), legend = FALSE)Focusing on SalePrice, TotalAppraisedValue, SaleDate, LocationStartNumber, StreetNameAndWay columns.
## [1] "index" "PropertyID"
## [3] "xrCompositeLandUseID" "xrBuildingTypeID"
## [5] "ParcelID" "LocationStartNumber"
## [7] "ApartmentUnitNumber" "StreetNameAndWay"
## [9] "xrPrimaryNeighborhoodID" "LandSF"
## [11] "TotalFinishedArea" "LivingUnits"
## [13] "OwnerLastName" "OwnerFirstName"
## [15] "PrimaryGrantor" "SaleDate"
## [17] "SalePrice" "TotalAppraisedValue"
## [19] "LegalReference" "xrSalesValidityID"
## [21] "xrDeedID"
# Creating Street Addresses by Combining Location Start Number and StreetNameAndWay Columns
hartford <- hartford %>%
mutate(fullAddress = paste(LocationStartNumber, StreetNameAndWay, sep = " "))
# converting SaleDate to a proper date format
hartford$SaleDate <- mdy(hartford$SaleDate)
# for analysis creating a df with required columns
sale_tracker <- hartford %>%
select(SalePrice, TotalAppraisedValue, SaleDate, fullAddress, xrPrimaryNeighborhoodID)
summary(sale_tracker)## SalePrice TotalAppraisedValue SaleDate fullAddress
## Min. : 0 Min. : 0 Min. :2018-08-27 Length:4735
## 1st Qu.: 0 1st Qu.: 52100 1st Qu.:2019-01-31 Class :character
## Median : 107000 Median : 132000 Median :2019-08-02 Mode :character
## Mean : 733117 Mean : 230927 Mean :2019-07-30
## 3rd Qu.: 305000 3rd Qu.: 175800 3rd Qu.:2019-12-17
## Max. :70500000 Max. :64087000 Max. :2020-08-21
## xrPrimaryNeighborhoodID
## Min. : 1.0
## 1st Qu.: 46.0
## Median :151.0
## Mean :182.5
## 3rd Qu.:280.0
## Max. :496.0
theme_set(theme_solarized())
# visual representation of total sale price value vs total appraised value
total_values <- sale_tracker %>%
summarise(SP = sum(SalePrice),
AV = sum(TotalAppraisedValue))
# Converting the summarized data to long format for plotting
total_values_long <- pivot_longer(total_values, cols = everything(),
names_to = "Metric",
values_to = "Value")
ggplot(total_values_long, aes(x = Metric, y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge", width = 0.5) +
labs(title = "Sales Price Total vs Appraised Value Total") +
theme(axis.text.x = element_text(size = 8),
plot.title = element_text(size = 10, face = "bold")) +
theme(plot.title = element_text(color = "#3498DB", face = "bold")) +
theme(legend.text = element_text(size = 8)) # Sales Price and Appraised Value trend over time
trend <- sale_tracker %>%
group_by(SaleDate) %>%
summarise(Avg_SP = mean(SalePrice),
Avg_AV = mean(TotalAppraisedValue))
plotly1 <- ggplot(trend, aes(x = SaleDate)) +
geom_line(aes(y = Avg_SP, color = "Sale Price")) +
geom_line(aes(y = Avg_AV, color = "Appraised Value")) +
labs(title = "Trend in Sales Price and Appraised Value Over Time",
x = "Date", y = "Average Value", color = "Type") +
scale_color_manual(values = c("Sale Price" = "blue", "Appraised Value" = "red")) +
theme(plot.title = element_text(size = 10, color = "#3498DB", face = "bold")) +
theme(legend.text = element_text(size = 8)) +
theme(legend.title = element_text(size = 10))
ggplotly(plotly1)The above visualization is interactive—explore the data with your cursor.
Significant Discrepancies Between Actual Sale Price and Appraised Value, Peaking in May and July for Maximum Sale Prices.
sale_month <- sale_tracker %>%
group_by(SaleMonth) %>%
summarise(Sales_Count = n())
ggplot(sale_month, aes(x = SaleMonth, y = Sales_Count)) +
geom_col(color = "black", aes(fill = Sales_Count)) +
labs(title = "Month wise Property Sales Count", x = " ",
y = "Sale Count", subtitle = "December saw the highest number of sales, with January following closely behind.") +
guides(fill = FALSE) +
theme(plot.title = element_text(size = 13, face = "bold")) +
theme(plot.subtitle = element_text(size = 8, color = "black"))## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Checking for correlation between Appraised Value and Sale Price
cor(sale_tracker$SalePrice, sale_tracker$TotalAppraisedValue)## [1] 0.5998602
There is a moderate positive correlation of 0.5998602 between Sale Price and Appraised Value.
# Building a model
set.seed(456)
sample <- sample.split(hartford$SalePrice, SplitRatio = 0.70)
train <- subset(hartford, sample == TRUE)
test <- subset(hartford, sample == FALSE)
model <- lm(SalePrice ~ TotalAppraisedValue + SaleDate + TotalFinishedArea + LandSF + LivingUnits
, data = train)
summary(model)##
## Call:
## lm(formula = SalePrice ~ TotalAppraisedValue + SaleDate + TotalFinishedArea +
## LandSF + LivingUnits, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14011575 -659111 -554133 -370679 16781526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.001e+06 2.666e+06 1.126 0.260
## TotalAppraisedValue 1.498e+00 5.583e-02 26.822 <2e-16 ***
## SaleDate -1.320e+02 1.472e+02 -0.897 0.370
## TotalFinishedArea -3.598e+01 2.809e+00 -12.811 <2e-16 ***
## LandSF -7.696e+00 8.642e-01 -8.906 <2e-16 ***
## LivingUnits 1.199e+03 3.414e+03 0.351 0.725
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1686000 on 3398 degrees of freedom
## Multiple R-squared: 0.3225, Adjusted R-squared: 0.3215
## F-statistic: 323.5 on 5 and 3398 DF, p-value: < 2.2e-16
Model Summary Breakdown: Factors Influencing Sale Price
The linear regression model was employed to understand the factors influencing Sale Price. The model was constructed using variables such as Total Appraised Value, Sale Date, Total Finished Area, Land Square Footage (LandSF), and Living Units.
The coefficients derived from the model offer insights into the impact of each variable on Sale Price. Notably, Total Appraised Value exhibited a significant positive correlation (Coefficient Estimate: 1.498e+00, p < 0.001, t-value: 26.822), indicating that an increase in the Appraised Value leads to a proportional rise in the Sale Price. Additionally, Total Finished Area and LandSF showcased negative correlations, implying that as these values increase, the Sale Price tends to decrease (Total Finished Area: Coefficient Estimate: -3.598e+01, p < 0.001, t-value: -12.811; LandSF: Coefficient Estimate: -7.696e+00, p < 0.001, t-value: -8.906).
Sale Date and Living Units, however, did not exhibit statistically significant relationships with Sale Price (Sale Date: Coefficient Estimate: -1.320e+02, p = 0.370; Living Units: Coefficient Estimate: 1.199e+03, p = 0.725).
The analysis of Hartford’s real estate market, encompassing 4735 observations and 21 variables, provides valuable insights into property transactions within the city. Exploratory Data Analysis (EDA) revealed trends and correlations, allowing for a deeper understanding of the market dynamics.
Visual representations displayed key metrics such as total sales prices versus total appraised values, highlighting variations and trends over time. Notable discrepancies between actual sale prices and appraised values were identified, particularly during peak periods in May and July, coinciding with spikes in sale prices.
The analysis uncovered a moderate positive correlation (0.5998602) between Sale Price and Appraised Value, indicating a relationship between these variables. Furthermore, predictive modeling utilizing linear regression showcased Total Appraised Value, Total Finished Area, and Land SF as influential predictors for Sale Price.
To complement these findings, interactive visualizations were developed using Plotly with ggplot, enhancing the accessibility and engagement of the analysis. Additionally, a Tableau visualization was created to provide further insights into forecasted sales prices and other relevant visuals, offering a comprehensive overview of Hartford’s real estate trends.
This analysis serves as a valuable resource for stakeholders, offering a holistic view of the real estate landscape in Hartford. The correlations, trends, and predictive models derived from this analysis empower informed decision-making, aiding in identifying investment opportunities and navigating the dynamic real estate market landscape effectively.
The integration of interactive visualization tools such as Plotly and Tableau adds depth and clarity to the analysis, enabling stakeholders to interactively explore and understand the nuances of Hartford’s real estate market dynamics.