Real Estate

Author

Kittim

I am analyzing a comprehensive real estate dataset covering the years 2001 to 2020. The data, obtained from data.cvt.gov and curated by Connecticut, provides crucial insights into residential real estate sales trends over two decades. The key variable, ‘list_year,’ indicates the year of sale recording. The dataset also includes information on property types, assessed values, sale amount , and towns in Connecticut. My analysis aims to understand the yearly distribution of residential sales, shedding light on the evolution of Connecticut’s real estate market. Also, I intend to analyze the role of Assessed Values in predicting the Sale Amount for residential properties in Connecticut and how the sale of property types compare across a selected number of towns.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(RColorBrewer)

getwd()

[1] "C:/Users/mutho/Desktop/Fall 2023/Data 110/DATASETS/Project 3"

real_estate_sales <- read_csv("Real_Estate_Sales_2001-2020_GL.csv")

Rows: 997213 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Date Recorded, Town, Address, Property Type, Residential Type, Non ...
dbl (5): Serial Number, List Year, Assessed Value, Sale Amount, Sales Ratio

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(real_estate_sales)

# A tibble: 6 × 14
  `Serial Number` `List Year` `Date Recorded` Town    Address   `Assessed Value`
            <dbl>       <dbl> <chr>           <chr>   <chr>                <dbl>
1         2020348        2020 09/13/2021      Ansonia 230 WAKE…           150500
2           20002        2020 10/02/2020      Ashford 390 TURN…           253000
3          200212        2020 03/09/2021      Avon    5 CHESTN…           130400
4          200243        2020 04/13/2021      Avon    111 NORT…           619290
5          200377        2020 07/02/2021      Avon    70 FAR H…           862330
6          200109        2020 12/09/2020      Avon    57 FAR H…           847520
# ℹ 8 more variables: `Sale Amount` <dbl>, `Sales Ratio` <dbl>,
#   `Property Type` <chr>, `Residential Type` <chr>, `Non Use Code` <chr>,
#   `Assessor Remarks` <chr>, `OPM remarks` <chr>, Location <chr>

# Eliminating blank variables from the dataset
real_estate <- real_estate_sales |>
  select(-`Non Use Code`, -`Assessor Remarks`, -`OPM remarks`, -Location) 
head(real_estate)

# A tibble: 6 × 10
  `Serial Number` `List Year` `Date Recorded` Town    Address   `Assessed Value`
            <dbl>       <dbl> <chr>           <chr>   <chr>                <dbl>
1         2020348        2020 09/13/2021      Ansonia 230 WAKE…           150500
2           20002        2020 10/02/2020      Ashford 390 TURN…           253000
3          200212        2020 03/09/2021      Avon    5 CHESTN…           130400
4          200243        2020 04/13/2021      Avon    111 NORT…           619290
5          200377        2020 07/02/2021      Avon    70 FAR H…           862330
6          200109        2020 12/09/2020      Avon    57 FAR H…           847520
# ℹ 4 more variables: `Sale Amount` <dbl>, `Sales Ratio` <dbl>,
#   `Property Type` <chr>, `Residential Type` <chr>

# Change variable names to lowercase
colnames(real_estate) <- tolower(colnames(real_estate))

#Replace the spaces with underscores in variable names
colnames(real_estate) <- gsub(" ", "_", colnames(real_estate))
head(real_estate)

# A tibble: 6 × 10
  serial_number list_year date_recorded town  address assessed_value sale_amount
          <dbl>     <dbl> <chr>         <chr> <chr>            <dbl>       <dbl>
1       2020348      2020 09/13/2021    Anso… 230 WA…         150500      325000
2         20002      2020 10/02/2020    Ashf… 390 TU…         253000      430000
3        200212      2020 03/09/2021    Avon  5 CHES…         130400      179900
4        200243      2020 04/13/2021    Avon  111 NO…         619290      890000
5        200377      2020 07/02/2021    Avon  70 FAR…         862330     1447500
6        200109      2020 12/09/2020    Avon  57 FAR…         847520     1250000
# ℹ 3 more variables: sales_ratio <dbl>, property_type <chr>,
#   residential_type <chr>

# creating another variable sale_proportion and arranging the variable in the order of usage for this project
real_estate <- real_estate |>
  mutate(sale_proportion = sales_ratio * sale_amount)

head(real_estate)

# A tibble: 6 × 11
  serial_number list_year date_recorded town  address assessed_value sale_amount
          <dbl>     <dbl> <chr>         <chr> <chr>            <dbl>       <dbl>
1       2020348      2020 09/13/2021    Anso… 230 WA…         150500      325000
2         20002      2020 10/02/2020    Ashf… 390 TU…         253000      430000
3        200212      2020 03/09/2021    Avon  5 CHES…         130400      179900
4        200243      2020 04/13/2021    Avon  111 NO…         619290      890000
5        200377      2020 07/02/2021    Avon  70 FAR…         862330     1447500
6        200109      2020 12/09/2020    Avon  57 FAR…         847520     1250000
# ℹ 4 more variables: sales_ratio <dbl>, property_type <chr>,
#   residential_type <chr>, sale_proportion <dbl>

glimpse(real_estate)

Rows: 997,213
Columns: 11
$ serial_number    <dbl> 2020348, 20002, 200212, 200243, 200377, 200109, 20201…
$ list_year        <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,…
$ date_recorded    <chr> "09/13/2021", "10/02/2020", "03/09/2021", "04/13/2021…
$ town             <chr> "Ansonia", "Ashford", "Avon", "Avon", "Avon", "Avon",…
$ address          <chr> "230 WAKELEE AVE", "390 TURNPIKE RD", "5 CHESTNUT DRI…
$ assessed_value   <dbl> 150500, 253000, 130400, 619290, 862330, 847520, 23420…
$ sale_amount      <dbl> 325000, 430000, 179900, 890000, 1447500, 1250000, 130…
$ sales_ratio      <dbl> 0.4630000, 0.5883000, 0.7248000, 0.6958000, 0.5957000…
$ property_type    <chr> "Commercial", "Residential", "Residential", "Resident…
$ residential_type <chr> NA, "Single Family", "Condo", "Single Family", "Singl…
$ sale_proportion  <dbl> 150475.0, 252969.0, 130391.5, 619262.0, 862275.8, 847…

The visualization shows a dashboard containing two line charts. The top line chart describes the trend of property sale on average in Connecticut over the years from 2001 to 2020. The visualizations helps create a forecast and predict the average sale amount on all residential properties in the next four years.

The second visual is categorized on residential type which defines the type of property on sale. The visual is important in helping understand the dynamics behind every sale and the type of properties that did well over the years. For example, a single family home seems to have done quite well in generating total sales over the years. For investors interested in real estate market in Connecticut, the visual can help predict what is likely going to happen in terms of sales for the next four years.

# Filter data for the years 2014, 2015, 2017, 2018, and 2019
filtered_data <- real_estate[real_estate$list_year %in% c(2014, 2015, 2017, 2018, 2019), ]

# Scatterplot with size and color aesthetics using ColorBrewer Set 1 palette
s <- ggplot(filtered_data, aes(x = assessed_value / 1000, y = sale_amount / 1000)) + 
  geom_point(aes(color = factor(list_year), size = sale_amount)) +
  labs(title = "Correlation between Sale Amount vs Assessed Value",
       x = "Assessed Value (in 1000 USD)",
       y = "Sale Amount (in 1000 USD)",  
       color = "List Year") +
  scale_color_brewer(palette = "Set1") +
  theme_minimal()

s

# Calculating the correlation coefficient 
correlation_coefficient <- cor(real_estate$assessed_value, real_estate$sale_amount)

cat("Correlation Coefficient:", correlation_coefficient, "\n")

Correlation Coefficient: 0.1109615

The primary focus of this analysis is to examine the correlation between Assessed Value and Sale Amount, aiming to assess whether Assessed Value alone can reliably predict or determine the Sale Amount of properties in Connecticut. The observations reveal that the majority of data points cluster near the dependent variable, represented by the Sale Amount on the y-axis. This clustering pattern suggests that the correlation between Assessed Value and Sale Amount is not particularly strong.

This outcome implies that variations in Assessed Value alone may not consistently and proportionally account for the variability in Sale Amount. Other influential factors, such as property location, prevailing market conditions, or specific features of the properties, likely contribute significantly to the determination of Sale Amount. Also, the correlation coefficient lies towards 0 (0.1109615) instead of towards 1 which would have indicated a stronger correlation.

Notably, the decision I made to exclude data from the years 2016 and 2020 was due to the presence of outliers, possibly attributed to unusually large Sale Amounts recorded in those specific years. By excluding these outliers, I was able to analyse a more representative and reliable assessment of the correlation between Assessed Value and Sale Amount in the remaining years.

# Remove rows with any NA values
real_estate <- na.omit(real_estate)

# Identify the top 10 towns based on average sale_proportion
top_towns <- real_estate %>%
  group_by(town) %>%
  summarize(avg_sale_proportion = mean(sale_proportion)) %>%
  top_n(10) %>%
  select(town)

Selecting by avg_sale_proportion

# Filter the dataset for the top 10 towns
filtered_data <- real_estate %>%
  filter(town %in% top_towns$town)

# Divide sale_proportion by 1000
filtered_data <- mutate(filtered_data, sale_proportion = sale_proportion / 1000)

# Bar Chart
bar_chart <- ggplot(filtered_data, aes(x = town, y = sale_proportion, fill = property_type)) +
  geom_bar(stat = "summary", fun = "mean", position = "dodge") +
  labs(title = "Average Sale Proportion in Top 10 Towns by Property Type in Connecticut",
       x = "Town",
       y = "Average Sale Proportion (in 1000)",
       fill = "Property Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(bar_chart)

The bar chart visually represents the average sale proportion in the top 10 towns of Connecticut, categorized by property type. Each bar corresponds to a town, and the height of the bar indicates the average sale proportion (divided by 1000) in that town. The bars are grouped by property type, distinguished by different colors. This visualization allows for a quick comparison of average sale proportions across towns and property types in a straightforward and clear manner.