Project 1: Property Sales History in Connecticut

Author

Kittim

For this project, I used a comprehensive real estate dataset spanning 2001 to 2020, sourced from data.cvt.gov and curated by Connecticut. This dataset provides vital insights into residential real estate sales trends over two decades, with the key variable being ‘list_year,’ denoting the sale recording year. It also includes data on property types, assessed values, sale prices, and towns in Connecticut.

My analysis focuses on understanding the yearly distribution of residential sales, revealing Connecticut’s real estate market evolution. Additionally, I’ll explore sales across property types, unveiling housing preference trends. Furthermore, I’ll spotlight the top ten towns with the highest house prices, showcasing the most expensive areas for home buyers.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2
getwd()
[1] "C:/Users/mutho/Desktop/Fall 2023/Data 110/DATASETS/Project 1"
real_estate_sales <- read_csv("Real_Estate_Sales_2001-2020_GL.csv")
Rows: 997213 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Date Recorded, Town, Address, Property Type, Residential Type, Non ...
dbl (5): Serial Number, List Year, Assessed Value, Sale Amount, Sales Ratio

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(real_estate_sales)
# A tibble: 6 × 14
  `Serial Number` `List Year` `Date Recorded` Town    Address   `Assessed Value`
            <dbl>       <dbl> <chr>           <chr>   <chr>                <dbl>
1         2020348        2020 09/13/2021      Ansonia 230 WAKE…           150500
2           20002        2020 10/02/2020      Ashford 390 TURN…           253000
3          200212        2020 03/09/2021      Avon    5 CHESTN…           130400
4          200243        2020 04/13/2021      Avon    111 NORT…           619290
5          200377        2020 07/02/2021      Avon    70 FAR H…           862330
6          200109        2020 12/09/2020      Avon    57 FAR H…           847520
# ℹ 8 more variables: `Sale Amount` <dbl>, `Sales Ratio` <dbl>,
#   `Property Type` <chr>, `Residential Type` <chr>, `Non Use Code` <chr>,
#   `Assessor Remarks` <chr>, `OPM remarks` <chr>, Location <chr>
# Change variable names to lowercase
colnames(real_estate_sales) <- tolower(colnames(real_estate_sales))

# Replace spaces with underscores in variable names
colnames(real_estate_sales) <- gsub(" ", "_", colnames(real_estate_sales))
head(real_estate_sales)
# A tibble: 6 × 14
  serial_number list_year date_recorded town  address assessed_value sale_amount
          <dbl>     <dbl> <chr>         <chr> <chr>            <dbl>       <dbl>
1       2020348      2020 09/13/2021    Anso… 230 WA…         150500      325000
2         20002      2020 10/02/2020    Ashf… 390 TU…         253000      430000
3        200212      2020 03/09/2021    Avon  5 CHES…         130400      179900
4        200243      2020 04/13/2021    Avon  111 NO…         619290      890000
5        200377      2020 07/02/2021    Avon  70 FAR…         862330     1447500
6        200109      2020 12/09/2020    Avon  57 FAR…         847520     1250000
# ℹ 7 more variables: sales_ratio <dbl>, property_type <chr>,
#   residential_type <chr>, non_use_code <chr>, assessor_remarks <chr>,
#   opm_remarks <chr>, location <chr>
glimpse(real_estate_sales)
Rows: 997,213
Columns: 14
$ serial_number    <dbl> 2020348, 20002, 200212, 200243, 200377, 200109, 20201…
$ list_year        <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,…
$ date_recorded    <chr> "09/13/2021", "10/02/2020", "03/09/2021", "04/13/2021…
$ town             <chr> "Ansonia", "Ashford", "Avon", "Avon", "Avon", "Avon",…
$ address          <chr> "230 WAKELEE AVE", "390 TURNPIKE RD", "5 CHESTNUT DRI…
$ assessed_value   <dbl> 150500, 253000, 130400, 619290, 862330, 847520, 23420…
$ sale_amount      <dbl> 325000, 430000, 179900, 890000, 1447500, 1250000, 130…
$ sales_ratio      <dbl> 0.4630000, 0.5883000, 0.7248000, 0.6958000, 0.5957000…
$ property_type    <chr> "Commercial", "Residential", "Residential", "Resident…
$ residential_type <chr> NA, "Single Family", "Condo", "Single Family", "Singl…
$ non_use_code     <chr> NA, NA, NA, NA, NA, NA, "08 - Part Interest", NA, NA,…
$ assessor_remarks <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "WATE…
$ opm_remarks      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ location         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "POINT (-73.1…
residential_per_year <- real_estate_sales |> 
  group_by(list_year) |> 
  summarise( n = n())


ggplot(data = residential_per_year, aes(x = list_year)) +
  geom_col(aes(y = n), position = 'dodge') +
  labs(title = 'Distribution of The Number of Residential Sales per Year',
       x = NULL,
       y = 'Frequency of Sales') +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(hjust = 0.5))

residential_type <- real_estate_sales |> 
  group_by(residential_type) |> 
  summarise(n = n()) |> 
  arrange(-n)

ggplot(residential_type, aes( x = n )) +
  geom_col(aes(y = reorder(residential_type,n)))+
   labs( title = 'Distribution of The Number of Sales per Residential Type',
        x = 'Frequency of Sales',
        y = NULL
        ) +
  theme_minimal() +
  theme(legend.position = "none",
            plot.title = element_text(hjust = 0.5))

town_price <- real_estate_sales |> 
  group_by(town) |> 
  summarise(price = median(sale_amount)) |> 
  arrange(-price)

head(town_price)
# A tibble: 6 × 2
  town         price
  <chr>        <dbl>
1 Greenwich  1270000
2 New Canaan 1250000
3 Darien     1220000
4 Westport   1000000
5 Weston      800000
6 Wilton      760000
library(viridis)
Loading required package: viridisLite
top_10_towns <- head(town_price, 10)

ggplot(top_10_towns, aes(x = reorder(town, -price), y = price, fill = town)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = price), vjust = -0.5, size = 4) +  # Add text labels
  labs(
    title = "Top Ten Expensive Towns to Acquire a House in Connecticut",
    x = "Town",
    y = "Median Sale Price"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  scale_fill_viridis(discrete = TRUE)

I cleaned the dataset by converting all variable names to lowercase, ensuring consistency and readability. Additionally, I replaced spaces with underscores in the variable names, maintaining a uniform naming convention and enhancing compatibility with data analysis tools and programming languages.

Regarding the visualization, the first plot illustrates the distribution of the number of residential sales per year, offering insights into the changing dynamics of Connecticut’s real estate market. It is clear that the number of sales has gradually increased over the years. However, the most striking pattern emerges in the year 2020, where a substantial spike in sales occurred, potentially influenced by market forces like the COVID-19 pandemic.

The second plot delves into the distribution of sales across different residential property types. An interesting revelation is that ‘single-family’ homes dominate the market, comprising the majority of sales. This aligns with conventional housing preferences, which tend to favor single-family homes. The visualization showcases an inherent pattern in housing choices.

The third plot beautifully encapsulates the top ten expensive towns in Connecticut for acquiring a house. It employs a visually engaging bar chart that not only ranks these towns by median sale price but also effectively communicates this information through the inclusion of text labels on each bar, making it easier for the audience to interpret the data. The vibrant color palette helps add a modern touch to the visualization. The x-axis is angled to accommodate the town labels without overcrowding, enhancing readability. The absence of a legend streamlines the plot’s clarity, ensuring that viewers can effortlessly focus on the crucial information. This plot serves as an essential resource for those interested in the Connecticut real estate market, pinpointing the areas where housing comes at a premium.