For this project, I used a comprehensive real estate dataset spanning 2001 to 2020, sourced from data.cvt.gov and curated by Connecticut. This dataset provides vital insights into residential real estate sales trends over two decades, with the key variable being ‘list_year,’ denoting the sale recording year. It also includes data on property types, assessed values, sale prices, and towns in Connecticut.
My analysis focuses on understanding the yearly distribution of residential sales, revealing Connecticut’s real estate market evolution. Additionally, I’ll explore sales across property types, unveiling housing preference trends. Furthermore, I’ll spotlight the top ten towns with the highest house prices, showcasing the most expensive areas for home buyers.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
Rows: 997213 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Date Recorded, Town, Address, Property Type, Residential Type, Non ...
dbl (5): Serial Number, List Year, Assessed Value, Sale Amount, Sales Ratio
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
residential_per_year <- real_estate_sales |>group_by(list_year) |>summarise( n =n())ggplot(data = residential_per_year, aes(x = list_year)) +geom_col(aes(y = n), position ='dodge') +labs(title ='Distribution of The Number of Residential Sales per Year',x =NULL,y ='Frequency of Sales') +theme_minimal() +theme(legend.position ="none",plot.title =element_text(hjust =0.5))
residential_type <- real_estate_sales |>group_by(residential_type) |>summarise(n =n()) |>arrange(-n)ggplot(residential_type, aes( x = n )) +geom_col(aes(y =reorder(residential_type,n)))+labs( title ='Distribution of The Number of Sales per Residential Type',x ='Frequency of Sales',y =NULL ) +theme_minimal() +theme(legend.position ="none",plot.title =element_text(hjust =0.5))
# A tibble: 6 × 2
town price
<chr> <dbl>
1 Greenwich 1270000
2 New Canaan 1250000
3 Darien 1220000
4 Westport 1000000
5 Weston 800000
6 Wilton 760000
library(viridis)
Loading required package: viridisLite
top_10_towns <-head(town_price, 10)ggplot(top_10_towns, aes(x =reorder(town, -price), y = price, fill = town)) +geom_bar(stat ="identity") +geom_text(aes(label = price), vjust =-0.5, size =4) +# Add text labelslabs(title ="Top Ten Expensive Towns to Acquire a House in Connecticut",x ="Town",y ="Median Sale Price" ) +theme_minimal() +theme(legend.position ="none",plot.title =element_text(hjust =0.5),axis.text.x =element_text(angle =45, hjust =1) ) +scale_fill_viridis(discrete =TRUE)
I cleaned the dataset by converting all variable names to lowercase, ensuring consistency and readability. Additionally, I replaced spaces with underscores in the variable names, maintaining a uniform naming convention and enhancing compatibility with data analysis tools and programming languages.
Regarding the visualization, the first plot illustrates the distribution of the number of residential sales per year, offering insights into the changing dynamics of Connecticut’s real estate market. It is clear that the number of sales has gradually increased over the years. However, the most striking pattern emerges in the year 2020, where a substantial spike in sales occurred, potentially influenced by market forces like the COVID-19 pandemic.
The second plot delves into the distribution of sales across different residential property types. An interesting revelation is that ‘single-family’ homes dominate the market, comprising the majority of sales. This aligns with conventional housing preferences, which tend to favor single-family homes. The visualization showcases an inherent pattern in housing choices.
The third plot beautifully encapsulates the top ten expensive towns in Connecticut for acquiring a house. It employs a visually engaging bar chart that not only ranks these towns by median sale price but also effectively communicates this information through the inclusion of text labels on each bar, making it easier for the audience to interpret the data. The vibrant color palette helps add a modern touch to the visualization. The x-axis is angled to accommodate the town labels without overcrowding, enhancing readability. The absence of a legend streamlines the plot’s clarity, ensuring that viewers can effortlessly focus on the crucial information. This plot serves as an essential resource for those interested in the Connecticut real estate market, pinpointing the areas where housing comes at a premium.