I am analyzing a comprehensive real estate dataset covering the years 2001 to 2020. The data, obtained from data.cvt.gov and curated by Connecticut, provides crucial insights into residential real estate sales trends over two decades. The key variable, ‘list_year,’ indicates the year of sale recording. The dataset also includes information on property types, assessed values, sale amount , and towns in Connecticut. My analysis aims to understand the yearly distribution of residential sales, shedding light on the evolution of Connecticut’s real estate market. Also, I intend to analyze the role of Assessed Values in predicting the Sale Amount for residential properties in Connecticut and how the sale of property types compare across a selected number of towns.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 997213 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Date Recorded, Town, Address, Property Type, Residential Type, Non ...
dbl (5): Serial Number, List Year, Assessed Value, Sale Amount, Sales Ratio
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Change variable names to lowercasecolnames(real_estate) <-tolower(colnames(real_estate))#Replace the spaces with underscores in variable namescolnames(real_estate) <-gsub(" ", "_", colnames(real_estate))head(real_estate)
# creating another variable sale_proportion and arranging the variable in the order of usage for this projectreal_estate <- real_estate |>mutate(sale_proportion = sales_ratio * sale_amount)head(real_estate)
The visualization shows a dashboard containing two line charts. The top line chart describes the trend of property sale on average in Connecticut over the years from 2001 to 2020. The visualizations helps create a forecast and predict the average sale amount on all residential properties in the next four years.
The second visual is categorized on residential type which defines the type of property on sale. The visual is important in helping understand the dynamics behind every sale and the type of properties that did well over the years. For example, a single family home seems to have done quite well in generating total sales over the years. For investors interested in real estate market in Connecticut, the visual can help predict what is likely going to happen in terms of sales for the next four years.
# Filter data for the years 2014, 2015, 2017, 2018, and 2019filtered_data <- real_estate[real_estate$list_year %in%c(2014, 2015, 2017, 2018, 2019), ]# Scatterplot with size and color aesthetics using ColorBrewer Set 1 palettes <-ggplot(filtered_data, aes(x = assessed_value /1000, y = sale_amount /1000)) +geom_point(aes(color =factor(list_year), size = sale_amount)) +labs(title ="Correlation between Sale Amount vs Assessed Value",x ="Assessed Value (in 1000 USD)",y ="Sale Amount (in 1000 USD)", color ="List Year") +scale_color_brewer(palette ="Set1") +theme_minimal()s
The primary focus of this analysis is to examine the correlation between Assessed Value and Sale Amount, aiming to assess whether Assessed Value alone can reliably predict or determine the Sale Amount of properties in Connecticut. The observations reveal that the majority of data points cluster near the dependent variable, represented by the Sale Amount on the y-axis. This clustering pattern suggests that the correlation between Assessed Value and Sale Amount is not particularly strong.
This outcome implies that variations in Assessed Value alone may not consistently and proportionally account for the variability in Sale Amount. Other influential factors, such as property location, prevailing market conditions, or specific features of the properties, likely contribute significantly to the determination of Sale Amount. Also, the correlation coefficient lies towards 0 (0.1109615) instead of towards 1 which would have indicated a stronger correlation.
Notably, the decision I made to exclude data from the years 2016 and 2020 was due to the presence of outliers, possibly attributed to unusually large Sale Amounts recorded in those specific years. By excluding these outliers, I was able to analyse a more representative and reliable assessment of the correlation between Assessed Value and Sale Amount in the remaining years.
# Remove rows with any NA valuesreal_estate <-na.omit(real_estate)# Identify the top 10 towns based on average sale_proportiontop_towns <- real_estate %>%group_by(town) %>%summarize(avg_sale_proportion =mean(sale_proportion)) %>%top_n(10) %>%select(town)
Selecting by avg_sale_proportion
# Filter the dataset for the top 10 townsfiltered_data <- real_estate %>%filter(town %in% top_towns$town)# Divide sale_proportion by 1000filtered_data <-mutate(filtered_data, sale_proportion = sale_proportion /1000)# Bar Chartbar_chart <-ggplot(filtered_data, aes(x = town, y = sale_proportion, fill = property_type)) +geom_bar(stat ="summary", fun ="mean", position ="dodge") +labs(title ="Average Sale Proportion in Top 10 Towns by Property Type in Connecticut",x ="Town",y ="Average Sale Proportion (in 1000)",fill ="Property Type") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))print(bar_chart)
The bar chart visually represents the average sale proportion in the top 10 towns of Connecticut, categorized by property type. Each bar corresponds to a town, and the height of the bar indicates the average sale proportion (divided by 1000) in that town. The bars are grouped by property type, distinguished by different colors. This visualization allows for a quick comparison of average sale proportions across towns and property types in a straightforward and clear manner.