Housing Markget Change 2001-2021

Introduction

For this project I am searching through the Real Estate sales 2001-2021 for the state of Connecticut data set. The data set includes the houses serial number for when it was sold, the year it was listed, the date it was recorded at, the town the house it is located at, the address of the house, the value it was assessed at before the house was sold, the price the house was sold at, the type of property the house is, and what type of house it is (condo, single family, two family, etc..). I plan to explore a lot of areas like the price with the town or since the data set goes back to 2001, I want to see how much the market as changed compared to 2021.

Loading the necessary libraries for project

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr) # dplyr is used for data manipulation. It provides a set of functions for efficiently filtering, selecting, grouping, and summarizing data.
library(ggplot2) # ggplot2 is used for data visualization. It allows for the creation of complex plots from data in a dataframe with a high level of customization.
library(scales) # scales is used for formatting visualization scales. It provides functions to customize the appearance of plot axes and legends, such as changing number formats.


Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor

library(reshape2) # reshape2 is used for reshaping data. It provides functions to melt and cast dataframes, making it easier to transform data between wide and long formats.


Attaching package: 'reshape2'

The following object is masked from 'package:tidyr':

    smiths

Loading on the data set we are going to explore as Real_estate_Sales_2001-2021_GL

Real_Estate_Sales <- read_csv("Real_Estate_Sales_2001-2021_GL.csv") #Loading the dataset we are going to use

Rows: 1054159 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Date Recorded, Town, Address, Property Type, Residential Type, Non ...
dbl (5): Serial Number, List Year, Assessed Value, Sale Amount, Sales Ratio

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Getting a quick look at what variables we are working with

head(Real_Estate_Sales) #Taking a look at the different varaibles we are working with

# A tibble: 6 × 14
  `Serial Number` `List Year` `Date Recorded` Town    Address   `Assessed Value`
            <dbl>       <dbl> <chr>           <chr>   <chr>                <dbl>
1         2020177        2020 04/14/2021      Ansonia 323 BEAV…           133000
2         2020225        2020 05/26/2021      Ansonia 152 JACK…           110500
3         2020348        2020 09/13/2021      Ansonia 230 WAKE…           150500
4         2020090        2020 12/14/2020      Ansonia 57 PLATT…           127400
5          200500        2020 09/07/2021      Avon    245 NEW …           217640
6          200121        2020 12/15/2020      Avon    63 NORTH…           528490
# ℹ 8 more variables: `Sale Amount` <dbl>, `Sales Ratio` <dbl>,
#   `Property Type` <chr>, `Residential Type` <chr>, `Non Use Code` <chr>,
#   `Assessor Remarks` <chr>, `OPM remarks` <chr>, Location <chr>

What we are going to explore first in this data set is how much the housing market as changed since 2001 but before we do that I am going to make a new data set that makes it easier for us to use the necessary information we need to do this

Now we are going to pick the top 5 towns that have the highest number of single family houses and store it as a value

top_towns <- Real_Estate_Sales |>
  filter(`Residential Type` == "Single Family") |> # Selecting just single family houses
  count(Town, sort = TRUE) |> # Sorts all of the towns in decesding order so the top 5 towns with the most sales comes first
  top_n(5) |> #Picks the top 5 towns from the sorted list that we performed above and removes the rest
  pull(Town) #Extracting the Town column that now has the top 5 towns. (I used this website to better understand how and what the pull function is https://www.rdocumentation.org/packages/lplyr/versions/0.1.6/topics/pull)

Selecting by n

Now we are going to make a dot plot that shows the top 5 towns that have the highest amount of single family houses

dot_plot <- Real_Estate_Sales |>
  filter(Town %in% top_towns, `Residential Type` == "Single Family") # I am filtering the main data set so it only has the towns that are included in top_towns then we are filtering the residential type to only show single family homes.

# Custom colors for towns
custom_colors <- c("#1b9e77", "#d95f02", "#7570b3", "#e7298a", "#66a61e")

# This makes it so the numbers on the y and x axis do not look weird and are actually readable (I asked chatGPT how to make this, I couldn't find out how to do it by myself and nothing online was helping.)
label_formatter_million <- function(x) {
  paste0(format(round(x / 1000000, 1), nsmall = 1), "M")
}
# Making the heatmap
ggplot(dot_plot, aes(x = `Assessed Value`, y = `Sale Amount`, color = Town)) +
  geom_point(alpha = 0.7, size = 2.5) + # Setting the transparency and how big the dots are
  scale_color_manual(values = custom_colors) + # Setting what colors to use
  #Naming the title, subtitle, x-axis, y-axis, caption, and color
  labs(title = "Sale Amount vs. Assessed Value for Single-Family Homes",
       subtitle = "Top 5 towns by transaction volume",
       x = "Assessed Value",
       y = "Sale Amount",
       caption = "Data source: Real Estate Sales 2001-2021 Dataset",
       color = "Town") +
  #Using the converter I made above so it will be easier to read
  scale_y_continuous(labels = label_formatter_million) + 
  scale_x_continuous(labels = label_formatter_million) +
  theme_light(base_size = 4) + # Setting how big I would like the graph to be
  theme(legend.position = "bottom", # Sets where the legend is
        #All of this sets the size for the specific text, you can also set the text to be bold
        plot.title = element_text(size = 16, face = "bold"),
        plot.subtitle = element_text(size = 12),
        plot.caption = element_text(size = 10),
        legend.title = element_text(size = 12),
        legend.text = element_text(size = 10),
        axis.title = element_text(size = 14),
        axis.text = element_text(size = 12))

Now I am going to make another data set that takes the top 5 towns we used in the previous visualization and add up the total sales each year for each town

sales_data <- Real_Estate_Sales |>
  group_by(Town, `List Year`) |> # Takes only the town and List Year column from the Real_Estate_Sales data set and gets rid of the others
  summarise(Total.Sale.Amount = sum(`Sale Amount`, na.rm = TRUE)) |> # Adds a new column after adding the sum of that years sale amount together
  filter(Town %in% top_towns) |> # Makes sure only the top 5 towns that are in top_towns are in this data set
  ungroup() # This is used to upgroup the data we grouped up since we do not need it grouped up anymore, we grouped them together so we could perform calculations but after we did that we can upgroup them

`summarise()` has grouped output by 'Town'. You can override using the
`.groups` argument.

Here we are going to spread out the data so we can use it for our heatmap. This step transforms the aggregated data into a wide format, where each column represents a different year.’List Year’ becomes the column headers, and ‘Total.Sale.Amount’ fills the matrix.

# Spread the data for heatmap, turns the data set into a wide data set
sales_data_wide <- spread(sales_data, key = `List Year`, value = Total.Sale.Amount)

Melt the data back to long format for ggplot. Here, we convert the wide format back into a long format where each row represents a town-year combination. This format is necessary for ggplot2 to correctly map the aesthetics for the heatmap.

# Melt the data back to long format for ggplot, we turn it back into a long data set
sales_data_long <- melt(sales_data_wide, id.vars = "Town")

Here we are going to create a heatmap of the total sale amount by town and year

# Define a custom label formatter for displaying values in millions (M)
# This function takes a numeric value and converts it into a string with an 'M' suffix, indicating millions
label_formatter_million <- function(x) {
  paste0(format(round(x / 1000000, 1), nsmall = 1), "M")
}
# Generate the heatmap
ggplot(sales_data_long, aes(x = Town, y = variable, fill = value)) +
  geom_tile() +
  scale_fill_gradient(low = "blue", high = "red", na.value = "white", name = "Total Sale\nAmount ($)") +
  labs(title = "Heatmap of Total Sale Amount by Town and Year",
       x = "Town",
       y = "Year",
       caption = "Data source: Real Estate Sales 2001-2021 Dataset") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), #Adjust the angle of the towns
        legend.position = "right",
        plot.title = element_text(size = 14, face = "bold"), #Sets the size of the title and bolds it
        plot.caption = element_text(size = 10)) #Sets the size of the caption

After Thoughts:

How you cleaned the dataset up?
1. I cleaned this data set up by getting rid of any unless data sets I made that I thought I was going to use. There were a lot of data sets I made at first which I thought I was going to use but after my planning changed or wanting to do something else with my visualization changed. I also tried to find ways to make the code simpler so it does not looked clumped up and looks clean.
What the visualization represents, any interesting patterns or surprises that arise within the visualization.
1. I noticed with the first visualization that houses are not cheep but also that fair field sells more on the higher side whereas West Hartford sells more on the lower side since they do not have a house that sold over $5 Million whereas the other towns are more spread out, West Hartford is more in one spot. For the second visualization I noticed that Stamford has a higher sell average then the other towns.
Anything that you might have shown that you could not get to work or that you wished you could have included.
1. On the second visualization I could not install money converter that I had in the first visualization so you can’t really understand the legend without doing some math. I also had trouble with the top_towns, I spent a lot of time on the section but in the end it ended up working.