For this project I am searching through the Real Estate sales 2001-2021 for the state of Connecticut data set. The data set includes the houses serial number for when it was sold, the year it was listed, the date it was recorded at, the town the house it is located at, the address of the house, the value it was assessed at before the house was sold, the price the house was sold at, the type of property the house is, and what type of house it is (condo, single family, two family, etc..). I plan to explore a lot of areas like the price with the town or since the data set goes back to 2001, I want to see how much the market as changed compared to 2021.
Loading the necessary libraries for project
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr) # dplyr is used for data manipulation. It provides a set of functions for efficiently filtering, selecting, grouping, and summarizing data.library(ggplot2) # ggplot2 is used for data visualization. It allows for the creation of complex plots from data in a dataframe with a high level of customization.library(scales) # scales is used for formatting visualization scales. It provides functions to customize the appearance of plot axes and legends, such as changing number formats.
Attaching package: 'scales'
The following object is masked from 'package:purrr':
discard
The following object is masked from 'package:readr':
col_factor
library(reshape2) # reshape2 is used for reshaping data. It provides functions to melt and cast dataframes, making it easier to transform data between wide and long formats.
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
Loading on the data set we are going to explore as Real_estate_Sales_2001-2021_GL
Real_Estate_Sales <-read_csv("Real_Estate_Sales_2001-2021_GL.csv") #Loading the dataset we are going to use
Rows: 1054159 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Date Recorded, Town, Address, Property Type, Residential Type, Non ...
dbl (5): Serial Number, List Year, Assessed Value, Sale Amount, Sales Ratio
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Getting a quick look at what variables we are working with
head(Real_Estate_Sales) #Taking a look at the different varaibles we are working with
What we are going to explore first in this data set is how much the housing market as changed since 2001 but before we do that I am going to make a new data set that makes it easier for us to use the necessary information we need to do this
Now we are going to pick the top 5 towns that have the highest number of single family houses and store it as a value
top_towns <- Real_Estate_Sales |>filter(`Residential Type`=="Single Family") |># Selecting just single family housescount(Town, sort =TRUE) |># Sorts all of the towns in decesding order so the top 5 towns with the most sales comes firsttop_n(5) |>#Picks the top 5 towns from the sorted list that we performed above and removes the restpull(Town) #Extracting the Town column that now has the top 5 towns. (I used this website to better understand how and what the pull function is https://www.rdocumentation.org/packages/lplyr/versions/0.1.6/topics/pull)
Selecting by n
Now we are going to make a dot plot that shows the top 5 towns that have the highest amount of single family houses
dot_plot <- Real_Estate_Sales |>filter(Town %in% top_towns, `Residential Type`=="Single Family") # I am filtering the main data set so it only has the towns that are included in top_towns then we are filtering the residential type to only show single family homes.# Custom colors for townscustom_colors <-c("#1b9e77", "#d95f02", "#7570b3", "#e7298a", "#66a61e")# This makes it so the numbers on the y and x axis do not look weird and are actually readable (I asked chatGPT how to make this, I couldn't find out how to do it by myself and nothing online was helping.)label_formatter_million <-function(x) {paste0(format(round(x /1000000, 1), nsmall =1), "M")}# Making the heatmapggplot(dot_plot, aes(x =`Assessed Value`, y =`Sale Amount`, color = Town)) +geom_point(alpha =0.7, size =2.5) +# Setting the transparency and how big the dots arescale_color_manual(values = custom_colors) +# Setting what colors to use#Naming the title, subtitle, x-axis, y-axis, caption, and colorlabs(title ="Sale Amount vs. Assessed Value for Single-Family Homes",subtitle ="Top 5 towns by transaction volume",x ="Assessed Value",y ="Sale Amount",caption ="Data source: Real Estate Sales 2001-2021 Dataset",color ="Town") +#Using the converter I made above so it will be easier to readscale_y_continuous(labels = label_formatter_million) +scale_x_continuous(labels = label_formatter_million) +theme_light(base_size =4) +# Setting how big I would like the graph to betheme(legend.position ="bottom", # Sets where the legend is#All of this sets the size for the specific text, you can also set the text to be boldplot.title =element_text(size =16, face ="bold"),plot.subtitle =element_text(size =12),plot.caption =element_text(size =10),legend.title =element_text(size =12),legend.text =element_text(size =10),axis.title =element_text(size =14),axis.text =element_text(size =12))
Now I am going to make another data set that takes the top 5 towns we used in the previous visualization and add up the total sales each year for each town
sales_data <- Real_Estate_Sales |>group_by(Town, `List Year`) |># Takes only the town and List Year column from the Real_Estate_Sales data set and gets rid of the otherssummarise(Total.Sale.Amount =sum(`Sale Amount`, na.rm =TRUE)) |># Adds a new column after adding the sum of that years sale amount togetherfilter(Town %in% top_towns) |># Makes sure only the top 5 towns that are in top_towns are in this data setungroup() # This is used to upgroup the data we grouped up since we do not need it grouped up anymore, we grouped them together so we could perform calculations but after we did that we can upgroup them
`summarise()` has grouped output by 'Town'. You can override using the
`.groups` argument.
Here we are going to spread out the data so we can use it for our heatmap. This step transforms the aggregated data into a wide format, where each column represents a different year.’List Year’ becomes the column headers, and ‘Total.Sale.Amount’ fills the matrix.
# Spread the data for heatmap, turns the data set into a wide data setsales_data_wide <-spread(sales_data, key =`List Year`, value = Total.Sale.Amount)
Melt the data back to long format for ggplot. Here, we convert the wide format back into a long format where each row represents a town-year combination. This format is necessary for ggplot2 to correctly map the aesthetics for the heatmap.
# Melt the data back to long format for ggplot, we turn it back into a long data setsales_data_long <-melt(sales_data_wide, id.vars ="Town")
Here we are going to create a heatmap of the total sale amount by town and year
# Define a custom label formatter for displaying values in millions (M)# This function takes a numeric value and converts it into a string with an 'M' suffix, indicating millionslabel_formatter_million <-function(x) {paste0(format(round(x /1000000, 1), nsmall =1), "M")}# Generate the heatmapggplot(sales_data_long, aes(x = Town, y = variable, fill = value)) +geom_tile() +scale_fill_gradient(low ="blue", high ="red", na.value ="white", name ="Total Sale\nAmount ($)") +labs(title ="Heatmap of Total Sale Amount by Town and Year",x ="Town",y ="Year",caption ="Data source: Real Estate Sales 2001-2021 Dataset") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1), #Adjust the angle of the townslegend.position ="right",plot.title =element_text(size =14, face ="bold"), #Sets the size of the title and bolds itplot.caption =element_text(size =10)) #Sets the size of the caption
After Thoughts:
How you cleaned the dataset up?
I cleaned this data set up by getting rid of any unless data sets I made that I thought I was going to use. There were a lot of data sets I made at first which I thought I was going to use but after my planning changed or wanting to do something else with my visualization changed. I also tried to find ways to make the code simpler so it does not looked clumped up and looks clean.
What the visualization represents, any interesting patterns or surprises that arise within the visualization.
I noticed with the first visualization that houses are not cheep but also that fair field sells more on the higher side whereas West Hartford sells more on the lower side since they do not have a house that sold over $5 Million whereas the other towns are more spread out, West Hartford is more in one spot. For the second visualization I noticed that Stamford has a higher sell average then the other towns.
Anything that you might have shown that you could not get to work or that you wished you could have included.
On the second visualization I could not install money converter that I had in the first visualization so you can’t really understand the legend without doing some math. I also had trouble with the top_towns, I spent a lot of time on the section but in the end it ended up working.