The dataset used for this project is Real Estate Sales 2001-2022. The data was collected and maintained by The Office of Policy and Management for the state of Connecticut and includes recorded data of all real estate sales in Connecticut from 1999 through 2022.
https://data.ct.gov/api/views/5mzw-sjtu/rows.csv?accessType=DOWNLOAD
Using the data in this data set I was able to find the towns with the highest number of sales per year, the highest average of real estate sales, and observe years and months where resl estate sales fluctuated.
import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C:/ProgramData/Anaconda3/Library/plugins/platforms'
import wget
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
URL = "https://data.ct.gov/api/views/5mzw-sjtu/rows.csv?accessType=DOWNLOAD"
path = "//apporto.com/dfs/LOYOLA/Users/smbond_loyola/Documents/Data Sets/Python Datasets/"
wget.download(URL, path + 'RealEstate_Sales.csv')
filename = path + "RealEstate_Sales.csv"
df = pd.read_csv(filename)
df = pd.read_csv(filename, usecols = ['List Year', 'Date Recorded', 'Town', 'Address', 'Assessed Value', 'Sale Amount', 'Sales Ratio', 'Property Type', 'Residential Type'])
df['Date Recorded'] = pd.to_datetime(df['Date Recorded'], format = '%m/%d/%Y')
df['Year'] = df['Date Recorded'].dt.year
df['Month'] = df['Date Recorded'].dt.month
df['Day'] = df['Date Recorded'].dt.day
df['MonthName'] = df['Date Recorded'].dt.strftime('%b')
df['DayName'] = df['Date Recorded'].dt.strftime('%a')
df = df[df['Property Type'].notna() & df['Residential Type'].notna() & df['Address'].notna()]
df = df[df['Month'] != 0]
x = df.groupby(['Year', 'MonthName', 'Month'])['Year'].count().reset_index(name='Count')
x = pd.DataFrame(x)
x['Month'] = x['Month'].astype('int')
x['Year'] = x['Year'].astype('int')
x['count_hundreds'] = round(x['Count']/100, 0)
x = x.reset_index(drop=True)
x[ (x['Month'] == 12) & (x['Year'] == 2022)]
x['Year'].value_counts()
del x['count_hundreds']
x2 = x.groupby(['Year', 'Month'])['Count'].sum().reset_index()
x2 = pd.DataFrame(x2)
x2['count_hundreds'] = round(x2['Count']/100, 0)
This Scatterplot shows the amount of real estate sales per year, by month. Here we can observe that sales are consistantly lower in the winter months and highest in the summer. Sales began to increase in the state in 2015, peaking in 2020 and 2021. Sales appear to be settling in 2022 and 2023 back down to the levels they were at in 2015 though 2019. Real Estate sales appear to be consistantly higher in the months of June, July, and August.
plt.figure()
plt.scatter(x2['Month'], x2['Year'], marker='8', cmap='viridis',
c=x2['count_hundreds'], s=x2['count_hundreds'], edgecolor='black')
plt.title('Real Estate Sales', fontsize=18)
plt.xlabel('Months of the Year', fontsize=14)
plt.ylabel('Year', fontsize=16)
cbar = plt.colorbar()
cbar.set_label('Number of Real Estate Sales', rotation=270, fontsize=14, color='black', labelpad=20)
my_colorbar_ticks = [*range(10, int(x2['count_hundreds'].max()), 10 )]
cbar.set_ticks(my_colorbar_ticks)
my_colorbar_tick_labels = [*range(1000, int(x2['Count'].max()), 1000) ]
my_colorbar_tick_labels = [ '{:,}'.format(each) for each in my_colorbar_tick_labels]
cbar.set_ticklabels(my_colorbar_tick_labels)
my_x_ticks = [*range(x2['Month'].min(), x2['Month'].max()+1, 1 )]
plt.xticks(my_x_ticks, fontsize=14, color='black')
my_y_ticks = [*range(x2['Year'].min(), x2['Year'].max()+1, 1 )]
plt.yticks(my_y_ticks, fontsize=10, color='black')
plt.show()
This plot shows the top ten towns for real estate sales. These towns out perform all others in Connecticut by a significant margin in terms of sales per year. These towns are likely more populated than others and are more popular locations to those looking for a home to purchase in Connecticut. These towns are likely to have more real estate inventory, better access to jobs and amenities, and have a higher number of schools and hospitals.
df = pd.read_csv(filename, usecols = ['Sale Amount', 'Property Type', 'Date Recorded', 'Town'])
df = df[df['Property Type'].notna() & df['Date Recorded'].notna()]
df['Date Recorded'] = pd.to_datetime(df['Date Recorded'], format = '%m/%d/%Y')
x4 = df.groupby(['Town']).agg({'Town':['count'], 'Sale Amount':['sum', 'mean']}).reset_index()
x4.columns = ['Town','Count','Total Sales','Average Price']
x5 = x4.sort_values('Count', ascending=False)[0:10]
plt.figure(figsize=(18,10))
plt.bar(x5.loc[15:138, 'Town'], x5.loc[15:138, 'Count'], label='Real Estate Sales Count')
plt.legend(loc='upper right', fontsize=14)
plt.show()
Here we can see the average rest estate prices by town and overall average for the state. We see that the top 10 sales prices is significantly higher than the majority of the towns in the state. The average appears to be skewed higher due to the disparity in prices in the top 10.
def pick_colors_according_to_mean_count(this_data):
colors=[]
avg = this_data.Count.mean()
for each in this_data.Count:
if each > avg*1.01:
colors.append('lightcoral')
elif each < avg*0.99:
colors.append('green')
else:
colors.append('black')
return colors
x6 = x4.sort_values('Count', ascending=False)[0:100]
import matplotlib.patches as mpatches
bottom1 = 15
top1 = 116
d1 = x6.loc[bottom1:top1]
my_colors1 = pick_colors_according_to_mean_count(d1)
bottom2 = 15
top2 = 138
d2 = x6.loc[bottom2:top2]
my_colors2 = pick_colors_according_to_mean_count(d2)
Above = mpatches.Patch(color='lightcoral', label='Above Average')
At = mpatches.Patch(color='black', label='Within 1% of the Average')
Below = mpatches.Patch(color='green', label='Below Average')
fig = plt.figure(figsize=(18,16))
fig.suptitle('Connecticut Real Estate Averages by Town and Top 10 Town Real Estate', fontsize=20, fontweight='bold')
ax1 = fig.add_subplot(2, 1, 1)
ax1.bar(d1.Town, d1.Count, label='Count', color=my_colors1)
ax1.legend(handles=[Above, At, Below], fontsize=14)
plt.axhline(d1.Count.mean(), color= 'black', linestyle='dashed')
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
ax1.axes.xaxis.set_visible(False)
ax1.set_title( 'Top 100 Connecticut Towns Real Estate Sales', size=20)
ax1.text(top1-10, d1.Count.mean()+5, 'Mean = ' + str(d1.Count.mean()), fontsize=14)
ax2 = fig.add_subplot(2, 1, 2)
ax2.bar(d2.Town, d2.Count, label='Count', color=my_colors2)
ax2.legend(handles=[Above, At, Below], fontsize=14)
plt.axhline(d2.Count.mean(), color= 'black', linestyle='solid')
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.set_title( 'Top 10 Connecticut Towns Real Estate Sales', size=20)
ax2.text(top2-1, d2.Count.mean()+5, 'Mean = ' + str(d2.Count.mean()), fontsize=14)
fig.subplots_adjust(hspace = 0.30)
plt.show()
The majority of towns in Connecticut sit well below the average rest estate prices and the Top 10 significantly outpace the bottom 10.
bottom3 = 15
top3 = 138
d3 = x6.loc[bottom3:top3]
d3 = d3.sort_values('Count', ascending=True)
d3.reset_index(inplace=True, drop=True)
my_colors3 = pick_colors_according_to_mean_count(d3)
fig = plt.figure(figsize=(18, 12))
ax1 = fig.add_subplot(1, 1, 1)
ax1.barh(d3.Town, d3.Count, color=my_colors3)
for row_counter, value_at_row_counter in enumerate(d3.Count):
if value_at_row_counter > d3.Count.mean()*1.01:
color= 'lightcoral'
elif value_at_row_counter < d3.Count.mean()*0.99:
color = 'green'
else:
color = 'black'
ax1.text(value_at_row_counter+2, row_counter, str(value_at_row_counter), color=color, size=12, fontweight = 'bold',
ha='left', va='center', backgroundcolor='white')
plt.xlim(0, d3.Count.max()*1.1)
ax1.legend(loc='lower right', handles=[Above, At, Below], fontsize=14)
plt.axvline(d3.Count.mean(), color='black', linestyle='dashed')
ax1.text(d3.Count.mean()+2, 0, 'Mean=' +str(d3.Count.mean()), fontsize=14)
ax1.set_title('Top 10 Towns Real Estate')
ax1.set_xlabel('Average Real Estate Prices', fontsize=16)
ax1.set_ylabel('Town Name', fontsize=16)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()
Ranking of Real Estate Sales by Month and by Year shows that sales are increasing over the last twenty years and sales are higher in the summer. Summer is a popular time to both sell and purchase a home due to children being out of school and better weather for moving. There was a large jump in sales during the summer of 2020, possibility due to people moving from cities to more rural areas in large numbers during the height of Covid-19. 2020 and 2021 rank highly in the sales ranking, pointing to the hot real estate market at the time and lower interest rates that were being offered. Work from home also allowed many people to move further away from their place of employment making purchasing a home a good option.
df['Year'] = df['Date Recorded'].dt.year
df['MonthName'] = df['Date Recorded'].dt.strftime('%b')
df['DayName'] = df['Date Recorded'].dt.strftime('%a')
Sales_df = df.groupby(['Year','MonthName'])['Sale Amount'].sum().reset_index(name='Total Sales')
bump_df = df.groupby(['Year', 'MonthName'])['Sale Amount'].sum().reset_index(name='TotalSales')
bump_df = bump_df.pivot(index='Year', columns='MonthName', values = 'TotalSales')
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
bump_df = bump_df.reindex(columns=month_order)
bump_df = bump_df.dropna()
bump_df_ranked = bump_df.rank(0, ascending=False, method='min')
bump_df_ranked = bump_df_ranked.T
fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)
bump_df_ranked.plot(kind='line', ax=ax, marker='o', markeredgewidth=1, linewidth=6,
markersize=44,
markerfacecolor='white')
ax.invert_yaxis()
num_rows = bump_df_ranked.shape[0]
num_cols = bump_df_ranked.shape[1]
plt.ylabel('Monthly Ranking', fontsize=18, labelpad=10)
plt.title('Ranking of Real Estate Sales by Month and by Year\n Bump Chart', fontsize=18, pad=20)
plt.xticks(np.arange(num_rows), month_order, fontsize=14)
plt.yticks(range(1, num_cols+1, 1), fontsize=14)
ax.set_xlabel('Month', fontsize=18)
handles, labels = ax.get_legend_handles_labels()
handles = [ handles[15], handles[14], handles[13], handles[12], handles[11], handles[10], handles[9], handles[8], handles[7], handles[6], handles[5], handles[4], handles[3], handles[2], handles[1], handles[0] ]
labels = [ labels [15], labels [14], labels [13], labels [12], labels [11], labels [10], labels [9], labels [8], labels [7], labels [6], labels [5], labels [4], labels [3], labels [2], labels [1], labels [0] ]
ax.legend(handles, labels, bbox_to_anchor=(1.01, 1.01), fontsize=14,
labelspacing = 1,
markerscale = .4,
borderpad = 1,
handletextpad = 0.8)
i = 0
j = 0
for eachcol in bump_df_ranked.columns:
for eachrow in bump_df_ranked.index:
this_rank = bump_df_ranked.iloc[i, j]
ax.text(i, this_rank, '$' + str(round(bump_df.iloc[j, i]/1e6,1)) + 'M', ha='center', va='center', fontsize=12)
1+1
j+=1
i=0
plt.show
Between the years of 1999 through 2022 (with partial recording of 2023) The state of Connecticut recorded the real estate purchases of 170 towns. Using this data we were able to observe buying and selling trends over the course of 23 years including two years that were particularly affected by the Covid-19 global pandemic. Some of the interesting observations include the significant jump in real estate purchases that occured suring the summer of 2020 lasting through to the end of 2021. This tells us that the combination of Covid-19 anxiety and lower interest rates spurned a large amount of home purchases and movement.
The data also highlights areas that are likely to have larger populations due to the high number of sales in the town as well as the large disparity between average real estate prices in some towns verses others. Using this we can identify larger towns and cities that have higher populations and we can identify afluent areas compared to less economically strong areas. This data provides a snapshot into how the states population is moving and where the wealth may be concentrated.