Introduction

These visualizations from the Connecticut Real Estate Sales data set provides a comprehensive analysis of real estate market trends from 2012 to 2022, focusing on different aspects of residential property transactions across various towns and residential types.

knitr::include_graphics("F://Higher studies//2024//Fall 2024//DS736 - Data Visualization Decision Making//Assignments//Python//Sale Image.jpg")
Real Estate Sales - Connecticut

Real Estate Sales - Connecticut

Dataset

The original dataset was sourced from a public data archive of the state of Connecticut website.

Link: https://data.ct.gov/Housing-and-Development/Real-Estate-Sales-2001-2022-GL/5mzw-sjtu/about_data

In the initial and unaltered form of the dataset, there are around 1 million records, each with 14 variables . Each row represents the real estate sales details in the state of Connecticut.

The data set includes demographic data like Town, Address, Latitude/Longitude. Also, it includes financial data like Sale Amount, Assessed Value, and few other variables like Property type, Residential type, Assessor Remarks.

The original data set contains records dating back to 1999 and extending up to 2023.

import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'C://Users//vijay//anaconda3//Library//plugins//platforms'

import urllib.request
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.ticker import FuncFormatter
import warnings
import seaborn as sns

pd.set_option('display.max_colwidth', None)
warnings.filterwarnings('ignore')
dst = 'F://Higher studies//2024//Fall 2024//DS736 - Data Visualization Decision Making//Datasets//Real_Estate_Sales_2001-2022_GL_20241006.csv'
df = pd.read_csv(dst)

Findings

The following tabs offer comprehensive details about Real Estate Sales in Connecticut from 1999 to 2023. They highlight the towns with highest sales, total sales count by Month and Year, total sales by Quarter and Residential Types with the highest sales over the years.

1. Top 20 Towns by Total sales in the State of Connecticut

This bar chart shows the top 20 towns in Connecticut ranked by total sales in billions. Here’s a breakdown of the key insights:

X-Axis (Total Sales (in Billions)): The x-axis shows the sales figures in billions of dollars. Each town has a bar representing its total sales.

Y-Axis (Towns): The chart lists top 20 towns having the highest sale values.

Key Observations:

1. Sales Leaders:

Greenwich leads with the highest total sales, at $38.67 billion, followed by Stamford with $33.09 billion. These towns significantly outperform others.

2. Sales Below Average:

Towns like Danbury, New Haven, and Milford have lower total sales compared to the average.

3. Sales Distribution:

The chart reveals a considerable disparity between the top-performing towns (e.g., Greenwich and Stamford) and those with lower sales (e.g., Glastonbury and Hamden).

This visualization shows that many towns have sales well below the average, indicating a skewed distribution of real estate sales in Connecticut.

## Prepare data for Horizontal bar chart
# Create new columns for Year, Month, Day

df['Date Recorded'] = pd.to_datetime(df['Date Recorded'], format = '%m/%d/%Y')
# Drop rows with NA
df.dropna(subset=['Date Recorded'], inplace=True)

# Create new columns for Year, Month, Day
df['Year'] = df['Date Recorded'].dt.year
df['Day'] = df['Date Recorded'].dt.day
df['Month'] = df['Date Recorded'].dt.month
df['MonthName'] = df['Date Recorded'].dt.strftime('%b')

# Convert data type for Year, Day and Month columns to integer
df['Year'] = df['Year'].astype(int)
df['Day'] = df['Day'].astype(int)
df['Month'] = df['Month'].astype(int)

# Create new data frame x with selective columns: Town, Count, Total Sales, Average Sales

x = df.groupby(['Town']).agg({'Town':['count'], 'Sale Amount':['sum', 'mean']}).reset_index()
x.columns = ['Town', 'Count', 'TotalSales', 'AvgSales']
x = x.sort_values('TotalSales', ascending=False).reset_index()
x['TotalSales_in_Billions'] = round(x['TotalSales']/1000000000, 2)
x['AvgSales_in_Millions'] = round(x['AvgSales']/1000000, 2)


# Create a function to pick 3 different colors for bar chart based on mean value of Total sales

def pick_colors_according_to_mean_sales(this_data):
    colors = []
    avg = this_data.TotalSales_in_Billions.mean()
    for each in this_data.TotalSales_in_Billions:
        if each > avg*1.01:
            colors.append('lightcoral')
        elif each < avg*0.99:
            colors.append('dodgerblue')
        else:
            colors.append('black')
    return colors



  
# Create new data frame using top 20 towns having the highest total sales
bottom = 0
top = 19
d = x.loc[bottom:top]
d = d.sort_values('TotalSales_in_Billions', ascending=True)
d.reset_index(inplace = True, drop = True)


my_colors = pick_colors_according_to_mean_sales(d)

# Add colors to legends
Above = mpatches.Patch(color = 'lightcoral', label='Above Average')
At = mpatches.Patch(color = 'black', label='Within 1% of the Average')
Below = mpatches.Patch(color = 'dodgerblue', label='Below Average')

fig = plt.figure(figsize = (18,12))

# axis 1
ax1 = fig.add_subplot(1,1,1);
ax1.barh(d['Town'], d.TotalSales_in_Billions, label='TotalSales_in_Billions', color = my_colors)


for row_counter, value_at_row_counter in enumerate(d.TotalSales_in_Billions):
    if value_at_row_counter > d.TotalSales_in_Billions.mean()*1.01:
        color = 'lightcoral'
    elif value_at_row_counter < d.TotalSales_in_Billions.max()*0.99:
        color = 'dodgerblue'
    else:
        color = 'black'
        
    ax1.text(value_at_row_counter + 0.5, row_counter, '$' + str(value_at_row_counter) + 'B',
    color = color, size=12, fontweight = 'bold',
    ha='left', va='center', backgroundcolor='white')
            
plt.xlim(0, d.TotalSales_in_Billions.max()*1.1)
# Format the x-axis label to show in billions
ax1.xaxis.set_major_formatter( FuncFormatter( lambda x, loc: f'${x:.0f}B'))

# Add legend
ax1.legend(loc='lower right', handles = [Above, At, Below], fontsize = 14)

# Add mean line
plt.axvline(d.TotalSales_in_Billions.mean(), color = 'black', linestyle='dashed')
ax1.text(d.TotalSales_in_Billions.mean()+0.5, 0, 'Mean = ' + '$' + str(round(d.TotalSales_in_Billions.mean(),2)) + 'B', rotation = 0, fontsize = 14)

# Add title
ax1.set_title('Top ' + str(top+1) +' Towns by Total Sales in the State of Connecticut', size=20)

# Add x and y labels
ax1.set_xlabel('Total Sales (in Billions)', fontsize=16)
ax1.set_ylabel('Town', fontsize=16)

plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

2. Total Sales Count by Month and Year

Below Scatterplot represents the total sales count by month and year from 2012 to 2022. Here’s a breakdown of the key insights:

X-Axis: Represents the months of the year (January to December).

Y-Axis: Represents the years ranging from 2012 to 2022.

Key Observations:

  1. In recent years (2020–2022), there are more yellow and orange dots, indicating higher sales counts in these periods compared to earlier years.

  2. The summer months (e.g., June, July, August) show more frequent instances of high sales counts (orange/yellow) for many years, suggesting a seasonal trend where sales peak during this period.

  3. But few years back especially from 2012 to 2014, there were consistently lower sales counts throughout the year.

This plot suggests that sales activity has increased over time, with the highest in recent years, particularly in summer months. The year 2021, for example, stands out with consistently high sales across several months. This could be because of lower interest rates during pandemic.

# Data preparation for Scatter plot

x = df.groupby(['Year', 'MonthName'])['Year'].count().reset_index(name = 'count')
x = pd.DataFrame(x)

# Keep only records of year for the past 10 years
x = x.loc[x['Year'].isin(range(2012,2023))]
x['count_tens'] = round(x['count']/10, 0)

# Sort data frame by the order of month
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
x['MonthName'] = pd.Categorical(x['MonthName'], categories=month_order, ordered=True)
x.sort_values(by='MonthName', inplace=True)

x = x.reset_index(drop = True)



 
plt.figure(figsize=(18,15))

# Scatter plot - Sales count by month and year
plt.scatter(x['MonthName'], x['Year'], marker = '8', cmap = 'plasma', c=x['count_tens'], s=x['count_tens'], edgecolors = 'black', vmin=1, vmax=700)

# Add title, x and y-axis label
plt.title('Total Sales Count by Month and Year', fontsize = 18, color = 'blue')
plt.xlabel('Months of the Year', fontsize = 14)
plt.ylabel('Year', fontsize = 14)

# Create color bar using sales count 
cbar = plt.colorbar()
cbar.set_label('Sales count', rotation = 270, fontsize = 14, color = 'black', labelpad = 30)

my_colorbar_ticks = [*range(100, int(x['count_tens'].max()), 100)]
cbar.set_ticks(my_colorbar_ticks)

my_colorbar_tick_labels = [*range(1000, int(x['count'].max()), 1000)]
my_colorbar_tick_labels = ['{:,}'.format(each) for each in my_colorbar_tick_labels]
cbar.set_ticklabels(my_colorbar_tick_labels)

# Add x and y ticks
plt.xticks(month_order, fontsize=14, color = 'black', rotation=30, ha = 'right')
my_y_ticks = [*range(x['Year'].min(), x['Year'].max()+1, 1)]
plt.yticks(my_y_ticks, fontsize=14, color = 'black')
plt.show()

3. Total Assessed Value for Single Family Homes by Year (Top 5 towns)

This line plot shows the Total Assessed Value for Single Family Homes over the time from 2012 to 2022 for the top 5 towns in Connecticut: Greenwich, Norwalk, Stamford, Stratford, and Bridgeport. Here’s a breakdown of the key insights:

X-Axis: Represents the years ranging from 2012 to 2022.

Y-Axis: Represents the total assessed value (in billions of dollars).

Key Observations:

  1. Greenwich (red line) consistently has the highest total assessed values compared to the other towns, but there is a notable fluctuations. For example, assessed values start around $3 billion and dropped significantly through 2016, rising again through 2017 with a peak value of $3.8 billion in 2021.

  2. Most of the towns show a slight dip around 2018, indicating a possible regional factor or event that impacted property values during that period.

The plot illustrates disparities in the real estate markets among these top towns, with Greenwich being the most well-known but also the most volatile. This visualization is useful for understanding trends in property assessments and potential factors affecting real estate valuations over time.

# Identify top 5 towns having the highest Assessed Value
x = df.groupby(['Town']).agg({'Town':['count'], 'Assessed Value':['sum']}).reset_index()
x.columns = ['Town', 'Count', 'TotalAssessedValue']
x = x.sort_values('TotalAssessedValue', ascending=False).reset_index()
toptowns = x.loc[0:4,'Town']
year_range = range(2012,2023)

# Create new data frame with Residential Type 'Single Family' and top 5 towns having the highest Assessed value

newdf = df[df['Residential Type'] == 'Single Family'] 
newdf = df[df['Town'].isin (toptowns)]
newdf = newdf[newdf['Year'].isin (year_range)]
assess_df = newdf.groupby(['Town', 'Year'])['Assessed Value'].sum().reset_index(name = 'TotalAssessedValue')
assess_df = assess_df.sort_values('Year', ascending=True)
assess_df.reset_index(inplace = True, drop = True)




fig = plt.figure(figsize = (18, 10))

ax = fig.add_subplot(1, 1, 1)

# Create dictionary for legends color coding
my_colors = {'Bridgeport': 'lime',
            'Greenwich': 'red',
            'Stratford': 'blue',
            'Norwalk': 'magenta',
            'Stamford': 'green',
            }

# Create x and y coordinates for annotation to point the highest total assessed value in the plot
max_AssessedValue = assess_df['TotalAssessedValue'].max()
max_AssessedValue_x = assess_df.loc[assess_df['TotalAssessedValue'] == max_AssessedValue, 'Year']
max_AssessedValue_y = assess_df.loc[assess_df['TotalAssessedValue'] == max_AssessedValue, 'TotalAssessedValue']
max_AssessedValue = '$'+str(round(max_AssessedValue*1e-9,1))+'B'

for key, grp in assess_df.groupby(['Town']):
    grp.plot(ax = ax, kind = 'line', x = 'Year', y = 'TotalAssessedValue', color = my_colors[key[0]], label = key, marker = '8')

# Point to the highest total assessed value 
ax.annotate(max_AssessedValue, xy =(max_AssessedValue_x.iloc[0], max_AssessedValue_y.iloc[0]), 
                xytext =(max_AssessedValue_x.iloc[0]-1, max_AssessedValue_y.iloc[0]),  
                arrowprops = dict(facecolor ='green', 
                                  shrink = 0.05),) 

# Add title and x, y-axis labels
plt.title('Total Assessed Value for Single Family Homes by Year (Top 5 towns)', fontsize=18, color = 'blue')
ax.set_xlabel('Year', fontsize=18)
ax.set_ylabel('Total Assessed Value ($B)', fontsize=18)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14, rotation=0)

year_min = assess_df['Year'].min()
year_max = assess_df['Year'].max()+1

year_list = range(year_min,year_max, 1)
ax.set_xticks(year_list)

ax.set_xticklabels(year_list, rotation=40, ha='right')

# Capture the current handle and label name on the legend
handles, labels = ax.get_legend_handles_labels()

# Sort the handle and label name to where it should be
handles = [ handles[1], handles[2], handles[3], handles[4], handles[0]]
labels = [ labels[1][1:-2], labels[2][1:-2], labels[3][1:-2], labels[4][1:-2], labels[0][1:-2]]

plt.legend(handles, labels, loc='best', fontsize=14, ncol=1)

# Format the y-axis label to show in billions
ax.yaxis.set_major_formatter( FuncFormatter( lambda x, pos:('$%1.1fB')%(x*1e-9)))

plt.show()

4. Total Sales by Quarter and Month

This Donut chart represents the total sales of $444.89 billion broken down by quarters and months. The chart is segmented into four quarters, with each section further divided to show the sales contribution of each month within the quarter. Here’s a breakdown of the key insights:

Key Observations:

  1. Quarter 3 is the most profitable quarter, accounting for nearly one-third (30.47%) of the total annual sales. Quarter 2 follows closely, with a share of 26.23%.

  2. January and February contribute the least to overall sales, making Quarter 1 the least productive quarter at 18.21%.

  3. The chart highlights a clear seasonal trend where sales peak in the middle of the year, specifically in Quarter 3, and are lower in the early months of the year (Quarter 1).

This visualization suggests that sales are seasonal, with a noticeable surge in the summer months (July and August) and a decline in the winter months (January and February).

#Create Quarter column for the outer circle in nested pie chart

df['Quarter'] = 'Quarter ' + df['Date Recorded'].dt.quarter.astype('string')
pie_df = df.groupby(['Quarter', 'MonthName', 'Month'])['Sale Amount'].sum().reset_index(name = 'TotalSaleAmount')
pie_df.sort_values(by = ['Month'], inplace=True)
pie_df.reset_index(inplace=True, drop=True)
del pie_df['Month']





# Set up inside and outside reference numbers for colors - reference numbers draw from 

# Outside circle
number_outside_colors = len(pie_df.Quarter.unique())

# Choose totally contrast colors by multiplying by 4
outside_color_ref_number = np.arange(number_outside_colors)*4

# Inside circle
number_inside_colors = len(pie_df.MonthName.unique())
all_color_ref_number = np.arange(number_outside_colors + number_inside_colors)

inside_color_ref_number = []
for each in all_color_ref_number:
    if each not in outside_color_ref_number:
        inside_color_ref_number.append(each)
        
        
        
# Better keep it in square shape
fig = plt.figure(figsize = (7,7))
ax = fig.add_subplot(1, 1, 1)

# Perform outer circle
colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)

total_sales = pie_df.TotalSaleAmount.sum()

pie_df.groupby(['Quarter'])['TotalSaleAmount'].sum().plot(
    kind='pie', radius = 1, colors = outer_colors, pctdistance = 0.85, labeldistance = 1.1,
    wedgeprops = dict(edgecolor = 'white'), textprops= {'fontsize':13},
    autopct = lambda p: '{:.2f}%\n(${:.1f}B)'.format(p,(p/100)*total_sales/1e+9),
    startangle=90)

# Perform inner circle
inner_colors = colormap(inside_color_ref_number)
pie_df.TotalSaleAmount.plot(
    kind='pie', radius = 0.7, colors = inner_colors, pctdistance = 0.55, labeldistance = 0.8,
    wedgeprops = dict(edgecolor = 'white'), textprops= {'fontsize':11},
    labels = pie_df.MonthName, 
    autopct = '%1.2f%%',
    startangle=90)

# Add a hole in the middle
hole = plt.Circle((0,0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

ax.yaxis.set_visible(False)
plt.title('Total Sales by Quarter and Month', fontsize = 18)

# Add text to center circle
ax.text(0, 0, 'Total Sales\n' + '$' + str(round(total_sales/1e+9,2)) + 'B', size = 18, ha = 'center', va = 'center')

# Add labels inside each pie to appear in middle/stay away from border
ax.axis('equal')
plt.tight_layout()

plt.show()

5. Heat map of the Sales Count by Year and Residential Type

This heatmap shows the total sales count categorized by residential type and year from 2012 to 2022. Here’s a breakdown of the key insights:

X-Axis: Represents the residential type, which includes Two Family, Three Family, Single Family, Four Family, and Condo.

Y-Axis: Represents the years from 2012 to 2022.

Key Observations:

  1. Single-family homes are the most popular type, with substantial growth in sales, especially during 2020 and 2021, potentially reflecting a surge in the housing market during those years.

  2. Condos also saw an increase over the years, indicating a rising interest in this type of property.

  3. Multi-family homes (Two, Three, and Four Family) remain less common, with relatively stable sales counts.

  4. The sales for all residential types peaked in 2020 and 2021, suggesting that it could be because of lower interest rates due to the pandemic.

This heatmap effectively highlights the variations in housing sales trends across different types of residential properties over time.

x1 = df.groupby(['Year', 'Residential Type'])['Year'].count().reset_index(name = 'count')
x1 = pd.DataFrame(x1)

# Keep only records of year for the past 10 years
x1 = x1.loc[x1['Year'].isin(range(2012,2023))]

# Sort data frame 

type_order = ['Condo', 'Four Family', 'Three Family', 'Two Family', 'Single Family']

x1['Residential Type'] = pd.Categorical(x1['Residential Type'], categories=type_order, ordered=True)

x1.sort_values(by='Residential Type', inplace=True)

x1 = x1.reset_index(drop = True)

# Create new dataframe using pivot table for heat map
hm_df = pd.pivot_table(x1, index='Residential Type', columns = 'Year', values='count')



fig = plt.figure(figsize = (18, 10))
ax = fig.add_subplot(1, 1, 1)

comma_fmt = FuncFormatter(lambda x, p: format(int(x), ','))

ax = sns.heatmap(hm_df, linewidth = 0.2, annot = True, cmap = 'Spectral_r', fmt = ',.0f',
                 square=True, annot_kws = {'size': 11}, 
                 cbar_kws = {'format': comma_fmt, 'orientation': 'vertical'})

# Set title, x and y labels
plt.title('Heatmap of the Sales count by Year and Residential type', fontsize=18,pad=15)
plt.xlabel('Year', fontsize=18, labelpad=10)
plt.ylabel('Residential type', fontsize=18, labelpad=10)
plt.yticks(rotation = 0, size=14)
plt.xticks(size=14)
ax.invert_yaxis()

# Format colorbar
cbar = ax.collections[0].colorbar

max_count = int(hm_df.to_numpy().max())

my_colorbar_ticks = [*range(5000, max_count, 5000)]

cbar.set_ticks(my_colorbar_ticks)

my_colorbar_tick_labels = ['{:,}'.format(each) for each in my_colorbar_ticks]

cbar.set_ticklabels(my_colorbar_tick_labels)

cbar.set_label('Sales Count', rotation=270, fontsize = 14, color = 'black', labelpad = 20)

plt.show()

6. Town Ranking by Sales Count for the year 2015 - 2020

This bump chart represents the town rankings by sales count for the years 2015 to 2020 among five towns: Stamford, Waterbury, Norwalk, Bridgeport, and Danbury. The ranking is visualized from 1 (highest sales count) to 5 (lowest sales count). Here’s a breakdown of the key insights:

X-Axis: Represents the years.

Y-Axis: Represents the town rankings. The ranking is visualized from 1 (highest sales count) to 5 (lowest sales count)

Key Observations:

  1. Stamford consistently showed strong performance, regaining the 1st position in 2020 after a dip.

  2. Bridgeport showed a significant shift from being at the lowest to the top-ranked town from 2016 to 2019, and falling again to the bottom ranks by 2019.

  3. Norwalk exhibited volatility, peaking at 2nd place in 2018 and dropping to the lowest rank in 2019

  4. Waterbury showed a decline in its ranking after being at 2nd place early in the period.

  5. Danbury showed a little improvement over time, moving up from 5th to 4th place by the end of 2020.

This chart highlights shifts in sales performance among these towns over time.

# Identify top 5 towns having the highest Assessed Value
bump_df = df.loc[df['Year'].isin(range(2015,2021))].groupby(['Town'])['Town'].count().reset_index(name='count')
bump_df.sort_values(by = 'count', ascending=False, inplace=True) 
top5  = bump_df['Town'].iloc[:5,].tolist()

bump_df = df.loc[df['Year'].isin(range(2015,2021))].groupby(['Year','Town'])['Year'].count().reset_index(name = 'SaleCount')
bump_df = bump_df.loc[bump_df['Town'].isin(top5)]
bump_df.columns = ['Year','Town', 'SaleCount']
bump_df = bump_df.sort_values(by = ['Town','Year'], ascending=False).reset_index()

bump_df = bump_df.pivot(index = 'Town', columns = 'Year', values = 'SaleCount')

bump_df = bump_df.dropna()

bump_df_ranked = bump_df.rank(0, ascending=False, method='min')

bump_df_ranked = bump_df_ranked.T





# Plotting bump chart

fig = plt.figure(figsize = (18, 10))
ax = fig.add_subplot(1, 1, 1)

bump_df_ranked.plot(kind = 'line', ax=ax, marker = 'o', markeredgewidth = 1, 
                   linewidth = 6, markersize=18, markerfacecolor='white'
                   )

ax.invert_yaxis()

x_ticks = np.unique(bump_df_ranked.index)
y_ticks = bump_df_ranked.shape[1]

plt.ylabel('Town Ranking (by Sales count)', fontsize=18, labelpad=10)
plt.title('Town Ranking by Sales count \n Year 2015-2020', fontsize=18, pad=15)
plt.xticks(x_ticks, fontsize=14)
plt.yticks(range(1, y_ticks+1, 1), fontsize=14)
ax.set_xlabel('Year', fontsize=18)

handles, labels = ax.get_legend_handles_labels()
handles = [handles[3], handles[4], handles[2], handles[0], handles[1]]
labels = [labels[3], labels[4], labels[2], labels[0], labels[1]]
ax.legend(handles, labels, bbox_to_anchor=(1.01, 1.01), fontsize=14,
         labelspacing = 1, 
         markerscale = .5,
         borderpad = 1,
         handletextpad = 0.8)

plt.tight_layout()

plt.show()

Conclusion

These visualizations present a comprehensive picture of the residential real estate market using the data about sales price, assessed values, transaction volumes, and the dynamics of different towns and housing types.

When taken as a whole, these visualizations help stakeholders—including investors, legislators, and real estate experts—understand market dynamics, spot trends, and make informed judgments.