House Sales Data Visualization

import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'c:/ProgramData/Anaconda3/Library/plugins/platforms'

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.ticker import FuncFormatter
import plotly.graph_objects as go
import plotly.io as pio
import warnings

warnings.filterwarnings("ignore")
path = "U:/"
filename = path + 'Real_Estate_Sales_2001-2018_GL.csv'
house_df = pd.read_csv(filename)

house_df['DateRecorded'] = pd.to_datetime(house_df['DateRecorded'], format = '%m/%d/%y')

house_df['Day'] = house_df.DateRecorded.dt.day
house_df['Month'] = house_df.DateRecorded.dt.month
house_df['Year'] = house_df.DateRecorded.dt.year
house_df['WeekDay'] = house_df.DateRecorded.dt.strftime('%a')
house_df['MonthName'] = house_df.DateRecorded.dt.strftime('%b')

Introduction

For the data visualization assignment, I wanted to find a dataset related to housing because I thought that there would be some interesting visualizations which could be made from the data. Also, I have an interest in real estate and thought that a dataset on the topic from somewhere in the United States could provide some interesting insights and serve as a learning experience. As a result, I searched for a dataset that contained information on real estate sales. The dataset that I chose contains information on home sales for the state of Connecticut from the year 2001 to the year 2018. As a result, the following visualizations provide some valuable information relating to the housing market conditions in the state of Connecticut from that year range.

Dataset

Between October 1 and September 30 of each year, the Office of Policy and Management keeps track of all real estate transactions with a selling price of $2,000 or more. This file contains information about the town, the property address, the date of sale, the property type (residential, apartment, commercial, manufacturing, or vacant land), the purchase price, and the property assessment for each sale record for each house sold in Connecticut from the year 2001 to the year 2018.

Findings

Top 20 Towns

For my first visualization, I wanted to find out the top towns in which houses were sold in the dataset so that I could get a better grasp on the housing market in the state of Connecticut. As a result, my first visualization focuses on a count of the top twenty towns in which houses were sold from the year 2001 to the year 2018. It revealed that the top three towns were Bridgeport, Stamford, and Waterbury. After creating the visualization, it sparked new questions such as why some areas had been selling more houses than others, whether or not the population in these towns was increasing or decreasing, and whether or not construction is creating such a boom in these towns.


x = house_df.groupby(['Town']).agg({'Town':['count'], 'SaleAmount':['sum', 'mean']}).reset_index()

x.columns = ['Town', 'Count', 'TotalSales', 'AverSales']

x = x.sort_values('Count', ascending=False)

x.reset_index(inplace=True, drop=True)

def pick_colors_according_to_mean_count(this_data):
    colors=[]
    avg = this_data.Count.mean()
    for each in this_data.Count:
        if each > avg*1.01:
            colors.append('green')
        elif each < avg*0.99:
            colors.append('gold')
        else:
            colors.append('black')
    return colors

bottom1 = 0
top1 = 19
d1 = x.loc[bottom1:top1]
d1 = d1.sort_values('Count', ascending=True)
d1.reset_index(inplace=True, drop=True)
my_colors1 = pick_colors_according_to_mean_count(d1)

Above = mpatches.Patch(color='green', label = 'Above Average')
At = mpatches.Patch(color='black', label = 'Within 1% of the Average')
Below = mpatches.Patch(color='gold', label = 'Below Average')

fig = plt.figure(figsize=(18, 12))
ax1 = fig.add_subplot(1,1,1)
ax1.barh(d1.Town, d1.Count, color=my_colors1)

## <BarContainer object of 20 artists>

for row_counter, value_at_row_counter in enumerate(d1.Count):
    if value_at_row_counter > d1.Count.mean()*1.01:
        color= 'green'
    elif value_at_row_counter < d1.Count.mean()*0.99:
        color='gold'
    else:
        color='black'
    ax1.text(value_at_row_counter+2, row_counter, str(value_at_row_counter), color=color, size=12, fontweight='bold',
          ha='left', va='center', backgroundcolor='white')
plt.xlim(0, d1.Count.max()*1.1)

## (0.0, 35618.0)

ax1.legend(loc='lower right', handles=[Above, At, Below], fontsize=14)
plt.axvline(d1.Count.mean(), color='black', linestyle='dashed')
ax1.text(d1.Count.mean() + 2, 0, '  Mean = ' + str(d1.Count.mean()), rotation=0, fontsize=14)

ax1.set_title('Top ' + str(top1+1) + ' Towns', size=20)
ax1.set_xlabel('Town Count', fontsize=16)
ax1.set_ylabel('Town', fontsize=16)
plt.xticks(fontsize=14)

## (array([    0.,  5000., 10000., 15000., 20000., 25000., 30000., 35000.,
##        40000.]), [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])

plt.yticks(fontsize=14)

## ([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])

plt.show()

Total Sales by Year and by Day

For my second graph I wanted to visualize whether a trend could be determined in terms of the year and day data points for house sales. Mainly, I wanted to see if the recession of 2008 had a large impact on the amount of houses sold in Connecticut and also see if weekdays also had an impact. The results of the graph were not very surprising as the housing market never truly recovered from the 2008 recession. The most houses sold in Connecticut was back in 2005. Despite the USA having a larger population, the housing market did not fully bounce back. However, in terms of the day datapoint, Tuesday was a day for the most amount of house closings. Again new questions arose from this visualization such as whether or not fear has an impact on the housing market, why Tuesday is considered the most popular day to close on a house, and why, despite a lower unemployment rate in 2017 from 2016, were house sales down.


sale_df = house_df.groupby(['Year', 'WeekDay'])['SaleAmount'].sum().reset_index(name='TotalSales')

stacked_df = house_df.groupby(['Year', 'WeekDay'])['SaleAmount'].sum().reset_index(name='TotalSales')
stacked_df = stacked_df.pivot(index='Year', columns='WeekDay', values='TotalSales')

day_order = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
stacked_df = stacked_df.reindex(columns=reversed(day_order))

fig = plt.figure(figsize=(18, 10))
ax = fig.add_subplot(1, 1, 1)
stacked_df.plot(kind='bar', stacked=True, ax=ax)

plt.ylabel('Total Sales', fontsize=18, labelpad=10)
plt.title('Total Sales by Year and by Day \n Stacked Bar Plot')
plt.xticks(rotation=0, horizontalalignment='center', fontsize=14)

## (array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
##        17, 18]), [Text(0, 0, '2001'), Text(1, 0, '2002'), Text(2, 0, '2003'), Text(3, 0, '2004'), Text(4, 0, '2005'), Text(5, 0, '2006'), Text(6, 0, '2007'), Text(7, 0, '2008'), Text(8, 0, '2009'), Text(9, 0, '2010'), Text(10, 0, '2011'), Text(11, 0, '2012'), Text(12, 0, '2013'), Text(13, 0, '2014'), Text(14, 0, '2015'), Text(15, 0, '2016'), Text(16, 0, '2017'), Text(17, 0, '2018'), Text(18, 0, '2019')])

plt.yticks(fontsize=14)

## (array([0.0e+00, 5.0e+09, 1.0e+10, 1.5e+10, 2.0e+10, 2.5e+10, 3.0e+10,
##        3.5e+10]), [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])

ax.set_xlabel('Year Sold', fontsize=18)

handles, labels = ax.get_legend_handles_labels()
handles = [ handles[6], handles[5], handles[4], handles[3], handles[2], handles[1], handles[0] ]
labels = [labels[6], labels[5], labels[4], labels[3], labels[2], labels[1], labels[0]]
plt.legend(handles, labels, loc='best', fontsize=14)

ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos:('$%1.1fB')%(x*1e-9)))


plt.show()

Total Sales by Quarter and Month

For my third graph, I broke down the housing market by both quarter and month. I wanted to visualize if a seasonal effect had any impact on house sales. My personal thought was that the most houses sold would correlate to months associated with spring. In the spring, cleaning and thoughts of warmer climates can cause more houses to be available on the market. The results from my pie chart, indicated that a seasonal effect has a visible impact on the housing market. The Spring proved to be the second most successful in terms of house sales following the Summer months. This makes sense considering the fact that many in the field of real estate mention that the summer months are the times when most people are interested in buying a house. This did reveal questions of the nature of most of the sales in the summer months, such as whether or not a lot of those sales were short term sales, such as a vacation home.

house_df['Quarter'] = 'Quarter ' + house_df.DateRecorded.dt.quarter.astype('string')

pie_df = house_df.groupby(['Quarter', 'MonthName', 'Month'])['SaleAmount'].sum().reset_index(name='TotalSales')

pie_df.sort_values(by=['Month'], inplace=True)

pie_df.reset_index(inplace=True, drop=True)

del pie_df['Month']

number_outside_colors = len(pie_df.Quarter.unique())
outside_color_ref_number = np.arange(number_outside_colors)*4

number_outside_colors = len(pie_df.Quarter.unique())
outside_color_ref_number = np.arange(number_outside_colors)*4

number_inside_colors = len(pie_df.MonthName.unique())
all_color_ref_number = np.arange(number_outside_colors + number_inside_colors)

inside_color_ref_number = []
for each in all_color_ref_number:
    if each not in outside_color_ref_number:
        inside_color_ref_number.append(each)

fig = plt.figure(figsize = (10, 10))
ax = fig.add_subplot(1, 1, 1)

colormap = plt.get_cmap("tab20c")
outer_colors = colormap(outside_color_ref_number)

all_sales = pie_df.TotalSales.sum()

pie_df.groupby(['Quarter'])['TotalSales'].sum().plot(
    kind = 'pie', radius = 1, colors = outer_colors, pctdistance = 0.85, labeldistance = 1.1,
    wedgeprops = dict(edgecolor = 'w'), textprops = {'fontsize':18},
    autopct = lambda p: '{:.2f}%\n(${:.1f}B)'.format(p,(p/100)*all_sales/1e+9),
    startangle = 90)

inner_colors = colormap(inside_color_ref_number)

pie_df.TotalSales.plot(
    kind = 'pie', radius = 0.7, colors = inner_colors, pctdistance = 0.55, labeldistance = 0.8,
    wedgeprops = dict(edgecolor = 'w'), textprops = {'fontsize':11},
    labels = pie_df.MonthName,
    autopct = '%1.2f%%',
    startangle = 90)

## <AxesSubplot:ylabel='TotalSales'>

hole = plt.Circle((0,0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

ax.yaxis.set_visible(False)
plt.title('Total Sales by Quarter and Month', fontsize=18)

ax.text(0, 0, 'Total Sales\n' + '$' + str(round(all_sales/1e9, 2)) + "B", size = 18, ha = 'center', va = 'center')

ax.axis('equal')

## (-1.100781016756437, 1.1020800045603707, -1.1124991916752727, 1.1005952776595949)

plt.tight_layout()

plt.show()

Deviation between Actual and Breakeven Monthly Sales in 2017

For the waterfall graph, I wanted to analyze the deviation from a pretend breakeven point for house sales in Connecticut on a monthly basis in the year 2017. Based on the visualization, with a pretend breakeven point of $1,400,000,000, the months of January, February, March, April, May, and the beginning of June were under the breakeven point. However, the sales started to pick up in middle to late June, only to begin to decline in December. This graph proved to be very valuable because it builds on the idea that the summer months are the most profitable. This is a point which is clearly visualized by the graph, since the biggest climbs in revenue away from the breakeven point appeared in the summer months.


x = house_df.groupby(['AssessedValue', 'ListYear'])['AssessedValue'].count().reset_index(name='count')
x = pd.DataFrame(x)

x['count_hundreds'] = round(x['count']/100, 0)

wf_df = house_df[house_df['Year'] == 2017].groupby(['MonthName'])['SaleAmount'].sum().reset_index(name='TotalSales')
wf_df['Breakeven'] = 1.4e9
wf_df['Deviation'] = wf_df.TotalSales - wf_df.Breakeven

wf_df.loc[wf_df.index.max()+1] = ['Total',
                                  wf_df.TotalSales.sum(),
                                  wf_df.Breakeven.sum(),
                                  wf_df.TotalSales.sum() - wf_df.Breakeven.sum()]
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'Total']

wf_df.MonthName = pd.Categorical(wf_df.MonthName, categories = months, ordered = True)

wf_df.sort_values(by='MonthName', inplace=True)

wf_df.reset_index(inplace = True, drop = True)

if wf_df.loc[12, 'Deviation'] > 0:
    end_color = 'black'
elif wf_df.loc[12, 'Deviation'] < 0:
    end_color = 'red'
else: end_color = 'blue'

fig = go.Figure(go.Waterfall(name='', orientation = 'v', x = wf_df['MonthName'], textposition='outside',
                             measure = ['relative', 'relative', 'relative', 'relative', 'relative', 'relative',
                                        'relative', 'relative', 'relative', 'relative', 'relative', 'relative', 'total'],
                             y = wf_df['Deviation']/1e9,
                             text = ['${:.2f}B'.format(each/1e9) for each in wf_df['TotalSales']],
                             decreasing = {'marker':{'color':'red'}},
                             increasing = {'marker':{'color':'green'}},
                             totals = {'marker':{'color': end_color}},
                             hovertemplate = 'Cumulative Deviation to Date:  ' + '$%{y:,.2f}B' + '<br>' +
                                             'Total Fines in %{x}:  %{text}' 
            ))
fig.layout = go.Layout(yaxis=dict(tickformat ='.1f'))

fig.update_xaxes(title_text='Months', title_font = {'size': 18})

fig.update_yaxes(title_text='Total Sales (Running Total $B)', title_font = {'size':18}, 
                dtick=0.5, tickprefix = '$', ticksuffix = 'B', zeroline=True)

fig.update_layout(title = dict(text = 'Deviation between Actual and Breakeven Monthly Sales in 2017 (Waterfall Diagram)<br>' +
                              'Surpluses appear in Green, Deficits appear in Red',
                              font = dict(family='Arial', size=18, color='black')),
                  template='simple_white',
                  title_x = 0.5,
                  showlegend = False,
                  autosize=True,
                  margin = dict(l=30, r=30, t=60, b=30)
                 )
#fig.show()

pio.write_html(fig, "U:/plotly_result_ohl.html", auto_open=False)

Count of Residential Type by Top 10 Towns

For my fifth graph, I wanted to analyze the type of residential house that was sold in each of the top ten towns and to visualize if a specific town had a most popular housing option. I based these values for the graph based on the assessed value of each house, instead of on the amount the house was sold for in order to get a better view of the housing market from the perspective of perceived house value in each area. Based on the visualization, it is evident that single family residential type houses were the most popular among the top ten towns, which ultimately makes sense considering single family homes are the most popular homes that people purchase.

towns = house_df[house_df.PropertyType == 'Residential']

sales = towns[['Town', 'PropertyType', 'ResidentialType', 'SalesRatio']]

sales = sales.dropna()

round(sales.groupby('Town')['SalesRatio'].mean(), 2)

## Town
## Andover          0.92
## Ansonia          1.07
## Ashford          1.03
## Avon             1.05
## Barkhamsted      0.99
##                  ... 
## Windsor Locks    0.74
## Wolcott          0.91
## Woodbridge       1.11
## Woodbury         0.98
## Woodstock        0.98
## Name: SalesRatio, Length: 169, dtype: float64

round(sales.groupby('ResidentialType')['SalesRatio'].mean(), 2)

## ResidentialType
## Condo            0.87
## Four Family      1.49
## Single Family    8.94
## Three Family     1.53
## Two Family       1.26
## Name: SalesRatio, dtype: float64

house = towns[['Town', 'ResidentialType', 'SalesRatio']]

residents_df = house.groupby('Town')['ResidentialType'].count()
residents_df = residents_df.reset_index()

residents_df = residents_df.sort_values(['ResidentialType'], ascending = False)

houses_sold_df = house.groupby(['ResidentialType', 'Town']).count()
houses_sold_df = houses_sold_df.reset_index()

houses_sold_df = houses_sold_df.sort_values(['Town'], ascending = True)
houses_sold_df.columns = ['ResidentialType', 'Town', 'Count']

houses_sold_df = houses_sold_df.dropna()

houses_sold_df = houses_sold_df.sort_values(['Count'], ascending = False)

house_hm_df = pd.pivot_table(houses_sold_df, index='Town', columns = 'ResidentialType', values = 'Count')

fig = plt.figure(figsize=(20, 15))
ax = fig.add_subplot(1, 1, 1)

comma_fmt = FuncFormatter(lambda x, p: format(int(x), ','))

ax = sns.heatmap(house_hm_df.head(10), linewidth = 0.5, annot = True, cmap = 'coolwarm', fmt=',.0f',
                square = True, annot_kws={'size':11},
                cbar_kws = {'format': comma_fmt, 'orientation': 'vertical'})

plt.show()

Conclusion

Based on the visualizations I created in order to analyze the housing market in the state of Connecticut from the year 2001 to the year 2018, I gained some interesting insights. These insights included things that I had not known, such as the top twenty towns in which houses were sold in the state of Connecticut, the year and day in which the most number of houses were sold, the quarter and months in which the most number of houses were sold, the months in which there was the biggest rise in sales from a pretend breakeven point, which also confirmed the months in which the most amount of house sales occurred, and the type of residential house that sells the most for the top ten towns in the state of Connecticut based on assessed market value of houses. Ultimately, I thoroughly enjoyed making these visualizations and was able to gain a lot of useful input. I would like to compare these results with results from other housing sales of other states in the United States from the same time range in order to see if my insights exist in more than just a single state.