Airbnb NYC

Introduction

Airbnb, Inc., is an American company which offers an online marketplace that enables people to rent out their homes or apartments to travelers who are seeking temporary accommodations. Airbnb was began in 2007, when two ‘Hosts’ welcomed three guests to their San Francisco home, and has since grown to have over 4 million ‘Hosts’, who have welcomed 1.4 billion guest stays all around the globe. This data visualization analysis will look at Airbnb stays in New York City, primarily focusing on Airbnbs in these five neighborhood groups: Manhattan, Queens, Brooklyn, Bronx, and Staten Island. We will look at Airbnb counts across all 225 NYC neighborhoods, then focusing on the top 20 neighborhoods with the most Airbnbs, which of the five neighborhood groups have the most Airbnb’s, differences in average prices for a night’s stay, differences in average minimum nights to stay, the median price per night by the five neighborhood groups and construction year, and lastly the amount of reviews a residence has, given neighborhood group and the home’s construction year. Hopefully, these visualizations will help guide Airbnb user decisions in where the best places are to lodge in NYC neighborhoods.

Dataset

The dataset used in this analysis was taken from Kaggle, and has 102,599 rows and 26 columns, including fields such as neighborhood group, neighborhood, price, construction year, host id, and roomtype. Many of the columns had a few NA’s, so I dropped any rows that had NA’s in the fields that I was using in my analysis, including the fields ‘price’, ‘construction year’, ‘neighborhood’, ‘neighborhood group’, and ‘minimum nights’. In addition, I had to reformat the ‘price’ field as it was in the format ‘$0,000’, and I needed it to be formatted as an integer for my analysis. Luckily, other than the NA’s and the price field, this dataset was in very good shape.

Findings

From a high-level perspective, this analysis had many findings when comparing Airbnbs across different NYC neighborhoods and neighborhood groups. I found that in all 225 neighborhoods, the range of Airbnb counts in each neighborhood is quite large, with the largest neighborhood having nearly 8,000 Airbnbs, and the minimum having 1. I found that the nieghborhood with the most Airbnbs is Bedford-Stuyvesant, and the mean of Airbnb count in the top 20 largest Airbnb populated neighborhoods are 3,252 Airbnbs per neighborhood. When comparing the five neighborhood groups, I found that Manhattan had the most Airbnbs with 43,792 residences, and Brooklyn falling closely behind at 41,842 residences. What is interesting is that Manhattan and Brooklyn also had the largest average minimum nights stay, although average price per night remained consistent across all five neighborhood groups. Lastly, I found that Staten Island and the Bronx had the most variance in their average Airbnb prices by construction years. In Staten Island, for example, the median price for homes built in 2008 costs $737 a night, while the median price in 2015 was only $423 a night. Overall, this analysis was very insightful of Airbnbs presence in NYC.

Airbnb Count Across all 225 Neighborhoods

This visualization is a bar chart that looks at the count of Airbnbs in each of NYC’s 225 neighborhoods. As you can see, the range of the amount of Airbnbs from the most populated Airbnb neighborhoods to the least populated is quite large, with the most populated having nearly 8,000 Airbnbs, and the least populated seeming to have 1. This visualization is color-coated by the average count of Airbnbs in every neighborhood, which is 455 Airbnbs per neighborhood - as seen through the mean line placed into the visual. If a neighborhood has more than 455 Airbnbs, the bar is blue, if it has less than 455 Airbnbs, the bar is pink, and if the neighborhood has a count of Airbnbs within 1% of 455, the bar is green.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import warnings
import seaborn as sns
from matplotlib.ticker import FuncFormatter

warnings.filterwarnings('ignore')

filename = "/Users/allisonkeck/Desktop/IS460 Data Visualization/Python Files/data/Airbnb_Open_Data.csv"
df = pd.read_csv(filename)
df['neighbourhood'].fillna("Unknown", inplace=True)

def pick_colors(my_data):
    colors = []
    #get the avg count
    avg = my_data.Count.mean()
    for each in my_data.Count:
        #if more then 1% above the mean, make the color navy
        if each > avg*1.01:
            colors.append('lightblue')
        #if less than 1% above the mean, make it dark orange
        elif each < avg*0.99:
            colors.append('lightpink')
        #if close to the mean, make it grey
        else:
            colors.append('green')
    return colors
  
####VISUALIZATION #1: Look at Frequency of Airbnb's In All 225 NYC Neighborhoods
#color-coated by the avg number of airbnbs

#use the entire dataframe of 225 neighborhoods
x = df.groupby(['neighbourhood']).agg({'neighbourhood':['count']}).reset_index()
x.columns = ['neigbourhood', 'Count']
x = x.sort_values('Count', ascending=False)
x.reset_index(inplace=True, drop=True)

bottom1 = 0
top1 = 224
d1= x.loc[bottom1:top1]
my_colors1 = pick_colors(d1)
#create legend colors
above = mpatches.Patch(color='lightblue', label='Above Average')
at = mpatches.Patch(color='green', label='Within 1% of the Average')
below = mpatches.Patch(color='lightpink', label='Below Average')

#create new figure and add title
#create new figure and add title
fig = plt.figure(figsize=(18,12))

#create ax1 AKA row 1
ax1 = fig.add_subplot(1,1,1)
ax1.bar(d1['neigbourhood'], d1.Count, label='Count', color=my_colors1)
#ax1.legend(fontsize=14)
#add a legend for above, at, and below

ax1.legend(handles=[above, at, below],fontsize=14)
#add avg line
plt.axhline(d1.Count.mean(), color="black", linestyle='dashed')
#get rid of right and top borders
ax1.spines['right'].set_visible(False)
ax1.spines['top'].set_visible(False)
#hide ticks and labels on x axis
ax1.axes.xaxis.set_visible(False)
#set y label
ax1.set_ylabel('Airbnb Count', fontsize=16)
ax1.set_yticklabels(['{:,}'.format(int(x)) for x in ax1.get_yticks().tolist()])
#add title
ax1.set_title("Frequency Analysis of All Airbnbs in NYC's "+ str(top1+1) + ' Neighborhoods', size=20, fontweight='bold')
#include text to show user the average, the first two argumants give the location of this text
ax1.text(top1-20, d1.Count.mean()+50, 'Mean = ' + str(int(d1.Count.mean())), rotation=0, fontsize=14)

plt.show()

Top 20 Most Airbnb Populated Neighborhoods

This visualization is a horizontal bar chart that looks at the top 20 most Airbnb populated neighborhoods, with Bedford-Stuyvesant having the most Airbnbs with 7,937 residences. The neighborhood of Williamsburg trails closely behind, with a count of 7,775 Airbnbs. Similarly to the graph looking at all 225 neighborhoods, this chart is also color-coated by the average amount of Airbnbs in these 20 neighborhoods. As the mean line shows, the average count of Airbnbs in these 20 neighborhoods is 3,283 Airbnbs per neighborhood, so, the bar is blue if the neighborhood has more than the average count, pink if less than average, and green if the count of Airbnbs in the neighborhood is within 1% of the average.

#select the top 20 neighborhoods with the most airbnbs
#####HORIZONTAL BAR CHART#### visualization 2
def pick_colors(my_data):
    colors = []
    #get the avg count
    avg = my_data.Count.mean()
    for each in my_data.Count:
        #if more then 1% above the mean, make the color navy
        if each > avg*1.01:
            colors.append('lightblue')
        #if less than 1% above the mean, make it dark orange
        elif each < avg*0.99:
            colors.append('lightpink')
        #if close to the mean, make it grey
        else:
            colors.append('green')
    return colors

x = df.groupby(['neighbourhood']).agg({'neighbourhood':['count']}).reset_index()
x.columns = ['neigbourhood', 'Count']
x = x.sort_values('Count', ascending=False)
x.reset_index(inplace=True, drop=True)

bottom2 = 0
top2 = 19
d2 = x.loc[bottom2:top2]
d2 = d2.sort_values('Count', ascending = True)
d2.reset_index(inplace=True, drop=True)
my_colors2 = pick_colors(d2)

#create legend colors
above = mpatches.Patch(color='lightblue', label='Above Average')
at = mpatches.Patch(color='green', label='Within 1% of the Average')
below = mpatches.Patch(color='lightpink', label='Below Average')

#create figure
fig = plt.figure(figsize=(18,12))
#create plot
ax1 = fig.add_subplot(1,1,1)
ax1.barh(d2['neigbourhood'], d2.Count, color=my_colors2)

#set chart title

ax1.set_title('Top ' + str(top2+1) +  ' NYC Neighborhoods with Most Airbnbs', size=20, fontweight='bold')
#ser x label
ax1.set_xlabel('Airbnb Count', fontsize=16)
#set y label
ax1.set_ylabel('Neighborhood Name', fontsize=16)
#set ticks to a new size
plt.xticks(fontsize=14)

plt.yticks(fontsize=14)

ax1.set_xticklabels(['{:,}'.format(int(x)) for x in ax1.get_xticks().tolist()])

#add labels to the bars
#look goes for the index of each row and the value at that index
for row_counter, value_at_row_counter in enumerate(d2.Count):
    #change color of labels based on bar color
    if value_at_row_counter > d2.Count.mean()*1.01:
        color = 'lightblue'
    elif value_at_row_counter < d2.Count.mean()*0.99:
        color='lightpink'
    else:
        color='green'
    #actually add the labels
    ax1.text(value_at_row_counter+65, row_counter, str('{:,}'.format(int(value_at_row_counter))), color=color, size=12, fontweight='bold',
             ha='left',va='center', backgroundcolor='white')
    
#text=['${:.2f}M'.format(x/1e6) for x in wf_df['TotalFines']]
#extend x axis for the labels
plt.xlim(0, d2.Count.max()*1.1)

#add the legend

ax1.legend(loc='lower right', handles=[above, at, below], fontsize=14)

#add mean line through the graph
plt.axvline(d2.Count.mean(), color='black', linestyle='dashed')
ax1.text(d2.Count.mean()+50, 0, 'Mean = ' + str('{:,}'.format(int(round(d2.Count.mean(),0)))), rotation=0, fontsize=14)

plt.show()

Airbnbs by Five Main Neighborhood Groups

This visualization is a pie chart that looks at the Airbnb count in each of the 5 NYC neighborhood groups: Brooklyn, Manhattan, Bronx, Staten Island, and Queens. The total amount of Airbnbs in NYC is 102,568, with Manhattan making up 42.7% of that with 43,792 Airbnbs. Brooklyn, the second largest most Airbnb populated neighborhood group, makes up 40.79% of the total, having 41,842 Airbnbs. Staten Island and the Bronx have the least amount of Airbnbs, with each of their percentages being less than 5% of the total.

#drop rows where neighborhood group is misspelled
df = df[(df['neighbourhood group'] != 'brookln')]
df = df[(df['neighbourhood group'] != 'manhatan')]
#VISUALIZAATION 4 - PIE CHART - looking at count of  airbnbs by neighbourhood group
pie_df = df.groupby(['neighbourhood group'])['neighbourhood group'].count().reset_index(name='TotalAirbnb')
#count how many unique colors we have by how many quarters we have
number_outside_colors = len(pie_df['neighbourhood group'].unique())
#multiply by 4 so theres a big range of colors
outside_color_ref_number = np.arange(number_outside_colors)*4
#Donut Chart
#pick a square shape
fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(1,1,1)

#gets the color map
colormap = plt.get_cmap("tab20c")
#picks out colors 0,4,8,12
outer_colors = colormap(outside_color_ref_number)

total_airbnb = pie_df.TotalAirbnb.sum()


pie_df.groupby(['neighbourhood group'])['TotalAirbnb'].sum().plot(
    kind='pie', radius=1, colors=outer_colors, pctdistance = 0.85, 
    labeldistance=1.1, wedgeprops = dict(edgecolor='White'), textprops = {'fontsize':16},
    autopct = lambda p: '{:.2f}%\n({:,.0f})'.format(p,(p/100)*total_airbnb) if p > 5 else None,
    startangle=90)

hole = plt.Circle((0,0), 0.3, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)

#remove y axis
ax.yaxis.set_visible(False)
plt.title('Airbnbs in NYC by Neighborhood Group', fontsize=18, fontweight='bold')

#add data to the white center
ax.text(0, 0, 'Total Airbnbs\n'+str('{:,}'.format((round(pie_df.TotalAirbnb.sum())))),ha='center',va='center',size=15)

#make height of axises equal
ax.axis('equal')

#fix labels that are outside the circumference

plt.tight_layout()

plt.show()

Avg Price & Avg Min Night Stay by Neigborhood Group

This visualization is a clustered bar chart that compares the average minimum night stay and average price per night between the five different NYC neighborhood groups. With minimum night stays, Manhattan has the largest average minimum night stay of 10 nights. This much higher than than the other 4 neighborhoods, whose minimum night stays range between 5-7 nights, with the Bronx having the lowest at 5 nights.

In addition to minimum night stays, this chart looks at the average price per night between all five neighborhood groups. The prices seem to remain similar, regardless of average minimum night stays and neighborhood groups. The neighborhood group with the least expensive price per night is Manhattan with $622.44, and the most expensive is Queens with $630.21 per night.

####DUAL AXIS BAR PLOT - visualization 3 aka clustered bar chart looking at avg night stay and avg price per night by neighbourhood group
#fill na's of neighbourhood group with 'unknown'
df['neighbourhood group'].dropna()
#drop rows where neighborhood group is misspelled

df = df[(df['neighbourhood group'] != 'brookln')]
df = df[(df['neighbourhood group'] != 'manhatan')]
#remove nas in price
df = df.dropna(subset=['price'])
#remove the $ from price column and commas
df['price']= df['price'].str.replace('$','')
df['price']= df['price'].str.replace(',','')
df['price']= df['price'].astype(int)

x = df.groupby(['neighbourhood group']).agg({'minimum nights':['mean'], 'price':['mean']}).reset_index()
x.columns = ['neigbourhood group', 'mean stay', 'mean price']
#sort x by count in descending order
x = x.sort_values('mean price', ascending=True)
x.reset_index(inplace=True, drop=True)

def autolabel(these_bars, this_ax, place_of_decimals, symbol):
    for each_bar in these_bars:
        height = each_bar.get_height()
        this_ax.text(each_bar.get_x()+each_bar.get_width()/2, height*1.01, symbol+format(height, place_of_decimals), fontsize=11, color='black', ha='center', va='bottom')
        
fig = plt.figure(figsize=(18,10))
#create ax1
ax1 = fig.add_subplot(1,1,1)
#create ax2 to show that it shares the same x axis as ax1
ax2 = ax1.twinx()
#set barwidth, do less than 0.5 so the clustered bars have some space in between them
bar_width = 0.4

#returns evenly spaced values within a given interval
x_pos = np.arange(5)

#create the bars for ax1, position them at the lower 50% of bar_width to have them on the left of the x_pos
stay_bars = ax1.bar(x_pos-(0.5*bar_width), x['mean stay'], bar_width, color='lightblue', edgecolor='black', label='Average Minimum Night Stay')

#create bars for ax2
price_bars = ax2.bar(x_pos+(0.5*bar_width), x['mean price'], bar_width, color='lightpink', edgecolor='black', label='Average Price')

#setting x label for ax1 sets it for both becaause they are twin labels
ax1.set_xlabel('Neighborhood Group', fontsize=18, labelpad=20)
#set y label for ax1 on the left
ax1.set_ylabel('Average Minimum Night Stay', fontsize=18, labelpad=20)
#set y label for ax2 on the right
ax2.set_ylabel('Average Price Per Night', fontsize=18, rotation=270, labelpad=20)
#make the tick labels on both y axes the same size
ax1.tick_params(axis='y', labelsize=14)
ax2.tick_params(axis='y', labelsize=14)

#set the title
plt.title("Average Minimum Night Stay and Average Price Analysis\n by NYC's 5 Neighborhood Groups", fontsize=18, fontweight='bold')
#set the x ticks to x_pos, which are 10 evenly spaced intervals
ax1.set_xticks(x_pos)
#change the x ticks to be the tag name
ax1.set_xticklabels(x['neigbourhood group'], fontsize=12)

#create legend
stay_color, stay_label = ax1.get_legend_handles_labels()
price_color, price_label = ax2.get_legend_handles_labels()
legend = ax1.legend(stay_color + price_color, stay_label + price_label, loc='best', frameon=True, ncol=1,
                    borderpad=1, fontsize=14)

#change y axis for ax1 so the legend appears not on top of a bar
ax1.set_ylim(0,x['mean stay'].max()*1.50)

ax2.set_ylim(0,x['mean price'].max()*1.25)

#set labels using function fromabove

autolabel(price_bars, ax2, '.2f', '$')
autolabel(stay_bars, ax1, '.0f','')
ax2.yaxis.set_major_formatter(FuncFormatter(lambda x, pos:('$%1.0f')%(x)))

plt.show()

Median Airbnb Price by Neighborhood Group & Construction Year

This visualization is a heatmap that looks at the median Airbnb prices of a neighborhood group given the residence’s construction year. As you can see, median prices seem to remain similar given construction year for Queens, Manhattan, and Brooklyn, while there is a lot more variance in the median price per night in Staten Island and the Bronx, given the construction year of the residences. For Staten Island, for example, residences built in 2015 have a median price per night of $423, while houses built in 2008 have a median price per night of $737. This is quite a large range in comparison to some of the other neighborhood groups. For me, it is interesting to see that there is not necessarily a pattern of houses being built more recently having a higher median price, which I expected would be the case.

#construction year by neighborhood group, count of airbnbs
df = df.dropna(subset=['Construction year'])
df['Construction year'] = df['Construction year'].astype(int)
x = df.groupby(['neighbourhood group','Construction year']).agg({'price': 'median'}).reset_index()
x.columns = ['neighbourhood group', 'Construction year', 'mean price']
hm_df = pd.pivot_table(x, index='neighbourhood group', columns='Construction year', values='mean price')


symbol = '$'
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)


comma_fmt = FuncFormatter(lambda x, p: '${:,.0f}'.format(x))
ax = sns.heatmap(hm_df, linewidth = 0.2, annot=True, cmap='PiYG',
                 fmt= '.2f', square=True, annot_kws={'size':10},
                 cbar_kws={'format':comma_fmt,"shrink": 0.5, 'orientation':'vertical'}          
)

for t in ax.texts:
    t.set_text('${:.0f}'.format(float(t.get_text())))


plt.title('Heatmap of Median Airbnb Price by Neighborhood Group and Construction Year', fontsize=18, pad=20, fontweight='bold')
plt.xlabel('Airbnb Construction Year', fontsize=14, labelpad=10)
plt.ylabel('Airbnb Neighbourhood', fontsize=14, labelpad=10)
plt.yticks(rotation=0, size=14)

plt.xticks(size=14)

ax.invert_yaxis()

cbar = ax.collections[0].colorbar
cbar.set_label("Median Cost Per Night ", rotation=270, fontsize=14, color='black', labelpad=20)

plt.show()

Number of Airbnb Reviews by Neighborhood Group & Construction Year

This visualization is a line graph looking at how the number of reviews differentiate between the neighborhood groups, given the construction year of the Airbnb. I was thinking that maybe homes that were older would have more reviews as they have been around longer, but this was not the case. As you can see, the number of reviews seem to remain steady, regardless of the home’s construction year. The most inconsistent neighborhood group seems to be Brooklyn, where some construction years, such as 2013, reviews are low, while homes built in 2015 have very large amounts of reviews. As you can see, neighborhood groups Bronx and Staten Island have far less reviews than Brooklyn and Manhattan, and this is because (as shown in the pie chart), that they have significantly less Airbnbs in their neighborhoods. It seems as though construction year does not seem to significantly influence the number of reviews a home may have.

df = df.dropna(subset=['number of reviews'])
df['Construction year'] = df['Construction year'].astype(str)
price_df = df.groupby(['Construction year','neighbourhood group'])['number of reviews'].sum().reset_index(name='TotalReviews')

from matplotlib.ticker import FuncFormatter
fig = plt.figure(figsize = (24,12))
ax = fig.add_subplot(1,1,1)

my_colors = {"Bronx":"blue",
             "Brooklyn":"red",
             "Manhattan":"green",
             "Queens":"gray",
             "Staten Island":"purple"}

for key, grp in price_df.groupby(['neighbourhood group']):
    grp.plot(ax=ax, kind="line", x='Construction year', y= 'TotalReviews', color=my_colors[key], label=key, marker='8')

plt.title("Total Number of Reviews by Construction Year and Neighborhood Group", fontsize=18, fontweight='bold')
ax.set_xlabel("Construction Year", fontsize=18)
ax.set_ylabel("Number of Reviews Given by Guests", fontsize=18, labelpad=20)
ax.tick_params(axis='x', labelsize=14, rotation=0)
ax.tick_params(axis='y', labelsize=14, rotation=0)

#sets 24 intervals for x axis
x = np.arange(20)
ax.set_xticks(x,['2003','2004','2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019','2020','2021','2022'])


#rearrange legend
handles, labels = ax.get_legend_handles_labels()
#reorder the handles using indexing from current handles, same with labels
handles = [handles[1], handles[2],handles[3],handles[0],handles[4]] 
labels = [labels[1], labels[2],labels[3],labels[0],labels[4]]
plt.legend(handles, labels, fontsize=14, ncol=1,bbox_to_anchor=(1.01,1.01), labelspacing=1,
          markerscale=.4, borderpad=1, handletextpad=0.8)


#format y axis
ax.get_yaxis().set_major_formatter(FuncFormatter(lambda x, p: format(int(x), ',')))
plt.show()

Conclusion

This analysis dove deep into the Airbnb presence in NYC, specifically looking at the differences of Airbnb presence across all 225 neighborhoods and the five main NYC neighborhood groups. For someone looking for an Airbnb in New York City, there is Airbnb presence across 225 neighborhoods, with some neighborhoods having much more Airbnb residences than others. In terms of neighborhood groups, I would recommend that they look in Brooklyn or Manhattan as both those neighborhood groups make up over 80% of all Airbnb presence in New York. Airbnb pricing remains consistent between all five neighborhood groups, similarly to minimum night stays, with the exception of Manhattan who has a higher average minimum night stay than the other four neighborhood groups. Lastly, price per night and the amount of reviews a home has seems to have little to do with the construction year of the Airbnb residence, so if looking for an Airbnb in any of the five neighborhoods, that should not be too large of a factor in your lodging decisions.