Data Visualization - Python (Tabs & TOC)

Introduction

This report analyzes the Airbnb NYC 2019 dataset, which contains information on Airbnb listings in New York City as of 2019. Sourced from Kaggle, the dataset includes over 48,000 listings with variables such as price, room type, neighbourhood group, availability, number of reviews, and more. The goal of this analysis is to explore pricing, availability, and demand patterns across NYC’s neighbourhood groups through descriptive statistics and five visualizations. Each visualization highlights a different aspect of the Airbnb market, providing insights into how these factors vary by location and over time.

Dataset

The Airbnb NYC 2019 dataset was obtained from Kaggle (https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data). It includes 48,895 listings with 16 variables, such as: - price: The nightly price of the listing in USD. - neighbourhood_group: The borough of NYC (e.g., Manhattan, Brooklyn). - room_type: The type of listing (e.g., entire home/apt, private room, shared room). - availability_365: Number of days the listing is available per year. - number_of_reviews: Total number of reviews received by the listing. - last_review: Date of the most recent review. The dataset provides a comprehensive snapshot of Airbnb’s presence in NYC, allowing us to analyze market dynamics across different boroughs.

Findings

This section presents the findings of our analysis through descriptive statistics and five visualizations. The descriptive statistics provide an overview of the dataset’s key variables, while the visualizations explore pricing, room type distribution, availability, and demand across neighbourhood groups. Each visualization is presented in a separate tab to keep the report organized and visually appealing.

Descriptive Statistics

Before exploring the visualizations, let’s examine the dataset’s descriptive statistics to understand its structure and key characteristics.

import pandas as pd
dataset_path = "C:/Users/Conrad/Downloads/pythonwork/AB_NYC_2019.csv"
df = pd.read_csv(dataset_path)

print("Descriptive Statistics for Numerical Columns:")

## Descriptive Statistics for Numerical Columns:

numerical_stats = df[['price', 'minimum_nights', 'number_of_reviews', 'calculated_host_listings_count', 'availability_365']].describe()
print(numerical_stats)

##               price  ...  availability_365
## count  48895.000000  ...      48895.000000
## mean     152.720687  ...        112.781327
## std      240.154170  ...        131.622289
## min        0.000000  ...          0.000000
## 25%       69.000000  ...          0.000000
## 50%      106.000000  ...         45.000000
## 75%      175.000000  ...        227.000000
## max    10000.000000  ...        365.000000
## 
## [8 rows x 5 columns]

print("\nSummary of Categorical Columns:")

## 
## Summary of Categorical Columns:

print("neighbourhood Group Distribution:")

## neighbourhood Group Distribution:

print(df['neighbourhood_group'].value_counts())

## neighbourhood_group
## Manhattan        21661
## Brooklyn         20104
## Queens            5666
## Bronx             1091
## Staten Island      373
## Name: count, dtype: int64

print("\nRoom Type Distribution:")

## 
## Room Type Distribution:

print(df['room_type'].value_counts())

## room_type
## Entire home/apt    25409
## Private room       22326
## Shared room         1160
## Name: count, dtype: int64

Interpretation: The average price is $152.72, but the median is $106, indicating a right-skewed distribution with some high-priced outliers (max price is $10,000). Minimum nights also show a skew, with a mean of 7.03 but a median of 3. Listings have an average of 23 reviews, with a maximum of 629, suggesting varying levels of demand. Availability ranges from 0 to 365 days, with a mean of 112.8 days, indicating that many listings are available for less than a third of the year on average. Manhattan and Brooklyn dominate the listings, with 21,661 and 20,104 listings, respectively, while Staten Island has the fewest (373). In terms of room types, entire homes/apartments (25,400) and private rooms (22,311) make up the majority, with shared rooms being rare (1,160).

Average Price by neighbourhood Group

This bar chart shows the average price of listings in each neighbourhood group, providing an initial understanding of pricing variations across NYC.

import pandas as pd
import matplotlib.pyplot as plt



avg_price_by_neighbourhood = df.groupby('neighbourhood_group')['price'].mean().reset_index()
plt.figure(figsize=(10, 6))
plt.bar(avg_price_by_neighbourhood['neighbourhood_group'], avg_price_by_neighbourhood['price'], color='skyblue')
plt.title('Average Price by neighbourhood Group', fontsize=16, pad=15)
plt.xlabel('neighbourhood Group', fontsize=12)
plt.ylabel('Average Price ($)', fontsize=12)
plt.xticks(rotation=45, ha='right')

## ([0, 1, 2, 3, 4], [Text(0, 0, 'Bronx'), Text(1, 0, 'Brooklyn'), Text(2, 0, 'Manhattan'), Text(3, 0, 'Queens'), Text(4, 0, 'Staten Island')])

plt.tight_layout()
plt.show()

Explanation: Manhattan has the highest average price, around $200, reflecting its status as a prime tourist and business destination with high demand. Brooklyn follows with an average price of about $125, indicating a moderately expensive market. Queens, Staten Island, and the Bronx have lower average prices, around $90–$100, suggesting more affordable options in these areas. This variation highlights the impact of location on pricing, with central and high-demand areas like Manhattan commanding premium prices. This sets the foundation for understanding pricing dynamics in the subsequent visualizations.

Room Type Distribution by neighbourhood Group

This grouped bar chart shows the distribution of room types (entire home/apt, private room, shared room) within each neighbourhood group, providing insight into the composition of listings.

import pandas as pd
import matplotlib.pyplot as plt


room_type_by_borough = df.groupby(['neighbourhood_group', 'room_type']).size().unstack(fill_value=0)
plt.figure(figsize=(10, 6))
room_type_by_borough.plot(kind='bar', stacked=False, figsize=(10, 6), color=['#1f77b4', '#ff7f0e', '#2ca02c'])
plt.title('Room Type Distribution by neighbourhood Group', fontsize=16, pad=15)
plt.xlabel('neighbourhood Group', fontsize=12)
plt.ylabel('Number of Listings', fontsize=12)
plt.xticks(rotation=45, ha='right')

plt.legend(title='Room Type')
plt.tight_layout()
plt.show()

Explanation: Manhattan has the highest number of entire homes/apartments (over 12,000), catering to tourists seeking full accommodations in a central location. Brooklyn has a balanced mix, with around 9,000 entire homes and 8,000 private rooms, reflecting its diverse appeal. Queens has fewer listings, with a slight preference for entire homes (around 3,000) over private rooms (2,500). The Bronx and Staten Island have the fewest listings, with a more even split between entire homes and private rooms, and very few shared rooms. Shared rooms are rare across all areas, with fewer than 500 listings each, indicating limited demand for this room type. This distribution suggests that room type offerings are influenced by local demand and space availability.

Price vs. Availability by neighbourhood Group

This scatterplot examines the relationship between price and availability (days per year a listing is available), with points colored by neighbourhood group.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='price', y='availability_365', hue='neighbourhood_group', alpha=0.3, size=2)
plt.title('Price vs. Availability by neighbourhood Group', fontsize=16, pad=15)
plt.xlabel('Price ($)', fontsize=12)
plt.ylabel('Availability (Days per Year)', fontsize=12)
plt.legend(title='neighbourhood Group')
plt.tight_layout()
plt.show()

Explanation: The scatterplot reveals a concentration of listings with prices below $2,000, though outliers extend to $10,000, particularly in Manhattan (orange points). Availability varies widely, with many listings either fully available (365 days) or unavailable (0 days). Manhattan listings cluster at higher prices but show varied availability, suggesting that high-priced listings aren’t necessarily less available—possibly due to high demand encouraging hosts to keep listings open. Brooklyn (blue points) has more listings at lower prices (mostly under $1,000) with similar availability patterns. Queens, Staten Island, and the Bronx have fewer high-priced listings, with availability scattered across the range. The lack of a clear correlation between price and availability indicates that other factors, such as host preferences or seasonal demand, may influence availability more than price.

Average Price and Number of Reviews by neighbourhood Group

This dual-axis bar chart compares the average price and average number of reviews per listing in each neighbourhood group, providing insight into pricing and demand.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


agg_data = df.groupby('neighbourhood_group')[['price', 'number_of_reviews']].mean().reset_index()
fig, ax1 = plt.subplots(figsize=(10, 6))
width = 0.4
x = np.arange(len(agg_data))
ax1.bar(x - width/2, agg_data['price'], width, label='Average Price', color='skyblue', alpha=0.8)
ax1.set_xlabel('neighbourhood Group', fontsize=12)
ax1.set_ylabel('Average Price ($)', fontsize=12, color='skyblue')
ax1.tick_params(axis='y', labelcolor='skyblue')
ax1.set_xticks(x)
ax1.set_xticklabels(agg_data['neighbourhood_group'], rotation=45, ha='right')
ax2 = ax1.twinx()
ax2.bar(x + width/2, agg_data['number_of_reviews'], width, label='Average Number of Reviews', color='orange', alpha=0.8)
ax2.set_ylabel('Average Number of Reviews', fontsize=12, color='orange')
ax2.tick_params(axis='y', labelcolor='orange')
plt.title('Average Price and Number of Reviews by neighbourhood Group', fontsize=16, pad=15)
fig.legend(loc='upper center', bbox_to_anchor=(0.5, -0.05), ncol=2)
plt.tight_layout()
plt.show()

Explanation: Manhattan has the highest average price (~$200) but a relatively low average number of reviews (~16), suggesting that while it’s expensive, listings may not be booked as frequently per listing. The Bronx has the lowest average price (~$90) but the highest average number of reviews (~26), indicating higher booking frequency—possibly due to affordability attracting more guests. Brooklyn and Queens show a balance, with average prices around $125 and $100, respectively, and review counts around 20–22. Staten Island has a moderate price (~$110) and review count (~20). This chart suggests that lower-priced areas like the Bronx may see more bookings (and thus more reviews), while high-priced areas like Manhattan may have fewer bookings per listing despite overall high demand. This connects to the scatterplot by showing that high prices in Manhattan don’t necessarily translate to high booking frequency.

Average Availability Over Time by neighbourhood Group

This line plot shows how average availability has changed over time (by year of last review) for each neighbourhood group, providing a temporal perspective on availability trends.

import pandas as pd
import matplotlib.pyplot as plt



df['year'] = pd.to_datetime(df['last_review']).dt.year
df_time = df.dropna(subset=['year']) 
df_time = df_time.groupby(['year', 'neighbourhood_group'])['availability_365'].mean().reset_index()
plt.figure(figsize=(10, 6))
for neighbourhood in df_time['neighbourhood_group'].unique():
  subset = df_time[df_time['neighbourhood_group'] == neighbourhood]
  plt.plot(subset['year'], subset['availability_365'], marker='o', label=neighbourhood)
plt.title('Average Availability Over Time by neighbourhood Group', fontsize=16, pad=15)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Average Availability (Days per Year)', fontsize=12)
plt.legend(title='neighbourhood Group')
plt.grid(True)
plt.tight_layout()
plt.show()

Explanation: Availability trends vary across neighbourhood groups and over time. Staten Island (purple line) consistently has the highest availability, peaking at over 300 days in 2015 and stabilizing around 200 days by 2019, likely due to lower demand. Queens (green line) also shows high availability, peaking around 2012 at 200 days and settling around 150 days by 2019. Brooklyn (blue line) and Manhattan (orange line) have lower availability, generally below 150 days, with Manhattan consistently the lowest (around 100 days), reflecting high demand that leads hosts to limit availability. The Bronx (red line) shows a sharp dip in 2014 (below 50 days) but recovers to around 150 days by 2019. Overall, availability increased from 2011 to 2015 across most groups, possibly due to Airbnb’s growing popularity, but stabilized or decreased slightly afterward, reflecting a maturing market. This temporal perspective aligns with the scatterplot’s findings, where Manhattan’s high prices and varied availability suggest high demand.

Conclusion

This analysis of the Airbnb NYC 2019 dataset provides several key insights into the market. Pricing varies significantly across neighbourhood groups, with Manhattan being the most expensive and the Bronx the most affordable. Room type distribution reflects local demand, with Manhattan favoring entire homes and Brooklyn offering a balanced mix. The relationship between price and availability is complex, with high-priced areas like Manhattan showing varied availability patterns. Demand (proxied by number of reviews) is higher in more affordable areas like the Bronx, suggesting that price influences booking frequency. Finally, availability trends have stabilized over time, with Manhattan consistently showing lower availability due to high demand. These findings highlight the interplay between pricing, availability, and demand in NYC’s Airbnb market. Future analyses could explore additional factors, such as the impact of local regulations or seasonal trends, to further understand these dynamics.