March 24th, 2024The dataset under examination offers a comprehensive glimpse into the vibrant real estate landscape of New York, capturing a plethora of crucial attributes essential for understanding housing trends and patterns in the region. Among its key features are details such as the broker’s title, property type, listed price, number of bedrooms and bathrooms and square footage. These attributes collectively provide invaluable insights into the diverse range of housing properties available, facilitating informed decision-making for prospective buyers, sellers, and investors alike.
Although there was many key features, I decided to remove certain features and keep ten features because many of the features overlapped with one another. Therefore I keep the features of Broker Title, Type, Price, Beds, Bath, Property Sqft, Locality, Sublocality, Latitude and Longitude.
Here is the link to the dataset: https://www.kaggle.com/datasets/nelgiriyewithana/new-york-housing-market/data?select=NY-House-Dataset.csv
# Importing necessary packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import numpy as np
import warnings
import matplotlib.patches as mpatches
import seaborn as sns
warnings.filterwarnings("ignore")
# Load the dataset
file_path = "NYHousing.csv"
# Read the CSV file into a DataFrame
df = pd.read_csv(file_path)
# Drop the unnamed columns
df = df.drop(columns=['Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13'])
# Remove spaces from all column names
df = df.rename(columns={col: col.strip() for col in df.columns})
# Display the first few rows of the DataFrame to inspect the data
print(df.head())
BROKERTITLE TYPE
0 Brokered by Douglas Elliman -111 Fifth Ave Condo for sale
1 Brokered by Serhant Condo for sale
2 Brokered by Sowae Corp House for sale
3 Brokered by COMPASS Condo for sale
4 Brokered by Sotheby’s International Realty - E… Townhouse for sale
PRICE BEDS BATH PROPERTYSQFT LOCALITY SUBLOCALITY
0 $315,000.00 2 2 1400 New York Manhattan
1 $195,000,000.00 7 10 17545 New York New York County
2 $260,000.00 4 2 2015 New York Richmond County
3 $69,000.00 3 1 445 New York New York County
4 $55,000,000.00 7 2 14175 New York New York County
LATITUDE LONGITUDE
0 40.761255 -73.974483
1 40.766393 -73.980991
2 40.541805 -74.196109
3 40.761398 -73.974613
4 40.767224 -73.969856
One of the most sough-out places to live in New York City is Manhattan. Manhattan is filled with tons of activities. From the pie chart you can see that there is primarily condos for sale.
# Filter rows where SUBLOCALITY is Manhattan
manhattan_df = df[df['SUBLOCALITY'] == 'Manhattan']
# Group by TYPE and count occurrences
house_type_counts = manhattan_df['TYPE'].value_counts()
# Plotting the pie chart
plt.figure(figsize=(8, 8))
plt.pie(house_type_counts, labels=house_type_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of House Types in Manhattan')
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
From the bar chart you can see that there is mostly places for sale with three or less beds. Therefore you can see it would not be ideal to raise a large family in NY.
# Group by 'TYPE' and count occurrences
type_counts = df['BEDS'].value_counts()
# Sort the Series by index (number of bedrooms)
type_counts = type_counts.sort_index().reindex(range(1, 16), fill_value=0)
# Plotting the bar chart
plt.figure(figsize=(10, 6))
type_counts.plot(kind='bar', color='skyblue')
plt.title('Count of Bed Amounts')
plt.xlabel('Bed Amounts')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readabilityplt.grid(axis='y') # Add grid lines along the y-axis
plt.tight_layout() # Adjust layout to prevent clipping of labels
# Set the range of the x-axis
plt.show()
Places to live in NYC often lack space therefore bed rooms are often small or a place to live can not house a certain amount of people. From the scatter plot you can see that there is many homes that are 5000 square feet and less with 5 or less beds as well.
# Define custom colormap
color_map = ListedColormap(['blue', 'purple', 'green', 'red', 'yellow', 'orange'])
# Create Scatter Plot
plt.figure(figsize=(12, 8))
plt.scatter(df['PROPERTYSQFT'], df['BEDS'], marker='o', c=df['BATH'], cmap=color_map,
s=df['BATH']*20, edgecolors='black', alpha=0.7)
plt.title('New York Housing Data Scatter Plot')
plt.xlabel('Property Square Footage')
plt.ylabel('Number of Bedrooms')
plt.colorbar(label='Number of Bathrooms')
plt.grid(True)
plt.xlim(0,30000)
plt.tight_layout()
plt.show()
This is displaying house locations based on there latitude and longitude. In this scenario the latitude is more important than longitude. In regards to the five boroughs: Bronx (Orange) Manhattan & Queens (Top of Red & Yellow) Brooklyn (Bottom of Red & Green) Staten Island (Blue)
# Define custom colormap
cmap = ListedColormap(['blue', 'green', 'red', 'yellow', 'orange'])
# Create scatter plot
plt.figure(figsize=(10, 8))
plt.scatter(df['LONGITUDE'], df['LATITUDE'], c=df['LATITUDE'], cmap=cmap, alpha=0.6)
plt.colorbar(label='Latitude')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Scatter Plot with Color Map based on Latitude')
plt.grid(True)
plt.tight_layout()
plt.show()
The data set was created to provide insight for buyers and sellers of homes in NY. I made the range of property square feet 0 to 4000 because that is pretty standard size to raise a family. I also made sure there is at least 200 homes for sale since the population of NY. From the histogram you can see there is Multi-Family homes of all different amounts of property square feet.
# Group the data by sublocality
grouped = df.groupby('TYPE')
colors = sns.color_palette('hsv', len(grouped))
# Plot histograms for each sublocality
plt.figure(figsize=(12, 8))
for i, (name, group) in enumerate(grouped):
if len(group)>200:
plt.hist(group['PROPERTYSQFT'], bins=20, alpha=0.5, label=name, color=colors[i])
plt.title('Histogram of Property Square Footage by Type of Place')
plt.xlabel('Property Square Footage')
plt.ylabel('Frequency')
plt.legend(title='Type')
plt.grid(True)
plt.xlim(0,4000)
plt.tight_layout()
plt.show()