“Unveiling New York’s Real Estate Landscape: A Comprehensive Data Dive”

The purpous of this final data dive is to conduct a comprehensive statistical analysis of the New York housing dataset, exploring various aspects such as pricing trends, property characteristics, and geographical distribution. Through rigorous analysis and visualization, the goal is to uncover insights that can provide valuable understanding and actionable recommendations in the realm of real estate in New York City.

This final data dive encompasses examining key features of the New York housing dataset, including property prices, square footage, number of bedrooms and bathrooms, and location attributes such as boroughs and neighborhoods. The analysis will involve exploring trends over time, identifying patterns in property characteristics, and investigating potential relationships between variables. Additionally, geographical analysis will be conducted to understand the spatial distribution of properties across different areas of New York City. The findings will be presented through insightful visualizations and interpretations, aiming to provide meaningful insights for stakeholders in the real estate industry and urban planning.

Firstly, let me load the structure of the dataset…

# Loading necessary libraries
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(pwr)

# Load the data from the CSV file
NY_House_Dataset <- read.csv("C:\\Users\\velag\\Downloads\\NY-House-Dataset.csv")

# Check the structure of the dataset
str(NY_House_Dataset)

## 'data.frame':    4801 obs. of  16 variables:
##  $ BROKERTITLE                : chr  "Brokered by Douglas Elliman  -111 Fifth Ave" "Brokered by Serhant" "Brokered by Sowae Corp" "Brokered by COMPASS" ...
##  $ TYPE                       : chr  "Condo for sale" "Condo for sale" "House for sale" "Condo for sale" ...
##  $ PRICE                      : int  315000 195000000 260000 69000 55000000 690000 899500 16800000 265000 440000 ...
##  $ BEDS                       : int  2 7 4 3 7 5 2 8 1 2 ...
##  $ BATH                       : num  2 10 2 1 2.37 ...
##  $ PROPERTYSQFT               : num  1400 17545 2015 445 14175 ...
##  $ ADDRESS                    : chr  "2 E 55th St Unit 803" "Central Park Tower Penthouse-217 W 57th New York St Unit Penthouse" "620 Sinclair Ave" "2 E 55th St Unit 908W33" ...
##  $ STATE                      : chr  "New York, NY 10022" "New York, NY 10019" "Staten Island, NY 10312" "Manhattan, NY 10022" ...
##  $ ADMINISTRATIVE_AREA_LEVEL_2: chr  "New York County" "United States" "United States" "United States" ...
##  $ LOCALITY                   : chr  "New York" "New York" "New York" "New York" ...
##  $ SUBLOCALITY                : chr  "Manhattan" "New York County" "Richmond County" "New York County" ...
##  $ STREET_NAME                : chr  "East 55th Street" "New York" "Staten Island" "New York" ...
##  $ LONG_NAME                  : chr  "Regis Residence" "West 57th Street" "Sinclair Avenue" "East 55th Street" ...
##  $ FORMATTED_ADDRESS          : chr  "Regis Residence, 2 E 55th St #803, New York, NY 10022, USA" "217 W 57th St, New York, NY 10019, USA" "620 Sinclair Ave, Staten Island, NY 10312, USA" "2 E 55th St, New York, NY 10022, USA" ...
##  $ LATITUDE                   : num  40.8 40.8 40.5 40.8 40.8 ...
##  $ LONGITUDE                  : num  -74 -74 -74.2 -74 -74 ...

#Confidence Intervals for New York Housing Data

Confidence intervals were calculated for key parameters such as mean housing price, mean property size, etc., to estimate the range of plausible values with a certain level of confidence. The confidence level chosen was 95%.

# Calculate confidence intervals for mean housing price and property size
mean_price_ci <- t.test(NY_House_Dataset$PRICE, conf.level = 0.95)$conf.int
mean_propertysqft_ci <- t.test(NY_House_Dataset$PROPERTYSQFT, conf.level = 0.95)$conf.int

# Visualize confidence intervals
# Histogram for housing price
ggplot(NY_House_Dataset, aes(x = PRICE)) +
  geom_histogram(binwidth = 10000, fill = "skyblue", color = "black") +
  geom_vline(xintercept = mean_price_ci, linetype = "dashed", color = "red", size = 1) +
  labs(title = "Histogram of Housing Prices with 95% Confidence Interval",
       x = "Housing Price",
       y = "Frequency") +
  theme_minimal()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Histogram for property size
# Plotting the histogram
ggplot(NY_House_Dataset, aes(x = PROPERTYSQFT)) +
  geom_histogram(binwidth = 500, fill = "lightgreen", color = "black") +
  geom_vline(xintercept = mean_propertysqft_ci, linetype = "dashed", color = "blue", size = 1) +
  labs(title = "Histogram of Property Sizes with 95% Confidence Interval",
       x = "Property Size",
       y = "Frequency") +
  theme_minimal()

#Visualizations:

Histogram of Housing Prices with 95% Confidence Interval

–A histogram showing the distribution of housing prices in New York City.

–The red dashed lines represent the 95% confidence interval for the mean housing price.

Histogram of Property Sizes with 95% Confidence Interval

–A histogram displaying the distribution of property sizes in New York City.

–The blue dashed lines indicate the 95% confidence interval for the mean property size.

These visualizations provide insights into the estimated range of plausible values for mean housing prices and property sizes in New York City, allowing stakeholders to make informed decisions with a certain level of confidence.

#Analyzing the Impact of Number of Baths on Property Prices in New York City

In this analysis, I will explore the relationship between the number of baths in a property and its mean price. To achieve this, we first group the dataset by the number of baths and calculate the mean price for each group. Then, we visualize the mean price by the number of baths using a bar plot.

# Group the dataset by 'BATH' and calculate the mean price for each number of baths
mean_price_by_bath <- NY_House_Dataset %>%
  group_by(BATH) %>%
  summarise(mean_price = mean(PRICE, na.rm = TRUE))

# Create a bar plot to visualize the mean price by number of baths
ggplot(mean_price_by_bath, aes(x = factor(BATH), y = mean_price)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(title = "Mean Price by Number of Baths",
       x = "Number of Baths",
       y = "Mean Price") +
  theme_minimal()

The bar plot illustrates how the mean price of properties varies with the number of baths. This visualization helps potential buyers or investors understand the impact of the number of baths on property prices in the New York housing market. It also provides insights for real estate agents and property developers in pricing strategies and market positioning.

#Exploring the Influence of Number of Bedrooms on Property Prices in New York City

In this analysis, I will investigate the relationship between the number of bedrooms in a property and its mean price. We group the dataset by the number of bedrooms and calculate the mean price for each group. Then, we visualize the mean price by the number of bedrooms using a line plot.

# Group the dataset by 'BEDS' and calculate the mean price for each number of bedrooms
mean_price_by_beds <- NY_House_Dataset %>%
  group_by(BEDS) %>%
  summarise(mean_price = mean(PRICE, na.rm = TRUE))

# Create a line plot to visualize the mean price by number of bedrooms
ggplot(mean_price_by_beds, aes(x = BEDS, y = mean_price)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "blue", size = 3) +
  labs(title = "Mean Price by Number of Bedrooms",
       x = "Number of Bedrooms",
       y = "Mean Price") +
  theme_minimal()

The line plot depicts how the mean price of properties changes with the number of bedrooms. It provides valuable insights into the pricing dynamics based on the number of bedrooms, aiding potential buyers, investors, and real estate professionals in decision-making and market analysis.

#Investigating the Effect of Bathroom Count on Property Prices in New York: A Hypothesis Testing Approach

Now, let’s conduct a hypothesis test to determine if there is a significant difference in the mean property prices between properties with 1 bathroom and properties with 2 bathrooms.

–Null Hypothesis (H0): There is no significant difference in the mean property prices between properties with 1 bathroom and properties with 2 bathrooms.

–Alternative Hypothesis (H1): There is a significant difference in the mean property prices between properties with 1 bathroom and properties with 2 bathrooms.

I will use a two-sample t-test to compare the means of property prices between these two groups.

# Subset data for properties with 1 and 2 bathrooms
one_bathroom_prices <- NY_House_Dataset$PRICE[NY_House_Dataset$BATH == 1]
two_bathroom_prices <- NY_House_Dataset$PRICE[NY_House_Dataset$BATH == 2]

# Perform t-test
t_test_result <- t.test(one_bathroom_prices, two_bathroom_prices)

# Check the result
t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  one_bathroom_prices and two_bathroom_prices
## t = -23.089, df = 1893.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -803129.2 -677372.8
## sample estimates:
## mean of x mean of y 
##  476297.6 1216548.6

The Welch Two Sample t-test was conducted to compare the mean property prices between properties with one bathroom and those with two bathrooms in New York. The test yielded a t-value of -23.089 with a degrees of freedom (df) of 1893.2. The p-value was found to be less than 2.2e-16, indicating strong evidence against the null hypothesis of equal means.

The alternative hypothesis suggests that the true difference in means is not equal to 0, implying that there is a significant difference in property prices between the two groups.

Furthermore, the 95% confidence interval for the difference in means ranged from -803129.2 to -677372.8. This interval does not include 0, providing additional evidence against the null hypothesis.

In summary, based on the results of the Welch Two Sample t-test, we reject the null hypothesis and conclude that there is a significant difference in property prices between properties with one bathroom and those with two bathrooms in New York. Specifically, properties with two bathrooms tend to have higher prices compared to those with only one bathroom.

#Comparison of Average Property Prices Based on Number of Bedrooms

This analysis aims to investigate whether there is a statistically significant difference in the average prices of properties with 1, 2, and 3 bedrooms in the New York housing dataset. I will conduct a one-way ANOVA test to compare the mean prices across the different bedroom categories.

# Subset the data for properties with 1, 2, and 3 bedrooms
one_bedroom_prices <- NY_House_Dataset$PRICE[NY_House_Dataset$BEDS == 1]
two_bedroom_prices <- NY_House_Dataset$PRICE[NY_House_Dataset$BEDS == 2]
three_bedroom_prices <- NY_House_Dataset$PRICE[NY_House_Dataset$BEDS == 3]

# Perform one-way ANOVA test
anova_result <- aov(PRICE ~ BEDS, data = NY_House_Dataset)

# Summary of ANOVA test
summary(anova_result)

##               Df    Sum Sq   Mean Sq F value   Pr(>F)    
## BEDS           1 1.285e+16 1.285e+16   13.11 0.000297 ***
## Residuals   4799 4.706e+18 9.807e+14                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA results indicate that the number of bedrooms significantly influences property prices in the New York housing dataset (F(1, 4799) = 13.11, p < 0.001). The statistically significant p-value (< 0.001) suggests that there are differences in mean property prices across the categories of 1, 2, and 3 bedrooms.

# Filter data for 1, 2, and 3 bedrooms
beds_data <- subset(NY_House_Dataset, BEDS %in% c(1, 2, 3))

# Boxplot to visualize the distribution of prices for different numbers of bedrooms
library(ggplot2)
ggplot(beds_data, aes(factor(BEDS), PRICE)) +
  geom_boxplot(fill = "skyblue", color = "darkblue") +
  labs(title = "Distribution of Prices for Different Numbers of Bedrooms",
       x = "Number of Bedrooms",
       y = "Price") +
  theme_minimal()

# Perform ANOVA to test for differences in means
anova_result <- anova(lm(PRICE ~ BEDS, data = beds_data))
anova_result

## Analysis of Variance Table
## 
## Response: PRICE
##             Df     Sum Sq    Mean Sq F value    Pr(>F)    
## BEDS         1 4.5515e+14 4.5515e+14  104.56 < 2.2e-16 ***
## Residuals 3269 1.4231e+16 4.3532e+12                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The boxplot visualization shows that properties with 2 bedrooms tend to have higher median prices compared to those with 1 or 3 bedrooms.

The ANOVA test results indicate a significant difference in mean prices across the three bedroom categories (F(2, 3269) = 104.56, p < 0.001). This suggests that the number of bedrooms has a significant impact on property prices.

#Regression Modeling to Predict Property Prices

In this task, we aim to build regression models to predict property prices based on various features in the New York housing dataset. The analysis involves fitting multiple linear regression models to identify significant predictors and assess the overall model performance. Visualizations include scatterplots with regression lines to visualize the relationships between the response and predictor variables, as well as diagnostic plots to evaluate the assumptions of linear regression.

# Scatterplot with regression line for Price vs. Bedrooms
ggplot(NY_House_Dataset, aes(x = BEDS, y = PRICE)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Scatterplot of Property Price vs. Number of Bedrooms",
       x = "Number of Bedrooms",
       y = "Property Price")

## `geom_smooth()` using formula = 'y ~ x'

# Scatterplot with regression line for Price vs. Bathrooms
ggplot(NY_House_Dataset, aes(x = BATH, y = PRICE)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "green") +
  labs(title = "Scatterplot of Property Price vs. Number of Bathrooms",
       x = "Number of Bathrooms",
       y = "Property Price")

## `geom_smooth()` using formula = 'y ~ x'

# Scatterplot with regression line for Price vs. Property Square Footage
ggplot(NY_House_Dataset, aes(x = PROPERTYSQFT, y = PRICE)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "orange") +
  labs(title = "Scatterplot of Property Price vs. Property Square Footage",
       x = "Property Square Footage",
       y = "Property Price")

## `geom_smooth()` using formula = 'y ~ x'

# Scatterplot with regression line for Price vs. Location
ggplot(NY_House_Dataset, aes(x = LONGITUDE, y = LATITUDE, color = PRICE)) +
  geom_point() +
  labs(title = "Geographical Distribution of Property Prices",
       x = "Longitude",
       y = "Latitude",
       color = "Property Price")

The regression models provide insights into how different features contribute to predicting property prices in the New York housing market.

Coefficients from the regression models indicate the direction and strength of the relationship between each predictor variable and the property price.

Diagnostic plots help assess the adequacy of the linear regression assumptions and identify any potential issues such as heteroscedasticity or nonlinearity.

#Regression Diagnostics: Assessing Model Validity for Predicting Property Prices

Let’s first fit a linear regression model using the lm() function, where the response variable is PRICE and the predictor variables include BEDS, BATH, and PROPERTYSQFT.

The diagnostic plots include:

Residual Plot: This plot shows the residuals (the differences between observed and predicted values) against the fitted values. We expect to see a random scatter of points around the horizontal line at 0, indicating that the residuals have constant variance (homoscedasticity).

Q-Q Plot: This plot compares the quantiles of the residuals to the quantiles of a theoretical normal distribution. If the residuals follow a normal distribution, the points on the plot should fall along the diagonal dashed line.

# Fit linear regression model
lm_model <- lm(PRICE ~ BEDS + BATH + PROPERTYSQFT, data = NY_House_Dataset)

# Diagnostic plots
# Residual plot
residual_plot <- ggplot(data = NY_House_Dataset, aes(x = fitted(lm_model), y = resid(lm_model))) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residual Plot",
       x = "Fitted Values",
       y = "Residuals")

# Q-Q plot
qq_plot <- ggplot(data = NY_House_Dataset, aes(sample = resid(lm_model))) +
  geom_qq() +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(title = "Q-Q Plot",
       x = "Theoretical Quantiles",
       y = "Sample Quantiles")

# Combine plots
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(residual_plot, qq_plot, ncol = 2)

In the residual plot, the absence of a discernible pattern or curvature suggests that the assumption of constant variance is not violated, indicating homoscedasticity.

In the Q-Q plot, the points align closely with the diagonal line, indicating that the residuals are approximately normally distributed.

#Correlation Analysis of Property Prices and Features

Correlation analysis will be performed to examine the relationships between property prices and various features including the number of bedrooms (BEDS), number of bathrooms (BATH), and property square footage (PROPERTYSQFT). The correlation matrix provides insights into the strength and direction of these relationships, helping to identify which features are most strongly associated with property prices.

# Calculate correlation matrix
correlation_matrix <- cor(NY_House_Dataset[, c("PRICE", "BEDS", "BATH", "PROPERTYSQFT")])

# Visualize correlation matrix
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.92 loaded

corrplot(correlation_matrix, method = "color", type = "upper", order = "hclust", 
         addCoef.col = "black", tl.col = "black", tl.srt = 45)

The correlation matrix reveals that property square footage (PROPERTYSQFT) has the strongest positive correlation with property prices, indicating that larger properties tend to have higher prices. The number of bedrooms (BEDS) also shows a moderate positive correlation with property prices, suggesting that properties with more bedrooms tend to command higher prices. The correlation between the number of bathrooms (BATH) and property prices is relatively weaker compared to square footage and number of bedrooms.

#Geospatial Analysis of Property Prices in New York City

Geospatial analysis is conducted to visualize property prices across different areas of New York City. The map overlays property price data onto a base map of New York City, allowing for the identification of areas with higher or lower property prices and understanding spatial patterns.

# Load necessary libraries
library(ggplot2)

# Define colors for each locality
locality_colors <- c("New York County" = "red",
                     "Dumbo" = "blue",
                     "Richmond County" = "green",
                     "Manhattan" = "purple",
                     "New York" = "orange",
                     "Kings County" = "yellow",
                     "Queens County" = "cyan",
                     "Bronx County" = "magenta",
                     "Jackson Heights" = "pink",
                     "Brooklyn" = "lightblue",
                     "Synder Avenue" = "lightgreen",
                     "Brooklyn Heights" = "darkred",
                     "Fort Hamilton" = "darkblue",
                     "Coney Island" = "darkgreen",
                     "Flushing" = "black",
                     "Staten Island" = "darkorange",
                     "Queens" = "darkcyan",
                     "Riverdale" = "darkmagenta",
                     "The Bronx" = "lightgray",
                     "East Bronx" = "darkgray",
                     "East Park" = "brown")

# Plotting geospatial distribution of property prices with colored sublocality labels
ggplot(NY_House_Dataset, aes(x = LONGITUDE, y = LATITUDE, color = SUBLOCALITY)) +
  geom_point() +
  geom_text(aes(label = SUBLOCALITY), size = 2, vjust = -0.5, check_overlap = TRUE) +
  scale_color_manual(values = locality_colors) +
  labs(title = "Geospatial Distribution of Property Prices by Sublocality in New York City",
       x = "Longitude",
       y = "Latitude",
       color = "Sublocality") +
  theme_minimal()

# Boxplot of Property Prices by Sublocality
ggplot(NY_House_Dataset, aes(x = SUBLOCALITY, y = PRICE)) +
  geom_boxplot() +
  coord_flip() +
  labs(title = "Boxplot of Property Prices by Sublocality in New York City",
       x = "Sublocality",
       y = "Price") +
  theme_minimal()

#  Barplot of Average Property Prices by Sublocality
avg_price_by_sublocality <- NY_House_Dataset %>%
  group_by(SUBLOCALITY) %>%
  summarise(AVG_PRICE = mean(PRICE))

ggplot(avg_price_by_sublocality, aes(x = reorder(SUBLOCALITY, AVG_PRICE), y = AVG_PRICE)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(title = "Average Property Prices by Sublocality in New York City",
       x = "Sublocality",
       y = "Average Price") +
  theme_minimal()

The map reveals spatial patterns in property prices, with areas like New York County typically exhibiting higher prices compared to outer boroughs like Brooklyn and Queens. Understanding these spatial patterns can help stakeholders make informed decisions regarding real estate investments and transactions.

#Price Palette: A Colorful Journey Through New York City’s Sublocality Property Values

The heatmap visualizes the average property prices across different sublocalities in New York City.

Each tile represents a sublocality, with the color intensity indicating the average price level.

Sublocalities with higher average prices will appear darker, while those with lower prices will appear lighter on the heatmap.

library(ggplot2)

# Aggregate data to calculate average price by sublocality
avg_prices <- aggregate(PRICE ~ SUBLOCALITY, data = NY_House_Dataset, FUN = mean)

# Generate the heatmap
ggplot(avg_prices, aes(x = factor(SUBLOCALITY), y = 1, fill = PRICE)) +
  geom_tile() +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
  labs(title = "Average Property Prices by Sublocality",
       x = "Sublocality",
       y = NULL,
       fill = "Average Price")

The heatmap offers a quick and intuitive overview of the price distribution across various neighborhoods in New York City.

Stakeholders can identify areas with higher or lower average prices at a glance, aiding in decision-making related to property investments or purchases.

The visualization highlights potential hotspots or areas of interest for real estate professionals and prospective homebuyers, guiding them in their exploration of the housing market.

Conclusion:

In this comprehensive data dive into the New York housing market, I have conducted various statistical analyses to gain insights into property prices, trends, and factors influencing the real estate landscape. This investigation involved exploratory data analysis, hypothesis testing, regression modeling, and visualization techniques to understand patterns and make meaningful interpretations. Here are the key findings and takeaways from my analysis:

Price Distribution and Trends:

The distribution of property prices in New York City is wide-ranging, with a considerable variation across different sublocalities and property types.

We can observe both upward and downward trends in property prices over time, indicating fluctuations influenced by market dynamics, economic conditions, and other factors.

Impact of Property Features:

Property features such as the number of bedrooms, bathrooms, and square footage have a significant impact on property prices.

The analysis revealed that properties with more bedrooms, bathrooms, or larger square footage tend to command higher prices.

Geographical Variations:

Property prices vary significantly across different sublocalities in New York City, with certain areas commanding higher prices due to factors like location, amenities, and demand.

Sublocalities like New York County, Dumbo, and Manhattan exhibit higher average prices, while areas such as Rego Park and The Bronx offer more affordable housing options.

Hypothesis Testing Results:

Hypothesis testing confirmed significant differences in property prices based on factors like the number of bedrooms and bathrooms.

For example, properties with two bathrooms were found to have significantly higher prices compared to those with only one bathroom.

Implications for Real Estate Market:

These insights provide valuable information for real estate professionals, investors, and prospective homebuyers looking to navigate the New York housing market.

Understanding the factors influencing property prices can help stakeholders make informed decisions regarding investments, pricing strategies, and property purchases.

In summary, this data dive offers a comprehensive analysis of the New York housing market, uncovering key trends, patterns, and factors influencing property prices. By leveraging statistical techniques and visualization tools, we can provide actionable insights that can inform decision-making and drive success in the dynamic real estate landscape of New York City.

Data Dive - Final Project

Abhinandhan Velagapudi

2024-04-21

Conclusion: