# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load your data
data <- read.csv("AB_NYC_2019.csv") # Replace with your actual data file
str(data)
## 'data.frame': 48895 obs. of 16 variables:
## $ id : int 2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
## $ name : chr "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
## $ host_id : int 2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
## $ host_name : chr "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
## $ neighbourhood_group : chr "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
## $ neighbourhood : chr "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
## $ latitude : num 40.6 40.8 40.8 40.7 40.8 ...
## $ longitude : num -74 -74 -73.9 -74 -73.9 ...
## $ room_type : chr "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
## $ price : int 149 225 150 89 80 200 60 79 79 150 ...
## $ minimum_nights : int 1 1 3 1 10 3 45 2 2 1 ...
## $ number_of_reviews : int 9 45 0 270 9 74 49 430 118 160 ...
## $ last_review : chr "2018-10-19" "2019-05-21" "" "2019-07-05" ...
## $ reviews_per_month : num 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
## $ calculated_host_listings_count: int 6 2 1 1 1 1 1 1 1 4 ...
## $ availability_365 : int 365 355 365 194 0 129 0 220 0 188 ...
head(data)
## id name host_id host_name
## 1 2539 Clean & quiet apt home by the park 2787 John
## 2 2595 Skylit Midtown Castle 2845 Jennifer
## 3 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth
## 4 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne
## 5 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura
## 6 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris
## neighbourhood_group neighbourhood latitude longitude room_type price
## 1 Brooklyn Kensington 40.64749 -73.97237 Private room 149
## 2 Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225
## 3 Manhattan Harlem 40.80902 -73.94190 Private room 150
## 4 Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89
## 5 Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80
## 6 Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200
## minimum_nights number_of_reviews last_review reviews_per_month
## 1 1 9 2018-10-19 0.21
## 2 1 45 2019-05-21 0.38
## 3 3 0 NA
## 4 1 270 2019-07-05 4.64
## 5 10 9 2018-11-19 0.10
## 6 3 74 2019-06-22 0.59
## calculated_host_listings_count availability_365
## 1 6 365
## 2 2 355
## 3 1 365
## 4 1 194
## 5 1 0
## 6 1 129
library(dplyr)
data <- data %>%
mutate(
room_type = as.factor(room_type),
neighbourhood_group = as.factor(neighbourhood_group),
High_Price = ifelse(price > 150, 1, 0) # Assuming 150 is the threshold for HighPrice
)
I’m going to select a binary column that could reasonably be modeled using logistic regression. One interesting binary variable is HighPrice, that categorizes listings into two groups depending on their price:
HighPrice = 1 if the price is greater than 150 HighPrice = 0 in case of price ≤ 150 This is a relevant variable to model because price is perhaps the most critical determinant of listing outcomes such as bookings, visibility, and attractiveness. By turning listings into high priced versus not, we can explore what drive the factors to whether a listing is in this high-price category.
# Build logistic regression model
log_model <- glm(High_Price ~ room_type + number_of_reviews + neighbourhood_group + availability_365,
data = data,
family = binomial)
# View summary of model to see coefficients
summary(log_model)
##
## Call:
## glm(formula = High_Price ~ room_type + number_of_reviews + neighbourhood_group +
## availability_365, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.906e+00 1.256e-01 -15.177 < 2e-16 ***
## room_typePrivate room -2.824e+00 3.144e-02 -89.818 < 2e-16 ***
## room_typeShared room -3.145e+00 1.284e-01 -24.503 < 2e-16 ***
## number_of_reviews -3.974e-03 3.007e-04 -13.215 < 2e-16 ***
## neighbourhood_groupBrooklyn 1.315e+00 1.259e-01 10.439 < 2e-16 ***
## neighbourhood_groupManhattan 2.389e+00 1.256e-01 19.027 < 2e-16 ***
## neighbourhood_groupQueens 6.692e-01 1.320e-01 5.069 3.99e-07 ***
## neighbourhood_groupStaten Island 3.540e-01 2.076e-01 1.706 0.0881 .
## availability_365 2.992e-03 9.605e-05 31.150 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 60186 on 48894 degrees of freedom
## Residual deviance: 42084 on 48886 degrees of freedom
## AIC: 42102
##
## Number of Fisher Scoring iterations: 5
I want to carry out logistic regression modeling on HighPrice using some explanatory variables that could logically account for whether a listing is high-priced or not. I will use the following explanatory variables:
room_type: The prices of these different types of rooms, like Entire Home, Private Room, and Shared Room, may have a different range of variation. neighborhood_group: Price may be different depending on different neighborhoods. Manhattan will always tend to be the most expensive when compared with other boroughs. number_of_reviews: Listings with more reviews might indicate higher demand and thus higher prices. availability_365: Listings available more often in the year may charge a higher price owing to a better demand level.
coef_estimate <- coef(log_model)["room_typePrivate room"]
standard_error <- summary(log_model)$coefficients["room_typePrivate room", "Std. Error"]
# Calculate the 95% confidence interval
z_value <- 1.96 # Critical value for 95% confidence
CI_lower <- coef_estimate - z_value * standard_error
CI_upper <- coef_estimate + z_value * standard_error
# Print the results
cat("Coefficient for 'room_typePrivate room':", coef_estimate, "\n")
## Coefficient for 'room_typePrivate room': -2.823969
cat("Standard Error for 'room_typePrivate room':", standard_error, "\n")
## Standard Error for 'room_typePrivate room': 0.03144105
cat("95% Confidence Interval for 'room_typePrivate room': [", CI_lower, ",", CI_upper, "]\n")
## 95% Confidence Interval for 'room_typePrivate room': [ -2.885593 , -2.762344 ]
Explanation of the Results
Coefficient for ‘room_typePrivate room’: The coefficient of -2.824 would suggest that listings falling under the category of “Private room” are significantly less likely to be high-priced when compared, presumably, to the baseline category of “Entire home/apt” or another room type. In other words, a negative coefficient means reduced likelihood of being high-priced in Private rooms.
Standard Error for ‘room_typePrivate room’: The standard error, 0.031 expresses the variability of our estimate. A small standard error, like this, suggests that the estimate of -2.824 is fairly precise.
95% Confidence Interval for ‘room_typePrivate room’: The interval [-2.886, -2.762] gives the range within which the true effect of “Private room” on the log-odds of being high-priced is expected to be at 95% confidence level. As this interval does not contain zero, it should reaffirm that the effect of the room type variable is of statistical significance for the High Price classification.
-Likelihood of High Price: Private rooms are less likely to fall into the high price category compared to the reference room type. It is a good insight into the room type w.r.t. pricing.