week10

# Load necessary libraries
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load your data
data <- read.csv("AB_NYC_2019.csv")  # Replace with your actual data file
str(data)

## 'data.frame':    48895 obs. of  16 variables:
##  $ id                            : int  2539 2595 3647 3831 5022 5099 5121 5178 5203 5238 ...
##  $ name                          : chr  "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : int  2787 2845 4632 4869 7192 7322 7356 8967 7490 7549 ...
##  $ host_name                     : chr  "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr  "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr  "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num  40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num  -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr  "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : int  149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : int  1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : int  9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : chr  "2018-10-19" "2019-05-21" "" "2019-07-05" ...
##  $ reviews_per_month             : num  0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: int  6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : int  365 355 365 194 0 129 0 220 0 188 ...

head(data)

##     id                                             name host_id   host_name
## 1 2539               Clean & quiet apt home by the park    2787        John
## 2 2595                            Skylit Midtown Castle    2845    Jennifer
## 3 3647              THE VILLAGE OF HARLEM....NEW YORK !    4632   Elisabeth
## 4 3831                  Cozy Entire Floor of Brownstone    4869 LisaRoxanne
## 5 5022 Entire Apt: Spacious Studio/Loft by central park    7192       Laura
## 6 5099        Large Cozy 1 BR Apartment In Midtown East    7322       Chris
##   neighbourhood_group neighbourhood latitude longitude       room_type price
## 1            Brooklyn    Kensington 40.64749 -73.97237    Private room   149
## 2           Manhattan       Midtown 40.75362 -73.98377 Entire home/apt   225
## 3           Manhattan        Harlem 40.80902 -73.94190    Private room   150
## 4            Brooklyn  Clinton Hill 40.68514 -73.95976 Entire home/apt    89
## 5           Manhattan   East Harlem 40.79851 -73.94399 Entire home/apt    80
## 6           Manhattan   Murray Hill 40.74767 -73.97500 Entire home/apt   200
##   minimum_nights number_of_reviews last_review reviews_per_month
## 1              1                 9  2018-10-19              0.21
## 2              1                45  2019-05-21              0.38
## 3              3                 0                            NA
## 4              1               270  2019-07-05              4.64
## 5             10                 9  2018-11-19              0.10
## 6              3                74  2019-06-22              0.59
##   calculated_host_listings_count availability_365
## 1                              6              365
## 2                              2              355
## 3                              1              365
## 4                              1              194
## 5                              1                0
## 6                              1              129

Select an interesting binary column of data

library(dplyr)


data <- data %>%
  mutate(
    room_type = as.factor(room_type),
    neighbourhood_group = as.factor(neighbourhood_group),
    High_Price = ifelse(price > 150, 1, 0)  # Assuming 150 is the threshold for HighPrice
  )

I’m going to select a binary column that could reasonably be modeled using logistic regression. One interesting binary variable is HighPrice, that categorizes listings into two groups depending on their price:

HighPrice = 1 if the price is greater than 150 HighPrice = 0 in case of price ≤ 150 This is a relevant variable to model because price is perhaps the most critical determinant of listing outcomes such as bookings, visibility, and attractiveness. By turning listings into high priced versus not, we can explore what drive the factors to whether a listing is in this high-price category.

Building a Logistic Regression Model

# Build logistic regression model
log_model <- glm(High_Price ~ room_type + number_of_reviews + neighbourhood_group + availability_365,
                      data = data,
                      family = binomial)

# View summary of model to see coefficients
summary(log_model)

## 
## Call:
## glm(formula = High_Price ~ room_type + number_of_reviews + neighbourhood_group + 
##     availability_365, family = binomial, data = data)
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -1.906e+00  1.256e-01 -15.177  < 2e-16 ***
## room_typePrivate room            -2.824e+00  3.144e-02 -89.818  < 2e-16 ***
## room_typeShared room             -3.145e+00  1.284e-01 -24.503  < 2e-16 ***
## number_of_reviews                -3.974e-03  3.007e-04 -13.215  < 2e-16 ***
## neighbourhood_groupBrooklyn       1.315e+00  1.259e-01  10.439  < 2e-16 ***
## neighbourhood_groupManhattan      2.389e+00  1.256e-01  19.027  < 2e-16 ***
## neighbourhood_groupQueens         6.692e-01  1.320e-01   5.069 3.99e-07 ***
## neighbourhood_groupStaten Island  3.540e-01  2.076e-01   1.706   0.0881 .  
## availability_365                  2.992e-03  9.605e-05  31.150  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 60186  on 48894  degrees of freedom
## Residual deviance: 42084  on 48886  degrees of freedom
## AIC: 42102
## 
## Number of Fisher Scoring iterations: 5

Interpret the coefficients

I want to carry out logistic regression modeling on HighPrice using some explanatory variables that could logically account for whether a listing is high-priced or not. I will use the following explanatory variables:

room_type: The prices of these different types of rooms, like Entire Home, Private Room, and Shared Room, may have a different range of variation. neighborhood_group: Price may be different depending on different neighborhoods. Manhattan will always tend to be the most expensive when compared with other boroughs. number_of_reviews: Listings with more reviews might indicate higher demand and thus higher prices. availability_365: Listings available more often in the year may charge a higher price owing to a better demand level.

coef_estimate <- coef(log_model)["room_typePrivate room"]
standard_error <- summary(log_model)$coefficients["room_typePrivate room", "Std. Error"]

# Calculate the 95% confidence interval
z_value <- 1.96  # Critical value for 95% confidence
CI_lower <- coef_estimate - z_value * standard_error
CI_upper <- coef_estimate + z_value * standard_error

# Print the results
cat("Coefficient for 'room_typePrivate room':", coef_estimate, "\n")

## Coefficient for 'room_typePrivate room': -2.823969

cat("Standard Error for 'room_typePrivate room':", standard_error, "\n")

## Standard Error for 'room_typePrivate room': 0.03144105

cat("95% Confidence Interval for 'room_typePrivate room': [", CI_lower, ",", CI_upper, "]\n")

## 95% Confidence Interval for 'room_typePrivate room': [ -2.885593 , -2.762344 ]

Explanation of the Results

Coefficient for ‘room_typePrivate room’: The coefficient of -2.824 would suggest that listings falling under the category of “Private room” are significantly less likely to be high-priced when compared, presumably, to the baseline category of “Entire home/apt” or another room type. In other words, a negative coefficient means reduced likelihood of being high-priced in Private rooms.
Standard Error for ‘room_typePrivate room’: The standard error, 0.031 expresses the variability of our estimate. A small standard error, like this, suggests that the estimate of -2.824 is fairly precise.
95% Confidence Interval for ‘room_typePrivate room’: The interval [-2.886, -2.762] gives the range within which the true effect of “Private room” on the log-odds of being high-priced is expected to be at 95% confidence level. As this interval does not contain zero, it should reaffirm that the effect of the room type variable is of statistical significance for the High Price classification.

Inferences Drawn

-Likelihood of High Price: Private rooms are less likely to fall into the high price category compared to the reference room type. It is a good insight into the room type w.r.t. pricing.

Significance

Interpretation: This will help provide some credibility for hosts and guests. Hosts may take into consideration that by offering a Private room, the listing may result in a lower tier price that could attract more guests on a tight budget.

Further Questions

Other Room Types: It could be informative to conduct similar analyses on other room types to determine if similar patterns hold.
Interaction Effects: Are there interactions between room type and other factors (e.g., neighborhood, season) that might influence the likelihood of a high-price classification?
Threshold Sensitivity: Would moving the high-price threshold change this relationship between room type and price classification?