The goal of this homework is hands-on practice with data exploration and cleaning, classification, and model selection. You will:
The Assignment
The data in the accompanying file “dc_airbnb_file.csv” (posted on Canvas) contains data about 4936 Airbnb.com listings from the Washington, DC area. This is a small subset of the data that you will eventually use for the class project. The variables in this dataset should be self-explanatory but we have also provided a data dictionary.
library(readr)
airbnb <- read_csv(“C:\Users\dell\Downloads\dc_airbnb_hw1.csv”)
Your task is to develop models to predict the target variable “high_booking_rate”, which labels whether a listing is popular (i.e. spends most of the time booked) or not.
Please answer the questions below clearly and concisely, providing tables or plots where applicable. Turn in a well-formatted compiled HTML document using R Markdown, containing clear answers to the questions and R code in the appropriate places.
RUBRIC: To receive a passing score on this assignment, you must do the following:
Note that this assignment is somewhat open-ended and there are many ways to answer these questions. I don’t require that we have exactly the same answers in order for you to receive full credit.
airbnb <- read_csv("dc_airbnb_hw1.csv") #read the dataset in R
## Rows: 4936 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, bed_type, cancellation_policy, cleaning_fee, price, property_...
## dbl (5): accommodates, bedrooms, beds, host_total_listings_count, high_booki...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(airbnb) #variables used in dataset
## [1] "name" "accommodates"
## [3] "bed_type" "bedrooms"
## [5] "beds" "cancellation_policy"
## [7] "cleaning_fee" "host_total_listings_count"
## [9] "price" "property_type"
## [11] "room_type" "high_booking_rate"
What is the mean of the accommodates variable?
ANSWER: The mean number of people that can be accommodated in a listing in this dataset is 3.522893.
age_mean <- airbnb %>%
summarise(mean_accommodates = mean(accommodates))
#a) Clean up and process some of the variables by doing the following (I recommend you do these in the order in which they are listed):
#- For cancellation_policy, group “strict” and “super_strict_30” into “strict” #- Convert cleaning_fee and price into numbers #- Replace NAs in cleaning_fee and price with 0 #- Replace NAs in other numerical variables with their mean
#Hint: there are lots of ways to deal with cleaning_fee and price. I like the parse_number function from the readr library.
#ANSWER TO QUESTION 1a HERE:
#PUT QUESTION 1a CODE HERE
airbnb <- airbnb %>%
mutate(cancellation_policy = if_else(cancellation_policy %in% c("strict", "super_strict_30"), "strict", cancellation_policy)) %>%
mutate(across(c(cleaning_fee, price), ~parse_number(.x) %>% replace_na(0))) %>%
mutate(across(c("accommodates", "bedrooms", "beds", "host_total_listings_count"), ~replace_na(.x, mean(.x, na.rm = TRUE))))
view(airbnb)
price_per_person is the nightly price per accommodates
has_cleaning_fee is YES if there is a cleaning fee, and NO otherwise
bed_category is “bed” if the bed_type is Real Bed and “other” otherwise
property_category has the following values:
make sure to convert property_category to a factor!
ppp_ind is 1 if the price_per_person is greater than the median for the property_category, and 0 otherwise
Hint: there are lots of ways to do the property_type variable, but case_when might be a useful function that’s part of the tidyverse.
ANSWER TO QUESTION 1b HERE:
airbnb <- airbnb %>%
mutate(
price_per_person = price / accommodates,
has_cleaning_fee = if_else(cleaning_fee > 0, "YES", "NO"),
bed_category = if_else(bed_type == "Real Bed", "bed", "other"),
property_category = case_when(
property_type %in% c("Apartment", "Serviced apartment", "Loft") ~ "apartment",
property_type %in% c("Bed & Breakfast", "Boutique hotel", "Hostel") ~ "hotel",
property_type %in% c("Townhouse", "Condominium") ~ "condo",
property_type %in% c("Bungalow", "House") ~ "house",
TRUE ~ "other"
)
) %>%
group_by(property_category) %>%
mutate(ppp_ind = as.integer(price_per_person > median(price_per_person, na.rm = TRUE))) %>%
ungroup() %>%
mutate(property_category = as.factor(property_category))
summary(airbnb$price_per_person)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6875 29.0000 40.0000 47.3192 55.0000 2575.0000
table(airbnb$has_cleaning_fee)
##
## NO YES
## 1208 3728
table(airbnb$bed_category)
##
## bed other
## 4786 150
table(airbnb$property_category)
##
## apartment condo hotel house other
## 2685 558 62 1561 70
summary(airbnb$ppp_ind)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4899 1.0000 1.0000
airbnb <- airbnb %>%
mutate(
bed_type = as.factor(bed_type),
cancellation_policy = as.factor(cancellation_policy),
room_type = as.factor(room_type)
)
ANSWER TO QUESTION 1c HERE:
library(ggplot2)
airbnb <- airbnb %>%
mutate(high_booking_rate = as.factor(high_booking_rate))
# Bplot: price_per_person by high_booking_rate
ggplot(airbnb, aes(x = high_booking_rate, y = price_per_person)) +
geom_boxplot() +
labs(title = "Boxplot of Price Per Person by High Booking Rate",
x = "High Booking Rate",
y = "Price Per Person")
# Bplot: log(price_per_person) by high_booking_rate
ggplot(airbnb, aes(x = high_booking_rate, y = log(price_per_person))) +
geom_boxplot() +
labs(title = "Boxplot of Log(Price Per Person) by High Booking Rate",
x = "High Booking Rate",
y = "Log(Price Per Person)")
(hint: prop.table might be especially helpful here)
ANSWER TO QUESTION 1e HERE:
install.packages("dplyr")
## Warning: package 'dplyr' is in use and will not be installed
two_way_table <- table(airbnb$property_category, airbnb$high_booking_rate)
prop.table(two_way_table, margin = 1)
##
## 0 1
## apartment 0.7497207 0.2502793
## condo 0.6935484 0.3064516
## hotel 0.8064516 0.1935484
## house 0.7674568 0.2325432
## other 0.7142857 0.2857143
ANSWER TO QUESTION 2a HERE:Multiple R-squared: 0.07898, Adjusted R-squared: 0.07617 0.079 indicates that the model doesn’t fit the data very well, as it is less than 8% of the variance in high_booking_rate.
airbnb$high_booking_rate_numeric <- as.numeric(airbnb$high_booking_rate) - 1
model <- lm(high_booking_rate_numeric ~ cancellation_policy + cleaning_fee + price_per_person +
ppp_ind + has_cleaning_fee + accommodates + bed_category + bedrooms +
beds + host_total_listings_count + property_category, data = airbnb)
model_summary <- summary(model)
print(model_summary)
##
## Call:
## lm(formula = high_booking_rate_numeric ~ cancellation_policy +
## cleaning_fee + price_per_person + ppp_ind + has_cleaning_fee +
## accommodates + bed_category + bedrooms + beds + host_total_listings_count +
## property_category, data = airbnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0026 -0.2862 -0.1748 0.4317 0.9956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.909e-01 2.112e-02 9.039 < 2e-16 ***
## cancellation_policymoderate 8.553e-02 1.581e-02 5.410 6.59e-08 ***
## cancellation_policystrict 6.877e-02 1.594e-02 4.316 1.62e-05 ***
## cleaning_fee -1.229e-03 1.899e-04 -6.470 1.08e-10 ***
## price_per_person -4.609e-04 1.404e-04 -3.283 0.001034 **
## ppp_ind -9.495e-02 1.387e-02 -6.847 8.47e-12 ***
## has_cleaning_feeYES 1.458e-01 1.732e-02 8.417 < 2e-16 ***
## accommodates 1.228e-02 5.994e-03 2.049 0.040545 *
## bed_categoryother -3.760e-02 3.488e-02 -1.078 0.281049
## bedrooms -8.041e-02 1.190e-02 -6.758 1.56e-11 ***
## beds 4.345e-02 8.964e-03 4.848 1.29e-06 ***
## host_total_listings_count -2.230e-04 7.144e-05 -3.122 0.001809 **
## property_categorycondo 6.576e-02 1.981e-02 3.320 0.000907 ***
## property_categoryhotel -1.661e-02 5.405e-02 -0.307 0.758573
## property_categoryhouse 4.977e-03 1.433e-02 0.347 0.728283
## property_categoryother -3.811e-03 5.109e-02 -0.075 0.940545
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4167 on 4920 degrees of freedom
## Multiple R-squared: 0.07898, Adjusted R-squared: 0.07617
## F-statistic: 28.13 on 15 and 4920 DF, p-value: < 2.2e-16
airbnb$high_booking_rate <- as.factor(airbnb$high_booking_rate)
logistic_model <- glm(high_booking_rate ~ cancellation_policy + cleaning_fee + price_per_person +
ppp_ind + has_cleaning_fee + accommodates + bed_category + bedrooms +
beds + host_total_listings_count + property_category,
family = binomial(), data = airbnb)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
##
## Call:
## glm(formula = high_booking_rate ~ cancellation_policy + cleaning_fee +
## price_per_person + ppp_ind + has_cleaning_fee + accommodates +
## bed_category + bedrooms + beds + host_total_listings_count +
## property_category, family = binomial(), data = airbnb)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.978980 0.160907 -6.084 1.17e-09 ***
## cancellation_policymoderate 0.510296 0.093378 5.465 4.63e-08 ***
## cancellation_policystrict 0.444980 0.095816 4.644 3.42e-06 ***
## cleaning_fee -0.006897 0.001299 -5.308 1.11e-07 ***
## price_per_person -0.019105 0.002891 -6.608 3.91e-11 ***
## ppp_ind -0.053136 0.108611 -0.489 0.6247
## has_cleaning_feeYES 0.841786 0.107402 7.838 4.59e-15 ***
## accommodates 0.040726 0.035167 1.158 0.2468
## bed_categoryother -0.264513 0.208151 -1.271 0.2038
## bedrooms -0.399557 0.069183 -5.775 7.68e-09 ***
## beds 0.223325 0.050891 4.388 1.14e-05 ***
## host_total_listings_count -0.007824 0.003194 -2.449 0.0143 *
## property_categorycondo 0.349404 0.109380 3.194 0.0014 **
## property_categoryhotel 0.100804 0.352798 0.286 0.7751
## property_categoryhouse -0.052321 0.083447 -0.627 0.5307
## property_categoryother -0.140461 0.291836 -0.481 0.6303
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5560.1 on 4935 degrees of freedom
## Residual deviance: 5082.6 on 4920 degrees of freedom
## AIC: 5114.6
##
## Number of Fisher Scoring iterations: 8
New Listing:
Hint: you will have to do the same processing on this new listing as you applied to the main dataset, but feel free to do it by hand rather than using the same code as before.
ANSWER TO QUESTION 2b HERE:
new_listing <- data.frame(#new listing data
cancellation_policy = "strict",
cleaning_fee = 30,
price_per_person = 200 / 4,
ppp_ind = as.integer((200 / 4) > median(airbnb$price_per_person[airbnb$property_category == "condo"], na.rm = TRUE)),
has_cleaning_fee = "YES",
accommodates = 4,
bed_category = "bed",
bedrooms = 3,
beds = 4,
host_total_listings_count = 1,
property_category = "condo"
)
predicted_prob <- predict(logistic_model, newdata = new_listing, type = "response")
predicted_prob
## 1
## 0.3299597
ANSWER TO QUESTION 2c HERE:0.33 probability indicates ambiguity towards its booking rate. Since the probability is lessthan 0.5 that the model lacks strong confidence in its prediction, thus seek a higher level of certainty before making a decision based on this prediction.
Hint: don’t worry if you get a warning here…
ANSWER TO QUESTION 3a HERE:5114.554
AIC(logistic_model)
## [1] 5114.554
ANSWER TO QUESTION 3b HERE: -0.019105. To interpret this coefficient in terms of odds, we can exponentiate it: 0.981
The coefficient of -0.019105 for price per person indicates that as the price per person increases by one unit (e.g., one dollar), the odds of having a high booking rate decrease by about 1.9%. This suggests that more expensive listings per person are slightly less likely to be highly booked.
ANSWER TO QUESTION 3c HERE: 0.349404. in terms of odds: 1.418 = 41.8% higher odds of a high booking rate compared to apartments, when all other factors are held constant. Condos are thus more likely to be highly booked.
ANSWER TO QUESTION 3d HERE:0.3299597
predicted_probability <- predict(logistic_model, newdata = new_listing, type = "response")
predicted_probability
## 1
## 0.3299597
ANSWER TO QUESTION 3e HERE: Logistic regression model would be the more better choice.
As it is specifically designed for binary dependent variables, like high_booking_rate, which can take on only 2 values (e.g., 0 or 1, representing low or high booking rates). It models the probability of the target variable being in one of the two classes, which is more suitable for interpreting relationships in a classification context.
The coefficients in a logistic regression model represent the change in the log odds of the dependent variable for a one-unit change in the predictor variable, holding all other variables constant. This interpretation is more meaningful for binary outcomes than the interpretation of coefficients in a linear regression model, which would imply a continuous change in a binary variable.
Logistic regression can capture the non-linear relationship between the features & the log odds of the target variable, which is often more realistic for binary outcomes.
ANSWER TO QUESTION 4a HERE:
set.seed(12345)
split_index <- sample(nrow(airbnb), size = 0.7 * nrow(airbnb))
train_data <- airbnb[split_index, ]
validation_data <- airbnb[-split_index, ]
train_data$high_booking_rate_numeric <- as.numeric(train_data$high_booking_rate) - 1
linear_model_train <- lm(high_booking_rate_numeric ~ cancellation_policy + cleaning_fee + price_per_person + ppp_ind + has_cleaning_fee + accommodates + bed_category + bedrooms +
beds + host_total_listings_count + property_category, data = train_data)
summary(linear_model_train)$r.squared
## [1] 0.08106124
train_data$high_booking_rate <- as.factor(train_data$high_booking_rate)
validation_data$high_booking_rate <- as.factor(validation_data$high_booking_rate)
logistic_model_train <- glm(high_booking_rate ~ cancellation_policy + cleaning_fee + price_per_person +
ppp_ind + has_cleaning_fee + accommodates + bed_category + bedrooms +
beds + host_total_listings_count + property_category,
family = binomial(), data = train_data)
AIC(logistic_model_train)
## [1] 3585.261
ANSWER TO QUESTION 4b HERE:0.4250136 the validation RMSE is higher than the training RMSE due to overfitting
train_predictions <- predict(linear_model_train, newdata = train_data)
train_rmse <- sqrt(mean((train_data$high_booking_rate_numeric - train_predictions)^2))
validation_predictions <- predict(linear_model_train, newdata = validation_data)
validation_rmse <- sqrt(mean((validation_data$high_booking_rate_numeric - validation_predictions)^2))
cat("Training RMSE:", train_rmse, "\n")
## Training RMSE: 0.4148109
cat("Validation RMSE:", validation_rmse, "\n")
## Validation RMSE: 0.4250136
ANSWER TO QUESTION 4c HERE:
validation_predictions_linear <- predict(linear_model_train, newdata = validation_data)
predicted_classes_linear <- ifelse(validation_predictions_linear > 0.5, 1, 0)
conf_matrix_linear <- table(Predicted = predicted_classes_linear, Actual = validation_data$high_booking_rate_numeric)
cat("Confusion Matrix for Linear Regression Model:\n")
## Confusion Matrix for Linear Regression Model:
print(conf_matrix_linear)
## Actual
## Predicted 0 1
## 0 1100 373
## 1 5 3
validation_predictions_logistic <- predict(logistic_model_train, newdata = validation_data, type = "response")
predicted_classes_logistic <- ifelse(validation_predictions_logistic > 0.5, 1, 0)
conf_matrix_logistic <- table(Predicted = predicted_classes_logistic, Actual = validation_data$high_booking_rate_numeric)
cat("Confusion Matrix for Logistic Regression Model:\n")
## Confusion Matrix for Logistic Regression Model:
print(conf_matrix_logistic)
## Actual
## Predicted 0 1
## 0 1081 348
## 1 24 28
ANSWER TO QUESTION 4d HERE:
linear_predictions_binary <- ifelse(validation_predictions >= 0.5, 1, 0)
linear_confusion_matrix <- table(Actual = validation_data$high_booking_rate_numeric, Predicted = linear_predictions_binary)
print(linear_confusion_matrix)
## Predicted
## Actual 0 1
## 0 1100 5
## 1 373 3
linear_accuracy <- sum(diag(linear_confusion_matrix)) / sum(linear_confusion_matrix)
linear_tpr <- linear_confusion_matrix[2, 2] / sum(linear_confusion_matrix[2, ])
linear_fpr <- linear_confusion_matrix[1, 2] / sum(linear_confusion_matrix[1, ])
cat("Linear Model Accuracy:", linear_accuracy, "\n")
## Linear Model Accuracy: 0.744767
cat("Linear Model TPR:", linear_tpr, "\n")
## Linear Model TPR: 0.007978723
cat("Linear Model FPR:", linear_fpr, "\n")
## Linear Model FPR: 0.004524887
logistic_predictions_prob <- predict(logistic_model_train, newdata = validation_data, type = "response")
logistic_predictions_binary <- ifelse(logistic_predictions_prob >= 0.5, 1, 0)
logistic_confusion_matrix <- table(Actual = validation_data$high_booking_rate_numeric, Predicted = logistic_predictions_binary)
print(logistic_confusion_matrix)
## Predicted
## Actual 0 1
## 0 1081 24
## 1 348 28
logistic_accuracy <- sum(diag(logistic_confusion_matrix)) / sum(logistic_confusion_matrix)
logistic_tpr <- logistic_confusion_matrix[2, 2] / sum(logistic_confusion_matrix[2, ])
logistic_fpr <- logistic_confusion_matrix[1, 2] / sum(logistic_confusion_matrix[1, ])
cat("Logistic Model Accuracy:", logistic_accuracy, "\n")
## Logistic Model Accuracy: 0.7488184
cat("Logistic Model TPR:", logistic_tpr, "\n")
## Logistic Model TPR: 0.07446809
cat("Logistic Model FPR:", logistic_fpr, "\n")
## Logistic Model FPR: 0.02171946
# Function to generate a confusion matrix and calculate metrics
evaluate_model <- function(predictions, actual) {
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
conf_matrix <- table(Predicted = predicted_classes, Actual = actual)
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
tpr <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
fpr <- conf_matrix[1, 2] / sum(conf_matrix[1, ])
list(conf_matrix = conf_matrix, accuracy = accuracy, tpr = tpr, fpr = fpr)
}
linear_results <- evaluate_model(validation_predictions_linear, validation_data$high_booking_rate_numeric)
cat("Linear Model:\n")
## Linear Model:
print(linear_results$conf_matrix)
## Actual
## Predicted 0 1
## 0 1100 373
## 1 5 3
cat("Accuracy:", linear_results$accuracy, "TPR:", linear_results$tpr, "FPR:", linear_results$fpr, "\n")
## Accuracy: 0.744767 TPR: 0.375 FPR: 0.2532247
logistic_results <- evaluate_model(validation_predictions_logistic, validation_data$high_booking_rate_numeric)
cat("Logistic Model:\n")
## Logistic Model:
print(logistic_results$conf_matrix)
## Actual
## Predicted 0 1
## 0 1081 348
## 1 24 28
cat("Accuracy:", logistic_results$accuracy, "TPR:", logistic_results$tpr, "FPR:", logistic_results$fpr, "\n")
## Accuracy: 0.7488184 TPR: 0.5384615 FPR: 0.2435269
most_accurate <- ifelse(linear_results$accuracy > logistic_results$accuracy, "Linear", "Logistic")
cat("The most accurate model is the", most_accurate, "model.\n")
## The most accurate model is the Logistic model.
ANSWER TO QUESTION 4e HERE:
compute_metrics <- function(predictions, actual, cutoff) {
predictions_binary <- ifelse(predictions >= cutoff, 1, 0)
confusion_matrix <- table(Actual = actual, Predicted = predictions_binary)
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
tpr <- confusion_matrix[2, 2] / sum(confusion_matrix[2, ])
fpr <- confusion_matrix[1, 2] / sum(confusion_matrix[1, ])
return(list(Accuracy = accuracy, TPR = tpr, FPR = fpr))
}
metrics_cutoff_0_4 <- compute_metrics(logistic_predictions_prob, validation_data$high_booking_rate_numeric, 0.4)
cat("Metrics for Cutoff 0.4:\n")
## Metrics for Cutoff 0.4:
print(metrics_cutoff_0_4)
## $Accuracy
## [1] 0.7359892
##
## $TPR
## [1] 0.2659574
##
## $FPR
## [1] 0.1040724
metrics_cutoff_0_6 <- compute_metrics(logistic_predictions_prob, validation_data$high_booking_rate_numeric, 0.6)
cat("\nMetrics for Cutoff 0.6:\n")
##
## Metrics for Cutoff 0.6:
print(metrics_cutoff_0_6)
## $Accuracy
## [1] 0.7461175
##
## $TPR
## [1] 0.002659574
##
## $FPR
## [1] 0.0009049774
ANSWER TO QUESTION 4f HERE: Application, presay a cutoff of 0.5 yields balanced results, provides a higher True Positive Rate (TPR) than 0.6, indicating better identification of actual high booking rates, while maintaining a lower False Positive Rate (FPR) than 0.4, indicating fewer incorrect high booking rate predictions. The middle ground supports a more reliable classification without excessively favoring either sensitivity or specificity.