The goal of this homework is hands-on practice with data exploration and cleaning, classification, and model selection. You will:
The Assignment
The data in the accompanying file “dc_airbnb_file.csv” (posted on Canvas) contains data about 4936 Airbnb.com listings from the Washington, DC area. This is a small subset of the data that you will eventually use for the class project. The variables in this dataset should be self-explanatory but we have also provided a data dictionary.
Your task is to develop models to predict the target variable “high_booking_rate”, which labels whether a listing is popular (i.e. spends most of the time booked) or not.
Please answer the questions below clearly and concisely, providing tables or plots where applicable. Turn in a well-formatted compiled HTML document using R Markdown, containing clear answers to the questions and R code in the appropriate places.
RUBRIC: To receive a passing score on this assignment, you must do the following:
Note that this assignment is somewhat open-ended and there are many ways to answer these questions. I don’t require that we have exactly the same answers in order for you to receive full credit.
airbnb <- read_csv("dc_airbnb_hw1.csv") #read the dataset in R
## Rows: 4936 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, bed_type, cancellation_policy, cleaning_fee, price, property_...
## dbl (5): accommodates, bedrooms, beds, host_total_listings_count, high_booki...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(airbnb) #variables used in dataset
## [1] "name" "accommodates"
## [3] "bed_type" "bedrooms"
## [5] "beds" "cancellation_policy"
## [7] "cleaning_fee" "host_total_listings_count"
## [9] "price" "property_type"
## [11] "room_type" "high_booking_rate"
What is the mean of the accommodates variable?
ANSWER: The mean number of people that can be accommodated in a listing in this dataset is 3.522893.
age_mean <- airbnb %>%
summarise(mean_accommodates = mean(accommodates))
Hint: there are lots of ways to deal with cleaning_fee and price. I like the parse_number function from the readr library.
ANSWER TO QUESTION 1a HERE:
#PUT QUESTION 1a CODE HERE
airbnb <- airbnb %>%
mutate(cancellation_policy = ifelse(cancellation_policy %in% c("strict", "super_strict_30"), "strict", cancellation_policy))
airbnb <- airbnb %>%
mutate(
cleaning_fee = parse_number(as.character(cleaning_fee)),
price = parse_number(as.character(price))
)
airbnb$cleaning_fee[is.na(airbnb$cleaning_fee)] <- 0
airbnb$price[is.na(airbnb$price)] <- 0
airbnb <- airbnb %>%
mutate(across(c("accommodates", "bedrooms", "beds", "host_total_listings_count"), ~replace_na(.x, mean(.x, na.rm=TRUE))))# For cancellation_policy, group "strict" and "super_strict_30" into "strict"
airbnb <- airbnb %>%
mutate(cancellation_policy = ifelse(cancellation_policy %in% c("strict", "super_strict_30"), "strict", cancellation_policy))
airbnb <- airbnb %>%
mutate(
cleaning_fee = parse_number(as.character(cleaning_fee)),
price = parse_number(as.character(price))
)
airbnb$cleaning_fee[is.na(airbnb$cleaning_fee)] <- 0
airbnb$price[is.na(airbnb$price)] <- 0
airbnb <- airbnb %>%
mutate(across(c("accommodates", "bedrooms", "beds", "host_total_listings_count"), ~replace_na(.x, mean(.x, na.rm=TRUE))))
summary(airbnb)
## name accommodates bed_type bedrooms
## Length:4936 Min. : 1.000 Length:4936 Min. :0.000
## Class :character 1st Qu.: 2.000 Class :character 1st Qu.:1.000
## Mode :character Median : 3.000 Mode :character Median :1.000
## Mean : 3.523 Mean :1.297
## 3rd Qu.: 4.000 3rd Qu.:2.000
## Max. :16.000 Max. :8.000
## beds cancellation_policy cleaning_fee
## Min. : 1.000 Length:4936 Min. : 0.00
## 1st Qu.: 1.000 Class :character 1st Qu.: 5.00
## Median : 1.000 Mode :character Median : 35.00
## Mean : 1.821 Mean : 48.47
## 3rd Qu.: 2.000 3rd Qu.: 75.00
## Max. :16.000 Max. :500.00
## host_total_listings_count price property_type
## Min. : 0.00 Min. : 10.0 Length:4936
## 1st Qu.: 1.00 1st Qu.: 80.0 Class :character
## Median : 1.00 Median : 115.0 Mode :character
## Mean : 15.57 Mean : 154.3
## 3rd Qu.: 3.00 3rd Qu.: 175.0
## Max. :879.00 Max. :5150.0
## room_type high_booking_rate
## Length:4936 Min. :0.0000
## Class :character 1st Qu.:0.0000
## Mode :character Median :0.0000
## Mean :0.2508
## 3rd Qu.:1.0000
## Max. :1.0000
price_per_person is the nightly price per accommodates
has_cleaning_fee is YES if there is a cleaning fee, and NO otherwise
bed_category is “bed” if the bed_type is Real Bed and “other” otherwise
property_category has the following values:
make sure to convert property_category to a factor!
ppp_ind is 1 if the price_per_person is greater than the median for the property_category, and 0 otherwise
Hint: there are lots of ways to do the property_type variable, but case_when might be a useful function that’s part of the tidyverse.
ANSWER TO QUESTION 1b HERE:
library(dplyr)
airbnb <- airbnb %>%
mutate(price_per_person = price / accommodates)
airbnb <- airbnb %>%
mutate(has_cleaning_fee = if_else(cleaning_fee > 0, "YES", "NO"))
airbnb <- airbnb %>%
mutate(bed_category = if_else(bed_type == "Real Bed", "bed", "other"))
airbnb <- airbnb %>%
mutate(property_category = case_when(
property_type %in% c("Apartment", "Serviced apartment", "Loft") ~ "apartment",
property_type %in% c("Bed & Breakfast", "Boutique hotel", "Hostel") ~ "hotel",
property_type %in% c("Townhouse", "Condominium") ~ "condo",
property_type %in% c("Bungalow", "House") ~ "house",
TRUE ~ "other"
))
airbnb$property_category <- as.factor(airbnb$property_category)
airbnb <- airbnb %>%
group_by(property_category) %>%
mutate(ppp_ind = if_else(price_per_person > median(price_per_person, na.rm = TRUE), 1, 0))
summary(airbnb)
## name accommodates bed_type bedrooms
## Length:4936 Min. : 1.000 Length:4936 Min. :0.000
## Class :character 1st Qu.: 2.000 Class :character 1st Qu.:1.000
## Mode :character Median : 3.000 Mode :character Median :1.000
## Mean : 3.523 Mean :1.297
## 3rd Qu.: 4.000 3rd Qu.:2.000
## Max. :16.000 Max. :8.000
## beds cancellation_policy cleaning_fee
## Min. : 1.000 Length:4936 Min. : 0.00
## 1st Qu.: 1.000 Class :character 1st Qu.: 5.00
## Median : 1.000 Mode :character Median : 35.00
## Mean : 1.821 Mean : 48.47
## 3rd Qu.: 2.000 3rd Qu.: 75.00
## Max. :16.000 Max. :500.00
## host_total_listings_count price property_type
## Min. : 0.00 Min. : 10.0 Length:4936
## 1st Qu.: 1.00 1st Qu.: 80.0 Class :character
## Median : 1.00 Median : 115.0 Mode :character
## Mean : 15.57 Mean : 154.3
## 3rd Qu.: 3.00 3rd Qu.: 175.0
## Max. :879.00 Max. :5150.0
## room_type high_booking_rate price_per_person has_cleaning_fee
## Length:4936 Min. :0.0000 Min. : 0.6875 Length:4936
## Class :character 1st Qu.:0.0000 1st Qu.: 29.0000 Class :character
## Mode :character Median :0.0000 Median : 40.0000 Mode :character
## Mean :0.2508 Mean : 47.3192
## 3rd Qu.:1.0000 3rd Qu.: 55.0000
## Max. :1.0000 Max. :2575.0000
## bed_category property_category ppp_ind
## Length:4936 apartment:2685 Min. :0.0000
## Class :character condo : 558 1st Qu.:0.0000
## Mode :character hotel : 62 Median :0.0000
## house :1561 Mean :0.4899
## other : 70 3rd Qu.:1.0000
## Max. :1.0000
airbnb$bed_type <- as.factor(airbnb$bed_type)
airbnb$cancellation_policy <- as.factor(airbnb$cancellation_policy)
airbnb$room_type <- as.factor(airbnb$room_type)
summary(airbnb)
## name accommodates bed_type bedrooms
## Length:4936 Min. : 1.000 Airbed : 31 Min. :0.000
## Class :character 1st Qu.: 2.000 Couch : 16 1st Qu.:1.000
## Mode :character Median : 3.000 Futon : 38 Median :1.000
## Mean : 3.523 Pull-out Sofa: 65 Mean :1.297
## 3rd Qu.: 4.000 Real Bed :4786 3rd Qu.:2.000
## Max. :16.000 Max. :8.000
## beds cancellation_policy cleaning_fee
## Min. : 1.000 flexible:1351 Min. : 0.00
## 1st Qu.: 1.000 moderate:1602 1st Qu.: 5.00
## Median : 1.000 strict :1983 Median : 35.00
## Mean : 1.821 Mean : 48.47
## 3rd Qu.: 2.000 3rd Qu.: 75.00
## Max. :16.000 Max. :500.00
## host_total_listings_count price property_type
## Min. : 0.00 Min. : 10.0 Length:4936
## 1st Qu.: 1.00 1st Qu.: 80.0 Class :character
## Median : 1.00 Median : 115.0 Mode :character
## Mean : 15.57 Mean : 154.3
## 3rd Qu.: 3.00 3rd Qu.: 175.0
## Max. :879.00 Max. :5150.0
## room_type high_booking_rate price_per_person
## Entire home/apt:3326 Min. :0.0000 Min. : 0.6875
## Private room :1472 1st Qu.:0.0000 1st Qu.: 29.0000
## Shared room : 138 Median :0.0000 Median : 40.0000
## Mean :0.2508 Mean : 47.3192
## 3rd Qu.:1.0000 3rd Qu.: 55.0000
## Max. :1.0000 Max. :2575.0000
## has_cleaning_fee bed_category property_category ppp_ind
## Length:4936 Length:4936 apartment:2685 Min. :0.0000
## Class :character Class :character condo : 558 1st Qu.:0.0000
## Mode :character Mode :character hotel : 62 Median :0.0000
## house :1561 Mean :0.4899
## other : 70 3rd Qu.:1.0000
## Max. :1.0000
ANSWER TO QUESTION 1c HERE: There is not much difference in the prices. The bookings with cheaper prices attracts more people and place with higher booking has more people with cheaper prices.
library(ggplot2)
airbnb <- airbnb %>%
mutate(log_price_per_person = log(price_per_person))
ggplot(airbnb, aes(x = as.factor(high_booking_rate), y = price_per_person)) +
geom_boxplot() +
labs(title = "Boxplot of price_per_person by high_booking_rate",
x = "High Booking Rate",
y = "Price per Person")
ggplot(airbnb, aes(x = as.factor(high_booking_rate), y = log_price_per_person)) +
geom_boxplot() +
labs(title = "Boxplot of log(price_per_person) by high_booking_rate",
x = "High Booking Rate",
y = "Log(Price_per_Person)")
(hint: prop.table might be especially helpful here)
ANSWER TO QUESTION 1e HERE:
booking_rate_table <- table(airbnb$high_booking_rate, airbnb$property_category)
prop_booking_rate_table <- prop.table(booking_rate_table, margin = 1)
print(prop_booking_rate_table)
##
## apartment condo hotel house other
## 0 0.544348296 0.104651163 0.013520822 0.323958897 0.013520822
## 1 0.542810985 0.138126010 0.009693053 0.293214863 0.016155089
ANSWER TO QUESTION 2a HERE:
airbnb$high_booking_rate <- as.numeric(as.character(airbnb$high_booking_rate))
model <- lm(high_booking_rate ~ cancellation_policy + cleaning_fee + price_per_person +
ppp_ind + has_cleaning_fee + accommodates + bed_category +
bedrooms + beds + host_total_listings_count + property_category, data = airbnb)
summary(model)
##
## Call:
## lm(formula = high_booking_rate ~ cancellation_policy + cleaning_fee +
## price_per_person + ppp_ind + has_cleaning_fee + accommodates +
## bed_category + bedrooms + beds + host_total_listings_count +
## property_category, data = airbnb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0026 -0.2862 -0.1748 0.4317 0.9956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.909e-01 2.112e-02 9.039 < 2e-16 ***
## cancellation_policymoderate 8.553e-02 1.581e-02 5.410 6.59e-08 ***
## cancellation_policystrict 6.877e-02 1.594e-02 4.316 1.62e-05 ***
## cleaning_fee -1.229e-03 1.899e-04 -6.470 1.08e-10 ***
## price_per_person -4.609e-04 1.404e-04 -3.283 0.001034 **
## ppp_ind -9.495e-02 1.387e-02 -6.847 8.47e-12 ***
## has_cleaning_feeYES 1.458e-01 1.732e-02 8.417 < 2e-16 ***
## accommodates 1.228e-02 5.994e-03 2.049 0.040545 *
## bed_categoryother -3.760e-02 3.488e-02 -1.078 0.281049
## bedrooms -8.041e-02 1.190e-02 -6.758 1.56e-11 ***
## beds 4.345e-02 8.964e-03 4.848 1.29e-06 ***
## host_total_listings_count -2.230e-04 7.144e-05 -3.122 0.001809 **
## property_categorycondo 6.576e-02 1.981e-02 3.320 0.000907 ***
## property_categoryhotel -1.661e-02 5.405e-02 -0.307 0.758573
## property_categoryhouse 4.977e-03 1.433e-02 0.347 0.728283
## property_categoryother -3.811e-03 5.109e-02 -0.075 0.940545
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4167 on 4920 degrees of freedom
## Multiple R-squared: 0.07898, Adjusted R-squared: 0.07617
## F-statistic: 28.13 on 15 and 4920 DF, p-value: < 2.2e-16
New Listing:
Hint: you will have to do the same processing on this new listing as you applied to the main dataset, but feel free to do it by hand rather than using the same code as before.
ANSWER TO QUESTION 2b HERE:
library(dplyr)
median_condo <- median(airbnb$price_per_person[airbnb$property_category == "condo"], na.rm = TRUE)
logistic_model <- glm(high_booking_rate ~ cancellation_policy + cleaning_fee +
price_per_person + ppp_ind + has_cleaning_fee + accommodates +
bed_category + bedrooms + beds + host_total_listings_count +
property_category, data = airbnb, family=binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predicted_prob <- data.frame(
cancellation_policy = "strict",
cleaning_fee = 30,
price_per_person = 200 / 4,
accommodates = 4,
bed_type = "Real Bed",
bedrooms = 3,
beds = 4,
host_total_listings_count = 1,
property_category = "condo"
) %>%
mutate(
ppp_ind = as.integer(price_per_person > median_condo),
has_cleaning_fee = "YES",
bed_category = "bed",
cancellation_policy = as.factor(cancellation_policy),
bed_category = as.factor(bed_category),
property_category = as.factor(property_category)
) %>%
{predict(logistic_model, newdata = ., type = "response")}
predicted_prob
## 1
## 0.3299597
ANSWER TO QUESTION 2c HERE: The predicted value is 0.3299, which is ultimately lower than 0.5. Hence, we cannot trust the predicted value for model analysis. So, we need more values to analyze the model.
Hint: don’t worry if you get a warning here…
ANSWER TO QUESTION 3a HERE:
airbnb$high_booking_rate <- as.numeric(as.character(airbnb$high_booking_rate))
airbnb$cancellation_policy <- as.factor(airbnb$cancellation_policy)
airbnb$bed_category <- as.factor(airbnb$bed_category)
airbnb$property_category <- as.factor(airbnb$property_category)
airbnb$has_cleaning_fee <- as.factor(airbnb$has_cleaning_fee)
logistic_model <- glm(high_booking_rate ~ cancellation_policy + cleaning_fee +
price_per_person + ppp_ind + has_cleaning_fee + accommodates +
bed_category + bedrooms + beds + host_total_listings_count +
property_category, data = airbnb, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic_model)
##
## Call:
## glm(formula = high_booking_rate ~ cancellation_policy + cleaning_fee +
## price_per_person + ppp_ind + has_cleaning_fee + accommodates +
## bed_category + bedrooms + beds + host_total_listings_count +
## property_category, family = binomial, data = airbnb)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.978980 0.160907 -6.084 1.17e-09 ***
## cancellation_policymoderate 0.510296 0.093378 5.465 4.63e-08 ***
## cancellation_policystrict 0.444980 0.095816 4.644 3.42e-06 ***
## cleaning_fee -0.006897 0.001299 -5.308 1.11e-07 ***
## price_per_person -0.019105 0.002891 -6.608 3.91e-11 ***
## ppp_ind -0.053136 0.108611 -0.489 0.6247
## has_cleaning_feeYES 0.841786 0.107402 7.838 4.59e-15 ***
## accommodates 0.040726 0.035167 1.158 0.2468
## bed_categoryother -0.264513 0.208151 -1.271 0.2038
## bedrooms -0.399557 0.069183 -5.775 7.68e-09 ***
## beds 0.223325 0.050891 4.388 1.14e-05 ***
## host_total_listings_count -0.007824 0.003194 -2.449 0.0143 *
## property_categorycondo 0.349404 0.109380 3.194 0.0014 **
## property_categoryhotel 0.100804 0.352798 0.286 0.7751
## property_categoryhouse -0.052321 0.083447 -0.627 0.5307
## property_categoryother -0.140461 0.291836 -0.481 0.6303
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5560.1 on 4935 degrees of freedom
## Residual deviance: 5082.6 on 4920 degrees of freedom
## AIC: 5114.6
##
## Number of Fisher Scoring iterations: 8
ANSWER TO QUESTION 3b HERE: The coefficient for price_per_person in the logistic regression model is approximately -0.0191. As the price_per_person increases, the model predicts a proportional decrease in having a high booking rate.
ANSWER TO QUESTION 3c HERE: The coefficient for property_category = condo in the logistic regression model is approximately 0.349404.The model predicts a proportional increase in the odds of having a high booking rate for listings categorized as condos compared to the reference category.
ANSWER TO QUESTION 3d HERE:
ANSWER TO QUESTION 3e HERE: We would choose logistic model as it is more appropriate for explaining the relationship between features and the target. It allows for a better interpretation of the data as its value lies between 0 and 1 and would not exceed 1.
ANSWER TO QUESTION 4a HERE:
set.seed(12345)
# Assuming 'airbnb' is your dataset
# Split the data into 70% training and 30% validation sets
n_rows <- nrow(airbnb)
train_indices <- sample(1:n_rows, 0.7 * n_rows)
train_data <- airbnb[train_indices, ]
valid_data <- airbnb[-train_indices, ]
# Retrain linear regression model on training data
linear_model <- lm(high_booking_rate ~ cancellation_policy + cleaning_fee +
price_per_person + ppp_ind + has_cleaning_fee + accommodates +
bed_category + bedrooms + beds + host_total_listings_count +
property_category, data = train_data)
# Retrain logistic regression model on training data
logistic_model <- glm(high_booking_rate ~ cancellation_policy + cleaning_fee +
price_per_person + ppp_ind + has_cleaning_fee + accommodates +
bed_category + bedrooms + beds + host_total_listings_count +
property_category, data = train_data, family = binomial)
# Report R^2 for linear regression model on validation set
linear_r_squared <- summary(linear_model)$r.squared
cat("Linear Regression R^2 on Validation Set:", linear_r_squared, "\n")
## Linear Regression R^2 on Validation Set: 0.08106124
# Report AIC for logistic regression model on validation set
logistic_aic <- AIC(logistic_model)
cat("Logistic Regression AIC on Validation Set:", logistic_aic, "\n")
## Logistic Regression AIC on Validation Set: 3585.261
ANSWER TO QUESTION 4b HERE: RMSE on the validation set is slightly higher than the RMSE on the training set which suggests that the model’s performance might be slightly worse on the new data set when compared to the data it was trained on.It is common and not much concern unless the difference is more significant.
train_predictions <- predict(linear_model, newdata = train_data)
train_rmse <- sqrt(mean((train_data$high_booking_rate - train_predictions)^2))
valid_predictions <- predict(linear_model, newdata = valid_data)
valid_rmse <- sqrt(mean((valid_data$high_booking_rate - valid_predictions)^2))
cat("RMSE on Training Set:", train_rmse, "\n")
## RMSE on Training Set: 0.4148109
cat("RMSE on Validation Set:", valid_rmse, "\n")
## RMSE on Validation Set: 0.4250136
ANSWER TO QUESTION 4c HERE:
cutoff <- 0.5
valid_linear_predictions <- predict(linear_model, newdata = valid_data)
valid_linear_predictions_binary <- ifelse(valid_linear_predictions >= cutoff, 1, 0)
conf_matrix_linear <- table(Actual = valid_data$high_booking_rate, Predicted = valid_linear_predictions_binary)
conf_matrix_linear
## Predicted
## Actual 0 1
## 0 1100 5
## 1 373 3
valid_logistic_predictions <- predict(logistic_model, newdata = valid_data, type = "response")
valid_logistic_predictions_binary <- ifelse(valid_logistic_predictions >= cutoff, 1, 0)
conf_matrix_logistic <- table(Actual = valid_data$high_booking_rate, Predicted = valid_logistic_predictions_binary)
conf_matrix_logistic
## Predicted
## Actual 0 1
## 0 1081 24
## 1 348 28
ANSWER TO QUESTION 4d HERE:
calculate_metrics <- function(conf_matrix) {
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
tpr <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
fpr <- conf_matrix[1, 2] / sum(conf_matrix[1, ])
return(c(Accuracy = accuracy, TPR = tpr, FPR = fpr))
}
metrics_linear <- calculate_metrics(conf_matrix_linear)
cat("Linear Regression Metrics:\n")
## Linear Regression Metrics:
print(metrics_linear)
## Accuracy TPR FPR
## 0.744767049 0.007978723 0.004524887
metrics_logistic <- calculate_metrics(conf_matrix_logistic)
cat("\nLogistic Regression Metrics:\n")
##
## Logistic Regression Metrics:
print(metrics_logistic)
## Accuracy TPR FPR
## 0.74881837 0.07446809 0.02171946
ANSWER TO QUESTION 4e HERE:
cutoffs <- c(0.4, 0.6)
calculate_metrics_cutoff <- function(predictions, actual, cutoff) {
predictions_binary <- ifelse(predictions >= cutoff, 1, 0)
conf_matrix <- table(Actual = actual, Predicted = predictions_binary)
metrics <- calculate_metrics(conf_matrix)
return(c(Cutoff = cutoff, metrics))
}
metrics_linear_04 <- calculate_metrics_cutoff(valid_linear_predictions, valid_data$high_booking_rate, 0.4)
metrics_linear_06 <- calculate_metrics_cutoff(valid_linear_predictions, valid_data$high_booking_rate, 0.6)
cat("Linear Regression Metrics (Cutoff = 0.4):\n")
## Linear Regression Metrics (Cutoff = 0.4):
print(metrics_linear_04)
## Cutoff Accuracy TPR FPR
## 0.40000000 0.74544227 0.21010638 0.07239819
cat("\nLinear Regression Metrics (Cutoff = 0.6):\n")
##
## Linear Regression Metrics (Cutoff = 0.6):
print(metrics_linear_06)
## Cutoff Accuracy TPR FPR
## 0.6000000000 0.7461174882 0.0026595745 0.0009049774
metrics_logistic_04 <- calculate_metrics_cutoff(valid_logistic_predictions, valid_data$high_booking_rate, 0.4)
metrics_logistic_06 <- calculate_metrics_cutoff(valid_logistic_predictions, valid_data$high_booking_rate, 0.6)
cat("\nLogistic Regression Metrics (Cutoff = 0.4):\n")
##
## Logistic Regression Metrics (Cutoff = 0.4):
print(metrics_logistic_04)
## Cutoff Accuracy TPR FPR
## 0.4000000 0.7359892 0.2659574 0.1040724
cat("\nLogistic Regression Metrics (Cutoff = 0.6):\n")
##
## Logistic Regression Metrics (Cutoff = 0.6):
print(metrics_logistic_06)
## Cutoff Accuracy TPR FPR
## 0.6000000000 0.7461174882 0.0026595745 0.0009049774
ANSWER TO QUESTION 4f HERE: .5 cutoff might yield the best results as it provides more true positives when compared to other cuoffs where they have more probability for false positives. So it might be better to choose .5 cutoff.