Read the dataset

airbnb <- read_csv(“C:\Users\dell\Downloads\dc_airbnb_hw1.csv”)

Your task is to develop models to predict the target variable “high_booking_rate”, which labels whether a listing is popular (i.e. spends most of the time booked) or not.

Please answer the questions below clearly and concisely, providing tables or plots where applicable. Turn in a well-formatted compiled HTML document using R Markdown, containing clear answers to the questions and R code in the appropriate places.

RUBRIC: To receive a passing score on this assignment, you must do the following:

Turn in a well-formatted compiled HTML document using R markdown.
Provide clear answers to the questions and the correct R commands as necessary, in the appropriate places. You may answer up to three sub-questions incorrectly (i.e. incorrect R command; missing answer to question) and still receive a P on this assignment (for example, 1(a) counts as one sub-question).
The entire document must be clear, concise, and readable.

Note that this assignment is somewhat open-ended and there are many ways to answer these questions. I don’t require that we have exactly the same answers in order for you to receive full credit.

airbnb <- read_csv("dc_airbnb_hw1.csv")  #read the dataset in R

## Rows: 4936 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, bed_type, cancellation_policy, cleaning_fee, price, property_...
## dbl (5): accommodates, bedrooms, beds, host_total_listings_count, high_booki...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

names(airbnb)                       #variables used in dataset

##  [1] "name"                      "accommodates"             
##  [3] "bed_type"                  "bedrooms"                 
##  [5] "beds"                      "cancellation_policy"      
##  [7] "cleaning_fee"              "host_total_listings_count"
##  [9] "price"                     "property_type"            
## [11] "room_type"                 "high_booking_rate"

0: Example answer

What is the mean of the accommodates variable?

ANSWER: The mean number of people that can be accommodated in a listing in this dataset is 3.522893.

age_mean <- airbnb %>%
  summarise(mean_accommodates = mean(accommodates))

1: EDA and Data Cleaning

#a) Clean up and process some of the variables by doing the following (I recommend you do these in the order in which they are listed):

#- For cancellation_policy, group “strict” and “super_strict_30” into “strict” #- Convert cleaning_fee and price into numbers #- Replace NAs in cleaning_fee and price with 0 #- Replace NAs in other numerical variables with their mean

#Hint: there are lots of ways to deal with cleaning_fee and price. I like the parse_number function from the readr library.

#ANSWER TO QUESTION 1a HERE:

#PUT QUESTION 1a CODE HERE
airbnb <- airbnb %>%
  mutate(cancellation_policy = if_else(cancellation_policy %in% c("strict", "super_strict_30"), "strict", cancellation_policy)) %>%
  mutate(across(c(cleaning_fee, price), ~parse_number(.x) %>% replace_na(0))) %>%
  mutate(across(c("accommodates", "bedrooms", "beds", "host_total_listings_count"), ~replace_na(.x, mean(.x, na.rm = TRUE))))
view(airbnb)

Create five new variables:

price_per_person is the nightly price per accommodates
has_cleaning_fee is YES if there is a cleaning fee, and NO otherwise
bed_category is “bed” if the bed_type is Real Bed and “other” otherwise
property_category has the following values:
- apartment if property_type is Apartment, Serviced apartment, Loft.
- hotel if property_type is Bed & Breakfast, Boutique hotel, Hostel.
- condo if property_type is Townhouse, Condominium.
- house if property_type is Bungalow, House.
- other, otherwise
make sure to convert property_category to a factor!
ppp_ind is 1 if the price_per_person is greater than the median for the property_category, and 0 otherwise

Hint: there are lots of ways to do the property_type variable, but case_when might be a useful function that’s part of the tidyverse.

ANSWER TO QUESTION 1b HERE:

airbnb <- airbnb %>%
  mutate(
    price_per_person = price / accommodates,
    has_cleaning_fee = if_else(cleaning_fee > 0, "YES", "NO"),
    bed_category = if_else(bed_type == "Real Bed", "bed", "other"),
    property_category = case_when(
      property_type %in% c("Apartment", "Serviced apartment", "Loft") ~ "apartment",
      property_type %in% c("Bed & Breakfast", "Boutique hotel", "Hostel") ~ "hotel",
      property_type %in% c("Townhouse", "Condominium") ~ "condo",
      property_type %in% c("Bungalow", "House") ~ "house",
      TRUE ~ "other"
    )
  ) %>%
  group_by(property_category) %>%
  mutate(ppp_ind = as.integer(price_per_person > median(price_per_person, na.rm = TRUE))) %>%
  ungroup() %>%
  mutate(property_category = as.factor(property_category))

summary(airbnb$price_per_person)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.6875   29.0000   40.0000   47.3192   55.0000 2575.0000

table(airbnb$has_cleaning_fee)

## 
##   NO  YES 
## 1208 3728

table(airbnb$bed_category)

## 
##   bed other 
##  4786   150

table(airbnb$property_category)

## 
## apartment     condo     hotel     house     other 
##      2685       558        62      1561        70

summary(airbnb$ppp_ind)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4899  1.0000  1.0000

Convert the remaining character variables to factors

bed_type
cancellation_policy
room_type

airbnb <- airbnb %>%
  mutate(
    bed_type = as.factor(bed_type),
    cancellation_policy = as.factor(cancellation_policy),
    room_type = as.factor(room_type)
  )

Construct and report boxplots of price_per_person and log(price_per_person), broken up by values of high_booking_rate. Does it appear that there is a relationship between price_per_person and high_booking_rate? If so, what is the relationship?

ANSWER TO QUESTION 1c HERE:Cheaper prices per person seem to lead to more bookings. Places with higher rates of bookings usually charge less per person, and their prices don’t change much. In short, good deals might get you more bookings.

library(ggplot2)

airbnb <- airbnb %>%
  mutate(high_booking_rate = as.factor(high_booking_rate))

# Bplot: price_per_person by high_booking_rate
ggplot(airbnb, aes(x = high_booking_rate, y = price_per_person)) +
  geom_boxplot() +
  labs(title = "Boxplot of Price Per Person by High Booking Rate",
       x = "High Booking Rate",
       y = "Price Per Person")

# Bplot: log(price_per_person) by high_booking_rate
ggplot(airbnb, aes(x = high_booking_rate, y = log(price_per_person))) +
  geom_boxplot() +
  labs(title = "Boxplot of Log(Price Per Person) by High Booking Rate",
       x = "High Booking Rate",
       y = "Log(Price Per Person)")

Construct a two-way table of high_booking_rate by property_category. Does it appear that any property categories are particularly bad or good in terms of high booking rates?

(hint: prop.table might be especially helpful here)

ANSWER TO QUESTION 1e HERE:Condos have a higher high booking rate (around 31%). Apartments and “other” categories show moderate high booking rates (around 25% and 29% respectively). Houses are slightly better with a high booking rate of approximately 23%. Hotels have the lowest high booking rate at about 19%.

install.packages("dplyr")

## Warning: package 'dplyr' is in use and will not be installed

two_way_table <- table(airbnb$property_category, airbnb$high_booking_rate)
prop.table(two_way_table, margin = 1)

##            
##                     0         1
##   apartment 0.7497207 0.2502793
##   condo     0.6935484 0.3064516
##   hotel     0.8064516 0.1935484
##   house     0.7674568 0.2325432
##   other     0.7142857 0.2857143

2: Linear Regression

Train a linear regression to predict high_booking_rate using the variables listed below. Report the R^2.

cancellation_policy
cleaning_fee
price_per_person
ppp_ind
has_cleaning_fee
accommodates
bed_category
bedrooms
beds
host_total_listings_count
property_category

ANSWER TO QUESTION 2a HERE:Multiple R-squared: 0.07898, Adjusted R-squared: 0.07617 0.079 indicates that the model doesn’t fit the data very well, as it is less than 8% of the variance in high_booking_rate.

airbnb$high_booking_rate_numeric <- as.numeric(airbnb$high_booking_rate) - 1

model <- lm(high_booking_rate_numeric ~ cancellation_policy + cleaning_fee + price_per_person +
            ppp_ind + has_cleaning_fee + accommodates + bed_category + bedrooms +
            beds + host_total_listings_count + property_category, data = airbnb)
model_summary <- summary(model)
print(model_summary)

## 
## Call:
## lm(formula = high_booking_rate_numeric ~ cancellation_policy + 
##     cleaning_fee + price_per_person + ppp_ind + has_cleaning_fee + 
##     accommodates + bed_category + bedrooms + beds + host_total_listings_count + 
##     property_category, data = airbnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0026 -0.2862 -0.1748  0.4317  0.9956 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.909e-01  2.112e-02   9.039  < 2e-16 ***
## cancellation_policymoderate  8.553e-02  1.581e-02   5.410 6.59e-08 ***
## cancellation_policystrict    6.877e-02  1.594e-02   4.316 1.62e-05 ***
## cleaning_fee                -1.229e-03  1.899e-04  -6.470 1.08e-10 ***
## price_per_person            -4.609e-04  1.404e-04  -3.283 0.001034 ** 
## ppp_ind                     -9.495e-02  1.387e-02  -6.847 8.47e-12 ***
## has_cleaning_feeYES          1.458e-01  1.732e-02   8.417  < 2e-16 ***
## accommodates                 1.228e-02  5.994e-03   2.049 0.040545 *  
## bed_categoryother           -3.760e-02  3.488e-02  -1.078 0.281049    
## bedrooms                    -8.041e-02  1.190e-02  -6.758 1.56e-11 ***
## beds                         4.345e-02  8.964e-03   4.848 1.29e-06 ***
## host_total_listings_count   -2.230e-04  7.144e-05  -3.122 0.001809 ** 
## property_categorycondo       6.576e-02  1.981e-02   3.320 0.000907 ***
## property_categoryhotel      -1.661e-02  5.405e-02  -0.307 0.758573    
## property_categoryhouse       4.977e-03  1.433e-02   0.347 0.728283    
## property_categoryother      -3.811e-03  5.109e-02  -0.075 0.940545    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4167 on 4920 degrees of freedom
## Multiple R-squared:  0.07898,    Adjusted R-squared:  0.07617 
## F-statistic: 28.13 on 15 and 4920 DF,  p-value: < 2.2e-16

airbnb$high_booking_rate <- as.factor(airbnb$high_booking_rate)


logistic_model <- glm(high_booking_rate ~ cancellation_policy + cleaning_fee + price_per_person +
                      ppp_ind + has_cleaning_fee + accommodates + bed_category + bedrooms +
                      beds + host_total_listings_count + property_category, 
                      family = binomial(), data = airbnb)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logistic_model)

## 
## Call:
## glm(formula = high_booking_rate ~ cancellation_policy + cleaning_fee + 
##     price_per_person + ppp_ind + has_cleaning_fee + accommodates + 
##     bed_category + bedrooms + beds + host_total_listings_count + 
##     property_category, family = binomial(), data = airbnb)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -0.978980   0.160907  -6.084 1.17e-09 ***
## cancellation_policymoderate  0.510296   0.093378   5.465 4.63e-08 ***
## cancellation_policystrict    0.444980   0.095816   4.644 3.42e-06 ***
## cleaning_fee                -0.006897   0.001299  -5.308 1.11e-07 ***
## price_per_person            -0.019105   0.002891  -6.608 3.91e-11 ***
## ppp_ind                     -0.053136   0.108611  -0.489   0.6247    
## has_cleaning_feeYES          0.841786   0.107402   7.838 4.59e-15 ***
## accommodates                 0.040726   0.035167   1.158   0.2468    
## bed_categoryother           -0.264513   0.208151  -1.271   0.2038    
## bedrooms                    -0.399557   0.069183  -5.775 7.68e-09 ***
## beds                         0.223325   0.050891   4.388 1.14e-05 ***
## host_total_listings_count   -0.007824   0.003194  -2.449   0.0143 *  
## property_categorycondo       0.349404   0.109380   3.194   0.0014 ** 
## property_categoryhotel       0.100804   0.352798   0.286   0.7751    
## property_categoryhouse      -0.052321   0.083447  -0.627   0.5307    
## property_categoryother      -0.140461   0.291836  -0.481   0.6303    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5560.1  on 4935  degrees of freedom
## Residual deviance: 5082.6  on 4920  degrees of freedom
## AIC: 5114.6
## 
## Number of Fisher Scoring iterations: 8

What is the predicted value of high_booking_rate for a listing with the following characteristics, and how would you interpret this prediction?

New Listing:

cancellation_policy = super_strict_30
cleaning_fee = $30
price = $200
accommodates = 4
bed_type = Real Bed
bedrooms = 3
beds = 4
host_total_listings_count = 1
property_type = townhouse

Hint: you will have to do the same processing on this new listing as you applied to the main dataset, but feel free to do it by hand rather than using the same code as before.

ANSWER TO QUESTION 2b HERE:probability of 0.33 indicates thathis newly listed condo has approx a 33% probability of experiencing a high booking rate. Thus moderate chance of being popular or frequently booked, although it’s not guaranteed.

new_listing <- data.frame(#new listing data
  cancellation_policy = "strict",
  cleaning_fee = 30,
  price_per_person = 200 / 4,
  ppp_ind = as.integer((200 / 4) > median(airbnb$price_per_person[airbnb$property_category == "condo"], na.rm = TRUE)),
  has_cleaning_fee = "YES",
  accommodates = 4,
  bed_category = "bed",
  bedrooms = 3,
  beds = 4,
  host_total_listings_count = 1,
  property_category = "condo"
)
predicted_prob <- predict(logistic_model, newdata = new_listing, type = "response")
predicted_prob

##         1 
## 0.3299597

Do you have any reservations about this predicted high_booking_rate? That is, would you feel sufficiently comfortable with this prediction in order to take some business action based on it? Why or why not?

ANSWER TO QUESTION 2c HERE:0.33 probability indicates ambiguity towards its booking rate. Since the probability is less han 0.5 that the model lacks strong confidence in its prediction, thus seek a higher level of certainty before making a decision based on this prediction.Factors like market trends, competition, and guest preferences should also be considered.Thus prediction should be considered but should not be the sole basis for action.

3: Logistic Regression

Train a Logistic Regression model using the same variables as in 2a. Report the AIC of your model.

Hint: don’t worry if you get a warning here…

ANSWER TO QUESTION 3a HERE:AIC: 5114.554

airbnb$high_booking_rate <- as.factor(airbnb$high_booking_rate)


logistic_model <- glm(high_booking_rate ~ cancellation_policy + cleaning_fee + price_per_person +
                      ppp_ind + has_cleaning_fee + accommodates + bed_category + bedrooms +
                      beds + host_total_listings_count + property_category, 
                      family = binomial(), data = airbnb)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logistic_model)

## 
## Call:
## glm(formula = high_booking_rate ~ cancellation_policy + cleaning_fee + 
##     price_per_person + ppp_ind + has_cleaning_fee + accommodates + 
##     bed_category + bedrooms + beds + host_total_listings_count + 
##     property_category, family = binomial(), data = airbnb)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -0.978980   0.160907  -6.084 1.17e-09 ***
## cancellation_policymoderate  0.510296   0.093378   5.465 4.63e-08 ***
## cancellation_policystrict    0.444980   0.095816   4.644 3.42e-06 ***
## cleaning_fee                -0.006897   0.001299  -5.308 1.11e-07 ***
## price_per_person            -0.019105   0.002891  -6.608 3.91e-11 ***
## ppp_ind                     -0.053136   0.108611  -0.489   0.6247    
## has_cleaning_feeYES          0.841786   0.107402   7.838 4.59e-15 ***
## accommodates                 0.040726   0.035167   1.158   0.2468    
## bed_categoryother           -0.264513   0.208151  -1.271   0.2038    
## bedrooms                    -0.399557   0.069183  -5.775 7.68e-09 ***
## beds                         0.223325   0.050891   4.388 1.14e-05 ***
## host_total_listings_count   -0.007824   0.003194  -2.449   0.0143 *  
## property_categorycondo       0.349404   0.109380   3.194   0.0014 ** 
## property_categoryhotel       0.100804   0.352798   0.286   0.7751    
## property_categoryhouse      -0.052321   0.083447  -0.627   0.5307    
## property_categoryother      -0.140461   0.291836  -0.481   0.6303    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5560.1  on 4935  degrees of freedom
## Residual deviance: 5082.6  on 4920  degrees of freedom
## AIC: 5114.6
## 
## Number of Fisher Scoring iterations: 8

What is the coefficient for price_per_person? Provide a precise (numerical) interpretation of the coefficient.

ANSWER TO QUESTION 3b HERE: -0.019105. To interpret this coefficient in terms of odds, we can exponentiate it: 0.981

coefficient of -0.019105 for price per person indicates that as the price per person increases by one unit (e.g., one dollar), the odds of having a high booking rate decrease by about 1.9%. This suggests that more expensive listings per person are slightly less likely to be highly booked.

What is the coefficient for property_category = condo? Provide a precise (numerical) interpretation of this coefficient.

ANSWER TO QUESTION 3c HERE: 0.349404. in terms of odds: 1.418 = 41.8% higher odds of a high booking rate compared to apartments, when all other factors are held constant. Condos are thus more likely to be highly booked.

Compute the predicted probability that the same listing as in #2b will have high_booking_rate = 1. Hint: you should use the predict function, but you need to specify type = “response” when predicting probabilities from logistic regression (otherwise, it will predict the value of logit). For example: predict(mymodel, newdata = mydata, type = “response”).

ANSWER TO QUESTION 3d HERE:0.3299597

predicted_probability <- predict(logistic_model, newdata = new_listing, type = "response")
predicted_probability

##         1 
## 0.3299597

If you were to pick one model to use for the purposes of explaining the relationship between the features and the target variable, which would it be, and why?

ANSWER TO QUESTION 3e HERE: Logistic regression is ideal for binary outcomes like high_booking_rate, which can be either 0 or 1. It gives the probability of an outcome, making it easier to understand in a classification context. model’s coefficients show the impact of each feature on the odds of a high booking rate and can handle non-linear relationships between features and the outcome, which is often more realistic.

4: Classification and Evaluation

Set the random seed to 12345. Split the data into 70% training and 30% validation sets, retrain the linear and logistic regression models using the training data only, and report the resulting R^2 and AIC, respectively.

ANSWER TO QUESTION 4a HERE:r.squared[1] 0.08106124 AIC(logistic_model_train)[1] 3585.261

set.seed(12345)
split_index <- sample(nrow(airbnb), size = 0.7 * nrow(airbnb))
train_data <- airbnb[split_index, ]
validation_data <- airbnb[-split_index, ]

train_data$high_booking_rate_numeric <- as.numeric(train_data$high_booking_rate) - 1

linear_model_train <- lm(high_booking_rate_numeric ~ cancellation_policy + cleaning_fee + price_per_person + ppp_ind + has_cleaning_fee + accommodates + bed_category + bedrooms +
beds + host_total_listings_count + property_category, data = train_data)
summary(linear_model_train)$r.squared

## [1] 0.08106124

train_data$high_booking_rate <- as.factor(train_data$high_booking_rate)
validation_data$high_booking_rate <- as.factor(validation_data$high_booking_rate)

logistic_model_train <- glm(high_booking_rate ~ cancellation_policy + cleaning_fee + price_per_person +
                            ppp_ind + has_cleaning_fee + accommodates + bed_category + bedrooms +
                            beds + host_total_listings_count + property_category, 
                            family = binomial(), data = train_data)

AIC(logistic_model_train)

## [1] 3585.261

Compute the RMSE in the training and validation sets for the linear model (do not do the classifications, just use the predicted score). Which RMSE is higher (training or validation), and does this make sense? Why or why not?

ANSWER TO QUESTION 4b HERE:0.4250136 the validation RMSE is higher than the training RMSE due to overfitting

The RMSE for the validation set is slightly higher than for the training set. This is normal because models often perform a bit worse on new data compared to the data they learned from, representing a good signthat the model isn’t just memorizing the training data.

train_predictions <- predict(linear_model_train, newdata = train_data)
train_rmse <- sqrt(mean((train_data$high_booking_rate_numeric - train_predictions)^2))
validation_predictions <- predict(linear_model_train, newdata = validation_data)
validation_rmse <- sqrt(mean((validation_data$high_booking_rate_numeric - validation_predictions)^2))
cat("Training RMSE:", train_rmse, "\n")

## Training RMSE: 0.4148109

cat("Validation RMSE:", validation_rmse, "\n")

## Validation RMSE: 0.4250136

For each model (both the linear and logistic regression models learned in the training data), display the confusion matrix resulting from using a cutoff of 0.5 to do the classifications in the validation data set.

ANSWER TO QUESTION 4c HERE:

validation_predictions_linear <- predict(linear_model_train, newdata = validation_data)

predicted_classes_linear <- ifelse(validation_predictions_linear > 0.5, 1, 0)

conf_matrix_linear <- table(Predicted = predicted_classes_linear, Actual = validation_data$high_booking_rate_numeric)
cat("Confusion Matrix for Linear Regression Model:\n")

## Confusion Matrix for Linear Regression Model:

print(conf_matrix_linear)

##          Actual
## Predicted    0    1
##         0 1100  373
##         1    5    3

validation_predictions_logistic <- predict(logistic_model_train, newdata = validation_data, type = "response")

predicted_classes_logistic <- ifelse(validation_predictions_logistic > 0.5, 1, 0)

conf_matrix_logistic <- table(Predicted = predicted_classes_logistic, Actual = validation_data$high_booking_rate_numeric)
cat("Confusion Matrix for Logistic Regression Model:\n")

## Confusion Matrix for Logistic Regression Model:

print(conf_matrix_logistic)

##          Actual
## Predicted    0    1
##         0 1081  348
##         1   24   28

Report the accuracy, TPR, and FPR using the two confusion matrices in 4c. Which model is the most accurate?

ANSWER TO QUESTION 4d HERE:According to matrices, the logistic regression model is more accurate . The logistic model has an accuracy of 0.7488 compared to the linear model’s accuracy of 0.7448. This difference indicates that the logistic model is slightly better at predicting the high booking rate for the validation data set thus more accurate.

linear_predictions_binary <- ifelse(validation_predictions >= 0.5, 1, 0)

linear_confusion_matrix <- table(Actual = validation_data$high_booking_rate_numeric, Predicted = linear_predictions_binary)
print(linear_confusion_matrix)

##       Predicted
## Actual    0    1
##      0 1100    5
##      1  373    3

linear_accuracy <- sum(diag(linear_confusion_matrix)) / sum(linear_confusion_matrix)
linear_tpr <- linear_confusion_matrix[2, 2] / sum(linear_confusion_matrix[2, ])
linear_fpr <- linear_confusion_matrix[1, 2] / sum(linear_confusion_matrix[1, ])

cat("Linear Model Accuracy:", linear_accuracy, "\n")

## Linear Model Accuracy: 0.744767

cat("Linear Model TPR:", linear_tpr, "\n")

## Linear Model TPR: 0.007978723

cat("Linear Model FPR:", linear_fpr, "\n")

## Linear Model FPR: 0.004524887

logistic_predictions_prob <- predict(logistic_model_train, newdata = validation_data, type = "response")

logistic_predictions_binary <- ifelse(logistic_predictions_prob >= 0.5, 1, 0)

logistic_confusion_matrix <- table(Actual = validation_data$high_booking_rate_numeric, Predicted = logistic_predictions_binary)
print(logistic_confusion_matrix)

##       Predicted
## Actual    0    1
##      0 1081   24
##      1  348   28

logistic_accuracy <- sum(diag(logistic_confusion_matrix)) / sum(logistic_confusion_matrix)
logistic_tpr <- logistic_confusion_matrix[2, 2] / sum(logistic_confusion_matrix[2, ])
logistic_fpr <- logistic_confusion_matrix[1, 2] / sum(logistic_confusion_matrix[1, ])

cat("Logistic Model Accuracy:", logistic_accuracy, "\n")

## Logistic Model Accuracy: 0.7488184

cat("Logistic Model TPR:", logistic_tpr, "\n")

## Logistic Model TPR: 0.07446809

cat("Logistic Model FPR:", logistic_fpr, "\n")

## Logistic Model FPR: 0.02171946

#extra
# Function to generate a confusion matrix and calculate metrics
evaluate_model <- function(predictions, actual) {
  predicted_classes <- ifelse(predictions > 0.5, 1, 0)
  conf_matrix <- table(Predicted = predicted_classes, Actual = actual)
  accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
  tpr <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
  fpr <- conf_matrix[1, 2] / sum(conf_matrix[1, ])
  
  list(conf_matrix = conf_matrix, accuracy = accuracy, tpr = tpr, fpr = fpr)
}

linear_results <- evaluate_model(validation_predictions_linear, validation_data$high_booking_rate_numeric)
cat("Linear Model:\n")

## Linear Model:

print(linear_results$conf_matrix)

##          Actual
## Predicted    0    1
##         0 1100  373
##         1    5    3

cat("Accuracy:", linear_results$accuracy, "TPR:", linear_results$tpr, "FPR:", linear_results$fpr, "\n")

## Accuracy: 0.744767 TPR: 0.375 FPR: 0.2532247

logistic_results <- evaluate_model(validation_predictions_logistic, validation_data$high_booking_rate_numeric)
cat("Logistic Model:\n")

## Logistic Model:

print(logistic_results$conf_matrix)

##          Actual
## Predicted    0    1
##         0 1081  348
##         1   24   28

cat("Accuracy:", logistic_results$accuracy, "TPR:", logistic_results$tpr, "FPR:", logistic_results$fpr, "\n")

## Accuracy: 0.7488184 TPR: 0.5384615 FPR: 0.2435269

most_accurate <- ifelse(linear_results$accuracy > logistic_results$accuracy, "Linear", "Logistic")
cat("The most accurate model is the", most_accurate, "model.\n")

## The most accurate model is the Logistic model.

For the more accurate model, compute the accuracy, TPR, and FPR using cutoffs of .4 and .6 in the validation data. Which cutoff has the highest accuracy, highest TPR, and highest FPR?

ANSWER TO QUESTION 4e HERE:

compute_metrics <- function(predictions, actual, cutoff) {
  predictions_binary <- ifelse(predictions >= cutoff, 1, 0)
  confusion_matrix <- table(Actual = actual, Predicted = predictions_binary)
  
  accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
  tpr <- confusion_matrix[2, 2] / sum(confusion_matrix[2, ])
  fpr <- confusion_matrix[1, 2] / sum(confusion_matrix[1, ])
  
  return(list(Accuracy = accuracy, TPR = tpr, FPR = fpr))
}

metrics_cutoff_0_4 <- compute_metrics(logistic_predictions_prob, validation_data$high_booking_rate_numeric, 0.4)
cat("Metrics for Cutoff 0.4:\n")

## Metrics for Cutoff 0.4:

print(metrics_cutoff_0_4)

## $Accuracy
## [1] 0.7359892
## 
## $TPR
## [1] 0.2659574
## 
## $FPR
## [1] 0.1040724

metrics_cutoff_0_6 <- compute_metrics(logistic_predictions_prob, validation_data$high_booking_rate_numeric, 0.6)
cat("\nMetrics for Cutoff 0.6:\n")

## 
## Metrics for Cutoff 0.6:

print(metrics_cutoff_0_6)

## $Accuracy
## [1] 0.7461175
## 
## $TPR
## [1] 0.002659574
## 
## $FPR
## [1] 0.0009049774

In your opinion, which cutoff of the three (.4, .5, .6) yields the best results for this application? Explain your reasoning.

ANSWER TO QUESTION 4f HERE: Application, pre say a cutoff of 0.5 yields balanced results, cutoff of 0.4 has a higher chance of correctly identifying high booking rates (TPR) but also a higher chance of false alarms (FPR). cutoff of 0.6 is more cautious, with lesser false alarms but also fewer correct identifications. Thusbest cutoff depends on what’s more important for your application: catching more true positives (go for 0.4) or having fewer false positives (go for 0.6). The middle ground supports a more reliable classification without excessively favoring either sensitivity or specificity.

Assignment 1 File