Assignment 1 File

Due February 18, 2023

Problem Overview

The goal of this homework is hands-on practice with data exploration and cleaning, classification, and model selection. You will:

Conduct basic exploratory analysis of a data set
Develop linear and logistic regression models
Interpret your models
Partition your dataset and evaluate your models in terms of classification performance

The Assignment

The data in the accompanying file “dc_airbnb_file.csv” (posted on Canvas) contains data about 4936 Airbnb.com listings from the Washington, DC area. This is a small subset of the data that you will eventually use for the class project. The variables in this dataset should be self-explanatory but we have also provided a data dictionary.

Your task is to develop models to predict the target variable “high_booking_rate”, which labels whether a listing is popular (i.e. spends most of the time booked) or not.

Please answer the questions below clearly and concisely, providing tables or plots where applicable. Turn in a well-formatted compiled HTML document using R Markdown, containing clear answers to the questions and R code in the appropriate places.

RUBRIC: To receive a passing score on this assignment, you must do the following:

Turn in a well-formatted compiled HTML document using R markdown.
Provide clear answers to the questions and the correct R commands as necessary, in the appropriate places. You may answer up to three sub-questions incorrectly (i.e. incorrect R command; missing answer to question) and still receive a P on this assignment (for example, 1(a) counts as one sub-question).
The entire document must be clear, concise, and readable.

Note that this assignment is somewhat open-ended and there are many ways to answer these questions. I don’t require that we have exactly the same answers in order for you to receive full credit.

airbnb <- read_csv("dc_airbnb_hw1.csv")  #read the dataset in R

## Rows: 4936 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, bed_type, cancellation_policy, cleaning_fee, price, property_...
## dbl (5): accommodates, bedrooms, beds, host_total_listings_count, high_booki...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

names(airbnb)                       #variables used in dataset

##  [1] "name"                      "accommodates"             
##  [3] "bed_type"                  "bedrooms"                 
##  [5] "beds"                      "cancellation_policy"      
##  [7] "cleaning_fee"              "host_total_listings_count"
##  [9] "price"                     "property_type"            
## [11] "room_type"                 "high_booking_rate"

0: Example answer

What is the mean of the accommodates variable?

ANSWER: The mean number of people that can be accommodated in a listing in this dataset is 3.522893.

age_mean <- airbnb %>%
  summarise(mean_accommodates = mean(accommodates))

1: EDA and Data Cleaning

Clean up and process some of the variables by doing the following (I recommend you do these in the order in which they are listed):

For cancellation_policy, group “strict” and “super_strict_30” into “strict”
Convert cleaning_fee and price into numbers
Replace NAs in cleaning_fee and price with 0
Replace NAs in other numerical variables with their mean

Hint: there are lots of ways to deal with cleaning_fee and price. I like the parse_number function from the readr library.

ANSWER TO QUESTION 1a HERE:

#PUT QUESTION 1a CODE HERE
airbnb <- airbnb %>%
  mutate(cancellation_policy = ifelse(cancellation_policy %in% c("strict", "super_strict_30"), "strict", cancellation_policy))
airbnb <- airbnb %>%
  mutate(
    cleaning_fee = parse_number(as.character(cleaning_fee)),
    price = parse_number(as.character(price))
  )

airbnb$cleaning_fee[is.na(airbnb$cleaning_fee)] <- 0
airbnb$price[is.na(airbnb$price)] <- 0

airbnb <- airbnb %>%
  mutate(across(c("accommodates", "bedrooms", "beds", "host_total_listings_count"), ~replace_na(.x, mean(.x, na.rm=TRUE))))# For cancellation_policy, group "strict" and "super_strict_30" into "strict"
airbnb <- airbnb %>%
  mutate(cancellation_policy = ifelse(cancellation_policy %in% c("strict", "super_strict_30"), "strict", cancellation_policy))

airbnb <- airbnb %>%
  mutate(
    cleaning_fee = parse_number(as.character(cleaning_fee)),
    price = parse_number(as.character(price))
  )

airbnb$cleaning_fee[is.na(airbnb$cleaning_fee)] <- 0
airbnb$price[is.na(airbnb$price)] <- 0

airbnb <- airbnb %>%
  mutate(across(c("accommodates", "bedrooms", "beds", "host_total_listings_count"), ~replace_na(.x, mean(.x, na.rm=TRUE))))
summary(airbnb)

##      name            accommodates      bed_type            bedrooms    
##  Length:4936        Min.   : 1.000   Length:4936        Min.   :0.000  
##  Class :character   1st Qu.: 2.000   Class :character   1st Qu.:1.000  
##  Mode  :character   Median : 3.000   Mode  :character   Median :1.000  
##                     Mean   : 3.523                      Mean   :1.297  
##                     3rd Qu.: 4.000                      3rd Qu.:2.000  
##                     Max.   :16.000                      Max.   :8.000  
##       beds        cancellation_policy  cleaning_fee   
##  Min.   : 1.000   Length:4936         Min.   :  0.00  
##  1st Qu.: 1.000   Class :character    1st Qu.:  5.00  
##  Median : 1.000   Mode  :character    Median : 35.00  
##  Mean   : 1.821                       Mean   : 48.47  
##  3rd Qu.: 2.000                       3rd Qu.: 75.00  
##  Max.   :16.000                       Max.   :500.00  
##  host_total_listings_count     price        property_type     
##  Min.   :  0.00            Min.   :  10.0   Length:4936       
##  1st Qu.:  1.00            1st Qu.:  80.0   Class :character  
##  Median :  1.00            Median : 115.0   Mode  :character  
##  Mean   : 15.57            Mean   : 154.3                     
##  3rd Qu.:  3.00            3rd Qu.: 175.0                     
##  Max.   :879.00            Max.   :5150.0                     
##   room_type         high_booking_rate
##  Length:4936        Min.   :0.0000   
##  Class :character   1st Qu.:0.0000   
##  Mode  :character   Median :0.0000   
##                     Mean   :0.2508   
##                     3rd Qu.:1.0000   
##                     Max.   :1.0000

Create five new variables:

price_per_person is the nightly price per accommodates
has_cleaning_fee is YES if there is a cleaning fee, and NO otherwise
bed_category is “bed” if the bed_type is Real Bed and “other” otherwise
property_category has the following values:
- apartment if property_type is Apartment, Serviced apartment, Loft.
- hotel if property_type is Bed & Breakfast, Boutique hotel, Hostel.
- condo if property_type is Townhouse, Condominium.
- house if property_type is Bungalow, House.
- other, otherwise
make sure to convert property_category to a factor!
ppp_ind is 1 if the price_per_person is greater than the median for the property_category, and 0 otherwise

Hint: there are lots of ways to do the property_type variable, but case_when might be a useful function that’s part of the tidyverse.

ANSWER TO QUESTION 1b HERE:

library(dplyr)
airbnb <- airbnb %>%
  mutate(price_per_person = price / accommodates)
airbnb <- airbnb %>%
  mutate(has_cleaning_fee = if_else(cleaning_fee > 0, "YES", "NO"))
airbnb <- airbnb %>%
  mutate(bed_category = if_else(bed_type == "Real Bed", "bed", "other"))
airbnb <- airbnb %>%
  mutate(property_category = case_when(
    property_type %in% c("Apartment", "Serviced apartment", "Loft") ~ "apartment",
    property_type %in% c("Bed & Breakfast", "Boutique hotel", "Hostel") ~ "hotel",
    property_type %in% c("Townhouse", "Condominium") ~ "condo",
    property_type %in% c("Bungalow", "House") ~ "house",
    TRUE ~ "other"
  ))
airbnb$property_category <- as.factor(airbnb$property_category)

airbnb <- airbnb %>%
  group_by(property_category) %>%
  mutate(ppp_ind = if_else(price_per_person > median(price_per_person, na.rm = TRUE), 1, 0))
summary(airbnb)

##      name            accommodates      bed_type            bedrooms    
##  Length:4936        Min.   : 1.000   Length:4936        Min.   :0.000  
##  Class :character   1st Qu.: 2.000   Class :character   1st Qu.:1.000  
##  Mode  :character   Median : 3.000   Mode  :character   Median :1.000  
##                     Mean   : 3.523                      Mean   :1.297  
##                     3rd Qu.: 4.000                      3rd Qu.:2.000  
##                     Max.   :16.000                      Max.   :8.000  
##       beds        cancellation_policy  cleaning_fee   
##  Min.   : 1.000   Length:4936         Min.   :  0.00  
##  1st Qu.: 1.000   Class :character    1st Qu.:  5.00  
##  Median : 1.000   Mode  :character    Median : 35.00  
##  Mean   : 1.821                       Mean   : 48.47  
##  3rd Qu.: 2.000                       3rd Qu.: 75.00  
##  Max.   :16.000                       Max.   :500.00  
##  host_total_listings_count     price        property_type     
##  Min.   :  0.00            Min.   :  10.0   Length:4936       
##  1st Qu.:  1.00            1st Qu.:  80.0   Class :character  
##  Median :  1.00            Median : 115.0   Mode  :character  
##  Mean   : 15.57            Mean   : 154.3                     
##  3rd Qu.:  3.00            3rd Qu.: 175.0                     
##  Max.   :879.00            Max.   :5150.0                     
##   room_type         high_booking_rate price_per_person    has_cleaning_fee  
##  Length:4936        Min.   :0.0000    Min.   :   0.6875   Length:4936       
##  Class :character   1st Qu.:0.0000    1st Qu.:  29.0000   Class :character  
##  Mode  :character   Median :0.0000    Median :  40.0000   Mode  :character  
##                     Mean   :0.2508    Mean   :  47.3192                     
##                     3rd Qu.:1.0000    3rd Qu.:  55.0000                     
##                     Max.   :1.0000    Max.   :2575.0000                     
##  bed_category       property_category    ppp_ind      
##  Length:4936        apartment:2685    Min.   :0.0000  
##  Class :character   condo    : 558    1st Qu.:0.0000  
##  Mode  :character   hotel    :  62    Median :0.0000  
##                     house    :1561    Mean   :0.4899  
##                     other    :  70    3rd Qu.:1.0000  
##                                       Max.   :1.0000

Convert the remaining character variables to factors

bed_type
cancellation_policy
room_type

airbnb$bed_type <- as.factor(airbnb$bed_type)
airbnb$cancellation_policy <- as.factor(airbnb$cancellation_policy)
airbnb$room_type <- as.factor(airbnb$room_type)
summary(airbnb)

##      name            accommodates             bed_type       bedrooms    
##  Length:4936        Min.   : 1.000   Airbed       :  31   Min.   :0.000  
##  Class :character   1st Qu.: 2.000   Couch        :  16   1st Qu.:1.000  
##  Mode  :character   Median : 3.000   Futon        :  38   Median :1.000  
##                     Mean   : 3.523   Pull-out Sofa:  65   Mean   :1.297  
##                     3rd Qu.: 4.000   Real Bed     :4786   3rd Qu.:2.000  
##                     Max.   :16.000                        Max.   :8.000  
##       beds        cancellation_policy  cleaning_fee   
##  Min.   : 1.000   flexible:1351       Min.   :  0.00  
##  1st Qu.: 1.000   moderate:1602       1st Qu.:  5.00  
##  Median : 1.000   strict  :1983       Median : 35.00  
##  Mean   : 1.821                       Mean   : 48.47  
##  3rd Qu.: 2.000                       3rd Qu.: 75.00  
##  Max.   :16.000                       Max.   :500.00  
##  host_total_listings_count     price        property_type     
##  Min.   :  0.00            Min.   :  10.0   Length:4936       
##  1st Qu.:  1.00            1st Qu.:  80.0   Class :character  
##  Median :  1.00            Median : 115.0   Mode  :character  
##  Mean   : 15.57            Mean   : 154.3                     
##  3rd Qu.:  3.00            3rd Qu.: 175.0                     
##  Max.   :879.00            Max.   :5150.0                     
##            room_type    high_booking_rate price_per_person   
##  Entire home/apt:3326   Min.   :0.0000    Min.   :   0.6875  
##  Private room   :1472   1st Qu.:0.0000    1st Qu.:  29.0000  
##  Shared room    : 138   Median :0.0000    Median :  40.0000  
##                         Mean   :0.2508    Mean   :  47.3192  
##                         3rd Qu.:1.0000    3rd Qu.:  55.0000  
##                         Max.   :1.0000    Max.   :2575.0000  
##  has_cleaning_fee   bed_category       property_category    ppp_ind      
##  Length:4936        Length:4936        apartment:2685    Min.   :0.0000  
##  Class :character   Class :character   condo    : 558    1st Qu.:0.0000  
##  Mode  :character   Mode  :character   hotel    :  62    Median :0.0000  
##                                        house    :1561    Mean   :0.4899  
##                                        other    :  70    3rd Qu.:1.0000  
##                                                          Max.   :1.0000

Construct and report boxplots of price_per_person and log(price_per_person), broken up by values of high_booking_rate. Does it appear that there is a relationship between price_per_person and high_booking_rate? If so, what is the relationship?

ANSWER TO QUESTION 1c HERE: There is not much difference in the prices. The bookings with cheaper prices attracts more people and place with higher booking has more people with cheaper prices.

library(ggplot2)
airbnb <- airbnb %>%
  mutate(log_price_per_person = log(price_per_person))
ggplot(airbnb, aes(x = as.factor(high_booking_rate), y = price_per_person)) +
  geom_boxplot() +
  labs(title = "Boxplot of price_per_person by high_booking_rate",
       x = "High Booking Rate",
       y = "Price per Person")

ggplot(airbnb, aes(x = as.factor(high_booking_rate), y = log_price_per_person)) +
  geom_boxplot() +
  labs(title = "Boxplot of log(price_per_person) by high_booking_rate",
       x = "High Booking Rate",
       y = "Log(Price_per_Person)")

Construct a two-way table of high_booking_rate by property_category. Does it appear that any property categories are particularly bad or good in terms of high booking rates?

(hint: prop.table might be especially helpful here)

ANSWER TO QUESTION 1e HERE:

booking_rate_table <- table(airbnb$high_booking_rate, airbnb$property_category)
prop_booking_rate_table <- prop.table(booking_rate_table, margin = 1)
print(prop_booking_rate_table)

##    
##       apartment       condo       hotel       house       other
##   0 0.544348296 0.104651163 0.013520822 0.323958897 0.013520822
##   1 0.542810985 0.138126010 0.009693053 0.293214863 0.016155089

2: Linear Regression

Train a linear regression to predict high_booking_rate using the variables listed below. Report the R^2.

cancellation_policy
cleaning_fee
price_per_person
ppp_ind
has_cleaning_fee
accommodates
bed_category
bedrooms
beds
host_total_listings_count
property_category

ANSWER TO QUESTION 2a HERE:

airbnb$high_booking_rate <- as.numeric(as.character(airbnb$high_booking_rate))
model <- lm(high_booking_rate ~ cancellation_policy + cleaning_fee + price_per_person +
              ppp_ind + has_cleaning_fee + accommodates + bed_category +
              bedrooms + beds + host_total_listings_count + property_category, data = airbnb)
summary(model)

## 
## Call:
## lm(formula = high_booking_rate ~ cancellation_policy + cleaning_fee + 
##     price_per_person + ppp_ind + has_cleaning_fee + accommodates + 
##     bed_category + bedrooms + beds + host_total_listings_count + 
##     property_category, data = airbnb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0026 -0.2862 -0.1748  0.4317  0.9956 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.909e-01  2.112e-02   9.039  < 2e-16 ***
## cancellation_policymoderate  8.553e-02  1.581e-02   5.410 6.59e-08 ***
## cancellation_policystrict    6.877e-02  1.594e-02   4.316 1.62e-05 ***
## cleaning_fee                -1.229e-03  1.899e-04  -6.470 1.08e-10 ***
## price_per_person            -4.609e-04  1.404e-04  -3.283 0.001034 ** 
## ppp_ind                     -9.495e-02  1.387e-02  -6.847 8.47e-12 ***
## has_cleaning_feeYES          1.458e-01  1.732e-02   8.417  < 2e-16 ***
## accommodates                 1.228e-02  5.994e-03   2.049 0.040545 *  
## bed_categoryother           -3.760e-02  3.488e-02  -1.078 0.281049    
## bedrooms                    -8.041e-02  1.190e-02  -6.758 1.56e-11 ***
## beds                         4.345e-02  8.964e-03   4.848 1.29e-06 ***
## host_total_listings_count   -2.230e-04  7.144e-05  -3.122 0.001809 ** 
## property_categorycondo       6.576e-02  1.981e-02   3.320 0.000907 ***
## property_categoryhotel      -1.661e-02  5.405e-02  -0.307 0.758573    
## property_categoryhouse       4.977e-03  1.433e-02   0.347 0.728283    
## property_categoryother      -3.811e-03  5.109e-02  -0.075 0.940545    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4167 on 4920 degrees of freedom
## Multiple R-squared:  0.07898,    Adjusted R-squared:  0.07617 
## F-statistic: 28.13 on 15 and 4920 DF,  p-value: < 2.2e-16

What is the predicted value of high_booking_rate for a listing with the following characteristics, and how would you interpret this prediction?

New Listing:

cancellation_policy = super_strict_30
cleaning_fee = $30
price = $200
accommodates = 4
bed_type = Real Bed
bedrooms = 3
beds = 4
host_total_listings_count = 1
property_type = townhouse

Hint: you will have to do the same processing on this new listing as you applied to the main dataset, but feel free to do it by hand rather than using the same code as before.

ANSWER TO QUESTION 2b HERE:

library(dplyr)

median_condo <- median(airbnb$price_per_person[airbnb$property_category == "condo"], na.rm = TRUE)
logistic_model <- glm(high_booking_rate ~ cancellation_policy + cleaning_fee +
                       price_per_person + ppp_ind + has_cleaning_fee + accommodates +
                       bed_category + bedrooms + beds + host_total_listings_count +
                       property_category, data = airbnb, family=binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

predicted_prob <- data.frame(
  cancellation_policy = "strict",
  cleaning_fee = 30,
  price_per_person = 200 / 4,
  accommodates = 4,
  bed_type = "Real Bed",
  bedrooms = 3,
  beds = 4,
  host_total_listings_count = 1,
  property_category = "condo"
) %>%
  mutate(
    ppp_ind = as.integer(price_per_person > median_condo),
    has_cleaning_fee = "YES",
    bed_category = "bed",
    cancellation_policy = as.factor(cancellation_policy),
    bed_category = as.factor(bed_category),
    property_category = as.factor(property_category)
  ) %>%
  {predict(logistic_model, newdata = ., type = "response")}

predicted_prob

##         1 
## 0.3299597

Do you have any reservations about this predicted high_booking_rate? That is, would you feel sufficiently comfortable with this prediction in order to take some business action based on it? Why or why not?

ANSWER TO QUESTION 2c HERE: The predicted value is 0.3299, which is ultimately lower than 0.5. Hence, we cannot trust the predicted value for model analysis. So, we need more values to analyze the model.

3: Logistic Regression

Train a Logistic Regression model using the same variables as in 2a. Report the AIC of your model.

Hint: don’t worry if you get a warning here…

ANSWER TO QUESTION 3a HERE:

airbnb$high_booking_rate <- as.numeric(as.character(airbnb$high_booking_rate))

airbnb$cancellation_policy <- as.factor(airbnb$cancellation_policy)
airbnb$bed_category <- as.factor(airbnb$bed_category)
airbnb$property_category <- as.factor(airbnb$property_category)
airbnb$has_cleaning_fee <- as.factor(airbnb$has_cleaning_fee)

logistic_model <- glm(high_booking_rate ~ cancellation_policy + cleaning_fee +
                       price_per_person + ppp_ind + has_cleaning_fee + accommodates +
                       bed_category + bedrooms + beds + host_total_listings_count +
                       property_category, data = airbnb, family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(logistic_model)

## 
## Call:
## glm(formula = high_booking_rate ~ cancellation_policy + cleaning_fee + 
##     price_per_person + ppp_ind + has_cleaning_fee + accommodates + 
##     bed_category + bedrooms + beds + host_total_listings_count + 
##     property_category, family = binomial, data = airbnb)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -0.978980   0.160907  -6.084 1.17e-09 ***
## cancellation_policymoderate  0.510296   0.093378   5.465 4.63e-08 ***
## cancellation_policystrict    0.444980   0.095816   4.644 3.42e-06 ***
## cleaning_fee                -0.006897   0.001299  -5.308 1.11e-07 ***
## price_per_person            -0.019105   0.002891  -6.608 3.91e-11 ***
## ppp_ind                     -0.053136   0.108611  -0.489   0.6247    
## has_cleaning_feeYES          0.841786   0.107402   7.838 4.59e-15 ***
## accommodates                 0.040726   0.035167   1.158   0.2468    
## bed_categoryother           -0.264513   0.208151  -1.271   0.2038    
## bedrooms                    -0.399557   0.069183  -5.775 7.68e-09 ***
## beds                         0.223325   0.050891   4.388 1.14e-05 ***
## host_total_listings_count   -0.007824   0.003194  -2.449   0.0143 *  
## property_categorycondo       0.349404   0.109380   3.194   0.0014 ** 
## property_categoryhotel       0.100804   0.352798   0.286   0.7751    
## property_categoryhouse      -0.052321   0.083447  -0.627   0.5307    
## property_categoryother      -0.140461   0.291836  -0.481   0.6303    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5560.1  on 4935  degrees of freedom
## Residual deviance: 5082.6  on 4920  degrees of freedom
## AIC: 5114.6
## 
## Number of Fisher Scoring iterations: 8

What is the coefficient for price_per_person? Provide a precise (numerical) interpretation of the coefficient.

ANSWER TO QUESTION 3b HERE: The coefficient for price_per_person in the logistic regression model is approximately -0.0191. As the price_per_person increases, the model predicts a proportional decrease in having a high booking rate.

What is the coefficient for property_category = condo? Provide a precise (numerical) interpretation of this coefficient.

ANSWER TO QUESTION 3c HERE: The coefficient for property_category = condo in the logistic regression model is approximately 0.349404.The model predicts a proportional increase in the odds of having a high booking rate for listings categorized as condos compared to the reference category.

Compute the predicted probability that the same listing as in #2b will have high_booking_rate = 1. Hint: you should use the predict function, but you need to specify type = “response” when predicting probabilities from logistic regression (otherwise, it will predict the value of logit). For example: predict(mymodel, newdata = mydata, type = “response”).

ANSWER TO QUESTION 3d HERE:

If you were to pick one model to use for the purposes of explaining the relationship between the features and the target variable, which would it be, and why?

ANSWER TO QUESTION 3e HERE: We would choose logistic model as it is more appropriate for explaining the relationship between features and the target. It allows for a better interpretation of the data as its value lies between 0 and 1 and would not exceed 1.

4: Classification and Evaluation

Set the random seed to 12345. Split the data into 70% training and 30% validation sets, retrain the linear and logistic regression models using the training data only, and report the resulting R^2 and AIC, respectively.

ANSWER TO QUESTION 4a HERE:

set.seed(12345)

# Assuming 'airbnb' is your dataset
# Split the data into 70% training and 30% validation sets
n_rows <- nrow(airbnb)
train_indices <- sample(1:n_rows, 0.7 * n_rows)
train_data <- airbnb[train_indices, ]
valid_data <- airbnb[-train_indices, ]

# Retrain linear regression model on training data
linear_model <- lm(high_booking_rate ~ cancellation_policy + cleaning_fee +
                      price_per_person + ppp_ind + has_cleaning_fee + accommodates +
                      bed_category + bedrooms + beds + host_total_listings_count +
                      property_category, data = train_data)

# Retrain logistic regression model on training data
logistic_model <- glm(high_booking_rate ~ cancellation_policy + cleaning_fee +
                       price_per_person + ppp_ind + has_cleaning_fee + accommodates +
                       bed_category + bedrooms + beds + host_total_listings_count +
                       property_category, data = train_data, family = binomial)

# Report R^2 for linear regression model on validation set
linear_r_squared <- summary(linear_model)$r.squared
cat("Linear Regression R^2 on Validation Set:", linear_r_squared, "\n")

## Linear Regression R^2 on Validation Set: 0.08106124

# Report AIC for logistic regression model on validation set
logistic_aic <- AIC(logistic_model)
cat("Logistic Regression AIC on Validation Set:", logistic_aic, "\n")

## Logistic Regression AIC on Validation Set: 3585.261

Compute the RMSE in the training and validation sets for the linear model (do not do the classifications, just use the predicted score). Which RMSE is higher (training or validation), and does this make sense? Why or why not?

ANSWER TO QUESTION 4b HERE: RMSE on the validation set is slightly higher than the RMSE on the training set which suggests that the model’s performance might be slightly worse on the new data set when compared to the data it was trained on.It is common and not much concern unless the difference is more significant.

train_predictions <- predict(linear_model, newdata = train_data)

train_rmse <- sqrt(mean((train_data$high_booking_rate - train_predictions)^2))

valid_predictions <- predict(linear_model, newdata = valid_data)

valid_rmse <- sqrt(mean((valid_data$high_booking_rate - valid_predictions)^2))

cat("RMSE on Training Set:", train_rmse, "\n")

## RMSE on Training Set: 0.4148109

cat("RMSE on Validation Set:", valid_rmse, "\n")

## RMSE on Validation Set: 0.4250136

For each model (both the linear and logistic regression models learned in the training data), display the confusion matrix resulting from using a cutoff of 0.5 to do the classifications in the validation data set.

ANSWER TO QUESTION 4c HERE:

cutoff <- 0.5

valid_linear_predictions <- predict(linear_model, newdata = valid_data)

valid_linear_predictions_binary <- ifelse(valid_linear_predictions >= cutoff, 1, 0)

conf_matrix_linear <- table(Actual = valid_data$high_booking_rate, Predicted = valid_linear_predictions_binary)
conf_matrix_linear

##       Predicted
## Actual    0    1
##      0 1100    5
##      1  373    3

valid_logistic_predictions <- predict(logistic_model, newdata = valid_data, type = "response")

valid_logistic_predictions_binary <- ifelse(valid_logistic_predictions >= cutoff, 1, 0)

conf_matrix_logistic <- table(Actual = valid_data$high_booking_rate, Predicted = valid_logistic_predictions_binary)
conf_matrix_logistic

##       Predicted
## Actual    0    1
##      0 1081   24
##      1  348   28

Report the accuracy, TPR, and FPR using the two confusion matrices in 4c. Which model is the most accurate?

ANSWER TO QUESTION 4d HERE:

calculate_metrics <- function(conf_matrix) {
  accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
  tpr <- conf_matrix[2, 2] / sum(conf_matrix[2, ])
  fpr <- conf_matrix[1, 2] / sum(conf_matrix[1, ])
  
  return(c(Accuracy = accuracy, TPR = tpr, FPR = fpr))
}
metrics_linear <- calculate_metrics(conf_matrix_linear)
cat("Linear Regression Metrics:\n")

## Linear Regression Metrics:

print(metrics_linear)

##    Accuracy         TPR         FPR 
## 0.744767049 0.007978723 0.004524887

metrics_logistic <- calculate_metrics(conf_matrix_logistic)
cat("\nLogistic Regression Metrics:\n")

## 
## Logistic Regression Metrics:

print(metrics_logistic)

##   Accuracy        TPR        FPR 
## 0.74881837 0.07446809 0.02171946

For the more accurate model, compute the accuracy, TPR, and FPR using cutoffs of .4 and .6 in the validation data. Which cutoff has the highest accuracy, highest TPR, and highest FPR?

ANSWER TO QUESTION 4e HERE:

cutoffs <- c(0.4, 0.6)

calculate_metrics_cutoff <- function(predictions, actual, cutoff) {
  predictions_binary <- ifelse(predictions >= cutoff, 1, 0)
  conf_matrix <- table(Actual = actual, Predicted = predictions_binary)
  metrics <- calculate_metrics(conf_matrix)
  return(c(Cutoff = cutoff, metrics))
}

metrics_linear_04 <- calculate_metrics_cutoff(valid_linear_predictions, valid_data$high_booking_rate, 0.4)
metrics_linear_06 <- calculate_metrics_cutoff(valid_linear_predictions, valid_data$high_booking_rate, 0.6)

cat("Linear Regression Metrics (Cutoff = 0.4):\n")

## Linear Regression Metrics (Cutoff = 0.4):

print(metrics_linear_04)

##     Cutoff   Accuracy        TPR        FPR 
## 0.40000000 0.74544227 0.21010638 0.07239819

cat("\nLinear Regression Metrics (Cutoff = 0.6):\n")

## 
## Linear Regression Metrics (Cutoff = 0.6):

print(metrics_linear_06)

##       Cutoff     Accuracy          TPR          FPR 
## 0.6000000000 0.7461174882 0.0026595745 0.0009049774

metrics_logistic_04 <- calculate_metrics_cutoff(valid_logistic_predictions, valid_data$high_booking_rate, 0.4)
metrics_logistic_06 <- calculate_metrics_cutoff(valid_logistic_predictions, valid_data$high_booking_rate, 0.6)

cat("\nLogistic Regression Metrics (Cutoff = 0.4):\n")

## 
## Logistic Regression Metrics (Cutoff = 0.4):

print(metrics_logistic_04)

##    Cutoff  Accuracy       TPR       FPR 
## 0.4000000 0.7359892 0.2659574 0.1040724

cat("\nLogistic Regression Metrics (Cutoff = 0.6):\n")

## 
## Logistic Regression Metrics (Cutoff = 0.6):

print(metrics_logistic_06)

##       Cutoff     Accuracy          TPR          FPR 
## 0.6000000000 0.7461174882 0.0026595745 0.0009049774

In your opinion, which cutoff of the three (.4, .5, .6) yields the best results for this application? Explain your reasoning.

ANSWER TO QUESTION 4f HERE: .5 cutoff might yield the best results as it provides more true positives when compared to other cuoffs where they have more probability for false positives. So it might be better to choose .5 cutoff.