Homework Assignment:

Exploring Linear Regression and KNN in R, but with your mid-term project data

Objective and Dataset:

This homework help you to prepare for the mid-term project by redoing the homework3 again on the mid-term project data you chose. If you have multiple data tables in your dataset, just choose the main table, usually the largest one. Choose one variable to predict, I will call the variable target from now on.

Instructions:

  1. Data Exploration

 Load the mid-term project dataset.

library(readr)
## Warning: package 'readr' was built under R version 4.3.1
hotel <- read_csv("C:/Users/cynth/OneDrive/Desktop/hotel.csv")
## Rows: 119390 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): hotel, arrival_date_month, meal, country, market_segment, distribu...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numbe...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(hotel)
## spc_tbl_ [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ hotel                         : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
##  $ is_canceled                   : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
##  $ lead_time                     : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
##  $ arrival_date_year             : num [1:119390] 2015 2015 2015 2015 2015 ...
##  $ arrival_date_month            : chr [1:119390] "July" "July" "July" "July" ...
##  $ arrival_date_week_number      : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
##  $ arrival_date_day_of_month     : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
##  $ stays_in_weekend_nights       : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stays_in_week_nights          : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
##  $ adults                        : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
##  $ children                      : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ babies                        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ meal                          : chr [1:119390] "BB" "BB" "BB" "BB" ...
##  $ country                       : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
##  $ market_segment                : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ distribution_channel          : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
##  $ is_repeated_guest             : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_cancellations        : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ reserved_room_type            : chr [1:119390] "C" "C" "A" "A" ...
##  $ assigned_room_type            : chr [1:119390] "C" "C" "C" "A" ...
##  $ booking_changes               : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
##  $ deposit_type                  : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
##  $ agent                         : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
##  $ company                       : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
##  $ days_in_waiting_list          : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ customer_type                 : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
##  $ adr                           : num [1:119390] 0 0 75 75 98 ...
##  $ required_car_parking_spaces   : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
##  $ total_of_special_requests     : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
##  $ reservation_status            : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
##  $ reservation_status_date       : chr [1:119390] "7/1/2015" "7/1/2015" "7/2/2015" "7/2/2015" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   hotel = col_character(),
##   ..   is_canceled = col_double(),
##   ..   lead_time = col_double(),
##   ..   arrival_date_year = col_double(),
##   ..   arrival_date_month = col_character(),
##   ..   arrival_date_week_number = col_double(),
##   ..   arrival_date_day_of_month = col_double(),
##   ..   stays_in_weekend_nights = col_double(),
##   ..   stays_in_week_nights = col_double(),
##   ..   adults = col_double(),
##   ..   children = col_double(),
##   ..   babies = col_double(),
##   ..   meal = col_character(),
##   ..   country = col_character(),
##   ..   market_segment = col_character(),
##   ..   distribution_channel = col_character(),
##   ..   is_repeated_guest = col_double(),
##   ..   previous_cancellations = col_double(),
##   ..   previous_bookings_not_canceled = col_double(),
##   ..   reserved_room_type = col_character(),
##   ..   assigned_room_type = col_character(),
##   ..   booking_changes = col_double(),
##   ..   deposit_type = col_character(),
##   ..   agent = col_character(),
##   ..   company = col_character(),
##   ..   days_in_waiting_list = col_double(),
##   ..   customer_type = col_character(),
##   ..   adr = col_double(),
##   ..   required_car_parking_spaces = col_double(),
##   ..   total_of_special_requests = col_double(),
##   ..   reservation_status = col_character(),
##   ..   reservation_status_date = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(hotel)
##     hotel            is_canceled       lead_time   arrival_date_year
##  Length:119390      Min.   :0.0000   Min.   :  0   Min.   :2015     
##  Class :character   1st Qu.:0.0000   1st Qu.: 18   1st Qu.:2016     
##  Mode  :character   Median :0.0000   Median : 69   Median :2016     
##                     Mean   :0.3704   Mean   :104   Mean   :2016     
##                     3rd Qu.:1.0000   3rd Qu.:160   3rd Qu.:2017     
##                     Max.   :1.0000   Max.   :737   Max.   :2017     
##                                                                     
##  arrival_date_month arrival_date_week_number arrival_date_day_of_month
##  Length:119390      Min.   : 1.00            Min.   : 1.0             
##  Class :character   1st Qu.:16.00            1st Qu.: 8.0             
##  Mode  :character   Median :28.00            Median :16.0             
##                     Mean   :27.17            Mean   :15.8             
##                     3rd Qu.:38.00            3rd Qu.:23.0             
##                     Max.   :53.00            Max.   :31.0             
##                                                                       
##  stays_in_weekend_nights stays_in_week_nights     adults      
##  Min.   : 0.0000         Min.   : 0.0         Min.   : 0.000  
##  1st Qu.: 0.0000         1st Qu.: 1.0         1st Qu.: 2.000  
##  Median : 1.0000         Median : 2.0         Median : 2.000  
##  Mean   : 0.9276         Mean   : 2.5         Mean   : 1.856  
##  3rd Qu.: 2.0000         3rd Qu.: 3.0         3rd Qu.: 2.000  
##  Max.   :19.0000         Max.   :50.0         Max.   :55.000  
##                                                               
##     children           babies              meal             country         
##  Min.   : 0.0000   Min.   : 0.000000   Length:119390      Length:119390     
##  1st Qu.: 0.0000   1st Qu.: 0.000000   Class :character   Class :character  
##  Median : 0.0000   Median : 0.000000   Mode  :character   Mode  :character  
##  Mean   : 0.1039   Mean   : 0.007949                                        
##  3rd Qu.: 0.0000   3rd Qu.: 0.000000                                        
##  Max.   :10.0000   Max.   :10.000000                                        
##  NA's   :4                                                                  
##  market_segment     distribution_channel is_repeated_guest
##  Length:119390      Length:119390        Min.   :0.00000  
##  Class :character   Class :character     1st Qu.:0.00000  
##  Mode  :character   Mode  :character     Median :0.00000  
##                                          Mean   :0.03191  
##                                          3rd Qu.:0.00000  
##                                          Max.   :1.00000  
##                                                           
##  previous_cancellations previous_bookings_not_canceled reserved_room_type
##  Min.   : 0.00000       Min.   : 0.0000                Length:119390     
##  1st Qu.: 0.00000       1st Qu.: 0.0000                Class :character  
##  Median : 0.00000       Median : 0.0000                Mode  :character  
##  Mean   : 0.08712       Mean   : 0.1371                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.0000                                  
##  Max.   :26.00000       Max.   :72.0000                                  
##                                                                          
##  assigned_room_type booking_changes   deposit_type          agent          
##  Length:119390      Min.   : 0.0000   Length:119390      Length:119390     
##  Class :character   1st Qu.: 0.0000   Class :character   Class :character  
##  Mode  :character   Median : 0.0000   Mode  :character   Mode  :character  
##                     Mean   : 0.2211                                        
##                     3rd Qu.: 0.0000                                        
##                     Max.   :21.0000                                        
##                                                                            
##    company          days_in_waiting_list customer_type           adr         
##  Length:119390      Min.   :  0.000      Length:119390      Min.   :  -6.38  
##  Class :character   1st Qu.:  0.000      Class :character   1st Qu.:  69.29  
##  Mode  :character   Median :  0.000      Mode  :character   Median :  94.58  
##                     Mean   :  2.321                         Mean   : 101.83  
##                     3rd Qu.:  0.000                         3rd Qu.: 126.00  
##                     Max.   :391.000                         Max.   :5400.00  
##                                                                              
##  required_car_parking_spaces total_of_special_requests reservation_status
##  Min.   :0.00000             Min.   :0.0000            Length:119390     
##  1st Qu.:0.00000             1st Qu.:0.0000            Class :character  
##  Median :0.00000             Median :0.0000            Mode  :character  
##  Mean   :0.06252             Mean   :0.5714                              
##  3rd Qu.:0.00000             3rd Qu.:1.0000                              
##  Max.   :8.00000             Max.   :5.0000                              
##                                                                          
##  reservation_status_date
##  Length:119390          
##  Class :character       
##  Mode  :character       
##                         
##                         
##                         
## 

 Use summary statistics and visualizations to understand the dataset.  Identify any trends, correlations, or patterns.

## printed first 5 rows of mtcars
head(hotel, 5)
## # A tibble: 5 × 32
##   hotel        is_canceled lead_time arrival_date_year arrival_date_month
##   <chr>              <dbl>     <dbl>             <dbl> <chr>             
## 1 Resort Hotel           0       342              2015 July              
## 2 Resort Hotel           0       737              2015 July              
## 3 Resort Hotel           0         7              2015 July              
## 4 Resort Hotel           0        13              2015 July              
## 5 Resort Hotel           0        14              2015 July              
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## #   arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## #   stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## #   meal <chr>, country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, …
colnames(hotel) 
##  [1] "hotel"                          "is_canceled"                   
##  [3] "lead_time"                      "arrival_date_year"             
##  [5] "arrival_date_month"             "arrival_date_week_number"      
##  [7] "arrival_date_day_of_month"      "stays_in_weekend_nights"       
##  [9] "stays_in_week_nights"           "adults"                        
## [11] "children"                       "babies"                        
## [13] "meal"                           "country"                       
## [15] "market_segment"                 "distribution_channel"          
## [17] "is_repeated_guest"              "previous_cancellations"        
## [19] "previous_bookings_not_canceled" "reserved_room_type"            
## [21] "assigned_room_type"             "booking_changes"               
## [23] "deposit_type"                   "agent"                         
## [25] "company"                        "days_in_waiting_list"          
## [27] "customer_type"                  "adr"                           
## [29] "required_car_parking_spaces"    "total_of_special_requests"     
## [31] "reservation_status"             "reservation_status_date"
###rownames(hotel)
  1. Data Preprocessing  If necessary, handle missing or inconsistent data.  Split the data into 70% training and 30% testing. ##Im loading in the package for data visualization:
library(ggplot2)

#ggplot(numeric_data_melted, aes(y = value)) +
 # geom_boxplot() +
  #facet_wrap(~variable, scales = 'free', ncol = 2) +
  ##labs(x = '', y = '') +
  #theme_minimal()  
  1. Linear Regression using lm  Create a linear regression model to predict target based on other variables in the dataset, using training data set.
clean_hotel <- hotel 
summary(hotel$adr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -6.38   69.29   94.58  101.83  126.00 5400.00
bench_adr <- 101.83 + 1.5 * IQR(hotel$adr)
clean_hotel$adr[hotel$adr > bench_adr] <- bench_adr
clean_hotel <- hotel[hotel$arrival_date_year != 0, ]
summary(clean_hotel$arrival_date_month)
##    Length     Class      Mode 
##    119390 character character
summary(hotel$adults)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   2.000   1.856   2.000  55.000
bench_adults <- 1.865 + 1.5 * IQR(clean_hotel$adults)
clean_hotel$adults[clean_hotel$adults > bench_adults] <- bench_adults
clean_hotel <- clean_hotel[clean_hotel$adults != 0, ]
summary(clean_hotel$adults)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.865   1.865   1.698   1.865   1.865
summary(hotel$stays_in_week_nights)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     1.0     2.0     2.5     3.0    50.0
bench_siwn <- 2.5 + 1.5 * IQR(clean_hotel$stays_in_week_nights)
clean_hotel$stays_in_week_nights[clean_hotel$stays_in_week_nights > bench_siwn] <- bench_siwn
clean_hotel <- clean_hotel[clean_hotel$stays_in_week_nights != 0, ]
summary(clean_hotel$stays_in_week_nights)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.547   3.000   5.500
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.1
## corrplot 0.92 loaded
corr_matrix <- cor(clean_hotel[sapply(clean_hotel, is.numeric)])
corr_matrix[!complete.cases(corr_matrix)] <- 0
corr_matrix[!is.finite(corr_matrix)] <- 0
corrplot(corr_matrix, method = "circle", type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45)

 Interpret the coefficients.  Evaluate the model using training MSE (Mean Square Error) and testing MSE.

  1. KNN using kknn  Use k-Nearest Neighbors to also predict target.  Try different values of k.  Evaluate the model using training MSE (Mean Square Error) and testing MSE.
#library(kknn)
#k <- 5 # Number of neighbors
#knn_fit_adr <- kknn(adr ~ ., hotel, hotel, k=k)
#knn_fit <- kknn(adr ~ ., hotel, hotel, k=k, scale = FALSE)
#mse_df <- data.frame(k = integer(), MSE = numeric())

#for (k in c(1, 3, 5, 7, 9, 11)) {
  
# Fit the k-NN model using kknn function
#knn_model <- kknn(adr ~ ., train = train_data, test = test_data, k = k)
  
# Calculate the MSE
#mse <-  mean((knn_model$fitted.values - hotel$adr)^2)
#mse_df <- rbind(adr, data.frame(k = k, MSE = mse))}
# Show the MSE data frame
#print(mse_df)


#hotel_lm <- lm(adr ~ ., data = hotel)
#summary(hotel_lm)
#par(mfrow =c(2, 2))
#plot(mtcars_lm)

#knn_std_mse <- mean((knn_fit_adr$fitted.values - hotel$adr)^2)
#print(paste("Mean Squared Error for stdKNN:", round(knn_std_mse, 2)))

#knn_mse <- mean((knn_fit_adr$fitted.values - hotel$mpg)^2)
#print(paste("Mean Squared Error for KNN:", round(knn_mse, 2)))
  1. Comparison and Insights  Compare the performance of the linear regression model and the KNN model. Focus more on testing MSE.  Discuss the advantages and disadvantages of both methods on your data.  Explain which model you would recommend and why.

Make sure you answer these questions clearly with figure/table as evidence to support your arguments:

  1. What variables are most strongly correlated with target? We can see the adults and children have a high corrlation to adr and required_car_parking_spaces, and total_of_special_requests both are negatively correlated with is_canceled

  2. How does the value of k in KNN affect the model’s performance (in terms of training MSE and testing MSE)? When K=1, the model is highly influenced by noise or outliers. The larger values lead to overly smoothed which do not necessarily demonstrate the patterns and ove rcomplicates the training data.

  3. What assumptions are being made when we use linear regression? Are they met in this dataset? Just describe what you observe from the diagnostic plots. Assuming the hotel data is linear, the variance is qual across the regression line.

  4. Try adding interaction terms to your linear regression model. At least try to find out oneinteraction term that has a statistically significant coefficient. Report the interaction term and check how these interaction terms influence the model’s performance in terms of R^2 and how do you interpret your new model?

  5. Is there any outlier in the dataset? If yes, apply truncation or winsorization techniques to handle outliers. Compare the performance of the models before and after applying these techniques. What differences do you observe? There were many outliers in the dataset. WInsorization was applied to handle the outliers.

  6. How could feature scaling (standardization) affect the KNN model? It has an impact on the KNN by relying on the distance metrics to determine the closest neighbors to make prediction, hence name! In addition the normalization or robust scaling are meathods that can reduce the rnage and scale to outliers within the dataset such as truncation.

  7. What insights can you derive from comparing the linear regression and KNN models? After comparing linear regression and KNN models, I see how general the KNN can be with small k values, or how the lr model is sensitive to outliers. Which model would you recommend the most and Why?