Exploring Linear Regression and KNN in R, but with your mid-term project data
This homework help you to prepare for the mid-term project by
redoing the homework3 again on the mid-term project data you chose. If
you have multiple data tables in your dataset, just choose the main
table, usually the largest one. Choose one variable to predict, I will
call the variable target from now on.
Load the mid-term project dataset.
library(readr)
## Warning: package 'readr' was built under R version 4.3.1
hotel <- read_csv("C:/Users/cynth/OneDrive/Desktop/hotel.csv")
## Rows: 119390 Columns: 32
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): hotel, arrival_date_month, meal, country, market_segment, distribu...
## dbl (18): is_canceled, lead_time, arrival_date_year, arrival_date_week_numbe...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(hotel)
## spc_tbl_ [119,390 × 32] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ hotel : chr [1:119390] "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : num [1:119390] 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : num [1:119390] 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : num [1:119390] 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr [1:119390] "July" "July" "July" "July" ...
## $ arrival_date_week_number : num [1:119390] 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : num [1:119390] 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : num [1:119390] 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : num [1:119390] 2 2 1 1 2 2 2 2 2 2 ...
## $ children : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr [1:119390] "BB" "BB" "BB" "BB" ...
## $ country : chr [1:119390] "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr [1:119390] "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr [1:119390] "C" "C" "A" "A" ...
## $ assigned_room_type : chr [1:119390] "C" "C" "C" "A" ...
## $ booking_changes : num [1:119390] 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr [1:119390] "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr [1:119390] "NULL" "NULL" "NULL" "304" ...
## $ company : chr [1:119390] "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr [1:119390] "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num [1:119390] 0 0 75 75 98 ...
## $ required_car_parking_spaces : num [1:119390] 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : num [1:119390] 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr [1:119390] "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr [1:119390] "7/1/2015" "7/1/2015" "7/2/2015" "7/2/2015" ...
## - attr(*, "spec")=
## .. cols(
## .. hotel = col_character(),
## .. is_canceled = col_double(),
## .. lead_time = col_double(),
## .. arrival_date_year = col_double(),
## .. arrival_date_month = col_character(),
## .. arrival_date_week_number = col_double(),
## .. arrival_date_day_of_month = col_double(),
## .. stays_in_weekend_nights = col_double(),
## .. stays_in_week_nights = col_double(),
## .. adults = col_double(),
## .. children = col_double(),
## .. babies = col_double(),
## .. meal = col_character(),
## .. country = col_character(),
## .. market_segment = col_character(),
## .. distribution_channel = col_character(),
## .. is_repeated_guest = col_double(),
## .. previous_cancellations = col_double(),
## .. previous_bookings_not_canceled = col_double(),
## .. reserved_room_type = col_character(),
## .. assigned_room_type = col_character(),
## .. booking_changes = col_double(),
## .. deposit_type = col_character(),
## .. agent = col_character(),
## .. company = col_character(),
## .. days_in_waiting_list = col_double(),
## .. customer_type = col_character(),
## .. adr = col_double(),
## .. required_car_parking_spaces = col_double(),
## .. total_of_special_requests = col_double(),
## .. reservation_status = col_character(),
## .. reservation_status_date = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(hotel)
## hotel is_canceled lead_time arrival_date_year
## Length:119390 Min. :0.0000 Min. : 0 Min. :2015
## Class :character 1st Qu.:0.0000 1st Qu.: 18 1st Qu.:2016
## Mode :character Median :0.0000 Median : 69 Median :2016
## Mean :0.3704 Mean :104 Mean :2016
## 3rd Qu.:1.0000 3rd Qu.:160 3rd Qu.:2017
## Max. :1.0000 Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 Length:119390 Length:119390
## 1st Qu.: 0.0000 1st Qu.: 0.000000 Class :character Class :character
## Median : 0.0000 Median : 0.000000 Mode :character Mode :character
## Mean : 0.1039 Mean : 0.007949
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000
## Max. :10.0000 Max. :10.000000
## NA's :4
## market_segment distribution_channel is_repeated_guest
## Length:119390 Length:119390 Min. :0.00000
## Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Median :0.00000
## Mean :0.03191
## 3rd Qu.:0.00000
## Max. :1.00000
##
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 Length:119390
## 1st Qu.: 0.00000 1st Qu.: 0.0000 Class :character
## Median : 0.00000 Median : 0.0000 Mode :character
## Mean : 0.08712 Mean : 0.1371
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000
## Max. :26.00000 Max. :72.0000
##
## assigned_room_type booking_changes deposit_type agent
## Length:119390 Min. : 0.0000 Length:119390 Length:119390
## Class :character 1st Qu.: 0.0000 Class :character Class :character
## Mode :character Median : 0.0000 Mode :character Mode :character
## Mean : 0.2211
## 3rd Qu.: 0.0000
## Max. :21.0000
##
## company days_in_waiting_list customer_type adr
## Length:119390 Min. : 0.000 Length:119390 Min. : -6.38
## Class :character 1st Qu.: 0.000 Class :character 1st Qu.: 69.29
## Mode :character Median : 0.000 Mode :character Median : 94.58
## Mean : 2.321 Mean : 101.83
## 3rd Qu.: 0.000 3rd Qu.: 126.00
## Max. :391.000 Max. :5400.00
##
## required_car_parking_spaces total_of_special_requests reservation_status
## Min. :0.00000 Min. :0.0000 Length:119390
## 1st Qu.:0.00000 1st Qu.:0.0000 Class :character
## Median :0.00000 Median :0.0000 Mode :character
## Mean :0.06252 Mean :0.5714
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :8.00000 Max. :5.0000
##
## reservation_status_date
## Length:119390
## Class :character
## Mode :character
##
##
##
##
Use summary statistics and visualizations to understand the dataset. Identify any trends, correlations, or patterns.
## printed first 5 rows of mtcars
head(hotel, 5)
## # A tibble: 5 × 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## # ℹ 27 more variables: arrival_date_week_number <dbl>,
## # arrival_date_day_of_month <dbl>, stays_in_weekend_nights <dbl>,
## # stays_in_week_nights <dbl>, adults <dbl>, children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, …
colnames(hotel)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
###rownames(hotel)
library(ggplot2)
#ggplot(numeric_data_melted, aes(y = value)) +
# geom_boxplot() +
#facet_wrap(~variable, scales = 'free', ncol = 2) +
##labs(x = '', y = '') +
#theme_minimal()
clean_hotel <- hotel
summary(hotel$adr)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.38 69.29 94.58 101.83 126.00 5400.00
bench_adr <- 101.83 + 1.5 * IQR(hotel$adr)
clean_hotel$adr[hotel$adr > bench_adr] <- bench_adr
clean_hotel <- hotel[hotel$arrival_date_year != 0, ]
summary(clean_hotel$arrival_date_month)
## Length Class Mode
## 119390 character character
summary(hotel$adults)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 2.000 1.856 2.000 55.000
bench_adults <- 1.865 + 1.5 * IQR(clean_hotel$adults)
clean_hotel$adults[clean_hotel$adults > bench_adults] <- bench_adults
clean_hotel <- clean_hotel[clean_hotel$adults != 0, ]
summary(clean_hotel$adults)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.865 1.865 1.698 1.865 1.865
summary(hotel$stays_in_week_nights)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 1.0 2.0 2.5 3.0 50.0
bench_siwn <- 2.5 + 1.5 * IQR(clean_hotel$stays_in_week_nights)
clean_hotel$stays_in_week_nights[clean_hotel$stays_in_week_nights > bench_siwn] <- bench_siwn
clean_hotel <- clean_hotel[clean_hotel$stays_in_week_nights != 0, ]
summary(clean_hotel$stays_in_week_nights)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.547 3.000 5.500
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.1
## corrplot 0.92 loaded
corr_matrix <- cor(clean_hotel[sapply(clean_hotel, is.numeric)])
corr_matrix[!complete.cases(corr_matrix)] <- 0
corr_matrix[!is.finite(corr_matrix)] <- 0
corrplot(corr_matrix, method = "circle", type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
Interpret the coefficients. Evaluate the model using training MSE
(Mean Square Error) and testing MSE.
#library(kknn)
#k <- 5 # Number of neighbors
#knn_fit_adr <- kknn(adr ~ ., hotel, hotel, k=k)
#knn_fit <- kknn(adr ~ ., hotel, hotel, k=k, scale = FALSE)
#mse_df <- data.frame(k = integer(), MSE = numeric())
#for (k in c(1, 3, 5, 7, 9, 11)) {
# Fit the k-NN model using kknn function
#knn_model <- kknn(adr ~ ., train = train_data, test = test_data, k = k)
# Calculate the MSE
#mse <- mean((knn_model$fitted.values - hotel$adr)^2)
#mse_df <- rbind(adr, data.frame(k = k, MSE = mse))}
# Show the MSE data frame
#print(mse_df)
#hotel_lm <- lm(adr ~ ., data = hotel)
#summary(hotel_lm)
#par(mfrow =c(2, 2))
#plot(mtcars_lm)
#knn_std_mse <- mean((knn_fit_adr$fitted.values - hotel$adr)^2)
#print(paste("Mean Squared Error for stdKNN:", round(knn_std_mse, 2)))
#knn_mse <- mean((knn_fit_adr$fitted.values - hotel$mpg)^2)
#print(paste("Mean Squared Error for KNN:", round(knn_mse, 2)))
Make sure you answer these questions clearly with figure/table as evidence to support your arguments:
What variables are most strongly correlated with target? We can see the adults and children have a high corrlation to adr and required_car_parking_spaces, and total_of_special_requests both are negatively correlated with is_canceled
How does the value of k in KNN affect the model’s performance (in terms of training MSE and testing MSE)? When K=1, the model is highly influenced by noise or outliers. The larger values lead to overly smoothed which do not necessarily demonstrate the patterns and ove rcomplicates the training data.
What assumptions are being made when we use linear regression? Are they met in this dataset? Just describe what you observe from the diagnostic plots. Assuming the hotel data is linear, the variance is qual across the regression line.
Try adding interaction terms to your linear regression model. At least try to find out oneinteraction term that has a statistically significant coefficient. Report the interaction term and check how these interaction terms influence the model’s performance in terms of R^2 and how do you interpret your new model?
Is there any outlier in the dataset? If yes, apply truncation or winsorization techniques to handle outliers. Compare the performance of the models before and after applying these techniques. What differences do you observe? There were many outliers in the dataset. WInsorization was applied to handle the outliers.
How could feature scaling (standardization) affect the KNN model? It has an impact on the KNN by relying on the distance metrics to determine the closest neighbors to make prediction, hence name! In addition the normalization or robust scaling are meathods that can reduce the rnage and scale to outliers within the dataset such as truncation.
What insights can you derive from comparing the linear regression and KNN models? After comparing linear regression and KNN models, I see how general the KNN can be with small k values, or how the lr model is sensitive to outliers. Which model would you recommend the most and Why?