Initial Exploration

There are 41,330 observations/selected rental included in the data set and 91 variables that are associated with each of the observation. Most of these variables though have unique values for the respective observation such as ID, summary, description, etc which have no prediction value and I excluded them (any variable that have more than 250 unique levels) to start.

Moreover, I expect below types of variables would lead to useful predictions based on my personal experiences: location, room type, rooms quantity, guest capacity and minimum night of booking.

I also noticed that there are some missing values, but they are concentrated in certain variables. Therefore, I also exclude variables that have high % (more than 90%) of missing data. In terms of rest of other missing values, I try to make them consistent by using data parsing for characters variables and imputation for the numeric variables instead of deleting them.

* Combine Scoring Data & Analysis Data to Start the Data Cleaning

#setwd("~/Desktop/Columbia/Courses/5200/Project")
data <- read.csv('analysisData.csv')
scoringData <- read.csv('scoringData.csv')
library(dplyr)
data <- data %>% mutate(type="analysis")
scoringData <- scoringData %>% 
  mutate(type="scoring")
combinedData <- bind_rows(data, scoringData)

* Get Rid of Variables that Have More than 250 Unique Levels

str(data)
combinedData[, sapply(combinedData, is.character)] <- lapply(combinedData[, sapply(combinedData, is.character)], as.factor)

sapply(combinedData, function(col) length(unique(col)))
data$type

combinedData<-combinedData %>%
select_if(function(col) length(unique(col))>1)%>%
select_if( function(col) length(levels(col))<250 )

* Get Rid of Variables that Have More than 90% of Missing Data

sapply(combinedData, function(col) sum(is.na(col))/51663*100)

#subset new combinedData data frame
combinedData<-subset(combinedData, select=-c(reviews_per_month,type,square_feet,
                                             weekly_price,monthly_price ))
dim(combinedData)

* Data Cleaning

####Missing Data####

library(dplyr)

?case_when
char2na <- function(x) {
  x <- as.character(x)
  return(
    case_when(
      x == "" ~ NA_character_,
      x == "N/A" ~ NA_character_,
      TRUE ~ x
    )
  )
}
char2na(c("FOO", "", "N/A"))

?mutate_if
cleandata %>%
  mutate_if(is.character, char2na) 

sum(is.na(data))

####Zip codes####

library(readr)
## modify zip+4
combinedData$zipcode <- substr(combinedData$zipcode, 1, 5)
combinedData$zipcode <- as.factor(combinedData$zipcode)
combinedData$zipcode <- forcats::fct_lump_n(combinedData$zipcode, 40)

####Numeric variables median imputation####

library(PreProcess)
library(oompaBase)
numeric_predictors <- which(colnames(combinedData) != "price" & 
                              sapply(combinedData, is.numeric))

imp_model_med <- preProcess(combinedData[,numeric_predictors], method = 'medianImpute')
combinedData[,numeric_predictors] <- predict(imp_model_med, newdata=combinedData[,numeric_predictors])

* Further Variables Deduction

library(ggplot2)
library(caret)
zero_var_table <- nearZeroVar(combinedData, saveMetrics= TRUE)
combinedData <- combinedData[, !zero_var_table$nzv]

Feature Selection & Modeling

After testing different feature selection (Best Subset, LASSO, Dimension Reduction) LASSO and Subset Selection (by features) are the most time - efficient and practical methods that work for both numeric and factor variables.

FEATURE SELECTION 1: Two Subset Selection & One Lasso

###(Rscript LASSOSELECTLINEAR.R)

* Subset Measure starts at line 100

library(leaps)

Ends at line 128

names(combinedData)

Lasso Measure starts at line 126

library(glmnet)

Ends at line 140

round(4)

FEATURE SELECTION 2: 2 x LASSO - (First Lasso result was being futher shrunk )

###(Rscript 2XLASSOFOREST.R)

* LASSO Measure starts at line 113

library(glmnet)

Ends at line 132

round(4)

MODEL 1:LINEAR REGRESSION WITH FEATURES FROM FEATURE SELECTION 1 STARTS HERE

modelL1 <- lm(price ~ host_response_time + host_acceptance_rate + 
                host_is_superhost +
                neighbourhood_group_cleansed + zipcode + 
                is_location_exact + room_type + accommodates + bathrooms + 
                bedrooms + security_deposit + cleaning_fee + guests_included+ 
                extra_people + minimum_nights + minimum_minimum_nights + 
                minimum_maximum_nights+
                maximum_maximum_nights + minimum_nights_avg_ntm + 
                availability_90 + availability_365 + number_of_reviews + 
                number_of_reviews_ltm + review_scores_rating + review_scores_cleanliness +
                review_scores_location+ instant_bookable + cancellation_policy+
                calculated_host_listings_count_private_rooms,data=train)

MODEL 2: DECISION TREE WITH FEATURES FROM FEATURE SELECTION 1 STARTS HERE

forest = randomForest(price ~ host_response_time  + 
                        host_is_superhost +
                        neighbourhood_group_cleansed + zipcode + 
                        is_location_exact + room_type + accommodates + bathrooms + 
                        bedrooms + security_deposit + cleaning_fee + guests_included+ 
                        extra_people + minimum_nights + minimum_minimum_nights + 
                        availability_60 + review_scores_cleanliness +
                        review_scores_location+ instant_bookable + cancellation_policy+
                        calculated_host_listings_count_private_rooms, 
                      data=train, 
                      ntree = 100)

MODEL 3: DECISION TREE WITH FEATURES FROM FEATURE SELECTION 2 STARTS HERE

library(randomForest)
set.seed(617)
forest = randomForest(price ~  
                        host_response_time+neighbourhood_group_cleansed + zipcode + 
                        is_location_exact + room_type + accommodates + bathrooms + 
                        bedrooms + guests_included+ 
                        minimum_minimum_nights + cleaning_fee+
                        review_scores_cleanliness + availability_60+
                        review_scores_location,data=train,ntree = 100)

MODEL 4: DECISION TREE WITH FEATURES FROM FEATURE SELECTION 2 ENDS HERE

library(randomForest)
set.seed(617)
forest = randomForest(price ~  
                        host_response_time+neighbourhood_group_cleansed + zipcode + 
                        is_location_exact + room_type + accommodates + bathrooms + 
                        bedrooms + guests_included+ 
                        minimum_minimum_nights + cleaning_fee+
                        review_scores_cleanliness + availability_60+
                        review_scores_location,data=train,ntree = 200)

MODELING AND FEATURE SELECTION ENDS HERE

Model Comparison

My best combination of feature selection worked best turned out to be 2xLASSO; pairing with my tree decision model (200 trees), it gives me best RMSE Score 63.38.

Model from previous section RMSE for test data holdout or CV RMSE on Kaggle Other notes
Model 1: Linear Regression 78.27 71.12
Model 2: Decision tree 69.02 63.38 Used features selected for model 1

Discussion

2 x LASSO further reduces the variables, leaving the best features that best explains or relative to the prices. Tree model works better than the linear model also indicates that the prices are greatly explained by couple categorical variables (things like room type & location have high positive correlation with prices) in this analysis.

Future Directions

I would have done more initial data exploration if I have more time. I realized that it would have been more efficient if I had hand picked and filtered some features out before conducting the feature selection methods. for example, I did not realize that there are some variables are close to the same and one could be excluded from the start, for example, ‘neighbourhood_group_cleansed’ & ‘neighbourhood’ until later experiments. By doing more manual data exploration, it would have left me more time to focus on the meaningful variables and allowed me to understand useful variables individually and to build a more in depth modeling based on those good variables.