Abstract
In this project, I conducted both regression and decision tree models with several feature selections to predict rental price in NewYork. My most accurate model is done with decison tree which gives me 63.38
rmse.
There are 41,330 observations/selected rental included in the data set and 91 variables that are associated with each of the observation. Most of these variables though have unique values for the respective observation such as ID, summary, description, etc which have no prediction value and I excluded them (any variable that have more than 250 unique levels) to start.
Moreover, I expect below types of variables would lead to useful predictions based on my personal experiences: location, room type, rooms quantity, guest capacity and minimum night of booking.
I also noticed that there are some missing values, but they are concentrated in certain variables. Therefore, I also exclude variables that have high % (more than 90%) of missing data. In terms of rest of other missing values, I try to make them consistent by using data parsing for characters variables and imputation for the numeric variables instead of deleting them.
#setwd("~/Desktop/Columbia/Courses/5200/Project")
data <- read.csv('analysisData.csv')
scoringData <- read.csv('scoringData.csv')
library(dplyr)
data <- data %>% mutate(type="analysis")
scoringData <- scoringData %>% 
  mutate(type="scoring")
combinedData <- bind_rows(data, scoringData)str(data)
combinedData[, sapply(combinedData, is.character)] <- lapply(combinedData[, sapply(combinedData, is.character)], as.factor)
sapply(combinedData, function(col) length(unique(col)))
data$type
combinedData<-combinedData %>%
select_if(function(col) length(unique(col))>1)%>%
select_if( function(col) length(levels(col))<250 )sapply(combinedData, function(col) sum(is.na(col))/51663*100)
#subset new combinedData data frame
combinedData<-subset(combinedData, select=-c(reviews_per_month,type,square_feet,
                                             weekly_price,monthly_price ))
dim(combinedData)####Missing Data####
library(dplyr)
?case_when
char2na <- function(x) {
  x <- as.character(x)
  return(
    case_when(
      x == "" ~ NA_character_,
      x == "N/A" ~ NA_character_,
      TRUE ~ x
    )
  )
}
char2na(c("FOO", "", "N/A"))
?mutate_if
cleandata %>%
  mutate_if(is.character, char2na) 
sum(is.na(data))
####Zip codes####
library(readr)
## modify zip+4
combinedData$zipcode <- substr(combinedData$zipcode, 1, 5)
combinedData$zipcode <- as.factor(combinedData$zipcode)
combinedData$zipcode <- forcats::fct_lump_n(combinedData$zipcode, 40)
####Numeric variables median imputation####
library(PreProcess)
library(oompaBase)
numeric_predictors <- which(colnames(combinedData) != "price" & 
                              sapply(combinedData, is.numeric))
imp_model_med <- preProcess(combinedData[,numeric_predictors], method = 'medianImpute')
combinedData[,numeric_predictors] <- predict(imp_model_med, newdata=combinedData[,numeric_predictors])library(ggplot2)
library(caret)
zero_var_table <- nearZeroVar(combinedData, saveMetrics= TRUE)
combinedData <- combinedData[, !zero_var_table$nzv]After testing different feature selection (Best Subset, LASSO, Dimension Reduction) LASSO and Subset Selection (by features) are the most time - efficient and practical methods that work for both numeric and factor variables.
###(Rscript LASSOSELECTLINEAR.R)
library(leaps)names(combinedData)library(glmnet)round(4)###(Rscript 2XLASSOFOREST.R)
library(glmnet)round(4)modelL1 <- lm(price ~ host_response_time + host_acceptance_rate + 
                host_is_superhost +
                neighbourhood_group_cleansed + zipcode + 
                is_location_exact + room_type + accommodates + bathrooms + 
                bedrooms + security_deposit + cleaning_fee + guests_included+ 
                extra_people + minimum_nights + minimum_minimum_nights + 
                minimum_maximum_nights+
                maximum_maximum_nights + minimum_nights_avg_ntm + 
                availability_90 + availability_365 + number_of_reviews + 
                number_of_reviews_ltm + review_scores_rating + review_scores_cleanliness +
                review_scores_location+ instant_bookable + cancellation_policy+
                calculated_host_listings_count_private_rooms,data=train)forest = randomForest(price ~ host_response_time  + 
                        host_is_superhost +
                        neighbourhood_group_cleansed + zipcode + 
                        is_location_exact + room_type + accommodates + bathrooms + 
                        bedrooms + security_deposit + cleaning_fee + guests_included+ 
                        extra_people + minimum_nights + minimum_minimum_nights + 
                        availability_60 + review_scores_cleanliness +
                        review_scores_location+ instant_bookable + cancellation_policy+
                        calculated_host_listings_count_private_rooms, 
                      data=train, 
                      ntree = 100)library(randomForest)
set.seed(617)
forest = randomForest(price ~  
                        host_response_time+neighbourhood_group_cleansed + zipcode + 
                        is_location_exact + room_type + accommodates + bathrooms + 
                        bedrooms + guests_included+ 
                        minimum_minimum_nights + cleaning_fee+
                        review_scores_cleanliness + availability_60+
                        review_scores_location,data=train,ntree = 100)library(randomForest)
set.seed(617)
forest = randomForest(price ~  
                        host_response_time+neighbourhood_group_cleansed + zipcode + 
                        is_location_exact + room_type + accommodates + bathrooms + 
                        bedrooms + guests_included+ 
                        minimum_minimum_nights + cleaning_fee+
                        review_scores_cleanliness + availability_60+
                        review_scores_location,data=train,ntree = 200)My best combination of feature selection worked best turned out to be 2xLASSO; pairing with my tree decision model (200 trees), it gives me best RMSE Score 63.38.
| Model from previous section | RMSE for test data holdout or CV | RMSE on Kaggle | Other notes | |
|---|---|---|---|---|
| Model 1: Linear Regression | 78.27 | 71.12 | ||
| Model 2: Decision tree | 69.02 | 63.38 | Used features selected for model 1 | |
2 x LASSO further reduces the variables, leaving the best features that best explains or relative to the prices. Tree model works better than the linear model also indicates that the prices are greatly explained by couple categorical variables (things like room type & location have high positive correlation with prices) in this analysis.
I would have done more initial data exploration if I have more time. I realized that it would have been more efficient if I had hand picked and filtered some features out before conducting the feature selection methods. for example, I did not realize that there are some variables are close to the same and one could be excluded from the start, for example, ‘neighbourhood_group_cleansed’ & ‘neighbourhood’ until later experiments. By doing more manual data exploration, it would have left me more time to focus on the meaningful variables and allowed me to understand useful variables individually and to build a more in depth modeling based on those good variables.