by Yuting Gong, December 2018

Introduction

This report records my work on solving “How much for your Airbnb?”, an in-class Kaggle competition hosted by course Applied Analytics Framework and Methods Fall 2018. The Kaggle competitions gives data sets of Airbnb housing listing and asks students to predict the price of the listings. Accruancy of the prediction is measured by RMSE.

Exploring the data

After importing both train data as data and test data as scoringData into R, I started my data exploration in below steps:

ggplot(data = all, aes(x=price)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

numericVars <- which(sapply(all, is.numeric)) #index numeric variables
numericVarNames <- names(numericVars) #saving names vector for use later on

all_numVar <- all[, numericVars]
all_numVar$id <- NULL   #remove id
cor_numVar <- cor(all_numVar, use="pairwise.complete.obs") #correlations of all numeric variables
#sort on decreasing correlations with price
cor_sorted <- as.matrix(sort(cor_numVar[,'price'], decreasing = TRUE))
#select only high corelations
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.2)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt")

weekly_price: as shown in below graph, “weekly_price” and “price” has high linear relationship

g1 = ggplot(data=all, aes(x=weekly_price)) +
  geom_histogram()
c1 = ggplot(data=all, aes(x=weekly_price, y=price)) +
  geom_point()

plot_grid(g1, c1, labels = "AUTO")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cleaning_fee: as the relationship between “cleaning_fee” and “price” is not pure linear. In the model fitting, I would like to try both linear model and other models such as random forest.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

accommodates:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Preparing the Data for Analysis

I spent a lot of time cleaning data and imputing missing values. Below are a few steps that I took:

total = colSums(is.na(data))
percentage = total/nrow(data)
col_name = colnames(data)
missing_data = data.frame(name= col_name,total = total, percentage = percentage)
arrange(missing_data, desc(total))
### we will fix NA in scoringdata at the same time
data[is.na(data$cleaning_fee),]$cleaning_fee <- 0
scoringData[is.na(scoringData$cleaning_fee),]$cleaning_fee <- 0

#beds, impute medium for missing beds 
data[is.na(data$beds),]$beds <- median(data$beds, na.rm = TRUE)
scoringData[is.na(scoringData$beds),]$beds <- median(scoringData$beds, na.rm = TRUE)

There are NA in categorical values that are not detected in the previous NA table. I imputed values for them when the I found them:

#scoringData, fill NA in review per month, since it only has 1 review, ill fill in 0 for its review per month
#in actual code, I did this after "fix factor levels". So I have "scoringData2" here
scoringData2[is.na(scoringData2$reviews_per_month),]$reviews_per_month <- 0
scoringData$price <- NA
scoringData$zipcode=as.factor(scoringData$zipcode)
fulldata <- rbind(data, scoringData)
fulldata$property_type = as.factor(fulldata$property_type)
##36428, train has 29142 rows, scoringData has 7286 rows
data2= fulldata[1:29142,]
scoringData2=fulldata[29143:36428,]
##there are  property type that are not in train. we want to fix that, which is hut and, cottage
scoringData2[scoringData2$property_type == "Hut",]$property_type = "Other"
scoringData2[scoringData2$property_type == "Cottage",]$property_type = "Other"

Another example of fixing factor levels is neighbourhood_cleasned. Test data has more factors levels than train data. I imputed them with nearby neighbourhood that’s in the train data:

#neighbourhood_cleasned
scoringData2[scoringData$neighbourhood_cleansed == "Hollis Hills",]$neighbourhood_cleansed = "Bayside"
scoringData2[scoringData$neighbourhood_cleansed == "Westerleigh",]$neighbourhood_cleansed = "Castleton Corners"

Okay. Our data is clean now!

Modeling Techniques:

I fitted different models including Linear Model, Random Forest, Random Forest with Cross Validation, Boosting with Cross Validation. In the end I decided to stick with Random Forest because it gave me the most accurate result without taking too much time to run the model. But I had to give up variables such as “neighbourhood_cleansed” because it had too many factor levels (>52). I also used google cloud to run R studio because it is quicker than my local R studio.

I will talk briefly about what model techniques work for me in the LM and randomForest model:

  nrow(data[data$price == 0,])  #25 rows have price = 0. Airbnb price is 0, does not make sense. Let’s remove them
data = data[data$price >0,] 
data$price = log(data$price) #log transformation price

#model
model9 = lm(price~extra_people+property_type+neighbourhood_cleansed+room_type+bedrooms+bathrooms+bed_type+review_scores_rating+minimum_nights+number_of_reviews+instant_bookable+is_business_travel_ready+beds+extra_people+review_scores_location+review_scores_value+review_scores_cleanliness+review_scores_accuracy+cancellation_policy+guests_included+host_response_time+extra_people+availability_30+ availability_60+ availability_90+ availability_365+calculated_host_listings_count+accommodates+cleaning_fee+reviews_per_month+latitude+longitude+host_response_rate+host_has_profile_pic+host_identity_verified+require_guest_phone_verification+require_guest_profile_picture, data2)
pred9 = predict(model9, newdata = scoringData2)
submissionFile = data.frame(id=scoringData2$id, price=exp(pred))

After continuous experimenting, I decided to leverage “weekly_price” by splitting data into 2 sets: one with “weekly_price” and another one without “weekly_price”, and trained two models accordingly. I also separated my test data the same way, and predicted them separately.

Last but not least, even though Log transformation worked well in Linear Model, I finally realized that log transformation did not work well in the Random Forest. I removed the log transformation in the Random Forest. My Random Forest model got me the best score of 53.5.

###split train data based on weekly_price
data3 <- data2[is.na(data2$weekly_price),]
data4 <- data2[!is.na(data2$weekly_price),]

###split test data based on weekly_price. 
scoringData2_1 <- scoringData2[is.na(scoringData2$weekly_price),]
scoringData2_2 <- scoringData2[!is.na(scoringData2$weekly_price),]

##random forest model
rf1 <- randomForest(price~host_response_time+host_is_superhost+host_listings_count+host_total_listings_count+host_has_profile_pic+host_identity_verified+neighbourhood_group_cleansed+latitude+longitude+is_location_exact+property_type+room_type+accommodates+bathrooms+bedrooms+beds+bed_type+cleaning_fee+guests_included+extra_people+minimum_nights+maximum_nights+availability_30+ availability_60+ availability_90+ availability_365+number_of_reviews+review_scores_rating+review_scores_accuracy+review_scores_cleanliness+review_scores_checkin+review_scores_communication+review_scores_location+review_scores_value+instant_bookable+is_business_travel_ready+cancellation_policy+require_guest_profile_picture+require_guest_phone_verification+calculated_host_listings_count+reviews_per_month+property_type,
                    data=data3, num.trees = 500)
rf2 <- randomForest(price~weekly_price+host_response_time+host_is_superhost+host_listings_count+host_total_listings_count+host_has_profile_pic+host_identity_verified+neighbourhood_group_cleansed+latitude+longitude+is_location_exact+property_type+room_type+accommodates+bathrooms+bedrooms+beds+bed_type+cleaning_fee+guests_included+extra_people+minimum_nights+maximum_nights+availability_30+ availability_60+ availability_90+ availability_365+number_of_reviews+review_scores_rating+review_scores_accuracy+review_scores_cleanliness+review_scores_checkin+review_scores_communication+review_scores_location+review_scores_value+instant_bookable+is_business_travel_ready+cancellation_policy+require_guest_profile_picture+require_guest_phone_verification+calculated_host_listings_count+reviews_per_month+property_type,
                    data=data4, ntree = 500)

Result Summary

My best result is with Random Forest, ntree= 500. The RMSE is 53.51.

Learning and discussion

What I did right:
  • Before I proceeded to feature selection, I well understood all the data, their meaning, and their potential relationships with “price” variable. So I was able to use my business acumen to select good features to include in my model.

  • I well cleaned and imputed missing data by imputation. I also searched data transformation/cleaning methods online and read Kernals on other similar competitions to learn more techniques to solve problems.

  • I fitted many different models and did not give up easily.

What I did wrong:

  • Random Forest Cross Validation took me 12 hours to finish. In the middle of the waiting, I didn’t know if the program was still running or it had already break down without a notification. I should have experimented it with a small data set first to make sure it works, and then apply it to the full data set.

  • I didn’t remove log transformation on price until the very last step. This was because I based all my next steps on previous data cleaning, and did not review all my features again before I applied new models.

If I do it again, I will:

  • Now my model has over 40 variables. There is space for feature reduction. If I have more time, I would work on reducing number of features, which might help with overfitting problem and improve RMSE.

Thank you for reading my report! I attached my full code seperately in R file.