Kaggle Report

by Yuting Gong, December 2018

Introduction

This report records my work on solving “How much for your Airbnb?”, an in-class Kaggle competition hosted by course Applied Analytics Framework and Methods Fall 2018. The Kaggle competitions gives data sets of Airbnb housing listing and asks students to predict the price of the listings. Accruancy of the prediction is measured by RMSE.

Exploring the data

After importing both train data as data and test data as scoringData into R, I started my data exploration in below steps:

I first used str() function to have an overview of the data structure. I got to learn that the data set contains 96 variables and more than 29,000 observations.
Since there are too many variables to shown in str() summary, I decided to open the data in R and read through each column to understand what the data means.
I summarized and mapped out distribution of our target “price” variable. I noticed that the “price” is a bit right skewed - I will bear this in mind. At this stage, I combined train and test data sets together so that I have a holistic view of the data distribution:

ggplot(data = all, aes(x=price)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Then I used ggplot to map out numeric data that has high correlations (correlation > 0.3) with “price” variable, ranked by decreasing correlations. I found that “weekly_price”, “cleaning_fee”, and “accommodates” have highest correlations with “price”:

numericVars <- which(sapply(all, is.numeric)) #index numeric variables
numericVarNames <- names(numericVars) #saving names vector for use later on

all_numVar <- all[, numericVars]
all_numVar$id <- NULL   #remove id
cor_numVar <- cor(all_numVar, use="pairwise.complete.obs") #correlations of all numeric variables
#sort on decreasing correlations with price
cor_sorted <- as.matrix(sort(cor_numVar[,'price'], decreasing = TRUE))
#select only high corelations
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.2)))
cor_numVar <- cor_numVar[CorHigh, CorHigh]
corrplot.mixed(cor_numVar, tl.col="black", tl.pos = "lt")

I visualized the distribution of the the top 3 correlated variables and their relationships with “price” using ggplot2.

weekly_price: as shown in below graph, “weekly_price” and “price” has high linear relationship

g1 = ggplot(data=all, aes(x=weekly_price)) +
  geom_histogram()
c1 = ggplot(data=all, aes(x=weekly_price, y=price)) +
  geom_point()

plot_grid(g1, c1, labels = "AUTO")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cleaning_fee: as the relationship between “cleaning_fee” and “price” is not pure linear. In the model fitting, I would like to try both linear model and other models such as random forest.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

accommodates:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Preparing the Data for Analysis

I spent a lot of time cleaning data and imputing missing values. Below are a few steps that I took:

Review Missing Data: List out variables that have missing data First, I printed out a table that lists all the variables ranked by their total number of N/A, and percentage of N/A. If a variable has too many missing data, it is useless and let’s not use it in the model. I would not use data that has over 40% missing value, which are “thumbnail_url”, “medium_url”, “xl_picture_url”, “license”, “monthly_price” and ”square_feet”, “security_deposit”. One exception is “weekly_price”. Interestingly, I noticed that “weekly_price” has high percentage of missing data (86%), but its correlations with “price” is also high as shown in the first section. I did not want to throw it away, so I decided to use it to build two separate models to predict price, which I will explain further in the Model Techniques section. For the rest of the missing data, I decided to impute value for N/A.

total = colSums(is.na(data))
percentage = total/nrow(data)
col_name = colnames(data)
missing_data = data.frame(name= col_name,total = total, percentage = percentage)
arrange(missing_data, desc(total))

Feature selection:
- I used the correlation map and my business acumen to manually select features. Most of the variables that I selected are numeric and categorical data. I decided to leave out a lot of text values such as “name”, “summary” and “amenities” simply because I didn’t not know how to deal with text variables. If I know natural language processing, I would consider working on them. I also applied Lasso once to select features, but its RMSE result was not as good as my manual pick. So I decided to keep my manual selection.
Impute missing data:
- After knowing what features I’d like to include, I prioritized to impute their missing values. I impute 0 or median or a categorical value depends on my understanding of the data. Below are some examples:

### we will fix NA in scoringdata at the same time
data[is.na(data$cleaning_fee),]$cleaning_fee <- 0
scoringData[is.na(scoringData$cleaning_fee),]$cleaning_fee <- 0

#beds, impute medium for missing beds 
data[is.na(data$beds),]$beds <- median(data$beds, na.rm = TRUE)
scoringData[is.na(scoringData$beds),]$beds <- median(scoringData$beds, na.rm = TRUE)

There are NA in categorical values that are not detected in the previous NA table. I imputed values for them when the I found them:

#scoringData, fill NA in review per month, since it only has 1 review, ill fill in 0 for its review per month
#in actual code, I did this after "fix factor levels". So I have "scoringData2" here
scoringData2[is.na(scoringData2$reviews_per_month),]$reviews_per_month <- 0

Fix factor levels:
- “property_type”’s factor levels in train and test data sets are different. It caused a problem for random forest, so I fixed factor levels to ensure they have the same levels of factors. My method is to combine them together, and then split:

scoringData$price <- NA
scoringData$zipcode=as.factor(scoringData$zipcode)
fulldata <- rbind(data, scoringData)
fulldata$property_type = as.factor(fulldata$property_type)
##36428, train has 29142 rows, scoringData has 7286 rows
data2= fulldata[1:29142,]
scoringData2=fulldata[29143:36428,]
##there are  property type that are not in train. we want to fix that, which is hut and, cottage
scoringData2[scoringData2$property_type == "Hut",]$property_type = "Other"
scoringData2[scoringData2$property_type == "Cottage",]$property_type = "Other"

Another example of fixing factor levels is neighbourhood_cleasned. Test data has more factors levels than train data. I imputed them with nearby neighbourhood that’s in the train data:

#neighbourhood_cleasned
scoringData2[scoringData$neighbourhood_cleansed == "Hollis Hills",]$neighbourhood_cleansed = "Bayside"
scoringData2[scoringData$neighbourhood_cleansed == "Westerleigh",]$neighbourhood_cleansed = "Castleton Corners"

Okay. Our data is clean now!

Modeling Techniques:

I fitted different models including Linear Model, Random Forest, Random Forest with Cross Validation, Boosting with Cross Validation. In the end I decided to stick with Random Forest because it gave me the most accurate result without taking too much time to run the model. But I had to give up variables such as “neighbourhood_cleansed” because it had too many factor levels (>52). I also used google cloud to run R studio because it is quicker than my local R studio.

I will talk briefly about what model techniques work for me in the LM and randomForest model:

In Linear Model stage, I used the findings that “price” variables are right skewed. I decided to log transformation “price” so that it became a normal distribution. To use the log transformation result of “price”, I then used exp() function in the result to convert “price” back to normal price. This technique improved my result RMSE to 59, which is my best MRSE using LM.

  nrow(data[data$price == 0,])  #25 rows have price = 0. Airbnb price is 0, does not make sense. Let’s remove them
data = data[data$price >0,] 
data$price = log(data$price) #log transformation price

#model
model9 = lm(price~extra_people+property_type+neighbourhood_cleansed+room_type+bedrooms+bathrooms+bed_type+review_scores_rating+minimum_nights+number_of_reviews+instant_bookable+is_business_travel_ready+beds+extra_people+review_scores_location+review_scores_value+review_scores_cleanliness+review_scores_accuracy+cancellation_policy+guests_included+host_response_time+extra_people+availability_30+ availability_60+ availability_90+ availability_365+calculated_host_listings_count+accommodates+cleaning_fee+reviews_per_month+latitude+longitude+host_response_rate+host_has_profile_pic+host_identity_verified+require_guest_phone_verification+require_guest_profile_picture, data2)
pred9 = predict(model9, newdata = scoringData2)
submissionFile = data.frame(id=scoringData2$id, price=exp(pred))

In the random forest, I started Random Forest with a small tree number of 32, and then apply it to big tree number of 1000. This strategy saved me a lot of time because along the process, I kept meeting new data problems such as different factor levels, and had go back to fix data. Random Forest boosted my RMSE to 56.39.

After continuous experimenting, I decided to leverage “weekly_price” by splitting data into 2 sets: one with “weekly_price” and another one without “weekly_price”, and trained two models accordingly. I also separated my test data the same way, and predicted them separately.

Last but not least, even though Log transformation worked well in Linear Model, I finally realized that log transformation did not work well in the Random Forest. I removed the log transformation in the Random Forest. My Random Forest model got me the best score of 53.5.

###split train data based on weekly_price
data3 <- data2[is.na(data2$weekly_price),]
data4 <- data2[!is.na(data2$weekly_price),]

###split test data based on weekly_price. 
scoringData2_1 <- scoringData2[is.na(scoringData2$weekly_price),]
scoringData2_2 <- scoringData2[!is.na(scoringData2$weekly_price),]

##random forest model
rf1 <- randomForest(price~host_response_time+host_is_superhost+host_listings_count+host_total_listings_count+host_has_profile_pic+host_identity_verified+neighbourhood_group_cleansed+latitude+longitude+is_location_exact+property_type+room_type+accommodates+bathrooms+bedrooms+beds+bed_type+cleaning_fee+guests_included+extra_people+minimum_nights+maximum_nights+availability_30+ availability_60+ availability_90+ availability_365+number_of_reviews+review_scores_rating+review_scores_accuracy+review_scores_cleanliness+review_scores_checkin+review_scores_communication+review_scores_location+review_scores_value+instant_bookable+is_business_travel_ready+cancellation_policy+require_guest_profile_picture+require_guest_phone_verification+calculated_host_listings_count+reviews_per_month+property_type,
                    data=data3, num.trees = 500)
rf2 <- randomForest(price~weekly_price+host_response_time+host_is_superhost+host_listings_count+host_total_listings_count+host_has_profile_pic+host_identity_verified+neighbourhood_group_cleansed+latitude+longitude+is_location_exact+property_type+room_type+accommodates+bathrooms+bedrooms+beds+bed_type+cleaning_fee+guests_included+extra_people+minimum_nights+maximum_nights+availability_30+ availability_60+ availability_90+ availability_365+number_of_reviews+review_scores_rating+review_scores_accuracy+review_scores_cleanliness+review_scores_checkin+review_scores_communication+review_scores_location+review_scores_value+instant_bookable+is_business_travel_ready+cancellation_policy+require_guest_profile_picture+require_guest_phone_verification+calculated_host_listings_count+reviews_per_month+property_type,
                    data=data4, ntree = 500)

Result Summary

My best result is with Random Forest, ntree= 500. The RMSE is 53.51.

Learning and discussion

What I did right:

Before I proceeded to feature selection, I well understood all the data, their meaning, and their potential relationships with “price” variable. So I was able to use my business acumen to select good features to include in my model.
I well cleaned and imputed missing data by imputation. I also searched data transformation/cleaning methods online and read Kernals on other similar competitions to learn more techniques to solve problems.
I fitted many different models and did not give up easily.

What I did wrong:

Random Forest Cross Validation took me 12 hours to finish. In the middle of the waiting, I didn’t know if the program was still running or it had already break down without a notification. I should have experimented it with a small data set first to make sure it works, and then apply it to the full data set.
I didn’t remove log transformation on price until the very last step. This was because I based all my next steps on previous data cleaning, and did not review all my features again before I applied new models.

If I do it again, I will:

Now my model has over 40 variables. There is space for feature reduction. If I have more time, I would work on reducing number of features, which might help with overfitting problem and improve RMSE.