CAPITAL ONE DATA CHALLENGE

INTRODUCTION

Problem Statement

A real estate company has a niche in purchasing properties to rent out short-term as part of their business model specifically within New York City.The real estate company has engaged my firm to build out a data product and provide conclusions to help the company understand which zip codes would generate the most profit on short term rentals within New York City. On doing research, the real estate company concluded that two bedroom properties are the most profitable. I as a consultant representing my firm, should suggest the zip codes that are best to invest in, there by maximising the profits.

Data Source

There are 2 data sources from which data is provided.

Zillow data- Median Cost for two-bedroom properties by zipcode.

AirBnB data- Detailed Information about properties listing including location, number of bedrooms, reviews, price, availability of properties, description is provided.

Assumptions

Following are the basic assumptions that are made for doing analysis

The investor will pay for the property in cash (i.e. no mortgage/interest rate will need to be accounted for).
The time value of money discount rate is 0% (i.e. $1 today is worth the same 100 years from now).
All properties and all square feet within each locale can be assumed to be homogeneous (i.e. a 1000 square foot property in a locale such as Bronx or Manhattan generates twice the revenue and costs twice as much as any other 500 square foot property within that same locale.)
Occupancy rate of 75% throughout the year for Airbnb properties.
30% of cleaning fee is considered while calculating revenues.
There is convenience fees charged by Airbnb for listing property on Airbnb website.

PACKAGES REQUIRED

The below code checks if the packages are already installed in the machine. If the package is installed, then it will not reinstall, there by saving computational time. Similarly, it will also check if the package is loaded in the machine already.

packages <- c(
  "ggplot2",    # for awesome graphics
  "GGally",     # for data sets and plotFit() functions
  "data.table", # for fread(data reading) 
  "plyr",       # for data cleaning
  "dplyr",      # for data manipulation
  "DT",         # for effective display of data tables
  "sqldf",      # for sql queries
  "DMwR",       # for imputing missing values using KNN
  "tibble",     # for tibble data frame
  "plotly",     # for visualisations
  "class"       # for KNN model
)
# Install required (CRAN) packages

ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}
ipak(packages)

##    ggplot2     GGally data.table       plyr      dplyr         DT 
##       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
##      sqldf       DMwR     tibble     plotly      class 
##       TRUE       TRUE       TRUE       TRUE       TRUE

READING DATASETS

# Reading the listings dataset
air_bnb <- fread("C:/Users/avina/Desktop/Assignments/Projects/Airbnb/listings.csv",na.strings = "",stringsAsFactors = T)
dim(air_bnb)

## [1] 40753    95

# Splitting the data into separate numerical and categorical columns
num_var <- select_if(air_bnb, is.numeric)
cat_var <- select_if(air_bnb,is.factor)

The dataset contains 40753 observations and 95 columns, of which 29 are numerical 64 are character variales and the rest 2 varaibles are logical variables according to the initial loading of the dataset. This analysis is done without any varaible conversions.

Up on intial inspection of the variables, variables which contain “$” symbol in the columns are considered as character variables. So, we need to convert the dollar value columns into numerical columns. These variable conversion are done after selecting the final variables.

Domain Understanding

Going by the real estate industry knowledge, I have considered the following fields to be important for further analysis.

neighbourhood_group_cleansed, city, state, zipcode, latitude, longitude, bedrooms, price, monthly price, cleaning fee, availability within 365 days, availability within 30, square feet

#Selecting the important columns for initial analysis
air_bnb_cols <- air_bnb %>%
                      select("neighbourhood_group_cleansed", "city",
                             "state", "zipcode", "latitude", "longitude",
                             "bedrooms", "price", "monthly_price",
                             "cleaning_fee", "availability_365",
                             "availability_30", "square_feet")

DATA UNDERSTANDING

First, let us look at the percentage of missing values in the selected columns, so that we can remove some columns if the percentage of missing values is too high.

# Plotting the proportion of missing values using bar graphs.

missing_data <- as.data.frame(sort(sapply(air_bnb_cols, function(x) sum(is.na(x))),decreasing = T))
missing_data <- (missing_data/nrow(air_bnb_cols))*100

colnames(missing_data)[1] <- "Missing_values_Percentage"
datatable(missing_data)

missing_data$features <- rownames(missing_data)

ggplot(missing_data[missing_data$Missing_values_Percentage > 0,],
       aes(reorder(features,-Missing_values_Percentage),
           Missing_values_Percentage,fill= Missing_values_Percentage)) +
  geom_bar(stat="identity") +theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1), legend.position = "none") + 
  ylab("Percentage of missingvalues") +
  xlab("Feature") + ggtitle("Understanding Missing Data")

From the above plot, we can see that square_feet has 98% missing values montly_price has 82% missing values, cleaning_fee has 31% missing values, zipcode has 1.4% missing values, bedrooms has 0.17% missing values.

As the missing value percentages are very high in square_feet and monthly_price columns, we can remove those columns. As we are concerned about two bed room apartments and the zipcodes, we need to impute these values.

Before confirming further that there are no missing values in the data and imputing, check if there are special symbols or abnormal values in all the columns.

In the zipcode column there are values with length more than 5. As the length of the zipcode is 5, we have to further inspect and check if we have to trim and take only first 5 digits. Up on checking the Google, I have found out that all the codes that have length greater than 5 seem to be valid codes. So we need to trim those from starting position 1 to ending position 5.

# Checking for invalid zipcodes forlength greater than 5 using sql query

zip_abnormal <- sqldf("select * from air_bnb_cols where length(zipcode) > 5")
datatable(zip_abnormal, options = list(scrollX = TRUE))

DATA CLEANING

From above, we can see that there are zipcodes with more than 5 digits.Up on investigating further, we can see that first 5 digits of abnormal zipcodes are valid, so, We have to trim from 1-5 digits as the zipcodes contain only 5 digits

*** Zipcode ***

#Trim the length of the zipcode to 5.

air_bnb_cols$zipcode <- substr(air_bnb_cols$zipcode, start=1, stop=5)

Checking Data types and removing columns that have high percentage of missing values

# Removing columns montly price and square_feet as both contain more than 80% missing values
air_bnb_upd <- air_bnb_cols %>%
                            select(-c(monthly_price,square_feet))
str(air_bnb_upd)

## Classes 'data.table' and 'data.frame':   40753 obs. of  11 variables:
##  $ neighbourhood_group_cleansed: Factor w/ 5 levels "Bronx","Brooklyn",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ city                        : Factor w/ 250 levels ""," Crown Heights,NY",..: 36 36 36 63 36 36 63 36 36 36 ...
##  $ state                       : Factor w/ 6 levels "MP","NJ","NY",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ zipcode                     : chr  "10464" "10464" "10464" "10464" ...
##  $ latitude                    : num  40.9 40.9 40.8 40.8 40.9 ...
##  $ longitude                   : num  -73.8 -73.8 -73.8 -73.8 -73.8 ...
##  $ bedrooms                    : int  1 1 3 1 1 0 1 1 1 1 ...
##  $ price                       : Factor w/ 583 levels "$1,000.00","$1,021.00",..: 580 151 258 60 483 60 535 331 570 60 ...
##  $ cleaning_fee                : Factor w/ 171 levels "$0.00","$10.00",..: NA NA 3 147 46 98 92 63 106 120 ...
##  $ availability_365            : int  170 180 365 335 352 129 306 306 144 106 ...
##  $ availability_30             : int  24 30 30 8 17 23 15 5 17 12 ...
##  - attr(*, ".internal.selfref")=<externalptr>

From the above data types, we have to convert the following attributes.

zipcode is to be converted to factor.

cleaning_fee and fee are to be converted numeric data types. This has to be done by separating “$”,“,” symbols from the fee values

*** Price and Cleaning Fee ***

# Removing $ symbol from the dollar value columns
# Instead of writing separate lines of code for removing dollar symbol, I have created a function to check all the columns price and cleaning_fee at a time.

air_bnb_upd[,c("price","cleaning_fee")] <- as.data.frame(apply(air_bnb_upd[,c("price","cleaning_fee")], 2,
                                                               function(x) gsub("\\$", "", x)))

air_bnb_upd[,c("price","cleaning_fee")] <- as.data.frame(apply(air_bnb_upd[,c("price","cleaning_fee")], 2,
                                                               function(x) gsub(",", "", x)))

Now, convert price and cleaning_fee to numerical data type as the columns are factors in the dataset. Convert zipcode to factor data type.

You need to use as.character before as.numeric. This is because factors are stored internally as integers with a table to give the factor level labels. Just using as.numeric will only give the internal integer codes

# Converting into factor and numeric data types
fact_data <- air_bnb_upd %>%
                            select(neighbourhood_group_cleansed:zipcode) %>%
                            mutate_if(is.character,as.factor)
num_data <- air_bnb_upd %>%
                            select(latitude:availability_30)

num_data$price <- as.numeric(as.character(num_data$price))
num_data$cleaning_fee <- as.numeric(as.character(num_data$cleaning_fee))

air_bnb_final <- cbind(num_data,fact_data)

SUMMARY STATISTICS

#Summary stats for numerical columns
summary(num_data)

##     latitude       longitude         bedrooms          price        
##  Min.   :40.50   Min.   :-74.24   Min.   : 0.000   Min.   :   10.0  
##  1st Qu.:40.69   1st Qu.:-73.98   1st Qu.: 1.000   1st Qu.:   70.0  
##  Median :40.72   Median :-73.96   Median : 1.000   Median :  100.0  
##  Mean   :40.73   Mean   :-73.96   Mean   : 1.149   Mean   :  145.3  
##  3rd Qu.:40.76   3rd Qu.:-73.94   3rd Qu.: 1.000   3rd Qu.:  170.0  
##  Max.   :40.91   Max.   :-73.71   Max.   :10.000   Max.   :10000.0  
##                                   NA's   :69                        
##   cleaning_fee   availability_365 availability_30 
##  Min.   :  0.0   Min.   :  0.0    Min.   : 0.000  
##  1st Qu.: 25.0   1st Qu.:  0.0    1st Qu.: 0.000  
##  Median : 50.0   Median : 84.0    Median : 2.000  
##  Mean   : 59.9   Mean   :134.1    Mean   : 7.344  
##  3rd Qu.: 80.0   3rd Qu.:277.0    3rd Qu.:12.000  
##  Max.   :600.0   Max.   :365.0    Max.   :30.000  
##  NA's   :12692

From the above summary statistics, we can infer the following. Missing value treatment for Zipcode,bedroom and cleaning_fee columns Price column should be investigated further, as the max price is $10,000 per night, which is slightly doubtful.

MISSING VALUES IMPUTATION

Now, all the columns are representing the correct data types. The next step is to deal with missing values.

Zipcode has 567 missing values, bedrooms has 69 missing values, cleaning_fee has 12692 missing values.

As we have latitudes and longitudes in the data, we can impute missing values of zipcodes using latitudes and longitudes. The reason for selecting latitudes and longitudes is the latitudes and longitudes of a particular zipcode will be almost close to each other.

KNN Imputation

In order to impute the missing values, I have used KNN imputation method. The k nearest neighbors(KNN) algorithm can be used for imputing missing data by finding the k closest neighbors to the observation with missing data and then imputing them based on the the non-missing values in the neighbors. I have used cross validation to find the optimum K-value.

Before making sure that latitudes and longitudes are the right metrics for imputing zip codes, I have build a K-nearest neighbor model on the non missing values of zipcode with latitude and longitude.

Split the data into 90% and 10% of the data and calculate the model performance on the testing data.

# Selected the non missing rows for model building
imp <- air_bnb_final %>%
       select(zipcode,longitude,latitude) %>%
       filter(zipcode!="NA")
ran <- sample(1:nrow(imp), 0.8 * nrow(imp)) 

#Normalising the data
nor <- function(x) { 
  (x -min(x))/(max(x)-min(x))
  }
imp_norm <- as.data.frame(lapply(imp[,c("latitude","longitude")], nor))

##extract training set
imp_train <- imp_norm[ran,] 
##extract testing set
imp_test <- imp_norm[-ran,] 
 
target_category <- imp[ran,"zipcode"]
test_category <- imp[-ran,"zipcode"]

*** Different K-values ***

averageaccuracy = vector()

#pick how many k to try
k = 1:15

for (i in 1:length(k)) {
  #build a model for the i-th k-value
  knn.pred <- knn(train = imp_train, test = imp_test, cl = target_category, k = i)
  
  #calculate the accuracy
  averageaccuracy[i] = mean(knn.pred==test_category)
}
averageaccuracy <- averageaccuracy*100
bestk = which.max(averageaccuracy)
bestaccuracy = max(averageaccuracy)

p <- averageaccuracy %>%
     as.tibble %>% 
     ggplot() +
     geom_line(aes(x= k, y = averageaccuracy)) +
     xlab("Number of k") + 
     ylab("Accuracy (%)") +
     scale_x_continuous(breaks = seq(0,length(k),5) ) + 
     ggtitle("Test Set Accuracy as k Increases") + 
     theme(plot.title = element_text(hjust = 0.5))

ggplotly(p = ggplot2::last_plot(), dynamicTicks = F)

*** Imputing the missing values for zipcode with 7 nearest neighbors ***

#Imputing the missing values for zipcode with 7 nearest neighbors
mis_data <- air_bnb_final %>%
            select(c("zipcode","latitude","longitude"))
knn_impute <- knnImputation(mis_data, k = 7)
anyNA(knn_impute)

## [1] FALSE

#Update the zip code column of air_bnb_final with non missing values of zipcode from knn_impute data frame
air_bnb_final$zipcode <- knn_impute$zipcode

Now impute the missing values for cleaning_fee. I have imputed the cleaning_fee missing values using columns price and zipcode as reference as I have assumed that cleaning_fee in similar areas will be almost similar and also dependent on the price of the stay.

Imputing the missing values for Cleaning fee with 7 nearest neighbors

#Imputing the missing values for cleaning_fee
mis_data_fee <- air_bnb_final %>% 
                select(c("zipcode","cleaning_fee","price"))

knn_impute_cl_fee <- knnImputation(mis_data_fee, k = 5)
anyNA(knn_impute_cl_fee)

## [1] FALSE

#Update the cleaning_fee column of air_bnb_final with non missing values of cleaning_fee from knn_impute data frame
air_bnb_final$cleaning_fee <- knn_impute_cl_fee$cleaning_fee

#Reorder columns for easy visualisations
air_bnb_final <- air_bnb_final %>% 
                      select(price:availability_30,bedrooms,latitude:longitude,
                             neighbourhood_group_cleansed:zipcode)

The final cleaned dataset of airbnb contains 4894 observations and 11 variables. Before calculating Revenue, let us look at the distribution of all the varaible, so that we can find the appropriate way for calculating the revenue.

VISUALISATIONS

Correlation Plots

# Correlation plots for numerical variables.
corr_num <- air_bnb_final %>%
            select(price:bedrooms) # As the numeric columns are in order, we can give latitude:cleaning_fee

names(corr_num) <- c("price","clean_fee","avail_365","avail_30","bedrooms")

# Correlations
ggcorr(corr_num, palette = "RdBu", label = TRUE)

Histograms

# Histograms
library(tidyr)
p <- corr_num %>%                             # Keep only numeric columns
     gather() %>%                             # Convert to key-value pairs
     ggplot(aes(value)) +                     # Plot the values
     facet_wrap(~ key, scales = "free") +     # In separate panels
     geom_histogram(aes(y =..count..),
                    col="darkblue", 
                    fill = "lightblue") +
     ggtitle("Histograms of numerical variables")
ggplotly(p)

Box-Plots

# Boxplots of all numerical variables by Neighbour hood
# This part of code takes all numerical variables and plots againist neighbourhood_group
plt <- htmltools::tagList()
for (i in 1:ncol(corr_num)){
  plt <- ggplotly(ggplot(air_bnb_final,aes(neighbourhood_group_cleansed,air_bnb_final[[c(i)]],
                                          fill=neighbourhood_group_cleansed))
        + geom_boxplot(outlier.colour="black",outlier.shape=1,outlier.size=1,notch=FALSE)
        + xlab("Neighbourhood")
        + ylab(colnames(air_bnb_final)[i])
        + labs(title = paste("Neighbourhood by",colnames(air_bnb_final)[i]))
        + guides(fill=guide_legend(title=NULL)))
}
plt

As the box plot for Price column is not clear, I have applied log transformation on the price column.

*** Log Price ***

#Box plot for Log(price) vs Neighbourhood
p <- ggplot(air_bnb_final, aes(x=neighbourhood_group_cleansed,
                               y=log(price),fill=neighbourhood_group_cleansed)) +
     geom_boxplot() +
     xlab("Neighbourhood") +
     ylab("log(price)") +
     labs(title="Log(Price) by Neighbourhood") +
     guides(fill=guide_legend(title=NULL))
ggplotly(p)

*** Count by Neighbourhood ***

# Visualisation by Neighbourhood
p_count <- ggplot(air_bnb_final, aes(x=neighbourhood_group_cleansed,
                                     fill=neighbourhood_group_cleansed)) +
           geom_bar(stat="count", width=0.7) +
           theme_minimal() +
           theme(legend.position="right") +
           labs(title="Distribution of Neighbourhood") +
           guides(fill=guide_legend(title=NULL))
ggplotly(p_count)

Density Plots

*** Availability for year by Neighbourhood ***

# Density plots for Availability for 365 days by neighbourhood
p_avail <- ggplot(air_bnb_final, aes(availability_365, fill=neighbourhood_group_cleansed)) +
           geom_density(alpha=0.4) + 
           labs(title="Availability Density Plot", x="Neighbourhood_group", y="availability_365") +
           guides(fill=guide_legend(title=NULL))
ggplotly(p_avail)

*** Price by Neighbourhood ***

# Density plots for Price by neighbourhood
p_avail <- ggplot(air_bnb_final, aes(price, fill=neighbourhood_group_cleansed)) +
           geom_density(alpha=0.4) +
           labs(title="Price by Neighbourhood", x="Neighbourhood_group", y="availability_365") +
           guides(fill=guide_legend(title=NULL)) +
           xlim(0, 1000)
ggplotly(p_avail)