Data description

The following information is copy abnd pasted from the kaggle website in this link.

Context

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019.

Content

This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.

Acknowledgements

This public dataset is part of Airbnb, and the original source can be found on this website.

Inspiration

What can we learn about different hosts and areas?
What can we learn from predictions? (ex: locations, prices, reviews, etc)
Which hosts are the busiest and why?
Is there any noticeable difference of traffic among different areas and what could be the reason for it?

Data quality

Checking missing data

variable	nº missing values
id	0
name	16
host_id	0
host_name	21
neighbourhood_group	0
neighbourhood	0
latitude	0
longitude	0
room_type	0
price	0
minimum_nights	0
number_of_reviews	0
last_review	10052
reviews_per_month	10052
calculated_host_listings_count	0
availability_365	0

We can already detect missing values on some of the columns. We can do three things here:

Exclude the column
Exclude the row
Fill them

For the columns last_review and reviews per month we can exclude them. On the other hand, we will fill name and host name with an unknown string.

Analysing hosts and areas

Hosts analyisis

How many unique hosts do we have?

Number of unqiue hosts:
37457

It seems there are hosts that has more than one apartment (listing ID). Let’s find out which are the hosts with more apartmetns:

There are 4 hosts IDs that has more than 100 unique IDs, probably they are part from a company (probably all the hostsi IDs bigger than 25 they are). Let’s find out their names:

The question after that:

Should I exclude these hosts from our analysis? Why?
Can I found more information about them? Maybe I can place them into a map in order to find where they are.

Let’s find out where Sonder (NYC) has major part of his apartments (*NOTE: maybe has some relation with the sonder placed in the 4th position, we can check that with the places that they are, but that will be a future step):

It seems this host is only focused on one small part of New York. We can check if this pattern is also common for the hosts with more than 50 listings ids:

Probably they will be placed on the most expensive area of new yorl. To detect that we will create a heat map:

The hypothesis seems to be correct. I can bet if we look for the places with a higher price, we will see a nice relation to this part of New York. To be able to compare the prices between different zones, we will differentiate the room types by colors:

In that graph we are missing more informaiton about the distribution of the data based on room type and neighbourkood group. We will use violins plots to see it:

Outlier extraction

There is a clear effect of high prices in our data. We will filter out all the apartments with a price higher than 500$, it will represent 2.53% of the data.

Check bias in the data

Before we start with the model creation, we would like to know if we have enough information from all the areas to create a model representative enough for all of them.

The map above shows as the ditribution of the data based on group of neighbourhoods, but with a table it would be much clear:

neighbourhood_group	Number of observations	Perc. of the data
Manhattan	20756	44 %
Brooklyn	19825	42 %
Queens	5630	12 %
Bronx	1082	2 %
Staten Island	367	1 %

We can conlclude we have a clear bias on the data. Bronx and Staten Island are completly missrepresented, and Manhattan and Brooklyn overrepresented. We will have to look if we will have to apply some techniques to avoid a a bias on our model.

Creating the models

Price prediction

As there is a variable telling there is a minimum number of nights you have to stay, we will create a new variable called total_price that considers that option:

We can already mention that location and room type has a strong influence on price. We are going to create a linear model to evaluate the prices:

##       RMSE   Rsquared        MAE 
## 66.6346235  0.3866963 45.8405994

Conclusion: bad results.

Why? Let’s do an error check:

Okey, we can extract different conclusions froms this graph:

Find a numeric variable that helps me to explain a bigger part of the price.
Our model tends to underpredict the price. we will have to check wwith a qqplot our errors distribution.

2nd round, which groups I am not explaining?

We will use a correlation matrix to find the correlated variables with the price:

No linear correlation with our numerical values. It’s time to mine our data to extract the maximum amount of information. We will start by the name of the appartments.

Mining the names to extract insights

First, we will find the most used words in the names to create new features to predict the price (we have used the following post to extract them:

We will decide to keep all the variables that appear more than 50 times:

Let’s find out if we can use the information of a correlation matrix to find out the most relevant variables:

## # A tibble: 1 x 2
##   price var  
##   <dbl> <chr>
## 1     1 price

Creating the 2nd model with knn

We can conclude there is no clear word in the name variable that help us to improve our model. Probably by adding all this new features wwe are causing in our model an overfitting.

Let’s plot the errors of our model with dummy variables:

The perfomance of our model has not increased so much, next steps would be to create models for relevant groups of prices ans see if we can classify different groups the correct way.

Exploration_NY