Content
This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.
Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019.
This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.
This public dataset is part of Airbnb, and the original source can be found on this website.
variable | nº missing values |
---|---|
id | 0 |
name | 16 |
host_id | 0 |
host_name | 21 |
neighbourhood_group | 0 |
neighbourhood | 0 |
latitude | 0 |
longitude | 0 |
room_type | 0 |
price | 0 |
minimum_nights | 0 |
number_of_reviews | 0 |
last_review | 10052 |
reviews_per_month | 10052 |
calculated_host_listings_count | 0 |
availability_365 | 0 |
We can already detect missing values on some of the columns. We can do three things here:
For the columns last_review and reviews per month we can exclude them. On the other hand, we will fill name and host name with an unknown string.
How many unique hosts do we have?
Number of unqiue hosts: |
---|
37457 |
It seems there are hosts that has more than one apartment (listing ID). Let’s find out which are the hosts with more apartmetns:
There are 4 hosts IDs that has more than 100 unique IDs, probably they are part from a company (probably all the hostsi IDs bigger than 25 they are). Let’s find out their names:
The question after that:
Let’s find out where Sonder (NYC) has major part of his apartments (*NOTE: maybe has some relation with the sonder placed in the 4th position, we can check that with the places that they are, but that will be a future step):
It seems this host is only focused on one small part of New York. We can check if this pattern is also common for the hosts with more than 50 listings ids:
Probably they will be placed on the most expensive area of new yorl. To detect that we will create a heat map:
The hypothesis seems to be correct. I can bet if we look for the places with a higher price, we will see a nice relation to this part of New York. To be able to compare the prices between different zones, we will differentiate the room types by colors:
In that graph we are missing more informaiton about the distribution of the data based on room type and neighbourkood group. We will use violins plots to see it:
There is a clear effect of high prices in our data. We will filter out all the apartments with a price higher than 500$, it will represent 2.53% of the data.
Before we start with the model creation, we would like to know if we have enough information from all the areas to create a model representative enough for all of them.
The map above shows as the ditribution of the data based on group of neighbourhoods, but with a table it would be much clear:
neighbourhood_group | Number of observations | Perc. of the data |
---|---|---|
Manhattan | 20756 | 44 % |
Brooklyn | 19825 | 42 % |
Queens | 5630 | 12 % |
Bronx | 1082 | 2 % |
Staten Island | 367 | 1 % |
We can conlclude we have a clear bias on the data. Bronx and Staten Island are completly missrepresented, and Manhattan and Brooklyn overrepresented. We will have to look if we will have to apply some techniques to avoid a a bias on our model.
As there is a variable telling there is a minimum number of nights you have to stay, we will create a new variable called total_price that considers that option:
We can already mention that location and room type has a strong influence on price. We are going to create a linear model to evaluate the prices:
## RMSE Rsquared MAE
## 66.6346235 0.3866963 45.8405994
Conclusion: bad results.
Why? Let’s do an error check:
Okey, we can extract different conclusions froms this graph:
We will use a correlation matrix to find the correlated variables with the price:
No linear correlation with our numerical values. It’s time to mine our data to extract the maximum amount of information. We will start by the name of the appartments.
First, we will find the most used words in the names to create new features to predict the price (we have used the following post to extract them:
We will decide to keep all the variables that appear more than 50 times:
Let’s find out if we can use the information of a correlation matrix to find out the most relevant variables:
## # A tibble: 1 x 2
## price var
## <dbl> <chr>
## 1 1 price
We can conclude there is no clear word in the name variable that help us to improve our model. Probably by adding all this new features wwe are causing in our model an overfitting.
Let’s plot the errors of our model with dummy variables:
The perfomance of our model has not increased so much, next steps would be to create models for relevant groups of prices ans see if we can classify different groups the correct way.