In this project we chose to write about one of the most successful C2C e-commerce leaders Airbnb. We choose to investigate what actually affects the market price when you try to rent out your apartment. We find it interesting because Airbnb is still a young but growing company. Furthermore, the concepts brings up some questions about trust since you as a renter let people into your personal space. To open your door to people you don’t know will force you to break your boundaries. The trust between the host and the renter is of most importance for Airbnb to exist and evolving. We want to ask, what should you take into consideration when trying to decide the price for your apartment on Airbnb? Can we predict what price you should set to fit the market? Also, does user reviews influence on the price setting, and if so which words do you want to find in people’s recommendations, and which words do you not want to see? We try to find information on the hosts who rent out their apartments and rooms in Copenhagen. Lucky for us, a man named Tom Slee from Ontario, Canada, already did a lot of data collection from Airbnb, which he is sharing on his webpage tomslee.net. He has collected Airbnb data from several cities as New York, Rome, Tokyo, Sydney and a lot more. He also collected data from Copenhagen in June 2016, which is the data we are going to use for the project. The second part of the data being used in this project has been generated by ourself, as we are scraping user reviews for different apartments in Copenhagen as text data.
In Tom Slee’s dataset we have information about Copenhagen as a municipality. This means that the municipality “Frederiksberg” is not included even though it lies in the city of Copenhagen. The data set includes a long list of different variables, such as room id and host id. But also about the accommodation, like room type and neighborhood. The data also have variables for number of bedrooms, bathrooms, the maximum occupancy in the apartment and the minimum period for renting. The dataset includes the price for renting given in USD. Finally, the data include longitude and latitude, which gives us gps-coordinates for all apartments. We are aware that the usage of specific gps-coordinates can raise some ethical questions, about each host’s privacy. On the other hand, we question how perfect the precision of all the coordinates are, as some observations seem to be located in the lakes of the city. Also, our dataset does not include addresses or the name of the host for any of the observations. This information will also be hidden for you, if you chose to look closer at a specific observation, by searching for the room id or host id on Airbnb. In order to get the addresses and contact information for the host, you will have to rent the apartment. Therefore we don’t find the usage of the gps coordinates as a major intervention in people’s private life.
The data we gather ourselves contains text reviews written by previous visitors. In order to investigate the most popular apartments, we have chosen to scrape reviews for apartments with at least 40 reviews which equals to slightly more than 4 percent of all the dataset. This tells us, that they have been renting out their apartments often, and hence must be taken as more popular Airbnb hosts than others. Since Airbnb uses javascript, it requires more text mining techniques to retrieve the guest review data. We save the entire webpage source of the 498 apartments at first, then we loop through all source files to filter out comments from the guests. We then see each word as an observation and removed all stop words plus names. Then we counts the frequency of each word and calculate the average start score given.
The amount of apartments available for renting on Airbnb in Copenhagen has increased over the past years. In June 2016 almost 15.000 different types of homes were rentable. The averages price of an apartment were close to 117 USD per day but did vary a lot based on room type and the neighbourhood. The largest and most common room is “entire homes” and represents 82.6 percent of all observations. Here you rent a complete apartment. The next group is “private rooms”. Here you rent a room in an apartment where the host also lives. This group provides for 16.9 percent of the dataset. Finally “shared rooms” stand for 0.5 percent of the data. To show the differences in prices we made a map using the mapDK in R based on the GPS coordinates in our dataset. The results is shown in graph 1a. The dark green dots only represents the 10 percent highest priced apartments. As mentioned in section 1.1. we do not have data on the municipality Frederiksberg. This is the reason why our map has a big gap.
It is clear to see, that the popularity of Airbnb is not just centered in the middle of Copenhagen. In fact the observations are spread out all over the city. However, as stated earlier the average price of an apartment in Copenhagen can vary significantly, depending on neighbourhood. In the center of the city we see a high cluster apartments with prices of the top 10 percent. Not surprisingly this tells us, that the location of your apartments does play a huge influence on the price. From graph 1b we can see that the city center (Indre by) is by far the most expensive place to rent a room or home. Amager Vest and Vesterbro comes in and finish the top three most expensive neighborhoods. However, the prices are very close when you discard the city center. The cheapest parts of the city are all the outer parts of Copenhagen.
But more than just location matters when you look at the prices on Airbnb. We try to interpret movements in the prices just by looking at the amount of reviews added for each apartment. Graph 2 shows the correlation between price and number of reviews based on the room types. We find, that apartments with very few reviews tends to be priced higher. One reason for this could be, that the very expensive apartments are most likely being visited less often as people don’t want to spend a lot of money, which results in less written reviews. A second reason for this could be, that people mostly tend to write reviews when they have been satisfied with their stay. We can see this from the average star rating in our data set, where the average score given is 4.7 out of 5. If your stay was very expensive, you are most likely not going to write a review for the apartment which again concludes in less reviews. We also find that apartments with a lot of reviews seems to be priced lower. However, the trend is not linear, and it somewhat dies out when number of reviews gets above 50 where the price is close to the average price. We can see a clear picture, that shows us that a huge number of reviews doesn’t leads to higher prices. Now there could be two reasons for this. Firstly, the host might not be aware that they have a lot more reviews than others, so they potentially could raise the prices on their rooms. Secondly, apartments with a price close to the average price will most likely be very popular and hence get lot of reviews. This would mean, that a key condition for having a lot of reviews comes from having fair and low prices.
In order to get a better knowledge of what influences the Airbnb market in Copenhagen, we want to perform tests instead of just interpreting graphs. The ultimate goal is to find a complete model ready for predicting the price on apartments. Hence, we first want to find the variables which has a significant effect on the price. Relying on regression analysis, we want to reveal which of these factors still reaches significance when simultaneously taking the effect of all predictors into account when calculating. As our data set has some observations with observations shown as NA, we decide to remove all rooms that has one or more values of NA. By doing this, we will be able to test on the same amount of observations when we remove insignificant values from the model. Before presenting the best-fitting model for our data, we briefly describe the steps needed to create this model. We built a fixed-effects-only model, running a forward stepwise selection algorithm and cross-verifying the results by means of bootstrapping. The standard diagnostic tests reveal no significant issues with the resulting model.
##
## Call:
## lm(formula = price ~ room_type + neighborhood + accommodates +
## bedrooms + bathrooms + minstay, data = Juni16)
##
## Residuals:
## Min 1Q Median 3Q Max
## -164.1 -23.6 -4.7 14.4 9949.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.776 9.254 2.353 0.018644 *
## room_typePrivate room -32.847 4.126 -7.962 1.95e-15 ***
## room_typeShared room -52.917 22.828 -2.318 0.020473 *
## neighborhoodAmager Vest 12.295 7.460 1.648 0.099375 .
## neighborhoodBispebjerg -12.288 8.641 -1.422 0.155044
## neighborhoodBroenshoej-Husum -24.601 15.815 -1.556 0.119865
## neighborhoodIndre By 58.654 6.499 9.025 < 2e-16 ***
## neighborhoodNoerrebro 7.031 6.273 1.121 0.262385
## neighborhoodoesterbro 6.165 7.166 0.860 0.389639
## neighborhoodValby -12.305 9.570 -1.286 0.198574
## neighborhoodVanloese -17.311 11.912 -1.453 0.146216
## neighborhoodVesterbro-Kongens 19.306 6.461 2.988 0.002817 **
## accommodates 14.677 1.458 10.064 < 2e-16 ***
## bedrooms 10.467 2.722 3.846 0.000121 ***
## bathrooms 22.706 6.630 3.425 0.000619 ***
## minstay -1.857 0.865 -2.147 0.031844 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 124.1 on 7399 degrees of freedom
## Multiple R-squared: 0.1103, Adjusted R-squared: 0.1085
## F-statistic: 61.15 on 15 and 7399 DF, p-value: < 2.2e-16
The table above presents the effects for the predictors. The second column shows the parameter estimates: a negative estimate means that there is a negative impact to the price. The opposite can be said for a positive estimate. The final column indicates the significance of the pattern: the more stars, the more significant the effect (*** for p < 0.001; ** for p < 0.01; * for p < 0.05). When testing for significance we use a simple two-sided t-test. When testing we test for the variables parameter estimate to be equal to 0. H0 = 0 against HA != 0. Since we have a large test size, our distribution will be approximately normal distributed. The critical value for a normal distribution on a 95% confidence interval is 1.96.
The first significant variable of importance is the room type. As the room type “entire home/apartment” are included in the intercept, we only need to see that the “private room” and “shared room” are significant. With t-values at -7.79 and -2.29 and the critical value at -1.96, both of our observations are extreme. Hence we cannot uphold the null hypothesis and therefore the room type does matter when setting the price. If your apartment are classified as an entire home, your will affect on the price will be positive. The effect requires that the neighborhood are Amager Øst, because it’s a part for the intercept. If you are apartment is classified as a private room the effect on the price will be negative and a shared room affects the price even more negative.
The second variable of interest is neighborhood. From the observation City Center (Indre by) we can see that the variable is significant. With a t-value at 8.96 and a p-value very close to zero we cannot uphold the H0 hypothesis and hence neighborhood is significant. The effect on the price does vary from neighborhood to neighborhood as we also saw in graph 1b. Some affect the price positive and some negative since Amager Øst is already included in the intercept.
Not surprisingly the occupancy of apartments has a positive significant effect on the price. The more people who can live in the apartment the more expensive it should be, as the visitors will probably split the costs. Also, the accommodation variable could be positive correlated with the size of the apartment, which would also lead to higher prices. The variable minimum stay is only weakly significant with a t-value at -2.17 and a p-value at 0.029. However, it affects the price in a negative direction. The intuition behind this should be, that when the minimum required period of a stay increases, it will be harder to get customers. This will eventually lower the price.
Both the variables for bathrooms and bedrooms seem to be highly significant and both variables has a positive effect on the price. This means, that when the number of rooms in the apartment grows so will the price. This must also be very close correlated with the actual size of the apartment as well.
The final two variables does not show any significant influence on the price. With t-values at respectively 0.7 and 0.2 we accept the H0 hypothesis of no effect on the price variable. The results is interesting because one might believe, that the better star score and a high number of reviews should result in higher prices. Maybe this is not the case if hosts only set their prices once, when they put the room/apartment online.
Above, we found that neither overall satisfaction nor amount of reviews plays an important role to the price setting. We find this interesting and somewhat surprising. In order to understand the connection between overall satisfaction and the number of reviews, we decided to scrape close to 3,500 reviews from some of the most visited apartments in Copenhagen, and analyse the text information. The scraping process is described in more details at the end of section 1.1. We filter out words, which have less than 20 frequency and all the host’s names. Our work result into graph 3, which shows correlation between the frequency of a word and the satisfaction score for apartments where the word appears in the reviews. We find that most words do not have an huge impact on the overall satisfaction, since most words in the reviews, appears for apartments with a score very close to the average score of 4.7. Some words that are being used especially often, such as “stay”, “Copenhagen”, “city”, “nice” and “location” still doesn’t affect the satisfaction score. The explanation, can be that the words have been used in general comments of the city, such as: “Copenhagen is a nice city” or a statement of fact : “I have stayed at this host for 3 nights”. The words are not linked to evaluation process, therefore, they have been most used, but are not related to the satisfaction score. The most positive words are “gorgeous”, “stylish”, ”beautifully” and “design”. The first three words are commendatory terms in most of the cases, it can be easily understood that people use those words to express their satisfaction. Unexpectedly, people use word “design” as a very positive word. By doing a case study, we find out the word combinations “stylish design“ and” modern design” have been commonly used in a good review. The most negative words are “share”, “noise” and “basic”. An interesting observation here is that people try to avoid derogatory terms, by using neutral words.
In this section we want to use what we learned from section 3 to make a model that can predict the best price for an apartment. As the variable we want to predict is somewhat gaussian distributed, we will be using a simple linear regression model instead of a logistic regression model. The results from the third section gives us the idea, that the best model is the one where we include all significant variables and removes all the insignificant ones. To test this, we created five different models as seen below.
Model 1: price = b0 + b1room_type + b2neighborhood + b3overall_satisfaction + b4accommodations + b5bathrooms + b6bedrooms + b7minstay + b8reviews + error term Model 2: price = b0 + b1room_type + b2neighborhood + b3accommodations + b4bathrooms + b5bedrooms + b6minstay + error term Model 3: price = b0 + b1overall_satisfaction + b2reviews+ error term Model 4: price = b0 + b1room_type + b2neighborhood + b3accommodations + b4bathrooms + b5bedrooms + error term Model 5: price = b0 + b1neighborhood + b2accommodations + b3bathrooms + b4*bedrooms + error term The first model includes all variables available in the dataset. The second model is our preferred one where we have only taking significant variables found in section 3. The third model only includes the insignificant variables and must be expected to be poorest model in terms of results. Model 4 includes all significant variables but “minstay”. The variable “minstay” was close to being borderline significant and hence might not play a huge role on the price. The fifth model has all the significant variables but “minstay” and “neighborhood”. By removing “neighborhood” we take away one of the more important variables when deciding on the price.
In order to make certain that our models doesn’t predict on already known data we decided to use cross validation. Basically we divide our dataset into two random groups. The first group is a Training dataset where the model runs on already known data. The second group is a Test dataset where the model tries to predict the price based on the knowledge gained from the training part. We chose to divide both the train and test datasets into 50 percent of our original dataset. The most important part here, is to make sure that both groups have a large enough sample size to avoid statistical noise. In order to test the models against one another we calculate and compare each model’s RMSE, which is a measure of the difference between true values and predicted values by a model. Each time we run the predictions the train and test data will be different as the sample size are chosen randomly. By changing the data, also our predictions will change, which means our RMSE will be slightly different. This means, that if we just run the simulation once the variance will be high, and hence we will not be able to tell which model is the best one. A solution to this problem is to replicate the process a lot of times so the variance of the RMSE gets smaller. We then take mean for each model over all the replications. Graph 4 shows the mean RMSE for each model on test and training datasets, where we have replicated the process 100 times. Not surprisingly model 1 has lowest RMSE on the training data. In general a model should always get more accurate when being run on training data when more explanatory variables are added. However, this is not always the case when we try to predict, since a model can become too complicated. This means, that even very small changes in input can result into huge changes in the output. We can also see, that the best model for predicting is model 2, our previous preferred model where only all significant variables are included. This coincides well with our expectations.
As we want to analyse the precision of model 2’s prediction further, we held the predictions up against the actual values. The result is shown in graph 5 below. We find that our model predicts within a range of 20 USD from the actual price 49.6 percent of the times. It can be discussed whether or not this is a great result, but by using a simple linear regression, it might be difficult to achieve higher accuracy. Using other techniques like supervised learning and unsupervised learning, we would might be able to find models with a higher precision. However, we neither have time nor space to perform such an analysis, but it would be good idea for further study on the subject.
We also went to test the model against current apartments on Airbnb. We find our model to be suggesting lower prices than what’s shown on the website. There could be several reasons for this to happen. Firstly, our dataset is build with data from June 2016. The price can change from month to month, and so we cannot be certain that the price level will be the same for August 2016. Opposite to the website, our dataset doesn’t include many apartments with prices above 200 USD. This means that our model will almost certainly predict lower prices than this. A stronger model would be based on time with data points for many months. Another reason could be, that our model needs more variables to be more accurate. As Airbnb has strong community spirit, which is completely different than the normal OTA(online travel agency), it’s a good direction to analyze the profile of hosts on Airbnb and how they personalize their accommodates. It could be variables that personal information related such as “marriage status”,”age” and “social group” would provide even more information about the non-traditional online OTA. Another reason could be, that the model doesn’t know the quality and condition of the accommodations just by knowing the how many bedrooms it has. It’s something you can only judge when you see the apartment yourself. Possibly, we could do more text mining to classify them from the property name and reviews. Finally, we find that the Airbnb prices seem to differ a lot from apartment to apartment even though the variable values are close to being equal. This will make it difficult for our model to predict correctly every time.
We have performed graphical and statistical analysis of Airbnb in Copenhagen. Our findings show, that different variables plays a role when one should try to set the price on their homes. The following variables has an significant effect on the price: “room type”, “neighborhood”, “accommodations”, “bathrooms”, “bedrooms” and “minstay”. We also found that the amount of reviews, and the stars given in the reviews do not play any significant importance for the price. However, words such as “share”, “noise” and “basic” seems to appear in reviews with a low star score. The most positive words are found to be “gorgeous”, “stylish”, ”beautifully” and “design”. Finally we tried to build a model to predict the prices. By comparing five different linear regression models, we found that the model including all significant values showed the best results. Put to the test, our models predict 49.6 percent of the time within a range of 20 USD from the actual price. We also tested the model against live data from Airbnb in Copenhagen. Our model seemed to predict lower prices than what we can find on the website. However, there could be more reasons for this to happen. First of all the prices from June could be different than the prices from August. In general our dataset doesn’t have many observations with prices above 200, so it’s difficult for our model to predict such high prices. Secondly, our model might need more variable to increase precision in its predictions. Thirdly, our model doesn’t know the quality of the rooms in the apartments, and finally the Airbnb prices seems very volatile, which will make it difficult for our model to make correct predictions.