“Every one soon or late comes round by Rome.” - Robert Browning

Rome has beckoned travelers from afar for quite a few decades now. If Italy represents romance, Rome stands for intimacy. Intimacy between its glorious past and urban present. Intimacy between its spellbinding art and inspiring culture. There is always more to Rome, and no matter how many trips you take, there will always be more to Rome. Needless to say that Rome receives millions of tourists each year. Rome is the 11th most visited city in the world and 3rd most visited in Europe.

   

   

Introduction

Problem Statement

The purpose of this project is to perform exploratory analysis on the AirBnB rental data for Rome and understand the following:

  • What are the best and affordable neighborhoods for a traveler, based on transit, availability, reviews etc.
  • What factors affect the prices of an AirBnB rental?

After all, who doesn’t want to travel the world with minimum cost, if not for free?

 

Brief Overview

Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Inside Airbnb initiative, this dataset describes the listing activity of homestays in Rome, Italy and is sourced from publicly available information from the Airbnb site.

Compiled till 8thMay, 2017, the following Airbnb activity is included in this Rome dataset:

  • Listings - Detailed Listings Data for Rome, including full descriptions and average review score.
  • Reviews - Detailed Review Data for listing s in Rome, including unique id for each reviewer and detailed comments.
  • Calendar - Detailed Calendar Data for listings in Rome, including listing id and the price and availability for that day.

 

Approach

R will be used to perform data analysis and visualization to explore and identify factors which affect prices, by grouping the neighborhoods based on availability, location advantage, affordability, reviews etc. Statistical analysis techniques like regression will also be performed to understand the significance of the relationship between prices and other variables.

 

What’s in it for you?

If you are a traveler or planning your next vacation to this exotic city, it will help you find the best neighborhoods to stay. If not, it might just inspire you to pack your bags!
If you are a resident of this city and plan to rent out your place, this analysis will help you predict the optimum rent for your home.
And finally, if you are part of none of these groups, the analysis might give new insights about the pricing of rentals!

   

Arsenal in my toolkit

To perform the analysis to the best of my abilities, I will be using the following R packages:

  • tidyverse - For easy installation of packages and for data manipulation.
  • dplyr - A fast, consistent tool for working with data frame like objects, both in memory and out of memory.
  • stringr - Simple, Consistent Wrappers for Common String Operations.
  • knitr - Provides a general-purpose tool for dynamic report generation in R.
library(tidyverse)
library(dplyr)
library(stringr)
library(knitr)

More packages will be loaded as and when required, as I proceed through my analysis.

       

Let’s get our hands dirty

Background of data

Inside Airbnb is an independent, non-commercial set of tools and data that allows you to explore how Airbnb is really being used in cities around the world.

By analyzing publicly available information about a city’s Airbnb’s listings, Inside Airbnb provides filters and key metrics so you can see how Airbnb is being used to compete with the residential housing market.

The data behind the Inside Airbnb site is sourced from publicly available information from the Airbnb site.

Link to the data is available here

 

Loading the datasets

listings <- read.csv("data/listings.csv", na.strings=c(""," ","NA"))
calendar <- read.csv("data/calendar.csv")
reviews <- read.csv("data/reviews.csv")

   

First look of the data

Dimensions of the datasets

listings_dim <- dim(listings)
calendar_dim <- dim(calendar)
reviews_dim <- dim(reviews)
  • Listings has 3585 rows and 95 columns.
  • Calendar has 1308890 rows and 4 columns.
  • Reviews has 68275 rows and 6 columns.

 

Column names of the datasets

names(listings)
##  [1] "id"                               "listing_url"                     
##  [3] "scrape_id"                        "last_scraped"                    
##  [5] "name"                             "summary"                         
##  [7] "space"                            "description"                     
##  [9] "experiences_offered"              "neighborhood_overview"           
## [11] "notes"                            "transit"                         
## [13] "access"                           "interaction"                     
## [15] "house_rules"                      "thumbnail_url"                   
## [17] "medium_url"                       "picture_url"                     
## [19] "xl_picture_url"                   "host_id"                         
## [21] "host_url"                         "host_name"                       
## [23] "host_since"                       "host_location"                   
## [25] "host_about"                       "host_response_time"              
## [27] "host_response_rate"               "host_acceptance_rate"            
## [29] "host_is_superhost"                "host_thumbnail_url"              
## [31] "host_picture_url"                 "host_neighbourhood"              
## [33] "host_listings_count"              "host_total_listings_count"       
## [35] "host_verifications"               "host_has_profile_pic"            
## [37] "host_identity_verified"           "street"                          
## [39] "neighbourhood"                    "neighbourhood_cleansed"          
## [41] "neighbourhood_group_cleansed"     "city"                            
## [43] "state"                            "zipcode"                         
## [45] "market"                           "smart_location"                  
## [47] "country_code"                     "country"                         
## [49] "latitude"                         "longitude"                       
## [51] "is_location_exact"                "property_type"                   
## [53] "room_type"                        "accommodates"                    
## [55] "bathrooms"                        "bedrooms"                        
## [57] "beds"                             "bed_type"                        
## [59] "amenities"                        "square_feet"                     
## [61] "price"                            "weekly_price"                    
## [63] "monthly_price"                    "security_deposit"                
## [65] "cleaning_fee"                     "guests_included"                 
## [67] "extra_people"                     "minimum_nights"                  
## [69] "maximum_nights"                   "calendar_updated"                
## [71] "has_availability"                 "availability_30"                 
## [73] "availability_60"                  "availability_90"                 
## [75] "availability_365"                 "calendar_last_scraped"           
## [77] "number_of_reviews"                "first_review"                    
## [79] "last_review"                      "review_scores_rating"            
## [81] "review_scores_accuracy"           "review_scores_cleanliness"       
## [83] "review_scores_checkin"            "review_scores_communication"     
## [85] "review_scores_location"           "review_scores_value"             
## [87] "requires_license"                 "license"                         
## [89] "jurisdiction_names"               "instant_bookable"                
## [91] "cancellation_policy"              "require_guest_profile_picture"   
## [93] "require_guest_phone_verification" "calculated_host_listings_count"  
## [95] "reviews_per_month"
names(calendar)
## [1] "listing_id" "date"       "available"  "price"
names(reviews)
## [1] "listing_id"    "id"            "date"          "reviewer_id"  
## [5] "reviewer_name" "comments"

 

Univariate analysis of variables

Price

Since we are interested in the price of the listings, performing intial analysis and manipulation on the variable.

Checking for null values

sum(is.na(listings$price))
## [1] 0

We see that there are no null values present for this variable

 

Changing the type of the variable

First we need to fix up the price variable, which is given to us as a string containing dollar signs, dots, and commas.

listings$price[1:5]
## [1] $250.00 $65.00  $65.00  $75.00  $79.00 
## 324 Levels: $1,000.00 $1,235.00 $1,250.00 $1,275.00 $1,300.00 ... $999.00
listings$price <- as.numeric(listings$price)

Summary of the variable

summary(listings$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    70.0   144.0   159.8   265.0   324.0

 

Initial plots for price

par(mfrow=c(1,2))
boxplot(listings$price,main="Boxplot of price")
hist(listings$price)

 

Finding the most costliest rental in Rome

listings %>% 
  filter(price == max(listings$price)) %>% 
  select(neighbourhood_cleansed ,property_type,price)
##   neighbourhood_cleansed property_type price
## 1               Brighton     Apartment   324

 

The interesting thing about this rental is that it has no reviews yet. Here is the url to it.

   

Listing out the neighbourhoods which lie within the 25 and 75 percentile of the price range

listings %>% 
  filter(price >= quantile(listings$price)[2] & price <= quantile(listings$price)[3]) %>% 
  distinct(neighbourhood_cleansed)
##     neighbourhood_cleansed
## 1               Roslindale
## 2            Jamaica Plain
## 3             Mission Hill
## 4              Bay Village
## 5         Leather District
## 6                Chinatown
## 7                North End
## 8                  Roxbury
## 9                South End
## 10                Back Bay
## 11             East Boston
## 12             Charlestown
## 13                West End
## 14             Beacon Hill
## 15                Downtown
## 16                  Fenway
## 17                Brighton
## 18            West Roxbury
## 19               Hyde Park
## 20                Mattapan
## 21              Dorchester
## 22 South Boston Waterfront
## 23            South Boston
## 24                 Allston

   

The dataset describes the listing activity of homestays in Rome, Italy. It contains 3585 rows in total. Each row is the full descriptions of a booking and has 95 columns in all, including date infos, location infos, review infos, availability infos and so on.

At the first glance of the dataset, we’ve seen that it contains many irrelevant and redundant columns that we won’t want to use in our analysis. Undoubtedly columns such as “host picture url” and “host name” will not have any effects on our predictions on“price”. However, columns such as “bedrooms” and “square feet” are very likely to influence the prices in our common sense. Giventhis dataset with so many columns, we it is very necessary todo some exploratory analysis on the dataset first.

     

Exploratory Data Analysis

  • Reducing the number of variables would be one of the important steps in my analysis.

  • Understanding the various neighbourhoods which have higher price rentals and grouping them by other variables to get interesting insights.

  • Studying the relationships of the other variables with price, by plotting them using boxplots, scatterplots, and histograms

  • Answering more interesting questions like:
    • How many nights a dwelling is rented per year
    • How many rooms on average are being rented in a building
    • Whether the listing is licensed
    • How much are hosts making from renting to tourists (compare that to long-term rentals)?
    • Which neighbourhoods have maximum listings.