“Every one soon or late comes round by Rome.” - Robert Browning
Rome has beckoned travelers from afar for quite a few decades now. If Italy represents romance, Rome stands for intimacy. Intimacy between its glorious past and urban present. Intimacy between its spellbinding art and inspiring culture. There is always more to Rome, and no matter how many trips you take, there will always be more to Rome. Needless to say that Rome receives millions of tourists each year. Rome is the 11th most visited city in the world and 3rd most visited in Europe.
The purpose of this project is to perform exploratory analysis on the AirBnB rental data for Rome and understand the following:
After all, who doesn’t want to travel the world with minimum cost, if not for free?
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Inside Airbnb initiative, this dataset describes the listing activity of homestays in Rome, Italy and is sourced from publicly available information from the Airbnb site.
Compiled till 8thMay, 2017, the following Airbnb activity is included in this Rome dataset:
R will be used to perform data analysis and visualization to explore and identify factors which affect prices, by grouping the neighborhoods based on availability, location advantage, affordability, reviews etc. Statistical analysis techniques like regression will also be performed to understand the significance of the relationship between prices and other variables.
If you are a traveler or planning your next vacation to this exotic city, it will help you find the best neighborhoods to stay. If not, it might just inspire you to pack your bags!
If you are a resident of this city and plan to rent out your place, this analysis will help you predict the optimum rent for your home.
And finally, if you are part of none of these groups, the analysis might give new insights about the pricing of rentals!
To perform the analysis to the best of my abilities, I will be using the following R packages:
library(tidyverse)
library(dplyr)
library(stringr)
library(knitr)
More packages will be loaded as and when required, as I proceed through my analysis.
Inside Airbnb is an independent, non-commercial set of tools and data that allows you to explore how Airbnb is really being used in cities around the world.
By analyzing publicly available information about a city’s Airbnb’s listings, Inside Airbnb provides filters and key metrics so you can see how Airbnb is being used to compete with the residential housing market.
The data behind the Inside Airbnb site is sourced from publicly available information from the Airbnb site.
Link to the data is available here
Loading the datasets
listings <- read.csv("data/listings.csv", na.strings=c(""," ","NA"))
calendar <- read.csv("data/calendar.csv")
reviews <- read.csv("data/reviews.csv")
Dimensions of the datasets
listings_dim <- dim(listings)
calendar_dim <- dim(calendar)
reviews_dim <- dim(reviews)
Column names of the datasets
names(listings)
## [1] "id" "listing_url"
## [3] "scrape_id" "last_scraped"
## [5] "name" "summary"
## [7] "space" "description"
## [9] "experiences_offered" "neighborhood_overview"
## [11] "notes" "transit"
## [13] "access" "interaction"
## [15] "house_rules" "thumbnail_url"
## [17] "medium_url" "picture_url"
## [19] "xl_picture_url" "host_id"
## [21] "host_url" "host_name"
## [23] "host_since" "host_location"
## [25] "host_about" "host_response_time"
## [27] "host_response_rate" "host_acceptance_rate"
## [29] "host_is_superhost" "host_thumbnail_url"
## [31] "host_picture_url" "host_neighbourhood"
## [33] "host_listings_count" "host_total_listings_count"
## [35] "host_verifications" "host_has_profile_pic"
## [37] "host_identity_verified" "street"
## [39] "neighbourhood" "neighbourhood_cleansed"
## [41] "neighbourhood_group_cleansed" "city"
## [43] "state" "zipcode"
## [45] "market" "smart_location"
## [47] "country_code" "country"
## [49] "latitude" "longitude"
## [51] "is_location_exact" "property_type"
## [53] "room_type" "accommodates"
## [55] "bathrooms" "bedrooms"
## [57] "beds" "bed_type"
## [59] "amenities" "square_feet"
## [61] "price" "weekly_price"
## [63] "monthly_price" "security_deposit"
## [65] "cleaning_fee" "guests_included"
## [67] "extra_people" "minimum_nights"
## [69] "maximum_nights" "calendar_updated"
## [71] "has_availability" "availability_30"
## [73] "availability_60" "availability_90"
## [75] "availability_365" "calendar_last_scraped"
## [77] "number_of_reviews" "first_review"
## [79] "last_review" "review_scores_rating"
## [81] "review_scores_accuracy" "review_scores_cleanliness"
## [83] "review_scores_checkin" "review_scores_communication"
## [85] "review_scores_location" "review_scores_value"
## [87] "requires_license" "license"
## [89] "jurisdiction_names" "instant_bookable"
## [91] "cancellation_policy" "require_guest_profile_picture"
## [93] "require_guest_phone_verification" "calculated_host_listings_count"
## [95] "reviews_per_month"
names(calendar)
## [1] "listing_id" "date" "available" "price"
names(reviews)
## [1] "listing_id" "id" "date" "reviewer_id"
## [5] "reviewer_name" "comments"
Since we are interested in the price of the listings, performing intial analysis and manipulation on the variable.
Checking for null values
sum(is.na(listings$price))
## [1] 0
We see that there are no null values present for this variable
Changing the type of the variable
First we need to fix up the price variable, which is given to us as a string containing dollar signs, dots, and commas.
listings$price[1:5]
## [1] $250.00 $65.00 $65.00 $75.00 $79.00
## 324 Levels: $1,000.00 $1,235.00 $1,250.00 $1,275.00 $1,300.00 ... $999.00
listings$price <- as.numeric(listings$price)
Summary of the variable
summary(listings$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 70.0 144.0 159.8 265.0 324.0
Initial plots for price
par(mfrow=c(1,2))
boxplot(listings$price,main="Boxplot of price")
hist(listings$price)
Finding the most costliest rental in Rome
listings %>%
filter(price == max(listings$price)) %>%
select(neighbourhood_cleansed ,property_type,price)
## neighbourhood_cleansed property_type price
## 1 Brighton Apartment 324
The interesting thing about this rental is that it has no reviews yet. Here is the url to it.
Listing out the neighbourhoods which lie within the 25 and 75 percentile of the price range
listings %>%
filter(price >= quantile(listings$price)[2] & price <= quantile(listings$price)[3]) %>%
distinct(neighbourhood_cleansed)
## neighbourhood_cleansed
## 1 Roslindale
## 2 Jamaica Plain
## 3 Mission Hill
## 4 Bay Village
## 5 Leather District
## 6 Chinatown
## 7 North End
## 8 Roxbury
## 9 South End
## 10 Back Bay
## 11 East Boston
## 12 Charlestown
## 13 West End
## 14 Beacon Hill
## 15 Downtown
## 16 Fenway
## 17 Brighton
## 18 West Roxbury
## 19 Hyde Park
## 20 Mattapan
## 21 Dorchester
## 22 South Boston Waterfront
## 23 South Boston
## 24 Allston
The dataset describes the listing activity of homestays in Rome, Italy. It contains 3585 rows in total. Each row is the full descriptions of a booking and has 95 columns in all, including date infos, location infos, review infos, availability infos and so on.
At the first glance of the dataset, we’ve seen that it contains many irrelevant and redundant columns that we won’t want to use in our analysis. Undoubtedly columns such as “host picture url” and “host name” will not have any effects on our predictions on“price”. However, columns such as “bedrooms” and “square feet” are very likely to influence the prices in our common sense. Giventhis dataset with so many columns, we it is very necessary todo some exploratory analysis on the dataset first.
Reducing the number of variables would be one of the important steps in my analysis.
Understanding the various neighbourhoods which have higher price rentals and grouping them by other variables to get interesting insights.
Studying the relationships of the other variables with price, by plotting them using boxplots, scatterplots, and histograms