Sleeping in Seattle: Airbnb data analysis

About Airbnb:

Airbnb is an online marketplace and hospitality service, enabling people to lease or rent short-term lodging including vacation rentals, apartment rentals, homestays, hostel beds, or hotel rooms.

1. Introduction

Problem Statement:

To analyse the AirBnB Seattle datasets listings and reviews.

To come up with certain questions that can be answered with the data provided and answer those using sound methodology.

To use text analytics and sentiment analysis techniques to understand customer reviews ans listing descriptions.

Implementation:

The datasets were imported from csv files obtained via Kaggle website and then cleaned, manipulated accordingly for the analysis. AFFIN lexicon was used to assign a sentiment score to a word rangin from -5 to -1 for negative words and +1 to +5 for positive words.

2. Packages Required

tidytext = allows conversion of text to and from tidy formats

DT = HTML display of data

tidyverse = Allows for data manipulation and works in harmony with other packages as well

stringr = String operations

leaflet = leaflet maps in r

tm = for text mining

wordcloud = for word cloud generator

ggmap = visualization by combining the spatial information of static maps from Google Maps

library(tidytext)
library(DT)
library(tm)
library(wordcloud)
library(tidyverse)
library(stringr)
library(leaflet)
library(ggmap)

3. Data Preparation

3.1 Data Source

Original Source of data: Kaggle Datasets

Explanation of data source:

The data was released by Airbnb itself on 04 January, 2016 to show people that how Airbnb is really being used and is affecting their neighbourhood.
The datasets then was uploaded on Kaggle to improve its visibility to the data science community and
The original data set had 84849 rows and 6 variables (columns) in reviews dataset, listings table had 3818 rows and 92 columns and calendar table had 1393570 rows and 4 columns.

3.2 Original Datasets

reviews dataset

review <- read_csv("reviews.csv")
datatable(head(review, 50), options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
    "}")
))

listings dataset

#listings <- read_csv("listings.csv")
listings <- read_csv("https://raw.githubusercontent.com/AkhileshAgnihotri/r4ds/master/listings.csv")

datatable(head(listings, 50), options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
    "}")
))

3.3 Cleaning Dataset

The first step after importing the datasets is to check the formatting of the columns and confirm that all the rows have been imported from source csv. All the variables have appropriate data type as seen in structure of the tables and dimentions of source and destination match.

#getting to know the review data
dim(review) # 84849 obs, 21 variables
#to confirm that primary key is combination of listing_id,date
review_aggregated <- group_by(review,listing_id,date)

As review_aggregated and review table have same number of rows, we confirm that combination of listing_id,date gives a unique record. Then we remove the ‘$’ sign from price column, conver it into numeric column, convert the categorical variables into factors and convert descriptions to character.

#cleaning the listing data
dim(listings) # 3818 obs, 92 variables
listings$price <- as.numeric(sub("\\$","", listings$price))
listings$description <- as.character(listings$description)
listings$neighbourhood_cleansed <- factor(listings$neighbourhood_cleansed)
listings$host_is_superhost <- factor(listings$host_is_superhost)

As we want to classify the homestays into 2 categories, affordable and pricey, based on price column. To do that we first extract the 50th percentile price value.

quantile(listings$price,probs = seq(0, 1, 0.02),na.rm=TRUE)

##     0%     2%     4%     6%     8%    10%    12%    14%    16%    18% 
##  20.00  38.00  40.00  46.96  50.00  55.00  56.00  60.00  64.00  65.00 
##    20%    22%    24%    26%    28%    30%    32%    34%    36%    38% 
##  69.00  70.00  74.00  75.00  79.00  80.00  82.00  85.00  89.00  90.00 
##    40%    42%    44%    46%    48%    50%    52%    54%    56%    58% 
##  92.00  95.00  98.00  99.00 100.00 100.00 105.00 109.00 110.00 115.00 
##    60%    62%    64%    66%    68%    70%    72%    74%    76%    78% 
## 120.00 124.92 125.00 129.00 135.00 140.00 149.00 150.00 150.00 159.00 
##    80%    82%    84%    86%    88%    90%    92%    94%    96%    98% 
## 169.00 175.00 187.00 199.00 200.00 225.00 250.00 275.00 320.00 399.00 
##   100% 
## 999.00

3.4 Cleaned Datasets

reviews dataset

datatable(head(review, 50), options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
    "}")
))

listings dataset

datatable(head(listings, 50), options = list(
  initComplete = JS(
    "function(settings, json) {",
    "$(this.api().table().header()).css({'background-color': '#000', 'color': '#fff'});",
    "}")
))

3.5 Summary of Variables

Although there are 100+ variables in the 2 datasets, only a few are used for the analysis.

Brief description of such variables is given below:

Date: Date of review

Reviewer_Name: Name of the reviewer

Comments: Review of the stay

Neighbourhood_Cleansed: Neighbourhood of the property

host_is_superhost: Hosts which have a lengthy history on Airbnb

Price: Price of the property per night

Zipcode: Zipcode of the property

Longitude and Latitude: Co-ordinates of the property

Description: Description of the property

Summary: Summary about the place posted by the host

listing_id: Unique number used to denote an AirBnB homestay

reviewer_id: Unique number representing someone who has stayed in Seattle AirBnB homestay and has provided with his review at least once

reviewer_name: Name of a reviewer_id

4. Exploratory Data Analysis

4.1 Common Keywords in Listings and Reviews

We want to look for most common words across listings and reviews. The most common words in listings would be the attributes about the AirBnB homestay that owners want to showcase. The most common words in reviews would be some of the most important factor that customers look for in an AirBnB. So, ideally the words in reviews should be included in listing descriptions as those attributes are what customers are looking for the most.

Top 20 Keywords in Listings

#We need to use the unnest_tokens function to obtain one-row-per-term-per-listing-description
listings_words <- listings %>%
  select(id, description, price, review_scores_accuracy, review_scores_rating) %>%
  unnest_tokens(word, description) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))

#plot the graph
common_listings <- listings_words %>%
  group_by(word) %>%
  summarise(count = n()) %>%
  top_n(n = 20, wt = count) %>%
  ggplot() +
  geom_bar(mapping = aes(x=reorder(word, count),
                         y=count),
           stat="identity", fill = "brown") +
  labs(title="Top 20 words described in listings",
       y="Word count", x="Most common Words") +
  coord_flip() +
  theme_minimal()
  
print(common_listings)

Here, we observe that many of the trivial words such as Seattle, Bed, Living, Located present in the result but we also get some useful results such as neighbourhood, restaurants,etc.

Top 20 Keywords in Reviews

#We need to use the unnest_tokens function to obtain one-row-per-term-per-listing-description
review_words <- review %>%
  unnest_tokens(word, comments) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))

#plot the graph
common_reviews <- review_words %>%
  group_by(word) %>%
  summarise(count = n()) %>%
  top_n(n = 20, wt = count) %>%
  ggplot() +
  geom_bar(mapping = aes(x=reorder(word, count), y=count),
           stat="identity", fill = "light green") +
  coord_flip() +
  labs(title="Top 20 words described in Reviews",
       y="Word count", x="Words") +
  theme_minimal()

print(common_reviews)

Top 20 keywords in reviews tell us that comfort, cleanliness, locations, space are some of the important factors which matter to the customers and some of these are rightly advertised in listing descriptions.

4.2 Pricey vs Affordable listings

Locations for Pricey homestays, defined as listings costing more than 50 percentile of all the prices, and rest labeled affordable locations can be seen in below plot. Now, we check if a particualar area in Seattle has more expensive stays than others.

Note: Green are affordable and red are expensive homestays.

leaflet(data = listings) %>%  addProviderTiles("Stamen.Watercolor") %>%
 addProviderTiles("Stamen.TonerHybrid") %>%
  addCircleMarkers(~longitude, ~latitude, radius = ifelse(listings$price > 100, 2, 0.2),
                   color = ifelse(listings$price > 100, "red", "green"),
                   fillOpacity = 0.4)

We observe that, in general, prices of homestays are higher in the center of the city.

4.3 Listings by neighbourhood

Listing density across each neighbourhood is shown below -

# Create a palette that maps factor levels to colors
factpal <- colorFactor(topo.colors(25), listings$neighbourhood_cleansed)

popup <- paste0("<strong>'hood: </strong>", listings$neighbourhood_cleansed)

leaflet(listings) %>% addProviderTiles("CartoDB.DarkMatter") %>%
  addCircleMarkers(
    color = ~factpal(neighbourhood_cleansed),
    stroke = FALSE, fillOpacity = 0.5, radius = 1.2,
    popup = ~popup
  )

It seems strange that the neighbourhood colored green and blue is spread all over the city which could be because of data quality issue. The neighbourhood_cleansed column is still not very accurate.

4.4 Sentiment Analysis

Do superhosts always get the best of the reviews ?

This question has been tackled here with following methodology:

AFFIN lexicon has been used to allot a score of -5 to +5 to every word, +5 being the highest positive vibe implied from a given word and -5 being the most negative vibe implied.

Then, the average of such scores for each listing has been calculated and infomation about whether a listing is offered by a superhost is retrieved.

In the end we depict the information using histogram as shown below -

review_words <- review %>%
  select(listing_id, reviewer_id, reviewer_name, comments) %>%
  unnest_tokens(word, comments) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))

AFINN <- sentiments %>%
  filter(lexicon == "AFINN") %>%
  select(word, afinn_score = score)

reviews_sentiment <- review_words %>%
  inner_join(AFINN, by = "word") %>%
  group_by(listing_id ) %>%
  summarize(sentiment = mean(afinn_score))

reviews_sentiment2 <- reviews_sentiment %>% 
  inner_join(listings,by=c("listing_id"="id")) %>%
               select(host_is_superhost, listing_id, sentiment)

ggplot(reviews_sentiment2, aes(x=sentiment))+
  geom_histogram(binwidth = 0.05, aes(fill = host_is_superhost))+
  labs(title="Histogram of AFFIN lexicon sentiment score",
       x="Mean AFFIN Score", y="Count") +
  theme_minimal()

It turns out that superhosts do not gurantee that customers would have the best experinece.

Most reviews words which are captured by AFFIN lexicon are positive since we see the histogram of mean AFFIN score as approximately normal with mean at +2.

4.5 Word Cloud

We can also understand about the most common listing words through a wordcloud as depicted below -

#making a data frame of words and its frequency
cloud <- as.data.frame(listings_words %>% 
                         group_by(word) %>%
                         summarise(no_rows = length(word)))

#building the word cloud
wordcloud(words = cloud$word, freq = cloud$no_rows, min.freq = 5,
          max.words=150, random.order=FALSE, random.color=FALSE, rot.per=0.33, 
          colors=brewer.pal(1, "Dark2"))

5. Summary

The following questions have been attempted:

What are the most common words mentioned in the listing descriptions?

Where are affordable listings located in the city?

Do listing descriptions showcase the same attributes which customers really care about?

How are the listings distributed across neighbourhood?

Implementation summary:

The data was scraped and manipulated accordingly for the analysis. The data was then reviewed graphically and sentiment analysis was done using histograms, word cloud, and horizontal bar charts. Reviews given to super hosts of Airbnb were also analysed. Also, a ggmap function was used to classify the listings of Seattle.

Insights obtained:

Top 20 keywords in reviews tell us that comfort, cleanliness, locations, space are some of the important factors which matter to the customers and some of these are rightly advertised in listing descriptions.
We observe that, in general, prices of homestays are higher in the center of the city.
$100 is the median price of an AirBnB homestay in Seattle.
It turns out that superhosts do not gurantee that customers would have the best experinece.
Most reviews words which are captured by AFFIN lexicon are positive since we see the histogram of mean AFFIN score as approximately normal with mean at +2.
There might be a data quality issue in neighbourhood_cleaned column.

Data Wrangling in R Project

Akhilesh Agnihotri

03 December, 2017