Project Proposal: Chicago Airbnb Sentiment Analysis

AirBnB Introduction

Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. This analysis describes the trends and overview of homestays in Chicago, IL.

1. Introduction

Summary:

My purpose of this report is to analyse customer reviews for Airbnb Chicago properties and understand the most frequently appearing words in customer reviews and see how pricey listings are spread across the city.

Following are the analysis I did to achieve my goal:

Sentiment analysis of 5 most reveiwed neighbourhoods in Chicago, IL - to come up with general vibe in the most reviewed neighbourhoods
Prepared wordcloud of the frequently occuring words in customer reviews - helps us identify most frequently occuring words from not so frequently occuring words.
Listed top 20 frequently occuring words to identify customer impression of the Airbnb Chicago properties - to understand the top words that come to customer’s mind when they think about Airbnb Chicago stay
Map of Chicago displaying pricey listings - helped in locating pricey listings on the map

Implementation:

The data was scraped and manipulated accordingly for the analysis. The data was then reviewed to determine what is the general vibe in the neighbourhood.

This analysis would help the consumer of this analysis to understand which words customers use to summarize their thoughts about their stay in one of our Airbnb Chicago properties. This would help the company utilise the positive words in their branding and marketing and try and understand the factors which are certain negative words.

2. Packages Required

rmarkdown: For publishing HTML report of this analysis

tidyverse: For graphs, plotting and other general functions

readr: For reading data files

tidytext: For scraping words from the reviews and scoring them based on sentiments

stringr: For common string operations

wordcloud: For displaying wordcloud

leaflet: For preparing GGMap

DT: For better data view

#loading packages
library(readr)
library(tidyverse)
library(tidytext)
library(stringr)
library(wordcloud)
library(leaflet)
library(DT)

3. Data Preparation

3.1 Data Source

Original Source of data: InsideAirbnb

3.2 Explanation of Source Data

Inside Airbnb is an independent, non-commercial set of tools and data that allows you to explore how Airbnb is really being used in cities around the world.

Explanation of data source: The original purpose of the data was to show people that how AirBnB is really being used and is affecting their neighbourhood. By analyzing publicly available information about a city’s Airbnb’s listings, Inside Airbnb provides filters and key metrics so people can see how Airbnb is being used to compete with the residential housing market. The data was posted on 10th May 2017 on their website. The original data set had 132353 rows and 6 variables (columns) in reviews dataset and listing table had 5207 rows and 16 columns. These tables have been combined for ease of data analysis.

3.3 Original Data Set

3.4 Cleaning Dataset

The first step is to read the reviews and listing data from source and check the formatting of the columns. All the variables have appropriate data type as seen in structure of the table.

#reading listing data
listing <- read_csv("C:/Users/jatin/Desktop/R_FINAL_PROJECT/listings.csv")

names(listing)

listing <- arrange(listing,listing$id)

For the analysis, very few variables were used from the datasets. These variables were combined in a single table to ease analysis. In brief, I used left_join to join listing table on review table.

#Getting prices and neighbourhood in reviews table
review <- review %>% left_join(listing, by=c("listing_id"="id"))

Removing NA values from cumulated data is our next step. In the data, we see that neighbourhood_group has no values. Hence we remove it from final dataset and check missing values in the rest of the data

#Cleaning the data

review <- subset(review,select=-(neighbourhood_group))
colSums(is.na(review))

##                     listing_id                             id 
##                              0                              0 
##                           date                    reviewer_id 
##                              0                              0 
##                  reviewer_name                       comments 
##                              0                            194 
##                           name                        host_id 
##                              4                              0 
##                      host_name                  neighbourhood 
##                              0                              0 
##                       latitude                      longitude 
##                              0                              0 
##                      room_type                          price 
##                              0                              0 
##                 minimum_nights              number_of_reviews 
##                              0                              0 
##                    last_review              reviews_per_month 
##                              0                              0 
## calculated_host_listings_count               availability_365 
##                              0                              0

Summarising NA on columns helps us realise missing values in each column. I am getting 194 missing values in comments variable and 4 in name variable. To analyse reveiws, we can remove rows which do not have any comments. I want to retain as many reviews as I can, therefore I am not removing reviews that have missing property name.

review <- filter(review,!is.na(comments))

3.5 Cleaned Dataset

Cleaned Dataset

Exactly 194 missing values were found. These rows were removed and we get 132159 entries.

3.6 Summary of Variables

Below is the summary of concerned variables. Out of 21 only 5 are of concern here.

Date: Date of review

Reviewer_Name: Name of the reviewer

Comments: Review of the stay

Neighbourhood: Neighbourhood of the property. We identified 71 different neighbourhoods of Chicago in the dataset.

Price: Price of the property per night. Price here ranges from zero dollar to 5000 dollars with a mean of $110.6. Zero dollar price may be present for some properties if the properties have been with Airbnb in the past but are no longer listed. 130 dollars and above is the upper quantile range.

Latitude Longitude: Latitude and Longitude of the property

Room_type: If the room type is Private, apartment or shared.

Minimum_nights: required minimum nights stay

Number_of_reviews: total number of reviews

last_review: date of last review

availability_365: number of days listing is available out of 365

4. Exploratory Data Analysis

4.1 Sentiment Analysis by Neighbourhood

#Breaking customer reviews in words 

neighbourhood_words <- review %>% select(comments, neighbourhood) %>% unnest_tokens(word, comments) %>% filter(!word %in% stop_words$word, str_detect(word, "^[a-z']+$"))

nrc <- sentiments %>% filter(lexicon == "nrc") %>% select(word, sentiment)

prop_tot_words <- neighbourhood_words %>% group_by(neighbourhood) %>% mutate(total_words = n()) %>% ungroup() %>% distinct(neighbourhood, total_words) %>%  arrange(desc(total_words)) %>% top_n(10)

#Identify 5 most reviewed properties
most_reviews_5 <- review %>%
  group_by(neighbourhood) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(5)

prop_tot_words <- most_reviews_5 %>% left_join(prop_tot_words)

#count words associated with each type of sentiment in 5 most reviewed properties
by_prop_sentiment <- neighbourhood_words %>%
  inner_join(nrc, by = "word") %>%
  count(sentiment, neighbourhood) %>%
  ungroup() %>%
  inner_join(prop_tot_words,by = "neighbourhood") %>%
  group_by(neighbourhood, sentiment) %>%
  mutate(prop = round(count / total_words * 100, digits=1)) %>%
  ungroup()

#Plotting GGPlot for showing sentiments
ggplot(data = by_prop_sentiment) +
  geom_bar(mapping = aes(x = neighbourhood,
                       y = prop),
           stat = "identity", fill = "blue") +
  facet_wrap( ~ sentiment) +
  labs(title = "Sentiment Analysis in 5 Most Reviewed Neighbourhoods",
       x ="Neighbourhood", y="Proportion \n (sentiment word count / total word count)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

We see from the above graph that it is hard to draw any conclusion as we are getting nearly similar frequency of words for each sentiment.

4.2 Word Cloud - All Reviews

In this section, we are making a wordcloud with word size based on the frequency of that word appearing in the reviews

listings_words <- review %>%
  select(id, comments) %>%
  unnest_tokens(word, comments) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))

cloud <- as.data.frame(listings_words %>% 
                         group_by(word) %>%
                         summarise(no_rows = length(word)))

#building the word cloud
wordcloud(words = cloud$word, freq = cloud$no_rows, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.1, 
          colors=brewer.pal(8, "Dark2"))

This step helped us in concluding that most frequently appearing words in the reviews are generally positive like “perfect”, “easy”,“quiet”, “nice”, “comfortable” and “clean”

4.3 Frequently Occuring Words

Now we identify 20 most frequently appearing words that customers associate Airbnb Chicago properties with.

#We need to use the unnest_tokens function to obtain one-row-per-term-per-listing-description
listings_words <- review %>%
  select(id, comments, price) %>%
  unnest_tokens(word, comments) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))

#plot the graph
common_listings <- listings_words %>%
  group_by(word) %>%
  summarise(count = n()) %>%
  top_n(n = 20, wt = count) %>%
  ggplot() +
  geom_bar(mapping = aes(x=reorder(word, count),
                         y=count),
           stat="identity", fill = "light grey") +
  coord_flip() +
  labs(title="Top 20 words described in Reviews",
       x="Word count", y="Words") +
  theme_minimal()

print(common_listings)

4.4 Upper Quantile Places in Chicago

We are now plotting Airbnb homestays based on their latitude and longitude and representing pricey listings (with price per night greater than $130 - upper quantile range) with white circles and rest with yellow circles to help us identify pricey locations and neighbourhoods of Chicago.

review %>% filter(price>130) %>% leaflet::leaflet() %>% 
  leaflet::addProviderTiles("CartoDB.DarkMatter") %>%
  leaflet::addCircleMarkers(~longitude, ~latitude, radius = 2, color = "white", fillOpacity = 0.3)

5. Summary

5.1 Summarizing the problem statement: Analysed vibes of homestays through AirBnB in Chicago, IL. The analysis includes results for the sentiments by neighbourhood, wordcloud to display most frequent words in the customer reviews and top words that describe the listings.

5.2 Summarizing the implementation: The data was scraped and manipulated accordingly for the analysis. The data was then reviewed graphically to determine what is the general vibe in the neighbourhood. Sentiment Analysis was done using faceted vertical bar graphs, word cloud, and horizontal bar charts. Also, a leaflet function was used to locate the pricey listings of Chicago.

5.3 Summary/Insights: Various results and analysis showed that there is a mixed vibe in Chicago neighbourhood with nearly equal positive and negative emotions. But word cloud, and bar charts, all together show us positive words that customers frequently associate their Airbnb stay with.

Below mentioned are some extra insights that we got from the analysis:

Clean, Nice, Quiet, Comfortable, Easy and Perfect - customers associate Airbnb homestays with positive words like these. This can help Airbnb in understanding how branding and associating with these words can further help them creating a buzz.
Upper quantile price of the listings starts from $130
5 Most reviewed neighbourhoods are - Lake View, Lincoln Park, Logan Square, Near North Side and West Town