AirBnB Introduction
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. This analysis describes the trends and overview of homestays in Chicago, IL.
Summary:
My purpose of this report is to analyse customer reviews for Airbnb Chicago properties and understand the most frequently appearing words in customer reviews and see how pricey listings are spread across the city.
Following are the analysis I did to achieve my goal:
Sentiment analysis of 5 most reveiwed neighbourhoods in Chicago, IL - to come up with general vibe in the most reviewed neighbourhoods
Prepared wordcloud of the frequently occuring words in customer reviews - helps us identify most frequently occuring words from not so frequently occuring words.
Listed top 20 frequently occuring words to identify customer impression of the Airbnb Chicago properties - to understand the top words that come to customer’s mind when they think about Airbnb Chicago stay
Map of Chicago displaying pricey listings - helped in locating pricey listings on the map
Implementation:
The data was scraped and manipulated accordingly for the analysis. The data was then reviewed to determine what is the general vibe in the neighbourhood.
This analysis would help the consumer of this analysis to understand which words customers use to summarize their thoughts about their stay in one of our Airbnb Chicago properties. This would help the company utilise the positive words in their branding and marketing and try and understand the factors which are certain negative words.
rmarkdown: For publishing HTML report of this analysis
tidyverse: For graphs, plotting and other general functions
readr: For reading data files
tidytext: For scraping words from the reviews and scoring them based on sentiments
stringr: For common string operations
wordcloud: For displaying wordcloud
leaflet: For preparing GGMap
DT: For better data view
#loading packages
library(readr)
library(tidyverse)
library(tidytext)
library(stringr)
library(wordcloud)
library(leaflet)
library(DT)
Original Source of data: InsideAirbnb
Inside Airbnb is an independent, non-commercial set of tools and data that allows you to explore how Airbnb is really being used in cities around the world.
Explanation of data source: The original purpose of the data was to show people that how AirBnB is really being used and is affecting their neighbourhood. By analyzing publicly available information about a city’s Airbnb’s listings, Inside Airbnb provides filters and key metrics so people can see how Airbnb is being used to compete with the residential housing market. The data was posted on 10th May 2017 on their website. The original data set had 132353 rows and 6 variables (columns) in reviews dataset and listing table had 5207 rows and 16 columns. These tables have been combined for ease of data analysis.
The first step is to read the reviews and listing data from source and check the formatting of the columns. All the variables have appropriate data type as seen in structure of the table.
#reading listing data
listing <- read_csv("C:/Users/jatin/Desktop/R_FINAL_PROJECT/listings.csv")
names(listing)
listing <- arrange(listing,listing$id)
For the analysis, very few variables were used from the datasets. These variables were combined in a single table to ease analysis. In brief, I used left_join to join listing table on review table.
#Getting prices and neighbourhood in reviews table
review <- review %>% left_join(listing, by=c("listing_id"="id"))
Removing NA values from cumulated data is our next step. In the data, we see that neighbourhood_group has no values. Hence we remove it from final dataset and check missing values in the rest of the data
#Cleaning the data
review <- subset(review,select=-(neighbourhood_group))
colSums(is.na(review))
## listing_id id
## 0 0
## date reviewer_id
## 0 0
## reviewer_name comments
## 0 194
## name host_id
## 4 0
## host_name neighbourhood
## 0 0
## latitude longitude
## 0 0
## room_type price
## 0 0
## minimum_nights number_of_reviews
## 0 0
## last_review reviews_per_month
## 0 0
## calculated_host_listings_count availability_365
## 0 0
Summarising NA on columns helps us realise missing values in each column. I am getting 194 missing values in comments variable and 4 in name variable. To analyse reveiws, we can remove rows which do not have any comments. I want to retain as many reviews as I can, therefore I am not removing reviews that have missing property name.
review <- filter(review,!is.na(comments))
Cleaned Dataset
Exactly 194 missing values were found. These rows were removed and we get 132159 entries.Below is the summary of concerned variables. Out of 21 only 5 are of concern here.
Date: Date of review
Reviewer_Name: Name of the reviewer
Comments: Review of the stay
Neighbourhood: Neighbourhood of the property. We identified 71 different neighbourhoods of Chicago in the dataset.
Price: Price of the property per night. Price here ranges from zero dollar to 5000 dollars with a mean of $110.6. Zero dollar price may be present for some properties if the properties have been with Airbnb in the past but are no longer listed. 130 dollars and above is the upper quantile range.
Latitude Longitude: Latitude and Longitude of the property
Room_type: If the room type is Private, apartment or shared.
Minimum_nights: required minimum nights stay
Number_of_reviews: total number of reviews
last_review: date of last review
availability_365: number of days listing is available out of 365
#Breaking customer reviews in words
neighbourhood_words <- review %>% select(comments, neighbourhood) %>% unnest_tokens(word, comments) %>% filter(!word %in% stop_words$word, str_detect(word, "^[a-z']+$"))
nrc <- sentiments %>% filter(lexicon == "nrc") %>% select(word, sentiment)
prop_tot_words <- neighbourhood_words %>% group_by(neighbourhood) %>% mutate(total_words = n()) %>% ungroup() %>% distinct(neighbourhood, total_words) %>% arrange(desc(total_words)) %>% top_n(10)
#Identify 5 most reviewed properties
most_reviews_5 <- review %>%
group_by(neighbourhood) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(5)
prop_tot_words <- most_reviews_5 %>% left_join(prop_tot_words)
#count words associated with each type of sentiment in 5 most reviewed properties
by_prop_sentiment <- neighbourhood_words %>%
inner_join(nrc, by = "word") %>%
count(sentiment, neighbourhood) %>%
ungroup() %>%
inner_join(prop_tot_words,by = "neighbourhood") %>%
group_by(neighbourhood, sentiment) %>%
mutate(prop = round(count / total_words * 100, digits=1)) %>%
ungroup()
#Plotting GGPlot for showing sentiments
ggplot(data = by_prop_sentiment) +
geom_bar(mapping = aes(x = neighbourhood,
y = prop),
stat = "identity", fill = "blue") +
facet_wrap( ~ sentiment) +
labs(title = "Sentiment Analysis in 5 Most Reviewed Neighbourhoods",
x ="Neighbourhood", y="Proportion \n (sentiment word count / total word count)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
We see from the above graph that it is hard to draw any conclusion as we are getting nearly similar frequency of words for each sentiment.
In this section, we are making a wordcloud with word size based on the frequency of that word appearing in the reviews
listings_words <- review %>%
select(id, comments) %>%
unnest_tokens(word, comments) %>%
filter(!word %in% stop_words$word,
str_detect(word, "^[a-z']+$"))
cloud <- as.data.frame(listings_words %>%
group_by(word) %>%
summarise(no_rows = length(word)))
#building the word cloud
wordcloud(words = cloud$word, freq = cloud$no_rows, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.1,
colors=brewer.pal(8, "Dark2"))
This step helped us in concluding that most frequently appearing words in the reviews are generally positive like “perfect”, “easy”,“quiet”, “nice”, “comfortable” and “clean”
Now we identify 20 most frequently appearing words that customers associate Airbnb Chicago properties with.
#We need to use the unnest_tokens function to obtain one-row-per-term-per-listing-description
listings_words <- review %>%
select(id, comments, price) %>%
unnest_tokens(word, comments) %>%
filter(!word %in% stop_words$word,
str_detect(word, "^[a-z']+$"))
#plot the graph
common_listings <- listings_words %>%
group_by(word) %>%
summarise(count = n()) %>%
top_n(n = 20, wt = count) %>%
ggplot() +
geom_bar(mapping = aes(x=reorder(word, count),
y=count),
stat="identity", fill = "light grey") +
coord_flip() +
labs(title="Top 20 words described in Reviews",
x="Word count", y="Words") +
theme_minimal()
print(common_listings)
We are now plotting Airbnb homestays based on their latitude and longitude and representing pricey listings (with price per night greater than $130 - upper quantile range) with white circles and rest with yellow circles to help us identify pricey locations and neighbourhoods of Chicago.
review %>% filter(price>130) %>% leaflet::leaflet() %>%
leaflet::addProviderTiles("CartoDB.DarkMatter") %>%
leaflet::addCircleMarkers(~longitude, ~latitude, radius = 2, color = "white", fillOpacity = 0.3)
5.1 Summarizing the problem statement: Analysed vibes of homestays through AirBnB in Chicago, IL. The analysis includes results for the sentiments by neighbourhood, wordcloud to display most frequent words in the customer reviews and top words that describe the listings.
5.2 Summarizing the implementation: The data was scraped and manipulated accordingly for the analysis. The data was then reviewed graphically to determine what is the general vibe in the neighbourhood. Sentiment Analysis was done using faceted vertical bar graphs, word cloud, and horizontal bar charts. Also, a leaflet function was used to locate the pricey listings of Chicago.
5.3 Summary/Insights: Various results and analysis showed that there is a mixed vibe in Chicago neighbourhood with nearly equal positive and negative emotions. But word cloud, and bar charts, all together show us positive words that customers frequently associate their Airbnb stay with.
Below mentioned are some extra insights that we got from the analysis:
Clean, Nice, Quiet, Comfortable, Easy and Perfect - customers associate Airbnb homestays with positive words like these. This can help Airbnb in understanding how branding and associating with these words can further help them creating a buzz.
Upper quantile price of the listings starts from $130
5 Most reviewed neighbourhoods are - Lake View, Lincoln Park, Logan Square, Near North Side and West Town