Final Project Data Wrangling in R

AirBnB Analysis

Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. This analysis describes the trends and overview of homestays in Boston, MA.

1. Synopsis

Problem Statement:

Want to analyse vibes of homestays through AirBnB in Boston, MA. That includes catching up on the neighbourhood sentiments, analysing listed properties by their types, the concept of a Superhost in Airbnb, and words that describe the expensive listings.

Implementation:

The data was scraped and manipulated accordingly for the analysis. The data was then reviewed graphically to determine what is the general vibe in the neighbourhood.

Summary:

The analysis show that overall there is a Positive vibe from the listings at Boston, MA. Other detailed insights have been summarised in the last section.

2. Packages Required

Following are the packages required with their use:

tidytext = allows conversion of text to and from tidy formats
DT = HTML display of data
tidyverse = Allows for data manipulation and works in harmony with other packages as well
stringr = String operations
magrittr = pipe operator in r programming
leaflet = leaflet maps in r
ggplot2 = graphical representation in r
dplyr = data manipulation in r
tm = for text mining
wordcloud = for word cloud generator
ggmap = visualization by combining the spatial information of static maps from Google Maps

library(tidytext)
library(DT)
library(tm)
library(wordcloud)
library(tidyverse)
library(stringr)
library(magrittr)
library(leaflet)
library(ggplot2)
library(ggmap)
library(dplyr)

3. Data Preparation

3.1 Data Source

Original Data Source AirBnB website: Original Data Set

3.2 Explanation of source data

Explanation of data source: The original purpose of the data was to show people that how AirBnB is really being used and is affecting their neighbourhood. By analyzing publicly available information about a city’s Airbnb’s listings, Inside Airbnb provides filters and key metrics so people can see how Airbnb is being used to compete with the residential housing market. The data was posted on 7th September 2016 on their website. The original data set had 3585 rows and 95 variables (columns). There are quite a few number of missing values in the dataset. But they have been left blank, or in other words have not been imputed in any form (manipulated in some ways for later use). If the data does not exist it is either marked with NA or it is a blank space filled string, which has been taken care of in data cleaning.

Original Dataset

Below is the HTML scrollable format of the data. As the data description in character type columns is too big, the table needs to be scrolled properly i.e. left-right and up-down to view the data. Each and every row is present along with all the columns. We can filter out some specific variables which we dont want to see using the clickable button “Column Visibility”. There is also a Search bar given on top of the table.

myfile <- 'https://raw.githubusercontent.com/ishantnayer/Rfiles/master/listings.csv'
listing_original<- read.csv(myfile)
datatable(listing_original ,extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = I('colvis')))

3.3 Data Cleaning

The first and foremost step was to clean the data as soon as it was read using read.csv function.

listings<- read.csv(myfile, na.strings=c(""," ","NA"))

Above mentioned code replaces all the blank strings with NA’s.

For the analysis, very few variables were used from the original dataset. They contain some missing values, either NA’s or just blank strings. Few variables had to be changed into some other datatype. For example Price should have been a numeric data type and description should have been a character type. Below mentioned code shows how 4 of the variables’ data types were changed- neighbourhood_cleansed, host_is_superhost to a factor type variable, Price to a numeric type, and decription to a character type.

listings$price <- as.numeric(sub("\\$","", listings$price))
listings$description <- as.character(listings$description)
listings$neighbourhood_cleansed <- factor(listings$neighbourhood_cleansed)
listings$host_is_superhost <- factor(listings$host_is_superhost)

A bunch of missing values were found, but all the variables that contain missing values are not required for the analysis. The main variables which are frequently used for analysis such as Description, Neighbourhood_cleansed, Property_type do not contain any missing values.

Price variable when changed to numeric type, got 12 NA values coerced in it. These are not omitted/removed. This is done because all the rows are needed to analyse the data. Already, only 3585 rows are available in the dataset and removal might affect the analysis using other variables. For example, the most important variable ‘description’ which tells about the vibe of the neighbourhood, needs to be analysed. Each word and description counts and removal of any observation might reduce the word count for the sentimental analysis (can’t take that risk!).

Price variable is used in further analysis, so it was necessary to use this data. It might have affected the final analysis or skewed the data a bit, if NA’s were removed permanently. So few alternates were applied, for example below mentioned code was used in order to find out the quantiles. Function na.rm was used to remove those observations temporarily, to find out the price range and quantiles. Such codes were written generally to use this variable without removing the observations permanently:

quantile(listings$price,
        probs = seq(0, 1, 0.01),
        na.rm=TRUE)

##     0%     1%     2%     3%     4%     5%     6%     7%     8%     9% 
##  10.00  32.00  40.00  45.00  45.00  50.00  50.00  55.00  56.00  60.00 
##    10%    11%    12%    13%    14%    15%    16%    17%    18%    19% 
##  60.00  60.00  65.00  65.00  67.00  69.00  70.00  70.00  75.00  75.00 
##    20%    21%    22%    23%    24%    25%    26%    27%    28%    29% 
##  75.00  78.00  79.84  80.00  83.00  85.00  87.00  89.00  90.00  95.00 
##    30%    31%    32%    33%    34%    35%    36%    37%    38%    39% 
##  97.00  99.00  99.00 100.00 100.00 100.00 107.92 110.00 115.00 119.00 
##    40%    41%    42%    43%    44%    45%    46%    47%    48%    49% 
## 120.00 124.52 125.00 126.00 130.00 133.00 135.00 140.00 144.00 147.00 
##    50%    51%    52%    53%    54%    55%    56%    57%    58%    59% 
## 150.00 150.00 150.00 150.00 155.00 159.00 160.00 165.00 169.00 169.48 
##    60%    61%    62%    63%    64%    65%    66%    67%    68%    69% 
## 175.00 175.00 175.00 180.00 185.00 189.00 195.00 199.00 199.00 199.00 
##    70%    71%    72%    73%    74%    75%    76%    77%    78%    79% 
## 200.00 200.00 200.00 209.00 215.00 220.00 225.00 228.44 232.00 240.00 
##    80%    81%    82%    83%    84%    85%    86%    87%    88%    89% 
## 249.00 250.00 250.00 250.00 265.00 275.00 279.00 288.28 295.00 299.00 
##    90%    91%    92%    93%    94%    95%    96%    97%    98%    99% 
## 300.00 315.00 319.24 333.84 350.00 373.80 395.48 425.00 500.00 600.00 
##   100% 
## 999.00

RESULTS: Median price around $150, upper quartile > $220

3.4 Cleaned Dataset

Below HTML table is the final cleaned dataset with all the necessary changes. As the data description in character type columns is too big, scroll properly i.e. left-right and up-down to view the data.

datatable(listings ,extensions = 'Buttons', options = list(dom = 'Bfrtip', buttons = I('colvis')))

3.5 Summary of Variables

Below mentioned is the summary of the concerned variables. Among 95 variables in the dataset, only few are of the concern here. Brief description of what overall variables represent in the dataset:

Name: Name of the place where a Guest stays
Summary: Summary of the place
Description: Descriptive review by the customer
Space: About the spaciousness of the place
Neighbourhood_cleansed: Neighbourhood’s name
Property_type: Type of property that is being listed
Transit: How to locate the place and how to reach
host_is_superhost: Hosts which have a lengthy history on Airbnb
Variables which gives Host’s summary
Variables with link url to the place like location, pics
Variables for Location specifications
Variables for amenities
Variables for Price
Variables for Availability and Calender
Variables for Reviews

The most important variables are of character type as the analysis needed only descriptive data and its “hit words”. Except price, no other variable used, was of numeric type. Some other variables were also used under “select” statements, but were used only temporarily. Below mentioned is the summary statistics of Price:

summary(listings$price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    10.0    85.0   150.0   169.1   220.0   999.0      12

4. Exploratory Data Analysis

Here in the main analysis of the project, sentiment analysis of Airbnb listing descriptions was performed along with other important techniques. The analysis “Graphically” represent the findings. Along with each graph, a summary is provided as to what the analysis is all about.

Format: The code always comes first, followed by graphs and their summary/explanations.
This analysis as mentioned in the Rubric:
1. Has tried to uncover new information in the data that is not self-evident
2. Provided findings in the form of plots, though not tables
3. Formatting of the plots
4. Highlighted important features about analysis using summary
5. Explanation

A. Sentiment Analysis by Neighbourhood

#getting the top 5 neigbourhoods, based on listing nos.
top_5 <- listings %>%
  group_by(neighbourhood_cleansed) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(5)

#filtering listings in these neighbourhoods
listings_5 <- listings %>%
  filter(neighbourhood_cleansed %in% top_5$neighbourhood_cleansed)

#unnesting indiv. words for these listings
top_words <- listings_5 %>%
  select(id, description, neighbourhood_cleansed, review_scores_rating) %>%
  unnest_tokens(word, description) %>%
  filter(!word %in% stop_words$word, str_detect(word, "^[a-z']+$"))

#get word-sentiment lexicon
nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  dplyr::select(word, sentiment)

#count total words in each neighbourhood
total_of_words <- top_words %>%
  group_by(neighbourhood_cleansed) %>%
  mutate(total_words = n()) %>%
  ungroup() %>%
  distinct(id, neighbourhood_cleansed, total_words)

#count words assoc. with each type of sentiment in each neighbourhood
sentiment <- top_words %>%
  inner_join(nrc, by = "word") %>%
  count(sentiment, id) %>%
  ungroup() %>%
  complete(sentiment, id, fill = list(n = 0)) %>%
  inner_join(total_of_words) %>%
  group_by(neighbourhood_cleansed, sentiment, total_words) %>%
  summarize(words = sum(n)) %>%
  mutate(prop = round(words / total_words * 100, digits=1)) %>%
  ungroup()

main_analysis <- ggplot(data=sentiment) +
  geom_bar(mapping=aes(x=neighbourhood_cleansed,
                       y=prop),
           stat="identity",  fill = "light green") +
  facet_wrap( ~ sentiment) +
  labs(title="Sentiment Analysis in Top 5 Neighbourhoods",
       x="Neighbourhood", y="Proportion \n (sentiment word count / total word count)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

print(main_analysis)

Summary/Explanation: This facet wrap bar graph of the sentiment analysis shows the sentiment in top 5 neighbourhoods. X-axis represent the top 5 neighbourhoods in Boston and Y-axis represent proportion of the sentiment word count among total word count in the description column. Lexicon==‘nrc’ is the main trick which links the words with the sentiments. Overall, positive words occur more frequently than the negative sentiments. Moreover, Anticipation, Joy, Trust - all the positive words have a better ratio than the negatives such as Anger, Disgust, Fear.

In other words, these neighbourhoods are in top 5 beacuse they have such a positive vibe and customers seek to live around these areas (and some other yet unknown reasons too!). Also, the green colour was choosen conciously to represent that overall vibe is positive!

B. Sentiment Analysis by Property_type

#Trying to represent sentiment by property_type.

property_words <- listings %>%
  select(id, description, property_type, review_scores_rating) %>%
  unnest_tokens(word, description) %>%
  filter(!word %in% stop_words$word, str_detect(word, "^[a-z']+$"))

#get word-sentiment lexicon
nrc <- sentiments %>%
  filter(lexicon == "nrc") %>%
  dplyr::select(word, sentiment)

#count total words in each property type
prop_tot_words <- property_words %>%
  group_by(property_type) %>%
  mutate(total_words = n()) %>%
  ungroup() %>%
  distinct(id, property_type, total_words)

#count words assoc. with each type of sentiment in each property type
by_prop_sentiment <- property_words %>%
  inner_join(nrc, by = "word") %>%
  count(sentiment, id) %>%
  ungroup() %>%
  complete(sentiment, id, fill = list(n = 0)) %>%
  inner_join(prop_tot_words) %>%
  group_by(property_type, sentiment, total_words) %>%
  summarize(words = sum(n)) %>%
  mutate(prop = round(words / total_words * 100, digits=1)) %>%
  ungroup()

#Filter missing values
senti_prop <- as.data.frame(by_prop_sentiment %>% na.omit(by_prop_sentiment[, c("property_type")]))

#Plotting the analysis
property_types <- ggplot(senti_prop) +
  geom_bar(mapping=aes(x=property_type,
                       y=prop),
           stat="identity",  fill = "pink") +
  facet_wrap( ~ sentiment) +
  labs(title="Sentiment by Property type",
       x="Property_Type", y="Proportion \n (sentiment word count / total word count)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

print(property_types)

Summary/Explanation: Representation of sentiment by property_type. This is the same kind of graphical analysis with a different factor level. This data is very insightful and can help different property types improve their customer experience. For example, the property type DORM has the least positive reviews and has the least trust word count. Camper/RV has the most anticipated, joyful, positive results among all types, which is the kind of international trend nowadays. So, this graph can be interpreted in various ways using all the sentiments. (Quite Informative!)

C. Super-host location in Boston

This time Super_hosts are located on the map using ggmap package.
Superhosts are AirBnB hosts with a particularly lengthy history on the platform. The benefit to a renter of staying with a superhost is that they know that this person has amassed a large amount of Positive reputation.

#making the map
leaflet(data = listings) %>% addProviderTiles("CartoDB.DarkMatter") %>%
  addCircleMarkers(~longitude, ~latitude, radius = ifelse(listings$host_is_superhost == "t", 2, 0.2),
                   color = ifelse(listings$host_is_superhost == "t", "white", "green"),
                   fillOpacity = 0.5)

#Checking the average price by host_is_superhost
avg <- listings %>% select(host_is_superhost, price)
aggregate(price~host_is_superhost, avg, mean)

##   host_is_superhost    price
## 1                 f 168.4837
## 2                 t 173.6675

Zoom-in and Zoom-out of the map to take a closer look!!

Summary/Explanation: Now as seen above, the average price charged by a superhost ($173) is almost same to the price charged by a non-super host ($168). General hypothesis is that superhosts would be at a premium and charge a greater rate because of their special status, but surprisingly that is not true. There doesn’t appear to be any statistical difference at all between superhost prices and normal host prices. So, superhosts who represent a positive vibe, can be located to represent positive neighbourhood vibe in general on the map.

Interesting observation: Now, keep in mind that analysis is naively looking at one variable (price) in isolation here. It’s possible that there is an effect and it just doesn’t show up because superhosts tend to rent out entities (e.g. single bedrooms) that are cheaper than those rented out by the general population (e.g. entire houses or apartments, especially heavily legally debated vacation rentals). The analysis is not going to dive that far in, but this is an interesting (lack of an) effect.

D. Common Words in Pricey Listings

#adding 'top 1% quantile price wise vs. the rest' variable. We know using the quantile function that upper quantile > $220 for price variable and median is $150.
listings$price_uq <- ifelse(listings$price >= 220, "Upper Quartile", "The Rest")

#We need to use the unnest_tokens function to obtain one-row-per-term-per-listing-description
listings_words <- listings %>%
  select(id, description, price, price_uq, review_scores_accuracy, review_scores_rating) %>%
  unnest_tokens(word, description) %>%
  filter(!word %in% stop_words$word,
         str_detect(word, "^[a-z']+$"))

#plot the graph
common_listings <- listings_words %>%
  group_by(word) %>%
  summarise(count = n()) %>%
  top_n(n = 20, wt = count) %>%
  ggplot() +
  geom_bar(mapping = aes(x=reorder(word, count),
                         y=count),
           stat="identity", fill = "light green") +
  coord_flip() +
  labs(title="Top 20 words described in Upper quantile listings",
       x="Word count", y="Words") +
  theme_minimal()

print(common_listings)

Upper-Quantile = Top 25% of the data sorted by Price. Here 75% percentile value is $220.

Summary/Explanation: Following analysis was done to show the top 20 words which are mentioned in the upper quantile listings’ descriptions. Upper quantile are those lisitngs whose price is above $220. These words represent how a listing is described most commonly for expensive listings. For example, the property type Apartment is the 2nd most common word used after Boston in expensive listings. On a general level, Apartments generally have a high price range in Boston AirBnB and show a positive vibe (confirmed in Second Sentiment Analysis).

E. Word Cloud-Pricey Listings

#making a data frame of words and its frequency
cloud <- as.data.frame(listings_words %>% 
  group_by(word) %>%
  summarise(no_rows = length(word)))

#building the word cloud
wordcloud(words = cloud$word, freq = cloud$no_rows, min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35, 
          colors=brewer.pal(8, "Dark2"))

Summary/Explanation: The same analysis is done as previously in (C) part. Just the representation here is different.

F. Upper-Quantile places in Boston

Here, again ggmap was used to represent data in a map, an idea which was picked up from the AirBnB website itself!

leaflet(data = listings) %>% addProviderTiles("CartoDB.DarkMatter") %>%
  addCircleMarkers(~longitude, ~latitude, radius = ifelse(listings$price > 220, 2, 0.2),
                   color = ifelse(listings$price > 220, "white", "orange"),
                   fillOpacity = 0.5)

Zoom-in and Zoom-out of the map to take a closer look!!

Summary/Explanation: This ggmap locates the upper quantile places in boston. The white coloured filled circles are the ones with upper-quantile prices and the rest are orange coloured. Now the upper quantile places contain apartments mostly with a positive vibe. So, if further analysis is done, it is possible to show the exact locations with positive/negative vibes on a map. (But a similar analysis is already performed in the First sentiment analysis without using a map.)

5. Summary

5.1 Summarizing the problem statement: Analysed vibes of homestays through AirBnB in Boston, MA. The analysis includes results for the sentiments by neighbourhood, sentiments by property types, analysis of super hosts, and top words that describe the pricey listings.

5.2 Summarizing the implementation: The data was scraped and manipulated accordingly for the analysis. The data was then reviewed graphically to determine what is the general vibe in the neighbourhood. Sentiment Analysis was done using faceted vertical bar graphs, word cloud, and horizontal bar charts. Super hosts concept of Airbnb was also analysed and they were located on a map. Also, a ggmap function was used to locate the pricey listings of Boston.

5.3 Summary/Insights: Various results and analysis showed that there is a POSITIVE vibe from the Boston neighbourhood. Whether it is a sentiment analysis, word cloud, or bar charts, all together in different shapes and sizes represent the same thing ultimately. Below mentioned are some extra insights that we got from the analysis:
* Anticipation, Joy, Trust - all the positive words have a better ratio than the negatives such as Anger, Disgust, Fear. * Back Bay, Dorchester, Fenway, Jamaica Plain, South End are the top 5 neighbourhoods
* Now as seen above, the average price charged by a superhost ($173) is almost same to the price charged by a non-super host ($168). General hypothesis is that superhosts would be at a premium and charge a greater rate because of their special status, but surprisingly that is not true.
* The property type DORM has the least positive reviews and has the least trust word count.
* Camper/RV has the most anticipated, joyful, positive results among all types, which is the kind of international trend nowadays.
* The property type Apartment is the 2nd most common word used after Boston in expensive listings.
* Upper quantile price of the listings is above $220.

Final Project Data Wrangling in R

Ishant Nayer

12/8/2016

AirBnB Analysis

1. Synopsis

2. Packages Required

3. Data Preparation

3.1 Data Source

3.2 Explanation of source data

Original Dataset

3.3 Data Cleaning

3.4 Cleaned Dataset

3.5 Summary of Variables

4. Exploratory Data Analysis

A. Sentiment Analysis by Neighbourhood

B. Sentiment Analysis by Property_type

C. Super-host location in Boston

D. Common Words in Pricey Listings

E. Word Cloud-Pricey Listings

F. Upper-Quantile places in Boston

5. Summary