Chicago: Airbnb Sentiment Analysis

AirBnB Introduction

Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. This analysis describes the trends and overview of homestays in Chicago, IL.

1. Synopsis

Problem Statement:

I am planning to analyse here following things:

What is the vibe in each neighbourhood of Chicago for Airbnb homestays. Is it Positive or Negative?
Prepare sentiment analysis by property type.
Prepare Word-Cloud for words appearing in reviews of pricey listings.
Try to uncover new information in the data which is not self-evident.
List upper quantile homestays of Chicago and prepare a ggmap for the same.

Implementation:

The data was scraped and manipulated accordingly for the analysis. The data was then reviewed to determine what is the general vibe in the neighbourhood.

Summary: Inside Airbnb is an independent, non-commercial set of tools and data that allows you to explore how Airbnb is really being used in cities around the world.

2. Packages Required

library(rmarkdown)
library(tidyverse)
library(ggplot2)
library(readr)
library(dplyr)

3. Data Preparation

3.1 Data Source

Original Source of data: InsideAirbnb

3.2 Explanation of Source Data

Explanation of data source: The original purpose of the data was to show people that how AirBnB is really being used and is affecting their neighbourhood. By analyzing publicly available information about a city’s Airbnb’s listings, Inside Airbnb provides filters and key metrics so people can see how Airbnb is being used to compete with the residential housing market. The data was posted on 10th May 2017 on their website. The original data set had 132353 rows and 6 variables (columns) in reviews dataset and listing table had 5207 rows and 16 columns. These tables have been combined for ease of data analysis.

3.3 Original Data Set

## # A tibble: 132,353 x 6
##    listing_id        id       date reviewer_id reviewer_name
##         <int>     <int>     <date>       <int>         <chr>
##  1    1027405  35183788 2015-06-15    28799386      Seo Hyun
##  2    1027405  35746894 2015-06-21    36002247       Colleen
##  3    1027405  37359380 2015-07-06    35422627         Susan
##  4    1027405  45254342 2015-09-01    35829599      Victoria
##  5    1027405  46872479 2015-09-14    17346096      Brittney
##  6    1027405  90333200 2016-07-31    65398549         Sarah
##  7    7725012  47857055 2015-09-21    10874627       Cameron
##  8    7725012  47900649 2015-09-21    43904307          Andy
##  9    7725012  99756424 2016-09-05     9797637       Michele
## 10    7725012 102581502 2016-09-18    72687862           Liz
## # ... with 132,343 more rows, and 1 more variables: comments <chr>

3.4 Cleaning Dataset

The first step is to read the reviews and listing data from source and check the formatting of the columns. All the variables have appropriate data type as seen in structure of the table.

#review <- read_csv("C:/Users/jatin/Desktop/UCHW/R_FINAL_PROJECT/reviews.csv")
dim(review)
str(review)
names(review)
head(review,1)
review <- arrange(review,review$id)

#reading listing data
listing <- read_csv("C:/Users/jatin/Desktop/UCHW/R_FINAL_PROJECT/listings.csv")
names(listing)
listing <- arrange(listing,listing$id)
head(listing,1)

For the analysis, very few variables were used from the datasets. These variables were combined in a single table to ease analysis. In brief, I used left_join to join listing table on review table.

#Getting prices and neighbourhood in reviews table
review$neighbourhood <- NA
review$neighbourhood <- as.character(review$neighbourhood)
review$price <- NA
review$price <- as.numeric(review$price)

review <- review %>% left_join(listing, by=c("listing_id"="id"))

Removing NA values from cumulated data is our next step

#Cleaning the data
colSums(is.na(review))
review <- subset(review,select=-(neighbourhood_group))
sum(is.na(review))
review <- filter(review,!is.na(comments))

3.5 Cleaned Dataset

Cleaned Dataset

## # A tibble: 132,159 x 22
##    listing_id    id       date reviewer_id reviewer_name
##         <int> <int>     <date>       <int>         <chr>
##  1       4505   849 2009-03-06        8913          Lisa
##  2       4505   925 2009-03-15        6976       Patrick
##  3       4505  1263 2009-04-07        8513         Emily
##  4       4505  1423 2009-04-12        9468         Ebony
##  5       4505  1527 2009-04-15       11819  Paul & Susan
##  6       4505  1763 2009-04-24       11853      Benjamin
##  7       4505  2957 2009-05-22        5887        Bianca
##  8       4505  3355 2009-05-31       14051         Kelly
##  9       6715  3677 2009-06-07       16850          Ynes
## 10       4505  3837 2009-06-10       17738       Jessica
## # ... with 132,149 more rows, and 17 more variables: comments <chr>,
## #   neighbourhood.x <chr>, price.x <dbl>, name <chr>, host_id <int>,
## #   host_name <chr>, neighbourhood.y <chr>, latitude <dbl>,
## #   longitude <dbl>, room_type <chr>, price.y <int>, minimum_nights <int>,
## #   number_of_reviews <int>, last_review <date>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <int>, availability_365 <int>

Exactly 194 missing values were found. These rows were removed and we get 132159 entries.

3.6 Summary of Variables

Below is the summary of concerned variables. Out of 8 only 5 are of concern here.

Date: Date of review

Reviewer_Name: Name of the reviewer

Comments: Review of the stay

Neighbourhood: Neighbourhood of the property

Price: Price of the property per night

Project Proposal

Jatin Saini

04 November, 2017