AirBnB Introduction
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. This analysis describes the trends and overview of homestays in Chicago, IL.
Problem Statement:
I am planning to analyse here following things:
What is the vibe in each neighbourhood of Chicago for Airbnb homestays. Is it Positive or Negative?
Prepare sentiment analysis by property type.
Prepare Word-Cloud for words appearing in reviews of pricey listings.
Try to uncover new information in the data which is not self-evident.
List upper quantile homestays of Chicago and prepare a ggmap for the same.
Implementation:
The data was scraped and manipulated accordingly for the analysis. The data was then reviewed to determine what is the general vibe in the neighbourhood.
Summary: Inside Airbnb is an independent, non-commercial set of tools and data that allows you to explore how Airbnb is really being used in cities around the world.
library(rmarkdown)
library(tidyverse)
library(ggplot2)
library(readr)
library(dplyr)
Original Source of data: InsideAirbnb
Explanation of data source: The original purpose of the data was to show people that how AirBnB is really being used and is affecting their neighbourhood. By analyzing publicly available information about a city’s Airbnb’s listings, Inside Airbnb provides filters and key metrics so people can see how Airbnb is being used to compete with the residential housing market. The data was posted on 10th May 2017 on their website. The original data set had 132353 rows and 6 variables (columns) in reviews dataset and listing table had 5207 rows and 16 columns. These tables have been combined for ease of data analysis.
## # A tibble: 132,353 x 6
## listing_id id date reviewer_id reviewer_name
## <int> <int> <date> <int> <chr>
## 1 1027405 35183788 2015-06-15 28799386 Seo Hyun
## 2 1027405 35746894 2015-06-21 36002247 Colleen
## 3 1027405 37359380 2015-07-06 35422627 Susan
## 4 1027405 45254342 2015-09-01 35829599 Victoria
## 5 1027405 46872479 2015-09-14 17346096 Brittney
## 6 1027405 90333200 2016-07-31 65398549 Sarah
## 7 7725012 47857055 2015-09-21 10874627 Cameron
## 8 7725012 47900649 2015-09-21 43904307 Andy
## 9 7725012 99756424 2016-09-05 9797637 Michele
## 10 7725012 102581502 2016-09-18 72687862 Liz
## # ... with 132,343 more rows, and 1 more variables: comments <chr>
The first step is to read the reviews and listing data from source and check the formatting of the columns. All the variables have appropriate data type as seen in structure of the table.
#review <- read_csv("C:/Users/jatin/Desktop/UCHW/R_FINAL_PROJECT/reviews.csv")
dim(review)
str(review)
names(review)
head(review,1)
review <- arrange(review,review$id)
#reading listing data
listing <- read_csv("C:/Users/jatin/Desktop/UCHW/R_FINAL_PROJECT/listings.csv")
names(listing)
listing <- arrange(listing,listing$id)
head(listing,1)
For the analysis, very few variables were used from the datasets. These variables were combined in a single table to ease analysis. In brief, I used left_join to join listing table on review table.
#Getting prices and neighbourhood in reviews table
review$neighbourhood <- NA
review$neighbourhood <- as.character(review$neighbourhood)
review$price <- NA
review$price <- as.numeric(review$price)
review <- review %>% left_join(listing, by=c("listing_id"="id"))
Removing NA values from cumulated data is our next step
#Cleaning the data
colSums(is.na(review))
review <- subset(review,select=-(neighbourhood_group))
sum(is.na(review))
review <- filter(review,!is.na(comments))
Cleaned Dataset
## # A tibble: 132,159 x 22
## listing_id id date reviewer_id reviewer_name
## <int> <int> <date> <int> <chr>
## 1 4505 849 2009-03-06 8913 Lisa
## 2 4505 925 2009-03-15 6976 Patrick
## 3 4505 1263 2009-04-07 8513 Emily
## 4 4505 1423 2009-04-12 9468 Ebony
## 5 4505 1527 2009-04-15 11819 Paul & Susan
## 6 4505 1763 2009-04-24 11853 Benjamin
## 7 4505 2957 2009-05-22 5887 Bianca
## 8 4505 3355 2009-05-31 14051 Kelly
## 9 6715 3677 2009-06-07 16850 Ynes
## 10 4505 3837 2009-06-10 17738 Jessica
## # ... with 132,149 more rows, and 17 more variables: comments <chr>,
## # neighbourhood.x <chr>, price.x <dbl>, name <chr>, host_id <int>,
## # host_name <chr>, neighbourhood.y <chr>, latitude <dbl>,
## # longitude <dbl>, room_type <chr>, price.y <int>, minimum_nights <int>,
## # number_of_reviews <int>, last_review <date>, reviews_per_month <dbl>,
## # calculated_host_listings_count <int>, availability_365 <int>
Exactly 194 missing values were found. These rows were removed and we get 132159 entries.
Below is the summary of concerned variables. Out of 8 only 5 are of concern here.
Date: Date of review
Reviewer_Name: Name of the reviewer
Comments: Review of the stay
Neighbourhood: Neighbourhood of the property
Price: Price of the property per night