#While we initially showed code sporadically, we took advice from the comments given by our classmates, and decided to make all code invisible, which made our final presentation more presentable and polished.

1. Visualising AirBnB Occupancies and Revenue Generated Over Time

For the initial data exploration part of our project, we used calendar heatmaps, which are often used to discern patterns, trends, and anomalies over time in a calendar-like interface. In this case, they can be used to understand how AirBnB occupancies and revenue generated over change over the period of 2019 to 2020.

a. Total Occupancies in 2019 and 2020

We can see that total occupancies in 2019 were generally at least twice that of occupancies in 2020.

b. Occupancy Rates in 2019 and 2020

Occupancy rates are severely reduced in 2020.

c. Total Daily Revenue in 2019 and 2020

Total daily revenue decreased from about 6 million dollars a day to 2 or 3 million dollars a day.

d. Average Daily Revenue in 2019 and 2020

Average daily revenues are also lower in 2020.

2. Seasonal Changes in AirBnB Occupancy and Revenue Generated

We can view the same patterns using line graphs instead.

3. Data Tables on Occupancy and Revenue of AirBnBs in New York City

The data tables allow users to see the number of bookings, listings, percentage of listings booked, total revenue, and average revenue for any day that they select.

5. What Makes a Host a Superhost?

We were interested to know what makes a host a superhost. We believe that response and acceptance rates could be linked closely to it.

ggplot(data = superhost_summary, aes(x = host_is_superhost, y = avgResponse, fill = host_is_superhost)) +
  geom_col(width = 0.5) +
  labs(title = 'Average Response Rates of Superhosts and Non-Superhosts') +
  ylab('Response Rates (%)') +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = 'none',
        axis.title.x = element_blank())

6. Let’s do some text analysis!!

Here we cleaned and preprocessed text data collected from AirBNB reviews. We then performed a sentiment analysis, and created word clouds to better understand which words come up more often in positive and negative reviews.

library(tidyverse)

###/Users/armaanahmed/Desktop/listings.csv 
##/Users/armaanahmed/Desktop/reviews.csv 
##/Users/armaanahmed/Desktop/calendar.csv 
##install.packages("textdata")

reviews <- read.csv("/Users/armaanahmed/Desktop/Classes/Data\ Visualization/Data\ Viz\ AirBNB\ Data/Su/data2/reviews.csv")
listings <- read.csv("/Users/armaanahmed/Desktop/Classes/Data\ Visualization/Data\ Viz\ AirBNB\ Data/Su/data2/listings.csv")
airbnb <- inner_join(listings, reviews, by=c("id" = "listing_id"))

## filter 2019-2020 data
airbnb <- airbnb %>% filter(date > "2018-12-31" & date < "2021-01-01")
## How many properties does a host own?
airbnb2 <- airbnb %>% group_by(host_id) %>%
  count(id) %>%
  arrange(desc(n)) %>%
  group_by(host_id) %>%
  count() %>% arrange(desc(n)) 

table(airbnb2$n)
## 
##     1     2     3     4     5     6     7     8     9    10    11    12    13 
## 12132  1546   413   195    89    48    34    31    18     7     8     6     2 
##    14    15    16    17    18    20    21    22    23    24    26    29    30 
##     5     1     4     1     2     1     3     2     2     1     2     3     1 
##    31    32    34    35    36    37    40    78    91    98 
##     1     2     1     1     1     1     2     1     1     1

Cleaning and PreProcessing Text

library(tm)
## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
library(quanteda)
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 6 of 6 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:tm':
## 
##     stopwords
## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-
## remove non-english comments
airbnb3 <- airbnb[which(!grepl("[^\x01-\x7F]+", airbnb$comments)),]


## remove stop words
airbnb3$comments <- removeWords(airbnb3$comments, stopwords(language = "en", source = "stopwords-iso"))
airbnb3$comments <- removeWords(airbnb3$comments, stopwords(language = "en", source = "marimo"))

## remove numbers, whitespace, punctuation
airbnb3$comments <- removeNumbers(airbnb3$comments)
airbnb3$comments <- stripWhitespace(airbnb3$comments)
airbnb3$comments <- removePunctuation(airbnb3$comments)

## tolower
airbnb3$comments <- tolower(airbnb3$comments)

Sentiment Analysis

##install.packages("tidytext")
library(tidytext)

tidy_ab  <- unnest_tokens(airbnb3,  output = word, input = comments) %>%
  anti_join(stop_words, by = "word")

afinn <- get_sentiments("afinn")
tidy_ab_sent <- inner_join(tidy_ab,  afinn, by = "word")

sent_by_rev <- tidy_ab_sent %>%
  group_by(host_id, reviewer_id) %>%
  mutate(rev_sent = mean(value))
summary(tidy_ab_sent$price) 
##    Length     Class      Mode 
##    652281 character character
## 75% of the properties are cheaper than $145 per night

 
tidy_ab_sent <- tidy_ab_sent %>%
  group_by(host_id, reviewer_id) %>%
  mutate(rev_sent = mean(value))

tidy_ab_sent$sentiment_fac <- cut(tidy_ab_sent$rev_sent, breaks = -5:5)
table(tidy_ab_sent$sentiment_fac)
## 
## (-5,-4] (-4,-3] (-3,-2] (-2,-1]  (-1,0]   (0,1]   (1,2]   (2,3]   (3,4]   (4,5] 
##      18     638    2334    5194   13835   49593  230525  318297   31337     509

## <<DocumentTermMatrix (documents: 2, terms: 1528)>>
## Non-/sparse entries: 1528/1528
## Sparsity           : 50%
## Maximal term length: 17
## Weighting          : term frequency (tf)
Good Comments Word Cloud

What are the key words that are found in good comments?

Words like Clean, nice, recommend all come up! It seems like cleanliness, aesthetics, and social cues (like recommend) are the most important aspect of a good review.

Bad Comments Word Cloud

What are the key words that are found in bad comments?

Having words like noisy, bad, dirty, block, hard stops come up in bad reviews! people want to have a nice, quiet, clean place to stay!

Dissimilar words Word Cloud

7. How does the distribution of AirBNB locations look throughout NYC?

```

8. How do prices vary by borough/neighborhood for AirBNBs?

We decided to use this as our last plot, because it was the easiest for actual consumers/users to engage with

For our chloropleth maps, they are attached seperately because the html files were too large. AirBNB in 2019 had more rooms and vacancies throughout NYC, while in 2020 we saw that the number of rooms and vacancies decreased due to COVID-19. plotly_num_airbnb19.html plotly_num_airbnb20.html also in our GITHUB!

Thanks for a great semester!!