To start: import the trusty tidyverse package and the AirBnB data.
library(tidyverse)
library(ggmap)
library(RColorBrewer)
setwd("C:/Users/monica/Documents/Courses/Q7_Fall2018/GEOG 208/week4_project")
untidy <- read_csv("listings.csv")
Parsed with column specification:
cols(
id = col_integer(),
name = col_character(),
host_id = col_integer(),
host_name = col_character(),
neighbourhood_group = col_character(),
neighbourhood = col_character(),
latitude = col_double(),
longitude = col_double(),
room_type = col_character(),
price = col_integer(),
minimum_nights = col_integer(),
number_of_reviews = col_integer(),
last_review = col_date(format = ""),
reviews_per_month = col_double(),
calculated_host_listings_count = col_integer(),
availability_365 = col_integer()
)
head(untidy)
# A tibble: 6 x 16
id name host_id host_name neighbourhood_g~ neighbourhood latitude longitude room_type price minimum_nights
<int> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <chr> <int> <int>
1 109 Amaz~ 521 Paolo <NA> Culver City 34.0 -118. Entire h~ 122 7
2 344 Fami~ 767 Melissa <NA> Burbank 34.2 -118. Entire h~ 168 2
3 25445 Down~ 105868 Namhau <NA> Culver City 34.0 -118. Private ~ 95 1
4 2404 dele~ 2633 Jjjj <NA> Del Rey 34.0 -118. Shared r~ 85 1
5 25670 Char~ 107370 Sandra <NA> West Los Ang~ 34.0 -118. Entire h~ 90 3
6 25672 Mode~ 107473 Claudia <NA> East Hollywo~ 34.1 -118. Entire h~ 130 2
# ... with 5 more variables: number_of_reviews <int>, last_review <date>, reviews_per_month <dbl>,
# calculated_host_listings_count <int>, availability_365 <int>
sapply(untidy, function(x) length(unique(x)))
id name host_id
43763 42918 26730
host_name neighbourhood_group neighbourhood
9053 1 260
latitude longitude room_type
43763 43763 3
price minimum_nights number_of_reviews
836 92 424
last_review reviews_per_month calculated_host_listings_count
1366 1042 49
availability_365
366
This doesn’t actually look too messy! I’m going to look at the price variable a little more closely, since I think I’ll be using it a lot.
summary(untidy$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 69.0 105.0 194.2 180.0 25000.0
That’s quite a range of values! The 25,000 seems like a real outlier, and I don’t really want to look at listings with a price of 0 USD.
airbnb_data <- untidy %>% filter(price >= 10)
summary(airbnb_data$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.0 69.0 105.0 194.3 180.0 25000.0
1. Listings by Neighborhood
I want to start off by mapping the price and number of listings by neighborhood, faceted by room type. We’ll need a new dataset that includes the average price and number of listings of each room type, as well as produce a new set of coordinates for each neighborhood. The quickest and not-so-accurate way that I am going to do this is by simply averaging the latitudes and longitudes of the listings in each neighborhood. Lastly, I’ll arrange the table so that the higher values will be drawn last.
nbhood_mean <- airbnb_data %>%
group_by(neighbourhood, room_type) %>%
summarise(
latitude = mean(latitude, na.rm = TRUE),
longitude = mean(longitude, na.rm = TRUE),
number_of_reviews = sum(number_of_reviews, na.rm = TRUE),
last_review = mean(last_review, na.rm = TRUE),
reviews_per_month = mean(reviews_per_month, na.rm = TRUE),
listings = n(),
price = mean(price, na.rm = TRUE)) %>%
ungroup() %>%
arrange(price)
head(nbhood_mean)
# A tibble: 6 x 9
neighbourhood room_type latitude longitude number_of_reviews last_review reviews_per_month listings price
<chr> <chr> <dbl> <dbl> <int> <date> <dbl> <int> <dbl>
1 La Habra Heights Shared room 34.0 -118. 0 NA NaN 1 10
2 Green Meadows Shared room 33.9 -118. 32 2018-08-29 3.29 3 18.7
3 Cudahy Shared room 34.0 -118. 0 NA NaN 1 20
4 Pico-Union Shared room 34.0 -118. 2215 2018-05-01 1.12 177 21.3
5 Monrovia Shared room 34.1 -118. 3 2016-07-21 0.11 1 22
6 Watts Shared room 33.9 -118. 165 2018-09-18 2.00 17 23.4
Next we’ll set the extent of the map…by liberally borrowing this snippet from r-bloggers.com:
LAheight <- max(airbnb_data$latitude) - min(airbnb_data$latitude)
LAwidth <- max(airbnb_data$longitude) - min(airbnb_data$longitude)
LA_borders <- c(bottom = min(airbnb_data$latitude) - 0.05 * LAheight,
top = max(airbnb_data$latitude) + 0.05 * LAheight,
left = min(airbnb_data$longitude) - 0.05 * LAwidth,
right = max(airbnb_data$longitude) + 0.05 * LAwidth)
I don’t want any labels cluttering up the background, so I’m using the “toner-background” Stamen Map:
LA_map <- get_stamenmap(LA_borders, zoom = 8, maptype = "toner-background")
And now the maps! I don’t think it’s necessary to show the x-/y-axis labels or titles, so I’ll remove those.
ggmap(LA_map) +
geom_point(nbhood_mean, mapping = aes(x = longitude, y = latitude, color = price, size = listings)) +
scale_color_distiller(palette = "RdYlBu", direction = -1, trans = "log10", name = "Price (USD)") +
scale_size(breaks = c(0,1,10,100,1000), labels = c(">0",">=1",">=10",">=100",">=1000"), range = c(1,9), name = "No. of Listings") +
theme(panel.background = element_rect(fill = "white")) +
facet_wrap(. ~ room_type) +
labs(title = "Listings by Neighborhood", x = "", y = "") +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
plot.title=element_text(vjust = 4, face = "bold"),
strip.text.x = element_text(size = 10.5, vjust = 3, face = "bold"),
strip.background = element_rect(fill = FALSE))
As we might have expected, the most affordable options are shared rooms in Downtown/Central LA. The priciest listings are on the Westside and in the Hollywood Hills, where there are fewer shared room options.
2. East-West Distribution of Listings…aka, the Gap Instinct?
There seem to be so many listings on the Westside…and more expensive ones at that. Is there an uneven distribution of listings in LA, from East to West? Is there…a GAP???
ggplot(airbnb_data, aes(x = longitude, fill = room_type)) +
geom_histogram(bins = 40) +
scale_fill_manual(values = c("#efa35c", "#4ab8b8", "#1b3764"), name = "Room Type") +
labs(title = "East-West Distribution of Listings", x = "Longitude", y = "Number of listings") +
theme(plot.title=element_text(vjust=2, face = "bold"),
axis.title.x=element_text(vjust=-1, face = "bold"),
axis.title.y=element_text(vjust=4, face = "bold"))
Not really. The x-centroid of mainland LA County should be around -118.2, so while the bulk of listings are skewed slightly to the left of that, I don’t suppose we could identify a gap here.
3. Number and Type of Listings < $1k
I’d like to look at the distribution of prices per room type a little more. But first I’m going to check on potential outliers again…
airbnb_data <- airbnb_data %>% arrange(desc(price))
head(airbnb_data)
# A tibble: 6 x 16
id name host_id host_name neighbourhood_g~ neighbourhood latitude longitude room_type price minimum_nights
<int> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <chr> <int> <int>
1 1.92e7 Hist~ 1.34e8 Lorenzo <NA> Hollywood Hi~ 34.1 -118. Entire h~ 25000 2
2 2.18e7 Beve~ 1.59e8 Daniel <NA> Beverly Crest 34.1 -118. Private ~ 15000 20
3 2.95e6 Mali~ 5.76e6 Yun <NA> Malibu 34.0 -119. Entire h~ 10000 1
4 3.07e6 Mali~ 5.76e6 Yun <NA> Unincorporat~ 34.1 -119. Entire h~ 10000 1
5 4.03e6 UNBE~ 1.02e7 Robert <NA> Beverly Crest 34.1 -118. Entire h~ 10000 3
6 1.14e7 Cape~ 4.82e7 Mary <NA> Malibu 34.0 -119. Entire h~ 10000 30
# ... with 5 more variables: number_of_reviews <int>, last_review <date>, reviews_per_month <dbl>,
# calculated_host_listings_count <int>, availability_365 <int>
I tried excluding the $25,000 listing and setting the plotted limit to the next highest price ($15k), but that didn’t produce very interesting results either. I think I’d like to focus on the non-luxury listings for now, a.k.a. listings under the modest amount of $1,000 for one night’s humble stay.
price_1k <- airbnb_data %>% filter(price <= 1000)
ggplot(price_1k, aes(x = price, fill = room_type)) +
geom_histogram(position = "dodge") +
scale_fill_manual(values = c("#efa35c", "#4ab8b8", "#1b3764"), name = "Room Type") +
labs(title = "Number and Type of Listings under 1,000 USD", x = "Price per night (USD)", y = "Number of listings") +
theme(plot.title=element_text(vjust=2, face = "bold"),
axis.title.x=element_text(vjust=-1, face = "bold"),
axis.title.y=element_text(vjust=4, face = "bold"))
The most affordable options, as we could’ve expected, are shared and private rooms.
4. Listing Price vs. Last Date Reviewed
There are some pretty crazy listing prices in LA - I’m curious if those actually get rented! Let’s take a look at the “last_review” variable
summary(airbnb_data$last_review)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
"2010-03-28" "2018-06-07" "2018-09-02" "2018-05-13" "2018-09-23" "2018-10-05" "9101"
Most listings appear to have been rented in the last year, but there are some outliers going back to 2010 that I’m not particularly interested in. We’ll say good-bye to those.
I had trouble filtering through the “last_review” variable, so I reformatted it.. or something:
airbnb_data$last_review <-as.Date(airbnb_data$last_review,"%Y-%m-%d")
rented <- airbnb_data %>%
filter(last_review >= "2013-01-01") %>%
arrange(number_of_reviews)
summary(rented$last_review)
Min. 1st Qu. Median Mean 3rd Qu. Max.
"2013-01-02" "2018-06-07" "2018-09-02" "2018-05-13" "2018-09-23" "2018-10-05"
That should also get rid of the NA values. Now we can look at the listings that have been rented at some point in the last 5 years:
ggplot(rented, aes(x = last_review, y = price, color = room_type, size = number_of_reviews)) +
geom_jitter(shape = 1, stroke = 1.2, alpha = 0.4) +
scale_color_manual(values = c("#efa35c", "#4ab8b8", "#1b3764"), guide = FALSE) +
scale_size(breaks = c(1,10,50,100), labels = c(">=1",">=10",">=50",">=100"), range = c(2,12), name = "No. of Reviews") +
facet_grid(. ~ room_type) +
labs(title = "Listing Price vs. Last Date Reviewed", x = "Last Review", y = "Price/night (USD)") +
theme_minimal() +
theme(plot.title=element_text(vjust=3, face = "bold"),
axis.title.x=element_text(vjust=-1, face = "bold"),
axis.title.y=element_text(vjust=4, face = "bold"),
strip.text.x = element_text(size = 10, vjust = 2),
strip.background = element_rect(color = FALSE))
So, $10,000 private rooms aren’t rented so often, but people really do pay that for entire homes and apartments! Listings in the in the $1,0000-5,000 range, however, aren’t as consistently rented.
5. Most Reviewed Neighbourhoods - the Size Instinct
In which neighbourhoods are listings more likely to be rented? I want to see which neighbourhood’s listings are most popular, in some sense. There’s no data for the number of times each listing has been rented, but we can look at the number of reviews.
To combat the Size Instinct, Rosling encourages us to “get things in proportion.” So let’s see if the neighbourhoods with the highest total number of reviews are also the neighbourhoods that get the most reviews per month.
We’ll revisit our nbhood_mean dataset and focus on the columns for total number of reviews (which we summed by room type and neighbourhood) and average number of reviews per month (averaged by the same variables).
First, I need a dataset of the 5 neighbourhoods, in each room type, with the highest total reviews. I’m sure there is a more efficient way to do this than to create 3 separate tables that we then bind together, but here we are for now. I’ll do the same for reviews per month.
reviews_total_entire <- nbhood_mean %>%
filter(room_type == "Entire home/apt") %>%
arrange(desc(number_of_reviews)) %>%
slice(1:5)
reviews_total_private <- nbhood_mean %>%
filter(room_type == "Private room") %>%
arrange(desc(number_of_reviews)) %>%
slice(1:5)
reviews_total_shared <- nbhood_mean %>%
filter(room_type == "Shared room") %>%
arrange(desc(number_of_reviews)) %>%
slice(1:5)
reviews_top5_total <- rbind(reviews_total_entire, reviews_total_private, reviews_total_shared)
rm(reviews_total_entire, reviews_total_private, reviews_total_shared)
reviews_monthly_entire <- nbhood_mean %>%
filter(room_type == "Entire home/apt") %>%
arrange(desc(reviews_per_month)) %>%
slice(1:5)
reviews_monthly_private <- nbhood_mean %>%
filter(room_type == "Private room") %>%
arrange(desc(reviews_per_month)) %>%
slice(1:5)
reviews_monthly_shared <- nbhood_mean %>%
filter(room_type == "Shared room") %>%
arrange(desc(reviews_per_month)) %>%
slice(1:5)
reviews_top5_monthly <- rbind(reviews_monthly_entire, reviews_monthly_private, reviews_monthly_shared)
rm(reviews_monthly_entire, reviews_monthly_private, reviews_monthly_shared)
head(reviews_top5_total)
# A tibble: 6 x 9
neighbourhood room_type latitude longitude number_of_reviews last_review reviews_per_month listings price
<chr> <chr> <dbl> <dbl> <int> <date> <dbl> <int> <dbl>
1 Venice Entire home/apt 34.0 -118. 109153 2018-05-22 2.16 2168 254.
2 Hollywood Entire home/apt 34.1 -118. 58221 2018-06-10 2.18 2001 181.
3 Downtown Entire home/apt 34.0 -118. 36128 2018-06-14 2.31 1300 206.
4 Silver Lake Entire home/apt 34.1 -118. 31760 2018-05-05 2.02 689 163.
5 Hollywood Hills Entire home/apt 34.1 -118. 29296 2018-05-10 1.86 857 273.
6 Venice Private room 34.0 -118. 23085 2018-03-07 2.21 511 105.
head(reviews_top5_monthly)
# A tibble: 6 x 9
neighbourhood room_type latitude longitude number_of_revie~ last_review reviews_per_mon~ listings price
<chr> <chr> <dbl> <dbl> <int> <date> <dbl> <int> <dbl>
1 Artesia Entire home/~ 33.9 -118. 71 2018-09-23 7.02 4 144.
2 Unincorporated Santa Susa~ Entire home/~ 34.4 -119. 34 2018-09-29 6.62 2 142
3 Paramount Entire home/~ 33.9 -118. 324 2018-09-11 5.57 2 112.
4 Del Aire Entire home/~ 33.9 -118. 784 2018-06-18 5.23 14 110.
5 Bellflower Entire home/~ 33.9 -118. 184 2018-09-28 5.16 6 127.
6 Manchester Square Private room 34.0 -118. 43 2018-10-02 7.44 5 63
I’m a little concerned about the longer neighbourhood names, so I’ll make some adjustments before plotting. I want to sub in line breaks in place of spaces - but not all the spaces, and not all of the labels, so it’s going to be a very silly, manual process until I learn how to be better:
reviews_top5_monthly$neighbourhood <- sub("Unincorporated Santa Susana ", "Santa Susana\\\n", reviews_top5_monthly$neighbourhood)
reviews_top5_monthly$neighbourhood <- sub("Hills Estates", "Hills\\\nEstates", reviews_top5_monthly$neighbourhood)
reviews_top5_monthly$neighbourhood <- sub("Hills/Crenshaw", "Hills/\\\nCrenshaw", reviews_top5_monthly$neighbourhood)
And the two plots:
ggplot(reviews_top5_total, aes(x = neighbourhood, y = number_of_reviews, color = room_type)) +
geom_point(size = 4) +
geom_segment(aes(x = neighbourhood, xend = neighbourhood, y = 0, yend = number_of_reviews)) +
coord_flip() +
facet_grid(room_type ~ ., scale = "free_y") +
scale_color_manual(values = c("#efa35c", "#4ab8b8", "#1b3764"), guide = FALSE) +
labs(title = "Most Reviewed Neighbourhoods v1", subtitle = "Based on total number of reviews", x = "", y = "Total Reviews") +
theme_minimal() +
theme(plot.title=element_text(vjust=2, face = "bold"),
plot.subtitle=element_text(color = "gray60"),
axis.title.x=element_text(vjust=-1, face = "bold"),
axis.title.y=element_text(vjust=4, face = "bold"),
strip.text.y = element_text(size = 10.5, face = "bold"))
ggplot(reviews_top5_monthly, aes(x = neighbourhood, y = reviews_per_month, color = room_type)) +
geom_point(size = 4) +
geom_segment(aes(x = neighbourhood, xend = neighbourhood, y = 0, yend = reviews_per_month)) +
coord_flip() +
facet_grid(room_type ~ ., scale = "free_y") +
scale_color_manual(values = c("#efa35c", "#4ab8b8", "#1b3764"), guide = FALSE) +
labs(title = "Most Reviewed Neighbourhoods v2", subtitle = "Based on reviews per month (average of listings)", x = "", y = "Average Reviews per Month") + theme_minimal() +
theme(plot.title=element_text(vjust=2, face = "bold"),
plot.subtitle=element_text(color = "gray60"),
axis.title.x=element_text(vjust=-1, face = "bold"),
axis.title.y=element_text(vjust=4, face = "bold"),
strip.text.y = element_text(size = 10.5, face = "bold"))
The top neighbourhoods are totally different using the two different metrics! There are certainly more listings in Venice, Hollywood, & co., and thus more total reviews, but when we look at the review rate in version 2, those neighbourhoods are reviewed much more frequently. Version 1 also suggests that entire home/apt options get way more reviews than private and shared rooms, but room type doesn’t appear to make much of a difference when we look at reviews per month.