A prop-tech company wants to understand what is so appealing about the short-term housing market in Lagos. For the sake of confidentiality, the name of the company and CEO have been altered. I will focus my analysis on the prices and occupancy of these units while using data visualizations to illustrate my findings.
There are four parts to my script as follows:
Firstly, all the necessary packages and libraries for data cleaning, analysis and visualization need to be installed and loaded. The libraries used are tidyverse, here, skimr, janitor, dplyr, lubridate,ggplot2 and ggmap.
#Load packages
library("tidyverse") # General analysis
library("here") # data cleaning
library("skimr") # data cleaning
library("janitor") # data cleaning
library("dplyr") # data manipulation
library("lubridate") # dates
library("ggmap") # maps
library("ggplot2") # visualizations
library("scales") # visualizations
Now that the packages are loaded, let’s read the data in and take a peek.
# Load the data
ga_listings <- read_csv("listings.csv")
ga_calendar <- read_csv("calendar.csv")
Now that we have read in the file, we can view what fields are available in the data. I chose not to combine the dataframes because I decided on joining them when I am visualising.
names(ga_listings)
## [1] "id"
## [2] "listing_url"
## [3] "scrape_id"
## [4] "last_scraped"
## [5] "name"
## [6] "description"
## [7] "neighborhood_overview"
## [8] "picture_url"
## [9] "host_id"
## [10] "host_url"
## [11] "host_name"
## [12] "host_since"
## [13] "host_location"
## [14] "host_about"
## [15] "host_response_time"
## [16] "host_response_rate"
## [17] "host_acceptance_rate"
## [18] "host_is_superhost"
## [19] "host_thumbnail_url"
## [20] "host_picture_url"
## [21] "host_neighbourhood"
## [22] "host_listings_count"
## [23] "host_total_listings_count"
## [24] "host_verifications"
## [25] "host_has_profile_pic"
## [26] "host_identity_verified"
## [27] "neighbourhood"
## [28] "neighbourhood_cleansed"
## [29] "neighbourhood_group_cleansed"
## [30] "latitude"
## [31] "longitude"
## [32] "property_type"
## [33] "room_type"
## [34] "accommodates"
## [35] "bathrooms"
## [36] "bathrooms_text"
## [37] "bedrooms"
## [38] "beds"
## [39] "amenities"
## [40] "price"
## [41] "minimum_nights"
## [42] "maximum_nights"
## [43] "minimum_minimum_nights"
## [44] "maximum_minimum_nights"
## [45] "minimum_maximum_nights"
## [46] "maximum_maximum_nights"
## [47] "minimum_nights_avg_ntm"
## [48] "maximum_nights_avg_ntm"
## [49] "calendar_updated"
## [50] "has_availability"
## [51] "availability_30"
## [52] "availability_60"
## [53] "availability_90"
## [54] "availability_365"
## [55] "calendar_last_scraped"
## [56] "number_of_reviews"
## [57] "number_of_reviews_ltm"
## [58] "number_of_reviews_l30d"
## [59] "first_review"
## [60] "last_review"
## [61] "review_scores_rating"
## [62] "review_scores_accuracy"
## [63] "review_scores_cleanliness"
## [64] "review_scores_checkin"
## [65] "review_scores_communication"
## [66] "review_scores_location"
## [67] "review_scores_value"
## [68] "license"
## [69] "instant_bookable"
## [70] "calculated_host_listings_count"
## [71] "calculated_host_listings_count_entire_homes"
## [72] "calculated_host_listings_count_private_rooms"
## [73] "calculated_host_listings_count_shared_rooms"
## [74] "reviews_per_month"
names(ga_calendar)
## [1] "listing_id" "date" "available" "price"
## [5] "adjusted_price" "minimum_nights" "maximum_nights"
The ga_listings contains 12530 listings (=number of rows). The data is organized in a form where each row is a listing and the columns contain information about the listing. While ga_calendar contains 4,573,499 rows. Next, it would be best to get a summary of the data.
# check data
ga_listings %>%
glimpse()
## Rows: 12,530
## Columns: 74
## $ id <dbl> 18536, 18769, 18828, 1884…
## $ listing_url <chr> "https://www.airbnb.com/r…
## $ scrape_id <dbl> 2.021103e+13, 2.021103e+1…
## $ last_scraped <date> 2021-10-29, 2021-10-29, …
## $ name <chr> "Cottage on our property"…
## $ description <chr> "<b>The space</b><br />Pe…
## $ neighborhood_overview <chr> NA, NA, NA, NA, NA, NA, N…
## $ picture_url <chr> "https://a0.muscache.com/…
## $ host_id <dbl> 71321, 72045, 72240, 7224…
## $ host_url <chr> "https://www.airbnb.com/u…
## $ host_name <chr> "Robin", "Firhana", "Miss…
## $ host_since <date> 2010-01-18, 2010-01-20, …
## $ host_location <chr> "Johannesburg, Gauteng, S…
## $ host_about <chr> "We have 5 children livin…
## $ host_response_time <chr> "N/A", "N/A", "N/A", "N/A…
## $ host_response_rate <chr> "N/A", "N/A", "N/A", "N/A…
## $ host_acceptance_rate <chr> "N/A", "N/A", "0%", "N/A"…
## $ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALS…
## $ host_thumbnail_url <chr> "https://a0.muscache.com/…
## $ host_picture_url <chr> "https://a0.muscache.com/…
## $ host_neighbourhood <chr> NA, NA, NA, NA, NA, NA, N…
## $ host_listings_count <dbl> 1, 1, 1, 4, 1, 1, 1, 1, 1…
## $ host_total_listings_count <dbl> 1, 1, 1, 4, 1, 1, 1, 1, 1…
## $ host_verifications <chr> "['phone']", "['email', '…
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, FALSE, …
## $ host_identity_verified <lgl> FALSE, FALSE, FALSE, FALS…
## $ neighbourhood <chr> NA, NA, NA, NA, NA, NA, N…
## $ neighbourhood_cleansed <chr> "City of Johannesburg", "…
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, N…
## $ latitude <dbl> -26.18090, -26.24933, -25…
## $ longitude <dbl> 28.07699, 28.00559, 28.18…
## $ property_type <chr> "Entire cabin", "Private …
## $ room_type <chr> "Entire home/apt", "Priva…
## $ accommodates <dbl> 2, 2, 3, 3, 6, 6, 4, 7, 2…
## $ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N…
## $ bathrooms_text <chr> "1 bath", "1.5 baths", "1…
## $ bedrooms <dbl> NA, 1, NA, 1, NA, NA, 1, …
## $ beds <dbl> 2, 1, 3, 3, 4, 4, 4, 7, 1…
## $ amenities <chr> "[\"Breakfast\", \"Kitche…
## $ price <chr> "$1,269.00", "$756.00", "…
## $ minimum_nights <dbl> 1, 1, 5, 1, 14, 14, 5, 7,…
## $ maximum_nights <dbl> 730, 14, 30, 30, 30, 730,…
## $ minimum_minimum_nights <dbl> 1, 1, 5, 1, 14, 14, 5, 7,…
## $ maximum_minimum_nights <dbl> 1, 1, 5, 1, 14, 14, 5, 7,…
## $ minimum_maximum_nights <dbl> 730, 14, 30, 30, 30, 730,…
## $ maximum_maximum_nights <dbl> 730, 14, 30, 30, 30, 730,…
## $ minimum_nights_avg_ntm <dbl> 1, 1, 5, 1, 14, 14, 5, 7,…
## $ maximum_nights_avg_ntm <dbl> 730, 14, 30, 30, 30, 730,…
## $ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N…
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T…
## $ availability_30 <dbl> 30, 30, 29, 29, 30, 30, 3…
## $ availability_60 <dbl> 60, 60, 59, 59, 60, 60, 6…
## $ availability_90 <dbl> 90, 90, 89, 89, 90, 90, 9…
## $ availability_365 <dbl> 365, 365, 364, 364, 365, …
## $ calendar_last_scraped <date> 2021-10-29, 2021-10-29, …
## $ number_of_reviews <dbl> 0, 5, 0, 0, 0, 0, 0, 0, 0…
## $ number_of_reviews_ltm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ number_of_reviews_l30d <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ first_review <date> NA, 2010-06-24, NA, NA, …
## $ last_review <date> NA, 2019-04-30, NA, NA, …
## $ review_scores_rating <dbl> NA, 4.20, NA, NA, NA, NA,…
## $ review_scores_accuracy <dbl> NA, 4.33, NA, NA, NA, NA,…
## $ review_scores_cleanliness <dbl> NA, 4.67, NA, NA, NA, NA,…
## $ review_scores_checkin <dbl> NA, 5.00, NA, NA, NA, NA,…
## $ review_scores_communication <dbl> NA, 5.00, NA, NA, NA, NA,…
## $ review_scores_location <dbl> NA, 4.67, NA, NA, NA, NA,…
## $ review_scores_value <dbl> NA, 4.00, NA, NA, NA, NA,…
## $ license <lgl> NA, NA, NA, NA, NA, NA, N…
## $ instant_bookable <lgl> FALSE, FALSE, FALSE, FALS…
## $ calculated_host_listings_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ calculated_host_listings_count_entire_homes <dbl> 1, 0, 1, 0, 1, 1, 0, 0, 1…
## $ calculated_host_listings_count_private_rooms <dbl> 0, 1, 0, 1, 0, 0, 1, 0, 0…
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ reviews_per_month <dbl> NA, 0.04, NA, NA, NA, NA,…
# check data
ga_calendar %>%
glimpse()
## Rows: 4,573,499
## Columns: 7
## $ listing_id <dbl> 1860607, 18536, 18536, 18536, 18536, 18536, 18536, 1853…
## $ date <date> 2021-10-29, 2021-10-29, 2021-10-30, 2021-10-31, 2021-1…
## $ available <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, …
## $ price <chr> "$427.00", "$1,269.00", "$1,269.00", "$1,269.00", "$1,2…
## $ adjusted_price <chr> "$427.00", "$1,269.00", "$1,269.00", "$1,269.00", "$1,2…
## $ minimum_nights <dbl> 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ maximum_nights <dbl> 1125, 730, 730, 730, 730, 730, 730, 730, 730, 730, 730,…
Now that we have an overview of the data, we can proceed to perform some data cleaning. Woah! 74 columns is a lot of data points. Though it’s impressive it’s important we only work with what’s relevant to the business task to be efficient.
To do that, a new data frame ga_listings_main will be created. Only attributes that are relevant to the business task will be taken from ga_listings and be added to ga_listings_main.
On the other hand, ga_calendar contains only 7 attributes. However, adjusted prices is redundant and removed since it’s the same as price.
ga_listings_main <- ga_listings %>%
select(
id,host_id,host_is_superhost,neighbourhood_cleansed,latitude,longitude,property_type,room_type,accommodates,
bathrooms_text,bedrooms,beds,price,minimum_nights,maximum_nights
)
ga_calendar_main <- ga_calendar %>%
select(-adjusted_price)
Now, let’s take a look at the new data frame.
ga_listings_main %>%
skim_without_charts()
| Name | Piped data |
| Number of rows | 12530 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| logical | 1 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| neighbourhood_cleansed | 0 | 1.00 | 6 | 20 | 0 | 9 | 0 |
| property_type | 0 | 1.00 | 3 | 35 | 0 | 89 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 77 | 0.99 | 6 | 17 | 0 | 79 | 0 |
| price | 0 | 1.00 | 7 | 10 | 0 | 1541 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 6 | 1 | 0.18 | FAL: 10226, TRU: 2298 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 34456765.76 | 13177410.44 | 18536.00 | 24217198.50 | 37390007.50 | 45728593.25 | 53039202.00 |
| host_id | 0 | 1.00 | 177835125.26 | 120064382.87 | 64744.00 | 70138093.75 | 161496958.50 | 274968798.00 | 429226726.00 |
| latitude | 0 | 1.00 | -26.04 | 0.21 | -26.91 | -26.15 | -26.09 | -25.90 | -25.34 |
| longitude | 0 | 1.00 | 28.10 | 0.16 | 27.27 | 28.02 | 28.07 | 28.22 | 28.92 |
| accommodates | 0 | 1.00 | 3.22 | 2.54 | 1.00 | 2.00 | 2.00 | 4.00 | 16.00 |
| bedrooms | 906 | 0.93 | 1.74 | 2.05 | 1.00 | 1.00 | 1.00 | 2.00 | 50.00 |
| beds | 848 | 0.93 | 2.13 | 2.97 | 1.00 | 1.00 | 1.00 | 2.00 | 118.00 |
| minimum_nights | 0 | 1.00 | 3.79 | 20.78 | 1.00 | 1.00 | 1.00 | 2.00 | 1000.00 |
| maximum_nights | 0 | 1.00 | 1555.08 | 89334.61 | 1.00 | 90.00 | 1125.00 | 1125.00 | 9999999.00 |
ga_calendar_main %>%
skim_without_charts()
| Name | Piped data |
| Number of rows | 4573499 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| Date | 1 |
| logical | 1 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| price | 0 | 1 | 7 | 10 | 0 | 3198 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2021-10-29 | 2022-10-30 | 2022-04-29 | 367 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| available | 0 | 1 | 0.73 | TRU: 3335795, FAL: 1237704 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| listing_id | 0 | 1 | 34456809.77 | 13176869.52 | 18536 | 24210720 | 37391924 | 45729119 | 53039202 |
| minimum_nights | 13 | 1 | 3.73 | 20.17 | 1 | 1 | 1 | 2 | 1125 |
| maximum_nights | 13 | 1 | 1697.29 | 89329.40 | 1 | 1125 | 1125 | 1125 | 9999999 |
Now that is done we can focus on renaming and standardizing column names as well as removing duplicate and empty rows. Standardizing column names and filtering.
ga_listings_main <- ga_listings_main%>%
remove_empty(which = c("rows","cols")) %>%
distinct() %>%
rename(
neighbourhood = neighbourhood_cleansed,
bathrooms = bathrooms_text
) %>%
filter(
maximum_nights<= 180
) %>%
clean_names()
ga_calendar_main <- ga_calendar_main %>%
remove_empty(which = c("rows","cols")) %>%
distinct() %>%
filter(
maximum_nights<= 180
) %>%
clean_names()
Out of 12,350 listings in Gauteng only 3478 qualify as short lets (maximum stay less than six months). Now that is done focus can now shift to individual columns. The price attribute for ga_listings_main and ga_calendar_main shouldn’t be categorized as a character variable, it should be numeric. Also the $ and , need to be removed.
ga_listings_main$price = as.numeric(gsub("[\\$,]", "", ga_listings_main$price))
#To confirm
is.numeric(ga_listings_main$price)
## [1] TRUE
ga_calendar_main$price = as.numeric(gsub("[\\$,]", "", ga_calendar_main$price))
#To confirm
is.numeric(ga_calendar_main$price)
## [1] TRUE
Next, removing all text in the bathroom_text column. But first, all fields in the column containing Half-bath need to be replaced with 0.5. Then all text is removed from the column and bathrooms is converter to numeric.
ga_listings_main$bathrooms[ga_listings_main$bathrooms=="Half-bath"] <- "0.5"
ga_listings_main$bathrooms = as.numeric(gsub("[\\privatesharedbaths]", "", ga_listings_main$bathrooms))
#To confirm
is.numeric(ga_listings_main$bathrooms)
## [1] TRUE
That about wraps it up for data cleaning. Let’s head into the analysis!
The first thing is to perform some descriptive analysis to get a feel of the data. Lets’ take a look at host_is_superhost to figure out home many of the hosts are super hosts.
#Count of super hosts
ga_listings_main %>%
group_by(host_is_superhost) %>%
drop_na() %>%
summarise(number_of_hosts = n(), average_price = mean(price)) %>%
mutate(percentage = (number_of_hosts / sum(number_of_hosts))*100) %>%
arrange(desc(number_of_hosts))
In the last year, Less than 20% of available short term rental property has been listed by super hosts. Let’s look at where tenants love to stay most and how much it costs to stay in those place on average.
ga_listings_main %>%
group_by(neighbourhood) %>%
summarise(number_of_listings = n(), average_price = mean(price)) %>%
mutate(percentage = (number_of_listings / sum(number_of_listings))*100) %>%
arrange(desc(number_of_listings))
TheCity of Johannesburg has the highest supply of listings, probably because of high demand. It’s likely but not certain and can’t tell from the data. However, it’s certain that theCity of Johannesburg has twice the number of listings as the City of Tshwane, the location with the second highest number of listings. Let’s examine the most popular types of property and rooms next.
ga_listings_main %>%
group_by(neighbourhood, room_type) %>%
summarise(
number_of_listings = n(), average_accommodation = round(mean(accommodates)), average_price = mean(price)
) %>%
arrange(desc(number_of_listings)) %>%
head()
It appears that entire rental units that can accommodate at least 3 people are the most popular type of property listings in both the City of Johannesburg and the City of Tshwane. It is at least twice as popular as a private room in residential home.
I decided to export the necessary files as Excel files so visualisations can be executed in Tableau. This was done so a comprehensive dashboard can be produced.
library("writexl")
## Warning: package 'writexl' was built under R version 4.0.2
write_xlsx(ga_listings_main,"listings.xlsx")
write_xlsx(ga_calendar_main,"calendar.xlsx")
The final dashboard produced can be view on my Tableau profile. Thank you for reaching the end of my analysis. Please I welcome any feedback and I am very open to collaborating!