airbnb <- read_delim("./airbnb_austin.csv", delim = ",")
## Rows: 15244 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): name, host_name, room_type
## dbl (12): id, host_id, neighbourhood, latitude, longitude, price, minimum_n...
## lgl (2): neighbourhood_group, license
## date (1): last_review
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
column_headers <- colnames(airbnb)
print(column_headers)
## [1] "id" "name"
## [3] "host_id" "host_name"
## [5] "neighbourhood_group" "neighbourhood"
## [7] "latitude" "longitude"
## [9] "room_type" "price"
## [11] "minimum_nights" "number_of_reviews"
## [13] "last_review" "reviews_per_month"
## [15] "calculated_host_listings_count" "availability_365"
## [17] "number_of_reviews_ltm" "license"
neighbourhood_group
The column is empty, but according to the data dictionary,
neighbourhood_group
represents the neighbourhood group as
geocoded using latitude and longitude against neighbourhoods defined by
open or public digital shapefiles.
The neighbourhood_group
column is encoded using latitude
and longitude matched to public maps to provide a broader regional
classification, making it helpful in grouping listings into districts or
larger city areas.
I might assume it works the same way as neighbourhood
(which uses zip codes) when it relies on geospatial data.
availability_365
The column contains values ranging from 0 to 365. The data dictionary
explains that availability_365
represents the days a
listing is available for booking in the next 365 days. A listing might
not be available because it has been booked by a guest or blocked by the
host.
Encoding availability as a numeric value (0–365) simplifies booking trends and forecasting calculations.
Without the documentation, I might assume a value of 0 means the listing is fully booked, leading to incorrect conclusions about listing availability.
number_of_reviews_ltm
The column contains numeric values and the data dictionary explains
that number_of_reviews_ltm
represents the number of reviews
a listing has in the last 12 months.
Using a separate column for reviews in the last 12 months (instead of total reviews) helps analyze recent activity and popularity trends.
Without the documentation, I might assume this column represents the total number of reviews, leading to an incorrect analysis of listing popularity over time.
One element of the data that remains unclear even after reading the
documentation is the license
column. The
documentation states that this field contains the
“licence/permit/registration number,” but it does not clarify whether
the license follows a specific format across all listings or varies by
city or country. More so, the documentation doesn’t explain if the
listing has no value in the license field.
The column is empty
sum(is.na(airbnb$host_name))
## [1] 2
sum(is.na(airbnb$room_type))
## [1] 0
airbnb[is.na(airbnb$host_name), ]
## # A tibble: 2 x 18
## id name host_id host_name neighbourhood_group neighbourhood latitude
## <dbl> <chr> <dbl> <chr> <lgl> <dbl> <dbl>
## 1 4356661 1 BR 1 B~ 2.18e7 <NA> NA 78704 30.3
## 2 8214182 Private ~ 2.16e7 <NA> NA 78741 30.2
## # i 11 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## # minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## # reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## # availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <lgl>
airbnb[is.na(airbnb$room_type), ]
## # A tibble: 0 x 18
## # i 18 variables: id <dbl>, name <chr>, host_id <dbl>, host_name <chr>,
## # neighbourhood_group <lgl>, neighbourhood <dbl>, latitude <dbl>,
## # longitude <dbl>, room_type <chr>, price <dbl>, minimum_nights <dbl>,
## # number_of_reviews <dbl>, last_review <date>, reviews_per_month <dbl>,
## # calculated_host_listings_count <dbl>, availability_365 <dbl>,
## # number_of_reviews_ltm <dbl>, license <lgl>
There are two explicitly missing rows from host_name
but
not from room_type
.
Looking through the data, I couldn’t find an implicity missing row,
but if there was going to be, it probably would exist in a
neighbourhood
. To investigate this, I will need to get the
list of all zip codes in Austin and compare it to the unique zip code
(also neighbourhood
)in my data to determine which zip codes
are missing. I could also investigate by assuming all zip codes for each
area follow a logical arrangement, but it’s not always valid for certain
regions.
complete_combinations <- airbnb |>
complete(neighbourhood, room_type)
missing_combinations <- complete_combinations |>
anti_join(airbnb, by = c("neighbourhood", "room_type"))
missing_combinations
## # A tibble: 56 x 18
## neighbourhood room_type id name host_id host_name neighbourhood_group
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <lgl>
## 1 78703 Hotel room NA <NA> NA <NA> NA
## 2 78712 Entire home/~ NA <NA> NA <NA> NA
## 3 78712 Hotel room NA <NA> NA <NA> NA
## 4 78712 Shared room NA <NA> NA <NA> NA
## 5 78717 Hotel room NA <NA> NA <NA> NA
## 6 78717 Shared room NA <NA> NA <NA> NA
## 7 78719 Hotel room NA <NA> NA <NA> NA
## 8 78719 Shared room NA <NA> NA <NA> NA
## 9 78721 Hotel room NA <NA> NA <NA> NA
## 10 78721 Shared room NA <NA> NA <NA> NA
## # i 46 more rows
## # i 11 more variables: latitude <dbl>, longitude <dbl>, price <dbl>,
## # minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## # reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## # availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <lgl>
Here are some missing combinations of specific room types unavailable in some neighbourhoods. In some neighbourhoods, there may be little to no demand for particular room types or in tourist-heavy/business districts, there may be high demand for specific room types, making other room types less common.
neighbourhood_counts <- airbnb |>
group_by(neighbourhood) |>
summarise(listing_count = n()) |>
arrange(desc(listing_count))
top_neighbourhood <- neighbourhood_counts |>
slice(1) |>
pull(neighbourhood)
top_neighbourhood
## [1] 78704
Investigating the data as a whole for outliers might not be ideal because what could be considered an outlier in the overall data might be normal for a particular neighbourhood. For example, an expensive neighbourhood might have very few listings with high prices because it’s a luxury area, which might appear as an outlier when investigated as a whole. So, I decided to investigate the neighbourhood with the most listings because a high number of listings could indicate a buyer’s market, where there’s more supply than demand. This might lead to lower prices or more room for negotiation.
top_neighbourhood_data <- airbnb |>
filter(neighbourhood == top_neighbourhood)
ggplot(top_neighbourhood_data, aes(x = room_type, y = price)) +
geom_boxplot(outlier.color = "darkred", outlier.shape = 16, outlier.size = 2) +
labs(title = paste("Price Distribution by Room Type in", top_neighbourhood, "neighbourhood"),
x = "Room Type",
y = "Price") +
theme_minimal()
## Warning: Removed 667 rows containing non-finite values (stat_boxplot).
Outliers in this data have extreme prices that are significantly
higher or lower than the majority of prices for that room type in the
neighbourhood. They can indicate unusual listings, data errors, or
special cases. For example, a luxury villa might have a very high price,
and a long-term rental with an unusually high
minimum_nights
requirement (e.g., 365 nights) which will
lead to very high price.