Loading Data

airbnb <- read_delim("./airbnb_austin.csv", delim = ",")
## Rows: 15244 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (3): name, host_name, room_type
## dbl  (12): id, host_id, neighbourhood, latitude, longitude, price, minimum_n...
## lgl   (2): neighbourhood_group, license
## date  (1): last_review
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

1. Unclear Columns

List of Columns headers

column_headers <- colnames(airbnb)
print(column_headers)
##  [1] "id"                             "name"                          
##  [3] "host_id"                        "host_name"                     
##  [5] "neighbourhood_group"            "neighbourhood"                 
##  [7] "latitude"                       "longitude"                     
##  [9] "room_type"                      "price"                         
## [11] "minimum_nights"                 "number_of_reviews"             
## [13] "last_review"                    "reviews_per_month"             
## [15] "calculated_host_listings_count" "availability_365"              
## [17] "number_of_reviews_ltm"          "license"

1. neighbourhood_group

Column details

The column is empty, but according to the data dictionary, neighbourhood_group represents the neighbourhood group as geocoded using latitude and longitude against neighbourhoods defined by open or public digital shapefiles.

Why encoded this way?

The neighbourhood_group column is encoded using latitude and longitude matched to public maps to provide a broader regional classification, making it helpful in grouping listings into districts or larger city areas.

What could have happened if I didn’t read the documentation?

I might assume it works the same way as neighbourhood (which uses zip codes) when it relies on geospatial data.

2. availability_365

Column details

The column contains values ranging from 0 to 365. The data dictionary explains that availability_365 represents the days a listing is available for booking in the next 365 days. A listing might not be available because it has been booked by a guest or blocked by the host.

Why Encoded This Way?

Encoding availability as a numeric value (0–365) simplifies booking trends and forecasting calculations.

What could have happened if I didn’t read the documentation?

Without the documentation, I might assume a value of 0 means the listing is fully booked, leading to incorrect conclusions about listing availability.

3. number_of_reviews_ltm

Column details

The column contains numeric values and the data dictionary explains that number_of_reviews_ltm represents the number of reviews a listing has in the last 12 months.

Why Encoded This Way?

Using a separate column for reviews in the last 12 months (instead of total reviews) helps analyze recent activity and popularity trends.

What could have happened if I didn’t read the documentation?

Without the documentation, I might assume this column represents the total number of reviews, leading to an incorrect analysis of listing popularity over time.

2. Unclear Column even after documentation

One element of the data that remains unclear even after reading the documentation is the license column. The documentation states that this field contains the “licence/permit/registration number,” but it does not clarify whether the license follows a specific format across all listings or varies by city or country. More so, the documentation doesn’t explain if the listing has no value in the license field.

3. Visualization

The column is empty

4. Two categorical columns

Explicitly missing rows

sum(is.na(airbnb$host_name)) 
## [1] 2
sum(is.na(airbnb$room_type))
## [1] 0
airbnb[is.na(airbnb$host_name), ]
## # A tibble: 2 x 18
##        id name      host_id host_name neighbourhood_group neighbourhood latitude
##     <dbl> <chr>       <dbl> <chr>     <lgl>                       <dbl>    <dbl>
## 1 4356661 1 BR 1 B~  2.18e7 <NA>      NA                          78704     30.3
## 2 8214182 Private ~  2.16e7 <NA>      NA                          78741     30.2
## # i 11 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <lgl>
airbnb[is.na(airbnb$room_type), ]
## # A tibble: 0 x 18
## # i 18 variables: id <dbl>, name <chr>, host_id <dbl>, host_name <chr>,
## #   neighbourhood_group <lgl>, neighbourhood <dbl>, latitude <dbl>,
## #   longitude <dbl>, room_type <chr>, price <dbl>, minimum_nights <dbl>,
## #   number_of_reviews <dbl>, last_review <date>, reviews_per_month <dbl>,
## #   calculated_host_listings_count <dbl>, availability_365 <dbl>,
## #   number_of_reviews_ltm <dbl>, license <lgl>

There are two explicitly missing rows from host_name but not from room_type.

Implicitly missing rows

Looking through the data, I couldn’t find an implicity missing row, but if there was going to be, it probably would exist in a neighbourhood. To investigate this, I will need to get the list of all zip codes in Austin and compare it to the unique zip code (also neighbourhood)in my data to determine which zip codes are missing. I could also investigate by assuming all zip codes for each area follow a logical arrangement, but it’s not always valid for certain regions.

Empty Group

complete_combinations <- airbnb |>
  complete(neighbourhood, room_type)

missing_combinations <- complete_combinations |>
  anti_join(airbnb, by = c("neighbourhood", "room_type"))

missing_combinations
## # A tibble: 56 x 18
##    neighbourhood room_type        id name  host_id host_name neighbourhood_group
##            <dbl> <chr>         <dbl> <chr>   <dbl> <chr>     <lgl>              
##  1         78703 Hotel room       NA <NA>       NA <NA>      NA                 
##  2         78712 Entire home/~    NA <NA>       NA <NA>      NA                 
##  3         78712 Hotel room       NA <NA>       NA <NA>      NA                 
##  4         78712 Shared room      NA <NA>       NA <NA>      NA                 
##  5         78717 Hotel room       NA <NA>       NA <NA>      NA                 
##  6         78717 Shared room      NA <NA>       NA <NA>      NA                 
##  7         78719 Hotel room       NA <NA>       NA <NA>      NA                 
##  8         78719 Shared room      NA <NA>       NA <NA>      NA                 
##  9         78721 Hotel room       NA <NA>       NA <NA>      NA                 
## 10         78721 Shared room      NA <NA>       NA <NA>      NA                 
## # i 46 more rows
## # i 11 more variables: latitude <dbl>, longitude <dbl>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <lgl>

Here are some missing combinations of specific room types unavailable in some neighbourhoods. In some neighbourhoods, there may be little to no demand for particular room types or in tourist-heavy/business districts, there may be high demand for specific room types, making other room types less common.

5. Outliers

Neighbourhood with the most listings:

neighbourhood_counts <- airbnb |>
  group_by(neighbourhood) |>
  summarise(listing_count = n()) |>
  arrange(desc(listing_count))

top_neighbourhood <- neighbourhood_counts |>
  slice(1) |>
  pull(neighbourhood)

top_neighbourhood
## [1] 78704

Investigating the data as a whole for outliers might not be ideal because what could be considered an outlier in the overall data might be normal for a particular neighbourhood. For example, an expensive neighbourhood might have very few listings with high prices because it’s a luxury area, which might appear as an outlier when investigated as a whole. So, I decided to investigate the neighbourhood with the most listings because a high number of listings could indicate a buyer’s market, where there’s more supply than demand. This might lead to lower prices or more room for negotiation.

Boxplot

top_neighbourhood_data <- airbnb |>
  filter(neighbourhood == top_neighbourhood)

ggplot(top_neighbourhood_data, aes(x = room_type, y = price)) +
  geom_boxplot(outlier.color = "darkred", outlier.shape = 16, outlier.size = 2) +
  labs(title = paste("Price Distribution by Room Type in", top_neighbourhood, "neighbourhood"),
       x = "Room Type",
       y = "Price") +
  theme_minimal()
## Warning: Removed 667 rows containing non-finite values (stat_boxplot).

Outliers in this data have extreme prices that are significantly higher or lower than the majority of prices for that room type in the neighbourhood. They can indicate unusual listings, data errors, or special cases. For example, a luxury villa might have a very high price, and a long-term rental with an unusually high minimum_nights requirement (e.g., 365 nights) which will lead to very high price.