library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_delim("./AB_NYC_2019.csv", delim = ",")
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(data)
## spc_tbl_ [48,895 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                            : num [1:48895] 2539 2595 3647 3831 5022 ...
##  $ name                          : chr [1:48895] "Clean & quiet apt home by the park" "Skylit Midtown Castle" "THE VILLAGE OF HARLEM....NEW YORK !" "Cozy Entire Floor of Brownstone" ...
##  $ host_id                       : num [1:48895] 2787 2845 4632 4869 7192 ...
##  $ host_name                     : chr [1:48895] "John" "Jennifer" "Elisabeth" "LisaRoxanne" ...
##  $ neighbourhood_group           : chr [1:48895] "Brooklyn" "Manhattan" "Manhattan" "Brooklyn" ...
##  $ neighbourhood                 : chr [1:48895] "Kensington" "Midtown" "Harlem" "Clinton Hill" ...
##  $ latitude                      : num [1:48895] 40.6 40.8 40.8 40.7 40.8 ...
##  $ longitude                     : num [1:48895] -74 -74 -73.9 -74 -73.9 ...
##  $ room_type                     : chr [1:48895] "Private room" "Entire home/apt" "Private room" "Entire home/apt" ...
##  $ price                         : num [1:48895] 149 225 150 89 80 200 60 79 79 150 ...
##  $ minimum_nights                : num [1:48895] 1 1 3 1 10 3 45 2 2 1 ...
##  $ number_of_reviews             : num [1:48895] 9 45 0 270 9 74 49 430 118 160 ...
##  $ last_review                   : Date[1:48895], format: "2018-10-19" "2019-05-21" ...
##  $ reviews_per_month             : num [1:48895] 0.21 0.38 NA 4.64 0.1 0.59 0.4 3.47 0.99 1.33 ...
##  $ calculated_host_listings_count: num [1:48895] 6 2 1 1 1 1 1 1 1 4 ...
##  $ availability_365              : num [1:48895] 365 355 365 194 0 129 0 220 0 188 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   name = col_character(),
##   ..   host_id = col_double(),
##   ..   host_name = col_character(),
##   ..   neighbourhood_group = col_character(),
##   ..   neighbourhood = col_character(),
##   ..   latitude = col_double(),
##   ..   longitude = col_double(),
##   ..   room_type = col_character(),
##   ..   price = col_double(),
##   ..   minimum_nights = col_double(),
##   ..   number_of_reviews = col_double(),
##   ..   last_review = col_date(format = ""),
##   ..   reviews_per_month = col_double(),
##   ..   calculated_host_listings_count = col_double(),
##   ..   availability_365 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(data)
## # A tibble: 6 × 16
##      id name        host_id host_name neighbourhood_group neighbourhood latitude
##   <dbl> <chr>         <dbl> <chr>     <chr>               <chr>            <dbl>
## 1  2539 Clean & qu…    2787 John      Brooklyn            Kensington        40.6
## 2  2595 Skylit Mid…    2845 Jennifer  Manhattan           Midtown           40.8
## 3  3647 THE VILLAG…    4632 Elisabeth Manhattan           Harlem            40.8
## 4  3831 Cozy Entir…    4869 LisaRoxa… Brooklyn            Clinton Hill      40.7
## 5  5022 Entire Apt…    7192 Laura     Manhattan           East Harlem       40.8
## 6  5099 Large Cozy…    7322 Chris     Manhattan           Murray Hill       40.7
## # ℹ 9 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
## #   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
## #   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
## #   availability_365 <dbl>


1. Unclear Columns in the Dataset

Columns of Interest

neighbourhood_group

  • Description: This column categories the listings according to more general geographic areas (e.g., Manhattan, Brooklyn, etc.).

  • The significance of Encoding: It may miss the fact that this column groups together multiple neighborhoods into a single area if we don’t read the documentation.

  • Potential consequences of Ignorance: We might treat various neighborhoods as belonging to different groups if we didn’t check the documentation, which could result in incorrect market segmentation or pricing analysis.

room_type:

  • Description: Specifies the type of room—such as the entire home or apartment, a private room, or a shared room—is provided.

  • The Significance of Encoding: Different room types can significantly affect pricing, occupancy, and demand. This column may lead to incorrect conclusions on the profitability of particular room types if it is misunderstood.

  • Potential Consequence of Ignorance: We might incorrectly interpret how room type affects booking frequency, income potential, and customer preferences if we do not understand the split of room types.

availability_365:

  • Description: The annual days that the listing is available.

  • The Significance of Encoding :This column indicates the availability but may include irregular listings (those not consistently available).

  • Potential consequences of Ignorance: Properties that aren’t regularly available may affect our analysis of market demand and availability patterns if we don’t account for this.

    2. Unclear Element Despite Documentation

    Unclear Element: last_review

    The last_review column still has ambiguity even after reading the documentation. It gives the date of the most recent review, but it doesn’t explain why the last_review value for other properties is missing.


    Issues:

  • Potential Confusion: Does a missing value in last_review indicate that the listing hasn’t been reviewed yet, is new, or inactive?

  • Further Investigation Is Required: The listing may be in different states as shown by the missing last_review values. While some might be inactive but are still listed, others might have been added recently but haven’t been evaluated yet.

Questions for Further Investigation:

  • What percentage of new versus inactive listings missing last_review?

  • Is there a pattern among listings with no reviews?

3. Visualization Highlighting the Issue

Visualization: Impact of Missing last_review on Listings

It will create a scatter plot to visualize the relationship between number_of_reviews and last_review, highlighting those listings with missing last_review values.

null_ <- sum(is.na(data$last_review))
null_
## [1] 10052
data$review_missing <- is.na(data$last_review)
# Replace missing values in 'last_review' with a placeholder date (e.g., "2021-01-01")
data$last_review <- replace_na(data$last_review, as.Date("2021-01-01"))

# Plot the data
library(ggplot2)

ggplot(data, aes(x = number_of_reviews, y = last_review)) +
  geom_point(aes(color = review_missing), size = 3) +
  scale_color_manual(values = c("blue", "red"), labels = c("Reviewed", "Missing Review")) +
  labs(title = "Missing Last Review on NYC Airbnb Listings",
       x = "Number of Reviews",
       y = "Last Review Date",
       color = "Review Status") +
  theme_minimal() +
  annotate("text", x = 30, y = as.Date("2023-06-01"), 
           label = "Missing reviews may indicate inactive or new listings", 
           color = "red", size = 4)

The Visualization’s explanation

  • What’s On Display: The scatter plot displays the correlation between the last_review date and the amount of reviews. Listings with missing review dates are indicated by red points, and those that have been reviewed are indicated by blue points.

  • Highlighting the Problem: Red listings missing a last_review date are highlighted. This could refer to either recently added properties or properties that have been inactive for a while and haven’t been reviewed.

Insights and Risks

  • Insight Gathered :Both recently listed and inactive properties are probably included in the properties with missing last_review dates. We are unable to determine the exact cause of missing data in the absence of more context.

  • Significance: Misinterpreting listings missing reviews may cause one to make false market assumptions, such as underestimating supply or incorrectly assessing demand in particular neighborhoods.

  • Risks and Mitigation:

  1. Risk of Misleading Conclusions: Listings missing reviews may be overlooked or incorrectly classified, which may skew revenue projections or occupancy rates.

  2. Mitigation: Further investigation should be done to determine the listing state (e.g., active or inactive), maybe by comparing it to other indicators like availability_365 or getting in communication with the hosts.

In conclusion


This analysis highlights how crucial it is to carefully go over the data and its supporting documentation. The ambiguity surrounding certain fields, especially last_review, poses a risk to the accuracy of our results if not handled properly. Through the identification and resolution of these ambiguities, we can ensure reliable insights and data-driven choices. To make right decisions in the future, it might be necessary to cross-reference with more data or collect more context for any missing fields.

Questions for Further Investigation:

  • Can please provide more information about the pricing structure, including whether taxes and fees are included?

  • Do listings that don’t have reviews actually exist, or were they only placed recently?

  • When properties are compared within different neighborhoods or price ranges, taking into consideration missing data, what trends show up?