# Load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(tsibble)
## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union
library(fable)
## Warning: package 'fable' was built under R version 4.4.2
## Loading required package: fabletools
## Warning: package 'fabletools' was built under R version 4.4.2
library(ggplot2)

# Load the dataset

data <- read_csv("AB_NYC_2019.csv")
## Rows: 48895 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): name, host_name, neighbourhood_group, neighbourhood, room_type
## dbl  (10): id, host_id, latitude, longitude, price, minimum_nights, number_o...
## date  (1): last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 1: Convert last_review column to Date

data <- data %>%
  mutate(last_date = as.Date(last_review, format = "%Y-%m-%d"))

Insight: The last_review column has been successfully converted to a Date type, enabling time-based analysis. This step enables you to work with the data as a time series, allowing for the exploration of trends over time.

Significance: Proper management of dates is essential for time series analysis, as it guarantees the correct ordering and interpretation of temporal patterns. The capability to handle dates as Date objects allows for the utilization of specialized time series functions, such as forecasting and trend detection.

Further Investigation: Are there any missing or improperly formatted date entries that could impact the analysis? Additional cleaning or validation may be required before proceeding with the analysis.

Step 2: Filter Data

data_time <- data %>%
  filter(!is.na(last_date)) %>%
  select(last_date, price) %>%
  filter(price > 0)  # Remove rows with invalid prices

Insight: By removing rows with missing dates and invalid prices, the dataset is streamlined and prepared for analysis. It ensures that only relevant and valid data is taken into account.

Significance: Invalid or missing data can skew any time series analysis, resulting in inaccurate conclusions. Eliminating outliers such as zero or negative prices aids in concentrating the analysis on genuine, significant transactions.

Further Investigation: Are there additional anomalies or outliers in the price data that require attention? Investigating further filtering techniques may help confirm that the data is thoroughly cleansed.

Step 3: Check for Duplicates

duplicates <- duplicates(data_time, index = last_date)

Insight: This step determines the presence of duplicate rows with identical last_date, which, if unaddressed, may distort the analysis.

Significance: Duplicates in time series data can produce misleading results, such as artificially inflated trends or erroneous seasonal patterns. Identifying and addressing duplicates ensures that the time series is accurately structured for dependable analysis.

Further Investigation: After managing duplicates, do other features (such as price) reveal patterns of duplication? Are there any systematic issues with data entry or collection that lead to duplicates?

Step 4: Aggregate Data by Date

data_time <- data_time %>%
  group_by(last_date) %>%
  summarize(price = mean(price, na.rm = TRUE), .groups = "drop")

Insight: Aggregating by date enables the calculation of the daily mean price, effectively eliminating any potential duplicates. This step enhances the data’s smoothness, rendering it appropriate for trend analysis.

Significance: Aggregation is crucial for time series data, particularly when multiple observations are present for the same time point. By using averaging, we eliminate bias and establish a single, consistent price value for each date, thereby enhancing the reliability of trend analysis. Further Investigation: Are there alternative aggregation methods (e. g. , median, mode, or weighted average) that could potentially provide a more accurate representation of the data based on the price distribution?

Step 5: Convert to tsibble

ts_timedata <- data_time %>%
  as_tsibble(index = last_date)

Insight: The data_time has been converted into a tsibble, a structure specifically designed for time series analysis in R. This facilitates convenient manipulation and access to specialized time series functions.

Significance: The tsibble structure guarantees that the data is ordered by the time index and enables efficient time-based operations. This step is crucial for performing advanced time series analysis such as forecasting, seasonal decomposition, and anomaly detection.

Further Investigation: Are there additional variables or features that should be incorporated as a key (e. g. , room_type, location)? Would a multi-dimensional time series (e. g. , using room type as a key) provide more insights?

Step 6: Plot the Time Series

ggplot(ts_timedata, aes(x = last_date, y = price)) +
  geom_line(color = "blue") +
  labs(
    title = "Airbnb Prices Over Time",
    x = "Date",
    y = "Average Price"
  ) +
  theme_minimal()

Insight: The plot illustrates the trend of Airbnb prices over time. By visualizing price fluctuations, we can identify periods of price increases, stability, or decreases. Any sharp spikes or drops are easily recognizable.

This time series plot illustrates Airbnb prices over time, with the vertical axis depicting the average price and the horizontal axis indicating the date.

Observations:

  1. Price Spikes:
  1. General Trend:
  1. Possible Seasonality:

Significance: Visualizing the data allows us to comprehend overall trends and anomalies. The plot can highlight key elements such as pricing strategies, seasonal demand, or economic events affecting price behavior.

Further Investigation: Are there certain times when price fluctuations are especially pronounced? Could these trends be associated with external factors like holidays, local events, or economic conditions?

Step 7: Perform Linear Regression

ggplot(data_time, aes(x = last_date, y = price)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", formula = y ~ x, color = "red") +
  labs(
    title = "Linear Trend of Airbnb Prices Over Time",
    x = "Date",
    y = "Average Price"
  ) +
  theme_minimal()

In the visualization depicting the linear trend of Airbnb prices over time, the red line signifies the linear regression trend line fitted to the data points. Upon reviewing the trend line:

Slope of the Trend Line:

The slope of the red line appears nearly flat (close to zero), indicating that there is no significant overall increase or decrease in average Airbnb prices during the observed time period.

Interpretation:

The near-zero slope suggests that, on average, Airbnb prices have remained relatively stable over time, without a strong upward or downward trend. Although individual data points exhibit noticeable spikes and variability, these do not substantially impact the general pricing trend over the years.

Significance:

A stable trend might imply that pricing strategies or market conditions have been consistent over time. The absence of a strong trend could also suggest that any observed fluctuations are more likely attributed to outliers, special events, or short-term variations rather than long-term trends.

Further Investigation:

Examine the pricing spikes noted in the scatter plot. These may indicate anomalies, special events, or data errors. Examine seasonal or cyclical patterns that might not be addressed by the straightforward linear model. For instance, prices may vary during holidays or peak seasons. Consider categorizing the data by geographic regions or property types to discern trends in specific submarkets.

Step 8: Apply Smoothing

data_time <- data_time %>%
  arrange(last_date) %>%
  mutate(smoothed_price = stats::filter(price, rep(1/30, 30), sides = 2))

# Remove rows with NA values in the smoothed data
data_time <- data_time %>%
  filter(!is.na(smoothed_price))

# Plot the smoothed time series
ggplot(data_time, aes(x = last_date)) +
  geom_line(aes(y = price), alpha = 0.4, color = "blue") +
  geom_line(aes(y = smoothed_price), color = "red", size = 1) +
  labs(
    title = "Smoothed Airbnb Price Trends Over Time",
    x = "Date",
    y = "Average Price"
  ) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Insight: Smoothed Price Trends

Key Observations: The blue line illustrates the raw average prices of Airbnb listings over time, whereas the red line depicts a smoothed 30-day moving average of these prices. Several outliers are evident in the raw data, where prices surge to significantly high values (e. g. , above $4000). These may indicate either luxury listings or data entry errors. The smoothed trend (red line) remains fairly stable with minor fluctuations over time, suggesting that while individual spikes occur, the overall price level does not exhibit a significant long-term increase or decrease.

Significance: The moving average offers a clearer view of trends by minimizing noise from extreme outliers or temporary price fluctuations. The absence of a notable upward or downward trend in the smoothed line indicates that average Airbnb prices have remained relatively stable during the observed period, with no consistent inflation or deflation in listing costs.

Further Questions: What triggered the extreme price spikes? Could these be linked to special events, data quality issues, or a specific category of luxury listings?

Does the stability in average prices persist across different locations? This analysis assumes a global dataset; however, price trends may differ significantly by city or country.

How do external factors such as demand (e. g. , peak tourist seasons) or supply (new Airbnb listings) influence these trends?

Step 9: Seasonality Analysis using ACF and PACF

acf(data_time$price, main = "ACF of Airbnb Prices")

pacf(data_time$price, main = "PACF of Airbnb Prices")

The Autocorrelation Function (ACF) plot for Airbnb prices illustrates the correlation of the series with its delayed values. Here’s a summary of the plot:

  1. Key Observations:
  1. Interpretation:
  1. Significance of Blue Lines (Confidence Bounds):
  1. Implications:

The Partial Autocorrelation Function (PACF) plot for Airbnb prices indicates the level of correlation between the current value and its lagged values, after controlling for correlations at intervening lags.

Key Observations:

  1. Initial Lags:
  1. Later Lags:
  1. Confidence Bounds (Blue Lines):

Interpretation:

Implications for Modeling:

  1. Stationarity:
  1. Model Selection:
  1. Next Steps: