Data Dive Time based data

Since my dataset does not have any Time based columns to complete this task, I will first identify a Wikipedia page related to New York housing or real estate. Then, I will extract the time series of page views for that page using the Wikipedia page views website or the appropriate R package. Next, I will analyze the time series data following the provided instructions.

“New York City,” is my selected wikipedia page and I will extract the time series of page views using the Wikipedia page views website

Introduction

In this analysis, I will explore the time series data of page views for the Wikipedia page “New York City.” and the goal is to gain insights into the popularity and trends related to this topic over time.

Step 1: Extracting Time Series Data –Load necessary libraries

–Install and load devtools package

–Install and load the pageviews package

library(pageviews)

## Warning: package 'pageviews' was built under R version 4.3.3

# Set the parameters for fetching pageviews data
article <- "New_York_City"  # Wikipedia article name
start_date <- as.Date("2019-01-01")    # Start date in YYYY-MM-DD format
end_date <- as.Date("2022-01-01")       # End date in YYYY-MM-DD format

# Fetch pageviews data for the specified article
pageviews_data <- article_pageviews(article, project = "en.wikipedia", start = start_date, end = end_date)

# Print the first few rows of the pageviews data
head(pageviews_data)

##     project language       article     access      agent granularity       date
## 1 wikipedia       en New_York_City all-access all-agents       daily 2019-01-01
## 2 wikipedia       en New_York_City all-access all-agents       daily 2019-01-02
## 3 wikipedia       en New_York_City all-access all-agents       daily 2019-01-03
## 4 wikipedia       en New_York_City all-access all-agents       daily 2019-01-04
## 5 wikipedia       en New_York_City all-access all-agents       daily 2019-01-05
## 6 wikipedia       en New_York_City all-access all-agents       daily 2019-01-06
##   views
## 1 20535
## 2 20786
## 3 19481
## 4 18614
## 5 18503
## 6 20124

# Fetch pageviews data for the specified article
pageviews_data <- article_pageviews(article, project = "en.wikipedia", start = start_date, end = end_date)

Insights :

The page views data for the Wikipedia page “New York City” was successfully retrieved from the Wikipedia page views website.
This step sets the foundation for analyzing the popularity and trends related to New York City over time.

Step 2: Creating a tsibble Object

# Convert data to tsibble object
library(tsibble)

## Warning: package 'tsibble' was built under R version 4.3.3

## 
## Attaching package: 'tsibble'

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

# Convert 'date' column to Date class
pageviews_data$date <- as.Date(pageviews_data$date)

# Convert data to tsibble object
ts_data <- as_tsibble(pageviews_data, index = date)

Insights :

The retrieved page views data was converted into a tsibble object, which is a time series data structure
This conversion enables further analysis and visualization of the time series data.

Step 3: Plotting the Time Series Data

# Plotting the time series data
ggplot(ts_data, aes(x = date, y = views)) +
  geom_line() +
  labs(title = "Page Views of New York City Wikipedia Page Over Time",
       x = "Date",
       y = "Page Views")

Insights:

The time series data of page views for the Wikipedia page “New York City” was plotted over time.
The plot provides a visual representation of the popularity trend of New York City on Wikipedia.

Step 4: Linear Regression for Trend Detection

# Fit linear regression model
lm_model <- lm(views ~ as.numeric(date), data = ts_data)

# Summary of the linear regression model
summary(lm_model)

## 
## Call:
## lm(formula = views ~ as.numeric(date), data = ts_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7585  -2453   -648   1217  34801 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -7.267e+04  7.927e+03  -9.168   <2e-16 ***
## as.numeric(date)  5.078e+00  4.297e-01  11.818   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4507 on 1095 degrees of freedom
## Multiple R-squared:  0.1131, Adjusted R-squared:  0.1123 
## F-statistic: 139.7 on 1 and 1095 DF,  p-value: < 2.2e-16

Insights:

– A linear regression model was fitted to the time series data to detect any upward or downward trends.

The summary of the linear regression model reveals the following:

The intercept coefficient indicates that on the date corresponding to the baseline, the estimated page views are approximately -72,670 (rounded).
The coefficient for the date variable indicates that for each additional day, the estimated page views increase by approximately 5.08.
The p-values for both coefficients are extremely low, indicating that they are statistically significant.
The adjusted R-squared value of 0.1123 suggests that approximately 11.23% of the variability in page views can be explained by the linear relationship with date.

Step 5: Seasonal Smoothing

–Convert aggregated views to a time series object

–Aggregate daily page views to weekly

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:tsibble':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

ts_data_weekly <- ts_data %>%
  group_by(date_week = floor_date(date, "week")) %>%
  summarise(views = sum(views))

ts_views <- ts(ts_data_weekly$views, frequency = 52)

# Perform seasonal decomposition
decomposed <- decompose(ts_views, "multiplicative")


# Convert seasonal component to a data frame
seasonal_df <- data.frame(date = time(decomposed$seasonal), seasonal = decomposed$seasonal)

# Plot the seasonal component
ggplot(seasonal_df, aes(x = date, y = seasonal)) +
  geom_line() +
  labs(title = "Seasonal Component of New York City Wikipedia Page Views",
       x = "Date",
       y = "Seasonal Component")

## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

Insights :

An attempt was made to perform seasonal decomposition on the aggregated views data to identify seasonal patterns.
The seasonal component of the time series data represents the systematic, repeating patterns that occur over time, such as seasonal fluctuations or cycles.
The multiplicative decomposition method was used, which assumes that the seasonal component varies proportionally with the level of the time series.
The plot of the seasonal component displays the variation in page views attributed to seasonal effects over time.
Overall, the ECG-like pattern in the seasonal component provides a visual representation of the rhythmic fluctuations in Wikipedia page views, offering insights into the temporal dynamics of public interest and engagement with topics related to New York City.

Step 6: Seasonality Analysis

–ACF plot to illustrate seasonality

–Load the forecast package

library(forecast)

## Warning: package 'forecast' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

# Create ACF plot
ggAcf(ts_data$views)

Insights :

The autocorrelation function (ACF) plot of the New York City Wikipedia page views provides insights into the temporal dependencies or correlations within the time series data. Here’s the analysis based on the ACF plot:

Lag Structure: The ACF plot displays autocorrelation coefficients at different lags, indicating the correlation between the series and its lagged values. Each lag represents a specific time interval between observations.
Significance Levels: The ACF plot typically includes horizontal dashed lines representing the significance levels. Observations outside these lines suggest significant autocorrelation values that may be indicative of patterns or trends in the data.
Decay Pattern: The decay pattern of autocorrelation coefficients as the lag increases provides insights into the persistence of patterns or trends in the data. A slow decay suggests long-term dependencies, while a rapid decay indicates short-term correlations.
Interpretation: By examining the ACF plot, one can identify significant spikes or patterns at specific lags, indicating recurring cycles or seasonal effects. These insights can inform forecasting models and help identify appropriate lag orders for autoregressive integrated moving average (ARIMA) models.

Further Questions:

Seasonality Identification: Are there any specific seasonal patterns or trends evident in the Wikipedia page views data for New York City? Further investigation into the nature and drivers of these seasonal variations could provide valuable insights.
Model Adequacy: How well does the current linear regression model capture the underlying trends in the data? Further diagnostics, such as residual analysis and model validation, can help assess the adequacy of the model and identify any potential areas for improvement.
Forecasting Accuracy: How accurate are the forecasts generated using the current modeling approach? Conducting out-of-sample forecasting and evaluating forecast accuracy metrics can provide a better understanding of the model’s predictive performance.

Data Dive Time based data

Abhinandhan Velagapudi

2024-15-06

Introduction