Week 12

Converting Transcation Date to Year Month Date Format

# Convert transaction_date to Date
data <- data %>%
  mutate(transaction_date = as.Date(transaction_date, format = "%Y-%m-%d"))  # Adjust format if necessary

Aggregating transactions by Day and by Month

daily_data <- data %>%
  group_by(transaction_date) %>%
  summarize(num_transactions = n())  # Count transactions per day

# Aggregate transactions by month
monthly_data <- data %>%
  mutate(year_month = format(transaction_date, "%Y-%m")) %>%  # Extract year and month
  group_by(year_month) %>%
  summarize(num_transactions = n())

Counting transactions by day of the week and Visualizing

daily_counts <- data %>%
  mutate(day_of_week = weekdays(transaction_date)) %>% # Extract day of the week
  group_by(day_of_week) %>%
  summarise(transaction_count = n()) %>%
  arrange(factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) # Ensure proper order of days

# Plot daily transaction counts
ggplot(daily_counts, aes(x = day_of_week, y = transaction_count)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Transactions by Day of the Week",
       x = "Day of Week", y = "Transaction Count")

Observation: Wednesday appears to have the highest number of transactions, while the other days are relatively uniform but slightly lower. Possible Reasons: Midweek activity: People often handle payments, bookings, or online transactions in the middle of the week when they’re at work or actively managing tasks. Routine schedules: Businesses might send bills or process payroll midweek, increasing the volume of transactions. Cultural/Market trends: Some industries (e.g., Like KFC’s, MC Donald’s provides good discounts on wednesday’s,grocery, bill payments) might experience peak activity on Wednesdays due to midweek promotions or habits.

Counting transactions by day of the Month and Visualizing

# Count transactions by month
monthly_counts <- data %>%
  mutate(month = month(transaction_date, label = TRUE)) %>% # Extract month as a labeled factor (e.g., "Jan", "Feb")
  group_by(month) %>%
  summarise(transaction_count = n()) %>%
  arrange(month) # Ensures the months are ordered properly (Jan to Dec)

# Plot monthly transaction counts
ggplot(monthly_counts, aes(x = month, y = transaction_count)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  theme_minimal() +
  labs(title = "Transactions by Month",
       x = "Month", y = "Transaction Count")

Observation: May,March and December have higher transaction counts, while other months, like April, February and October, have slightly lower activity. Possible Reasons: Seasonality: May, March and December are popular months for activities like vacations, weddings, and summer shopping, leading to increased transactions for travel, gifts, or preparations. February (shorter month) may inherently have fewer transactions due to fewer days, coupled with reduced spending after the New Year.Coming to april April is often associated with tax filing deadlines (April 15 in the U.S.), meaning people might limit discretionary spending to focus on paying taxes or saving money for potential payments.October might represent pre-holiday lull, where people pause major spending in preparation for November/December holidays. Promotions and sales: Sales events such as “Back-to-School” in late summer and mid-year sales in May/June might boost transaction activity.

Count the number of transactions for each date

daily_aggregated_data <- data %>%
  count(transaction_date, name = "num_transactions") # Create a column `num_transactions`
library(tsibble)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr

## 
## Attaching package: 'tsibble'

## The following object is masked from 'package:lubridate':
## 
##     interval

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

# Ensure transaction_date is in Date format
daily_aggregated_data <- daily_aggregated_data %>%
  mutate(transaction_date = as.Date(transaction_date))

# Convert to tsibble
daily_tsibble <- daily_aggregated_data %>%
  as_tsibble(index = transaction_date)
library(ggplot2)

# Plot daily transactions over time

# Smoothed line plot with adjustable smoothness
ggplot(daily_tsibble, aes(x = transaction_date, y = num_transactions)) +
  geom_line(color = "blue") + # Line for original data
  geom_smooth(method = "loess", span = 0.3, color = "darkgreen", se = FALSE) + # Smoothed line (adjust span for smoothness)
  labs(title = "Daily Transactions Over Time (With Smoothed Trend)",
       x = "Date",
       y = "Number of Transactions") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Observation: The daily transaction count shows fluctuations but has a slight upward trend over time. Possible Reasons: Increasing adoption of services: A gradual increase could reflect more people using the platform over time. Holidays or specific events: Spikes may align with specific festivals, holiday shopping, or events that encourage spending (e.g., Black Friday, Cyber Monday). Trend smoothing: Fluctuations may occur due to variable factors like paydays, billing cycles, or one-off events, while the smoothed trend captures a steady increase in adoption or usage.

Extract month from transaction_date

# Extract month from transaction_date
library(lubridate)
library(dplyr)

monthly_counts <- data %>%
  mutate(month = month(transaction_date, label = TRUE)) %>% # Get month as a labeled factor (e.g., "Jan", "Feb")
  group_by(month) %>%
  summarise(transaction_count = n()) # Count transactions for each month
library(ggplot2)



# Create a month-year variable
monthly_trend <- data %>%
  mutate(month_year = format(as.Date(transaction_date), "%Y-%m")) %>% # Extract year and month (e.g., "2024-01")
  group_by(month_year) %>%
  summarise(transaction_count = n())

# Plot transactions over time (month-year)
ggplot(monthly_trend, aes(x = as.Date(paste0(month_year, "-01")), y = transaction_count)) +
  geom_line(color = "blue") +
  theme_minimal() +
  labs(title = "Monthly Transaction Trend",
       x = "Month-Year", y = "Transaction Count")

Fit a linear regression model

linear_model <- lm(num_transactions ~ as.numeric(transaction_date), data = daily_tsibble)

# Add trend line to the plot
ggplot(daily_tsibble, aes(x = transaction_date, y = num_transactions)) +
  geom_line(color = "blue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Transactions Over Time with Trend Line",
       x = "Date",
       y = "Number of Transactions") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# Summary of the linear model
summary(linear_model)

## 
## Call:
## lm(formula = num_transactions ~ as.numeric(transaction_date), 
##     data = daily_tsibble)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9727  -2.5955   0.0666   2.2855  13.5905 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                  -20.081398  38.262630  -0.525    0.600
## as.numeric(transaction_date)   0.001707   0.001935   0.882    0.378
## 
## Residual standard error: 3.912 on 364 degrees of freedom
## Multiple R-squared:  0.002132,   Adjusted R-squared:  -0.0006094 
## F-statistic: 0.7777 on 1 and 364 DF,  p-value: 0.3784

# Subset data based on a specific time period (example: pre-2022)
subset_data <- daily_tsibble %>%
  filter(transaction_date < as.Date("2024-01-01"))

# Fit linear regression on the subset
subset_model <- lm(num_transactions ~ as.numeric(transaction_date), data = subset_data)

# Add trend line to the plot
ggplot(subset_data, aes(x = transaction_date, y = num_transactions)) +
  geom_line(color = "blue") +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Subset: Transactions Over Time with Trend Line",
       x = "Date",
       y = "Number of Transactions") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# Summary of the subset model
summary(subset_model)

## 
## Call:
## lm(formula = num_transactions ~ as.numeric(transaction_date), 
##     data = subset_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6517 -2.5497 -0.5084  1.4784 13.3781 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)
## (Intercept)                   36.114329 160.511717   0.225    0.822
## as.numeric(transaction_date)  -0.001146   0.008166  -0.140    0.889
## 
## Residual standard error: 3.698 on 133 degrees of freedom
## Multiple R-squared:  0.0001481,  Adjusted R-squared:  -0.00737 
## F-statistic: 0.0197 on 1 and 133 DF,  p-value: 0.8886

The linear regression models show no significant upward or downward trends, as p-values are not significant, and 𝑅2 values are near zero. Subsetting the data might help detect localized trends, but overall, the trends are extremely weak.

# Apply smoothing using LOESS
ggplot(daily_tsibble, aes(x = transaction_date, y = num_transactions)) +
  geom_line(color = "blue") +
  geom_smooth(method = "loess", color = "red", se = FALSE) +
  labs(title = "Smoothing to Detect Seasonality",
       x = "Date",
       y = "Number of Transactions") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The plot uses smoothing to analyze trends and seasonality in transaction data from October 2023 to July 2024. The red trend line indicates a mostly flat or slightly declining overall trend in transactions. The blue lines show daily fluctuations, with potential seasonal variability increasing around early 2024. Further decomposition or periodic smoothing is required to confirm seasonal patterns, possibly influenced by events or cycles.

# ACF plot to detect seasonality
acf(daily_tsibble$num_transactions, main = "ACF of Daily Transactions")

# PACF plot to check for partial autocorrelations
pacf(daily_tsibble$num_transactions, main = "PACF of Daily Transactions")

Insights from ACF and PACF ACF: High lag-1 correlation indicates dependency on the previous day, but no periodic spikes (e.g., lags 7, 14) suggest a lack of clear seasonality. PACF: Significant early lags (1–3) highlight short-term dependencies, with limited contributions from higher-order lags. Seasonality: No evident weekly or monthly patterns; aggregation (e.g., weekly) might uncover hidden trends. Reason: Lack of seasonality could stem from random or the nature of transactions.