Exploring Time Series Analysis Techniques Using Washdash data

Introduction:
Time series analysis stands as a cornerstone technique across diverse fields like finance, economics, and meteorology, enabling the comprehension and prediction of data trends over time. In this case study, my focus is on delving into the realm of time series analysis using the R programming language. My objective is to navigate through various techniques for analyzing time series data. These techniques encompass generating synthetic data, visualizing trends, detecting seasonality, and conducting linear regression. Through active exploration and analysis, my aim is to glean insights into the underlying patterns and dynamics encapsulated within the data.

Summary of Data:
This dataset contains various columns detailing information such as the type of service, region, residence type, service type, year, coverage, population, and service level. Each column provides different perspectives on the data, allowing me to understand its breadth and depth and summarizing the dataset, including descriptive statistics like mean, median, minimum, maximum, and quartiles for each variable.
Significance:
Summarizing the data provides an overview of its distribution and central tendencies, helping to understand the characteristics of the variables.
Further questions:
Further investigation could include exploring any outliers or missing values in the dataset and understanding their impact on the analysis.

Structure of Data:
Examining the structure of the data involves understanding the data types of each variable, the number of observations, and the overall structure of the dataset.
Significance:
Understanding the structure of the data is crucial for data manipulation, preprocessing, and modeling tasks.
Further questions:
Further investigation could involve checking for inconsistencies or anomalies in the data structure that might affect subsequent analysis steps.

Generating Synthetic Time Series Data:
Generating synthetic time series data allows for the creation of a simulated dataset with known characteristics, such as trends, seasonality, and noise.
Significance:
Synthetic data can be used for testing models, evaluating algorithms, and understanding the behavior of time series analysis techniques.
Further questions:
Further investigation could involve experimenting with different parameters for generating synthetic data to explore various time series patterns and their impact on analysis outcomes.

Plotting Time Series Data:
Plotting time series data visually represents the patterns and trends present in the dataset over time.
Significance:
Visualizing time series data helps identify patterns, trends, seasonality, and anomalies, facilitating better understanding and interpretation of the data.
Further questions:
Further investigation could involve exploring different visualization techniques and adjusting plot parameters to enhance the clarity and readability of the time series plots.

Linear Regression to Detect Trends:
Performing linear regression helps identify trends or patterns in the time series data by fitting a linear model to the relationship between the response variable and time.
Significance:
Detecting trends is essential for understanding long-term patterns in the data and making predictions about future behavior.
Further questions:
I'm keen to assess the strength of the detected seasonality and exploring more complex regression models, such as polynomial regression or generalized additive models, to capture non-linear trends in the data.

Subset Analysis for Multiple Trends:
Conducting subset analysis involves splitting the time series data into multiple subsets and performing trend analysis on each subset independently.
Significance:
Subset analysis helps identify different trends or patterns in different time periods, providing insights into temporal variations in the data.
Further questions:
However, I'm curious about the the reasons behind the observed differences in trends between subsets and understanding their implications for the overall dataset.

Smoothing to Detect Seasonality:
Smoothing techniques, such as seasonal decomposition, help identify seasonal patterns or variations in the time series data.
Significance:
Detecting seasonality is crucial for understanding recurring patterns or cycles in the data and adjusting for them in forecasting or modeling tasks.
Further questions:
Exploring alternative smoothing techniques, such as moving averages or exponential smoothing, and comparing their effectiveness in detecting seasonality.

Autocorrelation Function (ACF) Analysis:
ACF and PACF plots help assess the autocorrelation and partial autocorrelation of the time series data, respectively, revealing the presence of seasonality or other temporal dependencies. The ACF plot displays the autocorrelation of the time series data at different lags. In this plot, the correlation between the time series values at various lag intervals is depicted. For instance, if there is a significant positive autocorrelation at lag 1, it suggests that the value at time t is correlated with the value at time t-1. Similarly, if there are significant autocorrelations at higher lags, it indicates longer-term dependencies in the data.
Partial Autocorrelation Function (PACF) Analysis:
ACF and PACF plots help assess the autocorrelation and partial autocorrelation of the time series data, respectively, revealing the presence of seasonality or other temporal dependencies. The PACF plot shows the partial autocorrelation of the time series data at different lags, while controlling for the correlations at shorter lags. It helps to identify the direct relationship between the observations at different time points. For example, a significant spike at lag k in the PACF plot suggests that there is a direct correlation between the observations k time units apart, controlling for the correlations at shorter lags.
By examining the ACF and PACF plots, we can gain insights into the underlying structure of the time series data, such as identifying seasonality, trend, and other patterns. These plots serve as diagnostic tools to guide the selection of appropriate models for forecasting or analyzing the time series data.
Significance:
Analyzing ACF and PACF plots provides insights into the underlying structure of the time series data and helps identify the appropriate parameters for time series modeling, such as lag values.
Further questions:
Further investigation could involve interpreting the ACF and PACF plots to determine the order of seasonal and non-seasonal components in time series models like ARIMA or SARIMA.

Conclusion:
In conclusion, my exploration of time series data has provided valuable insights into the underlying patterns and dynamics present within the dataset. Through descriptive statistics, visualization, and statistical modeling techniques and identified trends, seasonality, and temporal dependencies, enabling to better understand the behavior of the data over time.
Moving forward, further investigation could involve refining this analytical methods, exploring alternative modeling techniques, and validating the findings using real-world data. Additionally, applying time series forecasting models to predict future trends and patterns could offer valuable insights for decision-making and planning.
Overall, the analysis of time series data offers a powerful framework for understanding temporal dynamics, identifying patterns, and making informed predictions. By continuing to explore and analyze time series data, we can unlock new insights, drive innovation, and make data-driven decisions in diverse fields and industries.

# Load required libraries
library(tsibble)

## Warning: package 'tsibble' was built under R version 4.3.3

## 
## Attaching package: 'tsibble'

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(fable)

## Warning: package 'fable' was built under R version 4.3.3

## Loading required package: fabletools

## Warning: package 'fabletools' was built under R version 4.3.3

# Read the CSV file
data <- read.csv("C:\\Users\\am790\\Downloads\\washdash-download (1).csv")

# Summary of the data
summary(data)

##      Type              Region          Residence.Type     Service.Type      
##  Length:3367        Length:3367        Length:3367        Length:3367       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       Year         Coverage         Population        Service.level     
##  Min.   :2010   Min.   :  0.000   Min.   :0.000e+00   Length:3367       
##  1st Qu.:2013   1st Qu.:  2.486   1st Qu.:4.366e+06   Class :character  
##  Median :2016   Median : 12.110   Median :3.306e+07   Mode  :character  
##  Mean   :2016   Mean   : 22.447   Mean   :1.497e+08                     
##  3rd Qu.:2019   3rd Qu.: 34.190   3rd Qu.:1.755e+08                     
##  Max.   :2022   Max.   :100.000   Max.   :2.173e+09

# Structure of the data
str(data)

## 'data.frame':   3367 obs. of  8 variables:
##  $ Type          : chr  "sdg" "sdg" "sdg" "sdg" ...
##  $ Region        : chr  "Australia and New Zealand" "Australia and New Zealand" "Australia and New Zealand" "Australia and New Zealand" ...
##  $ Residence.Type: chr  "total" "total" "total" "total" ...
##  $ Service.Type  : chr  "Sanitation" "Sanitation" "Sanitation" "Sanitation" ...
##  $ Year          : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ Coverage      : num  5.40789 0 0 94.58855 0.00356 ...
##  $ Population    : num  1425817 0 0 24938751 938 ...
##  $ Service.level : chr  "Basic service" "Limited service" "Open defecation" "Safely managed service" ...

# Generate synthetic time series data
set.seed(123)
date <- seq.Date(from = as.Date("2022-01-01"), to = as.Date("2022-12-31"), by = "day")
response <- sin(2 * pi * seq_along(date) / 365) + rnorm(length(date), mean = 0, sd = 0.2)

# Create a tsibble object
ts_data <- tsibble(Date = date, Response = response) %>%
  mutate(Response = as.numeric(Response))

## Using `Date` as index variable.

# Plot the data over time
ggplot(data = ts_data, aes(x = Date, y = Response)) +
  geom_line() +
  labs(title = "Time Series Plot of Response Variable",
       x = "Date", y = "Response")

# Linear regression to detect trends
lm_model <- lm(Response ~ as.numeric(Date), data = ts_data)
summary(lm_model)

## 
## Call:
## lm(formula = Response ~ as.numeric(Date), data = ts_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04430 -0.41988  0.02245  0.39424  1.33909 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      99.3625642  4.5584635    21.8   <2e-16 ***
## as.numeric(Date) -0.0051815  0.0002377   -21.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4785 on 363 degrees of freedom
## Multiple R-squared:  0.5669, Adjusted R-squared:  0.5657 
## F-statistic: 475.1 on 1 and 363 DF,  p-value: < 2.2e-16

# Subset the data for multiple trends
ts_data_subset1 <- filter(ts_data, Date <= as.Date("2022-06-30"))
ts_data_subset2 <- filter(ts_data, Date > as.Date("2022-06-30"))

# Linear regression for subset 1
lm_model_subset1 <- lm(Response ~ as.numeric(Date), data = ts_data_subset1)
summary(lm_model_subset1)

## 
## Call:
## lm(formula = Response ~ as.numeric(Date), data = ts_data_subset1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8212 -0.3014  0.0571  0.2734  0.7913 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)       2.204e+00  9.665e+00   0.228    0.820
## as.numeric(Date) -8.187e-05  5.065e-04  -0.162    0.872
## 
## Residual standard error: 0.356 on 179 degrees of freedom
## Multiple R-squared:  0.0001459,  Adjusted R-squared:  -0.00544 
## F-statistic: 0.02613 on 1 and 179 DF,  p-value: 0.8718

# Linear regression for subset 2
lm_model_subset2 <- lm(Response ~ as.numeric(Date), data = ts_data_subset2)
summary(lm_model_subset2)

## 
## Call:
## lm(formula = Response ~ as.numeric(Date), data = ts_data_subset2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72746 -0.29810 -0.02922  0.24276  1.04180 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)      -1.765e+00  9.769e+00  -0.181    0.857
## as.numeric(Date)  5.953e-05  5.071e-04   0.117    0.907
## 
## Residual standard error: 0.3653 on 182 degrees of freedom
## Multiple R-squared:  7.572e-05,  Adjusted R-squared:  -0.005418 
## F-statistic: 0.01378 on 1 and 182 DF,  p-value: 0.9067

# Smoothing to detect seasonality
seasonal_decomposition <- ts_data %>%
  model(season = TSLM(Response ~ trend() + fourier(K = 2)))

# Extract seasonal component from the decomposition
seasonal_component <- seasonal_decomposition$season

# Extract dates and seasonal component values
dates <- index(ts_data)
values <- seasonal_component$season$fitted

# Extract the Date column from ts_data
dates <- as.Date(ts_data$Date)

# Create a tsibble object for seasonal component
seasonal_tsibble <- tsibble::tsibble(Date = dates, Seasonal_Component = values)

## Using `Date` as index variable.

# Plotting seasonal component
autoplot(seasonal_tsibble, Date) +
  labs(title = "Seasonal Component of Time Series",
       x = "Date", y = "Seasonal Component")

# Autocorrelation Function (ACF) plot
acf(ts_data$Response, main = "Autocorrelation Function (ACF)")

# Partial Autocorrelation Function (PACF) plot
pacf(ts_data$Response, main = "Partial Autocorrelation Function (PACF)")