Week-12: Data Dive Notebook

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(xts)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## ######################### Warning from 'xts' package ##########################
## #                                                                             #
## # The dplyr lag() function breaks how base R's lag() function is supposed to  #
## # work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
## # source() into this session won't work correctly.                            #
## #                                                                             #
## # Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
## # conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
## # dplyr from breaking base R's lag() function.                                #
## #                                                                             #
## # Code in packages is not affected. It's protected by R's namespace mechanism #
## # Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
## #                                                                             #
## ###############################################################################
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(tsibble)

## Registered S3 method overwritten by 'tsibble':
##   method               from 
##   as_tibble.grouped_df dplyr
## 
## Attaching package: 'tsibble'
## 
## The following object is masked from 'package:zoo':
## 
##     index
## 
## The following object is masked from 'package:lubridate':
## 
##     interval
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, union

theme_set(theme_minimal())
options(scipen = 6)

Loading Time Series Data:

CVD <- read.csv("/Users/rupeshswarnakar/Downloads/pageviews-20191028-20241117.csv")
head(CVD)

Above is the time series data for the Cardiovascular diseases extracted from the Wikipedia page view. The Date column represent the different dates for number of Cardiovascular disease reported. The data represented above includes the time of 5 years from 2019 to 2024.

Clean Dataset:

The dataset for time series can be cleaned by performing various operations such as changing date format, renaming columns, transforming into tidy data format etc. Some of the operations are given below.

Rename Column:

CVD_clean <- CVD |>   
  select(Date, Cardiovascular.disease) |>          
  rename(date = Date)             

# Check the cleaned dataset
head(CVD_clean)

Just for the sake of easiness, the Date column is renamed as date as shown above.

Tidy Data Format:

CVD_cleaner <- CVD_clean |> 
  mutate(date = as.Date(date)) |>  
  select(date, Cardiovascular.disease)  

CVD_ts <- CVD_cleaner |> 
  as_tsibble(index = date)

print(CVD_ts)

## # A tsibble: 1,848 x 2 [1D]
##    date       Cardiovascular.disease
##    <date>                      <int>
##  1 2019-10-28                   1708
##  2 2019-10-29                   1689
##  3 2019-10-30                   1766
##  4 2019-10-31                   1711
##  5 2019-11-01                   1541
##  6 2019-11-02                   1468
##  7 2019-11-03                   1534
##  8 2019-11-04                   1885
##  9 2019-11-05                   1852
## 10 2019-11-06                   1875
## # ℹ 1,838 more rows

The date is converted into a compatibe Date format which then is transformed into a tidy data format so as to be able to work with ‘tsibble’ R package for time series data.

Visualization:

Let’s visualize Cardiovascular disease over time period. For the sake of deeper analysis, let’s create two visualization, one for 2 months period and another one for 6 months period.

a. Plot A: For 2 months

CVD_ts |>
  filter(date >= as.Date("2024-10-01")) |>  
  ggplot() +
  geom_line(mapping = aes(x = date, y = Cardiovascular.disease), linewidth=0.3) +
  labs(title = '"Cardiovascular Disease" Page Views on Wikipedia') +
  theme_hc()

b. Plot B: For 6 months

CVD_ts |>
  filter(date >= as.Date("2024-07-01")) |>   
  ggplot() +
  geom_line(mapping = aes(x = date, y = Cardiovascular.disease), linewidth=0.3) +
  labs(title = '"Cardiovascular Disease" Page Views on Wikipedia') +
  theme_hc()

Interpretation:

From the above Plot A and Plot B, there are few information that can be extracted at a glance. Some of them are listed below.

a. These plots exhibit some kind of trends.

b. More specifically, Plot A shows a downward trend while Plot B shows an upward trend.

c. There are presence of spikes in the plot, indicating some sort of seasonality in the trend.

The trend and the seasonality can be further analyzed in the below sections.

Linear Regression:

Let’s further explore the trend by using linear regression model. Also, let’s explore the strength of the trends interpreting the summary of the linear regression model.

Since we already know the trend differs by the time period, let’s create three linear regression model for 6 months, 1 year and entire 5 years as shown below.

a. Model A: For 6 months

trend_model_1 <- lm(Cardiovascular.disease ~ date, data = CVD_ts |> filter(date >= as.Date("2024-07-01")))

summary(trend_model_1)

## 
## Call:
## lm(formula = Cardiovascular.disease ~ date, data = filter(CVD_ts, 
##     date >= as.Date("2024-07-01")))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -261.498  -78.696   -3.685   51.846  299.551 
## 
## Coefficients:
##                Estimate  Std. Error t value      Pr(>|t|)    
## (Intercept) -23648.9458   3960.9589  -5.971 0.00000001898 ***
## date             1.2363      0.1983   6.234 0.00000000518 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 94.82 on 138 degrees of freedom
## Multiple R-squared:  0.2198, Adjusted R-squared:  0.2141 
## F-statistic: 38.87 on 1 and 138 DF,  p-value: 0.000000005176

CVD_ts |> 
  filter(date >= as.Date("2024-07-01")) |>   
  ggplot() +
  geom_line(mapping = aes(x = date, y = Cardiovascular.disease), linewidth=0.3) +  
  geom_smooth(mapping = aes(x = date, y = Cardiovascular.disease), method = "lm", se = FALSE, color = "red") +  # Trend line
  labs(title = '"Cardiovascular Disease" Page Views on Wikipedia') +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

From the above plot we can see that the trend is going upward as time increases. This shows a positive correlation between Cardiovascular disease with time period. Looking at the summary of the model, we see that cardiovascular disease increases by 1.2363 unit as date increases by 1 unit. And, this is statistically very significant as it is evidenced by very low p-value of 0.00000000518.

Also, The R-squared value of 0.2198 shows that 22% of the variance in the Cardiovascular disease by time period is captured in this linear regression model. This value is not very high indicating the rest of the variance explained by other attributes could be captured by other models or seasonality. Hence the trend is not very strong.

b. Model B: For 12 months

trend_model_2 <- lm(Cardiovascular.disease ~ date, data = CVD_ts |> filter(date >= as.Date("2024-01-01")))

summary(trend_model_2)

## 
## Call:
## lm(formula = Cardiovascular.disease ~ date, data = filter(CVD_ts, 
##     date >= as.Date("2024-01-01")))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -315.19 -117.83  -16.53   89.95 1315.65 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15094.5449  2056.0100   7.342 1.76e-12 ***
## date           -0.7046     0.1034  -6.814 4.71e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 172.5 on 320 degrees of freedom
## Multiple R-squared:  0.1267, Adjusted R-squared:  0.124 
## F-statistic: 46.43 on 1 and 320 DF,  p-value: 4.713e-11

CVD_ts |> 
  filter(date >= as.Date("2024-01-01")) |>   
  ggplot() +
  geom_line(mapping = aes(x = date, y = Cardiovascular.disease), linewidth=0.3) + 
  geom_smooth(mapping = aes(x = date, y = Cardiovascular.disease), method = "lm", se = FALSE, color = "red") +  # Trend line
  labs(title = '"Cardiovascular Disease" Page Views on Wikipedia') +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

From the above plot we can see that the trend is going downward as time increases. This shows a negative correlation between Cardiovascular disease with time period. Looking at the summary of the model, we see that cardiovascular disease decreases by 0.7046 unit as date increases by 1 unit. And, this is also statistically very significant as it is evidenced by very low p-value of 4.71e-11.

Also, The R-squared value of 0.1267 shows that 12.67% of the variance in the Cardiovascular disease by time period is captured in this linear regression model. This value is low which indicates the rest of the variance explained by other attributes could be captured by other models or seasonality. This is also a weak trend overall.

b. Model C: For 5 years

trend_model <- lm(Cardiovascular.disease ~ date, data = CVD_ts |> filter(date >= as.Date("2019-01-01")))

summary(trend_model)

## 
## Call:
## lm(formula = Cardiovascular.disease ~ date, data = filter(CVD_ts, 
##     date >= as.Date("2019-01-01")))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -722.6 -217.3  -57.8  104.9 7161.0 
## 
## Coefficients:
##                Estimate  Std. Error t value Pr(>|t|)    
## (Intercept) 10442.84153   362.61517   28.80   <2e-16 ***
## date           -0.47406     0.01896  -25.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 434.8 on 1846 degrees of freedom
## Multiple R-squared:  0.253,  Adjusted R-squared:  0.2526 
## F-statistic: 625.3 on 1 and 1846 DF,  p-value: < 2.2e-16

CVD_ts |> 
  filter(date >= as.Date("2019-01-01")) |>   
  ggplot() +
  geom_line(mapping = aes(x = date, y = Cardiovascular.disease), linewidth=0.3) +  
  geom_smooth(mapping = aes(x = date, y = Cardiovascular.disease), method = "lm", se = FALSE, color = "red") +  # Trend line
  labs(title = '"Cardiovascular Disease" Page Views on Wikipedia') +
  theme_hc()

## `geom_smooth()` using formula = 'y ~ x'

This plot is basically a representation of entire 5 years of time series data for Cardiovascular disease. The trend overall is downward indicating a negative correlation between Cardiovascular disease and time period.

Looking at the summary of the model, we see that cardiovascular disease decreases by 0.47406 unit as date increases by 1 unit. This is not a huge depreciation of heart disease count. However, this negative relation is statistically very significant as it is evidenced by very low p-value of <2e-16.

Overall, the R-squared value is 0.253 which means 25.3% variance occurred in the Cardiovascular disease is defined by the simple linear model. It can be better modeled by using seasonality model as described below.

Seasonality:

Since, a simple linear regression model does not capture the entire variance of the time series data as evidenced by the R-squared value above, seasonality of data can be explored to better define the time series data of Cardiovascular disease as below.

# Filtering the data to analyze between june to september of 2024
CVD_ts_filtered_year <- CVD_ts |> 
  filter(date >= as.Date("2024-06-01") & date <= as.Date("2024-08-31"))

# Applying Loess smoothing
CVD_ts_filtered_year$loess_smoothed <- loess(Cardiovascular.disease ~ as.numeric(date), data = CVD_ts_filtered_year, span = 0.1)$fitted

ggplot(CVD_ts_filtered_year, aes(x = date)) +
  geom_line(aes(y = Cardiovascular.disease), color = "gray") +
  geom_line(aes(y = loess_smoothed), color = "red") +
  labs(title = "Loess Smoothed Cardiovascular Disease Page Views (2024)") +
  theme_minimal()

From the above Loess smoothing, we can observe some kind of pattern over time. Specifically, we can observe that there are 4 spikes every month if we have to analyze the data from June to September of 2024, for instance. This means the time series indicates weekly seasonality.

Autocorrelation:

Let’s use Autocorrelation to further illustrate the seasonality. Let’s use the lag value as 7 to study the relationship between the observed data on each day along with the data lagging 7 days prior.

After then, the new data frame using lag 7 can be used to plot ACF vs lag. The acf plot can be studied to indicate the seasonality of time series data for Cardiovascular disease.

CVD_clean |>
  mutate(Cardiovascular.disease_lag7 = lag(Cardiovascular.disease, 7)) |>
  drop_na()

Here, the data frame is mutated to add number of cardiovascular disease lagged by 7 days so as to study the correlation between data with each day along side the data lagged by 7 days.

ACF Plot:

Let’s create a ACF plot to find the seasonality of the time series data for Cardiovascular disease.

acf(CVD_clean$Cardiovascular.disease, ci = 0.95, na.action = na.exclude)

From the above ACF plot, we can clearly observe that there is occurrence of regular pattern of peak on the 7 days interval. Specifically, we can see a spike on 6th, 14th, 21st, 28th days. This clearly indicate the presence of weekly seasonality of the time series data for Cardiovascular disease.

Summary:

The time series data for Cardiovascular disease as overall shows a negative trend over time. With that, the data also highlight the presence of weekly seasonality. These statements are statistically strong supported by a very low p-values.

Looking from the real world perspective, the above conclusions obtained from the analysis seems intuitive. The Cardiovascular disease worldwide seems to exhibit a declining trend. This is true because now we have access to many different health care solutions. We are more aware than before regarding heart diseases. And, we are well educated on taking preventive measures to mitigate heart diseases by modifying lifestyles and eating behaviors.

Also, the presence of weekly seasonality in heart diseease pattern seems true. This is because most of the time people seems to appoint health check-up on free days/weekends. Not to mention, people enjoy their life, drink beverages, party, etc. during weekends which shows a spike due to reported heart related cases on weekends. With that, there are reported cases on week days also, as seen in the above ACF plot. However, the weekend spike is higher than that of weekdays. Hence, the time series data for Cardiovascular disease indicates weekly seasonality.