Week 12 Data Dive

Variables Selected

For this Week 12 data dive, Year was selected as the time variable since it directly represents the time dimension in the dataset and can be converted into a proper Date in R for time-based analysis. The response variable selected was Deaths since it is a numeric count that can be meaningfully aggregated by year and tracked over time. To make the yearly totals interpretable and avoid double counting, rows for United States and All causes were excluded before summarizing the data.

Data Preparation

# Summarize the data so there is one total deaths value for each year, excluding United States and All causes to avoid double counting
data_yearly <- data |>
  filter(State != "United States", Cause.Name != "All causes") |>
  group_by(Year) |>
  summarize(
    total_deaths = sum(Deaths, na.rm = TRUE)
  ) |>
  mutate(
    Date = as.Date(paste0(Year, "-01-01"))
  )

Creating tsibble

# Create a tsibble using the Date column as the time index and keep only the date and total deaths variable
data_yearly_tsibble <- data_yearly |>
  select(Date, total_deaths) |>
  as_tsibble(index = Date)

Visualizations of Trend Over Time

# Plot the total deaths over time to visualize the overall trend across years after excluding aggregate rows
ggplot(data_yearly_tsibble, aes(x = Date, y = total_deaths / 1000000)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Total Deaths Over Time",
    x = "Year",
    y = "Total Deaths (Millions)"
  ) +
  coord_cartesian(ylim = c(1.7, 2.3))

# Plot a narrower time window to examine the later years more closely for any changes in pattern
data_yearly_tsibble |>
  filter(Date >= as.Date("2008-01-01")) |>
  ggplot(aes(x = Date, y = total_deaths / 1000000)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Total Deaths Over Time (2008-2017)",
    x = "Year",
    y = "Total Deaths (Millions)"
  ) +
  coord_cartesian(ylim = c(1.7, 2.3))

Analysis of Immediate Stand-Out

The first thing that stands out is that total deaths in the filtered dataset remain fairly stable in the earlier years, with moderate fluctuations from 2000 through about 2009. After that, the pattern begins to shift upward, and the increase becomes more noticeable in the later years, especially from about 2013 through 2017. The second, narrower time window makes this later upward movement easier to see, suggesting that the most recent years may follow a different trend than the earlier part of the series.

Linear Regression Models

# Fit a linear regression model for the full time period to estimate the overall trend in total deaths over time
lm_full <- lm(total_deaths ~ Year, data = data_yearly)

summary(lm_full)

## 
## Call:
## lm(formula = total_deaths ~ Year, data = data_yearly)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -74800 -48741 -12031  51050 110752 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -12520771    5286033  -2.369   0.0308 *
## Year             7185       2632   2.730   0.0148 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57930 on 16 degrees of freedom
## Multiple R-squared:  0.3178, Adjusted R-squared:  0.2751 
## F-statistic: 7.453 on 1 and 16 DF,  p-value: 0.01483

# Fit a linear regression model for the earlier years to estimate the trend from 2000 through 2009
lm_early <- data_yearly |>
  filter(Year <= 2009) |>
  lm(total_deaths ~ Year, data = _)

summary(lm_early)

## 
## Call:
## lm(formula = total_deaths ~ Year, data = filter(data_yearly, 
##     Year <= 2009))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -19618 -13356  -7142  18299  20684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 17274532    3856816   4.479  0.00206 **
## Year           -7680       1924  -3.992  0.00400 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17480 on 8 degrees of freedom
## Multiple R-squared:  0.6657, Adjusted R-squared:  0.6239 
## F-statistic: 15.93 on 1 and 8 DF,  p-value: 0.003997

# Fit a linear regression model for the later years to estimate the trend from 2010 through 2017
lm_late <- data_yearly |>
  filter(Year >= 2010) |>
  lm(total_deaths ~ Year, data = _)

summary(lm_late)

## 
## Call:
## lm(formula = total_deaths ~ Year, data = filter(data_yearly, 
##     Year >= 2010))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -25605 -19293   4764  14960  24796 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -66743309    6388120  -10.45 4.51e-05 ***
## Year            34115       3173   10.75 3.82e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20560 on 6 degrees of freedom
## Multiple R-squared:  0.9507, Adjusted R-squared:  0.9424 
## F-statistic: 115.6 on 1 and 6 DF,  p-value: 3.823e-05

Linear Regression Trend Summary and Explanation

Subsetting the data was necessary since the earlier and later years showed different trend directions, and a single regression line across the full period would have masked that change.

Linear regression showed that the full 2000–2017 period has an overall upward trend in total deaths, since the coefficient for Year was positive (7185) and statistically significant (p = 0.0148). However, this overall model was not especially strong, with an \(R^2\) of 0.3178, suggesting that one straight-line trend does not describe the full series particularly well. Since the visualizations suggested a change in pattern over time, the data were split into an earlier and later period.

For 2000–2009, the coefficient for Year was negative (-7680) and statistically significant (p = 0.00400), indicating a downward trend in total deaths during the earlier years. This trend was fairly strong, with an \(R^2\) of 0.6657. For 2010–2017, the coefficient for Year was positive (34115) and highly significant (p = 3.82e-05), showing a strong upward trend in the later years. This later trend was the strongest of all three models, with an \(R^2\) of 0.9507, which suggests that subsetting the data was necessary to capture the different trends more accurately.

Smoothing, Seasonality, and Autocorrelation

# Plot total deaths over time with a smooth curve to highlight the broader long-term pattern in the yearly series
ggplot(data_yearly_tsibble, aes(x = Date, y = total_deaths / 1000000)) +
  geom_line() +
  geom_point() +
  geom_smooth(se = FALSE) +
  labs(
    title = "Smoothed Total Deaths Over Time",
    x = "Year",
    y = "Total Deaths (Millions)"
  ) +
  coord_cartesian(ylim = c(1.7, 2.3))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Plot the autocorrelation function for yearly total deaths to examine repeated dependence across yearly lags
acf(data_yearly$total_deaths, main = "ACF of Total Deaths")

# Plot the partial autocorrelation function for yearly total deaths to examine direct dependence at different yearly lags
pacf(data_yearly$total_deaths, main = "PACF of Total Deaths")

Smoothing, Seasonality, and Autocorrelation Summary and Analysis

The smoothing plot highlights the broader pattern in total deaths over time after excluding the aggregate rows for United States and All causes. The smoothed curve shows a slight decline in the earlier years, followed by a clearer upward movement in the later years, especially after about 2010. This supports the earlier regression results, which suggested that the full period does not follow one single simple trend and that the later years behave differently from the earlier years.

To investigate possible seasonality, smoothing was used along with the ACF and PACF plots. However, an important limitation of this dataset is that the time series is measured only at the year level. Since there is only one observation per year, the data do not contain the kind of repeated within-year pattern that is normally needed to identify true seasonality, such as monthly or quarterly cycles. For that reason, the smoothing results are better interpreted as showing a long-term pattern or multi-year structure rather than strong traditional seasonality.

The ACF plot shows positive autocorrelation at the smallest lags, especially at lag 1, which suggests that total deaths in one year are related to values in nearby years. After the first few lags, the autocorrelations weaken and do not show a clear repeating cycle, so the plot does not provide strong evidence of a regular seasonal pattern. The PACF plot also shows one clear significant spike at lag 1, which suggests that the strongest direct relationship is between a given year and the immediately previous year rather than a repeated seasonal interval.

Overall, the smoothing and autocorrelation results suggest that the yearly series has persistence and a changing long-term trend, but not strong traditional seasonality. This is still meaningful since it shows that the series is not random from year to year, while also making clear that annual data limit how confidently seasonality can be evaluated.

Supplementary Wikipedia Time Series Example

To better understand how smoothing, ACF, and PACF work when a dataset has a finer time scale, I also explored a daily Wikipedia pageviews series for Heart disease, which is directly related to one of the causes represented in the mortality dataset.

# Retrieve daily Wikipedia pageviews for the "Heart disease" article
wiki_url <- paste0(
  "https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/",
  "en.wikipedia/all-access/user/Heart_disease/daily/20240101/20241231"
)

wiki_raw <- jsonlite::fromJSON(wiki_url)

wiki_data <- tibble(
  date = as.Date(substr(wiki_raw$items$timestamp, 1, 8), format = "%Y%m%d"),
  views = wiki_raw$items$views
)

# Convert the Wikipedia pageviews data into a tsibble
wiki_tsibble <- wiki_data |>
  as_tsibble(index = date)

# Plot daily Wikipedia pageviews with a smooth curve to show the broader pattern
ggplot(wiki_tsibble, aes(x = date, y = views)) +
  geom_line() +
  geom_smooth(se = FALSE) +
  labs(
    title = 'Daily Wikipedia Pageviews for "Heart disease"',
    x = "Date",
    y = "Views"
  )

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

# Plot the autocorrelation function to examine repeated dependence across daily lags
acf(wiki_tsibble$views, main = 'ACF of Wikipedia Pageviews for "Heart disease"')

# Plot the partial autocorrelation function to examine direct lag relationships
pacf(wiki_tsibble$views, main = 'PACF of Wikipedia Pageviews for "Heart disease"')

Supplementary Wikipedia Example Summary and Analysis

To better understand how smoothing, ACF, and PACF work on a higher-frequency time series, I also examined daily Wikipedia pageviews for the article “Heart disease.” This search term is directly related to the subject of the mortality dataset and provides a more appropriate setting for exploring short-term temporal dependence than the annual mortality data alone.

The time series plot shows that daily pageviews were relatively stable for much of 2024, with moderate fluctuations around a fairly consistent level, followed by a noticeable increase later in the year. Near the end of the series, there are several sharp spikes in pageviews, suggesting that interest in the topic increased substantially during that period. The smoothing curve reinforces this interpretation by showing a broader upward trend over time, especially in the final months of the year. Rather than indicating a clean repeating seasonal cycle, the smoothed pattern appears to reflect changing overall interest and several large short-term surges.

The ACF plot shows strong positive autocorrelation at small lags that gradually declines as the lag increases. This means that pageview counts on a given day are strongly related to pageview counts on nearby previous days, and that this dependence persists across multiple lags. The slow decay in the autocorrelation suggests a sustained temporal structure in the series rather than random day-to-day noise. However, the ACF does not show a distinct repeating spike pattern that would clearly indicate a strong seasonal cycle over the lags displayed.

The PACF plot shows one especially large spike at the first lag, while most later lags are much smaller and closer to the significance bounds. This suggests that the strongest direct relationship is between a day’s pageviews and the immediately preceding day’s pageviews. After accounting for that first lag, the additional direct contribution of later lags appears to be much weaker. In practical terms, this means the series shows short-term persistence, where recent values help explain current values, but it does not provide strong evidence of a simple repeating seasonal pattern in the period shown.

Overall, this supplementary example demonstrates why smoothing, ACF, and PACF are more informative for a daily time series than for my original annual mortality dataset. The Wikipedia pageviews data clearly show short-term dependence and changing trend over time, whereas the annual mortality data are too coarse to meaningfully assess true seasonality. For that reason, the mortality dataset is better interpreted in terms of long-term trend and persistence rather than within-year seasonal structure.

Data Dive Summary

This Week 12 data dive examined how total deaths changed over time after converting Year into a proper Date and filtering out the aggregate rows for United States and All causes to avoid double counting. The time-series visualizations showed that deaths were relatively stable in the earlier years, but began rising more clearly in the later years, especially after about 2010. Linear regression confirmed that one model for the full period was not enough to describe the data well, and that splitting the series into earlier and later periods gave a much clearer picture of a downward trend followed by a strong upward trend.

The smoothing and autocorrelation results added to this interpretation by showing that the yearly series has a persistent long-term structure rather than random year-to-year movement. However, since the data are annual, the analysis could not establish strong traditional seasonality in the way monthly or quarterly data might allow. To supplement this limitation, a daily Wikipedia pageviews time series for “Heart disease” was also explored to practice interpreting smoothing, ACF, and PACF on a higher-frequency dataset. That example showed short-term dependence and changing trend over time, illustrating how these tools are more informative when the data contain finer time resolution. Overall, this data dive suggests that the time pattern in deaths is meaningful but not uniform across the full period, and that a useful next step would be to investigate what broader demographic, reporting, or health-related changes may help explain the stronger rise seen in the later years.