Data Mining - Homework 5

library(tidyverse) # for data manipulation
library(forecast) # for time series analysis
library(lubridate) # for date manipulation
library(mlba) # for load data

16.2 Performance on Training and Validation Data.

Two different models were fit to the same time series. The first 100 time periods were used for the training set and the last 12 periods were treated as a hold-out set. Assume that both models make sense practically and fit the data pretty well. Below are the RMSE values for each of the models:

Model	Training Set	Validation Set
Model A	543	690
Model B	669	675

a.

Which model appears more useful for explaining the different components of this time series? Why?

b.

Which model appears to be more useful for forecasting purposes? Why?

16.3 Forecasting Department Store Sales.

The file DepartmentStoreSales.csv contains data on the quarterly sales for a department store over a 6-year period (data courtesy of Chris Albright).

a.

Create a well-formatted time plot of the data.

# Load the data
data("DepartmentStoreSales") # Load the data to DepartmentStoreSales

# Create a time series object
sales.ts <- ts(DepartmentStoreSales$Sales, 
                               start = c(2000, 1), 
                               # end = c(2005, 4),
                               frequency = 4)

# Create a time plot
sales.ts |>
  autoplot(col = "skyblue") +
  labs(title = "Department Store Sales",
       x = "Year",
       y = "Sales") +
  theme_minimal()

autoplot(sales.ts ,col = "skyblue") +
  labs(title = "Department Store Sales",
       x = "Year",
       y = "Sales") #+

  # theme_minimal()

# plot using plot finction
plot(sales.ts, 
     xlab = "Year", 
     ylab = "Sales", 
     main = "Department Store Sales")

b.

Which of the four components (level, trend, seasonality, noise) seem to be present in this series?

# Decompose the time series
decompose(sales.ts) |> autoplot()

16.4 Shipments of Household Appliances.

The file ApplianceShipments.csv contains the series of quarterly shipments (in million $) of US household appliances between 1985 and 1989 (data courtesy of Ken Black).

a.

Create a well-formatted time plot of the data.

# Load the data
data("ApplianceShipments") # Load the data to ApplianceShipments

# Create a time series object
appliance.ts <- ts(ApplianceShipments$Shipments, 
                               start = c(1985, 1), 
                               # end = c(1989, 4),
                               frequency = 4)

# Create a time plot
appliance.ts |>
  autoplot(col = "skyblue") +
  labs(title = "Appliance Shipments",
       x = "Year",
       y = "Shipments") +
  theme_minimal()

### b. Which of the four components (level, trend, seasonality, noise) seem to be present in this series?

# Decompose the time series
decompose(appliance.ts) |> autoplot()

16.5 Canadian Manufacturing Workers Workhours.

The time plot in Figure 16.5 describes the average annual number of weekly hours spent by Canadian manufacturing workers (data are available in CanadianWorkHours.csv — thanks to Ken Black for the data).

a.

Reproduce the time plot.

# Load the data
data("CanadianWorkHours") # Load the data to CanadianWorkHours

# Create a time series object
workhours.ts <- ts(CanadianWorkHours$Hours, 
                   start = 1966, 
                   end = 2000, 
                   freq = 1)

# Create a time plot
workhours.ts |>
  autoplot(col = "skyblue") +
  labs(title = "Canadian Manufacturing Workers Workhours",
       x = "Year",
       y = "Hours Per Week") +
  scale_x_continuous(breaks = seq(1965, 2000, 5)) + # Change the x-axis major breaks
  scale_y_continuous(breaks = seq(30, 40, 0.5)) + # Change the y-axis major breaks
  # remove background
  theme(panel.background = element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        plot.background = element_blank(),
        panel.border = element_blank(),
        axis.line = element_line(colour = "black", size = 0.1),)

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plot(workhours.ts, 
     xlab = "Year", 
     ylab = "Hours Per Week", 
     main = "Canadian Manufacturing Workers Workhours")

Which of the four components (level, trend, seasonality, noise) seem to be present in this series?

16.6 Souvenir Sales:

The file SouvenirSales.csv contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, between 1995–2001. (Source: Hyndman, R.J., Time Series Data Library, http://data.is/TSDLdemo. Accessed on 07/25/15.)

Back in 2001, the store wanted to use the data to forecast sales for the next 12 months (year 2002). They hired an analyst to generate forecasts. The analyst first partitioned the data into training and validation sets, with the validation set containing the last 12 months of data (year 2001). She then fit a regression model to sales, using the training set.

a.

Create a well-formatted time plot of the data.

# Load the data
data("SouvenirSales") # Load the data to SouvenirSales

# Create a time series object
souvenir.ts <- ts(SouvenirSales$Sales, 
                   start = c(1995, 1), 
                   # end = c(2001, 12),
                   frequency = 12)

# Create a time plot
souvenir.ts |>
  autoplot(col = "skyblue") +
  labs(title = "Souvenir Sales",
       x = "Year",
       y = "Sales") +
  theme_minimal()

## b. Change the scale on the x-axis, or on the y-axis, or on both to log-scale in order to achieve a linear relationship. Select the time plot that seems most linear.

# Load the library
library(scales)

# Create a time plot with natural logarithm scale
souvenir.ts |>
  autoplot(col = "skyblue") +
  labs(title = "Souvenir Sales",
       x = "Year",
       y = "Sales (ln)") +
  # Change the scale to natural logarithm
  scale_y_continuous(trans = trans_new("ln", transform = log, inverse = exp)) + 
  theme_minimal()

# Create a time plot with log10 scale
souvenir.ts |>
  autoplot(col = "skyblue") +
  labs(title = "Souvenir Sales",
       x = "Year",
       y = "Sales (log10)") +
  # Change the scale to log10
  scale_y_log10() + 
  theme_minimal()

# plot using plot function, scale to log10
plot(souvenir.ts, 
     xlab = "Year", 
     ylab = "Sales", 
     main = "Souvenir Sales",
     log='xy')

c.

Comparing the two time plots, what can be said about the type of trend in the data?

d.

Why were the data partitioned? Partition the data into the training and validation set as explained above.

Answer: In forecasting problems, data was partitioned into training and validation sets to evaluate the model’s performance on unseen data.

# Partition the data into training and validation sets
nValid <- 12
nTrain <- length(souvenir.ts) - nValid
train.ts <- window(souvenir.ts, start = c(1995, 1), end = c(1995, nTrain + 1))
valid.ts <- window(souvenir.ts, start = c(1995, nTrain + 1), end = c(1995, nTrain + nValid))

naive.pred <- naive(train.ts, h = nValid)
snaive.pred <- snaive(train.ts, h = nValid)

# plot the logarithmic series (training set)
plot(train.ts, 
     ylab = "Sales", 
     xlab = "Time", 
     bty = "l",
     xaxt = "n", 
     xlim = c(1995,2002), 
     main = "Time plot log of monthly sales for a souvenir shop\n(training set)",
     log='xy'
     )
axis(1, at = seq(1995, 2002, 1), labels = format(seq(1995, 2002, 1)))
lines(snaive.pred$mean, lwd = 1, col = "skyblue", lty = 1)
lines(valid.ts, col = "black", lty = 3)

16.7 Shampoo Sales.

The file ShampooSales.csv contains data on the monthly sales of a certain shampoo over a 3-year period. Source: Hyndman, R.J., Time Series Data Library, http://data.is/TSDLdemo. Accessed on 07/25/15). ### a. Create a well-formatted time plot of the data.

Which of the four components (level, trend, seasonality, noise) seem to be present in this series?

# Decompose the time series
decompose(shampoo.ts) |> autoplot()

# Check the seasonality
# isSeasonal(sales.ts) # need to load the forecast library

Do you expect to see seasonality in sales of shampoo? Why?
If the goal is forecasting sales in future months, which of the following steps should be taken?
- Partition the data into training and validation sets
- Tweak the model parameters to obtain good fit to the validation data
- Look at MAPE and RMSE values for the training set
- Look at MAPE and RMSE values for the validation set

# Install and load required packages
# install.packages("GGally")
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(ggplot2)

# Load the diamonds dataset from ggplot2
data(diamonds)

# Create the pair plot with correlation and color by cut
ggpairs(
  diamonds, 
  columns = c("carat", "depth", "table", "price"), 
  aes(color = cut, alpha = 0.3),
  lower = list(continuous = wrap("smooth", se = FALSE, alpha = 0.3)),
  diag = list(continuous = wrap("densityDiag", alpha = 0.3)),
  upper = list(continuous = wrap("cor", size = 3))
) +
  theme_minimal() +
  labs(title = "Pair Plot")

Data Mining - Homework 5

2024-10-15

16.2 Performance on Training and Validation Data.

a.

b.

16.3 Forecasting Department Store Sales.

a.

b.

16.4 Shipments of Household Appliances.

a.

16.5 Canadian Manufacturing Workers Workhours.

a.

16.6 Souvenir Sales:

a.

c.

d.

16.7 Shampoo Sales.