package

Chapter 2

2.1

Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.

Use ? (or help()) to find out about the data in each series. What is the time interval of each series? Use autoplot() to produce a time plot of each series. For the last plot, modify the axis labels and title.

bricks

aus_production %>% autoplot(.vars = Bricks) + 
                   labs(title = "Brick production", x = "Time", y = "production")
## Warning: Removed 20 rows containing missing values (`geom_line()`).

based on the x axis, interval is one quarter

lynx

autoplot(pelt, Lynx) + ggtitle("Canadaian Lync pelts traded 1845-1945")

based on the x axis, interval is one year

Close

gafa_stock %>% autoplot(.vars = Close) +
               labs(title = "Close price in the stock market", x = "date", y = "price")

based on the x axis, interval is one day

Demand

vic_elec %>% autoplot(.vars = Demand)  +  
             labs(title = "Electricity Demand Over Time", x = "Time", y = "Demand")

based on the x axis, interval is every 30 min

2.2

Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.

unique(gafa_stock$Symbol)
## [1] "AAPL" "AMZN" "FB"   "GOOG"
gafa_stock %>% group_by(Symbol) %>%
               filter(Close == max(Close)) %>%
               select(Symbol, Date)
## # A tsibble: 4 x 2 [!]
## # Key:       Symbol [4]
## # Groups:    Symbol [4]
##   Symbol Date      
##   <chr>  <date>    
## 1 AAPL   2018-10-03
## 2 AMZN   2018-09-04
## 3 FB     2018-07-25
## 4 GOOG   2018-07-26

It looks like 2018-07-25 is a big day. Maybe U.S. President Donald Trump and European leaders agree to halt their trade war over vehicles and say they will open talks on removing trade barriers between the United States and the European Union.

2.3

Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.

data_url <- "https://bit.ly/fpptute1"
tute1 <- read.csv(data_url)
# View the first few rows of the dataframe
head(tute1)
##      Quarter  Sales AdBudget   GDP
## 1 1981-03-01 1020.2    659.2 251.8
## 2 1981-06-01  889.2    589.0 290.9
## 3 1981-09-01  795.0    512.5 290.8
## 4 1981-12-01 1003.9    614.1 292.4
## 5 1982-03-01 1057.7    647.2 279.1
## 6 1982-06-01  944.4    602.0 254.0
tute1ts <- tute1 |> mutate(Quarter = yearquarter(Quarter)) |>
                         as_tsibble(index = Quarter)
tute1ts
## # A tsibble: 100 x 4 [1Q]
##    Quarter Sales AdBudget   GDP
##      <qtr> <dbl>    <dbl> <dbl>
##  1 1981 Q1 1020.     659.  252.
##  2 1981 Q2  889.     589   291.
##  3 1981 Q3  795      512.  291.
##  4 1981 Q4 1004.     614.  292.
##  5 1982 Q1 1058.     647.  279.
##  6 1982 Q2  944.     602   254 
##  7 1982 Q3  778.     531.  296.
##  8 1982 Q4  932.     608.  272.
##  9 1983 Q1  996.     638.  260.
## 10 1983 Q2  908.     582.  280.
## # ℹ 90 more rows
tute1ts |>   pivot_longer(-Quarter)
## # A tsibble: 300 x 3 [1Q]
## # Key:       name [3]
##    Quarter name     value
##      <qtr> <chr>    <dbl>
##  1 1981 Q1 Sales    1020.
##  2 1981 Q1 AdBudget  659.
##  3 1981 Q1 GDP       252.
##  4 1981 Q2 Sales     889.
##  5 1981 Q2 AdBudget  589 
##  6 1981 Q2 GDP       291.
##  7 1981 Q3 Sales     795 
##  8 1981 Q3 AdBudget  512.
##  9 1981 Q3 GDP       291.
## 10 1981 Q4 Sales    1004.
## # ℹ 290 more rows
tute1ts |>   pivot_longer(-Quarter) |> ggplot(aes(x = Quarter, y = value, colour = name)) +
                                       geom_line() +
                                       facet_grid(name ~ ., scales = "free_y")

tute1ts |>   pivot_longer(-Quarter) |> ggplot(aes(x = Quarter, y = value, colour = name)) +
                                       geom_line()

if there is no face_grid, all of the variable using the same x and y axis here.

2.4 greedy MA Gov

The USgas package contains data on the demand for natural gas in the US.

Install the USgas package. Create a tsibble from us_total with year as the index and state as the key. Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).

??USgas
## starting httpd help server ... done
data("usgas")
us_total_ts <- us_total  %>%  as_tsibble(key = state, index = year) 
us_total_ts_state <- us_total_ts[us_total_ts$state %in% c('Maine', 'Vermont', 'New Hampshire', 'Massachusetts', 'Connecticut', 'Rhode Island'), ]  %>% ggplot(aes(x = year, y = y, color = state))  + geom_line() + facet_grid(state ~ ., scales = "free_y") + labs(title = "annual natural gas consumption by state for the New England area", subtitle = "STUPID MASSACHUSSETS GOVERNMENT CHARGED US SO MUCH, This mopho charged me 150$ one month", y = "gas consumption")

us_total_ts_state

mean MA gov

2.5

Download tourism.xlsx from the book website and read it into R using readxl::read_excel(). Create a tsibble which is identical to the tourism tsibble from the tsibble package. Find what combination of Region and Purpose had the maximum number of overnight trips on average. Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

It looks complex to download the data and you need to create a template first

download

file_url <- "https://bit.ly/fpptourism"

# Specify a temporary file path to download the file 
temp_file <- tempfile(fileext = ".xlsx")

# Download the file from the URL to the temporary file
download.file(file_url, temp_file, mode = "wb")

# Now read the Excel file from the downloaded temporary file
tourism <- readxl::read_excel(temp_file)

check identification

tourism %>% mutate(Quarter = yearquarter(Quarter)) %>% 
            as_tsibble(key = c(Region, State, Purpose), index = Quarter)
## # A tsibble: 24,320 x 5 [1Q]
## # Key:       Region, State, Purpose [304]
##    Quarter Region   State           Purpose  Trips
##      <qtr> <chr>    <chr>           <chr>    <dbl>
##  1 1998 Q1 Adelaide South Australia Business  135.
##  2 1998 Q2 Adelaide South Australia Business  110.
##  3 1998 Q3 Adelaide South Australia Business  166.
##  4 1998 Q4 Adelaide South Australia Business  127.
##  5 1999 Q1 Adelaide South Australia Business  137.
##  6 1999 Q2 Adelaide South Australia Business  200.
##  7 1999 Q3 Adelaide South Australia Business  169.
##  8 1999 Q4 Adelaide South Australia Business  134.
##  9 2000 Q1 Adelaide South Australia Business  154.
## 10 2000 Q2 Adelaide South Australia Business  169.
## # ℹ 24,310 more rows
tsibble::tourism # this is for comparison
## # A tsibble: 24,320 x 5 [1Q]
## # Key:       Region, State, Purpose [304]
##    Quarter Region   State           Purpose  Trips
##      <qtr> <chr>    <chr>           <chr>    <dbl>
##  1 1998 Q1 Adelaide South Australia Business  135.
##  2 1998 Q2 Adelaide South Australia Business  110.
##  3 1998 Q3 Adelaide South Australia Business  166.
##  4 1998 Q4 Adelaide South Australia Business  127.
##  5 1999 Q1 Adelaide South Australia Business  137.
##  6 1999 Q2 Adelaide South Australia Business  200.
##  7 1999 Q3 Adelaide South Australia Business  169.
##  8 1999 Q4 Adelaide South Australia Business  134.
##  9 2000 Q1 Adelaide South Australia Business  154.
## 10 2000 Q2 Adelaide South Australia Business  169.
## # ℹ 24,310 more rows

maximum combination

tourism %>% group_by(Region, Purpose) %>% mutate(avg_reg_pur = mean(Trips)) %>% ungroup() %>%
            filter(avg_reg_pur == max(avg_reg_pur)) %>% distinct(Region, Purpose)
## # A tibble: 1 × 2
##   Region Purpose 
##   <chr>  <chr>   
## 1 Sydney Visiting
# no way this code is so efficient

new tsibble Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.

tourism %>% group_by(Quarter, State) %>% mutate(Quarter = yearquarter(Quarter), Trips = sum(Trips)) %>%
            ungroup() %>% select(Quarter, State, Trips) %>% distinct() %>% 
            as_tsibble(key = c(State, Trips), index = Quarter)
## # A tsibble: 640 x 3 [1Q]
## # Key:       State, Trips [640]
##    Quarter State Trips
##      <qtr> <chr> <dbl>
##  1 2003 Q1 ACT    378.
##  2 1999 Q1 ACT    379.
##  3 2011 Q1 ACT    381.
##  4 2001 Q1 ACT    391.
##  5 2009 Q2 ACT    397.
##  6 2009 Q3 ACT    410.
##  7 1998 Q2 ACT    416.
##  8 2005 Q4 ACT    423.
##  9 2004 Q3 ACT    424.
## 10 2009 Q4 ACT    428.
## # ℹ 630 more rows

2.6

The aus_arrivals data set comprises quarterly international arrivals to Australia from Japan, New Zealand, UK and the US.

Use autoplot(), gg_season() and gg_subseries() to compare the differences between the arrivals from these four countries. Can you identify any unusual observations?

# I want to write a for-loop in R


# Get the unique list of countries
countries <- unique(aus_arrivals$Origin)

# Loop through each country and generate plots
for (country in countries) {
  cat("Country: ", country, "\n") # Print the country name to the console
  
  # Filter data for the current country
  country_data <- aus_arrivals %>% filter(Origin == country)
  
  # autoplot with titles and axis labels
  p1 <- country_data %>%
    autoplot(Arrivals) +
    labs(title = paste("Arrivals from", country),
         x = "Time",
         y = "Number of Arrivals")
  print(p1) # Print the autoplot
  
  # gg_season with titles and axis labels
  p2 <- country_data %>%
    gg_season(Arrivals) +
    labs(title = paste("Seasonal Plot of Arrivals from", country),
         x = "Season",
         y = "Number of Arrivals")
  print(p2) # Print the seasonal plot
  
  # gg_subseries with titles and axis labels
  p3 <- country_data %>%
    gg_subseries(Arrivals) +
    labs(title = paste("Subseries Plot of Arrivals from", country),
         x = "Year",
         y = "Number of Arrivals")
  print(p3) # Print the subseries plot
  
  cat("\n") # Print a newline for better separation between countries
}
## Country:  Japan

## 
## Country:  NZ

## 
## Country:  UK

## 
## Country:  US

another way for 2.6

aus_arrivals |> autoplot(Arrivals)

Generally the number of arrivals to Australia is increasing over the entire series, with the exception of Japanese visitors which begin to decline after 1997 because of the finish of the migration.

aus_arrivals |> gg_season(Arrivals, labels = "both")

Seasonal patterns in arrivals appear to differ from country to country. In particular, arrivals from the UK appear to be lowest in the second and third quarters, while increasing significantly in the fourth and first quarters. For New Zealand tourists, arrivals were lowest in the first quarter and highest in the third quarter. Similar changes exist in Japan and the United States.

aus_arrivals |> gg_subseries(Arrivals)

explanation for unusual spike 2000 Q3: Spikes from the US (Sydney Olympics arrivals) 2001 Q3-Q4 are unusual for US (9/11 effect) 1997 Japanese Migration to Australia in 1997 2007 New South Wales migration to Australia

Chapter 3

3.5

For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

understanding box-cox first

then we can easily find the value of lambda when we are using this:

Tobacco from aus_production

step 1: trying the log first

aus_production %>%  autoplot(Tobacco)
## Warning: Removed 24 rows containing missing values (`geom_line()`).

aus_production %>%  autoplot(log(Tobacco))
## Warning: Removed 24 rows containing missing values (`geom_line()`).

step 2: calculate the lambda using guerrero feature here is the utility of calculating the box-cox lambda

aus_production  %>%  features(Tobacco, features = guerrero)
## # A tibble: 1 × 1
##   lambda_guerrero
##             <dbl>
## 1           0.926
aus_production %>%   autoplot(box_cox(Tobacco, 0.926))
## Warning: Removed 24 rows containing missing values (`geom_line()`).

also based on the lambda value which is closed to 1, we can find there is actually very small transformation here.

Economy class passengers between Melbourne and Sydney from ansett

ansett %>% filter(Airports == "MEL-SYD", Class == "Economy") %>% autoplot(Passengers) +
           labs(title = "Economy passengers", subtitle = "crazy strike! should we reelly adjust this like that?")

ansett_melsyd <- ansett %>% filter(Airports == "MEL-SYD", Class == "Economy")

Can I transform this data in two period??? I do not think transformation is good in this example. Or anyway, I can do some permutation or interpolation

Pedestrian counts at Southern Cross Station from pedestrian

since there are lots of 0, so it is not good to use the log here. also hard to adjust because of the distribution then we decided to use the log1p here

pedestrian %>% filter(Sensor == "Southern Cross Station") %>% autoplot(Count)

pedestrian %>% filter(Sensor == "Southern Cross Station") %>% autoplot(log1p(Count))

now it looks better now!!!

3.7

Consider the last five years of the Gas data from aus_production.

Plot the time series. Can you identify seasonal fluctuations and/or a trend-cycle?

gas <- tail(aus_production, 5*4) %>% select(Gas) 
gas %>% autoplot(Gas)

There is some strong seasonality and a trend.

Use classical_decomposition with type=multiplicative to calculate the trend-cycle and seasonal indices.

gas %>% model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>%
        components() %>% autoplot() + labs(title = "multiplicative")
## Warning: Removed 2 rows containing missing values (`geom_line()`).

gas %>% model(decomp = classical_decomposition(Gas, type = "additive")) %>%
        components() %>% autoplot() + labs(title = "additive")
## Warning: Removed 2 rows containing missing values (`geom_line()`).

Do the results support the graphical interpretation from part a?

It supports the seasonality described in a.

Compute and plot the seasonally adjusted data.

gas %>% model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>%
        autoplot(season_adjust)

#that is weird, why I need an as_tsibble here first??? because the best use of the autoplot?
gas %>% model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>%
        as_tsibble(index = Quarter) %>% autoplot(season_adjust)

Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

gas %>% mutate(Gas = if_else(Quarter == yearquarter("2008Q1"), Gas + 300, Gas)) %>% 
        model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>%
        autoplot()
## Warning: Removed 2 rows containing missing values (`geom_line()`).

#the only one outlier makes my seasonal plot super different

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

gas %>% mutate(Gas = if_else(Quarter == yearquarter("2005Q3"), Gas + 300, Gas)) %>% 
        model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>% autoplot()
## Warning: Removed 2 rows containing missing values (`geom_line()`).

gas %>% mutate(Gas = if_else(Quarter == yearquarter("2005Q3"), Gas + 300, Gas)) %>% 
        model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>% autoplot()
## Warning: Removed 2 rows containing missing values (`geom_line()`).

gas %>% mutate(Gas = if_else(Quarter == yearquarter("2010Q2"), Gas + 300, Gas)) %>% 
        model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>% autoplot()
## Warning: Removed 2 rows containing missing values (`geom_line()`).

There is no big difference but make sure it is the end! even though I changed the 2010Q2 to 2010Q1 like :

gas %>% mutate(Gas = if_else(Quarter == yearquarter("2010Q1"), Gas + 300, Gas)) %>% 
        model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>% autoplot()
## Warning: Removed 2 rows containing missing values (`geom_line()`).

you can find the top of the graph changed explicitly

3.9

Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.

Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

  1. This is a typical STL decomposition.
  2. the trend is the most significant part for the decomposition
  3. seasonality is super stable even other decomposition changed a lot
  4. Dec and March usually have the most labor force, Jan and Aug usually have the least.

Is the recession of 1991/1992 visible in the estimated components?

Yes. I think the remainder shows a substantial drop during 1991 and 1992 coinciding with the recession.


this is the finish of coding part ****

some free trials

set.seed(42)  # For reproducible results
data <- tibble(
  Date = rep(seq(as.Date('2020-01-01'), by = "month", length.out = 12), times = 3*4), #this is meaningless
  Sales = rnorm(144, mean = 1000, sd = 250),
  Product = rep(c("Product A", "Product B", "Product C"), each = 48),
  Region = rep(c("North", "South", "East", "West"), each = 12, times = 3)
)
data$Year <- format(data$Date, "%Y")

ggplot(data, aes(x = Date, y = Sales)) +
  geom_line() +  # Assuming we're interested in sales trends over time
  facet_grid(Product ~ Region, scales = "free_y") +
  labs(title = "Monthly Sales by Product and Region",
       x = "Date", y = "Sales") +
  theme_minimal()

ggplot(data, aes(x = Date, y = Sales)) +
  geom_line() +  # Assuming we're interested in sales trends over time
  facet_grid(Product ~ ., scales = "free_y") +
  labs(title = "Monthly Sales by Product and Region",
       x = "Date", y = "Sales") +
  theme_minimal()

tourism %>%
  group_by(Region, Purpose) %>%
  summarise(Total_Trips = sum(Trips), .groups = 'drop')
## # A tibble: 304 × 3
##    Region         Purpose  Total_Trips
##    <chr>          <chr>          <dbl>
##  1 Adelaide       Business      12442.
##  2 Adelaide       Holiday       12523.
##  3 Adelaide       Other          4525.
##  4 Adelaide       Visiting      16415.
##  5 Adelaide Hills Business        213.
##  6 Adelaide Hills Holiday         837.
##  7 Adelaide Hills Other           112.
##  8 Adelaide Hills Visiting       1136.
##  9 Alice Springs  Business       1166.
## 10 Alice Springs  Holiday        2551.
## # ℹ 294 more rows

comment

topic

Discuss the risks associated with an overreliance on historical data in statistical forecasting. How can such dependence lead to inaccurate predictions, especially in rapidly changing markets or environments? Provide examples of situations where historical data may not be a reliable indicator of future outcomes and suggest methods to mitigate these risks.

doc

In the realm of statistical forecasting, an overreliance on historical data presents a myriad of challenges, especially in fast-evolving sectors where the past may no longer serve as a reliable proxy for the future. This concern is magnified in the current era of accelerated Artificial Intelligence (AI) advancement. Recent developments, such as the unveiling of the Sora model and Google's Gemma model, alongside Elon Musk's insightful commentary on the advancement towards Artificial General Intelligence (AGI), highlight the brisk pace at which AI is evolving. These milestones not only underscore the transformative potential of AI across various industries but also signal a paradigm shift in how forecasting needs to adapt in anticipation of a future shaped by the capabilities of AGI. All in all, overreliance will make us underestimate the power of technology updating.

However, historical data is still invaluable for understanding trends, patterns, and relationships within a dataset, it has limitations. Here are some key risks and strategies for mitigation:
  1. Leading to Inaccurate Prediction

    For the first thing, unpredictable, rare events (like pandemics, natural disasters, or financial crises) can dramatically change the landscape in ways that historical data cannot anticipate. In the light of my data science background, I have a very deep understand on how bad machine learning model act on the deficient dataset and no performance on the prediction out of the range for dataset.

    Second, the whole economic environment changed quickly. Markets and environments often undergo structural changes due to technological advancements, regulatory changes, new competition, or shifts in consumer behavior. Historical data may not capture these shifts, leading to forecasts that are based on outdated assumptions.

    Third problem is the stationarity Issues. Many statistical forecasting models assume that the processes generating the data are stationary, meaning their statistical properties (mean, variance) are constant over time. In reality, many processes are non-stationary, and their underlying dynamics change.

    The last is an easy understanding situation: overfitting on the historical dataset, so I jumped this line :).

  2. Examples Where Historical Data May Be Misleading

    Like I have said in the first paragraph, the disruption from technology. Historical data would have been a poor predictor of the artificial intelligence revolution and the rapid decline of the out-dated model and software usage due to the time sensitivity in the AI industry.

    Then they are some general examples everyone would use:

    Financial Markets: The 2008 financial crisis is an example where models based on historical mortgage default rates failed to predict the widespread defaults due to underlying changes in lending practices and economic conditions.

    Pandemic Response: The COVID-19 pandemic has shown how quickly consumer behavior, supply chains, and business operations can change, rendering many historical-based forecasts irrelevant.

  3. Ways to Mitigate the Risks(To be continued)

    Incorporate External Data: Use external indicators that capture potential changes in the environment, such as economic indicators, consumer sentiment indices, or technological trend analyses, to adjust forecasts accordingly.

    Scenario Analysis: Instead of relying on a single forecast, use scenario analysis to explore a range of possible futures based on different assumptions about how current trends might evolve or new events might emerge. I think this is poplular in the industry now.

    Constant Model Re-evaluation: Regularly update and re-evaluate forecasting models to incorporate new data and check for structural breaks or changes in relationships within the data.

    Machine Learning and Adaptive Models: Use machine learning models that can adapt to changes over time and are capable of capturing non-linear relationships and interactions that traditional models might miss.

    Expert Judgment: Combine statistical forecasts with expert judgment to incorporate insights that are not reflected in the data. This can be particularly useful for anticipating potential future disruptions or innovations.