Explore the following four time series: Bricks from aus_production, Lynx from pelt, Close from gafa_stock, Demand from vic_elec.
Use ? (or help()) to find out about the data in each series. What is the time interval of each series? Use autoplot() to produce a time plot of each series. For the last plot, modify the axis labels and title.
aus_production %>% autoplot(.vars = Bricks) +
labs(title = "Brick production", x = "Time", y = "production")
## Warning: Removed 20 rows containing missing values (`geom_line()`).
based on the x axis, interval is one quarter
autoplot(pelt, Lynx) + ggtitle("Canadaian Lync pelts traded 1845-1945")
based on the x axis, interval is one year
gafa_stock %>% autoplot(.vars = Close) +
labs(title = "Close price in the stock market", x = "date", y = "price")
based on the x axis, interval is one day
vic_elec %>% autoplot(.vars = Demand) +
labs(title = "Electricity Demand Over Time", x = "Time", y = "Demand")
based on the x axis, interval is every 30 min
Use filter() to find what days corresponded to the peak closing price for each of the four stocks in gafa_stock.
unique(gafa_stock$Symbol)
## [1] "AAPL" "AMZN" "FB" "GOOG"
gafa_stock %>% group_by(Symbol) %>%
filter(Close == max(Close)) %>%
select(Symbol, Date)
## # A tsibble: 4 x 2 [!]
## # Key: Symbol [4]
## # Groups: Symbol [4]
## Symbol Date
## <chr> <date>
## 1 AAPL 2018-10-03
## 2 AMZN 2018-09-04
## 3 FB 2018-07-25
## 4 GOOG 2018-07-26
It looks like 2018-07-25 is a big day. Maybe U.S. President Donald Trump and European leaders agree to halt their trade war over vehicles and say they will open talks on removing trade barriers between the United States and the European Union.
Download the file tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for inflation.
data_url <- "https://bit.ly/fpptute1"
tute1 <- read.csv(data_url)
# View the first few rows of the dataframe
head(tute1)
## Quarter Sales AdBudget GDP
## 1 1981-03-01 1020.2 659.2 251.8
## 2 1981-06-01 889.2 589.0 290.9
## 3 1981-09-01 795.0 512.5 290.8
## 4 1981-12-01 1003.9 614.1 292.4
## 5 1982-03-01 1057.7 647.2 279.1
## 6 1982-06-01 944.4 602.0 254.0
tute1ts <- tute1 |> mutate(Quarter = yearquarter(Quarter)) |>
as_tsibble(index = Quarter)
tute1ts
## # A tsibble: 100 x 4 [1Q]
## Quarter Sales AdBudget GDP
## <qtr> <dbl> <dbl> <dbl>
## 1 1981 Q1 1020. 659. 252.
## 2 1981 Q2 889. 589 291.
## 3 1981 Q3 795 512. 291.
## 4 1981 Q4 1004. 614. 292.
## 5 1982 Q1 1058. 647. 279.
## 6 1982 Q2 944. 602 254
## 7 1982 Q3 778. 531. 296.
## 8 1982 Q4 932. 608. 272.
## 9 1983 Q1 996. 638. 260.
## 10 1983 Q2 908. 582. 280.
## # ℹ 90 more rows
tute1ts |> pivot_longer(-Quarter)
## # A tsibble: 300 x 3 [1Q]
## # Key: name [3]
## Quarter name value
## <qtr> <chr> <dbl>
## 1 1981 Q1 Sales 1020.
## 2 1981 Q1 AdBudget 659.
## 3 1981 Q1 GDP 252.
## 4 1981 Q2 Sales 889.
## 5 1981 Q2 AdBudget 589
## 6 1981 Q2 GDP 291.
## 7 1981 Q3 Sales 795
## 8 1981 Q3 AdBudget 512.
## 9 1981 Q3 GDP 291.
## 10 1981 Q4 Sales 1004.
## # ℹ 290 more rows
tute1ts |> pivot_longer(-Quarter) |> ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line() +
facet_grid(name ~ ., scales = "free_y")
tute1ts |> pivot_longer(-Quarter) |> ggplot(aes(x = Quarter, y = value, colour = name)) +
geom_line()
if there is no face_grid, all of the variable using the same x and y axis here.
The USgas package contains data on the demand for natural gas in the US.
Install the USgas package. Create a tsibble from us_total with year as the index and state as the key. Plot the annual natural gas consumption by state for the New England area (comprising the states of Maine, Vermont, New Hampshire, Massachusetts, Connecticut and Rhode Island).
??USgas
## starting httpd help server ... done
data("usgas")
us_total_ts <- us_total %>% as_tsibble(key = state, index = year)
us_total_ts_state <- us_total_ts[us_total_ts$state %in% c('Maine', 'Vermont', 'New Hampshire', 'Massachusetts', 'Connecticut', 'Rhode Island'), ] %>% ggplot(aes(x = year, y = y, color = state)) + geom_line() + facet_grid(state ~ ., scales = "free_y") + labs(title = "annual natural gas consumption by state for the New England area", subtitle = "STUPID MASSACHUSSETS GOVERNMENT CHARGED US SO MUCH, This mopho charged me 150$ one month", y = "gas consumption")
us_total_ts_state
mean MA gov
Download tourism.xlsx from the book website and read it into R using readxl::read_excel(). Create a tsibble which is identical to the tourism tsibble from the tsibble package. Find what combination of Region and Purpose had the maximum number of overnight trips on average. Create a new tsibble which combines the Purposes and Regions, and just has total trips by State.
It looks complex to download the data and you need to create a template first
file_url <- "https://bit.ly/fpptourism"
# Specify a temporary file path to download the file
temp_file <- tempfile(fileext = ".xlsx")
# Download the file from the URL to the temporary file
download.file(file_url, temp_file, mode = "wb")
# Now read the Excel file from the downloaded temporary file
tourism <- readxl::read_excel(temp_file)
tourism %>% mutate(Quarter = yearquarter(Quarter)) %>%
as_tsibble(key = c(Region, State, Purpose), index = Quarter)
## # A tsibble: 24,320 x 5 [1Q]
## # Key: Region, State, Purpose [304]
## Quarter Region State Purpose Trips
## <qtr> <chr> <chr> <chr> <dbl>
## 1 1998 Q1 Adelaide South Australia Business 135.
## 2 1998 Q2 Adelaide South Australia Business 110.
## 3 1998 Q3 Adelaide South Australia Business 166.
## 4 1998 Q4 Adelaide South Australia Business 127.
## 5 1999 Q1 Adelaide South Australia Business 137.
## 6 1999 Q2 Adelaide South Australia Business 200.
## 7 1999 Q3 Adelaide South Australia Business 169.
## 8 1999 Q4 Adelaide South Australia Business 134.
## 9 2000 Q1 Adelaide South Australia Business 154.
## 10 2000 Q2 Adelaide South Australia Business 169.
## # ℹ 24,310 more rows
tsibble::tourism # this is for comparison
## # A tsibble: 24,320 x 5 [1Q]
## # Key: Region, State, Purpose [304]
## Quarter Region State Purpose Trips
## <qtr> <chr> <chr> <chr> <dbl>
## 1 1998 Q1 Adelaide South Australia Business 135.
## 2 1998 Q2 Adelaide South Australia Business 110.
## 3 1998 Q3 Adelaide South Australia Business 166.
## 4 1998 Q4 Adelaide South Australia Business 127.
## 5 1999 Q1 Adelaide South Australia Business 137.
## 6 1999 Q2 Adelaide South Australia Business 200.
## 7 1999 Q3 Adelaide South Australia Business 169.
## 8 1999 Q4 Adelaide South Australia Business 134.
## 9 2000 Q1 Adelaide South Australia Business 154.
## 10 2000 Q2 Adelaide South Australia Business 169.
## # ℹ 24,310 more rows
tourism %>% group_by(Region, Purpose) %>% mutate(avg_reg_pur = mean(Trips)) %>% ungroup() %>%
filter(avg_reg_pur == max(avg_reg_pur)) %>% distinct(Region, Purpose)
## # A tibble: 1 × 2
## Region Purpose
## <chr> <chr>
## 1 Sydney Visiting
# no way this code is so efficient
tourism %>% group_by(Quarter, State) %>% mutate(Quarter = yearquarter(Quarter), Trips = sum(Trips)) %>%
ungroup() %>% select(Quarter, State, Trips) %>% distinct() %>%
as_tsibble(key = c(State, Trips), index = Quarter)
## # A tsibble: 640 x 3 [1Q]
## # Key: State, Trips [640]
## Quarter State Trips
## <qtr> <chr> <dbl>
## 1 2003 Q1 ACT 378.
## 2 1999 Q1 ACT 379.
## 3 2011 Q1 ACT 381.
## 4 2001 Q1 ACT 391.
## 5 2009 Q2 ACT 397.
## 6 2009 Q3 ACT 410.
## 7 1998 Q2 ACT 416.
## 8 2005 Q4 ACT 423.
## 9 2004 Q3 ACT 424.
## 10 2009 Q4 ACT 428.
## # ℹ 630 more rows
The aus_arrivals data set comprises quarterly international arrivals to Australia from Japan, New Zealand, UK and the US.
Use autoplot(), gg_season() and gg_subseries() to compare the differences between the arrivals from these four countries. Can you identify any unusual observations?
# I want to write a for-loop in R
# Get the unique list of countries
countries <- unique(aus_arrivals$Origin)
# Loop through each country and generate plots
for (country in countries) {
cat("Country: ", country, "\n") # Print the country name to the console
# Filter data for the current country
country_data <- aus_arrivals %>% filter(Origin == country)
# autoplot with titles and axis labels
p1 <- country_data %>%
autoplot(Arrivals) +
labs(title = paste("Arrivals from", country),
x = "Time",
y = "Number of Arrivals")
print(p1) # Print the autoplot
# gg_season with titles and axis labels
p2 <- country_data %>%
gg_season(Arrivals) +
labs(title = paste("Seasonal Plot of Arrivals from", country),
x = "Season",
y = "Number of Arrivals")
print(p2) # Print the seasonal plot
# gg_subseries with titles and axis labels
p3 <- country_data %>%
gg_subseries(Arrivals) +
labs(title = paste("Subseries Plot of Arrivals from", country),
x = "Year",
y = "Number of Arrivals")
print(p3) # Print the subseries plot
cat("\n") # Print a newline for better separation between countries
}
## Country: Japan
##
## Country: NZ
##
## Country: UK
##
## Country: US
another way for 2.6
aus_arrivals |> autoplot(Arrivals)
Generally the number of arrivals to Australia is increasing over the entire series, with the exception of Japanese visitors which begin to decline after 1997 because of the finish of the migration.
aus_arrivals |> gg_season(Arrivals, labels = "both")
Seasonal patterns in arrivals appear to differ from country to country. In particular, arrivals from the UK appear to be lowest in the second and third quarters, while increasing significantly in the fourth and first quarters. For New Zealand tourists, arrivals were lowest in the first quarter and highest in the third quarter. Similar changes exist in Japan and the United States.
aus_arrivals |> gg_subseries(Arrivals)
explanation for unusual spike 2000 Q3: Spikes from the US (Sydney Olympics arrivals) 2001 Q3-Q4 are unusual for US (9/11 effect) 1997 Japanese Migration to Australia in 1997 2007 New South Wales migration to Australia
For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.
then we can easily find the value of lambda when we are using this:
step 1: trying the log first
aus_production %>% autoplot(Tobacco)
## Warning: Removed 24 rows containing missing values (`geom_line()`).
aus_production %>% autoplot(log(Tobacco))
## Warning: Removed 24 rows containing missing values (`geom_line()`).
step 2: calculate the lambda using guerrero feature here is the utility of calculating the box-cox lambda
aus_production %>% features(Tobacco, features = guerrero)
## # A tibble: 1 × 1
## lambda_guerrero
## <dbl>
## 1 0.926
aus_production %>% autoplot(box_cox(Tobacco, 0.926))
## Warning: Removed 24 rows containing missing values (`geom_line()`).
also based on the lambda value which is closed to 1, we can find there is actually very small transformation here.
ansett %>% filter(Airports == "MEL-SYD", Class == "Economy") %>% autoplot(Passengers) +
labs(title = "Economy passengers", subtitle = "crazy strike! should we reelly adjust this like that?")
ansett_melsyd <- ansett %>% filter(Airports == "MEL-SYD", Class == "Economy")
Can I transform this data in two period??? I do not think transformation is good in this example. Or anyway, I can do some permutation or interpolation
since there are lots of 0, so it is not good to use the log here. also hard to adjust because of the distribution then we decided to use the log1p here
pedestrian %>% filter(Sensor == "Southern Cross Station") %>% autoplot(Count)
pedestrian %>% filter(Sensor == "Southern Cross Station") %>% autoplot(log1p(Count))
now it looks better now!!!
Consider the last five years of the Gas data from aus_production.
gas <- tail(aus_production, 5*4) %>% select(Gas)
gas %>% autoplot(Gas)
There is some strong seasonality and a trend.
gas %>% model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>%
components() %>% autoplot() + labs(title = "multiplicative")
## Warning: Removed 2 rows containing missing values (`geom_line()`).
gas %>% model(decomp = classical_decomposition(Gas, type = "additive")) %>%
components() %>% autoplot() + labs(title = "additive")
## Warning: Removed 2 rows containing missing values (`geom_line()`).
It supports the seasonality described in a.
gas %>% model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>%
autoplot(season_adjust)
#that is weird, why I need an as_tsibble here first??? because the best use of the autoplot?
gas %>% model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>%
as_tsibble(index = Quarter) %>% autoplot(season_adjust)
gas %>% mutate(Gas = if_else(Quarter == yearquarter("2008Q1"), Gas + 300, Gas)) %>%
model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>%
autoplot()
## Warning: Removed 2 rows containing missing values (`geom_line()`).
#the only one outlier makes my seasonal plot super different
gas %>% mutate(Gas = if_else(Quarter == yearquarter("2005Q3"), Gas + 300, Gas)) %>%
model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>% autoplot()
## Warning: Removed 2 rows containing missing values (`geom_line()`).
gas %>% mutate(Gas = if_else(Quarter == yearquarter("2005Q3"), Gas + 300, Gas)) %>%
model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>% autoplot()
## Warning: Removed 2 rows containing missing values (`geom_line()`).
gas %>% mutate(Gas = if_else(Quarter == yearquarter("2010Q2"), Gas + 300, Gas)) %>%
model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>% autoplot()
## Warning: Removed 2 rows containing missing values (`geom_line()`).
There is no big difference but make sure it is the end! even though I changed the 2010Q2 to 2010Q1 like :
gas %>% mutate(Gas = if_else(Quarter == yearquarter("2010Q1"), Gas + 300, Gas)) %>%
model(decomp = classical_decomposition(Gas, type = "multiplicative")) %>% components() %>% autoplot()
## Warning: Removed 2 rows containing missing values (`geom_line()`).
you can find the top of the graph changed explicitly
Figures 3.19 and 3.20 show the result of decomposing the number of persons in the civilian labour force in Australia each month from February 1978 to August 1995.
Yes. I think the remainder shows a substantial drop during 1991 and 1992 coinciding with the recession.
this is the finish of coding part ****
set.seed(42) # For reproducible results
data <- tibble(
Date = rep(seq(as.Date('2020-01-01'), by = "month", length.out = 12), times = 3*4), #this is meaningless
Sales = rnorm(144, mean = 1000, sd = 250),
Product = rep(c("Product A", "Product B", "Product C"), each = 48),
Region = rep(c("North", "South", "East", "West"), each = 12, times = 3)
)
data$Year <- format(data$Date, "%Y")
ggplot(data, aes(x = Date, y = Sales)) +
geom_line() + # Assuming we're interested in sales trends over time
facet_grid(Product ~ Region, scales = "free_y") +
labs(title = "Monthly Sales by Product and Region",
x = "Date", y = "Sales") +
theme_minimal()
ggplot(data, aes(x = Date, y = Sales)) +
geom_line() + # Assuming we're interested in sales trends over time
facet_grid(Product ~ ., scales = "free_y") +
labs(title = "Monthly Sales by Product and Region",
x = "Date", y = "Sales") +
theme_minimal()
tourism %>%
group_by(Region, Purpose) %>%
summarise(Total_Trips = sum(Trips), .groups = 'drop')
## # A tibble: 304 × 3
## Region Purpose Total_Trips
## <chr> <chr> <dbl>
## 1 Adelaide Business 12442.
## 2 Adelaide Holiday 12523.
## 3 Adelaide Other 4525.
## 4 Adelaide Visiting 16415.
## 5 Adelaide Hills Business 213.
## 6 Adelaide Hills Holiday 837.
## 7 Adelaide Hills Other 112.
## 8 Adelaide Hills Visiting 1136.
## 9 Alice Springs Business 1166.
## 10 Alice Springs Holiday 2551.
## # ℹ 294 more rows
Discuss the risks associated with an overreliance on historical data in statistical forecasting. How can such dependence lead to inaccurate predictions, especially in rapidly changing markets or environments? Provide examples of situations where historical data may not be a reliable indicator of future outcomes and suggest methods to mitigate these risks.
In the realm of statistical forecasting, an overreliance on historical data presents a myriad of challenges, especially in fast-evolving sectors where the past may no longer serve as a reliable proxy for the future. This concern is magnified in the current era of accelerated Artificial Intelligence (AI) advancement. Recent developments, such as the unveiling of the Sora model and Google's Gemma model, alongside Elon Musk's insightful commentary on the advancement towards Artificial General Intelligence (AGI), highlight the brisk pace at which AI is evolving. These milestones not only underscore the transformative potential of AI across various industries but also signal a paradigm shift in how forecasting needs to adapt in anticipation of a future shaped by the capabilities of AGI. All in all, overreliance will make us underestimate the power of technology updating.
However, historical data is still invaluable for understanding trends, patterns, and relationships within a dataset, it has limitations. Here are some key risks and strategies for mitigation:
Leading to Inaccurate Prediction
For the first thing, unpredictable, rare events (like pandemics, natural disasters, or financial crises) can dramatically change the landscape in ways that historical data cannot anticipate. In the light of my data science background, I have a very deep understand on how bad machine learning model act on the deficient dataset and no performance on the prediction out of the range for dataset.
Second, the whole economic environment changed quickly. Markets and environments often undergo structural changes due to technological advancements, regulatory changes, new competition, or shifts in consumer behavior. Historical data may not capture these shifts, leading to forecasts that are based on outdated assumptions.
Third problem is the stationarity Issues. Many statistical forecasting models assume that the processes generating the data are stationary, meaning their statistical properties (mean, variance) are constant over time. In reality, many processes are non-stationary, and their underlying dynamics change.
The last is an easy understanding situation: overfitting on the historical dataset, so I jumped this line :).
Examples Where Historical Data May Be Misleading
Like I have said in the first paragraph, the disruption from technology. Historical data would have been a poor predictor of the artificial intelligence revolution and the rapid decline of the out-dated model and software usage due to the time sensitivity in the AI industry.
Then they are some general examples everyone would use:
Financial Markets: The 2008 financial crisis is an example where models based on historical mortgage default rates failed to predict the widespread defaults due to underlying changes in lending practices and economic conditions.
Pandemic Response: The COVID-19 pandemic has shown how quickly consumer behavior, supply chains, and business operations can change, rendering many historical-based forecasts irrelevant.
Ways to Mitigate the Risks(To be continued)
Incorporate External Data: Use external indicators that capture potential changes in the environment, such as economic indicators, consumer sentiment indices, or technological trend analyses, to adjust forecasts accordingly.
Scenario Analysis: Instead of relying on a single forecast, use scenario analysis to explore a range of possible futures based on different assumptions about how current trends might evolve or new events might emerge. I think this is poplular in the industry now.
Constant Model Re-evaluation: Regularly update and re-evaluate forecasting models to incorporate new data and check for structural breaks or changes in relationships within the data.
Machine Learning and Adaptive Models: Use machine learning models that can adapt to changes over time and are capable of capturing non-linear relationships and interactions that traditional models might miss.
Expert Judgment: Combine statistical forecasts with expert judgment to incorporate insights that are not reflected in the data. This can be particularly useful for anticipating potential future disruptions or innovations.