Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?
#check dataset for analysis
head(global_economy)
#Plot the GDP per capita after creating a global_economy function.
data("global_economy")
global_economy$GDPperCapita <- global_economy$GDP/global_economy$Population
global_economy %>% autoplot(GDPperCapita, show.legend = FALSE) +
labs(title= "GDP per Capita", x= "Year", y = "USD")
## Warning: Removed 3242 rows containing missing values or values outside the scale range
## (`geom_line()`).
#Filter data to get country and GDP per capita
global_Cap<- global_economy %>%
index_by(Year)%>%
filter(GDPperCapita == max(GDPperCapita, na.rm = TRUE)) %>%
select(Country, GDPperCapita) %>%
arrange(Year)
## Adding missing grouping variables: `Year`
#Select top 5 countries with head function
head(global_Cap %>%
arrange(desc(GDPperCapita)), n = 5)
#Create a Graph with GGplot to display the 3 variables need it to answers the problem's question(Country, Year and GDP per Capita)
ggplot(global_Cap, aes(x = Year, y = GDPperCapita, fill = Country)) +
geom_bar(stat = "identity") +
labs(title = "GDP Per Capita from 1960 to 2017", x = "1960 to 2017", y = "GDP per Capita", fill = "Country") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 60, hjust = 2))+
scale_x_discrete(limits = global_Cap$Year, breaks = 10)
## Warning in scale_x_discrete(limits = global_Cap$Year, breaks = 10): Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?
Monaco has the highest GDP per Capita in this dataset, the graph shows that 2014 was the highest peak of GDP for that country. The Chart also shows that in earlier years,USA and Kuwait has the highest GDP per Capita, however Monaco seems to be the constant leader along with Luxembourg and Liechtenstein in recent years.
For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.
United States GDP from global_economy. Slaughter of Victorian “Bulls, bullocks and steers” in aus_livestock. Victorian Electricity Demand from vic_elec. Gas production from aus_production.
#Check dataset for variables
data("global_economy")
head(global_economy)
# Graph Data with autoplot filtering the USA country
global_economy %>% filter(Country =='United States') %>%
autoplot(GDP) +
labs(title = "United States GDP", x= "Year", y = "Dollars")
Transformation does not seem necessary for this dataset.
#Check dataset for variables
data("aus_livestock")
head(aus_livestock)
aus_livestock %>% filter(Animal == 'Bulls, bullocks and steers') %>%
filter(State == 'Victoria') %>%
autoplot(Count) +
labs(title = "Slaughter of Victorian Bulls, bullocks and steers", x = "Month", y = 'Count')
Transformation does not seem necessary for this dataset.
#Check dataset for variables
data("vic_elec")
head(vic_elec)
vic_elec %>% autoplot(Demand) +
labs(title = "Victorian Electricity Demand",x = "minutes", y = "MWh")
yearly <- vic_elec %>%
mutate(Year = year(Date)) %>%
index_by(Year) %>%
summarise(Demand = sum(Demand))
yearly %>% autoplot(Demand) +
labs(title = "Victorian Electricity Demand yearly", x = "Year", y = 'MWh')
A transformation might be necessary on this dataset, since the first
plot was a little messy and hard to understand, by changing the value
from months to years, the plot seems more understandable.
#Check dataset for variables
data("aus_production")
head(aus_production)
aus_production %>% autoplot(Gas) +
labs(title = "Gas Production", x = "Quaterly", y = "Count")
aus_production %>% autoplot(sqrt(Gas)) +
labs(title = "Gas Production", x = "Quaterly", y = "Count")
I decided to transform the data from quadratic to square root, I think
it seems way better that way.
Why is a Box-Cox transformation unhelpful for the canadian_gas data?
canadian_gas %>%
autoplot() + labs(title = "Canadian gas production by month", x = "Months", y = "Count")
## Plot variable not specified, automatically selected `.vars = Volume`
lambda3 <- canadian_gas %>%
features(Volume,features = guerrero) %>%
pull(lambda_guerrero)
canadian_gas %>%
autoplot(box_cox(Volume,lambda3)) +
labs(title = "Canadian gas production by month", x = "Months", y = "Count")
After transforming the canadian_gas dataset using the Box_Cox function,
and comparing both data visualizations I have realized that a Box_Cox
transformation is not useful since the variance of both datasets are
increasing and the transformations are only necessary when we need to
stabilize the variance on the data.
What Box-Cox transformation would you select for your retail data (from Exercise 7 in Section 2.10)?
set.seed(122779)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries %>%
autoplot() +
labs(title = "Australian Retail Trade Turnover", x = "Months", y = "Turnover")
## Plot variable not specified, automatically selected `.vars = Turnover`
lambda4 <- myseries %>%
features(Turnover,features = guerrero) %>%
pull(lambda_guerrero)
lambda4
## [1] 0.1571524
myseries %>%
autoplot(box_cox(Turnover,lambda4)) +
labs(title = "Australian Retail Trade Turnover", x = "Months", y = "Turnover")
The lambda value it will be 0.1571524 for the retail data, which is approximately a log transformation which makes the forecasting simplier
For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.
aus_production %>% autoplot(Tobacco) +
labs(title = "Tobacco Production", x = "Quaterly", y = "Count")
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
lambda5 <- aus_production %>%
features(Tobacco, features = guerrero)%>%
pull(lambda_guerrero)
aus_production %>% autoplot(box_cox(Tobacco,lambda5)) +
labs(title = paste("Transformed Tobacco Production with Box-Cox =", round(lambda5, 4)))
## Warning: Removed 24 rows containing missing values or values outside the scale range
## (`geom_line()`).
After performing a Box-Cox transformation in the dataset I have a value
of 0.9265 which it a small change transformation of it.
class_eco <- ansett %>%
filter(Class == "Economy",
Airports == "MEL-SYD")
autoplot(class_eco, Passengers)+
labs(title = "Passengers in Economy Class Between Melbourne and Sydney Flights", X= "Weeks")
lambda6 <- class_eco %>%
features(Passengers, features = guerrero) %>%
pull(lambda_guerrero)
class_eco %>%
autoplot(box_cox(Passengers, lambda6)) +
labs(title = paste("Transformed Economy Class Passengers Count =", round(lambda6, 5)))
In this dataset, we have a Box-Cox transformation with a 1.9993 value of
5, which displays variations better.
pedestrian %>% filter(Sensor =='Southern Cross Station') %>% autoplot(Count)+
labs(title = "Pedestrian Count")
monthly <- pedestrian %>%
mutate(Month = yearmonth(Date)) %>%
index_by(Month) %>%
summarise(Count = sum(Count))
monthly %>% autoplot(Count)+
labs(title = "Pedestrian Count by month")
lambda7 <- monthly %>%
features(Count, features = guerrero) %>%
pull(lambda_guerrero)
monthly %>% autoplot(box_cox(Count,lambda7)) +
labs(title = paste("Transformed Pedestrian Count by month =", round(lambda7, 6)))
I decided to change the dataset from hourly to monthly for better
understanding, the Box-Cox transformation gave a value of 1.999927 with
a value of 6.
Consider the last five years of the Gas data from aus_production.
gas <- tail(aus_production, 5*4) |> select(Gas)
gas %>% autoplot(Gas) +
labs(title = "Australia Gas Production")
It seems to be variations between quarters, with some peaks in the second quarter and decreases in the fouth quarter.
gas %>% model(classical_decomposition(Gas, type = "multiplicative")) %>%
components() %>%
autoplot() +
labs(title = "Classical multiplicative decomposition ")
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).
I believe that it does support graphical interpretation form part A, there is a positive trend with seasonal variations during the year.
seasonally <- gas %>% model(classical_decomposition(Gas, type = "multiplicative"))
components(seasonally) %>%
as_tsibble() %>%
autoplot(Gas, colour = "red") +
geom_line(aes(y=season_adjust), colour = "blue") +
labs(title = "Seasonally Adjusted Gas Production")
gas$Gas[gas$Gas == 171] <- gas$Gas[gas$Gas == 171] + 300
gas %>%
model(classical_decomposition(Gas, type = "multiplicative")) %>%
components() %>%
as_tsibble() %>%
autoplot(Gas, colour = "red") +
geom_line(aes(y=season_adjust), colour = "blue") +
labs(title = "Seasonally Adjusted Data with an Outlier")
There is a huge effect of the outlier that we can clearly see in both of the data as well as seasonally adjusted data, causing an impact on the trend of the overall data.
gas$Gas[gas$Gas == 471] <- gas$Gas[gas$Gas == 471] - 300
gas$Gas[gas$Gas == 245] <- gas$Gas[gas$Gas == 245] + 300
gas %>%
model(classical_decomposition(Gas, type = "multiplicative")) %>%
components() %>%
as_tsibble() %>%
autoplot(Gas, colour = "yellow") +
geom_line(aes(y=season_adjust), colour = "red") +
labs(title = "Seasonally Adjusted Data with a Middle Outlier")
There is an impact on the trend and data seems to be significant where
it is placed when the outlier is in the middle.
Recall your retail time series data (from Exercise 7 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?
set.seed(12345678)
myseries <- aus_retail |>
filter(`Series ID` == sample(aus_retail$`Series ID`,1))
myseries %>%
autoplot() + labs(title = "Australian retail trade turnover")
## Plot variable not specified, automatically selected `.vars = Turnover`
myseries %>%
gg_season() +labs(title = "Australian retail trade turnover")
## Plot variable not specified, automatically selected `y = Turnover`
myseries %>%
gg_subseries() + labs(title = "Australian retail trade turnover")
## Plot variable not specified, automatically selected `y = Turnover`
The figures for this questions shows a positive trend and increases over time. The seasonality trend shows 3 peaks per a year in hiring the civilian labor force around March, September and December.
The seasonal component from the decomposition we can see a sharp decrease from March to August of the early 1990’s. The overview of Figure 3.19 show us the STL decomposition where during the period of 1990 to 1991, where can see a huge decrease in the remainder or noise column.