Assignment 1

R Markdown

#1.Consider the GDP information in global_economy. Plot the GDP per capita for each country over time. Which country has the highest GDP per capita? How has this changed over time?

ge<-global_economy
ge$gdppc<-ge$GDP/ge$Population
ge %>%
  tsibble(key = Code, index = Year)%>%
  autoplot(gdppc, show.legend =  FALSE) +
  labs(y = "GDP Per Capita")

## Warning: Removed 3242 row(s) containing missing values (geom_path).

In the most recent year, Luxembourg had the highest GDP per capita. In year 2000, Monaco took the prize. Going back to 1965, the petro-state of Kuwait was the highest, with the United States in second. The United States is somewhat unique to appear at the top of these rankigns, with a significantly larger populaton than the Luxembourgh, Kuwait, Monaco, Lichtenstein, and UAE, other top-ranking countries.

ge_2017 <- ge[ge$Year == 2017,]
ge_2017[order(-ge_2017$gdppc),]

## # A tsibble: 262 x 10 [1Y]
## # Key:       Country [262]
##    Country    Code   Year     GDP Growth   CPI Imports Exports Population  gdppc
##    <fct>      <fct> <dbl>   <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl>  <dbl>
##  1 Luxembourg LUX    2017 6.24e10   2.30 111.    194.    230.      599449 1.04e5
##  2 Macao SAR… MAC    2017 5.04e10   9.10 136.     32.0    79.4     622567 8.09e4
##  3 Switzerla… CHE    2017 6.79e11   1.09  98.3    53.9    65.0    8466017 8.02e4
##  4 Norway     NOR    2017 3.99e11   1.92 115.     33.1    35.5    5282223 7.55e4
##  5 Iceland    ISL    2017 2.39e10   3.64 122.     42.8    47.0     341284 7.01e4
##  6 Ireland    IRL    2017 3.34e11   7.80 105.     87.9   120.     4813608 6.93e4
##  7 Qatar      QAT    2017 1.67e11   1.58 116.     37.3    51.0    2639211 6.32e4
##  8 United St… USA    2017 1.94e13   2.27 112.     NA      NA    325719178 5.95e4
##  9 North Ame… NAC    2017 2.10e13   2.35  NA      NA      NA    362492702 5.81e4
## 10 Singapore  SGP    2017 3.24e11   3.62 113.    149.    173.     5612253 5.77e4
## # … with 252 more rows

ge_2000 <- ge[ge$Year == 2000,]
ge_2000[order(-ge_2000$gdppc),]

## # A tsibble: 263 x 10 [1Y]
## # Key:       Country [263]
##    Country    Code   Year     GDP Growth   CPI Imports Exports Population  gdppc
##    <fct>      <fct> <dbl>   <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl>  <dbl>
##  1 Monaco     MCO    2000 2.65e 9   3.91  NA     NA       NA        32082 82535.
##  2 Liechtens… LIE    2000 2.48e 9   3.22  NA     NA       NA        33286 74625.
##  3 Bermuda    BMU    2000 3.48e 9   9.32  NA     NA       NA        61833 56284.
##  4 Luxembourg LUX    2000 2.13e10   8.24  80.1  123.     148.      436300 48736.
##  5 Channel I… CHI    2000 6.44e 9   5.83  NA     NA       NA       148725 43299.
##  6 San Marino SMR    2000 1.10e 9   2.21  NA     NA       NA        27418 40189.
##  7 Japan      JPN    2000 4.89e12   2.78 103.     9.20    10.6  126843000 38532.
##  8 Norway     NOR    2000 1.71e11   3.21  81.9   28.9     45.7    4490967 38147.
##  9 Switzerla… CHE    2000 2.72e11   3.94  91.7   46.0     52.1    7184250 37868.
## 10 United St… USA    2000 1.03e13   4.09  79.0   14.3     10.7  282162411 36450.
## # … with 253 more rows

ge_1965 <- ge[ge$Year == 1965,]
ge_1965[order(-ge_1965$gdppc),]

## # A tsibble: 260 x 10 [1Y]
## # Key:       Country [260]
##    Country  Code   Year      GDP Growth     CPI Imports Exports Population gdppc
##    <fct>    <fct> <dbl>    <dbl>  <dbl>   <dbl>   <dbl>   <dbl>      <dbl> <dbl>
##  1 Kuwait   KWT    1965  2.10e 9 NA     NA        23.1    67.7      473554 4429.
##  2 United … USA    1965  7.44e11  6.40  14.4       4.24    4.99  194303000 3828.
##  3 North A… NAC    1965  7.98e11  6.42  NA         5.54    6.22  214031100 3727.
##  4 Sweden   SWE    1965  2.33e10  3.82  11.0      22.5    21.9     7733853 3008.
##  5 Luxembo… LUX    1965  9.22e 8 -0.735 19.0      75.1    83.0      331500 2780.
##  6 Canada   CAN    1965  5.39e10  6.64  14.4      18.7    18.5    19678000 2740.
##  7 Iceland  ISL    1965  5.24e 8  7.37   0.0699   35.0    36.2      192286 2724.
##  8 Switzer… CHE    1965  1.53e10 NA     28.2      27.0    27.5     5856472 2620.
##  9 Bermuda  BMU    1965  1.14e 8  4.76  NA        NA      NA         50100 2282.
## 10 Austral… AUS    1965  2.59e10  5.98   8.69     15.3    13.2    11388000 2277.
## # … with 250 more rows

2.For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

In order to better portray the US GDP data, I opted to scale it simply into Billions, which keeps the same shape but better scales the data.

us<- ge[ge$Country == 'United States',]
us %>%
  tsibble(index = Year)%>%
  autoplot(GDP/1000000000, show.legend =  FALSE) +
  labs(y = "GDP in Billions($)")

I opted for no transformation on the Bulls, bullocks, and steers in Victoria, since the data clearly show variation across time and a seasonality effect.

aus<-aus_livestock
aus_bbc<- aus[aus$Animal == 'Bulls, bullocks and steers',]
aus_bbc_v<-aus_bbc[aus_bbc$State == 'Victoria',]
aus_bbc_v %>%
  tsibble(index = Month)%>%
  autoplot(Count, show.legend =  FALSE) +
  labs(title = "Slaughter of Bulls, Bollucks, and Steers in Victoria, Australia",
    y = "Total Number",
    x= "Month")

Due to the difficulty of viewing the plot,I considering a log transformation and a box-cox transformation, but neither made it easier to view.

elec<-vic_elec
elec

## # A tsibble: 52,608 x 5 [30m] <Australia/Melbourne>
##    Time                Demand Temperature Date       Holiday
##    <dttm>               <dbl>       <dbl> <date>     <lgl>  
##  1 2012-01-01 00:00:00  4383.        21.4 2012-01-01 TRUE   
##  2 2012-01-01 00:30:00  4263.        21.0 2012-01-01 TRUE   
##  3 2012-01-01 01:00:00  4049.        20.7 2012-01-01 TRUE   
##  4 2012-01-01 01:30:00  3878.        20.6 2012-01-01 TRUE   
##  5 2012-01-01 02:00:00  4036.        20.4 2012-01-01 TRUE   
##  6 2012-01-01 02:30:00  3866.        20.2 2012-01-01 TRUE   
##  7 2012-01-01 03:00:00  3694.        20.1 2012-01-01 TRUE   
##  8 2012-01-01 03:30:00  3562.        19.6 2012-01-01 TRUE   
##  9 2012-01-01 04:00:00  3433.        19.1 2012-01-01 TRUE   
## 10 2012-01-01 04:30:00  3359.        19.0 2012-01-01 TRUE   
## # … with 52,598 more rows

elec$logd<-log(elec$Demand)
elec %>%
  tsibble(index = Time) %>%
  autoplot(Demand, show.legend =  FALSE) +
  labs(title = "Victorian Electricity Demand",
    y = "KW Used",
    x= "Time")

lambda <- elec %>%
  features(Demand, features = guerrero) %>%
  pull(lambda_guerrero)

lambda

## [1] 0.09993089

elec %>%
  autoplot(box_cox(Demand, lambda)) +
  labs(title = "Victorian Electricity Demand",
    y = "KW Used",
    x= "Time")

No Change was considered because the data shows clear seasonality and variation.

aus_p<-aus_production
aus_production %>% autoplot(Gas)

##3. Why is a Box-Cox transformation unhelpful for the canadian_gas data?

The box-cox transformation converts the plot to a near-normal distribution. In this case, since our variance is not continuous over the period, a boxcox will not be appropriate, and we can see in the two plots below, that the boxcox does not improve the graph.

cg<-canadian_gas
cg %>% autoplot(Volume)

lambda2 <- cg %>%
  features(Volume, features = guerrero) %>%
  pull(lambda_guerrero)

cg %>% autoplot(box_cox(Volume, lambda2))

##4. What Box-Cox transformation would you select for your retail data (from Exercise 8 in Section 2.10)?

Below are the plots for turnover in Australian retail. I chose to use to guerrero feature to determine the lambda, which assigned a value of -.04. A good value of lambda “is one which makes the size of the seasonal variation about the same across the whole series, as that makes the forecasting model simpler”(Hyndman 3.1), which this achieves.

set.seed(45678)
austs <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

austs %>% autoplot(Turnover)

lambda3 <- austs %>%
  features(Turnover, features = guerrero) %>%
  pull(lambda_guerrero)
lambda3

## [1] -0.04252935

austs %>% autoplot(box_cox(Turnover, lambda3))

##5. For the following series, find an appropriate Box-Cox transformation in order to stabilise the variance. Tobacco from aus_production, Economy class passengers between Melbourne and Sydney from ansett, and Pedestrian counts at Southern Cross Station from pedestrian.

ap<-aus_production
ap %>% autoplot(Tobacco)

## Warning: Removed 24 row(s) containing missing values (geom_path).

lambda4 <- ap %>%
  features(Tobacco, features = guerrero) %>%
  pull(lambda_guerrero)
ap %>% autoplot(box_cox(Tobacco, lambda4))

## Warning: Removed 24 row(s) containing missing values (geom_path).

ans<-ansett
ansettms<- ans[ans$Airports == 'MEL-SYD',]
anseco<- ansettms[ansettms$Class == 'Economy',]
anseco %>% autoplot(Passengers)

lambda5 <- anseco%>%
  features(Passengers, features = guerrero) %>%
  pull(lambda_guerrero)
anseco %>% autoplot(box_cox(Passengers, lambda5))

peds<-pedestrian
pedsx<- peds[peds$Sensor == 'Southern Cross Station',]
pedsx %>% autoplot(Count)

lambda6 <- pedsx%>%
  features(Count, features = guerrero) %>%
  pull(lambda_guerrero)
pedsx %>% autoplot(box_cox(Count, lambda6))

##6.Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.

A 3x5 MA averages the first five terms and then obtaining three observation set. Since a 3X5 MA will lead to a 1/15(y + 2y + 3y + 3y + 3y + 2y + y), it follows the weighted average of 1/15, 2/15, 3/15, 3/15 which are equivalent to the decimal values in the prompt.

##7. Consider the last five years of the Gas data from aus_production. The plot below shows the seasonal fluctuations and a general upwards trend.

gas <- tail(aus_production, 5*4) %>% select(Gas)
gas %>% autoplot(Gas)

gts<- ts(gas$Gas, frequency = 4, start = c(2005))
plot(gts)

The decomposition below confirms the overall trend, with a clear upwards trend and clear seasonal fluctuations, as one would expect with gas usage.

dec1<-decompose(gts, type="multiplicative")
plot(dec1)

#borrowed this code from Juan Esteban Rincon Poveda
gas2<-gas%>% model(classical_decomposition(Gas, type="multiplicative")) %>% components()
gas2 %>%
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Gas, colour = "Gas")) +
  geom_line(aes(y = season_adjust)) +
  geom_line(aes(y = trend, colour = "Trend"))

## Warning: Removed 4 row(s) containing missing values (geom_path).

e. Change one observation to be an outlier (e.g., add 300 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

The outlier has a notible disruptive impact on the trend, but does very little to the seasonal interpretation.

gas3<-gas
gas3$Gas[14]<-gas$Gas[14]+300

gas3 %>%
  model(classical_decomposition(Gas,type = "multiplicative")) %>%
  components() %>%
  autoplot()

## Warning: Removed 2 row(s) containing missing values (geom_path).

Does it make any difference if the outlier is near the end rather than in the middle of the time series?

Yes. If the added quantity is near the end of the time series, it will be in line with the overall trend and just make the effect more drastic. Alternatively, if it is at the beginning of the series, then it will disrupt the overall trend.

##Recall your retail time series data (from Exercise 8 in Section 2.10). Decompose the series using X-11. Does it reveal any outliers, or unusual features that you had not noticed previously?

There are clearly some bumps from outliers that mess with the oeverall trend,. That said, there is clearly a seasonal trend here.

set.seed(3)
myseries <- aus_retail %>%
  filter(`Series ID` == sample(aus_retail$`Series ID`,1))

myseries %>%
  model(classical_decomposition(Turnover ,type = "multiplicative")) %>%
  components() %>%
  autoplot()

## Warning: Removed 6 row(s) containing missing values (geom_path).

##9. a. Write about 3–5 sentences describing the results of the decomposition. Pay particular attention to the scales of the graphs in making your interpretation.

The decomposition in figures 3.19 and 3.20 show the decomposition of the monthly civilian labor force from 1979 to 1995. The data show a clear upward trend, with just a small change from near-perfect linearity from 1990 to 1993ish, but then there is a clear bump up. There is a clear season trend, as it humps up in the summer (our winter) and goes down in the winter.
###b. Is the recession of 1991/1992 visible in the estimated components? The recession of 1991-1992 is visible only in the ‘remainder,’ and causes a slight but noticeable slowing of growth in the trend line. THere is nothing discernible inhte seasonality about the recession, which makes sense.

11. This exercise uses the canadian_gas data (monthly Canadian gas production in billions of cubic metres, January 1960 – February 2005).

a. Plot the data using autoplot(), gg_subseries() and gg_season() to look at the effect of the changing seasonality over time

cg <- canadian_gas
cg %>%
  autoplot(Volume)

cg %>%
  gg_season(Volume)

cg %>%
  gg_subseries(Volume)

#Chapter 7 Excercises

##1. Half-hourly electricity demand for Victoria, Australia is contained in vic_elec. Extract the January 2014 electricity demand, and aggregate this data to daily with daily total demands and maximum temperatures.

jan14_vic_elec <- vic_elec %>%
  filter(yearmonth(Time) == yearmonth("2014 Jan")) %>%
  index_by(Date = as_date(Time)) %>%
  summarise(
    Demand = sum(Demand),
    Temperature = max(Temperature)
  )
jan14_vic_elec

## # A tsibble: 31 x 3 [1D]
##    Date        Demand Temperature
##    <date>       <dbl>       <dbl>
##  1 2014-01-01 175185.        26  
##  2 2014-01-02 188351.        23  
##  3 2014-01-03 189086.        22.2
##  4 2014-01-04 173798.        20.3
##  5 2014-01-05 169733.        26.1
##  6 2014-01-06 195241.        19.6
##  7 2014-01-07 199770.        20  
##  8 2014-01-08 205339.        27.4
##  9 2014-01-09 227334.        32.4
## 10 2014-01-10 258111.        34  
## # … with 21 more rows

Plot the data and find the regression model for Demand with temperature as a predictor variable. Why is there a positive relationship?

There is a positive relationship because the average temperature in Victoria in January is about 80 degrees, so as the heat goes up, demand for AC goes up, hence an increase in demand for electricity.

mod1<- lm(Demand ~ Temperature, jan14_vic_elec)
summary(mod1)

## 
## Call:
## lm(formula = Demand ~ Temperature, data = jan14_vic_elec)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49978 -10219   -121  18533  35441 
## 
## Coefficients:
##             Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)  59083.9    17424.8   3.391         0.00203 ** 
## Temperature   6154.3      601.3  10.235 0.0000000000389 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared:  0.7832, Adjusted R-squared:  0.7757 
## F-statistic: 104.7 on 1 and 29 DF,  p-value: 0.0000000000389

plot(jan14_vic_elec$Temperature,jan14_vic_elec$Demand)

b. Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?

As shown below,from inspection there is not a discernable trend between temperature and residuals. The residuals are skewed right, but positive and negative residuals exist at both ends up the spectrum.

res.mod1<-resid(mod1)
plot(jan14_vic_elec$Temperature, res.mod1,
  ylab="Residuals", xlab="Temperature")

c. Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15 C and compare it with the forecast if the with maximum temperature was 35. Do you believe these forecasts?

I do not trust these forecasts. This forecast is not very reliable since the input data is only for Jan 15.

jan14_vic_elec %>%
  model(TSLM(Demand ~ Temperature)) %>%
  forecast(
    new_data(jan14_vic_elec, 1) %>%
      mutate(Temperature = 15)
  ) %>%

  autoplot(jan14_vic_elec)

jan14_vic_elec %>%
  model(TSLM(Demand ~ Temperature)) %>%
  forecast(
    new_data(jan14_vic_elec, 1) %>%
      mutate(Temperature = 35)
  ) %>%
  
  autoplot(jan14_vic_elec)

d. Give prediction intervals for your forecasts.

Upper and lower levels printed below

temp<- data.frame(Temperature=c(15,35))
mod2<- forecast(mod1, temp)
mod2

##           Point Forecast    Lo 80    Hi 80     Lo 95    Hi 95
## 151398.35       151398.4 117127.2 185669.5  97951.22 204845.5
## 274484.25       274484.2 241333.0 307635.5 222783.69 326184.8

mod2$lower[,1]

##        1        2 
## 117127.2 241333.0

mod2$upper[,1]

##        1        2 
## 185669.5 307635.5

Plot Demand vs Temperature for all of the available data in vic_elec aggregated to daily total demand and maximum temperature. What does this say about your model?

The model shows that the data does not follow a linear trend. Thus, our linear model may not be a good approximation of electricity demand. This likely stems from the interchanging relationship between cold and hot weather and the need for electricity. For example, higher temps does not mean more electricity unless it is the hotter months.

ve <- vic_elec 
plot(ve$Temperature,ve$Demand,
  xlab="Temp",ylab="Demand")

##2. Data set olympic_running contains the winning times (in seconds) in each Olympic Games sprint, middle-distance and long-distance track events from 1896 to 2016.

Plot the winning time against the year for each event. Describe the main features of the plot. The primary feature of the plot below is that finishing time is decreasing for distance events, but increasing slightly or unchanged for sprint events.

or<-olympic_running
or %>% autoplot(Time, show.legend =  TRUE)

Fit a regression line to the data for each event. Obviously the winning times have been decreasing, but at what average rate per year?

According to the model, the average times are decreasing by an average of 4.82 seconds per year on average. This does little to show us about the sprint models because the impact of the larger models is so much larger.

orts<-ts(or, frequency = 1)
ormod1 <- tslm(Time ~ trend + Year, data = orts)
summary(ormod1)

## 
## Call:
## tslm(formula = Time ~ trend + Year, data = orts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -479.91 -291.19   -9.71  212.57  893.18 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 2236.9636  1261.4729   1.773              0.0773 .  
## trend          4.8262     0.2384  20.246 <0.0000000000000002 ***
## Year          -1.3415     0.6440  -2.083              0.0382 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 354.9 on 278 degrees of freedom
##   (31 observations deleted due to missingness)
## Multiple R-squared:  0.5986, Adjusted R-squared:  0.5957 
## F-statistic: 207.3 on 2 and 278 DF,  p-value: < 0.00000000000000022

Plot the residuals against the year. What does this indicate about the suitability of the fitted lines?

The relationship between residuals and year is a bit odd. Mostly, the relationship seems to be constant with some oddity at the higher residua;ls

resormod1<-resid(ormod1)

plot(or$Year, resormod1,
  ylab="Residuals", xlab="Year")

#olympicforecast<- forecast(tslm, newdata=data.frame(year=2020))

An elasticity coefficient is the ratio of the percentage change in the forecast variable (y) as a percent change in the forecast variable.

##4. The data set souvenirs concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

The plot below shows the time plot for souvenier sales. Two oddities exist: In general the data follow an upward but seasonal trend, with the exception of January 1991 where there is a lesser increase. Additionally, the growth in 1993 and 1994 exceeds that of prior years.

souv<-souvenirs
souv %>% autoplot(Sales)

b. Explain why it is necessary to take logarithms of these data before fitting a model.

Logarithms accomplish useful transformation for these data. First, it allows us to scale the data to reduce the role of major fluctuations. Additionally, when we run a regression, we get a % on % interpretation which helps with general interpretability and statistic and practical significance interpretations.

c.Fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

sts<- ts(souv$Sales,start=c(1987,1), end=c(1993,6), frequency=12)
#borrowed from Juan ERP

logsales = log(sts)

sts

##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 1987  1664.81  2397.53  2840.71  3547.29  3752.96  3714.74  4349.61  3566.34
## 1988  2499.81  5198.24  7225.14  4806.03  5900.88  4951.34  6179.12  4752.15
## 1989  4717.02  5702.63  9957.58  5304.78  6492.43  6630.80  7349.62  8176.62
## 1990  5921.10  5814.58 12421.25  6369.77  7609.12  7224.75  8121.22  7979.25
## 1991  4826.64  6470.23  9638.77  8821.17  8722.37 10209.48 11276.55 12552.22
## 1992  7615.03  9849.69 14558.40 11587.33  9332.56 13082.09 16732.78 19888.61
## 1993 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53                  
##           Sep      Oct      Nov      Dec
## 1987  5021.82  6423.48  7600.60 19756.21
## 1988  5496.43  5835.10 12600.08 28541.72
## 1989  8573.17  9690.50 15151.84 34061.01
## 1990  8093.06  8476.70 17914.66 30114.41
## 1991 11637.39 13606.89 21822.11 45060.69
## 1992 23933.38 25391.35 36024.80 80721.71
## 1993

surfing_festival <- c()
for(i in 1:length(sts)){
  month <- round(12*(sts[i] - floor(sts[i]))) + 1
  year <- floor(sts[i])
  if(year >= 1988 & month == 3){
    surfing_festival[i] <- 1
  } else {
    surfing_festival[i] <- 0
  }
}
#borrowed code from Joseph Sansevero (previous class rpub posting)

surfmodel = tslm(logsales ~ trend + season + surfing_festival)
summary(surfmodel)

## 
## Call:
## tslm(formula = logsales ~ trend + season + surfing_festival)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45174 -0.11922 -0.00074  0.10175  0.35933 
## 
## Coefficients:
##                   Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)      7.6464607  0.0773528  98.852 < 0.0000000000000002 ***
## trend            0.0212957  0.0009292  22.919 < 0.0000000000000002 ***
## season2          0.2521410  0.0979991   2.573             0.012414 *  
## season3          0.6932048  0.0990785   6.997        0.00000000189 ***
## season4          0.3820296  0.0991135   3.854             0.000271 ***
## season5          0.4123836  0.0980652   4.205        0.00008271735 ***
## season6          0.4524491  0.0981048   4.612        0.00001963521 ***
## season7          0.5767690  0.1019961   5.655        0.00000039077 ***
## season8          0.5352140  0.1033731   5.177        0.00000242988 ***
## season9          0.6237063  0.1034004   6.032        0.00000008934 ***
## season10         0.7223515  0.1020342   7.080        0.00000000135 ***
## season11         1.1936391  0.1020638  11.695 < 0.0000000000000002 ***
## season12         1.9447112  0.1021019  19.047 < 0.0000000000000002 ***
## surfing_festival 0.0293746  0.1001947   0.293             0.770337    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1833 on 64 degrees of freedom
## Multiple R-squared:  0.9444, Adjusted R-squared:  0.9331 
## F-statistic: 83.67 on 13 and 64 DF,  p-value: < 0.00000000000000022

Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

The residuals show a clear decrease during the recession of 1991. Additionally, there is a clear pattern suggesting a relationship between residuals and time.

autoplot(surfmodel$residuals)

Do boxplots of the residuals for each month. Does this reveal any problems with the model?

The larger residuals in March suggest that the significance around the surf model may be more variable than the model can predict.

boxplot(residuals(surfmodel)~cycle(residuals(surfmodel)))

f. What do the values of the coefficients tell you about each variable? The model suggests that November and December cause the biggest jump in sales – over 1% each. Additionally, even when accounting for the surfing festival, March causes a .6% increase in sales, while, the surfing festival accounts for another .3% increase.

What does the Ljung-Box test tell you about your model? The Ljung-Box test shows that the residuals have a value of just .01216, demonstrating that there is a pattern in the residuals that is statistically significant. In other words, these are not just random.

checkresiduals(surfmodel)

## 
##  Breusch-Godfrey test for serial correlation of order up to 17
## 
## data:  Residuals from Linear regression model
## LM test = 32.739, df = 17, p-value = 0.01216

Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

surffuture<- rep(0, 36)
for(i in 1:36){
  if(i %% 12 == 3){
    surffuture[i] <- 1
  }
}
surffuture<-ts(data=surffuture,start=1994,
                end=c(1996,12),
                frequency=12)

surfforecast<-forecast(surfmodel, newdata=surffuture)

## Warning in forecast.lm(surfmodel, newdata = surffuture): newdata column names
## not specified, defaulting to first variable required.

autoplot(surfforecast)

Predicton intervalus below

surfforecast$upper

##               [,1]      [,2]
## Jul 1993 10.166941 10.308788
## Aug 1993 10.147163 10.289272
## Sep 1993 10.309227 10.463765
## Oct 1993 10.376410 10.518258
## Nov 1993 10.868994 11.010841
## Dec 1993 11.641361 11.783209
## Jan 1994  9.716872  9.858137
## Feb 1994  9.990309 10.131574
## Mar 1994 10.452922 10.594325
## Apr 1994 10.163042 10.304445
## May 1994 10.214438 10.355703
## Jun 1994 10.275800 10.417064
## Jul 1994 10.425660 10.569229
## Aug 1994 10.405760 10.549524
## Sep 1994 10.568224 10.724635
## Oct 1994 10.635130 10.778699
## Nov 1994 11.127713 11.271282
## Dec 1994 11.900081 12.043650
## Jan 1995  9.976000 10.119208
## Feb 1995 10.249437 10.392644
## Mar 1995 10.711946 10.855236
## Apr 1995 10.422067 10.565356
## May 1995 10.473566 10.616774
## Jun 1995 10.534927 10.678135
## Jul 1995 10.685120 10.830812
## Aug 1995 10.665100 10.810922
## Sep 1995 10.827895 10.986543
## Oct 1995 10.894590 11.040282
## Nov 1995 11.387173 11.532865
## Dec 1995 12.159541 12.305233
## Jan 1996 10.235858 10.381406
## Feb 1996 10.509295 10.654842
## Mar 1996 10.971704 11.117278
## Apr 1996 10.681824 10.827398
## May 1996 10.733425 10.878972
## Jun 1996 10.794786 10.940333

surfforecast$lower

##               [,1]      [,2]
## Jul 1993  9.644235  9.502387
## Aug 1993  9.623495  9.481386
## Sep 1993  9.739756  9.585217
## Oct 1993  9.853705  9.711857
## Nov 1993 10.346288 10.204440
## Dec 1993 11.118656 10.976808
## Jan 1994  9.196314  9.055049
## Feb 1994  9.469751  9.328486
## Mar 1994  9.931856  9.790454
## Apr 1994  9.641977  9.500574
## May 1994  9.693880  9.552615
## Jun 1994  9.755241  9.613976
## Jul 1994  9.896612  9.753043
## Aug 1994  9.875993  9.732229
## Sep 1994  9.991855  9.835444
## Oct 1994 10.106081  9.962512
## Nov 1994 10.598664 10.455096
## Dec 1994 11.371032 11.227463
## Jan 1995  9.448282  9.305074
## Feb 1995  9.721719  9.578511
## Mar 1995 10.183928 10.040639
## Apr 1995  9.894049  9.750759
## May 1995  9.945849  9.802641
## Jun 1995 10.007210  9.864002
## Jul 1995 10.148248 10.002556
## Aug 1995 10.127750  9.981928
## Sep 1995 10.243280 10.084632
## Oct 1995 10.357718 10.212026
## Nov 1995 10.850301 10.704609
## Dec 1995 11.622669 11.476977
## Jan 1996  9.699520  9.553973
## Feb 1996  9.972957  9.827409
## Mar 1996 10.435267 10.289693
## Apr 1996 10.145387  9.999813
## May 1996 10.197086 10.051539
## Jun 1996 10.258447 10.112900

How could you improve these predictions by modifying the model?

The model could be improved by incuding other factors that account for sales. For example, average temperature in a given month, the overeall state of the economy, and other features.

The us_gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day”. Consider only the data to the end of 2004.

#Could not get fourier series to run properly
#usgas <- ts(us_gasoline$Barrels, start=(1991-1), end=(2017-1),frequency = 52)
#ugh<-us_gasoline

#fourier_gas<- recent_barrells %>%
  #model(TSLM(Barrels~ trend() + fourier(K = 2)))
#report(fourier_beer)

##6. The annual population of Afghanistan is available in the global_economy data set.

Plot the data and comment on its features. Can you observe the effect of the Soviet-Afghan war?

The population of Afghanistan shrunk after 1980 (soviet-afghan war) but appears to have grown at a continuous rate since 1990.

ge_afg<-filter(ge, Code=='AFG')
ge_afg%>%
autoplot(Population)

bFit a linear trend model and compare this to a piecewise linear trend model with knots at 1980 and 1989. c.Generate forecasts from these two models for the five years after the end of the data, and comment on the results.

The following graphs show the predicted values for both piecewise and linear functions. By applying a piecewise function, the predicted values are much closer to the recent trend, compared to the linear model which is too heavily influenced by the losses from an earlier war.

afg_trends <- ge_afg %>%
  model(
    linear = TSLM(Population ~ trend()),
    piecewise = TSLM(Population ~ trend(knots = c(1980, 1989)))
  )
afg_forecast <- afg_trends %>% forecast(h = 5)

ge_afg %>%
  autoplot(Population) +
  geom_line(data = fitted(afg_trends),
            aes(y = .fitted, colour = .model)) +
  autolayer(afg_forecast, alpha = 0.5, level = 95) +
  labs(y = "Population",
       title = "Afghan Population")

Assignment 1

Lucas Kaplan

2022-03-30

R Markdown

2.For each of the following series, make a graph of the data. If transforming seems appropriate, do so and describe the effect.

11. This exercise uses the canadian_gas data (monthly Canadian gas production in billions of cubic metres, January 1960 – February 2005).

a. Plot the data using autoplot(), gg_subseries() and gg_season() to look at the effect of the changing seasonality over time