Forecasting Section 2.8: #1 through #4.

1. For each of the following series (from the fma package), make a graph of the data. If transforming seems appropriate, do so and describe the effect.

  • FMA Datasets:
    • dole - Unemployment benefits in Australia
    • usdeaths - Accidental deaths in USA
    • bricksq - Quarterly clay brick production

a. Monthly total of people on unemployed benefits (dole)

library(fma)
## Loading required package: tseries
## Loading required package: forecast
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: timeDate
## This is forecast 7.2
summary(dole)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4742   20400   56760  221200  392600  856500
par(mfrow=c(2,2))
plot(dole, main="Monthly unemployment benefits")
plot(log(dole), main="Log transformed unemployment benefits")
lambda <- BoxCox.lambda(dole)
lambda
## [1] 0.3290922
plot(BoxCox(dole,lambda), main="Box-Cox transformed unemployment benefits")

Transformations: The Log and Box Cox transformation improve the distinction of the variation in the plots. This maybe subjective, but the log transformation appears to be slightly better transformation than the Box Cox.


b. Monthly total of accidental deaths in the United States (January 1973-December 1978).

summary(usdeaths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6892    8089    8728    8788    9323   11320
par(mfrow=c(2,2))
plot(usdeaths, main="Monthly Accidental Deaths in the US")
plot(log(usdeaths), main="Log Transformed Accidental Deaths")
lambda2 <- BoxCox.lambda(usdeaths)
plot(BoxCox(usdeaths,lambda2), main = "Box-Cox transformed Accidental Deaths")
plot(sqrt(usdeaths), main = "SQRT transformed Accidental Deaths")

lambda2
## [1] -0.03363775

Transformations: The transformations of the data show very similar variation in the accidental deaths, however the range in the plots vary dramatically. I would compare all the transformations to see which produced the better results.


c. Quarterly production of bricks (in millions of units) at Portland, Australia (March 1956-September 1994).

summary(bricksq)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   187.0   338.5   428.0   408.8   490.5   589.0
par(mfrow=c(2,2))
plot(bricksq, main="Quarterly Brick Production")
plot(log(bricksq), main = "Log Transformed Brick Production")
lambda3 <- BoxCox.lambda(bricksq)
lambda3
## [1] 0.2548929
plot(BoxCox(bricksq,lambda3), main = "Box-Cox Brick Production")
plot(sqrt(bricksq), main = "SQRT transformed Brick Production")

***

Transformations: The transformations are very similar in describing the prick production in Portland, Austrailia, and similar to the accidental deaths, in that the range in the plots vary dramatically. The transformations need to be compared and see which produce better results.


2. Use the Dow Jones index (data set dowjones) to do the following:

a. Produce a time plot of the series.

par(mfrow=c(2,2))
plot(dowjones, main="Dow Jones Index")
plot(log(dowjones), main = "Log DJI")
plot(sqrt(dowjones), main = "SQRT")

b. Produce forecasts using the drift method and plot them.

dj_drift <- rwf(dowjones , h=10, drift=TRUE)
dj_drift_log <- rwf(log(dowjones), h = 10, drift = TRUE)
dj_drift_sqrt <- rwf(sqrt(dowjones), h =10, drift = TRUE)
par(mfrow=c(2,2))

plot(dj_drift,plot.conf=FALSE,main="Drift Method Dow Jones", ylab="Index",xlab="Year")
legend("topleft",lty=1, col=c(4),legend=c("SQRT"))
plot(dj_drift_log,plot.conf=FALSE,main="Log Method Dow Jones", ylab="Index",xlab="Year")
legend("topleft",lty=1, col=c(4),legend=c("Log"))
plot(dj_drift_sqrt,plot.conf=FALSE,main="SQRT Method Dow Jones", ylab="Index",xlab="Year")
legend("topleft",lty=1, col=c(4),legend=c("Drift"))

c. Show that the graphed forecasts are identical to extending the line drawn between the first and last observations.

dj2 <- window(dowjones, start=1, end=66-.1)
dj2a <- meanf(dj2,h=12)
dj2b <- rwf(dj2,h=12)
dj2c <- rwf(dj2,h=12, drift = TRUE)
plot(dj2a, plot.conf=FALSE, main="Dow Jones Index", xlim=c(1,78))
lines(dj2b$mean,col=2)
lines(dj2c$mean,col=3)
lines(dowjones)
legend("topleft", lty=1, col=c(4,2,3), legend=c("Mean ","Naive","Drifit"))


Forecasts: The forecasts are not identical. It depends on where the forecast begins and the upward/downward trend that is ocurring.


d. Try some of the other benchmark functions to forecast the same data set. Which do you think is best? Why?

dj_drift <- rwf(dowjones , h=24, drift=TRUE)
dj_mean <-meanf(dowjones, h=42)
dj_naive <-naive(dowjones, h=42)

plot(dj_drift,plot.conf=FALSE,main="Drift Method Dow Jones", ylab="Index",xlab="Year")

lines(dj_mean$mean, col=2)
lines(dj_naive$mean, col=3)
legend("topleft",lty=1, col=c(4,2,3),legend=c("Mean Method","Naive Method","Drift"))


Forecasts: The drift method is a large improvement over the Mean and Naive method because it is more dynamic in capturing the upward/downward trend that is occurring.


3. Use the Dow Jones index (data set dowjones) to do the following:

a. Produce some plots of the data in order to become familiar with it.

head(ibmclose)
## [1] 460 457 452 459 462 459
summary(ibmclose)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   306.0   387.0   494.0   478.5   549.0   603.0
par(mfrow=c(2,2))
plot(ibmclose)
qqnorm(ibmclose)
qqline(ibmclose)
plot(log(ibmclose))
plot(sqrt(ibmclose))

b. Split the data into a training set of 300 observations and a test set of 69 observations.

ibmclose_train <- window(ibmclose ,end=300)
ibmclose_test <- window(ibmclose ,start=301)

c. Try various benchmark methods to forecast the training set and compare the results on the test set. Which method did best?

ibm_avg <- meanf(ibmclose_train,h=54)$mean
ibm_naive <- naive(ibmclose_train ,h=54)$mean
ibm_drift <- rwf(ibmclose_train ,drift=TRUE,h=54)$mean

plot(ibmclose_train,main="IBM Close Prices",xlab="Day",ylab="Price")

lines(ibm_naive,col=2)
lines(ibm_avg,col=4)
lines(ibm_drift,col=3)
lines(ibmclose_test,col=8)

legend("topleft",lty=1,col=c(4,2,3),
  legend=c("Mean Method","Naive Method","Drift Method"))

c. Zooming In

plot(ibmclose_train,main="IBM Close Prices", ylab="Price",xlab="Day", xlim=c(250,369), ylim=c(300,425))
lines(ibm_naive,col=2)
lines(ibm_avg,col=4)
lines(ibm_drift,col=3)
lines(ibmclose_test,col=8)

legend("topleft",lty=1,col=c(4,2,3),
  legend=c("Mean Method","Naive Method","Drift Method"))


Forecasts: The drift method is a large improvement over the Mean and Naive method because it is more dynamic capturing the upward/downward trend that is occurring. Whereas the Naive and Mean are more static statistics.


4. Consider the sales of new one-family houses in the USA, Jan 1973 - Nov 1995 (data set hsales).

a. Produce some plots of the data in order to become familiar with it.

head(hsales)
## [1] 55 60 68 63 65 61
summary(hsales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.00   44.00   53.00   52.29   59.00   89.00
par(mfrow=c(2,2))
plot(hsales)
qqnorm(hsales)
qqline(hsales)
plot(log(hsales))
acf(hsales)

b. Split the hsales data set into a training set and a test set, where the test set is the last two years of data.

hsales_ts <- ts(hsales,start=1,end=275)
hsales_train <- window(hsales_ts,end=251)
hsales_test <- window(hsales_ts,start=251)

c. Try various benchmark methods to forecast the training set and compare the results on the test set. Which method did best?

hsales_avg <- meanf(hsales_train,h=24)$mean
hsales_naive <- naive(hsales_train,h=24)$mean
hsales_drift <- rwf(hsales_train,drift=TRUE,h=24)$mean

plot(hsales_train,main="House Sales",xlab="Month",ylab="Price")

lines(hsales_naive,col=2)
lines(hsales_avg,col=4)
lines(hsales_drift,col=3)
lines(hsales_test,col=8)

legend("topleft",lty=1,col=c(4,2,3),
  legend=c("Mean Method","Naive Method","Drift Method"))

d. Zooming In

plot(hsales_train,main="House sales", ylab="House Sales",xlab="Month", xlim=c(240,275), ylim=c(35,75))

lines(hsales_naive,col=2)
lines(hsales_avg,col=4)
lines(hsales_drift,col=3)
lines(hsales_test,col=8)

legend("topleft",lty=1,col=c(4,2,3), legend=c("Mean Method","Naive Method","Drift Method"))


Forecasts: Mean method is worse than the Drift and the Naive method. The variability in the data is making the forcasts unable to be reliable results. Also the Drift and Naive method are producing nearly indistinguishable results.


Forecasting Section 4.10. #1 through #3.

1. Electricity consumption was recorded for a small town on 12 randomly chosen days. The following maximum temperatures (degrees Celsius) and consumption (megawatt-hours) were recorded for each day.

econsumption
##     Mwh temp
## 1  16.3 29.3
## 2  16.8 21.7
## 3  15.5 23.7
## 4  18.2 10.4
## 5  15.2 29.7
## 6  17.5 11.9
## 7  19.8  9.0
## 8  19.0 23.4
## 9  17.5 17.8
## 10 16.0 30.0
## 11 19.6  8.6
## 12 18.0 11.8

a. Plot the data and find the regression model for Mwh with temperature as an explanatory variable. Why is there a negative relationship?

plot(Mwh ~ temp, data = econsumption, main = "Econsumption")
fit = lm(formula = Mwh  ~ temp, data = econsumption)
abline(fit, col=5)

summary(fit)
## 
## Call:
## lm(formula = Mwh ~ temp, data = econsumption)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2593 -0.5395 -0.1827  0.4274  2.1972 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.19952    0.73040   27.66 8.86e-11 ***
## temp        -0.14516    0.03549   -4.09  0.00218 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9888 on 10 degrees of freedom
## Multiple R-squared:  0.6258, Adjusted R-squared:  0.5884 
## F-statistic: 16.73 on 1 and 10 DF,  p-value: 0.00218

Negative Relationship: As temperature increases the amount of electricity used decreases. This may be due to the use of Air Conditioning in the small town- the colder the temperaturee the more likely a house hold would use their AC.


b. Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?

par(mfrow=c(2,2))
plot(fit)


Adequacy and Outliers: The model appears to be a great fit to the data. Based on the QQ-plot and the Residuals v Fitted observation 8 appears to be an outliers where the electricity used does not correspond to the trend in temperature. Besides that data point, the observations fit the assumptions of normality fairly- there is linearity in the QQ plot as well as the residuals vs fitted show nearly constant variance. However, due to the limited number of observations x<30 this may be a suspect sample size.


c. Use the model to predict the electricity consumption that you would expect for a day with maximum temperature 10 and a day with maximum temperature 35. Do you believe these predictions?

coeffs = coefficients(fit)
pred_temp = c(10, 35)
p_temp = coeffs[1] + coeffs[2]*pred_temp 
p_temp
## [1] 18.74795 15.11902

Predictions: The predictions seems to be in the general trend of the model, so I wouldn’t question it. Although, we are basing this on less than data 30 points so there may be issues with normality.


d. Give prediction intervals for your forecasts.

par(mfrow=c(1,2))
fcast <- forecast(fit, newdata=data.frame(temp=10))
plot(fcast, xlab="temp", ylab="Mwh")
fcast2 <- forecast(fit, newdata=data.frame(temp=35))
plot(fcast2, xlab="temp", ylab="Mwh")

temp10 = data.frame(temp=10)
temp35 = data.frame(temp=35)
predict(fit, temp10, interval="predict") 
##        fit      lwr      upr
## 1 18.74795 16.34824 21.14766
predict(fit, temp35, interval="predict") 
##        fit      lwr      upr
## 1 15.11902 12.49768 17.74035

2. The following table gives the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2012 (data set olympic).

a. Update the data set olympic to include the winning times from the last few Olympics.

olympic1 <- matrix(c(1896, 54.2, 1900, 49.4, 1904, 49.2, 1908, 50, 1912 , 48.2, 1920 , 49.6, 1924 , 47.6 , 1928 , 47.8, 1932, 46.2, 1936, 46.5, 1948, 46.2, 1952, 45.9, 1956, 46.7, 1960, 44.9, 1964, 45.1, 1968, 43.8 , 1972, 44.66, 1976, 44.26, 1980, 44.6, 1984, 44.27, 1988 , 43.87 , 1992, 43.5, 1996 , 43.49 , 2000, 43.84 , 2004, 44, 2008, 43.75, 2012, 43.94, 2016 , 43.03) ,ncol=2,byrow=TRUE)
colnames(olympic1) <- c("Year","time")
olympic_ts <- ts(olympic1,start=1,end=28)

b. Plot the winning time against the year. Describe the main features of the scatterplot.

plot(time ~ Year, data = olympic_ts, main = "Olympic Gold Medal Times")


Scatterplot: The first observation appears to be an outlier where the winner ran the 400M in 54.2 seconds. There continues to be a downward trend until 1992 where the progressive olympics have gold medal times plateau around 43.65 seconds.


c. Fit a regression line to the data. Obviously the winning times have been decreasing, but at what average rate per year?

fit1 = lm(formula = time  ~ Year, data = olympic_ts)
plot(time ~ Year, data = olympic_ts, main = "Olympic Gold Medal Times")
abline(fit1, col=5)

summary(fit1)
## 
## Call:
## lm(formula = time ~ Year, data = olympic_ts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6002 -0.5747 -0.2858  0.5751  4.1505 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 172.481477  11.487522   15.02 2.52e-14 ***
## Year         -0.064574   0.005865  -11.01 2.75e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.136 on 26 degrees of freedom
## Multiple R-squared:  0.8234, Adjusted R-squared:  0.8166 
## F-statistic: 121.2 on 1 and 26 DF,  p-value: 2.752e-11

Decreasing Rate: The 400M time is decreasing at an average rate of .065 per 4 years.


d. Plot the residuals against the year. What does this indicate about the suitability of the fitted line?

par(mfrow=c(2,2))
plot(fit1)


Suitability: The model appears to be a great fit to the data. Based on the QQ-plot and the Residuals v Fitted observation 1 appears to be an outliers where the time for the first gold medal does not correspond to the downward trend in time. Besides that data point, the observations fit the assumptions of normality fairly- there is linearity in the QQ plot as well as the residuals vs fitted demonstrating nearly constant variance. However, due to the limited number of observations x<30 this may be a suspect sample size.


e. Predict the winning time for the men’s 400 meters final in the 2000, 2004, 2008 and 2012 Olympics. Give a prediction interval for each of your forecasts. What assumptions have you made in these calculations?

coeffs1 = coefficients(fit1)
pred_time = c(2000, 2004, 2008, 2012)
p_time = coeffs1[1] + coeffs1[2]*pred_time
p_time
## [1] 43.33379 43.07549 42.81720 42.55890
par(mfrow=c(2,2))
fcast3 <- forecast(fit1, newdata=data.frame(Year=2000))
plot(fcast3, xlab="Year", ylab="time")
fcast4 <- forecast(fit1, newdata=data.frame(Year=2004))
plot(fcast4, xlab="Year", ylab="time")
fcast5 <- forecast(fit1, newdata=data.frame(Year=2008))
plot(fcast5, xlab="Year", ylab="time")
fcast6 <- forecast(fit1, newdata=data.frame(Year=2012))
plot(fcast6, xlab="Year", ylab="time")

time2000 = data.frame(Year=2000)
time2004 = data.frame(Year=2004)
time2008 = data.frame(Year=2008)
time2012 = data.frame(Year=2012)
predict(fit1, time2000, interval="predict") 
##        fit      lwr      upr
## 1 43.33379 40.90529 45.76229
predict(fit1, time2004, interval="predict") 
##        fit      lwr     upr
## 1 43.07549 40.63659 45.5144
predict(fit1, time2008, interval="predict") 
##       fit      lwr      upr
## 1 42.8172 40.36698 45.26741
predict(fit1, time2012, interval="predict") 
##       fit      lwr      upr
## 1 42.5589 40.09648 45.02132

Assumptions: The underlying assumption about these calculations is that the underlying distribution is a normal distribution. The 95% predictive interval provides an acceptable range where the actual winning times may fit.


f. Find out the actual winning times for these Olympics (see www.databaseolympics.com). How good were your forecasts and prediction intervals?

coeffs1 = coefficients(fit1)
pred_time = c(2000, 2004, 2008, 2012)
p_time = coeffs1[1] + coeffs1[2]*pred_time
p_time
## [1] 43.33379 43.07549 42.81720 42.55890
olympic_ts[24:27,2]
## [1] 43.84 44.00 43.75 43.94

Forecast/Prediction Intervals: After the year 2000, the forecasts become increasingly inaccurate. The model was unable to account for the plateau that would occur around 43.65 seconds. The actual winning times were within the predictive intervals.


3.An elasticity coefficient is the ratio of the percentage change in the forecast variable (y) to the percentage change in the predictor variable (x). Mathematically, the elasticity is defined as (dy/dx)×(x/y)(dy/dx)×(x/y). Consider the log-log model, logy=B0+B1logx+e.

Express y as a function of x and show that the coefficient B1 is the elasticity coefficient.

logy=B0+B1logx+e

take differential, assuming other variables are constant

dy/y = dx/x*B1 (dy/y / dx/x) = B1

change in y divided by y over the the change in x divided by x

percentage in y over the percentage change in x B1 is elasticity- the partial elasticity of the dependent variable with respect to the independent variable, ie percentage increase in the dependent variable with the percentage increase in the independent variable.

100(dy/y) = 100(dx/x)B1 % change y = % change x B1