Format: Group Effort, Group Representative will turn in your assignment. No conversations across groups regarding this project. DUE: 6/27/20 by Midnight ET Submission: Via Email – scott.burk@sps.cuny.edu Submission: Word Readable Document for Report (all in one), Excel Readable (all in one, separate sheets). File Convention: Group#_Project1_Summer624, example Group1_Project1_Summer624 GRADE: 70% Report, 30% Forecast Accuracy
Your data is a de-identified Excel spreadsheet. Your assignment is to perform the appropriate analysis to forecast several series for 140 periods. You will have 1622 periods for your analysis. See Requirement #2 for more details.
You will turn in a written report. You need to write this report as if it the report will be routed in an office to personnel of vary different backgrounds. You need to be able to reach readers that have no data science background to full fledge data scientists. So, you need to explain what you have done, why and how you did it with both layman and technical terminology. Do not simply write this with me in mind. Visuals and output are expected, but it is not necessary to include every bit of analysis. In fact, a terse report with simple terminology is preferred to throwing in everything into a long, ad nausem report. Story telling is really taking on for data science, so please flex your muscles here. The report is part 1 of 2 requirements.
NOTE: We have covered a lot of material. I don’t want you to try every method to create your forecast. But you should perform fundamentals, visualize your data, and perform exploratory data analysis as appropriate.
Your second requirement is to produce forecasts. This workbook will contain at least 6 sheets where I will calculate your error rates. There will be one sheet (tab) for each Group – S01, S02, S03, SO4, S05, S06. You should order each sheet by the variable SeriesIND (low to high). Your source data is sorted this way, except there are all 6 groups present in one sheet which you must break out into 6 tabs. You will submit the data I sent AND the forward forecast for 140 periods. I want you to forecast the following
S01 – Forecast Var01, Var02 S02 – Forecast Var02, Var03 S03 – Forecast Var05, Var07 S04 – Forecast Var01, Var02 S05 – Forecast Var02, Var03 S06 – Forecast Var05, Var07
This group S05 is constituated of variables Var02 and Var03. Our goal is to find the best forecast for the variables Var02 and Var03 in S05. For that, we are going to process the dataset to change missing values and ouliers. After some statistic analysis, we can apply several models and check for the accuaracy of those models.
library(fpp2)
## Warning: package 'fpp2' was built under R version 3.6.3
## Loading required package: ggplot2
## Loading required package: forecast
## Warning: package 'forecast' was built under R version 3.6.3
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Loading required package: fma
## Warning: package 'fma' was built under R version 3.6.3
## Loading required package: expsmooth
## Warning: package 'expsmooth' was built under R version 3.6.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(imputeTS)
## Warning: package 'imputeTS' was built under R version 3.6.3
full_data <- readxl::read_excel("Data Set for Class.xls")
head(full_data)
## # A tibble: 6 x 7
## SeriesInd group Var01 Var02 Var03 Var05 Var07
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 40669 S03 30.6 123432400 30.3 30.5 30.6
## 2 40669 S02 10.3 60855800 10.0 10.2 10.3
## 3 40669 S01 26.6 10369300 25.9 26.2 26.0
## 4 40669 S06 27.5 39335700 26.8 27.0 27.3
## 5 40669 S05 69.3 27809100 68.2 68.7 69.2
## 6 40669 S04 17.2 16587400 16.9 16.9 17.1
dt_s05 <- subset(full_data, group == 'S05', select = c(SeriesInd, Var02, Var03)) %>%
mutate(date = as.Date(SeriesInd, origin = "1900-01-01"))
summary(dt_s05)
## SeriesInd Var02 Var03 date
## Min. :40669 Min. : 4156600 Min. : 55.94 Min. :2011-05-08
## 1st Qu.:41304 1st Qu.: 11201200 1st Qu.: 77.21 1st Qu.:2013-01-31
## Median :41946 Median : 14578800 Median : 84.87 Median :2014-11-05
## Mean :41945 Mean : 16791350 Mean : 82.97 Mean :2014-11-03
## 3rd Qu.:42586 3rd Qu.: 20021200 3rd Qu.: 89.74 3rd Qu.:2016-08-06
## Max. :43221 Max. :118023500 Max. :103.95 Max. :2018-05-03
## NA's :141 NA's :145
str(dt_s05)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1762 obs. of 4 variables:
## $ SeriesInd: num 40669 40670 40671 40672 40673 ...
## $ Var02 : num 27809100 30174700 35044700 27192100 24891800 ...
## $ Var03 : num 68.2 68.8 69.3 69.4 69.2 ...
## $ date : Date, format: "2011-05-08" "2011-05-09" ...
dt_s05 <- filter(dt_s05, SeriesInd <= 43221)
summary(dt_s05)
## SeriesInd Var02 Var03 date
## Min. :40669 Min. : 4156600 Min. : 55.94 Min. :2011-05-08
## 1st Qu.:41304 1st Qu.: 11201200 1st Qu.: 77.21 1st Qu.:2013-01-31
## Median :41946 Median : 14578800 Median : 84.87 Median :2014-11-05
## Mean :41945 Mean : 16791350 Mean : 82.97 Mean :2014-11-03
## 3rd Qu.:42586 3rd Qu.: 20021200 3rd Qu.: 89.74 3rd Qu.:2016-08-06
## Max. :43221 Max. :118023500 Max. :103.95 Max. :2018-05-03
## NA's :141 NA's :145
str(dt_s05)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1762 obs. of 4 variables:
## $ SeriesInd: num 40669 40670 40671 40672 40673 ...
## $ Var02 : num 27809100 30174700 35044700 27192100 24891800 ...
## $ Var03 : num 68.2 68.8 69.3 69.4 69.2 ...
## $ date : Date, format: "2011-05-08" "2011-05-09" ...
dt_s05_v2 <- dt_s05 %>% select(Var02)
dt_s05_v2 <- dt_s05_v2[1:1622,]
dt_s05_v3 <- dt_s05 %>% select(Var03)
dt_s05_v3 <- dt_s05_v3[1:1622,]
ggplot(dt_s05, aes(x=date, y=Var02))+geom_line()+ xlab("Time") + ylab("S05_V02")
## Warning: Removed 140 rows containing missing values (geom_path).
ggplot(dt_s05, aes(x=date, y=Var03))+geom_line()+ xlab("Time") + ylab("S05_V03")
## Warning: Removed 140 rows containing missing values (geom_path).
Explore the subsets Var02
summary(dt_s05_v2)
## Var02
## Min. : 4156600
## 1st Qu.: 11201200
## Median : 14578800
## Mean : 16791350
## 3rd Qu.: 20021200
## Max. :118023500
## NA's :1
str(dt_s05_v2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1622 obs. of 1 variable:
## $ Var02: num 27809100 30174700 35044700 27192100 24891800 ...
Var02 has 1 missing value
Explore the subsets Var03
summary(dt_s05_v3)
## Var03
## Min. : 55.94
## 1st Qu.: 77.21
## Median : 84.87
## Mean : 82.97
## 3rd Qu.: 89.74
## Max. :103.95
## NA's :5
str(dt_s05_v3)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1622 obs. of 1 variable:
## $ Var03: num 68.2 68.8 69.3 69.4 69.2 ...
Var03 has 5 missing values Median and mean are in the same order.
dt_s05_v2 <- na_interpolation(dt_s05_v2)
summary(dt_s05_v2)
## Var02
## Min. : 4156600
## 1st Qu.: 11206550
## Median : 14575700
## Mean : 16788872
## 3rd Qu.: 20014325
## Max. :118023500
str(dt_s05_v2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1622 obs. of 1 variable:
## $ Var02: num 27809100 30174700 35044700 27192100 24891800 ...
dt_s05_v3 <- na_interpolation(dt_s05_v3)
summary(dt_s05_v3)
## Var03
## Min. : 55.94
## 1st Qu.: 77.25
## Median : 84.88
## Mean : 82.97
## 3rd Qu.: 89.71
## Max. :103.95
dt_s05_v2 <- ts(dt_s05_v2)
str(dt_s05_v2)
## Time-Series [1:1622, 1] from 1 to 1622: 27809100 30174700 35044700 27192100 24891800 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr "Var02"
dt_s05_v3 <- ts(dt_s05_v3)
str(dt_s05_v3)
## Time-Series [1:1622, 1] from 1 to 1622: 68.2 68.8 69.3 69.4 69.2 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr "Var03"
autoplot(dt_s05_v2) + ggtitle("Time Series S05-Var02")
Trend non seasonal time series
autoplot(dt_s05_v3) + ggtitle("Time Series S05-Var03")
Trend non seasonal time serie. It can also be cyclic time series
par(mfrow= c(1,2))
hist(dt_s05_v2)
boxplot(dt_s05_v2)
Va02 is right skewed. We can supress the ultimate outliers that skew the distribution then transform the data.
par(mfrow= c(1,2))
hist(dt_s05_v3)
boxplot(dt_s05_v3)
Var03 is nearly normal distributed and has outliers at the left.
# check outlier(s)
dt_s05_v2_out <- tsoutliers(dt_s05_v2)
dt_s05_v3_out <- tsoutliers(dt_s05_v3)
We have 31 extreme value in the time series V02 and only 1 in the time series V03. We usetsclean to replace outlier Cleaning the time series
dt_s05_v2 <- tsclean(dt_s05_v2)
dt_s05_v3 <- tsclean(dt_s05_v3)
autoplot(dt_s05_v2)
autoplot(dt_s05_v3)
There exists a linear correlation between V02 and V03 This is prove by the correlation test below
cor.test(dt_s05_v2, dt_s05_v3)
##
## Pearson's product-moment correlation
##
## data: dt_s05_v2 and dt_s05_v3
## t = -40.859, df = 1620, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7355716 -0.6875738
## sample estimates:
## cor
## -0.7124049
lmodel <- lm(dt_s05_v2~dt_s05_v3)
plot(x = dt_s05_v3, y = dt_s05_v2)+abline(lmodel)
## integer(0)
(cor(dt_s05_v2,dt_s05_v3))^2
## Var03
## Var02 0.5075207
V02 can explain V03 by 50.75% and vice-versa
We trying to confirm that S05_V02 is not stationary by ploting ACF of Var02 difference
ggAcf(dt_s05_v2)
autoplot(diff(dt_s05_v2))
ggAcf(diff(dt_s05_v2))
There is one order autocorrelation This data has no seasonality
ACF of Var03 difference
autoplot(diff(dt_s05_v3))
ggAcf(diff(dt_s05_v3))
The ACF shows one order correlation seasonality is insignificant.
dt_s05_v2_train <- window(dt_s05_v2, end = as.integer(length(dt_s05_v2)*0.8))
dt_s05_v3_train <- window(dt_s05_v3, end = as.integer(length(dt_s05_v3)*0.8))
lambda2 <- BoxCox.lambda(dt_s05_v2)
lambda3 <- BoxCox.lambda(dt_s05_v3)
The forecast horizon here is the length of test set
h_test <- length(dt_s05_v2)- as.integer(length(dt_s05_v2)*0.8)
h_test
## [1] 325
Get the arima model for Var02
dt_s05_v2_farima_fit <- dt_s05_v2_train %>% auto.arima(lambda = lambda2, stepwise = FALSE) %>% forecast(h = h_test)
Get the arima model for Var03
dt_s05_v3_farima_fit <- dt_s05_v3_train %>% auto.arima(lambda = lambda3, stepwise = FALSE) %>% forecast(h = h_test)
Get the ets model for Var02
dt_s05_v2_fets_fit <- dt_s05_v2_train %>% ets(lambda = lambda2) %>% forecast(h = h_test)
Get the arima ets for Var03
dt_s05_v3_fets_fit <- dt_s05_v3_train %>% ets(lambda = lambda3) %>% forecast(h = h_test)
Naive method for Var02
dt_s05_v2_naive_fit <- naive(dt_s05_v2_train, h=h_test)
Naive method for Var03
dt_s05_v3_naive_fit <- naive(dt_s05_v3_train, h=h_test)
arima model
s05_v2_farima_ac <- accuracy(dt_s05_v2_farima_fit, dt_s05_v2)["Test set", "MAPE"]
ets model
s05_v2_ets_ac <- accuracy(dt_s05_v2_fets_fit, dt_s05_v2)["Test set", "MAPE"]
naive model
s05_v2_naive_ac <- accuracy(dt_s05_v2_naive_fit, dt_s05_v2)["Test set", "MAPE"]
Comparing the MAPEs, the arima model for Var02 is the best.
arima model
s05_v3_farima_ac <- accuracy(dt_s05_v3_farima_fit, dt_s05_v3)["Test set", "MAPE"]
ets model
s05_v3_ets_ac <- accuracy(dt_s05_v3_fets_fit, dt_s05_v3)["Test set", "MAPE"]
naive method
s05_v3_naive_ac <- accuracy(dt_s05_v3_naive_fit, dt_s05_v3)["Test set", "MAPE"]
Using MAPE for accuracy, the naive method for Var03 is better than the arima of the same data. ETS model is the best. ### MAPE ACCURACY
s05_v2_MAPE <- c(s05_v2_farima_ac, s05_v2_ets_ac, s05_v2_naive_ac)
s05_v3_MAPE <- c(s05_v3_farima_ac, s05_v3_ets_ac, s05_v3_naive_ac)
s05_MAPE <- matrix(rbind(s05_v2_MAPE, s05_v3_MAPE), nrow = 2 )
rownames(s05_MAPE) <- c("S05_V02", "S05_V03")
colnames(s05_MAPE) <- c("Arima", "ETS", "Naive")
data.frame(s05_MAPE)
## Arima ETS Naive
## S05_V02 29.847124 26.074663 26.807647
## S05_V03 9.104011 9.057531 9.057559
dt_s05_v2_farima <- dt_s05_v2 %>% auto.arima(lambda = lambda2, stepwise = FALSE) %>% forecast(h = 140)
dt_s05_v3_farima <- dt_s05_v3 %>% auto.arima(lambda = lambda3, stepwise = FALSE) %>% forecast(h = 140)
autoplot(dt_s05_v2_farima)+autolayer(dt_s05_v2_farima_fit, alpha = 0.65)
##### Var03 forecast
autoplot(dt_s05_v3_farima) + autolayer(dt_s05_v3_farima_fit, alpha = 0.65)
checkresiduals(dt_s05_v2_farima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,2)
## Q* = 10.064, df = 7, p-value = 0.185
##
## Model df: 3. Total lags used: 10
with p-value greater than 0.05, there is convaincing evidence that residuals for Var02 are white noise. On ACF, the residuals are uncorrelated. The histogram shows that the residuals are normal distributed.
checkresiduals(dt_s05_v3_farima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,3)
## Q* = 5.2616, df = 7, p-value = 0.6281
##
## Model df: 3. Total lags used: 10
with p-value greater than 0.05, there is convaincing evidence that residuals for Var03 are white noise. On ACF, the residuals are uncorrelated. The histogram shows that the residuals are normal distributed.
dt_s05_v2_fets <- dt_s05_v2 %>% ets(lambda = lambda2) %>% forecast(h = 140)
dt_s05_v3_fets <- dt_s05_v3 %>% ets(lambda = lambda3) %>% forecast(h = 140)
autoplot(dt_s05_v2_fets) + autolayer(dt_s05_v2_fets_fit, alpha = 0.65)
autoplot(dt_s05_v3_fets) + autolayer(dt_s05_v3_fets_fit, alpha = 0.60)
checkresiduals(dt_s05_v2_fets)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 60.114, df = 8, p-value = 4.427e-10
##
## Model df: 2. Total lags used: 10
checkresiduals(dt_s05_v3_fets)
##
## Ljung-Box test
##
## data: Residuals from ETS(A,N,N)
## Q* = 36.383, df = 8, p-value = 1.494e-05
##
## Model df: 2. Total lags used: 10
The p values for the ETS models residuals are less than 0.05.The residuals are not white noise. They ETS models have prediction interval to wide. The ETS models have the best accuracies with the test set.
This is the metric MAPE on the entire dataset S05_V02 or S05_V03 using the ETS model
summary(dt_s05_v2 %>% auto.arima(lambda = lambda2, stepwise = FALSE))
## Series: .
## ARIMA(1,1,2)
## Box Cox transformation: lambda= -0.09436074
##
## Coefficients:
## ar1 ma1 ma2
## 0.7080 -1.2921 0.3270
## s.e. 0.0552 0.0669 0.0584
##
## sigma^2 estimated as 0.002298: log likelihood=2625.1
## AIC=-5242.21 AICc=-5242.18 BIC=-5220.64
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 302903.7 3891882 2834839 -2.938515 17.72102 0.8743672
## ACF1
## Training set 0.01981952
summary(dt_s05_v2 %>% ets(lambda = lambda2))
## ETS(A,N,N)
##
## Call:
## ets(y = ., lambda = lambda2)
##
## Box-Cox transformation: lambda= -0.0944
##
## Smoothing parameters:
## alpha = 0.376
##
## Initial states:
## l = 8.5044
##
## sigma: 0.0492
##
## AIC AICc BIC
## 2219.804 2219.819 2235.978
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 263534.8 3965352 2890727 -2.773874 18.10502 0.891605
## ACF1
## Training set 0.08579925
summary(dt_s05_v3 %>% auto.arima(lambda = lambda3, stepwise = FALSE))
## Series: .
## ARIMA(0,1,3)
## Box Cox transformation: lambda= 1.944948
##
## Coefficients:
## ma1 ma2 ma3
## 0.1260 -0.0278 -0.0566
## s.e. 0.0248 0.0254 0.0256
##
## sigma^2 estimated as 3190: log likelihood=-8837.51
## AIC=17683.02 AICc=17683.04 BIC=17704.58
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0129888 0.8849026 0.6428464 0.01046983 0.7956071 0.9915353
## ACF1
## Training set -0.006240098
summary(dt_s05_v3 %>% ets(lambda = lambda3))
## ETS(A,N,N)
##
## Call:
## ets(y = ., lambda = lambda3)
##
## Box-Cox transformation: lambda= 1.9449
##
## Smoothing parameters:
## alpha = 0.9999
##
## Initial states:
## l = 1943.0963
##
## sigma: 57.0135
##
## AIC AICc BIC
## 25109.30 25109.32 25125.48
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.01270981 0.8929011 0.6484972 0.00960707 0.8029787 1.000251
## ACF1
## Training set 0.1155351
# MAPE
print(paste0("MAPE for S05 Var02 is", MLmetrics::RMSE(dt_s05_v2_farima_fit$fitted, dt_s05_v2)))
## [1] "MAPE for S05 Var02 is4014883.09671773"
print(paste0("MAPE for S05 Var03 is ", MLmetrics::RMSE(dt_s05_v3_fets_fit$fitted, dt_s05_v3)))
## [1] "MAPE for S05 Var03 is 0.877534828336508"
# MAPE
print(paste0("MAPE for S05 Var02 is ", MLmetrics::MAPE(dt_s05_v2_farima_fit$fitted, dt_s05_v2)))
## [1] "MAPE for S05 Var02 is 0.1752854110183"
print(paste0("MAPE for S05 Var03 is ", MLmetrics::MAPE(dt_s05_v3_farima_fit$fitted, dt_s05_v3)))
## [1] "MAPE for S05 Var03 is 0.00770590795726415"
#library(MARSS)
#MARSS(t(as.data.frame.ts(dt_s05_v2)), model = list(dt_s05_v2_farima_fit$fitted))
?MARSS
## No documentation for 'MARSS' in specified packages and libraries:
## you could try '??MARSS'
#RShowDoc("UserGuide",package="MARSS")
#RShowDoc("Quick_Start",package="MARSS")
# s05_v2_forecast <- window(dt_s05_v2_fets$fitted, start=1722-139)
# library(xlsx)
# write.xlsx(s05_v2_forecast, "c:\\DATA624\\S05_V02.xlsx")
# s05_v3_forecast <- window(dt_s05_v3_fets$fitted, start=1722-139)
# write.xlsx(s05_v3_forecast, "c:\\DATA624\\S05_V03.xlsx")