project 1 data 624

Preparation:Loading Libraries

Q1 Part A

In part A, I want you to forecast how much cash is taken out of 4 different ATM machines for May 2010. The data is given in a single file. The variable ‘Cash’ is provided in hundreds of dollars, other than that it is straight forward. I am being somewhat ambiguous on purpose to make this have a little more business feeling. Explain and demonstrate your process, techniques used and not used, and your actual forecast. I am giving you data via an excel file, please provide your written report on your findings, visuals, discussion and your R code via an RPubs link along with the actual.rmd file Also please submit the forecast which you will put in an Excel readable file.

Part 1 loading the data and fixing the format

First of all, we load the excel file as data set and remove the na.

Then we use check the summary to see if removing na makes any different.

as we can see column “Cash” has no change on min descriptive statistic which the observations drop from 1474 to 1455. We can tell that the “Cash” has na value origanally.

atm <- read_excel("ATM624Data.xlsx")
atmwithoutna<-na.omit(atm)

summary(atm)

##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1474        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40118   Mode  :character   Median :   73.0  
##  Mean   :40118                      Mean   :  155.6  
##  3rd Qu.:40210                      3rd Qu.:  114.0  
##  Max.   :40312                      Max.   :10919.8  
##                                     NA's   :19

summary(atmwithoutna)

##       DATE           ATM                 Cash        
##  Min.   :39934   Length:1455        Min.   :    0.0  
##  1st Qu.:40026   Class :character   1st Qu.:    0.5  
##  Median :40117   Mode  :character   Median :   73.0  
##  Mean   :40116                      Mean   :  155.6  
##  3rd Qu.:40208                      3rd Qu.:  114.0  
##  Max.   :40298                      Max.   :10919.8

up next we want to check the loaded data set, one thing we can see is the “DATE” column is not date format. This need to be fix on following steps.

head(atmwithoutna)

## # A tibble: 6 × 3
##    DATE ATM    Cash
##   <dbl> <chr> <dbl>
## 1 39934 ATM1     96
## 2 39934 ATM2    107
## 3 39935 ATM1     82
## 4 39935 ATM2     89
## 5 39936 ATM1     85
## 6 39936 ATM2     90

then we converted to the correct date format, let check again!

atmwithoutna$DATE <- as.Date(atmwithoutna$DATE, origin = "1899-12-30")
atmtouse <- atmwithoutna
head(atmtouse)

## # A tibble: 6 × 3
##   DATE       ATM    Cash
##   <date>     <chr> <dbl>
## 1 2009-05-01 ATM1     96
## 2 2009-05-01 ATM2    107
## 3 2009-05-02 ATM1     82
## 4 2009-05-02 ATM2     89
## 5 2009-05-03 ATM1     85
## 6 2009-05-03 ATM2     90

Exploring data

I used as_tsibble to create tsibble object, otherwise, it will have error when autoplot is being used like I previously encounter in last assignment.

We can tell there is outlier from ATM4 in the first auto-plot, however, other ATM data is unclear to look at. So I create another autoplot.

ATM1 and ATM2 have seasonality, and ATM3 is not active until the end of the timeline.

Let’s look at

atmtouse <- as_tsibble(atmtouse,key = ATM, index = DATE)

#1st autoplot
autoplot(atmtouse)+
  labs(title = "ATM Data with Outlier")

## Plot variable not specified, automatically selected `.vars = Cash`

#2nd autoplot
atmtouse %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Data with Outlier")

atmtouse = atmtouse[atmtouse$Cash < 9500,]

atmtouse %>%
  autoplot(Cash) +
  facet_wrap(~ATM, scales = "free", nrow = 4) +
  labs(title = "ATM Data with Outlier")

summary(atmtouse)

##       DATE                ATM                 Cash        
##  Min.   :2009-05-01   Length:1454        Min.   :   0.00  
##  1st Qu.:2009-08-01   Class :character   1st Qu.:   0.25  
##  Median :2009-10-30   Mode  :character   Median :  73.00  
##  Mean   :2009-10-30                      Mean   : 148.18  
##  3rd Qu.:2010-01-29                      3rd Qu.: 114.00  
##  Max.   :2010-04-30                      Max.   :1712.08

Let’s filter the AMT1,2,3 and 4 as independent data set. we also try to find out the minimum date to start the 1 year data which is may 1 2009.

atm1 <- subset(atmtouse,ATM == "ATM1")
atm2 <- subset(atmtouse,ATM == "ATM2")
atm3 <- subset(atmtouse,ATM == "ATM3")
atm4 <- subset(atmtouse,ATM == "ATM4")

min(atmtouse$DATE)

## [1] "2009-05-01"

ATM1_TS<-ts(atm1$Cash, frequency = 365, start=c(2009,5))
ATM2_TS<-ts(atm2$Cash, frequency = 365, start=c(2009,5))
ATM3_TS<-ts(atm3$Cash, frequency = 365, start=c(2009,5))
ATM4_TS<-ts(atm4$Cash, frequency = 365, start=c(2009,5))

Let’s compare 4 different ATM, 1 and 2 has seasonal variation, and 3 only have fews day data. And the last one which is ATM 4 has stationary variability.

autoplot(ATM1_TS)+ xlab("Time") + ylab("Cash (in hundreds of dollars)") + ggtitle("ATM 1 Time Series")

autoplot(ATM2_TS)+ xlab("Time") + ylab("Cash (in hundreds of dollars)") + ggtitle("ATM 2 Time Series")

autoplot(ATM3_TS)+ xlab("Time") + ylab("Cash (in hundreds of dollars)") + ggtitle("ATM 3 Time Series")

autoplot(ATM4_TS)+ xlab("Time") + ylab("Cash (in hundreds of dollars)") + ggtitle("ATM 4 Time Series")

AMT1

we can see the PAC shows a spike on lag 7 which mean it is weekly seasonality. it is more clear if i change frequency to 7.

ggtsdisplay(ATM1_TS, points = FALSE)

atm1_ts2 <- ts(ATM1_TS, frequency = 7)
ggseasonplot(atm1_ts2)

ggtsdisplay(atm1_ts2, points = FALSE)

From the plot the data looks like normal distribution and stationary.

checkresiduals(ATM1_TS)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

ndiffs(ATM1_TS)

## [1] 0

fit_ATM1 <- auto.arima(ATM1_TS, seasonal=FALSE,
  stepwise=FALSE, approximation=FALSE)
fit_ATM1

## Series: ATM1_TS 
## ARIMA(0,0,5) with non-zero mean 
## 
## Coefficients:
##          ma1      ma2      ma3     ma4     ma5     mean
##       0.1930  -0.4207  -0.3307  0.1460  0.4235  83.7885
## s.e.  0.0512   0.0487   0.0610  0.0449  0.0546   1.8154
## 
## sigma^2 = 1193:  log likelihood = -1793.58
## AIC=3601.15   AICc=3601.47   BIC=3628.39

then we use it for forecast 30 days as requested.

autoplot(forecast(fit_ATM1, h=30))

ATM 2

We do the same thing, we also can see the PAC shows a spike on lag 7 which mean it is weekly seasonality. it is more clear if i change frequency to 7.

ggtsdisplay(ATM2_TS, points = FALSE)

atm2_ts2 <- ts(ATM2_TS, frequency = 7)
ggseasonplot(atm2_ts2)

ggtsdisplay(atm2_ts2, points = FALSE)

From the plot the data does not look like normal distribution and stationary. So we need to do BoxCox Transformation.

then we use it for forecast 30 days as requested.

checkresiduals(ATM2_TS)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

ndiffs(ATM2_TS)

## [1] 1

lambda <- BoxCox.lambda(ATM2_TS)
print(lambda)

## [1] 1

ATM2_TS_BC <- BoxCox(ATM2_TS, lambda)
autoplot(BoxCox(ATM2_TS, lambda)) +
  ggtitle("BoxCox Transformation for ATM 2")

checkresiduals(ATM2_TS_BC)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

ndiffs(ATM2_TS_BC)

## [1] 1

fit_ATM2<-auto.arima(ATM2_TS_BC, seasonal=FALSE,
  stepwise=FALSE, approximation=FALSE)
fit_ATM2

## Series: ATM2_TS_BC 
## ARIMA(0,1,5) with drift 
## 
## Coefficients:
##           ma1      ma2     ma3     ma4      ma5    drift
##       -1.0175  -0.3751  0.0672  0.6416  -0.3048  -0.0600
## s.e.   0.0499   0.0608  0.0700  0.0826   0.0544   0.0249
## 
## sigma^2 = 1186:  log likelihood = -1794.51
## AIC=3603.01   AICc=3603.33   BIC=3630.26

autoplot(forecast(fit_ATM2, h=30))

model3

we can see there is nothing much we can tell by using ggtsdisplay unlike ATM 1 and 2. I do not think model is needed for the in immaterial amount of data.

ggtsdisplay(ATM3_TS, points = FALSE)

ATM 4

ggtsdisplay(ATM4_TS, points = FALSE)

atm4_ts2 <- ts(ATM4_TS, frequency = 7)
ggseasonplot(atm4_ts2)

ggtsdisplay(atm4_ts2, points = FALSE)

checkresiduals(ATM4_TS)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

ndiffs(ATM4_TS)

## [1] 0

lambda <- BoxCox.lambda(ATM4_TS)
print(lambda)

## [1] 1

ATM4_TS_BC <- BoxCox(ATM4_TS, lambda)
autoplot(BoxCox(ATM4_TS, lambda)) +
  ggtitle("BoxCox Transformation for ATM 4")

checkresiduals(ATM4_TS_BC)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

ndiffs(ATM4_TS_BC)

## [1] 0

fit_ATM4<-auto.arima(ATM4_TS_BC, seasonal=FALSE,
  stepwise=FALSE, approximation=FALSE)
fit_ATM4

## Series: ATM4_TS_BC 
## ARIMA(3,0,2) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ma1     ma2      mean
##       0.7059  -0.9121  0.0234  -0.6448  0.9554  444.2817
## s.e.  0.0631   0.0816  0.0568   0.0351  0.0513   19.9944
## 
## sigma^2 = 120566:  log likelihood = -2643.2
## AIC=5300.41   AICc=5300.72   BIC=5327.69

autoplot(forecast(fit_ATM4, h=30))

ATM1 is the better one since AIC and BIC are lower.

summary(fit_ATM1)

## Series: ATM1_TS 
## ARIMA(0,0,5) with non-zero mean 
## 
## Coefficients:
##          ma1      ma2      ma3     ma4     ma5     mean
##       0.1930  -0.4207  -0.3307  0.1460  0.4235  83.7885
## s.e.  0.0512   0.0487   0.0610  0.0449  0.0546   1.8154
## 
## sigma^2 = 1193:  log likelihood = -1793.58
## AIC=3601.15   AICc=3601.47   BIC=3628.39
## 
## Training set error measures:
##                      ME     RMSE      MAE       MPE    MAPE MASE        ACF1
## Training set 0.04080169 34.25518 27.38142 -122.1883 146.046  NaN -0.07748216

summary(fit_ATM2)

## Series: ATM2_TS_BC 
## ARIMA(0,1,5) with drift 
## 
## Coefficients:
##           ma1      ma2     ma3     ma4      ma5    drift
##       -1.0175  -0.3751  0.0672  0.6416  -0.3048  -0.0600
## s.e.   0.0499   0.0608  0.0700  0.0826   0.0544   0.0249
## 
## sigma^2 = 1186:  log likelihood = -1794.51
## AIC=3603.01   AICc=3603.33   BIC=3630.26
## 
## Training set error measures:
##                       ME     RMSE      MAE  MPE MAPE MASE         ACF1
## Training set -0.04142577 34.10204 28.38481 -Inf  Inf  NaN -0.005533458

summary(fit_ATM4)

## Series: ATM4_TS_BC 
## ARIMA(3,0,2) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ma1     ma2      mean
##       0.7059  -0.9121  0.0234  -0.6448  0.9554  444.2817
## s.e.  0.0631   0.0816  0.0568   0.0351  0.0513   19.9944
## 
## sigma^2 = 120566:  log likelihood = -2643.2
## AIC=5300.41   AICc=5300.72   BIC=5327.69
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE MASE        ACF1
## Training set 0.319843 344.3522 284.6138 -822.4449 855.2923  NaN 0.001934766

Q2 Part B

Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add this to your existing files above.

Part 1 loading the data and fixing the format

we load the data and check the head and minimum and summary of the data.

however, i saw that min() output is not correct.

RCFload <- read_excel("ResidentialCustomerForecastLoad-624.xlsx")
head(RCFload)

## # A tibble: 6 × 3
##   CaseSequence `YYYY-MMM`     KWH
##          <dbl> <chr>        <dbl>
## 1          733 1998-Jan   6862583
## 2          734 1998-Feb   5838198
## 3          735 1998-Mar   5420658
## 4          736 1998-Apr   5010364
## 5          737 1998-May   4665377
## 6          738 1998-Jun   6467147

min(RCFload$`YYYY-MMM`) #not working

## [1] "1998-Apr"

summary(RCFload)

##   CaseSequence     YYYY-MMM              KWH          
##  Min.   :733.0   Length:192         Min.   :  770523  
##  1st Qu.:780.8   Class :character   1st Qu.: 5429912  
##  Median :828.5   Mode  :character   Median : 6283324  
##  Mean   :828.5                      Mean   : 6502475  
##  3rd Qu.:876.2                      3rd Qu.: 7620524  
##  Max.   :924.0                      Max.   :10655730  
##                                     NA's   :1

The we create TS by month which start with Jan 1998 from the result of head. We check again by using head()

RCFload_TS<-ts(RCFload[,"KWH"], frequency = 12, start=c(1998,1))
autoplot(RCFload_TS)+ xlab("Time") + ylab("KWH") + ggtitle("Monthly Power Usage (Watts)")

head(RCFload_TS)

##          Jan     Feb     Mar     Apr     May     Jun
## 1998 6862583 5838198 5420658 5010364 4665377 6467147

Exploring data

2 things we can see from the plot, one NA or outliner is found and WATT are increase year to year.

ggseasonplot(RCFload_TS)

We can see outliner got removed and data need Baxcox as well like we do previously.

RCFload_TS[which.min(RCFload_TS)]<-median(RCFload_TS, na.rm = TRUE)
ggseasonplot(RCFload_TS)

ggtsdisplay(RCFload_TS)

## Warning: Removed 1 rows containing missing values (`geom_point()`).

checkresiduals(RCFload_TS)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

The data is stationary.

lambda <- BoxCox.lambda(RCFload_TS)
print(lambda)

## [1] -0.2023101

RCFload_TS_BC <- BoxCox(RCFload_TS, lambda)
autoplot(BoxCox(RCFload_TS, lambda)) +
  ggtitle("BoxCox Transformation")

RCFload_TS_BC2<-ur.kpss(RCFload_TS_BC)
summary(RCFload_TS_BC2)

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 4 lags. 
## 
## Value of test-statistic is: 1.5857 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

forecast_RCFload <- auto.arima(RCFload_TS_BC)
autoplot(forecast(forecast_RCFload,12), main="Prediction for 2014 with ARIMA")

summary(forecast_RCFload)

## Series: RCFload_TS_BC 
## ARIMA(0,0,1)(2,1,0)[12] with drift 
## 
## Coefficients:
##          ma1     sar1     sar2  drift
##       0.2377  -0.7467  -0.3759  1e-04
## s.e.  0.0802   0.0738   0.0749  1e-04
## 
## sigma^2 = 1.683e-05:  log likelihood = 736.27
## AIC=-1462.54   AICc=-1462.19   BIC=-1446.57
## 
## Training set error measures:
##                        ME        RMSE         MAE         MPE       MAPE
## Training set 0.0002399358 0.003927156 0.003111997 0.005021188 0.06571964
##                   MASE       ACF1
## Training set 0.7794575 0.08523054

Part C Part C consists of two data sets. These are simple 2 columns sets, however they have different time stamps. Your optional assignment is to time-base sequence the data and aggregate based on hour (example of what this looks like, follows). Note for multiple recordings within an hour, take the mean. Then to determine if the data is stationary and can it be forecast. If so, provide a week forward forecast and present results via Rpubs and .rmd and the forecast in an Excel readable file.

First of all, we combine 2 table together, then we change the format to take out the hours and time. However, I think rounding make more sense after that.

Then we create a groupby data set, a ts for 24 hours and model after.

However, I do not think i did it correctly for the ts and model.

library(datetimeutils)

## 
## Attaching package: 'datetimeutils'

## The following objects are masked from 'package:lubridate':
## 
##     mday, mday<-, month, year

WF1 <- read_excel("Waterflow_Pipe1.xlsx")
WF2 <- read_excel("Waterflow_Pipe1.xlsx")

WF1$`Date Time` <- convert_date(WF1$`Date Time`, type = "Excel", fraction = TRUE)
WF2$`Date Time` <- convert_date(WF2$`Date Time`, type = "Excel", fraction = TRUE)

WFtouse <-rbind(WF1,WF2)
summary(WFtouse)

##    Date Time                     WaterFlow     
##  Min.   :2015-10-23 00:24:06   Min.   : 1.067  
##  1st Qu.:2015-10-25 11:21:35   1st Qu.:13.683  
##  Median :2015-10-27 20:07:30   Median :19.880  
##  Mean   :2015-10-27 20:53:49   Mean   :19.897  
##  3rd Qu.:2015-10-30 08:24:50   3rd Qu.:26.159  
##  Max.   :2015-11-01 23:35:43   Max.   :38.913

WFtouse$hours <- format(as.POSIXct(WFtouse$`Date Time`), format = "%H")
WFtouse$Date <- convert_date(WFtouse$`Date Time`, type = "Excel")
WFtouse$`Date Time` <- round_date(WFtouse$`Date Time`, "hours")

WFtouseG <- WFtouse %>%
  group_by(`Date Time`) %>%
  summarise_at(vars(WaterFlow), list(name = mean))

WFtouseG$WaterFlow = WFtouseG$name

WFtouseG = subset(WFtouseG, select = -c(name) )

WFtouseG <- as_tsibble(WFtouseG)

## Using `Date Time` as index variable.

gg_tsdisplay(WFtouseG, plot_type = "partial")

## Plot variable not specified, automatically selected `y = WaterFlow`

WFtouseG_ts <- ts(WFtouseG$WaterFlow, start = c(2015,10,23), frequency = 24)
WFtouseG_ts

## Time Series:
## Start = c(2015, 10) 
## End = c(2025, 11) 
## Frequency = 24 
##   [1] 23.369599 20.594952 24.108974 22.096586 16.297601 20.622674 22.315469
##   [8] 21.486459 18.670595 19.332527 21.118805 16.518302 23.969097 23.042386
##  [15] 21.985536 24.190958 25.201762 21.358023 21.084532 13.270244 21.075784
##  [22] 19.065998 17.490423 18.344710 18.419928 14.469826 26.529643 23.182193
##  [29] 23.576826 31.655559 19.393744 21.937817 23.566852 20.051771 18.732876
##  [36]  5.678149 11.941820 19.324775 13.185921 11.844761 15.171121 19.925632
##  [43] 17.671231 20.050264 18.761564 15.714068 19.018135 15.614098 23.273438
##  [50] 25.034591 23.369905 20.635865 17.995699 17.213137 20.854437 17.633172
##  [57] 21.645167 25.512180  9.937115 18.998701 10.924172 24.607450 16.878789
##  [64] 21.850332 12.718461 25.788332 24.678980 21.543160 13.367861 20.331131
##  [71] 16.234728 30.364928 22.584131 24.169753 17.681047 17.228764 15.557495
##  [78] 13.755156 18.378993 17.348098 22.767248 22.208604 16.768433 27.174478
##  [85] 20.868027 16.441789 18.509076 14.787636 26.339307 19.628814 18.593013
##  [92] 24.965091 22.618109 12.099414 22.612621 21.166484 18.389080 23.560504
##  [99] 23.996702 18.370869 20.618550 25.888403 13.276622 19.954973 17.896071
## [106] 10.719391 14.956229 14.538538 23.706073 20.588211 21.670380 28.049444
## [113] 19.306950 25.930345 12.977001 25.161390 24.392668 18.382795 18.111355
## [120]  6.037540 17.341505 26.199455 26.775857 15.122721 26.450472 19.379246
## [127] 20.309258 14.976784 17.948196  8.530415 21.318989 25.120167 15.530935
## [134] 14.509383 22.184786 22.363508 26.027013 18.337608 22.467321 18.007411
## [141] 16.440147 23.807593 22.097793 19.319161 21.477792 24.942278 13.196225
## [148] 14.627345 12.728187 24.441168 16.295212 21.175908 18.380724 15.500801
## [155] 11.375779 23.474667 19.452844 23.102799 23.957343 19.087196 15.035653
## [162] 25.729928 22.436649 17.092190 29.020963 21.669312 15.521515 21.943044
## [169] 22.111881 25.313116 17.412591 14.490768 20.704842 19.138894 26.741802
## [176] 21.646633 19.718219 18.253964 17.307226 16.852787 26.074076  7.759069
## [183] 29.381336 22.040600 29.470806 21.986151 12.308062 19.556140 23.534997
## [190] 20.046244 12.529729 14.853327 11.337785 19.581081 25.211067 24.390009
## [197] 24.264212 19.143730 26.931825 23.545106 36.082648 16.700396 14.575057
## [204] 23.963497 19.533826 19.802439 19.431547 17.289728 20.493131 22.200491
## [211] 24.385576 15.962480 20.248071 22.786641 11.857143 29.106091 14.506164
## [218] 23.629227 26.975276 16.537810 20.923563 17.328540 22.610985 16.788841
## [225] 29.918730 14.139590 28.550303 18.399540 20.705043 29.616021 20.858675
## [232] 22.743092 18.795715 22.596592 21.139837 19.288250 18.821119 15.292248
## [239] 16.929837 22.358282 22.844104 18.715206

ggseasonplot(WFtouseG_ts)

checkresiduals(WFtouseG_ts)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

Fit_WFtouseG_ts<-auto.arima(WFtouseG_ts, seasonal=FALSE,
  stepwise=FALSE, approximation=FALSE)


autoplot(forecast(Fit_WFtouseG_ts, h=30))

project 1 data 624

Jay Lee

2023-03-26