Analyze the attached time series data on average monthly value of a company stock. Forecast one year into the future and report the projected average value and low and high ends using a 95% confidence interval.

Based on the forecast would you advise a client to buy the stock?

And if so, what is the most they should pay and why?

library(readxl)   
## Warning: package 'readxl' was built under R version 3.5.2
library(forecast)  
## Warning: package 'forecast' was built under R version 3.5.2
library(tseries)
## Warning: package 'tseries' was built under R version 3.5.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
Q1_data <- read_excel("/Users/pallavisaitu/Downloads/Final Exam 525 - Analysis Question A.xlsx")
head(Q1_data)
## # A tibble: 6 x 2
##   Month Value
##   <dbl> <dbl>
## 1     1  450.
## 2     2  446.
## 3     3  436.
## 4     4  425.
## 5     5  419 
## 6     6  423.

Checking Assumptions Visually

If data appears to meet the assumptions of stationarity that means the mean, variance and autocorrelation structure do not change over time

mean(Q1_data$Value)
## [1] 438.7495
plot(Q1_data$Value)
lines(Q1_data$Value)

The mean does not appear to be particularly stable.Thus, we cannot assume stationarity here.

Fixing Outliers and Missing Values

Using tsclean() to deal with missing values and extreme outliers.Setting FALSE so that missing values are not imputed.

Q1_data$Value_Cleaned = as.data.frame(tsclean(Q1_data$Value))
Q1_data$Value_Cleaned = as.data.frame(tsclean(Q1_data$Value, replace.missing = FALSE)) 
Q1_data$Value_Cleaned
##    tsclean(Q1_data$Value, replace.missing = FALSE)
## 1                                         449.7143
## 2                                         446.4286
## 3                                         435.7143
## 4                                         424.7500
## 5                                         419.0000
## 6                                         423.3750
## 7                                         417.5000
## 8                                         408.7500
## 9                                         406.8333
## 10                                        416.5000
## 11                                        409.8333
## 12                                        411.1429
## 13                                        409.3750
## 14                                        408.2000
## 15                                        409.7143
## 16                                        434.3333
## 17                                        435.6250
## 18                                        437.8333
## 19                                        439.0000
## 20                                        432.8750
## 21                                        425.5000
## 22                                        419.0000
## 23                                        419.3333
## 24                                        420.1429
## 25                                        419.0000
## 26                                        414.2857
## 27                                        428.0000
## 28                                        441.0000
## 29                                        459.5000
## 30                                        461.0000
## 31                                        452.6250
## 32                                        455.8571
## 33                                        472.7143
## 34                                        500.0000
## 35                                        530.0000
## 36                                        517.8750
## 37                                        521.4000

The data is in pretty good shape (no outliers or missing values) as such there is no difference between Consumption and cleaned data.

Identifying Components

Recall when we look at time series data we are interested in long term and seasonal trends and the random component: ŷ = St + Tt + Et

When we first look at our data, we can try to get a handle on each of these components. We can not use “periodic” since we need univatiate series for it

We see a seasonal component and long-term trend.

The bottom spikes help us determine if we have auto- regressive errors.

Checking Stationality

Remember ARIMA requires stationarity.

We can use an augmented Dickey-Fuller (ADF) test for this (it looks for a linear and lagged component - if the linear contributes but the lagged does not we don’t have stationarity).

ctrends = ts(na.omit(Q1_data$Value_Cleaned), frequency=6)
adf.test(ctrends, alternative = "stationary")
## 
##  Augmented Dickey-Fuller Test
## 
## data:  ctrends
## Dickey-Fuller = -0.23969, Lag order = 3, p-value = 0.9878
## alternative hypothesis: stationary

Another way to break down the data into components is:

components.ts = decompose(ts(Q1_data$Value_Cleaned, freq = 6)) 
plot(components.ts)

So, this means we fail the requirement of stationary data.

Let us see the degree of auto-correlation to have a better insight.

#The ACF helps us to survey our q parameter
acf(Q1_data$Value_Cleaned)

#The PCF helps us to survey our p parameter
pacf(Q1_data$Value_Cleaned)

We see a spike at lag 1,2,3,4 & 5 that is significant in ACF & 1 in PCF.

Removing Non-Stationality

Recall our d parameter is the differencing parameter which seeks to remove non-stationality.

We could move forward building our ARIMA Model setting d = 1 and that should work.

Fitting and Forecasting

R allows us to select our own p, d, q values, or it can find them for us.

There is little reason to set our own, unless we think we can beat the algorithm R employs or if we find the results of the auto fitting to be odd.

testns <- auto.arima(ts(data= Q1_data$Value_Cleaned)) # fit the ARIMA
summary(testns)
## Series: ts(data = Q1_data$Value_Cleaned) 
## ARIMA(0,2,2) 
## 
## Coefficients:
##           ma1      ma2
##       -0.5789  -0.2724
## s.e.   0.1589   0.1488
## 
## sigma^2 estimated as 109.7:  log likelihood=-131.41
## AIC=268.83   AICc=269.6   BIC=273.49
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 2.035183 9.889485 7.313185 0.4523609 1.620453 0.9465853
##                     ACF1
## Training set -0.05102664
fns <- forecast(testns) # forecast out based on model
plot(fns) # plot the forecast

Let’s refit the model allowing for a seasonal trend. Meaning ARIME(p,d,q)(P,D,Q)m (with m = 6)

tests <- auto.arima(ts(data= Q1_data$Value_Cleaned, freq = 6)) 
summary(tests)
## Series: ts(data = Q1_data$Value_Cleaned, freq = 6) 
## ARIMA(1,2,1)(0,0,1)[6] 
## 
## Coefficients:
##          ar1      ma1    sma1
##       0.3927  -0.9648  0.6618
## s.e.  0.1890   0.1289  0.2451
## 
## sigma^2 estimated as 86.26:  log likelihood=-128.31
## AIC=264.61   AICc=265.94   BIC=270.83
## 
## Training set error measures:
##                    ME     RMSE     MAE       MPE     MAPE      MASE
## Training set 1.620841 8.637098 6.55866 0.3647722 1.467482 0.2395169
##                     ACF1
## Training set -0.03698714
fs <- forecast(tests)
plot(fs)

testsp <- auto.arima(ts(data= Q1_data$Value_Cleaned, freq = 6), xreg= Q1_data$Month)
summary(testsp)
## Series: ts(data = Q1_data$Value_Cleaned, freq = 6) 
## Regression with ARIMA(2,0,0)(0,0,1)[6] errors 
## 
## Coefficients:
##          ar1      ar2    sma1  intercept    xreg
##       1.3336  -0.4625  0.8066   410.1358  2.2131
## s.e.  0.1489   0.1604  0.3588    27.6570  1.1589
## 
## sigma^2 estimated as 73.94:  log likelihood=-133.63
## AIC=279.26   AICc=282.06   BIC=288.93
## 
## Training set error measures:
##                      ME     RMSE      MAE        MPE     MAPE     MASE
## Training set -0.6642965 7.997002 6.589387 -0.1935169 1.476333 0.240639
##                     ACF1
## Training set -0.07909074

Forecasting Controllable Variables

If we wanted to forecast, we would need to include price in our forecast.

Price could be whatever values we want; we just need to ensure it lines up with how far in the future we want to predict.

We want to predict 1 year in the future will need a Stock value column with 12 (months) values in it.

fs <- forecast(testsp, xreg = Q1_data$Value, h = 12) 
plot(fs)

How good is our model?

Looking at the residuals (errors in prediction) that our model makes is one way to know how well our model does.

What we are looking for is no significant remaining auto-correlation.

tsdisplay(residuals(testsp))

tsdisplay(residuals(testsp), lag.max = 54)

We don’t see any significant remaining spikes in the plot above. Also, we could simply compare our fitted model with the observed data.

plot(testsp$x,col="red")
lines(fitted(testsp),col="blue")

Here the red lines are for the raw data and blue lines are the predictions from the model. The model doen’t seem to be so far off. The model would get better if we had more data as input.

Furthermore, - Test Approach:The test approach I considered here was building an ARIMA model to forcast the stock value for the next 1 year. This approach is considered since we are predicting on a time series data. I have used the ARIMA model that is one of the central model types used to forecast using time series data

Expected output:The expected output was forecasting of the stock values for the next 1 year so that I could suggest the clients to buy stocks and at what price.

Summary:The time series ARIMA model was good enough since I checked by plotting it against the original raw data and there were not any alarming flags. The predictions from this model helps to answer the questions:

  1. Based on the forecast would you advise a client to buy the stock? Answer: Yes, based on the stocks forcast I would advise the client to but the stock.

  2. And if so, what is the most they should pay and why? Answer: They should not be paying more than 525 and the reason is that my model shows the highest stock value between the months of June (6) and July (7) and the predicted value of the stock.