My last name is Williams, so I decided to use the the Walmart Inc. (WMT) adjusted closing price for the stock I will attempt to forecast. Let’s see what the structure of the data I downloaded from here looks like.

df <- read.csv("C:/Users/jawilliams/Desktop/Forecasting/Data/WMT.csv")
str(df)
## 'data.frame':    252 obs. of  7 variables:
##  $ Date     : Factor w/ 252 levels "2019-07-01","2019-07-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Open     : num  111 111 111 112 112 ...
##  $ High     : num  112 112 112 112 113 ...
##  $ Low      : num  110 110 111 111 112 ...
##  $ Close    : num  111 112 112 112 113 ...
##  $ Adj.Close: num  109 110 110 110 111 ...
##  $ Volume   : int  5514700 4062900 3207300 3579400 4715700 5423400 4579000 3896600 3743300 3346000 ...

Now let’s assign the Date var class as ‘Date’ for the sake of that is what it actually is. After that is accomplished, We are going to want to figure out when the start/ end dates of this data are, and the length of the dataset so we can get a better understanding of the frequency we are dealing with is WRT to time. We can also load some packages we are going to need moving forward as well.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(ggplot2)
library(forecast)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df$Date <- as.Date(df$Date)

print(paste(
  "This Data contains",
  length(df$Date),
  "observations",
  sep = " ")
  )
## [1] "This Data contains 252 observations"
print(paste(
  "The first date in the data is",
  head(df$Date, 1), 
  "and the last date is",
  tail(df$Date, 1),
  sep = " ")
)
## [1] "The first date in the data is 2019-07-01 and the last date is 2020-06-29"

So we have 252 observations, for roughly a year’s worth of data. Cool. Clearly we are missing some days of the week. Those must be holidays / weekends. Let’s see what days of the week comprise this data.

table(weekdays(df$Date), useNA = "a")
## 
##    Friday    Monday  Thursday   Tuesday Wednesday      <NA> 
##        51        49        50        52        50         0

Are there difference in the means of the Adjusted Closing price of the WMT stock?

df %>% 
  mutate(Weekday = weekdays(Date)) %>% 
  group_by(Weekday) %>% 
  summarise(Mean_Adj.Close = mean(Adj.Close))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
##   Weekday   Mean_Adj.Close
##   <chr>              <dbl>
## 1 Friday              117.
## 2 Monday              117.
## 3 Thursday            117.
## 4 Tuesday             117.
## 5 Wednesday           117.

Probably not enough variation here worth stat testing among the groups. that’s OK. Let’s visualize our data. We are going to make it a ts object. In order to do that, we need to know

  1. What the frequency of time we are dealing w/ Stocks are traded daily, so 365.
  2. What day of the calendar year (out of 365). We could count on our fingers for this, but R can do it for us
close_price_ts <- ts(df$Adj.Close,
                     start = c(2019, 
                             as.numeric(format(as.Date("2019-07-01"), "%j"))),
                     frequency = 365)

## How to know what day of year date x is

print(as.numeric(format(as.Date("2019-07-01"), "%j")))
## [1] 182

Now that we have our data stored as a ts object, let’s plot:

ggplot(df, aes(x = Adj.Close)) +
  geom_histogram()+
  xlab("WMT Adjusted Closing Price Distribution")+
  ylab("Frequency")+
  ggthemes::theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Now lets plot the trend.

ggplot(df, aes(x =as.POSIXct(Date) , y = Adj.Close)) +
  geom_line() +
  xlab("Day") +
  ylab("Adjusted Closing Price $") +
  ggtitle("WMT Daily Adjusted Closing Price NYSE") +
  ggthemes::theme_economist()

Let’s split the data into two sets, the data that will be used to train out model i.e. the training set, and the set used to test our predictions, i.e. the testing set.

train_ts <- window(close_price_ts, end = c(2020, 16))

test_ts <- window(close_price_ts, start = c(2020, 17))

print(paste("There are",
            length(train_ts),
            "observations in the training set",
            sep = " "))
## [1] "There are 200 observations in the training set"
print(paste("There are",
            length(test_ts),
            "observations in the testing set",
            sep = " "))
## [1] "There are 52 observations in the testing set"

Now, let’s fit our linear model of the training set against the trend, and see what our results yield

x <- tslm(train_ts ~ trend)

print(summary(x))
## 
## Call:
## tslm(formula = train_ts ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2754  -2.0770   0.5941   2.7241  10.3105 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.117e+02  5.202e-01 214.728  < 2e-16 ***
## trend       3.233e-02  4.488e-03   7.203  1.2e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.664 on 198 degrees of freedom
## Multiple R-squared:  0.2076, Adjusted R-squared:  0.2036 
## F-statistic: 51.89 on 1 and 198 DF,  p-value: 1.201e-11
x_forecasts <- forecast(x, h = 52)

print(accuracy(x_forecasts, test_ts))
##                         ME     RMSE      MAE        MPE     MAPE MASE      ACF1
## Training set -4.275746e-16 3.645889 2.850789 -0.1026099 2.501322  NaN 0.7991014
## Test set      4.427215e+00 5.895752 4.725075  3.5017220 3.753699  NaN 0.8660365
##              Theil's U
## Training set        NA
## Test set      3.977182

So we have some reasonable metrics WRT to accuracy. Clearly there is some sort of variation we are missing in explaining the training set, since the \[R^2\] is only 0.20. Let’s see how the residuals are distributed.

 hist(x$residuals, col = "blue")

There is a bit of a left-skew present, like I said, earlier, there is clearly some variation in the Adjusted Closing price that our model does not account for.

Finally, let’s plot our forecasts

autoplot(x_forecasts)+
  ylab("WMT Adjusted Closing Price $")

Now lets forecast the next 30 days in the data.

x2 <- tslm(close_price_ts ~ trend)

x2_forecasts <- forecast(x2, h = 30)

autoplot(x2_forecasts)+
  ylab("WMT Adjusted Closing Price $")