Basic time series analysis using techniques in R that I have learned. This project involves finding two dataset and making datasets stationary to later use for forecasting

Description of datasets:
1. The first one has to do with monthly sales if US houses between 1965 and 1975.
2. The second dataset has to with rental vacancies between 1980 and 2016, by quarter

#Load libraries and set working directory
library(readxl)
library(tseries)
library(forecast)
library(MTS) #Used for the archTest()
#Dataset #1
rentalVac <- read_excel("RentalVacancies_1980-2016.xls")
rentalVac <- na.omit(rentalVac)

#Dataset #2
monthlyHouseSales <- read_excel("monthly-sales-of-us-houses-thous.xls")
monthlyHouseSales <- monthlyHouseSales[-c(1:14),]
monthlyHouseSales <- na.omit(monthlyHouseSales)

1) Plotting the time series variable to visually determine whether the data set possibly has constant mean and constant variance.

  1. Rental Vacancies from 1980-2016: By looking at the graph it appears that the time series data set is NOT stationary in terms of constant mean and constant variance. There appears to be a trend upwards until about 2010 followed by downward trend.

  2. Monthly Sales of US Houses in the thousands: By looking at the graph it appears that the time series data set is NOT stationary in terms of constant variance. However, the mean does appear to be fairly constant with the exception of the few points between 1972 and 1974. Does not appear to be too much trend.

#create the time series variable
#1
rentalVacTS <- ts(rentalVac, frequency = 4, start=c(1980,1)) #there are two columns in the dataset. Second column is the time series
plot.ts(rentalVacTS[,2], main="Original Rental Vacancy Dataset") 

#2
monthlyHouseSalesTS <- ts(monthlyHouseSales, frequency = 12, start = c(1965,1))
par(mfrow=c(1,1))
plot.ts(monthlyHouseSalesTS[,2], main="Original Monthly House Sales")

2) Plot the ACF for the time series to see if there may be trend or non-constant mean for each time series

  1. Rental Vacancies from 1980-2016: The ACF below looks like a classic ACF where it is evident that the data is a non-stationary time series because it drops very slowly.
  2. Monthly Sales of US Houses in the thousands: By looking at the graph it appears that the data is possibly stationary after a certain lag about 0.5. The ACF initially drops fairly fast but then raises for a short time before it drops rapidly towards zero.
#1
par(mfrow=c(1,1))
acf(na.omit(rentalVacTS[,2]),main="ACF - Original Rental Vacancies")

#2
par(mfrow=c(1,1))
acf(na.omit(monthlyHouseSalesTS[,2]),main="ACF - Orig Monthly House Sales")

3) Use root tests, KPSS and Augmented Dickey Fuller, to see if the test suggest if there is a constant mean or not.

  1. Rental Vacancies from 1980-2016: Both the KPSS and Augmented Dickey Fuller test concur that this dataset is non-stationary.
  2. Monthly Sales of US Houses in the thousands: Both the KPSS and Augmented Dickey Fuller test concur that this dataset is non-stationary, but both are marginally close to the cut-off points of p-value = 0.05.
#Rental Vacancies
kpss.test(na.omit(rentalVacTS[,2])) #both give same result that data is non-stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  na.omit(rentalVacTS[, 2])
## KPSS Level = 2.9421, Truncation lag parameter = 2, p-value = 0.01
adf.test(na.omit(rentalVacTS[,2]))
## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.omit(rentalVacTS[, 2])
## Dickey-Fuller = -0.71386, Lag order = 5, p-value = 0.9672
## alternative hypothesis: stationary
#Monthly sales
kpss.test(na.omit(monthlyHouseSalesTS[,2])) #p-value=0.022 which is significant which is non-stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  na.omit(monthlyHouseSalesTS[, 2])
## KPSS Level = 0.59702, Truncation lag parameter = 2, p-value =
## 0.02291
adf.test(na.omit(monthlyHouseSalesTS[,2]))  #p-value=0.067 which is non-significant which is stationary
## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.omit(monthlyHouseSalesTS[, 2])
## Dickey-Fuller = -3.3385, Lag order = 5, p-value = 0.0679
## alternative hypothesis: stationary

4) If the KPSS and Augmented Dickey Fuller steps from above suggest there is non-constant mean then use differencing to create a new variable

Based on the results from the KPSS and Augmented Dickey Fuller, it appears that possibly both datasets have non-constant variance and therefore will try differencing for both datasets. Will first use the ndiffs() and nsdiff() in R to determine the number of difference to try on the datasets.

  1. Rental Vacancies from 1980-2016: Based on the results, R states to take 2 differences on this dataset. Below, first graph is when differenced once and second plot is when differenced twice. From the graph of the differenced twice it appears that the data is stationary and that the differencing got rid of the trend and non-constant mean. When take the KPSS and the ADF from the difference once plot, both results contradict each other meaning the KPSS states that the data is non-stationary but for the ADF states that it is stationary. When take difference twice, both tests concur stating the dataset is stationary data with constant mean.

  2. Monthly Sales of US Houses in the thousands: Based on the results, R states to take 0 differences on this dataset. After trying to take one difference and looking at results, the plot does not look like it has constant mean like the original dataset. The ACF looks better but the KPSS and ADF contradict each other. When take a second difference the dataset also contradicts itself when comparing the KPSS and ADF test. Therefore, I decided to leave this dataset in its original set. Below is data differenced once. The original data has at least fairly constant mean but the differenced does not. However, this plot did remove some of the seasonality from the original data but not enough. Similar results when differenced twice.

#Rental Vacancy
nonSeas = ndiffs(na.omit(rentalVacTS[,2])) #R says 2 which is greater than 0

#Is differenced data by 1 stationary?
#Both contradict each other
rentDiff <- diff(na.omit(rentalVacTS[,2]), differences = 1)
plot.ts(rentDiff, main="Rental Vacancies Differenced = 2")

kpss.test(rentDiff) #p-value=0.04 which is significant which states data is stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  rentDiff
## KPSS Level = 0.48592, Truncation lag parameter = 2, p-value =
## 0.04484
adf.test(rentDiff)  #p-value=0.01 which is significant which states Data is stationary
## 
##  Augmented Dickey-Fuller Test
## 
## data:  rentDiff
## Dickey-Fuller = -4.1008, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
par(mfrow=c(1,2))
acf(rentDiff,main="Rental Vacancies Differenced=2")
pacf(rentDiff)

#Is differenced data by 2 stationary?
#The ACF looks like the differencing got
#rid of the trend and the non-constant mean because points drop to zero
#very rapidly but many of the points exceed the significance boundaries but
#not by much so this is something I will have to keep an eye on.

rentDiff <- diff(na.omit(rentalVacTS[,2]), differences = nonSeas)
plot.ts(rentDiff, main="Rental Vacancies Differenced = 2")
kpss.test(rentDiff) #p-value=0.1 which is not significant which states data is stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  rentDiff
## KPSS Level = 0.014818, Truncation lag parameter = 2, p-value = 0.1
adf.test(rentDiff)  #p-value=0.01 which is significant which states data is stationary
## 
##  Augmented Dickey-Fuller Test
## 
## data:  rentDiff
## Dickey-Fuller = -7.775, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
#Monthly house sales
#Since I decided to not
#difference this dataset, the ACF is the same as in #3. It is harder to interpret
#but I feel that keeping this data as is will be the best option. The second
#wave around 1.0 shows that there is possibly seasonality.
nsdiffs(na.omit(monthlyHouseSalesTS[,2])) #value is zero so no differencing
## [1] 0

5) Checking for constant variance using the ARCH test

Used the archTest() in the MTS package in R.

  1. Rental Vacancies from 1980-2016: The rank series of the differenced time series is used to test the conditional heteroscedasticity. Based on the results, the p-value is 0.133 which is non-significant therefore, the different rental vacancy time series is good.

  2. Monthly Sales of US Houses in the thousands: The rank series of the differenced time series is used to test the conditional heteroscedasticity. Based on the results, the pvalue is 0, which means it is statistically significant and that there might be an issue with constant variance. https://cran.r-project.org/web/packages/MTS/MTS.pdf

#Rental vacancy
archTest(rentDiff, lag=10)
## Q(m) of squared series(LM test):  
## Test statistic:  26.95053  p-value:  0.002651682 
## Rank-based Test:  
## Test statistic:  14.97264  p-value:  0.1330626
#Monthly house sales
archTest(monthlyHouseSalesTS, lag=10) #This is significant
## Q(m) of squared series(LM test):  
## Test statistic:  1074.015  p-value:  0 
## Rank-based Test:  
## Test statistic:  1082.45  p-value:  0

6) Plot PACF (with ACF) to determine Autoregressive and/or Moving Average

  1. Rental Vacancies from 1980-2016: Based on the ACF and PACF, I might consider an autoregressive of 1 since tail off to zero after lag 1 but it is a little bit complicated because there are other points that cross the significance boundaries later on. For the moving average might consider 3 but for lag 1 and 2, these lags are within the significance boundaries so this is also tricky. Will use the auto.arima() in R which states what can possibly be the better arima model. Based on the auto.arima model, the best arima model is (1,2,1). This is fairly close to what I had interpreted based on looking at the graphs below

  2. Monthly Sales of US Houses in the thousands: The PACF and ACF for this model are much harder to interpret, therefore for this model I chose to go straight to the auto.arima() to determine what the a good candidate for the arima model would be. Based on the results, a good arima model is (2,0,0)

#Rental Vacancy
par(mfrow=c(1,2))
acf(rentDiff,main="Rent Vacancies Diff=2")
pacf(rentDiff, main="")

#Based on the ACF and PACF, they seem to be a bit complicated and therefore I chose to let R decide which ARIMA model I should use.
auto.arima(na.omit(rentalVacTS[,2]), seasonal = FALSE, ic="bic")#states to use ARIMA(1,2,1)
## Series: na.omit(rentalVacTS[, 2]) 
## ARIMA(1,2,1)                    
## 
## Coefficients:
##           ar1      ma1
##       -0.3064  -0.9443
## s.e.   0.0823   0.0360
## 
## sigma^2 estimated as 0.09293:  log likelihood=-34.13
## AIC=74.27   AICc=74.44   BIC=83.22
#Monthly House Sales
par(mfrow=c(1,2))
acf(na.omit(monthlyHouseSalesTS[,2]),main="Monthly House Sales")
pacf(na.omit(monthlyHouseSalesTS[,2]),main="")

7) Comparing ARIMA models

Used the archTest() in the MTS package in R. When comparing different models, used AIC to compare which is possibly a better model

  1. Rental Vacancies from 1980-2016:
  1. Different Models:
  2. Arima(1,2,1) - AIC: 74.27
  1. Arima(1,2,3) - AIC: 75.78
  2. Arima(1,1,1) - AIC: 57.23 (Though I get a possible convergence warning with this model)
  3. Arima(1,1,3) - AIC:70.57
  1. Observed vs Fitted: The fitted forecast seems to agree pretty well with the observed values, though not exactly. Although there is a bit of a lag of the fitted compared to the observed. (Black line: observed; Red line: fitted)
  2. Favorite Model: I am inclined to using the Arima(1,2,1) model. Though it is not the one with the lowest AIC, it was the model that the auto.Arima() chose. In addition, by looking at the ACF and PACF plot, it is unclear as to whether a moving average of 1 or 3 will be best. Due to possible penalties if having extra parameters, I decided to stay with the (1,2,1) model. In addition, though the (1,1,1) model has a better AIC, it is unclear how the possible convergence warning can affect the model
  1. Monthly Sales of US Houses in the thousands:
  1. Different Models: For this dataset decided to also look at the BIC because the data is not as interpretable as the previous dataset.
  2. Arima(2,0,0) - AIC: 815.02 BIC: 826.55
  1. Arima(0,1,0) - AIC: 814.91 BIC: 817.79
  2. Arima(1,0,3) - AIC: 803.1 BIC: 820.4
  3. Arima(1,0,2) - AIC: 804.86 BIC: 819.28
  1. Observed vs Fitted: The fitted follows the observed perfectly with the exception of a lag. Not sure if it should follow this closely. The previous dataset had the general trend but didn’t match the observed as much as this one.
  2. Favorite Model: I strongly believe that this model does not need or should have any differencing because it doesn’t really appear to have trend but does have seasonality. Based on the AIC and BIC and the number of parameters, the model I am choosing is the Arima(1,0,2) which is different from what the auto.Arima() suggested. The AIC for this Arima is one of the lowest from the four models tried and BIC is also one of the lowest. In addition, it only has 3 parameters.
#Rental vacancy
auto.arima(na.omit(rentalVacTS[,2]), seasonal = FALSE, ic="bic")#states to use ARIMA(1,2,1)
## Series: na.omit(rentalVacTS[, 2]) 
## ARIMA(1,2,1)                    
## 
## Coefficients:
##           ar1      ma1
##       -0.3064  -0.9443
## s.e.   0.0823   0.0360
## 
## sigma^2 estimated as 0.09293:  log likelihood=-34.13
## AIC=74.27   AICc=74.44   BIC=83.22
rentArima <- Arima(na.omit(rentalVacTS[,2]), order=c(1,2,1)) #AIC: 74.27
rentArima
## Series: na.omit(rentalVacTS[, 2]) 
## ARIMA(1,2,1)                    
## 
## Coefficients:
##           ar1      ma1
##       -0.3064  -0.9443
## s.e.   0.0823   0.0360
## 
## sigma^2 estimated as 0.09293:  log likelihood=-34.13
## AIC=74.27   AICc=74.44   BIC=83.22
rentArima <- Arima(na.omit(rentalVacTS[,2]), order=c(1,2,3)) #AIC 75.78
rentArima
## Series: na.omit(rentalVacTS[, 2]) 
## ARIMA(1,2,3)                    
## 
## Coefficients:
##          ar1      ma1     ma2      ma3
##       0.8090  -2.1098  1.4596  -0.3454
## s.e.  0.2104   0.2142  0.2797   0.0959
## 
## sigma^2 estimated as 0.09239:  log likelihood=-32.89
## AIC=75.78   AICc=76.21   BIC=90.7
rentArima <- Arima(na.omit(rentalVacTS[,2]), order=c(1,1,1)) #AIC 57.23
rentArima
## Series: na.omit(rentalVacTS[, 2]) 
## ARIMA(1,1,1)                    
## 
## Coefficients:
##           ar1     ma1
##       -0.9998  0.9921
## s.e.   0.0005  0.0124
## 
## sigma^2 estimated as 0.0827:  log likelihood=-25.61
## AIC=57.23   AICc=57.4   BIC=66.2
rentArima <- Arima(na.omit(rentalVacTS[,2]), order=c(1,1,3)) #AIC 70.57
rentArima
## Series: na.omit(rentalVacTS[, 2]) 
## ARIMA(1,1,3)                    
## 
## Coefficients:
##          ar1      ma1     ma2     ma3
##       0.7734  -1.0815  0.3032  0.0486
## s.e.  0.1917   0.1988  0.1255  0.0881
## 
## sigma^2 estimated as 0.09072:  log likelihood=-30.28
## AIC=70.57   AICc=70.99   BIC=85.52
#8B
#Black line is the original time series
#Red line is the forecasted/fitted values 
rentForecast <- HoltWinters(na.omit(rentalVacTS[,2], gamma=FALSE))
rentForecast
## Holt-Winters exponential smoothing with trend and additive seasonal component.
## 
## Call:
## HoltWinters(x = na.omit(rentalVacTS[, 2], gamma = FALSE))
## 
## Smoothing parameters:
##  alpha: 0.6857901
##  beta : 0.1046679
##  gamma: 0.2924761
## 
## Coefficients:
##           [,1]
## a   6.83196142
## b  -0.08555574
## s1  0.04438373
## s2 -0.12099100
## s3  0.13035629
## s4 -0.01765244
plot(rentForecast) #The forecasts agree pretty well with the observed values. Although tend to lag a bit behind the observed values. 

rentForecast2 <- forecast.HoltWinters(rentForecast, h=19)
Box.test(rentForecast2$residuals)
## 
##  Box-Pierce test
## 
## data:  rentForecast2$residuals
## X-squared = 0.7424, df = 1, p-value = 0.3889
plot.forecast(rentForecast2)

#8D Favorite model: Forecast next 6 time periods and plot the timeseries with the forecast data
rentArima <- Arima(na.omit(rentalVacTS[,2]), order=c(1,2,1))
rentArimaForecast <- forecast.Arima(rentArima, h=6)
rentArimaForecast
##         Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 2017 Q2       6.767140 6.376458 7.157823 6.169644 7.364637
## 2017 Q3       6.705620 6.217415 7.193826 5.958974 7.452266
## 2017 Q4       6.622244 6.024707 7.219782 5.708389 7.536099
## 2018 Q1       6.545564 5.852839 7.238288 5.486133 7.604995
## 2018 Q2       6.466832 5.681292 7.252373 5.265452 7.668213
## 2018 Q3       6.388729 5.513443 7.264016 5.050095 7.727364
plot.forecast(rentArimaForecast)

acf(rentArimaForecast$residuals, lag.max = 20, main="Forecast ACF for Rental Vacancies (1,2,1)")

Box.test(rentArimaForecast$residuals, lag=15, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  rentArimaForecast$residuals
## X-squared = 24.084, df = 15, p-value = 0.06368
#Monthly house sales
#8A
auto.arima(na.omit(monthlyHouseSalesTS[,2]),seasonal = TRUE) #Try ARIMA(2,0,0)
## Series: na.omit(monthlyHouseSalesTS[, 2]) 
## ARIMA(0,1,0)(2,0,0)[12]                    
## 
## Coefficients:
##         sar1    sar2
##       0.4480  0.2677
## s.e.  0.0871  0.0941
## 
## sigma^2 estimated as 17.59:  log likelihood=-376.37
## AIC=758.74   AICc=758.93   BIC=767.37
houseArima <- Arima(na.omit(monthlyHouseSalesTS[,2]), order=c(2,0,0)) #AIC:815.02 BIC: 826.55
houseArima  
## Series: na.omit(monthlyHouseSalesTS[, 2]) 
## ARIMA(2,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2     mean
##       1.0071  -0.1666  44.8242
## s.e.  0.0862   0.0861   2.7128
## 
## sigma^2 estimated as 26.79:  log likelihood=-403.51
## AIC=815.02   AICc=815.34   BIC=826.55
houseArima <- Arima(na.omit(monthlyHouseSalesTS[,2]), order=c(0,1,0)) #AIC:814.91 BIC:817.79
houseArima
## Series: na.omit(monthlyHouseSalesTS[, 2]) 
## ARIMA(0,1,0)                    
## 
## sigma^2 estimated as 29.01:  log likelihood=-406.46
## AIC=814.91   AICc=814.94   BIC=817.79
houseArima <- Arima(na.omit(monthlyHouseSalesTS[,2]), order=c(1,0,2)) #AIC:803.1 BIC:820.4
houseArima
## Series: na.omit(monthlyHouseSalesTS[, 2]) 
## ARIMA(1,0,2) with non-zero mean 
## 
## Coefficients:
##          ar1     ma1     ma2     mean
##       0.7478  0.1606  0.4129  44.7767
## s.e.  0.0678  0.0871  0.0974   2.5848
## 
## sigma^2 estimated as 24.55:  log likelihood=-397.43
## AIC=804.86   AICc=805.34   BIC=819.28
#8B
houseForecast <- HoltWinters(na.omit(monthlyHouseSalesTS[,2]), gamma=FALSE)
houseForecast
## Holt-Winters exponential smoothing with trend and without seasonal component.
## 
## Call:
## HoltWinters(x = na.omit(monthlyHouseSalesTS[, 2]), gamma = FALSE)
## 
## Smoothing parameters:
##  alpha: 1
##  beta : 0.08860311
##  gamma: FALSE
## 
## Coefficients:
##         [,1]
## a 38.0000000
## b -0.6603704
plot(houseForecast) #The forecasts agree pretty well with the observed values. Although tend to lag a bit behind the observed values. 

houseForecast2 <- forecast.HoltWinters(houseForecast, h=19)
Box.test(houseForecast2$residuals)
## 
##  Box-Pierce test
## 
## data:  houseForecast2$residuals
## X-squared = 1.4245, df = 1, p-value = 0.2327
plot.forecast(houseForecast2)

#8D
houseArima <- Arima(na.omit(monthlyHouseSalesTS[,2]), order=c(1,0,2))
houseArimaForecast <- forecast.Arima(houseArima, h=6)
houseArimaForecast
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 1976       39.70576 33.35538 46.05614 29.99369 49.41782
## Feb 1976       38.42964 29.85027 47.00901 25.30862 51.55066
## Mar 1976       40.03061 28.99838 51.06283 23.15827 56.90294
## Apr 1976       41.22775 29.03728 53.41822 22.58403 59.87147
## May 1976       42.12293 29.33045 54.91541 22.55853 61.68734
## Jun 1976       42.79231 29.67527 55.90936 22.73152 62.85311
plot.forecast(houseArimaForecast)

acf(houseArimaForecast$residuals, lag.max = 20, main="Forecast ACF for Home Sales (1,2,1)")

Box.test(houseArimaForecast$residuals, lag=15, type = "Ljung-Box")
## 
##  Box-Ljung test
## 
## data:  houseArimaForecast$residuals
## X-squared = 41.222, df = 15, p-value = 0.0002958