The following data set is the sales of new one-family houses, USA, from 1963 through 2016. (United States Census Bureaus)
https://www.census.gov/construction/nrs/historical_data/index.html #click ‘Median and Average Sales Price of House Sold by Region’
Apply time series analysis to analyze the given data set. Suggested commands: plot.ts(), stl(), acf(), decompose(), Box.test(), auto.arima(), Holtwinters(), forecast.HoltWinters() or prediction().
Interpret the results from each command, make a forecast then summarize your findings.
What are your insights and/or suggested actionable decision?
To begin, we need to read in the source data, which comes in the form of an xls document. At this time, I will also load any packages we are going to need before we get started. It was also advised that we look at quarterly data, which appears on sheet 2. The first few steps will be about reading in our data. After that is done, we need to get into a timeseries object before we can begin to evaluate different models.
library(xlsx)
Error in names(frame)[names(frame) == "x"] <- name :
names() applied to a non-vector
library(ggplot2)
library(forecast)
setwd('C:\\Users\\JP\\Downloads')
The working directory was changed to C:/Users/JP/Downloads inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the the working directory for notebook chunks.
list.files()
[1] "Associate +.pdf" "desktop.ini" "pricereg_cust.xls" "Week 5_msg.pdf"
data<-read.xlsx(file='pricereg_cust.xls',2,as.data.frame = T,startRow = 5,endRow = 221)
names(data)<-c('YearQtr','MedianUS','MedianNorthEast','MedianMidwest','MedianSouth','MedianWest','AverageUS','AvgerageNorthEast','AverageMidwest','AverageSouth','AverageWest')
head(data)
str(data)
'data.frame': 216 obs. of 11 variables:
$ YearQtr : Factor w/ 216 levels "1963Q1","1963Q2",..: 1 2 3 4 5 6 7 8 9 10 ...
$ MedianUS : num 17800 18000 17900 18500 18500 18900 18900 19400 20200 19800 ...
$ MedianNorthEast : num 20800 20600 19600 20600 20300 19800 20200 21400 21000 21900 ...
$ MedianMidwest : num 17500 17700 17800 19100 18700 19800 18900 20800 21900 20800 ...
$ MedianSouth : num 16800 15800 15900 15800 16500 16800 16800 16700 17400 16400 ...
$ MedianWest : num 18000 18900 19000 19500 19600 20100 20600 21500 21600 22100 ...
$ AverageUS : num 19300 19400 19200 19600 19600 20200 20500 20900 21500 21000 ...
$ AvgerageNorthEast: Factor w/ 164 levels "(NA)","102000",..: 1 1 1 1 1 1 1 1 1 1 ...
$ AverageMidwest : Factor w/ 165 levels "(NA)","102500",..: 1 1 1 1 1 1 1 1 1 1 ...
$ AverageSouth : Factor w/ 163 levels "(NA)","102300",..: 1 1 1 1 1 1 1 1 1 1 ...
$ AverageWest : Factor w/ 161 levels "(NA)","103400",..: 1 1 1 1 1 1 1 1 1 1 ...
We have a nice digestible data frame to work with now.The first few columns appear to be numeric, while the last few come in as factors because of their “NA” values. Let’s move our data into a timeseries format.
data$qtr<-substr(data$YearQtr,5,6)
data$Year<-substr(data$YearQtr,1,4)
MedianTS<-ts(data[,2:6],frequency=4,start=c(1963,1))
MedianTSUS<-ts(data[,2],frequency=4,start=c(1963,1))
We got the data into the time series object by reworking it a little bit to meet our needs. We will start by looking at just the median, but may revisit the others a little further ahead. We have a goal of informing a decision via forecast. First, let’s cover off on the basics of the recommended functions.
plot.ts(MedianTS)
plot.ts(MedianTSUS)
Comp<-decompose(MedianTSUS)
MedianTSUSAdjust<-MedianTSUS-Comp$seasonal
MedianTSUSAdjust2<-MedianTSUS-Comp$trend
MedianTSUSAdjust3<-MedianTSUS-Comp$random
MedianTSUSAdjust4<-MedianTSUS-Comp$random-Comp$trend-Comp$seasonal
plot.ts(MedianTSUS)
plot.ts(MedianTSUSAdjust)
plot.ts(MedianTSUSAdjust2)
plot.ts(MedianTSUSAdjust3)
plot.ts(MedianTSUSAdjust4)
You can see from some of these plots that seasonality is experiencing and increase in variance over time. This would indicate that our time series data would not be represented by an additive model. We should transform our data to use an additive model rather than a multiplicative model. We then use a Holt Winters model to evauluate our data and make a model to forecast from.
MedianTSUSL<-log(MedianTSUS)
MedianTSUSLF<-HoltWinters(MedianTSUSL)
MedianTSUSLF
Holt-Winters exponential smoothing with trend and additive seasonal component.
Call:
HoltWinters(x = MedianTSUSL)
Smoothing parameters:
alpha: 0.7065824
beta : 0.192128
gamma: 0.07708463
Coefficients:
[,1]
a 12.647443947
b 0.007036414
s1 0.002180491
s2 0.005069479
s3 -0.009596301
s4 0.001477106
Here we get some metrics that tell us how the model is weighing our data. + Aplha measures how much weight is put on recent observations + Beta is representative of the slope + Gamma tells us how much the seasonal component impacts the model over time. This would tell us that it goes very far back.
Now that we have a model, we can look at how it would preform against actual, and then make some predictions.
plot(MedianTSUSLF)
MedianTSUSLF2<-forecast.HoltWinters(MedianTSUSLF,h=40)
plot(MedianTSUSLF2)
At this point, we should determine if there are autocorrelations. We have a forecast, but it’s likely that it can still be improved. We have not covered all of the recommended functions yet either, but this gives us a good chance to do so.
The first we will look at is acf, to ensure that autocorrelation is low through the lagged values.
acf(window(MedianTSUSLF2$residuals,start = c(1964,1)),lag.max = 80)
Box.test(window(MedianTSUSLF2$residuals,start = c(1964,1)),lag = 80)
Box-Pierce test
data: window(MedianTSUSLF2$residuals, start = c(1964, 1))
X-squared = 77.389, df = 80, p-value = 0.5619
We could not prove that autocorellation would have our model misrepresent the data as the p-value above is close to .56, which would indicate that we may not need to go on evaluating this data.
Now that the median is covered and we have a trustworthy forecast for the next 10 years in the US, we can take a peak at the other regions to see how they compare.
MedianTSNE<-ts(data[,3],frequency=4,start=c(1963,1))
MedianTSMW<-ts(data[,4],frequency=4,start=c(1963,1))
MedianTSS<-ts(data[,5],frequency=4,start=c(1963,1))
MedianTSW<-ts(data[,6],frequency=4,start=c(1963,1))
plot.ts(MedianTSNE)
plot.ts(MedianTSMW)
plot.ts(MedianTSS)
plot.ts(MedianTSW)
Each region appears to be a little diferent, but we can follow similar steps from above and start with the North East region.
Comp2<-decompose(MedianTSNE)
MedianTSNEAdjust<-MedianTSNE-Comp2$seasonal
MedianTSNEAdjust2<-MedianTSNE-Comp2$trend
MedianTSNEAdjust3<-MedianTSNE-Comp2$random
MedianTSNEAdjust4<-MedianTSNE-Comp2$random-Comp2$trend-Comp2$seasonal
plot.ts(MedianTSNE)
plot.ts(MedianTSNEAdjust)
plot.ts(MedianTSNEAdjust2)
plot.ts(MedianTSNEAdjust3)
plot.ts(MedianTSNEAdjust4)
The last plot for the northeast shows an even more significant increase in seasonality as time goes on. Like before we can simply log the data and continue to evaluate a forecast.
MedianTSNEL<-log(MedianTSNE)
MedianTSNELF<-HoltWinters(MedianTSNEL)
MedianTSNELF
Holt-Winters exponential smoothing with trend and additive seasonal component.
Call:
HoltWinters(x = MedianTSNEL)
Smoothing parameters:
alpha: 0.4033972
beta : 0.0470212
gamma: 0.08138937
Coefficients:
[,1]
a 12.9640991341
b 0.0088600463
s1 0.0006493396
s2 0.0002255062
s3 -0.0022004558
s4 0.0248030571
plot(MedianTSNELF)
MedianTSNELF2<-forecast.HoltWinters(MedianTSNELF,h=40)
plot(MedianTSNELF2)
acf(window(MedianTSNELF2$residuals,start = c(1964,1)),lag.max = 80)
Box.test(window(MedianTSNELF2$residuals,start = c(1964,1)),lag = 80)
Box-Pierce test
data: window(MedianTSNELF2$residuals, start = c(1964, 1))
X-squared = 70.98, df = 80, p-value = 0.7544
For this region, there is evidence that would support that there is no autocorrelation that is not equal to 0. This can be seen in acf plot, where there are few points that surpass the significant boundries and also by the p-value in the box test.
While this model will suffice, it would be interesting to see a different type of model. We could take this one a step further by looking at an ARIMA model instead.
To begin, we already know that our data is not stationary, so making it so would be the first thing to do in this scenario.
MedianTSNELD1<-diff(MedianTSNEL, differences = 1)
plot.ts(MedianTSNEL)
plot.ts(MedianTSNELD1)
You can see the difference is quite significant here in the two plots when setting the differencing equation to 1. Next steps are to determine that other variables for an arima. To do this, we can use the auto.arima function.
auto.arima(MedianTSNEL)
Series: MedianTSNEL
ARIMA(0,1,1) with drift
Coefficients:
ma1 drift
-0.5769 0.0143
s.e. 0.0510 0.0021
sigma^2 estimated as 0.005454: log likelihood=255.96
AIC=-505.91 AICc=-505.8 BIC=-495.8
auto.arima(MedianTSNELD1)
Series: MedianTSNELD1
ARIMA(0,0,1) with non-zero mean
Coefficients:
ma1 mean
-0.5769 0.0143
s.e. 0.0510 0.0021
sigma^2 estimated as 0.005454: log likelihood=255.96
AIC=-505.91 AICc=-505.8 BIC=-495.8
Interestingly, you can see that difference is correct as the auto arima of our differenced function has a 0 for the middle variable, while our original data has a 1. Now that we have optimal variable for the ARIMA, we can forecast for the North East.
MedianTSNELF<-arima(MedianTSNEL,order = c(0,1,1))
MedianTSNELF
Call:
arima(x = MedianTSNEL, order = c(0, 1, 1))
Coefficients:
ma1
-0.4225
s.e. 0.0514
sigma^2 estimated as 0.006183: log likelihood = 241.57, aic = -479.14
MedianTSNELF2<-forecast.Arima(MedianTSNELF,h=40)
plot(MedianTSNELF2)
acf(MedianTSNELF2$residuals,lag.max=80)
Box.test(window(MedianTSNELF2$residuals,start = c(1964,1)),lag = 80)
Box-Pierce test
data: window(MedianTSNELF2$residuals, start = c(1964, 1))
X-squared = 78.367, df = 80, p-value = 0.5307
Above we can see the forecast in action, while also proving that it is a valid forecast and that there is limited auto correlation that is not equal to 0. It is however, at a lower p-value and greater x-squared.
Let’s move to the Midwest and begin with the basic plots to see which type of model will suffice here.
plot.ts(MedianTSMW)
Comp3<-decompose(MedianTSMW)
MedianTSMWAdjust<-MedianTSMW-Comp3$seasonal
MedianTSMWAdjust2<-MedianTSMW-Comp3$trend
MedianTSMWAdjust3<-MedianTSMW-Comp3$random
MedianTSMWAdjust4<-MedianTSMW-Comp3$random-Comp3$trend-Comp3$seasonal
plot.ts(MedianTSMW)
plot.ts(MedianTSMWAdjust)
plot.ts(MedianTSMWAdjust2)
plot.ts(MedianTSMWAdjust3)
plot.ts(MedianTSMWAdjust4)
MedianTSMWL<-log(MedianTSMW)
MedianTSMWLF<-HoltWinters(MedianTSMWL)
MedianTSMWLF
Holt-Winters exponential smoothing with trend and additive seasonal component.
Call:
HoltWinters(x = MedianTSMWL)
Smoothing parameters:
alpha: 0.5326104
beta : 0.02161117
gamma: 0.1180292
Coefficients:
[,1]
a 12.596292713
b 0.010316285
s1 -0.008215130
s2 -0.007651337
s3 -0.022767595
s4 -0.011331350
plot(MedianTSMWLF)
MedianTSMWLF2<-forecast.HoltWinters(MedianTSMWLF,h=40)
plot(MedianTSMWLF2)
acf(window(MedianTSMWLF2$residuals,start = c(1964,1)),lag.max = 80)
Box.test(window(MedianTSMWLF2$residuals,start = c(1964,1)),lag = 80)
Box-Pierce test
data: window(MedianTSMWLF2$residuals, start = c(1964, 1))
X-squared = 77.777, df = 80, p-value = 0.5496
Taking the same approach here works, next the South.
plot.ts(MedianTSS)
Comp4<-decompose(MedianTSS)
MedianTSSAdjust<-MedianTSS-Comp4$seasonal
MedianTSSAdjust2<-MedianTSS-Comp4$trend
MedianTSSAdjust3<-MedianTSS-Comp4$random
MedianTSSAdjust4<-MedianTSS-Comp4$random-Comp4$trend-Comp4$seasonal
plot.ts(MedianTSS)
plot.ts(MedianTSSAdjust)
plot.ts(MedianTSSAdjust2)
plot.ts(MedianTSSAdjust3)
plot.ts(MedianTSSAdjust4)
MedianTSSL<-log(MedianTSS)
MedianTSSLF<-HoltWinters(MedianTSSL)
MedianTSSLF
Holt-Winters exponential smoothing with trend and additive seasonal component.
Call:
HoltWinters(x = MedianTSSL)
Smoothing parameters:
alpha: 0.6638147
beta : 0.02750619
gamma: 0.1997318
Coefficients:
[,1]
a 12.553872501
b 0.010409675
s1 0.002614273
s2 0.002203794
s3 -0.007477704
s4 -0.001678296
plot(MedianTSSLF)
MedianTSSLF2<-forecast.HoltWinters(MedianTSSLF,h=40)
plot(MedianTSSLF2)
acf(window(MedianTSSLF2$residuals,start = c(1964,1)),lag.max = 80)
Box.test(window(MedianTSSLF2$residuals,start = c(1964,1)),lag = 80)
Box-Pierce test
data: window(MedianTSSLF2$residuals, start = c(1964, 1))
X-squared = 64.896, df = 80, p-value = 0.8897
All clear once again. The last region to cover is the West before we make any comparisons.
plot.ts(MedianTSW)
Comp5<-decompose(MedianTSW)
MedianTSWAdjust<-MedianTSW-Comp5$seasonal
MedianTSWAdjust2<-MedianTSW-Comp5$trend
MedianTSWAdjust3<-MedianTSW-Comp5$random
MedianTSWAdjust4<-MedianTSW-Comp5$random-Comp5$trend-Comp5$seasonal
plot.ts(MedianTSW)
plot.ts(MedianTSWAdjust)
plot.ts(MedianTSWAdjust2)
plot.ts(MedianTSWAdjust3)
plot.ts(MedianTSWAdjust4)
MedianTSWL<-log(MedianTSW)
MedianTSWLF<-HoltWinters(MedianTSWL)
MedianTSWLF
Holt-Winters exponential smoothing with trend and additive seasonal component.
Call:
HoltWinters(x = MedianTSWL)
Smoothing parameters:
alpha: 0.6554139
beta : 0.2325247
gamma: 0.0342411
Coefficients:
[,1]
a 12.8212789654
b 0.0042214341
s1 -0.0019541029
s2 0.0004703746
s3 -0.0047414771
s4 0.0050565842
plot(MedianTSWLF)
MedianTSWLF2<-forecast.HoltWinters(MedianTSWLF,h=40)
plot(MedianTSWLF2)
acf(window(MedianTSWLF2$residuals,start = c(1964,1)),lag.max = 80)
Box.test(window(MedianTSWLF2$residuals,start = c(1964,1)),lag = 80)
Box-Pierce test
data: window(MedianTSWLF2$residuals, start = c(1964, 1))
X-squared = 50.271, df = 80, p-value = 0.9962
We now have projection for all of the regions and a forecast for the US as a whole. We can now get these 5 items on the same plot to evaluate together.
ts.plot(MedianTSWLF2$fitted, MedianTSSLF2$fitted, MedianTSNELF2$fitted, MedianTSMWLF2$fitted, MedianTSUSLF2$fitted,
col=c("red","green","blue","black","purple"),
gpars=list(xlab="Year", ylab="Logged Value of Median House Sales", lty=c(1:5)))
The Northeast and Southwest seem to have jumped ahead in household sales prices over the past 20 years. It looks as though it will continue moving forward. In the future, it looks as if the median price is on the rise. As someone looking to make an investment that would gain value over time. I would recommend looking in the Northeast and Southwest. This does not tell us the average price or even the price at all. We are looking at the log values of our data. This plot helps us to see how the markets are performing as the relate to eachother. Should we continue, we could evalue that avgerage price of those homes by region and look at the actual cost after the forecasts have been calculate.
I need to acknowledge Avril Coghlans work in the Time Series 0.2 documentation. Her work inspired much of this analysis and was very helpful in learning the concepts in R.