This dataset includes the counts of new single-family houses sold per month, in thousands, in the United States from 1963 to 2023. There are 729 observations in the dataset.
new.houses = na.omit(read.csv("C:/Users/qinfa/Desktop/school/STA 321/newhousessold.csv")[,-1])
We will begin by splitting the data into training and testing data, holding back the data from the last 10 months to calculate forecasting error.
training = new.houses[1:719]
testing = new.houses[720:729]
##
new.houses.ts = ts(training, frequency = 12, start = c(1963, 1))
Using four forecasting models and the first 719 observations, we predict the values for the next 10 months.
pred.mv = meanf(new.houses.ts, h=10)$mean
pred.naive = naive(new.houses.ts, h=10)$mean
pred.snaive = snaive(new.houses.ts, h=10)$mean
pred.rwf = rwf(new.houses.ts, h=10, drift = TRUE)$mean
###
###
pred.table = cbind( pred.mv = pred.mv,
pred.naive = pred.naive,
pred.snaive = pred.snaive,
pred.rwf = pred.rwf)
kable(pred.table, caption = "Forecasting Table")
pred.mv | pred.naive | pred.snaive | pred.rwf |
---|---|---|---|
54.59805 | 41 | 61 | 40.99861 |
54.59805 | 41 | 70 | 40.99721 |
54.59805 | 41 | 71 | 40.99582 |
54.59805 | 41 | 68 | 40.99443 |
54.59805 | 41 | 56 | 40.99304 |
54.59805 | 41 | 58 | 40.99164 |
54.59805 | 41 | 48 | 40.99025 |
54.59805 | 41 | 44 | 40.98886 |
54.59805 | 41 | 51 | 40.98747 |
54.59805 | 41 | 44 | 40.98607 |
We now make a time series plot of the last 30 observations (observations #700-#729) as well as the 10 forecasted values of our four models.
plot(700:729, new.houses[700:729], type="l", xlim=c(700,729), ylim=c(20, 100),
xlab = "observation sequence",
ylab = "New Single-Family Houses Sold (in thousands)",
main = "Monthly houses sold and forecasting")
points(700:729, new.houses[700:729],pch=20)
##
points(720:729, pred.mv, pch=15, col = "red")
points(720:729, pred.naive, pch=16, col = "blue")
points(720:729, pred.rwf, pch=18, col = "navy")
points(720:729, pred.snaive, pch=17, col = "purple")
##
lines(720:729, pred.mv, lty=2, col = "red")
lines(720:729, pred.snaive, lty=2, col = "purple")
lines(720:729, pred.naive, lty=2, col = "blue")
lines(720:729, pred.rwf, lty=2, col = "navy")
##
legend("topright", c("moving average", "naive", "drift", "seasonal naive"),
col=c("red", "blue", "navy", "purple"), pch=15:18, lty=rep(2,4),
bty="n", cex = 0.8)
We can see that the naive and drift models predicted very similarly. Of these models, we can see that drift and naive predicted poorly, as did season naive. Moving average seems to be the best of our current forecasting models.
We will compare how each model did in the table below.
true.value = new.houses[720:729]
PE.mv = 100*(true.value - pred.mv)/true.value
PE.naive = 100*(true.value - pred.naive)/true.value
PE.snaive = 100*(true.value - pred.snaive)/true.value
PE.rwf = 100*(true.value - pred.rwf)/true.value
##
MAPE.mv = mean(abs(PE.mv))
MAPE.naive = mean(abs(PE.naive))
MAPE.snaive = mean(abs(PE.snaive))
MAPE.rwf = mean(abs(PE.rwf))
##
MAPE = c(MAPE.mv, MAPE.naive, MAPE.snaive, MAPE.rwf)
## residual-based Error
e.mv = true.value - pred.mv
e.naive = true.value - pred.naive
e.snaive = true.value - pred.snaive
e.rwf = true.value - pred.rwf
## MAD
MAD.mv = sum(abs(e.mv))
MAD.naive = sum(abs(e.naive))
MAD.snaive = sum(abs(e.snaive))
MAD.rwf = sum(abs(e.rwf))
MAD = c(MAD.mv, MAD.naive, MAD.snaive, MAD.rwf)
## MSE
MSE.mv = mean((e.mv)^2)
MSE.naive = mean((e.naive)^2)
MSE.snaive = mean((e.snaive)^2)
MSE.rwf = mean((e.rwf)^2)
MSE = c(MSE.mv, MSE.naive, MSE.snaive, MSE.rwf)
##
accuracy.table = cbind(MAPE = MAPE, MAD = MAD, MSE = MSE)
row.names(accuracy.table) = c("Moving Average", "Naive", "Seasonal Naive", "Drift")
kable(accuracy.table, caption ="Overall performance of the four forecasting methods")
MAPE | MAD | MSE | |
---|---|---|---|
Moving Average | 8.532377 | 50.41168 | 37.97324 |
Naive | 28.718779 | 170.00000 | 315.40000 |
Seasonal Naive | 19.095505 | 109.00000 | 143.30000 |
Drift | 28.731785 | 170.07660 | 315.67863 |
Based on this table, moving average had the best performance overall, followed by seasonal naive. The drift and naive methods produced similarly-performing models and performed the worst when compared with the test data.