1 Description of the Dataset

This dataset includes the counts of new single-family houses sold per month, in thousands, in the United States from 1963 to 2023. There are 729 observations in the dataset.

new.houses = na.omit(read.csv("C:/Users/qinfa/Desktop/school/STA 321/newhousessold.csv")[,-1])

1.1 Training and Testing Data

We will begin by splitting the data into training and testing data, holding back the data from the last 10 months to calculate forecasting error.

training = new.houses[1:719]
testing = new.houses[720:729] 
##
new.houses.ts = ts(training, frequency = 12, start = c(1963, 1))

1.2 Building 4 Baseline Forecasting Models

Using four forecasting models and the first 719 observations, we predict the values for the next 10 months.

pred.mv = meanf(new.houses.ts, h=10)$mean
pred.naive = naive(new.houses.ts, h=10)$mean
pred.snaive = snaive(new.houses.ts, h=10)$mean
pred.rwf = rwf(new.houses.ts, h=10, drift = TRUE)$mean
###
###
pred.table = cbind( pred.mv = pred.mv,
                    pred.naive = pred.naive,
                    pred.snaive = pred.snaive,
                    pred.rwf = pred.rwf)
kable(pred.table, caption = "Forecasting Table")
Forecasting Table
pred.mv pred.naive pred.snaive pred.rwf
54.59805 41 61 40.99861
54.59805 41 70 40.99721
54.59805 41 71 40.99582
54.59805 41 68 40.99443
54.59805 41 56 40.99304
54.59805 41 58 40.99164
54.59805 41 48 40.99025
54.59805 41 44 40.98886
54.59805 41 51 40.98747
54.59805 41 44 40.98607

1.3 Visualizations

We now make a time series plot of the last 30 observations (observations #700-#729) as well as the 10 forecasted values of our four models.

plot(700:729, new.houses[700:729], type="l", xlim=c(700,729), ylim=c(20, 100),
     xlab = "observation sequence",
     ylab = "New Single-Family Houses Sold (in thousands)",
     main = "Monthly houses sold and forecasting")
points(700:729, new.houses[700:729],pch=20)
##
points(720:729, pred.mv, pch=15, col = "red")
points(720:729, pred.naive, pch=16, col = "blue")
points(720:729, pred.rwf, pch=18, col = "navy")
points(720:729, pred.snaive, pch=17, col = "purple")
##
lines(720:729, pred.mv, lty=2, col = "red")
lines(720:729, pred.snaive, lty=2, col = "purple")
lines(720:729, pred.naive, lty=2, col = "blue")
lines(720:729, pred.rwf, lty=2, col = "navy")
## 
legend("topright", c("moving average", "naive", "drift", "seasonal naive"),
       col=c("red", "blue", "navy", "purple"), pch=15:18, lty=rep(2,4),
       bty="n", cex = 0.8)

We can see that the naive and drift models predicted very similarly. Of these models, we can see that drift and naive predicted poorly, as did season naive. Moving average seems to be the best of our current forecasting models.

1.4 Accuracy Metrics

We will compare how each model did in the table below.

true.value = new.houses[720:729]
PE.mv =  100*(true.value - pred.mv)/true.value
PE.naive =  100*(true.value - pred.naive)/true.value
PE.snaive =  100*(true.value - pred.snaive)/true.value
PE.rwf =  100*(true.value - pred.rwf)/true.value
##
MAPE.mv = mean(abs(PE.mv))
MAPE.naive = mean(abs(PE.naive))
MAPE.snaive = mean(abs(PE.snaive))
MAPE.rwf = mean(abs(PE.rwf))
##
MAPE = c(MAPE.mv, MAPE.naive, MAPE.snaive, MAPE.rwf)
## residual-based Error
e.mv = true.value - pred.mv
e.naive = true.value - pred.naive
e.snaive = true.value - pred.snaive
e.rwf = true.value - pred.rwf
## MAD
MAD.mv = sum(abs(e.mv))
MAD.naive = sum(abs(e.naive))
MAD.snaive = sum(abs(e.snaive))
MAD.rwf = sum(abs(e.rwf))
MAD = c(MAD.mv, MAD.naive, MAD.snaive, MAD.rwf)
## MSE
MSE.mv = mean((e.mv)^2)
MSE.naive = mean((e.naive)^2)
MSE.snaive = mean((e.snaive)^2)
MSE.rwf = mean((e.rwf)^2)
MSE = c(MSE.mv, MSE.naive, MSE.snaive, MSE.rwf)
##
accuracy.table = cbind(MAPE = MAPE, MAD = MAD, MSE = MSE)
row.names(accuracy.table) = c("Moving Average", "Naive", "Seasonal Naive", "Drift")
kable(accuracy.table, caption ="Overall performance of the four forecasting methods")
Overall performance of the four forecasting methods
MAPE MAD MSE
Moving Average 8.532377 50.41168 37.97324
Naive 28.718779 170.00000 315.40000
Seasonal Naive 19.095505 109.00000 143.30000
Drift 28.731785 170.07660 315.67863

Based on this table, moving average had the best performance overall, followed by seasonal naive. The drift and naive methods produced similarly-performing models and performed the worst when compared with the test data.