1 Description of the Dataset

This dataset includes the counts of new single-family houses sold per month, in thousands, in the United States from 1963 to 2023. There are 729 observations in the dataset.

new.houses = na.omit(read.csv("C:/Users/qinfa/Desktop/school/STA 321/newhousessold.csv")[,-1])

1.1 Training and Testing Data

We will begin by splitting the data into training and testing data, holding back the data from the last 10 months to calculate forecasting error.

training = new.houses[1:719]
testing = new.houses[720:729] 
##
new.houses.ts = ts(training, frequency = 12, start = c(1963, 1))

1.2 Building 4 Baseline Forecasting Models

Using four forecasting models and the first 719 observations, we predict the values for the next 10 months.

pred.mv = meanf(new.houses.ts, h=10)$mean
pred.naive = naive(new.houses.ts, h=10)$mean
pred.snaive = snaive(new.houses.ts, h=10)$mean
pred.rwf = rwf(new.houses.ts, h=10, drift = TRUE)$mean
###
###
pred.table = cbind( pred.mv = pred.mv,
                    pred.naive = pred.naive,
                    pred.snaive = pred.snaive,
                    pred.rwf = pred.rwf)
kable(pred.table, caption = "Forecasting Table")

Forecasting Table
pred.mv	pred.naive	pred.snaive	pred.rwf
54.59805	41	61	40.99861
54.59805	41	70	40.99721
54.59805	41	71	40.99582
54.59805	41	68	40.99443
54.59805	41	56	40.99304
54.59805	41	58	40.99164
54.59805	41	48	40.99025
54.59805	41	44	40.98886
54.59805	41	51	40.98747
54.59805	41	44	40.98607

1.3 Visualizations

We now make a time series plot of the last 30 observations (observations #700-#729) as well as the 10 forecasted values of our four models.

plot(700:729, new.houses[700:729], type="l", xlim=c(700,729), ylim=c(20, 100),
     xlab = "observation sequence",
     ylab = "New Single-Family Houses Sold (in thousands)",
     main = "Monthly houses sold and forecasting")
points(700:729, new.houses[700:729],pch=20)
##
points(720:729, pred.mv, pch=15, col = "red")
points(720:729, pred.naive, pch=16, col = "blue")
points(720:729, pred.rwf, pch=18, col = "navy")
points(720:729, pred.snaive, pch=17, col = "purple")
##
lines(720:729, pred.mv, lty=2, col = "red")
lines(720:729, pred.snaive, lty=2, col = "purple")
lines(720:729, pred.naive, lty=2, col = "blue")
lines(720:729, pred.rwf, lty=2, col = "navy")
## 
legend("topright", c("moving average", "naive", "drift", "seasonal naive"),
       col=c("red", "blue", "navy", "purple"), pch=15:18, lty=rep(2,4),
       bty="n", cex = 0.8)

We can see that the naive and drift models predicted very similarly. Of these models, we can see that drift and naive predicted poorly, as did season naive. Moving average seems to be the best of our current forecasting models.

1.4 Accuracy Metrics

We will compare how each model did in the table below.

true.value = new.houses[720:729]
PE.mv =  100*(true.value - pred.mv)/true.value
PE.naive =  100*(true.value - pred.naive)/true.value
PE.snaive =  100*(true.value - pred.snaive)/true.value
PE.rwf =  100*(true.value - pred.rwf)/true.value
##
MAPE.mv = mean(abs(PE.mv))
MAPE.naive = mean(abs(PE.naive))
MAPE.snaive = mean(abs(PE.snaive))
MAPE.rwf = mean(abs(PE.rwf))
##
MAPE = c(MAPE.mv, MAPE.naive, MAPE.snaive, MAPE.rwf)
## residual-based Error
e.mv = true.value - pred.mv
e.naive = true.value - pred.naive
e.snaive = true.value - pred.snaive
e.rwf = true.value - pred.rwf
## MAD
MAD.mv = sum(abs(e.mv))
MAD.naive = sum(abs(e.naive))
MAD.snaive = sum(abs(e.snaive))
MAD.rwf = sum(abs(e.rwf))
MAD = c(MAD.mv, MAD.naive, MAD.snaive, MAD.rwf)
## MSE
MSE.mv = mean((e.mv)^2)
MSE.naive = mean((e.naive)^2)
MSE.snaive = mean((e.snaive)^2)
MSE.rwf = mean((e.rwf)^2)
MSE = c(MSE.mv, MSE.naive, MSE.snaive, MSE.rwf)
##
accuracy.table = cbind(MAPE = MAPE, MAD = MAD, MSE = MSE)
row.names(accuracy.table) = c("Moving Average", "Naive", "Seasonal Naive", "Drift")
kable(accuracy.table, caption ="Overall performance of the four forecasting methods")

Overall performance of the four forecasting methods
	MAPE	MAD	MSE
Moving Average	8.532377	50.41168	37.97324
Naive	28.718779	170.00000	315.40000
Seasonal Naive	19.095505	109.00000	143.30000
Drift	28.731785	170.07660	315.67863

Based on this table, moving average had the best performance overall, followed by seasonal naive. The drift and naive methods produced similarly-performing models and performed the worst when compared with the test data.

STA 321 Assignment 11: Time Series on New Houses Sold between 1963 and 2023

Alice Xiang

11/12/2023

1 Description of the Dataset

1.1 Training and Testing Data

1.2 Building 4 Baseline Forecasting Models

1.3 Visualizations

1.4 Accuracy Metrics