This dataset is the monthly weather report for Houston Texas from January 1947 - December 2012. For this specific analysis we will be focusing on the 150 most recent observations which is from 2002 - 2014.
Variables: Month (of Year) Year LowTemp HighTemp WarmestMin ColdestHigh AveMin AveMax meanTemp TotPrecip TotSnow Max24hrPrecip
Weather <- read.csv("https://raw.githubusercontent.com/TylerBattaglini/STA-321/refs/heads/main/houston_weather.csv", header = TRUE)
Weather$Date <- as.Date(paste(Weather$Year, Weather$Month, "01", sep = "-"), format = "%Y-%m-%d")
Weather1 <- Weather[order(Weather$Date), ]
most_recent_150 <- tail(Weather1, 150)
head(most_recent_150)
Month Year LowTemp HighTemp WarmestMin ColdestHigh AveMin AveMax meanTemp
667 7 2002 71 95 79 82 75.5 90.9 83.2
668 8 2002 70 97 79 76 75.4 91.3 83.3
669 9 2002 64 96 77 79 72.2 87.9 80.1
670 10 2002 50 93 75 67 65.5 79.0 72.3
671 11 2002 38 84 69 51 49.8 69.5 59.7
672 12 2002 35 77 66 43 46.7 64.8 55.7
TotPrecip Max24hrPrecip Date
667 8.55 2.29 2002-07-01
668 8.39 5.52 2002-08-01
669 3.53 0.80 2002-09-01
670 9.99 2.84 2002-10-01
671 2.89 1.22 2002-11-01
672 4.84 1.33 2002-12-01
Since our data set is monthly our frequency will be set to 12 and is used to define the time series object.
houstonweather.ts <- ts(most_recent_150$meanTemp, frequency = 12, start = c(2002, 1))
par(mar = c(4, 4, 2, 2))
plot(houstonweather.ts,
main = "Houston Average Monthly Temperatures (2008-2019)",
ylab = "Mean Temperature (°F)",
xlab = "Year")
We use this method to get a better visual understanding of our dataset by using different methods to show our data.
cls.decomp = decompose(houstonweather.ts)
par(mar=c(2,2,2,2))
plot(cls.decomp, xlab="")
stl.decomp=stl(houstonweather.ts, s.window = 12)
par(mar=c(2,2,2,2))
plot(stl.decomp)
We can see above our second decomposing model did a better job at visualizing trends. We see that in the second graph it is clearly outlined that there is a minimum at around 2009 and also a maximum at 2011. While the previous graph is very rough this decomposing model gives us a more smooth looking graph which is easier to read.
To show the effect of different sizes in training the time series, we use different training data sets with different sizes. the three sizes that were used are 144, 109, 73, and 48. To calculate the prediction error we will use size 7.
ini.data = most_recent_150$meanTemp
n0 = length(ini.data)
train.data01 = ini.data[1:(n0-7)]
train.data02 = ini.data[37:(n0-7)]
train.data03 = ini.data[73:(n0-7)]
train.data04 = ini.data[97:(n0-7)]
test.data = ini.data[(n0-6):n0]
train01.ts = ts(train.data01, frequency = 12, start = c(2002, 1))
train02.ts = ts(train.data02, frequency = 12, start = c(2006, 1))
train03.ts = ts(train.data03, frequency = 12, start = c(2010, 1))
train04.ts = ts(train.data04, frequency = 12, start = c(2014, 1))
stl01 = stl(train01.ts, s.window = 12)
stl02 = stl(train02.ts, s.window = 12)
stl03 = stl(train03.ts, s.window = 12)
stl04 = stl(train04.ts, s.window = 12)
fcst01 = forecast(stl01, h = 7, method = "naive")
fcst02 = forecast(stl02, h = 7, method = "naive")
fcst03 = forecast(stl03, h = 7, method = "naive")
fcst04 = forecast(stl04, h = 7, method = "naive")
PE01 = (test.data - fcst01$mean) / fcst01$mean
PE02 = (test.data - fcst02$mean) / fcst02$mean
PE03 = (test.data - fcst03$mean) / fcst03$mean
PE04 = (test.data - fcst04$mean) / fcst04$mean
MAPE1 = mean(abs(PE01))
MAPE2 = mean(abs(PE02))
MAPE3 = mean(abs(PE03))
MAPE4 = mean(abs(PE04))
E1 = test.data - fcst01$mean
E2 = test.data - fcst02$mean
E3 = test.data - fcst03$mean
E4 = test.data - fcst04$mean
MSE1 = mean(E1^2)
MSE2 = mean(E2^2)
MSE3 = mean(E3^2)
MSE4 = mean(E4^2)
MSE = c(MSE1, MSE2, MSE3, MSE4)
MAPE = c(MAPE1, MAPE2, MAPE3, MAPE4)
accuracy = cbind(MSE = MSE, MAPE = MAPE)
row.names(accuracy) = c("n.144", "n.109", "n.73", "n.48")
kable(accuracy, caption = "Error comparison between forecast results with different sample sizes")
MSE | MAPE | |
---|---|---|
n.144 | 13.212112 | 0.0477541 |
n.109 | 13.274332 | 0.0473071 |
n.73 | 12.067922 | 0.0440394 |
n.48 | 8.515213 | 0.0365059 |
par(mfrow = c(1, 2))
plot(1:4, MSE, type = "b", col = "darkred", ylab = "Error", xlab = "",
main = "MSE", axes = FALSE)
labs = c("n=144", "n=109", "n=73", "n=48")
axis(1, at = 1:4, labels = labs)
axis(2)
text(1:4, MSE - 0.03, as.character(round(MSE, 4)), col = "darkred", cex = 0.7)
plot(1:4, MAPE, type = "b", col = "blue", ylab = "Error", xlab = "",
main = "MAPE", axes = FALSE)
axis(1, at = 1:4, labels = labs)
axis(2)
text(1:4, MAPE + 0.03, as.character(round(MAPE, 4)), col = "blue", cex = 0.7)
legend("topright", legend = c("MSE", "MAPE"), col = c("darkred", "blue"),
lty = 1, bty = "n", cex = 0.8)
Sample size of 48 gives the lowest MSE (8.515) and lowest MAPE (0.0365) and gives us the the best overall forecast accuracy. This suggests that the model built on the smallest sample size outperforms all the other sizes. However, smaller sample sizes can sometimes lead to overfitting, which might limit the forecast. Sample size of 73 is also a good sample size because compared to the sizes other than 48 it is low in MSE and MAPE. If we are confident that a smaller sample is a good representative of the data then a sample size of 48 would be the best choice. If we are concerned about generalizability or overfitting a sample size of 73 might be a better option, as it also provides solid performance without too much risk from overfitting.