We use ARIMA to forecast the temperature of New York. This report includes 5 more parts: Preprocess, Methodology (explain why we choose ARIMA), Analyzing, Predicting and Evaluating, Conclusion
The initial data contains data recorded in 5 points of time in a day, and we need only data from New York City.
Find and Handle NA/missing value in dataset
Filter data that contains only Date, Location and Temperature of New York City. Because the initial data were recorded in 5 points of time in a day, so subsequently we have to “group by the average” of Temperature.
Using data recorded before 14/10/2024
Create Day, Month, Year, Week columns for the dataset
## Date Location Temperature Humidity WindSpeed
## 0 0 0 0 0
## Precipitation CloudCover Pressure RainTomorrow
## 0 0 0 0
library(dplyr)
NewYork_temp <- USArain |>
filter(Location == "New York"
& Date <"2024-10-14") |>
select(Date, Temperature )
NewYork_temp <- NewYork_temp |>
group_by(Date) |>
summarise(Avg_Temperature =
mean(Temperature))library(lubridate)
Year <- as.numeric(format(time,'%Y'))
Month <- as.numeric(format(time,'%m'))
Day <- as.numeric(format(time,'%d'))
Week <- week(time)## Date Avg_Temperature Day Month Year Week
## 1 2024-01-01 58.82717 1 1 2024 1
## 2 2024-01-02 71.57452 2 1 2024 1
## 3 2024-01-03 82.42953 3 1 2024 1
## 4 2024-01-04 63.57733 4 1 2024 1
## 5 2024-01-05 68.95104 5 1 2024 1
## 6 2024-01-06 48.90668 6 1 2024 1
## Date Avg_Temperature Day Month Year Week
## 282 2024-10-08 49.27937 8 10 2024 41
## 283 2024-10-09 59.80298 9 10 2024 41
## 284 2024-10-10 46.79601 10 10 2024 41
## 285 2024-10-11 68.91663 11 10 2024 41
## 286 2024-10-12 53.77048 12 10 2024 41
## 287 2024-10-13 70.48573 13 10 2024 41
The initial dataset has predicted values of Temperature from the begin of 2024 to the end of 2025. However, for our analyzing purpose, we only used the data before the time that we wrote this report (14/10/2024). Dataset has no missing value. After prepocessing, we have the data including Date (Day, Month, Year, Week), Location and Temperature of New York City from 01/01/2024 to 13/10/2024. Our processed dataset has 287 observations
weekly_avg_temp <- NewYork_temp |>
group_by(Week) |>
summarise(Weekly_Temperature = mean(Avg_Temperature))
head(weekly_avg_temp)## # A tibble: 6 × 2
## Week Weekly_Temperature
## <dbl> <dbl>
## 1 1 64.3
## 2 2 62.9
## 3 3 73.9
## 4 4 63.4
## 5 5 61.8
## 6 6 65.3
## # A tibble: 6 × 2
## Week Weekly_Temperature
## <dbl> <dbl>
## 1 36 64.7
## 2 37 68.4
## 3 38 66.6
## 4 39 68.7
## 5 40 66.2
## 6 41 57.3
library(ggplot2)
library(lubridate)
library(dplyr)
ggplot(weekly_avg_temp, aes(x = Week, y = Weekly_Temperature)) +
geom_line(color = "#0000AA") +
geom_smooth(method = "lm", color = "maroon") +
labs(title = "Temperature Over Week", x = "Week",
y = "Temperature") +
xlim(0,41) + ylim(55,75)X axis: Weeks, ranging from 1 to 41.
Y axis: Average Temperature, fluctuating between 55 and 75 degrees.
Blue Line: Represents the change in average temperature over weeks, with considerable fluctuations from week to week.
Red Line: A linear regression line indicating the predicted trend of average temperature. The line has a slight negative slope, suggesting a minor decrease in average temperature over time.
Gray Shaded Area: Shows the confidence interval of the regression model. The wider area in the middle indicates higher uncertainty in that range.
Overall Trend: There is a slight downward trend in the average temperature over time, but the change is not very pronounced.
High Variability: The data shows a lot of week-to-week fluctuations, especially between weeks 5 and 25.
This data shows a significant fluctuations. In this case, ARIMA (AutoRegressive Integrated Moving Average) has the potential to provide short-term forecasts that are superior to more theoretically satisfying regression models. Therefore, we decided to used ARIMA for time series forecasting.
library(tseries)
library(forecast)
adf_result <- adf.test(NewYork_temp[,"Avg_Temperature"])
print(adf_result)##
## Augmented Dickey-Fuller Test
##
## data: NewYork_temp[, "Avg_Temperature"]
## Dickey-Fuller = -5.8363, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
Conclusion: p-value = 0.01: Since this p-value is less than 0.05, we have enough evidence to reject the null hypothesis. This means we have strong evidence to conclude that the series is stationary. This dataset is suitable for ARIMA modeling
## Series: NewYork_temp[, "Avg_Temperature"]
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## 0.2193 64.7181
## s.e. 0.0575 0.6578
##
## sigma^2 = 76.37: log likelihood = -1028.41
## AIC=2062.82 AICc=2062.91 BIC=2073.8
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.005009769 8.708393 7.078628 -1.882391 11.30654 0.7826644
## ACF1
## Training set 0.003203218
library(forecast)
forecasted_values <- forecast(fit, h = 10)
start_date <- as.Date("2024-10-13")
forecasted_values$Date <-
seq.Date(start_date, by = "day",
along.with = forecasted_values)
print(forecasted_values)## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 288 65.98272 54.78338 77.18207 48.85480 83.11064
## 289 64.99536 53.52995 76.46077 47.46053 82.53019
## 290 64.77886 53.30081 76.25690 47.22470 82.33302
## 291 64.73139 53.25273 76.21004 47.17630 82.28647
## 292 64.72098 53.24229 76.19966 47.16585 82.27611
## 293 64.71869 53.24001 76.19738 47.16356 82.27383
## 294 64.71819 53.23951 76.19688 47.16306 82.27333
## 295 64.71808 53.23940 76.19677 47.16295 82.27322
## 296 64.71806 53.23938 76.19675 47.16293 82.27319
## 297 64.71806 53.23937 76.19674 47.16292 82.27319
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,0,0) with non-zero mean
## Q* = 3.4267, df = 9, p-value = 0.945
##
## Model df: 1. Total lags used: 10
## ME RMSE MAE MPE MAPE MASE
## Training set 0.005009769 8.708393 7.078628 -1.882391 11.30654 0.7826644
## ACF1
## Training set 0.003203218
The ACF (Autocorrelation Function) plot shows the autocorrelation of residuals across different lags. Most of the residuals’ autocorrelation values are within the confidence bounds (dashed blue lines), indicating no significant autocorrelation between the residuals at different lags.
Conclusion: There is no significant autocorrelation in the residuals, which suggests that the ARIMA(1,0,0) model has adequately captured the data’s structure.
The histogram shows the distribution of the residuals centered around zero, and the red line represents a normal distribution curve. The residuals seem to follow a normal distribution well, as the curve fits the histogram closely.
Conclusion: The residuals are approximately normally distributed, a good sign indicating that the model fits the data well.
Q = 53.545*, df = 56, p-value = 0.5684. The Ljung-Box test is used to check for autocorrelation in the residuals. Here, the p-value is greater than 0.05 (specifically 0.5684), meaning there is no statistical evidence of autocorrelation in the residuals.
Conclusion : The model captures the data’s structure well, with no significant autocorrelation remaining in the residuals.
ME (Mean Error): 0.0050
The mean error is close to 0, indicating that the model does not have a significant bias.
RMSE (Root Mean Squared Error): 8.708
This metric measures the average deviation of predictions from actual values, with the model having an average error of about 8.7 units.
MAE (Mean Absolute Error): 7.078
The mean absolute error shows that the model’s average absolute prediction error is about 7 units.
MPE (Mean Percentage Error): -1.88
The model tends to slightly underestimate the actual values, as shown by the small negative percentage error.
MAPE (Mean Absolute Percentage Error): 11.31%
The model’s average prediction error is approximately 11.3% when compared to actual values.
ACF1: 0.003
The autocorrelation at lag 1 is almost zero, confirming no significant autocorrelation in the residuals at this lag.
Overall Conclusion: The ARIMA(1,0,0) model performs well. The residuals show no significant autocorrelation and are normally distributed. Both the Ljung-Box test and the performance metrics support the conclusion that the model is appropriate for the data.
Prediction accuracy: RMSE and MAE are reasonably low, although the MAPE suggests an average prediction error of about 11.3%, which is acceptable but indicates some room for improvement in forecasting accuracy.
Prediction output
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 288 65.98272 54.78338 77.18207 48.85480 83.11064
## 289 64.99536 53.52995 76.46077 47.46053 82.53019
## 290 64.77886 53.30081 76.25690 47.22470 82.33302
## 291 64.73139 53.25273 76.21004 47.17630 82.28647
## 292 64.72098 53.24229 76.19966 47.16585 82.27611
## 293 64.71869 53.24001 76.19738 47.16356 82.27383
## 294 64.71819 53.23951 76.19688 47.16306 82.27333
## 295 64.71808 53.23940 76.19677 47.16295 82.27322
## 296 64.71806 53.23938 76.19675 47.16293 82.27319
## 297 64.71806 53.23937 76.19674 47.16292 82.27319
Point Forecast: The predicted values for each specific point in time.
Lo 80 and Hi 80: The lower and upper bounds of the 80% confidence interval. This means there is an 80% chance that the actual value will lie between these bounds.
Lo 95 and Hi 95: The lower and upper bounds of the 95% confidence interval. This is a more conservative estimate, indicating a 95% chance that the actual value will fall within these bounds.