1. Purpose

We use ARIMA to forecast the temperature of New York. This report includes 5 more parts: Preprocess, Methodology (explain why we choose ARIMA), Analyzing, Predicting and Evaluating, Conclusion

2. Preprocess The Data

The initial data contains data recorded in 5 points of time in a day, and we need only data from New York City.

Our goals in preprocessing are:

  • Find and Handle NA/missing value in dataset

  • Filter data that contains only Date, Location and Temperature of New York City. Because the initial data were recorded in 5 points of time in a day, so subsequently we have to “group by the average” of Temperature.

  • Using data recorded before 14/10/2024

  • Create Day, Month, Year, Week columns for the dataset

Code to preprocess:

  • Load data into workplace
library(readr)
USArain <- read_csv("C:/Users/manhphi2811/Downloads/USArain.csv")
  • Checking NA
colSums(is.na(USArain))
##          Date      Location   Temperature      Humidity     WindSpeed 
##             0             0             0             0             0 
## Precipitation    CloudCover      Pressure  RainTomorrow 
##             0             0             0             0
  • Filter only data including Date, Location and Temperature of New York City, then Group by the average of 5 points of time
library(dplyr)

NewYork_temp <- USArain  |>
  filter(Location == "New York" 
         & Date <"2024-10-14") |>
  select(Date, Temperature )

NewYork_temp <- NewYork_temp |>
  group_by(Date) |>
  summarise(Avg_Temperature =
              mean(Temperature))
  • Convert into date format
time <- as.Date(NewYork_temp$Date)
  • Create three distinct columns for Day, Month, Year
library(lubridate)
Year <- as.numeric(format(time,'%Y'))
Month <- as.numeric(format(time,'%m'))
Day <- as.numeric(format(time,'%d'))
Week <- week(time)
  • Add Day, Month, Year, Week into NewYork_temp
NewYork_temp <-cbind(NewYork_temp, Day, Month, Year, Week)
head(NewYork_temp) 
##         Date Avg_Temperature Day Month Year Week
## 1 2024-01-01        58.82717   1     1 2024    1
## 2 2024-01-02        71.57452   2     1 2024    1
## 3 2024-01-03        82.42953   3     1 2024    1
## 4 2024-01-04        63.57733   4     1 2024    1
## 5 2024-01-05        68.95104   5     1 2024    1
## 6 2024-01-06        48.90668   6     1 2024    1
tail(NewYork_temp)
##           Date Avg_Temperature Day Month Year Week
## 282 2024-10-08        49.27937   8    10 2024   41
## 283 2024-10-09        59.80298   9    10 2024   41
## 284 2024-10-10        46.79601  10    10 2024   41
## 285 2024-10-11        68.91663  11    10 2024   41
## 286 2024-10-12        53.77048  12    10 2024   41
## 287 2024-10-13        70.48573  13    10 2024   41

Overview and conclusion about our data

The initial dataset has predicted values of Temperature from the begin of 2024 to the end of 2025. However, for our analyzing purpose, we only used the data before the time that we wrote this report (14/10/2024). Dataset has no missing value. After prepocessing, we have the data including Date (Day, Month, Year, Week), Location and Temperature of New York City from 01/01/2024 to 13/10/2024. Our processed dataset has 287 observations

3.Methodology

Create a new dataset based on the weekly average

weekly_avg_temp <- NewYork_temp |>
  group_by(Week)  |>
  summarise(Weekly_Temperature = mean(Avg_Temperature))
head(weekly_avg_temp)
## # A tibble: 6 × 2
##    Week Weekly_Temperature
##   <dbl>              <dbl>
## 1     1               64.3
## 2     2               62.9
## 3     3               73.9
## 4     4               63.4
## 5     5               61.8
## 6     6               65.3
tail(weekly_avg_temp)
## # A tibble: 6 × 2
##    Week Weekly_Temperature
##   <dbl>              <dbl>
## 1    36               64.7
## 2    37               68.4
## 3    38               66.6
## 4    39               68.7
## 5    40               66.2
## 6    41               57.3

Observe the pattern of the Temperature in New York

library(ggplot2)
library(lubridate)
library(dplyr)

ggplot(weekly_avg_temp, aes(x = Week, y = Weekly_Temperature)) +
  geom_line(color = "#0000AA") +
  geom_smooth(method = "lm", color = "maroon") +
  labs(title = "Temperature Over Week", x = "Week", 
       y = "Temperature") +
  xlim(0,41) + ylim(55,75)

Summary and conclusion about this pattern

  • Summary

X axis: Weeks, ranging from 1 to 41.

Y axis: Average Temperature, fluctuating between 55 and 75 degrees.

Blue Line: Represents the change in average temperature over weeks, with considerable fluctuations from week to week.

Red Line: A linear regression line indicating the predicted trend of average temperature. The line has a slight negative slope, suggesting a minor decrease in average temperature over time.

Gray Shaded Area: Shows the confidence interval of the regression model. The wider area in the middle indicates higher uncertainty in that range.

  • Conclusion

Overall Trend: There is a slight downward trend in the average temperature over time, but the change is not very pronounced.

High Variability: The data shows a lot of week-to-week fluctuations, especially between weeks 5 and 25.

Reasons for the chosen time series model

This data shows a significant fluctuations. In this case, ARIMA (AutoRegressive Integrated Moving Average) has the potential to provide short-term forecasts that are superior to more theoretically satisfying regression models. Therefore, we decided to used ARIMA for time series forecasting.

4. Analyzing

Assumption Testing

library(tseries)
library(forecast)

adf_result <- adf.test(NewYork_temp[,"Avg_Temperature"])
print(adf_result)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  NewYork_temp[, "Avg_Temperature"]
## Dickey-Fuller = -5.8363, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

Conclusion: p-value = 0.01: Since this p-value is less than 0.05, we have enough evidence to reject the null hypothesis. This means we have strong evidence to conclude that the series is stationary. This dataset is suitable for ARIMA modeling

Forecasting model

library(forecast)

fit <- auto.arima(NewYork_temp[,"Avg_Temperature"])
summary(fit)
## Series: NewYork_temp[, "Avg_Temperature"] 
## ARIMA(1,0,0) with non-zero mean 
## 
## Coefficients:
##          ar1     mean
##       0.2193  64.7181
## s.e.  0.0575   0.6578
## 
## sigma^2 = 76.37:  log likelihood = -1028.41
## AIC=2062.82   AICc=2062.91   BIC=2073.8
## 
## Training set error measures:
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.005009769 8.708393 7.078628 -1.882391 11.30654 0.7826644
##                     ACF1
## Training set 0.003203218

5. Predicting and Evaluating

Predicting

library(forecast)
forecasted_values <- forecast(fit, h = 10)
start_date <- as.Date("2024-10-13")
forecasted_values$Date <- 
  seq.Date(start_date, by = "day",
           along.with = forecasted_values)

print(forecasted_values)
##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 288       65.98272 54.78338 77.18207 48.85480 83.11064
## 289       64.99536 53.52995 76.46077 47.46053 82.53019
## 290       64.77886 53.30081 76.25690 47.22470 82.33302
## 291       64.73139 53.25273 76.21004 47.17630 82.28647
## 292       64.72098 53.24229 76.19966 47.16585 82.27611
## 293       64.71869 53.24001 76.19738 47.16356 82.27383
## 294       64.71819 53.23951 76.19688 47.16306 82.27333
## 295       64.71808 53.23940 76.19677 47.16295 82.27322
## 296       64.71806 53.23938 76.19675 47.16293 82.27319
## 297       64.71806 53.23937 76.19674 47.16292 82.27319
plot(forecasted_values, col= "maroon")

Evaluating

library(forecast)

checkresiduals(fit, color ="maroon")

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,0,0) with non-zero mean
## Q* = 3.4267, df = 9, p-value = 0.945
## 
## Model df: 1.   Total lags used: 10
accuracy(fit)
##                       ME     RMSE      MAE       MPE     MAPE      MASE
## Training set 0.005009769 8.708393 7.078628 -1.882391 11.30654 0.7826644
##                     ACF1
## Training set 0.003203218

6. Conclusion

1. ACF Plot of Residuals:

The ACF (Autocorrelation Function) plot shows the autocorrelation of residuals across different lags. Most of the residuals’ autocorrelation values are within the confidence bounds (dashed blue lines), indicating no significant autocorrelation between the residuals at different lags.

Conclusion: There is no significant autocorrelation in the residuals, which suggests that the ARIMA(1,0,0) model has adequately captured the data’s structure.

2. Residuals Distribution Plot:

The histogram shows the distribution of the residuals centered around zero, and the red line represents a normal distribution curve. The residuals seem to follow a normal distribution well, as the curve fits the histogram closely.

Conclusion: The residuals are approximately normally distributed, a good sign indicating that the model fits the data well.

3. Ljung-Box Test:

Q = 53.545*, df = 56, p-value = 0.5684. The Ljung-Box test is used to check for autocorrelation in the residuals. Here, the p-value is greater than 0.05 (specifically 0.5684), meaning there is no statistical evidence of autocorrelation in the residuals.

Conclusion : The model captures the data’s structure well, with no significant autocorrelation remaining in the residuals.

4. Performance Metrics (accuracy):

ME (Mean Error): 0.0050

The mean error is close to 0, indicating that the model does not have a significant bias.

RMSE (Root Mean Squared Error): 8.708

This metric measures the average deviation of predictions from actual values, with the model having an average error of about 8.7 units.

MAE (Mean Absolute Error): 7.078

The mean absolute error shows that the model’s average absolute prediction error is about 7 units.

MPE (Mean Percentage Error): -1.88

The model tends to slightly underestimate the actual values, as shown by the small negative percentage error.

MAPE (Mean Absolute Percentage Error): 11.31%

The model’s average prediction error is approximately 11.3% when compared to actual values.

ACF1: 0.003

The autocorrelation at lag 1 is almost zero, confirming no significant autocorrelation in the residuals at this lag.

Overall Conclusion: The ARIMA(1,0,0) model performs well. The residuals show no significant autocorrelation and are normally distributed. Both the Ljung-Box test and the performance metrics support the conclusion that the model is appropriate for the data.

Prediction accuracy: RMSE and MAE are reasonably low, although the MAPE suggests an average prediction error of about 11.3%, which is acceptable but indicates some room for improvement in forecasting accuracy.

Prediction output

##     Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 288       65.98272 54.78338 77.18207 48.85480 83.11064
## 289       64.99536 53.52995 76.46077 47.46053 82.53019
## 290       64.77886 53.30081 76.25690 47.22470 82.33302
## 291       64.73139 53.25273 76.21004 47.17630 82.28647
## 292       64.72098 53.24229 76.19966 47.16585 82.27611
## 293       64.71869 53.24001 76.19738 47.16356 82.27383
## 294       64.71819 53.23951 76.19688 47.16306 82.27333
## 295       64.71808 53.23940 76.19677 47.16295 82.27322
## 296       64.71806 53.23938 76.19675 47.16293 82.27319
## 297       64.71806 53.23937 76.19674 47.16292 82.27319
  • Point Forecast: The predicted values for each specific point in time.

  • Lo 80 and Hi 80: The lower and upper bounds of the 80% confidence interval. This means there is an 80% chance that the actual value will lie between these bounds.

  • Lo 95 and Hi 95: The lower and upper bounds of the 95% confidence interval. This is a more conservative estimate, indicating a 95% chance that the actual value will fall within these bounds.