Hello, in this project by Shauna Le, Aaron Chien, Yoojin Min, and Doris Wu, we will be exploring the Climate Change in Delhi. We would be trying to model the temperature in Delhi and trying to predict climate change. Weather data was collected in the city of Delhi from the period of 4 years (from 2013 to 2017). Our data can be found here: Daily Climate time series data. Our variables are \(date\) in the MM/DD/YYYY format, \(meantemp\) for the average temperature of that day from three hour intervals in Celsius, \(humidity\) is the humidity levels (units are grams of water vapor per cubic meter volume of air), \(wind\_speed\) is the wind speed measured in kmph, and \(meanpressure\) is the average pressure in atm. Here is a glance at our data:
climate_change <- read.csv("DailyDelhiClimateTrain.csv")
climate_change <- as.data.frame(ts(climate_change))
head(climate_change)
## date meantemp humidity wind_speed meanpressure
## 1 1 10.000000 84.50000 0.000000 1015.667
## 2 2 7.400000 92.00000 2.980000 1017.800
## 3 3 7.166667 87.00000 4.633333 1018.667
## 4 4 8.666667 71.33333 1.233333 1017.167
## 5 5 6.000000 86.83333 3.700000 1016.500
## 6 6 7.000000 82.80000 1.480000 1018.000
# seeing if there are any NA values
lapply(climate_change, function(x) {any(is.na(x))})
## $date
## [1] FALSE
##
## $meantemp
## [1] FALSE
##
## $humidity
## [1] FALSE
##
## $wind_speed
## [1] FALSE
##
## $meanpressure
## [1] FALSE
FALSE means that there are no NA values, since we used the function
any(is.na(x)) in which it is looking for any values that
are NA and outputs TRUE if any values are NA and FALSE otherwise, since
all the outputs are FALSE, we can assume they have no NA values.
par(mfrow = c(2, 2))
hist_boxplot(climate_change$meantemp, freq = FALSE, main = "Histogram of Mean Temperature",
xlab = "Mean Temperature")
lines(density(climate_change$meantemp))
hist_boxplot(climate_change$humidity, freq = FALSE, main = "Histogram of Humidity Levels",
xlab = "Humidity Levels")
lines(density(climate_change$humidity))
hist_boxplot(climate_change$wind_speed, freq = FALSE, main = "Histogram of Wind Speed",
xlab = "Wind Speed")
lines(density(climate_change$wind_speed))
hist_boxplot(climate_change$meanpressure, freq = FALSE, main = "Histogram of Mean Pressure",
xlab = "Mean Pressure")
lines(density(climate_change$meanpressure))
Here, we display a histogram and box plot for all of the variables as well as their fitted distribution line, we see that the mean temperature is very widely spread and peaking to the right, making it left skewed. The boxplot confirms the skewness of the data as the mean is toward the right and there are not outliers present. The humidity levels is somewhat centered but it does indeed have very fat tails. Looking at its boxplot, it is slightly skewed to the left and has some outliers. The wind speed is heavily skewed to the right and the boxplot shows many outliers in the data, this is problematic as this can interfere with the forecast of our data. Mean pressure is the most interesting as it is the most skewed and contains many outliers, this will be problematic in our forecast so we should remove outliers in the data
meantemp_bp <- climate_change %>%
ggplot(mapping = aes(x = meantemp)) +
geom_boxplot() +
ggtitle("Boxplot of Mean Temperature") +
labs(x = "Mean Temperature")
humidity_bp <- climate_change %>%
ggplot(mapping = aes(x = humidity)) +
geom_boxplot() +
ggtitle("Boxplot of Humidity Levels") +
labs(x = "Humidity Levels")
windspeed_bp <- climate_change %>%
ggplot(mapping = aes(x = wind_speed)) +
geom_boxplot() +
ggtitle("Boxplot of Wind Speed") +
labs(x = "Wind Speed")
meanpressure_bp <- climate_change %>%
ggplot(mapping = aes(x = meanpressure)) +
geom_boxplot() +
ggtitle("Boxplot of Mean Pressure") +
labs(x = "Mean Pressure")
plot_grid(meantemp_bp, humidity_bp, windspeed_bp, meanpressure_bp, ncol = 2, nrow = 2)
Here are the boxplots to give a closer look at the outliers, we can see the boxplot of mean pressure has many outliers and is heavily clustered, which is extremely alarming and using this data will cause our forecast to be wrong, so we should remove the outliers.
climate_change_f <- climate_change[
-c(which(climate_change$meanpressure %in% boxplot.stats(climate_change$meanpressure)$out),
which(climate_change$wind_speed %in% boxplot.stats(climate_change$wind_speed)$out),
which(climate_change$humidity %in% boxplot.stats(climate_change$humidity)$out)), ]
nrow(climate_change_f)
## [1] 1422
After removing the outliers, we still have a large dataset to work with with \(1,422\) observations, much larger than \(30\), so we can safely move on.
temp <- climate_change_f %>%
ggplot(mapping = aes(x = meantemp)) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
geom_density(alpha= .2, fill="#FF6666") +
ggtitle("Histogram and Distribution of Mean Temperature") +
labs(x = "Mean Temperature") +
theme(plot.title = element_text(size = 9))
humid <- climate_change_f %>%
ggplot(mapping = aes(x = humidity)) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
geom_density(alpha= .2, fill="#FF6666") +
ggtitle("Histogram and Distribution of Humidity Levels") +
labs(x = "Humidity Levels") +
theme(plot.title = element_text(size = 9))
wind <- climate_change_f %>%
ggplot(mapping = aes(x = wind_speed)) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
geom_density(alpha= .2, fill="#FF6666") +
ggtitle("Histogram and Distribution of Wind Speed") +
labs(x = "Wind Speed") +
theme(plot.title = element_text(size = 9))
pressure <- climate_change_f %>%
ggplot(mapping = aes(x = meanpressure)) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
geom_density(alpha= .2, fill="#FF6666") +
ggtitle("Histogram and Distribution of Mean Pressure") +
labs(x = "Mean Pressure") +
theme(plot.title = element_text(size = 9))
plot_grid(temp, humid, wind, pressure, ncol = 2, nrow = 2)
Plotting the histograms for our variables after removing the outliers, we can see that they have a better spread, but skewness is still an issue and mean pressure is now bimodal, meaning that it has two peaks so there are two points where the data is clustered. This can present different problems since we want our data to follow a normal distribution (with one peak and even spread) for better forecasting and so our standard errors can be more accurate. Because of this, we decided to differentiate the data as differencing the data helped make it unimodal and follow a normal distribution.
corr_mroz <- climate_change_f[, (names(climate_change_f) %in%
c('meantemp', 'humidity', 'wind_speed', 'meanpressure'))]
res <- cor(corr_mroz)
round(res, 2)
## meantemp humidity wind_speed meanpressure
## meantemp 1.00 -0.57 0.35 -0.88
## humidity -0.57 1.00 -0.41 0.33
## wind_speed 0.35 -0.41 1.00 -0.34
## meanpressure -0.88 0.33 -0.34 1.00
corrplot(res, method = 'number')
Even though we plan to differentiate the data, we believe it’s still important to examine the correlation between the variables to see if there are any strong correlation between the variables. Examining the correlation plot between the variables, we see that \(meanpressure\) and \(meantemp\) has a correlation of \(-0.88\), indicating that they are highly correlated and much of the variation in \(meantemp\) is explained by \(meanpressure\). The correlation between \(meantemp\) and \(humidity\) is \(-0.57\) so some of the variation in \(meantemp\) is explained by \(humidity\). The correlation between \(meantemp\) and \(wind\_speed\) is only \(0.35\) so they do have a weaker correlation.
climate_change_f <-
climate_change_f %>% mutate(meantemp_df = meantemp - lag(meantemp),
humidity_df = humidity - lag(humidity),
windspeed_df = wind_speed - lag(wind_speed),
meanpressure_df = meanpressure - lag(meanpressure))
climate_change_f$meantemp_df[1] <- 0
climate_change_f$humidity_df[1] <- 0
climate_change_f$windspeed_df[1] <- 0
climate_change_f$meanpressure_df[1] <- 0
head(climate_change_f)
## date meantemp humidity wind_speed meanpressure meantemp_df humidity_df
## 1 1 10.000000 84.50000 0.000000 1015.667 0.0000000 0.000000
## 2 2 7.400000 92.00000 2.980000 1017.800 -2.6000000 7.500000
## 3 3 7.166667 87.00000 4.633333 1018.667 -0.2333333 -5.000000
## 4 4 8.666667 71.33333 1.233333 1017.167 1.5000000 -15.666667
## 5 5 6.000000 86.83333 3.700000 1016.500 -2.6666667 15.500000
## 6 6 7.000000 82.80000 1.480000 1018.000 1.0000000 -4.033333
## windspeed_df meanpressure_df
## 1 0.000000 0.0000000
## 2 2.980000 2.1333333
## 3 1.653333 0.8666667
## 4 -3.400000 -1.5000000
## 5 2.466667 -0.6666667
## 6 -2.220000 1.5000000
Now, we have 4 new variables where \(meantemp\_df\) is the \(meantemp\) variable differentiated once on its lag, \(humidity\_df\) is the \(humidity\) variable differentiated on its lag, \(windspeed\_df\) is the \(wind\_speed\) variable differentiated on its lag, and \(meanpressure\_df\) is the \(meanpressure\) differentiated on its lag.
temp_df <- climate_change_f %>%
ggplot(mapping = aes(x = meantemp_df)) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
geom_density(alpha= .2, fill="#FF6666") +
ggtitle("Histogram and Distribution of Mean Temperature Diff") +
labs(x = "Mean Temperature Diff") +
theme(plot.title = element_text(size = 9))
humid_df <- climate_change_f %>%
ggplot(mapping = aes(x = humidity_df)) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
geom_density(alpha= .2, fill="#FF6666") +
ggtitle("Histogram and Distribution of Humidity Levels Diff") +
labs(x = "Humidity Levels Diff") +
theme(plot.title = element_text(size = 9))
wind_df <- climate_change_f %>%
ggplot(mapping = aes(x = windspeed_df)) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
geom_density(alpha= .2, fill="#FF6666") +
ggtitle("Histogram and Distribution of Wind Speed Diff") +
labs(x = "Wind Speed Diff") +
theme(plot.title = element_text(size = 9))
pressure_df <- climate_change_f %>%
ggplot(mapping = aes(x = meanpressure_df)) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
geom_density(alpha= .2, fill="#FF6666") +
ggtitle("Histogram and Distribution of Mean Pressure Diff") +
labs(x = "Mean Pressure Diff") +
theme(plot.title = element_text(size = 9))
plot_grid(temp_df, humid_df, wind_df, pressure_df, ncol = 2, nrow = 2)
These are the histograms of our dataset after we took out the outliers and differentiated them as indicated by “Diff” in the labels:
corr_mroz_diff <- climate_change_f[, (names(climate_change_f) %in%
c('meantemp_df', 'humidity_df',
'windspeed_df', 'meanpressure_df'))]
res_diff <- cor(corr_mroz_diff)
round(res_diff, 2)
## meantemp_df humidity_df windspeed_df meanpressure_df
## meantemp_df 1.00 -0.66 0.06 -0.32
## humidity_df -0.66 1.00 -0.23 0.16
## windspeed_df 0.06 -0.23 1.00 -0.03
## meanpressure_df -0.32 0.16 -0.03 1.00
corrplot(res_diff, method = 'number')
Examining the correlation plot between the differentiated data, we see that \(humidity\_df\) and \(meantemp\_df\) has a correlation of \(-0.66\), indicating that they are highly correlated and much of the variation in \(meantemp\_df\) is explained by \(humidity\_df\). The correlation between the other variables are lighter in color, indicating a weaker correlation. Comparing this to the non-differentiated data, \(humidity\_df\) is now a stronger candidate for correlation.
lapply(climate_change_f, summary)[-1]
## $meantemp
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 18.75 27.62 25.42 31.25 38.71
##
## $humidity
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.47 50.70 62.81 61.01 72.38 100.00
##
## $wind_speed
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.437 6.025 6.460 8.921 17.908
##
## $meanpressure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 991.4 1001.6 1008.7 1008.3 1014.9 1023.0
##
## $meantemp_df
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10.6250 -0.8750 0.0625 0.0000 1.0000 6.6667
##
## $humidity_df
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -37.3333 -4.8089 -0.2500 0.0109 4.0870 47.2500
##
## $windspeed_df
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -16.20000 -2.31250 -0.05058 0.00000 2.31250 16.21250
##
## $meanpressure_df
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.583333 -1.119167 0.000000 0.000234 1.000000 7.750000
These are the 5-number summary data for each of the variables:
We also made the 5-number summary data for each of the variables after we differentiated them:
Before diving into our differenced data, we can examine the variables to see if un-differenced variables are stationary are not in order to see if they can be used as a proper model to forecast change in mean temperature.
For the structure of the time series, the lag order here is the
number of lags used in the regression and the code is
trunc((length(x)-1)^(1/3)) which corresponds to the
suggested upper bound on the rate at which the number of lags should be
made to grow with the sample size. It is basically just the number of
lags used to test if it’s stationary.
The observations appear to follow each other. We are using seasonal data, so the observations follow a trend to go up and down. The variance for all of these variables does not seem to be increasing; it looks pretty constant. Also, all of our graphs have seasonality. The existence of stationary for each of our variables is indicated below:
ggtsdisplay(climate_change_f$meantemp, main = "TS Display of Mean Temperature")
Looking at the ggtsdisplay of mean temperature, we can
see that the time display at the top is reverberating, but it doesn’t
seem to be reverberating enough as the mean can change if you take
different subsets of the data. The data also shows that this is cyclical
as the rise and fall trend of the data is clear and they always seem to
go in the same places. Mean temperature also seems to follow an \(AR(4)\) model as the PACF of the model
shows significance at lag \(4\). The
ACF is also declining, but it is declining very slowly, indicating that
it is non-stationary.
ggtsdisplay(climate_change_f$humidity, main = "TS Display of Humidity Levels")
For the ggtsdisplay of humidity, we see that this
follows an \(AR(4)\) model as the ACF
graph shows declining lags, and the PACF graph shows significance at lag
\(4\). The time display at the top
seems to be reverberating, and there seems to be a pattern of large and
small ups and downs. Also, the variance does not seem to be constant and
not increasing. Although it is reverberating, it is not reverberating
enough to indicate that it is stationary. Moreover, the ACF graph is
slowly declining, indicating that humidity may be non-stationary.
ggtsdisplay(climate_change_f$wind_speed, main = "TS Display of Wind Speed")
We used the ggtsdisplay method to generate ACF and PACF
plots for mean temperature, wind speed, mean pressure, and humidity. The
time series for wind speed displays seasonality as we see variations
occur at specific intervals. The oscillations seem too pronounced to
describe the data as mean-riveting. Wind speed seems to follow an \(AR(18)\) model since the lags tail off in
the ACF plot and cut off at lag \(18\).
Furthermore, the ACF plot for wind speed shows a sharp decline early on,
however, it stops decaying soon after. With this in mind, wind speed
appears to be non-stationary.
ggtsdisplay(climate_change_f$meanpressure, main = "TS Display of Mean Pressure")
Looking at the ggtsdisplay for mean pressure, we see
that this follows an \(AR(2)\) model as
the ACF graph shows declining lags and the PACF shows significance at
lag \(2\). The time display at the top
is also reverberating, and the variance does seem to be constant, but we
do think that it’s not reverberating enough to indicate that it is
stationary. However, the ACF graph is slowly declining, indicating that
mean pressure may be non-stationary.
The ACF graphs are taking a significant amount of time to decay, but they all do indeed look like AR models, moving forward, I believe using the differentiated series would be best. All variables have their ACF that is taking a long time to decay, so I believe approaching this and using the differentiated series would help immensely.
Here is the adf.test for each of the variables to test
for stationary.
adf.test(climate_change_f$meantemp)
##
## Augmented Dickey-Fuller Test
##
## data: climate_change_f$meantemp
## Dickey-Fuller = -1.8886, Lag order = 11, p-value = 0.6255
## alternative hypothesis: stationary
adf.test(climate_change_f$humidity)
##
## Augmented Dickey-Fuller Test
##
## data: climate_change_f$humidity
## Dickey-Fuller = -3.7309, Lag order = 11, p-value = 0.02245
## alternative hypothesis: stationary
adf.test(climate_change_f$wind_speed)
## Warning in adf.test(climate_change_f$wind_speed): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: climate_change_f$wind_speed
## Dickey-Fuller = -6.5266, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary
adf.test(climate_change_f$meanpressure)
##
## Augmented Dickey-Fuller Test
##
## data: climate_change_f$meanpressure
## Dickey-Fuller = -2.1196, Lag order = 11, p-value = 0.5277
## alternative hypothesis: stationary
Since the p value for \(meantemp\), \(humidity\), and \(meanpressure\) are all above 0.05, we fail to reject the null that it is non-stationary, so there is significance that the variable is non-stationary. The variable \(wind\_speed\) is indicated to be stationary, but the ACF graph of it is slowly decaying, which is alarming to us so we decided to continue with the differentiated data.
ndiffs(climate_change_f$meantemp)
## [1] 1
ndiffs(climate_change_f$humidity)
## [1] 0
ndiffs(climate_change_f$wind_speed)
## [1] 0
ndiffs(climate_change_f$meanpressure)
## [1] 0
By using the ndiffs function, we see that \(meantemp\) should be differentiated once in
order to make it stationary but everything else is 0 which is still
alarming as the ACF of \(humidity\),
and \(meanpressure\) indicates
non-stationary and the adf.test() also indicates that they
are non-stationary, so there should be at least some differencing to
make it stationary.
ndiffs(climate_change_f$meantemp_df)
## [1] 0
ndiffs(climate_change_f$humidity_df)
## [1] 0
ndiffs(climate_change_f$windspeed_df)
## [1] 0
ndiffs(climate_change_f$meanpressure_df)
## [1] 0
If we look at the differentiated data, it indicates that no
differencing is needed, this may be because we already differentiated it
once and now they should be stationary. We can use the
adf.test to examine if they are stationary.
adf.test(climate_change_f$meantemp_df)
## Warning in adf.test(climate_change_f$meantemp_df): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: climate_change_f$meantemp_df
## Dickey-Fuller = -13.192, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary
adf.test(climate_change_f$humidity_df)
## Warning in adf.test(climate_change_f$humidity_df): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: climate_change_f$humidity_df
## Dickey-Fuller = -15.124, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary
adf.test(climate_change_f$windspeed_df)
## Warning in adf.test(climate_change_f$windspeed_df): p-value smaller than printed
## p-value
##
## Augmented Dickey-Fuller Test
##
## data: climate_change_f$windspeed_df
## Dickey-Fuller = -17.038, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary
adf.test(climate_change_f$meanpressure_df)
## Warning in adf.test(climate_change_f$meanpressure_df): p-value smaller than
## printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: climate_change_f$meanpressure_df
## Dickey-Fuller = -14.401, Lag order = 11, p-value = 0.01
## alternative hypothesis: stationary
Since now all the p values are lower than \(0.05\) now, we reject the null that it is non-stationary.
ggtsdisplay(climate_change_f$meantemp_df, main = "TS display of Mean Temp Differences")
Examining the Mean Temp after differencing shows that for the time display, there doesn’t seem to be a trend and the mean and variance are quite constant. It also follows more of a MA trend as the PACF is gradually declining and the ACF looks like an AR’s PACF model. If this was the case, this would be \(MA(19)\) as the 19th lag in the ACF model is the furthest and is also significant.
ggtsdisplay(climate_change_f$humidity_df, main = "TS display of Humidity Differences")
After differencing, the TS display of humidity differences seems to reverberate much more than before differencing, and the variance seems not to change widely than before. The graph overall does not have seasonal variations nor trends, thus indicating stationary. Also, for the ACF model after differencing, a \(MA(29)\) model appears to be a better fit for humidity than an \(AR(13)\) model since there is a drastic decline in the lags in the PACF graph while the ACF graph cuts off after lag \(29\). Also, the PACF graph is gradually declining while the ACF graph looks like an AR’s PACF model.
ggtsdisplay(climate_change_f$windspeed_df, main = "TS display of Wind Speed Differences")
Looking at the ggtsdisplay for wind speed after
first-differencing it, the TS display appears to be mean reverting as
there are no longer any seasonal variations or trends, indicating
stationary. After differencing wind speed, a MA model appears to be a
better fit for wind speed than an \(AR(18)\) model since we see a geometric
decline in the lags in the PACF plot, while the ACF plot cuts off after
lag \(18\).
ggtsdisplay(climate_change_f$meanpressure_df, main = "TS display of Mean Pressure Differences")
Examining the ggtsdisplay for the mean pressure
differences, we see that the time display is now more reverberating and
that the mean and variance do not seem to change widely. There seems to
be more trend in the time display which indicates stationary. Examining
the ACF and PACF, this mean pressure differences seems to resemble a MA
model instead as the PACF has gradually declining lags and the ACF seems
to look like the PACF of an AR model. If this was a MA model, then it
would be \(MA(2)\).
Base on the tsdisplay of the \(meantemp\_df\) variable, we shall create three different AR models of it. Base on the PACF, we saw that lags \(5\), \(9\), and \(19\) were all significant so we shall be building \(AR(5)\), \(AR(9)\), and \(AR(19)\) where the \(meantemp\_df\) is regressed against itself.
ar_5 <- dynlm(meantemp_df ~ lag(meantemp_df, 1) + lag(meantemp_df, 2) + lag(meantemp_df, 3) +
lag(meantemp_df, 4) + lag(meantemp_df, 5), data = climate_change_f)
ar_9 <- dynlm(meantemp_df ~ lag(meantemp_df, 1) + lag(meantemp_df, 2) + lag(meantemp_df, 3) +
lag(meantemp_df, 4) + lag(meantemp_df, 5) + lag(meantemp_df, 6) +
lag(meantemp_df, 7) + lag(meantemp_df, 8) + lag(meantemp_df, 9),
data = climate_change_f)
ar_19 <- dynlm(meantemp_df ~ lag(meantemp_df, 1) + lag(meantemp_df, 2) + lag(meantemp_df, 3) +
lag(meantemp_df, 4) + lag(meantemp_df, 5) + lag(meantemp_df, 6) +
lag(meantemp_df, 7) + lag(meantemp_df, 8) + lag(meantemp_df, 9) +
lag(meantemp_df, 10) + lag(meantemp_df, 11) + lag(meantemp_df, 12) +
lag(meantemp_df, 13) + lag(meantemp_df, 14) + lag(meantemp_df, 15) +
lag(meantemp_df, 16) + lag(meantemp_df, 17) + lag(meantemp_df, 18) +
lag(meantemp_df, 19), data = climate_change_f)
model_score <- data.frame("Model" = paste("ar", c(5, 9, 19), sep = ""),
"BIC" = c(BIC(ar_5), BIC(ar_9), BIC(ar_19)),
"AIC" = c(AIC(ar_5), AIC(ar_9), AIC(ar_19)))
sorted_model <- model_score[order(model_score$BIC), ]
sorted_model
## Model BIC AIC
## 2 ar9 5439.657 5381.868
## 3 ar19 5445.055 5334.882
## 1 ar5 5445.897 5409.103
The BIC for ar9, or \(AR(9)\) is the lowest and since the BIC penalizes the model more for more variables, we believe following the BIC would be better. The AIC is still a good measure of the model, but our model for \(AR(9)\) has a AIC score of \(5381.868\), which is the second smallest out of the tree, \(50\) off the smallest \(AIC\) and \(18\) away from the biggest AIC. However, the BIC for ar19, \(AR(19)\) and ar5, \(AR(5)\) is extremely close together so that is a warning that both of these models similarly fit the data.
summary(ar_9)
##
## Time series regression with "numeric" data:
## Start = 1, End = 1413
##
## Call:
## dynlm(formula = meantemp_df ~ lag(meantemp_df, 1) + lag(meantemp_df,
## 2) + lag(meantemp_df, 3) + lag(meantemp_df, 4) + lag(meantemp_df,
## 5) + lag(meantemp_df, 6) + lag(meantemp_df, 7) + lag(meantemp_df,
## 8) + lag(meantemp_df, 9), data = climate_change_f)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4174 -0.9075 0.0812 1.0233 6.8010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.002859 0.043052 0.066 0.947070
## lag(meantemp_df, 1) -0.220341 0.026655 -8.266 3.17e-16 ***
## lag(meantemp_df, 2) -0.175410 0.027262 -6.434 1.70e-10 ***
## lag(meantemp_df, 3) -0.212424 0.027608 -7.694 2.66e-14 ***
## lag(meantemp_df, 4) -0.111346 0.028147 -3.956 8.01e-05 ***
## lag(meantemp_df, 5) -0.096996 0.028173 -3.443 0.000593 ***
## lag(meantemp_df, 6) -0.067893 0.028122 -2.414 0.015895 *
## lag(meantemp_df, 7) -0.064001 0.027619 -2.317 0.020631 *
## lag(meantemp_df, 8) -0.030701 0.027268 -1.126 0.260404
## lag(meantemp_df, 9) -0.056283 0.026674 -2.110 0.035036 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.618 on 1403 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.08197, Adjusted R-squared: 0.07608
## F-statistic: 13.92 on 9 and 1403 DF, p-value: < 2.2e-16
Our \(AR(9)\) model would be \(AR(9) : meantemp\_df_{t} = 0.002859 + \\ -0.220341meantemp\_df_{t-1} + -0.175410meantemp\_df_{t-2} + \\ -0.212424meantemp\_df_{t-3} + -0.111346meantemp\_df_{t-4} + \\ -0.096996meantemp\_df_{t-5} + -0.067893meantemp\_df_{t-6} + \\ -0.064001meantemp\_df_{t-7} + -0.030701meantemp\_df_{t-8} + \\ -0.056283meantemp\_df_{t-9}\)
What this model is telling us is that it is predicting the change in mean temperature at time \(t\) base on \(9\) previous lags. The immediate multiplier is \(-0.220341\) while the total effect at time \(q\) is \(\sum_{s=0}^{q}B_{s}\) where \(B\) represents the coefficients of the \(AR(9)\) model excluding the intercept, \(0.002859\). Examining the model further, the intercept has an extremely large p value that is almost close to \(1\), indicating that the intercept is very not significant. Lags \(6\), \(7\), and \(9\) also show moderate significance as it’s significant at the \(1%\) level. Lag \(8\), however, is shown not to be significant enough. The \(R^{2}\) is also very low, around \(0.08197\) so much of the variation in \(meantemp\_df\) is not explained by itself. Interpreting the coefficients, the intercept is \(0.002859\), which does not mean the \(meantemp\_df\) start at \(0.002859\), but rather it is used as a baseline for the model to predict future forecasts. We can estimate the change at certain lags using these coefficients, for example, at \(t-1\), the change in \(meantemp\_df\) is \(-0.220341\) with respect to \(meantemp\_df_{t-1}\), which means for every unit of \(meantemp\_df_{t-1}\), there is a change of \(-0.220341\) in \(meantemp\_df\).
acf(ar_9$residuals, main = "Mean Temperature AR(9) model serial correlation of errors")
Examining the residuals to see if any errors show any sign of serial correlation, we see that in the correlogram, they seem to show little significance. Lag \(19\) and \(29\) are a bit alarming as they reach over the interval where it is statistically significance that serial correlation is present, but since they only reach over it by such a little amount, we decided to ignore them.
In order to create an ARDL model, we need to choose one of the three predictor variables: \(humidity\_df\), \(windspeed\_df\), and \(meanpressure\_df\). In order to choose the best predictor variables, we shall use the Boruta algorithmn, VIF model, and Mallows’ CP to choose for feature selection.
library(Boruta)
bor_res <- Boruta(meantemp_df ~., data = climate_change_f[, 6:9], doTrace = 1)
## After 9 iterations, +13 secs:
## confirmed 3 attributes: humidity_df, meanpressure_df, windspeed_df;
## no more attributes left.
attStats(bor_res)
## meanImp medianImp minImp maxImp normHits decision
## humidity_df 103.876673 102.892632 98.798947 114.02582 1 Confirmed
## windspeed_df 5.023095 4.958657 4.139015 5.86304 1 Confirmed
## meanpressure_df 24.021673 24.197846 19.249203 26.95287 1 Confirmed
To measure the importance of the variables we would choose for the ARDL model, we first used the boruta algorithm in which it duplicates the dataset and shuffles each column and uses individual decision trees to spit out a class prediction and the higher the votes (or score), the more important the prediction. Here, we see that \(humidity\_df\) has the highest prediction with a score of \(103\) and \(meanpressure\_df\) with a score of \(24\) and \(windspeed\_df\) at \(6\).
vif_model <- dynlm(meantemp_df ~ humidity_df + windspeed_df + meanpressure_df, data = climate_change_f[, 6:9])
vif(vif_model)
## humidity_df windspeed_df meanpressure_df
## 1.081704 1.053914 1.027539
The VIF model here estimates for collinearity between the independent and the explanatory variable and looks for multicollinearity. The rule of thumb is that as long as the score of below \(5\), the variable does not shown signs of multicollinearity and is acceptable. So since all of the potential predicator variables are very much below \(5\), they are good predictors.
ss <- regsubsets(meantemp_df ~ humidity_df + windspeed_df + meanpressure_df, method = c("exhaustive"), nbest = 3, data = climate_change_f)
subsets(ss, statistic = "cp", legend = F, main = "Mallows CP")
## Abbreviation
## humidity_df h
## windspeed_df w
## meanpressure_df m
We decided to use Mallow’s CP in order to find which predictor is best used to explain the explanatory variable. It basically creates a full model and compares it with a smaller model with a certain number of parameters and determines how much error is left unexplained by the partial model. The rule of thumb is that the smaller the score, the better the fit. Here, we see \(humidity\_df\) is yet again the best fit, so our ARDL model would be using the \(humidity\_df\) variable. We also want to point out that h-w-m, which means all three predictors, is lower than just h, so a VAR model containing all three predictors may better fit the forecast.
demo1 <- dynlm(meantemp_df ~ lag(meantemp_df, 1) + lag(meantemp_df, 2) +
lag(meantemp_df, 3) + lag(meantemp_df, 4) + lag(meantemp_df, 5) +
lag(meantemp_df, 6) + lag(meantemp_df, 7) + lag(meantemp_df, 8) +
lag(meantemp_df, 9) + humidity_df + lag(humidity, 1) +
lag(humidity_df, 2) + lag(humidity_df, 3) + lag(humidity_df, 4) +
lag(humidity_df, 5), data = climate_change_f)
summary(demo1)
##
## Time series regression with "numeric" data:
## Start = 1, End = 1413
##
## Call:
## dynlm(formula = meantemp_df ~ lag(meantemp_df, 1) + lag(meantemp_df,
## 2) + lag(meantemp_df, 3) + lag(meantemp_df, 4) + lag(meantemp_df,
## 5) + lag(meantemp_df, 6) + lag(meantemp_df, 7) + lag(meantemp_df,
## 8) + lag(meantemp_df, 9) + humidity_df + lag(humidity, 1) +
## lag(humidity_df, 2) + lag(humidity_df, 3) + lag(humidity_df,
## 4) + lag(humidity_df, 5), data = climate_change_f)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4609 -0.6562 0.0710 0.7313 6.1530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3377376 0.1380353 2.447 0.01454 *
## lag(meantemp_df, 1) -0.0900247 0.0214245 -4.202 2.81e-05 ***
## lag(meantemp_df, 2) -0.1864019 0.0268699 -6.937 6.10e-12 ***
## lag(meantemp_df, 3) -0.1975406 0.0270822 -7.294 5.02e-13 ***
## lag(meantemp_df, 4) -0.0875840 0.0273909 -3.198 0.00142 **
## lag(meantemp_df, 5) -0.0753320 0.0273946 -2.750 0.00604 **
## lag(meantemp_df, 6) -0.0448693 0.0217711 -2.061 0.03949 *
## lag(meantemp_df, 7) -0.0303219 0.0211831 -1.431 0.15254
## lag(meantemp_df, 8) -0.0005086 0.0208239 -0.024 0.98052
## lag(meantemp_df, 9) -0.0285684 0.0203076 -1.407 0.15972
## humidity_df -0.1360437 0.0042039 -32.361 < 2e-16 ***
## lag(humidity, 1) -0.0054473 0.0022011 -2.475 0.01345 *
## lag(humidity_df, 2) -0.0280023 0.0056464 -4.959 7.94e-07 ***
## lag(humidity_df, 3) -0.0280821 0.0054373 -5.165 2.76e-07 ***
## lag(humidity_df, 4) -0.0150141 0.0053774 -2.792 0.00531 **
## lag(humidity_df, 5) -0.0028055 0.0054915 -0.511 0.60952
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.224 on 1397 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.4773, Adjusted R-squared: 0.4717
## F-statistic: 85.05 on 15 and 1397 DF, p-value: < 2.2e-16
In our first model, we decided to go with an \(ARDL(9, 5)\) model since the \(AR(9)\) model was the best fit for \(meantemp\_df\) and the PACF for the \(humidity\_df\) model has lags \(5\), \(10\), and \(13\) that are significant. Observing our first model, we see that \(meantemp\_df\) at lag \(7\) is where it stops being significant and for \(humidity\_df\), it stops at lag \(5\). We should re-estimate this ARDL with \(ARDL(6, 4)\).
demo2 <- dynlm(meantemp_df ~ lag(meantemp_df, 1) + lag(meantemp_df, 2) +
lag(meantemp_df, 3) + lag(meantemp_df, 4) + lag(meantemp_df, 5) +
lag(meantemp_df, 6) + humidity_df + lag(humidity, 1) +
lag(humidity_df, 2) + lag(humidity_df, 3) + lag(humidity_df, 4),
data = climate_change_f)
summary(demo2)
##
## Time series regression with "numeric" data:
## Start = 1, End = 1416
##
## Call:
## dynlm(formula = meantemp_df ~ lag(meantemp_df, 1) + lag(meantemp_df,
## 2) + lag(meantemp_df, 3) + lag(meantemp_df, 4) + lag(meantemp_df,
## 5) + lag(meantemp_df, 6) + humidity_df + lag(humidity, 1) +
## lag(humidity_df, 2) + lag(humidity_df, 3) + lag(humidity_df,
## 4), data = climate_change_f)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5138 -0.6718 0.0627 0.7191 5.9899
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.312477 0.136046 2.297 0.02177 *
## lag(meantemp_df, 1) -0.085419 0.021272 -4.016 6.25e-05 ***
## lag(meantemp_df, 2) -0.180303 0.026556 -6.790 1.66e-11 ***
## lag(meantemp_df, 3) -0.190938 0.026762 -7.135 1.55e-12 ***
## lag(meantemp_df, 4) -0.079535 0.026873 -2.960 0.00313 **
## lag(meantemp_df, 5) -0.057730 0.020805 -2.775 0.00560 **
## lag(meantemp_df, 6) -0.034357 0.020295 -1.693 0.09071 .
## humidity_df -0.136707 0.004184 -32.671 < 2e-16 ***
## lag(humidity, 1) -0.005012 0.002167 -2.313 0.02089 *
## lag(humidity_df, 2) -0.027974 0.005513 -5.075 4.40e-07 ***
## lag(humidity_df, 3) -0.027872 0.005307 -5.252 1.73e-07 ***
## lag(humidity_df, 4) -0.014402 0.005371 -2.681 0.00742 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.226 on 1404 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.4761, Adjusted R-squared: 0.472
## F-statistic: 116 on 11 and 1404 DF, p-value: < 2.2e-16
Here, we see that the lag at \(6\) for \(humididty\_df\) is only significant at the \(10\)% level, so we re-estimate this with \(ARDL(5, 4)\).
mean_humid <- dynlm(meantemp_df ~ lag(meantemp_df, 1) + lag(meantemp_df, 2) +
lag(meantemp_df, 3) + lag(meantemp_df, 4) + lag(meantemp_df, 5) +
humidity_df + lag(humidity, 1) + lag(humidity_df, 2) +
lag(humidity_df, 3) + lag(humidity_df, 4), data = climate_change_f)
summary(mean_humid)
##
## Time series regression with "numeric" data:
## Start = 1, End = 1417
##
## Call:
## dynlm(formula = meantemp_df ~ lag(meantemp_df, 1) + lag(meantemp_df,
## 2) + lag(meantemp_df, 3) + lag(meantemp_df, 4) + lag(meantemp_df,
## 5) + humidity_df + lag(humidity, 1) + lag(humidity_df, 2) +
## lag(humidity_df, 3) + lag(humidity_df, 4), data = climate_change_f)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5300 -0.6660 0.0768 0.7261 5.9757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.286003 0.135080 2.117 0.034411 *
## lag(meantemp_df, 1) -0.081782 0.021164 -3.864 0.000116 ***
## lag(meantemp_df, 2) -0.178333 0.026534 -6.721 2.62e-11 ***
## lag(meantemp_df, 3) -0.183980 0.026453 -6.955 5.38e-12 ***
## lag(meantemp_df, 4) -0.074193 0.026681 -2.781 0.005497 **
## lag(meantemp_df, 5) -0.049862 0.020286 -2.458 0.014094 *
## humidity_df -0.136723 0.004186 -32.665 < 2e-16 ***
## lag(humidity, 1) -0.004584 0.002151 -2.131 0.033246 *
## lag(humidity_df, 2) -0.028610 0.005493 -5.209 2.19e-07 ***
## lag(humidity_df, 3) -0.028125 0.005304 -5.302 1.33e-07 ***
## lag(humidity_df, 4) -0.014755 0.005368 -2.749 0.006058 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.227 on 1406 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.4752, Adjusted R-squared: 0.4715
## F-statistic: 127.3 on 10 and 1406 DF, p-value: < 2.2e-16
We see that all variables are relatively significant, so \(ARDL(5, 4)\) is our best fit.
Our \(ARDL(5, 4)\) model is: \(meantemp\_df_{t} = 0.286003 + \\ -0.081782meantemp\_df_{t-1} + -0.178333meantemp\_df_{t-2} + \\ -0.183980meantemp\_df_{t-3} + -0.074193meantemp\_df_{t-4} + \\ -0.049862meantemp\_df_{t-5} + -0.136723humidity\_df_{t} + \\ -0.004584humidity\_df_{t-1} + -0.028610humidity\_df_{t-2} + \\ -0.028125humidity\_df_{t-3} + -0.014755humidity\_df_{t-4}\)
which shows that the change in mean temperature is a model of past values of \(meantemp\_df\) and \(humidity\_df\).
The impact multiplier, \(B_{0}\) would be \(-0.136723\) and the first delay multiplier, \(B_{1}\) would be \(-0.00458 + -0.136723 \times -0.081782\) and each multiplier after would be \(B_{j} = B_{j-1} \times -0.081782\). The \(R^{2}\) is also \(0.4752\), which isn’t very high but it does hold more significant coefficients so we shall continue with this model. Interpreting the coefficients, the intercept is \(0.286003\) which does not mean that the \(meantemp\_df\) starts at \(0.286003\), it’s just the way the equation is forecasted. We see that for the coefficients, most of them are negative implying a negative relationship and we can also estimate the change at certain lags. For example, at lag \(t-1\), the change in \(meantemp\_df\) would be \(-0.081782\) for every unit of \(meantemp\_df_{t-1}\) and \(-0.004584\) with for every unit of \(humidity\_df_{t-1}\).
acf(mean_humid$residuals, main = "ACF of the residuals of ARDL(5, 4)")
We see that there are barely any significant correlations in the errors for our model and although lag 11 does show that there is correlation there, we decided to ignore it since it is at a long lag and is barely passes the test of serial error correlation.
AIC(ar_9)
## [1] 5381.868
BIC(ar_9)
## [1] 5439.657
plot(fitted(ar_9), resid(ar_9), main = "Residual Plot for AR(9) Mean Temp Difference",
xlab = "Mean Temp Diff Fitted Values", ylab = "Residuals ")
abline(a = 0, b = 0, col = "red")
Examining the residual plot for the \(AR(9)\) model, we see there there is no pattern, but there is a lot of noise here, which is a good sign it fits relatively well. The residuals are not increasing, meaning that there is no heteroskedasticity present.
AIC(mean_humid)
## [1] 4613.535
BIC(mean_humid)
## [1] 4676.61
plot(fitted(mean_humid), resid(mean_humid), main = "Residual Plot for ARDL(5, 4) Mean Temp Diff and Humidity Level Diff", xlab = "fitted values", ylab = "residuals")
abline(a = 0, b = 0, col = "red")
Examining the residual plot for the \(ARDL(5, 4)\) model, we also see that there there is no pattern and there is a lot of white noise, which is a good sign it fits relatively well. There are also no patterns in the residuals as well, meaning that there is no heteroskedasticity present.
Comparing our AIC and BIC models, we see that the AIC score for our \(AR(9)\) \(meantemp\_df\) is 5381.868414 and the BIC is 5439.6565882. As for our \(ARDL(5, 4)\) with \(meantemp\_df\) and \(humidity\_df\), the AIC is 4613.5349275 and the BIC is 4676.6104944. Since both AIC and BIC score for \(ARDL(5, 4)\) is lower than the \(AR(9)\) model, we can say that the \(ARDL(5, 4)\) is a better fit.
test <- read.csv("DailyDelhiClimateTest.csv")
test$meantemp_df <- c(0, diff(test$meantemp, lag = 1))
test$humidity_df <- c(0, diff(test$humidity, lag = 1))
predicted_values <- function(a, b){
results <-
0.286003 + -0.081782*lag(a, 1) + -0.178333*lag(a, 2) + -0.183980*lag(a, 3) +
-0.074193*lag(a, 4) + -0.049862*lag(a, 5) + -0.136723*b + -0.004584*lag(b, 1) +
-0.028610*lag(b, 2) + -0.028125*lag(b, 3) + -0.014755*lag(b, 4)
results
}
test$model_results <- predicted_values(test$meantemp_df, test$humidity_df)
test_results <- test[, c(6, 8)][-(1:5), ]
test_results$t <- 1:nrow(test_results)
test_results %>%
ggplot() +
geom_line(aes(x = t, y = meantemp_df, color = "Real Mean Temperature Difference")) +
geom_line(aes(x = t, y = model_results, color = "Estimated Mean Temperature Difference")) +
xlab("Time") +
ylab("Mean Temperature Difference") +
scale_color_manual(name = "Mean Temperature Difference",
breaks=c("Real Mean Temperature Difference",
"Estimated Mean Temperature Difference"),
values=c("Real Mean Temperature Difference" = "red",
"Estimated Mean Temperature Difference" = "blue"))
Results of the model compared with the actual model.
In observing what we can do to improve the model, we are suggesting
that an MA model for the differentiated data would help immensely as
observing the PACF and ACF graphs in the ggtsdisplay, we
saw that it followed the pattern of a MA model. The ACF had sharp
cut-off and the PACF should have a gradual decay.
We also observed in the Mallows’ CP that h-w-m, which
stands for \(humidity\_df\), \(windspeed\_df\), and \(meanpressure\_df\), had the lowest score so
we can see that a VAR(p) model can also work and may fit the data
better.
VARselect(climate_change_f[, 6:9], lag.max = 12)
## $selection
## AIC(n) HQ(n) SC(n) FPE(n)
## 11 5 3 11
##
## $criteria
## 1 2 3 4 5 6
## AIC(n) 8.101387 7.881151 7.772935 7.741076 7.710533 7.701100
## HQ(n) 8.129223 7.931254 7.845307 7.835715 7.827441 7.840275
## SC(n) 8.175875 8.015228 7.966602 7.994332 8.023379 8.073535
## FPE(n) 3299.042523 2646.919579 2375.438684 2300.957297 2231.751721 2210.810340
## 7 8 9 10 11 12
## AIC(n) 7.690369 7.690773 7.671788 7.664307 7.652709 7.655193
## HQ(n) 7.851812 7.874485 7.877767 7.892554 7.903225 7.927977
## SC(n) 8.122394 8.182388 8.222993 8.275101 8.323093 8.385167
## FPE(n) 2187.231669 2188.140202 2147.019257 2131.054452 2106.526729 2111.818675
Using a VAR model, we see that \(VAR(11)\) has the lowest AIC score at \(7.652709\), so a \(VAR(11)\) would be best.
VAR_11 <- VAR(climate_change_f[, 6:9], p = 11)
VAR_11summ <- summary(VAR_11)
VAR_11summ$varresult$meantemp_df
##
## Call:
## lm(formula = y ~ -1 + ., data = datamat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3758 -0.8613 0.0874 1.0230 6.2086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## meantemp_df.l1 -0.1918028 0.0381586 -5.026 5.66e-07 ***
## humidity_df.l1 0.0088528 0.0077772 1.138 0.25519
## windspeed_df.l1 -0.0292010 0.0139211 -2.098 0.03612 *
## meanpressure_df.l1 -0.0254545 0.0295611 -0.861 0.38935
## meantemp_df.l2 -0.2059749 0.0384834 -5.352 1.02e-07 ***
## humidity_df.l2 -0.0046072 0.0078592 -0.586 0.55783
## windspeed_df.l2 -0.0311424 0.0160005 -1.946 0.05182 .
## meanpressure_df.l2 -0.0388271 0.0295571 -1.314 0.18919
## meantemp_df.l3 -0.2028746 0.0393374 -5.157 2.87e-07 ***
## humidity_df.l3 0.0006753 0.0080608 0.084 0.93325
## windspeed_df.l3 -0.0156044 0.0174376 -0.895 0.37101
## meanpressure_df.l3 0.0132273 0.0311600 0.424 0.67127
## meantemp_df.l4 -0.1247149 0.0399659 -3.121 0.00184 **
## humidity_df.l4 -0.0019369 0.0082850 -0.234 0.81519
## windspeed_df.l4 -0.0222179 0.0183620 -1.210 0.22649
## meanpressure_df.l4 -0.0312341 0.0313025 -0.998 0.31855
## meantemp_df.l5 -0.0723127 0.0401744 -1.800 0.07209 .
## humidity_df.l5 0.0089191 0.0083651 1.066 0.28651
## windspeed_df.l5 -0.0297808 0.0188368 -1.581 0.11411
## meanpressure_df.l5 -0.0063451 0.0313213 -0.203 0.83949
## meantemp_df.l6 -0.0746924 0.0402371 -1.856 0.06363 .
## humidity_df.l6 -0.0028000 0.0083531 -0.335 0.73752
## windspeed_df.l6 0.0108347 0.0190617 0.568 0.56986
## meanpressure_df.l6 0.0080565 0.0313501 0.257 0.79723
## meantemp_df.l7 -0.0404370 0.0401491 -1.007 0.31403
## humidity_df.l7 0.0083489 0.0083543 0.999 0.31780
## windspeed_df.l7 -0.0080107 0.0188646 -0.425 0.67117
## meanpressure_df.l7 0.0048370 0.0312518 0.155 0.87702
## meantemp_df.l8 -0.0719847 0.0399012 -1.804 0.07144 .
## humidity_df.l8 -0.0123895 0.0082200 -1.507 0.13198
## windspeed_df.l8 0.0094975 0.0183807 0.517 0.60544
## meanpressure_df.l8 0.0280097 0.0311818 0.898 0.36920
## meantemp_df.l9 -0.0625861 0.0391615 -1.598 0.11024
## humidity_df.l9 0.0017949 0.0080146 0.224 0.82283
## windspeed_df.l9 0.0077417 0.0174389 0.444 0.65716
## meanpressure_df.l9 -0.0129408 0.0309030 -0.419 0.67546
## meantemp_df.l10 -0.0423848 0.0383238 -1.106 0.26894
## humidity_df.l10 0.0001174 0.0078311 0.015 0.98804
## windspeed_df.l10 0.0029343 0.0159064 0.184 0.85367
## meanpressure_df.l10 -0.0041647 0.0294295 -0.142 0.88749
## meantemp_df.l11 0.0463401 0.0379003 1.223 0.22166
## humidity_df.l11 0.0102829 0.0076660 1.341 0.18003
## windspeed_df.l11 0.0054985 0.0137375 0.400 0.68903
## meanpressure_df.l11 0.0054660 0.0293846 0.186 0.85246
## const 0.0004090 0.0429168 0.010 0.99240
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.612 on 1366 degrees of freedom
## Multiple R-squared: 0.1064, Adjusted R-squared: 0.07762
## F-statistic: 3.697 on 44 and 1366 DF, p-value: 1.92e-14
Looking at the \(VAR(11)\) model, we can see the summary statistics but the \(R^{2}\) is relatively low at \(0.1064\) and many of the coefficients is insignificant. We would have to do more testing and modeling in order to see which model is truly the best model.
But first, we test for reverse causality, where our explanatory variable is actually caused by our dependent variable.
grangertest(climate_change_f$humidity_df ~ climate_change_f$meantemp_df, order = 11)
## Granger causality test
##
## Model 1: climate_change_f$humidity_df ~ Lags(climate_change_f$humidity_df, 1:11) + Lags(climate_change_f$meantemp_df, 1:11)
## Model 2: climate_change_f$humidity_df ~ Lags(climate_change_f$humidity_df, 1:11)
## Res.Df Df F Pr(>F)
## 1 1388
## 2 1399 -11 1.9202 0.03305 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, we reject the null that \(meantemp\_df\) does not granger-cause \(humidity\_df\) since the p value is \(0.03305\), much lower than the significant level at \(0.05\), so it is significant that \(meantemp\_df\) explains \(humidity\_df\) so perhaps a model trying to predict \(meantemp\_df\) from \(humidity\_df\) may not be ideal, but rather \(humidity\_df\) is better predicted by \(meantemp\_df\).
grangertest(climate_change_f$meantemp_df ~ climate_change_f$humidity_df, order = 11)
## Granger causality test
##
## Model 1: climate_change_f$meantemp_df ~ Lags(climate_change_f$meantemp_df, 1:11) + Lags(climate_change_f$humidity_df, 1:11)
## Model 2: climate_change_f$meantemp_df ~ Lags(climate_change_f$meantemp_df, 1:11)
## Res.Df Df F Pr(>F)
## 1 1388
## 2 1399 -11 1.3897 0.1712
Since our p value is \(0.1712\), which is above the significance level of \(0.05\), we fail to reject the null \(humidity\_df\) does not granger-cause \(meantemp\_df\). Therefore, our model could be improved by predicting \(humidity\_df\) from \(meantemp\_df\) and use that model to reverse calculate the average temperature change.
Base on all of our resultse, we believe we can safely say that our \(ARDL(5, 4)\) regressed on the \(meantemp\_df\) lags and the \(humidity\_df\) lags may not be the best model in predicting the difference in \(meantemp\) in the next period and we may want to either consider a MA or \(VAR(11)\) model. We also need to test a reverse casuality model where \(humidity\_df\) is predicted from \(meantemp\_df\).