Australia in recent months has seen a great increase in rainfall, saucing widespread flooding across Northern NSW. This study investigates potential effects causing high probabilities of rainfall and differing seasons and weather patterns. Variables included in this study are rain (mm), wind gust direction, average temperature (), average relative humidity (%), wind gust speed (km/h), average mean sea level pressure (hpa), yearly seasons and the weather pattern La Nina. Through the study, a model was created to produced a probability of rain through a binary outcome, and investigate correlations to the outcome from the variables mentioned. The analysis indicated relationships to the response from predictors, covariate relationships between predictors, and interaction effects from seasons and Patterns. Noteworthy interactions include seasons and average mean sea level pressure and wind gust speed and average mean sea level pressure. The logistic model with wind gust direction as an explanatory variable was compared to that without it, using measures like AIC deviance, accuracy and AUC to determine model performance and which would be used. Both models misclassified todays (Monday, 16th of May) rainfall, but the modelwith wind gust direction has better performance. This study also indicates high potential for overfitting through analysis of interaction plot, and stepAIC including non-significant variables.
Due to the high amounts of flooding in Northern NSW this year, rainfall prediction is essential for emergency services, industries such as agriculture and construction and also day-to-day living. Precipitation, is the process of condensation where air is cooled at higher levels of the atmosphere at forms clouds. Starting by evaporation of warm waters off the ocean, it is transferred by the difference in air pressure to the mainland of Australia (Barry, 2003). La Nina was declared from the Australian Bureau of Meteorology as as of the end of November 2021. This weather pattern brings deep, cooler ocean water to the surface from South America, by trade-winds. Causing rapid heating of ocean temperatures off the equatorial plane and climate drivers respond by above average rainfall and extreme weather risks such as flooding and East Coast Low events (Climate Council). This study measures and quantifies the effects of climatic variables of temperate, relative humidity, wind gust speeds and directions, and mean sea level pressures to predict a likelihood of rainfall. Explaining the complexities involved in weather and climatic changes from seasons and weather patterns, and challenges faced when modeling using statistical learning. Once the data is collected, the analysis will be performed using a generalized linear model with a binomial response variable, and logit link function. Then, testing the model with weather forecasts from WeatherZone of Monday the 16th of May, 2022. Due to the high covariate relationships of climatic variables, the results could vary, but complexity in the model will be visualized with quantified model performance.
The aim of the study is to determine, quantify and visualize complexities in rainfall prediction using statistical learning. This allows a better understanding of the physical process driving weather from static measurements of dynamic systems.
Weather observations are recorded from the station at Williamstown RAAF base from May 1st, 2021, until the 13th of May, 2022. Using Monday the 16th of May as a test for the model and compared to a forecast from Weatherzone and a generally observation outside. The observations could be related to each due pressure systems lasting for periods of time, having influence on how other observations behave. Generally, the samples are not from a random sample of weather stations, as many further north are not in operation due to the past extreme weather events.
This study examines two different models, one including wind gust direction and one without, with investigation into overfitting and assumption failures through diagnostic plots. It also asseses models based on AIC deviance, model performance measure and observations.
To start the analysis, data collected off the BoM daily weather observations site was copy and pasted into an excel spreadsheet, where two variables containing all NA values were omitted (Cold and Solar Evaporation). The data collected contained a number of NA values for variables needed in the analysis but not too many. Roughly 7% of the 374 daily observations captured going back to May last 2021 to the 10th of May 2022. Some restrictions on the data are that La Nina is only a probability that it occurred, which was greater then 70% for Summer 2021 and 2022, and greater than 65% for Autumn 2022. So the dataset is contains a slightly bias sample of observations of a neutral weather pattern containing 3 out of 5 seasons, this could have effected interaction effects investigated as being non-significant. The wrangling process involved omitting the 26 observations from the dataset, then setting character variables types into factors. Next, wind gust directions were mapped 1 to 1 into their main direction components to reduce the amount of factors and dimesionality for the model. New variables in weather of average temperature which is an average of all the temperature variables collected in the dataset, the same with relative humidity, and mean sea level pressure. The binary response variable is classified as a 1 if the rainfall for the day was equal or above 0.1mm, and the half the observations 50.084% recorded rainfall on that day.
The average rainfall throughout the data is 3.754mm and median of 0.200mm indicating some extreme right skewness of low or no rainfall in the data. Wind gust direction, Wind gust speed with a mean of 42.03 and median of 39.0. Weather pattern, La Nina with 152 total observations, and neutral with 202 observations. Years of 2021 with 231 observations and 2022 with 123 (data is more influenced by the year of 2021). Season, Autumn with 95 observations, Spring with 88, Summer with 87 and Winter with 84. The average temperature of the minimum, maximum, 9am and 3pm, with a mean of 19.02 and median of 19.36. Average relative humidity(%) calculated by the average of 9am and 3pm relative humitidities, with mean of 68.23 and median 67.75. Average mean sea level pressure (hpa), calculated the same as relative humidity, with a mean of 1016.3 and median of 10.16.3. These last three variables all have symmetrical distributions given by the close values of the mean and medians.
## `geom_smooth()` using formula 'y ~ x'
From the plot above we see higher amounts of observations above 30mm belonging to the El Nina (circle) weather pattern, with more the season spring having the most frequency. The relationship between wind gust speed and rain is hard to determine if it’s linear but a non-linear relationship seems more likely. The linear trend line shows that rainfall is more prominent as the wind speed increases, it shows that more circles belonging to Autumn (red) and and summer blue appear more frequently in the mid-levels of rainfall, Spring (green) is also highly frequent in the scatter plot with the highest observation of rainfall.
## `geom_smooth()` using formula 'y ~ x'
The linear trend in this plot is not prominent, but there is a clear distinction with the average temperatures among the seasons. Autumn (red) with El Nina (circles) and Summer (blue) with El Nina appear to be the most frequent observations of rainfall among the sample. With Autumn of El Nina having the second highest amount of rainfall from the sample.
#Rainfall and RH scatterplot
g3 + ggtitle("Rain (mm) vs Average Daily Relative Humidity (%)")
## `geom_smooth()` using formula 'y ~ x'
The trend in the plot above shows a an increase in the amount of rain as relative humidity increase to completely saturated air, although the relationship does not seem to be linear. There looks to be a higher frequency frequency of La Nina in rain in the mid sections of the plot, with many spring (green) neutral patterns (triangles) towards the saturated air end of plot.
## `geom_smooth()` using formula 'y ~ x'
This plot shows much randomness happening between the seasons and weather patterns and not much trend between the daily mean sea level pressure and rainfall.
## `geom_smooth()` using formula 'y ~ x'
This plot is interesting as it showcases one of the relationships that are shared between variables in weather, that being the decrease in temperature for an increase in mean sea level pressure. The temperature difference shown by the colors could be that summer was recorded as El Nina this year and last year which explains why it is higher then that of a neutral weather pattern.
## Warning: Ignoring unknown parameters: binwidth, bins, pad
The histogram above shows the rainfall of each season recorded in the years 2021 and 2022 with colors of El Nina. Noting a significant increase of rainfaill in the season Autumn of 2022 compared to that of 2021.
The mosaic plot above shows the rainfall of each weather pattern and wind direction. We see in the El Nina pattern that rain decreases as the winds turn from easterly to south then west, with barely any rain from the north east. For the neutral pattern this is revers with more rain coming from the west. The most significant amount of rain coming from a Southerly wind direction.
Again the largest amount of rainfall came from the south wind direction, followed by east. Autumn had the highest amount of rainfall, possibly due to the data being recorded with 2 seasons of Autumn from 2021 and 2022. There is a likely interaction from wind gust direction and seasons but the complexity of interpretation of this plot is hard to visualize it. But the key identifier is significantly larger amounts of rainfall in summer, spring and autumn with wind directions from the east, south east and south.
The response of rainfall in each observation will be of a binary outcome. That is if the rainfall for that day is greater or equal to 1mm, it is classified as a 1, and 0 for no rain. The response follows a binomial distribution and to model the binomial distribution a logistic regression will be used. The binomial distribution is a member of the exponential family, therefore a generalized linear model which contains a logistic link function. Thus, the model formula for the regression is: \[logit(\pi_i)=log(\frac{\pi_i}{1-\pi_i})=\beta_0+\beta_1X_1+\beta_2X2+...+\beta_nX_n+\epsilon\] Where
\(\pi_i\) is the probability of rain
\(\beta_0\) is the intercept
\(\beta_i\) is the coefficient for each predictor variable, \(i=1:n\)
\(X_i\) is each predictor variable
The final model will include interaction terms, depending on how or if any will occur, the model will be represented by:
\[logit(\pi_i)=log(\frac{\pi_i}{1-\pi_i})=\beta_0+\beta_1X_1+\beta_2X2+\beta_3X_1X_2+...+\beta_nX_nX_m+\epsilon\] Where n and m are the number of interactions with predictor variables.
## Hypothesis Test \[H_0: \beta_i=0\] I understand to go through each predictor variable and test hypothesis independently, but for this case the number of predictors is 15 and would be very repetitive. In substitute, the null hypothesis states all of the 15 predictor variables coefficients equal zero. Implying no relationship to rainfall.
\[H_A: \beta_i \neq0\] At least one of the 15 predictor variables coefficients does not equal zero and implies a relationship to rainfall.
##
## Call:
## glm(formula = rain.class ~ W.G.Dir + W.G.Spd + av.temp + av.MSLP +
## av.RH + Season + Pattern, family = binomial(), data = weather)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2083 -0.8535 0.2357 0.8497 2.2228
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 25.78832 28.88991 0.893 0.3720
## W.G.DirNorthEast -3.25693 1.29828 -2.509 0.0121 *
## W.G.DirEast -2.01241 1.18683 -1.696 0.0900 .
## W.G.DirSouthEast -1.39138 1.20480 -1.155 0.2481
## W.G.DirSouth -1.68482 1.19534 -1.409 0.1587
## W.G.DirSouthWest -1.95133 1.39383 -1.400 0.1615
## W.G.DirWest -2.35035 1.20340 -1.953 0.0508 .
## W.G.DirNorthWest -1.12711 1.25559 -0.898 0.3694
## W.G.Spd 0.02322 0.01160 2.001 0.0454 *
## av.temp -0.06418 0.06551 -0.980 0.3272
## av.MSLP -0.02819 0.02735 -1.031 0.3025
## av.RH 0.08056 0.01208 6.671 2.53e-11 ***
## SeasonSpring 1.21415 0.59591 2.037 0.0416 *
## SeasonSummer -0.48652 0.46549 -1.045 0.2959
## SeasonWinter 0.73809 0.58002 1.273 0.2032
## PatternNeutral -1.32600 0.61463 -2.157 0.0310 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.65 on 353 degrees of freedom
## Residual deviance: 374.01 on 338 degrees of freedom
## AIC: 406.01
##
## Number of Fisher Scoring iterations: 4
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: rain.class
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 353 490.65
## W.G.Dir 7 34.852 346 455.79 1.192e-05 ***
## W.G.Spd 1 0.137 345 455.66 0.711137
## av.temp 1 2.606 344 453.05 0.106475
## av.MSLP 1 8.922 343 444.13 0.002818 **
## av.RH 1 63.424 342 380.71 1.667e-15 ***
## Season 3 1.986 339 378.72 0.575235
## Pattern 1 4.705 338 374.01 0.030068 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using a 5% significance level, wind gust direction is statistically significant (p=1.192e-051), average mean sea level pressure shows significance (p=0.002818), average relative humidity strongly significant (p=2.53e-11). Neutral weather pattern is significant (p=0.030068).
Two models were created, one without a significant intercept term, without wind gust direction which is used and one without this variable. Both models have insignificant intercept terms which I’m not quite sure how to interpret or to include.
It is observed from the table above that that wind gusts from the north, decreases the log odd of rainfall by -3.25693. Wind gusts from the west decrease the log odds by -2.35035. Wind gust speed increases log odds of rainfall by 0.02322. Average relative humidity increases log odds of rainfall by 0.08056. Season of spring increase the log odds of rainfall by 1.21415, and the neutral weather pattern decreased the log odds by -1.32600. From these results the formula for the logistic model is:
\[logit(\pi_i)=25.78832-3.25693X_1+0.02322X_2+0.08056X_3+1.21415X_5-1.32600X_6+\epsilon\] Where, \[X_1=W.G.Dir; X_2=W.G.Spd; X_3=av.temp\] \[X_4=av.RH;X_5=Season; X_6=Pattern\]
The next model will investigate interaction effects using the stepAIC function. The function uses a measure of deviance, the akaike information criterion to determine significant variables and variable interactions to include in the model. By an iterative forwards approach it identifies whether the variable or interaction selected qualifies to be in the model by decreasing residual deviance and AIC deviance. If the variable adds deviance it is not selected and the algorithm proceeds to the next until no more deviance can be decreased and the model starts to overfit. There is potential for the stepAIC to overfit non-significant predictor variables within the logistic model, and there is a high chance of this occurring. Due to the colinear relationships between predictor variables in rainfall. Basically, the variables ARE related to one another as found in the exploratory analysis, almost like a chain reaction starting from sea water temperature and evaporation (Variables not measured in data). With this in mind, we will proceed to build a model and identify potential issues with colinearity and/or overfitting in the model.
Same approach from before, testing the hypothesis will not be done separately to save repetitiveness, instead test them as a collective under the same hypothesis. \[H_O: \alpha_i\beta_i = 0\] Implies no interactions from all weather variables.
\[H_A:\alpha_i\beta_i\neq0\] Implies at least one interaction effect from weather variables.
##
## Call:
## glm(formula = rain.class ~ W.G.Dir + W.G.Spd + av.temp + av.MSLP +
## av.RH + Season + Pattern + av.temp:Season + W.G.Spd:av.MSLP +
## W.G.Dir:av.temp + W.G.Dir:av.RH + av.temp:av.MSLP + av.RH:Pattern,
## family = binomial(), data = weather)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2004 -0.6997 0.0000 0.6979 2.2010
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.773e+01 1.514e+02 0.315 0.752490
## W.G.DirNorthEast -5.975e+03 3.629e+05 -0.016 0.986861
## W.G.DirEast -4.234e+00 1.126e+01 -0.376 0.707031
## W.G.DirSouthEast -9.250e+00 1.148e+01 -0.806 0.420215
## W.G.DirSouth -4.394e+00 1.111e+01 -0.395 0.692484
## W.G.DirSouthWest -1.171e+02 2.731e+04 -0.004 0.996580
## W.G.DirWest -5.043e-02 1.100e+01 -0.005 0.996342
## W.G.DirNorthWest -5.707e+00 1.169e+01 -0.488 0.625259
## W.G.Spd 5.371e+00 1.799e+00 2.986 0.002825 **
## av.temp -1.522e+01 7.570e+00 -2.010 0.044419 *
## av.MSLP -5.462e-02 1.474e-01 -0.370 0.711026
## av.RH 1.427e-01 2.276e-01 0.627 0.530536
## SeasonSpring 1.313e+01 3.790e+00 3.463 0.000533 ***
## SeasonSummer 1.177e+01 4.702e+00 2.504 0.012266 *
## SeasonWinter 3.592e+00 3.935e+00 0.913 0.361380
## PatternNeutral -4.237e+00 2.385e+00 -1.777 0.075622 .
## av.temp:SeasonSpring -6.326e-01 2.049e-01 -3.087 0.002024 **
## av.temp:SeasonSummer -5.722e-01 2.164e-01 -2.644 0.008203 **
## av.temp:SeasonWinter -9.639e-02 2.463e-01 -0.391 0.695487
## W.G.Spd:av.MSLP -5.290e-03 1.775e-03 -2.979 0.002888 **
## W.G.DirNorthEast:av.temp 5.408e+01 3.600e+03 0.015 0.988014
## W.G.DirEast:av.temp 3.493e-01 1.011e+00 0.346 0.729650
## W.G.DirSouthEast:av.temp 6.511e-01 1.010e+00 0.645 0.519243
## W.G.DirSouth:av.temp 3.342e-01 1.009e+00 0.331 0.740527
## W.G.DirSouthWest:av.temp -4.893e+01 7.157e+03 -0.007 0.994545
## W.G.DirWest:av.temp 3.434e-01 1.006e+00 0.341 0.732909
## W.G.DirNorthWest:av.temp 5.485e-01 1.016e+00 0.540 0.589254
## W.G.DirNorthEast:av.RH 6.165e+01 3.688e+03 0.017 0.986663
## W.G.DirEast:av.RH -7.157e-02 2.279e-01 -0.314 0.753485
## W.G.DirSouthEast:av.RH -7.930e-02 2.294e-01 -0.346 0.729590
## W.G.DirSouth:av.RH -5.578e-02 2.278e-01 -0.245 0.806562
## W.G.DirSouthWest:av.RH 1.213e+01 1.809e+03 0.007 0.994651
## W.G.DirWest:av.RH -1.348e-01 2.275e-01 -0.593 0.553390
## W.G.DirNorthWest:av.RH -8.834e-02 2.301e-01 -0.384 0.701014
## av.temp:av.MSLP 1.499e-02 7.394e-03 2.027 0.042701 *
## av.RH:PatternNeutral 6.006e-02 3.007e-02 1.997 0.045776 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.65 on 353 degrees of freedom
## Residual deviance: 307.61 on 318 degrees of freedom
## AIC: 379.61
##
## Number of Fisher Scoring iterations: 22
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: rain.class
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 353 490.65
## W.G.Dir 7 34.852 346 455.79 1.192e-05 ***
## W.G.Spd 1 0.137 345 455.66 0.7111369
## av.temp 1 2.606 344 453.05 0.1064753
## av.MSLP 1 8.922 343 444.13 0.0028183 **
## av.RH 1 63.424 342 380.71 1.667e-15 ***
## Season 3 1.986 339 378.72 0.5752350
## Pattern 1 4.705 338 374.01 0.0300680 *
## av.temp:Season 3 17.205 335 356.81 0.0006413 ***
## W.G.Spd:av.MSLP 1 5.610 334 351.20 0.0178552 *
## W.G.Dir:av.temp 7 16.714 327 334.48 0.0193355 *
## W.G.Dir:av.RH 7 18.360 320 316.12 0.0104466 *
## av.temp:av.MSLP 1 4.330 319 311.79 0.0374395 *
## av.RH:Pattern 1 4.188 318 307.61 0.0407032 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Using a 5% significance level, the anova output above shows significant variables of wind gust direction (p=1.192e-05). Average mean sea level pressure (p=0.0028183). Average relative humidity (p=1.667e-15), and weather pattern (p=0.0006413). Significant interaction effects from average temperature and season (p=0.0006413). Wind gust speed and average mean sea level pressure (p=0178552). Wind gust direction and average temperature (p=0.0193355). Wind gust direction and average relative humidity (p=0.0104466). Average temperature and average mean sea level pressure (p=0.0374395). Average relative humidity and weather pattern (p=0.0407032). The number of significant interactions really shows the complexity of precipitation and why it is hard to predict.
The logistic regression summary above shows the the coefficients for the generalized linear model for their log odds and again no significant slope intercept found. Wind gust speed is expected to increase the log odds of rainfall by 5.371e+00. Average temperature decreases the log odds by -1.522e+01. Seasons spring and summer increase the log odds by 1.313e+01 and 1.177e+01. Interaction effects from the predictor variables to the log odds of rainfall, average temperature and season spring decrease log odds by -6.326e-01. Average temperature and season summer decrease log odds by -5.722e-01.Wind gust speed and average mean sea level pressure decrease the log odds by -5.290e-03. Average temperature and average relative humidity increase the log odds by 1.499e-02. Finally, average relative humidity and neutral weather pattern increase the log odds by 6.006e-02. The multiple logistic regression with interactions is complex with effects and hard to interpret into an equation because of the amount of factors involved. Instead the explanation of interaction effects will be used to summarize the model.
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Residuals vs fitted plot shows a large spike indicating some potential issues with the models variance or heteroscadicity within the model, this needs to be addressed. The Q-Q looks fairly symmetrical with majority of the point on the line of best fit, some dispersion happening at the tail ends of the plot, but other than that it looks fine. Scale-Location showing the same as the plot above it, indicating the same potential problem. Final plot indicates a number of residuals having leverage or influence on the data by lying outside the cooks distance. The does not pass assumptions and is questionable compared to the model without wind gust direction as seen in the appendix.
The model with wind gust direction is showing significant spikes in residual vs fitted and scale location plots, and with outliers having significant influence on the results. The highly colinear relationships between weather variables are likely to be causing this and wind gust direction to be the main culprit. Weather works in an extremely complex physical process and rainfall is one of the final outcomes, changes in initial conditions of weather that are not captured in the datasat which alter results. Rainfall levels are not recorded, so there is no distinction whether, say 1mm of rain has been spread across a whole day or across a 2 minute period, where other predictor variables measured have changed. Another questionable part of the model, is the non-significant intercept term given, perhaps you could shine some light on this issue as I have never seen this or know how to interpret this. Putting the non-significant term down to noise in the data seems like a likely causation of the problem.
Interaction plot are created by the allEffects function from the effects package as parameters in the base R library function plot. Setting multiline equal to true, for a easier interpretation and each plot a separate selection to avoid confusion.
The first interaction plot shows the average temperature and seasons effect. As the average temperature increases the log odds of rainfall increase, with Autumn and Winter increase the highest, where Spring and Summer increase at roughly the same gradient.
This interaction effect plot shows the effect of wind gust speed and average mean sea level pressure. As the average mean sea level pressure increases with an increase of wind gust speed, the log odds of rainfall is decreasing but the change of gradient slopes. The highest log odds of rainfall in this plot is when average mean sea level pressure = 995 with a peak wind speeds over 100km/h, which could be considered east coast low or tropical cyclone events.
This interaction plot is quiet complex, wind gusts from the north east across all average temperatures decrease the log odds of rainfall. Same as south west wind gust direction, excluding a temperature of 9.2 degrees celsius, where there is an increase of log odds in rainfall.
We see as similar interaction to the effects plot previously. In this plot when relative humidity rises to completely saturated air and wind gusts are from the north east, there is an increase in the log odds of rainfall, and similarly that for the south west. No other interaction effects are observed
The plot above shows the log odds of rainfall increases as average mean sea level pressure increase with the increase of average temperature. Air pressure with the highest temperature has the highest log odds and high air pressure with low temperature has the lowest log odds of rainfall.
Not sure about this interaction effect as both weather patterns seem identical, could this be from the disproportion of observations collected in sample belonging to the La Nina Pattern, after all it has only been half a year roughly.
This section uses the predict function with the logistic regression object created in the previous section, new data as the weather data set, and type of prediction the response. This gives each observation a probability of rainfall from the log odds or fitted values created from the glm function. The pred_class variable is then added to the dataframe, with the predicted values being greater then 0.5 a 1, below a 0. Predicted classifications are stored into an xtabs object as a confusion matrix to conduct performance evaluations on the model.
The performance measures for the model include accuracy which is the sum of true predictions divided by the sum of all predictions. Error, which is the accuracy subtracted from one. Sensitivity, the true positives divided by the sum of true positives and false negatives. Sensitivity, true negatives divided by the sum of true negatives and false positives. The model performed reasonably well, with 78% accuracy, 22% error, 78% sensitivity, and 77% specificity.
## pred_class1
## rain.class 0 1
## 0 134 40
## 1 37 143
\[Accuracy = \frac{TP+TN}{(TrueP+TN+FN+FP)}\] \[=\frac{134+143}{134+143+37+40}\] \[=0.78%\] \[Error= 1-Accuracy\] \[=0.22\] \[Sensitivity =\frac{TP}{(TP+FN)}\] \[=\frac{143}{(143+37)}\] \[=0.78\] \[Specificity = \frac{TN}{TN+FP}\] \[=\frac{134}{(134+40)}\] \[=0.77\]
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = weather$rain.class, predictor = rain.eff2$fitted.values, smooth = TRUE, ci = TRUE, plot = TRUE, legacy.axes = TRUE, percennt = TRUE, xlab = "False Positive Rate", ylab = "True Positive Rate", print.auc = TRUE, auc.polygon = TRUE, auc.polygon.col = "lightblue")
##
## Data: rain.eff2$fitted.values in 174 controls (weather$rain.class 0) < 180 cases (weather$rain.class 1).
## Smoothing: binormal
## Area under the curve: 0.8811
## 95% CI: 0.8426-0.9113 (2000 stratified bootstrap replicates)
The ROC curve, is the performance measure of the true positive rate over the false positive rate. As the number of true positives increases the curve goes straight up, when a false positive is detected, it brings the line down, into a curve, between te limits of 0 and 1 for both x and y axis. The performance is indicated by the integral of the curve, a high performing model will have an area under the curve close to 1 (more rectangle in shape). The AUC is described as this area of curve and, the weather model with wind gust direction performed to an AUC value of 88% with the peak of the confidence interval pushing up to 91% and on the lower end 84%. The model shows good performance throughout the tests. The AUC was performed using the weather data’s rain binomial response variable, to the stepAIC’s models fitted values, which was using the predictor variable of wind gust direction.
## Analysis of Deviance Table
##
## Model 1: rain.class ~ W.G.Spd + Season + av.temp + av.RH + av.MSLP + Season:av.temp +
## W.G.Spd:av.MSLP
## Model 2: rain.class ~ W.G.Dir + W.G.Spd + av.temp + av.MSLP + av.RH +
## Season + Pattern + av.temp:Season + W.G.Spd:av.MSLP + W.G.Dir:av.temp +
## W.G.Dir:av.RH + av.temp:av.MSLP + av.RH:Pattern
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 342 366.12
## 2 318 307.61 24 58.519 0.0001031 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The anova model produced indicates the model with wind gust direction is of better quality, with a p-value of 0.0001031 and a reduction in deviance of 58.519. There is a chance the model is overfit, with possibly included variables and interactions with no significance from the step AIC function. A solution to this problem in the future could be to cross-validate the model with training data.
#Testing tomorrows forecast
weather_test <- read_excel("weather_test.xlsx")
#Model with wind gust direction
predict(rain.eff2, newdata = weather_test, type = "response")
## 1
## 0.6409845
#Model without wind gust direction
predict(rain.eff1, newdata = weather_test, type = "response")
## 1
## 0.5869697
Using the both models to predict the probability of rain for the 16th of May 2022, we see both models predict a log odds of over 50%, thus classifying rain = 1, and a misclassification. The model without wind gust direction had a lower log odds and was closer to the actually amount of rainfall. The models created are extremely complex with colinearity and high amounts of noise in the data from categorical variables with many factors. In understanding weather prediction, it would be best to use a dynamic model, one that can change with initial conditions and predict timings of changes. Logistic regression uses static measurements or ones that don’t change, predict levels of rainfall, or timings of events, and so is probably not the best method of rainfall prediction without a supercomputer.
For interpretabilty, the model in the appendix would be better for showing interactions with fewer effects without wind gust direction, the complexity of the model is reduced and makes it easier to understand. The stepAIC function could have induced some issues with model overfitting The model above has better predictive performance given by the performance tests above, so with more data and optimization techniques to reduce complexity like k-fold cross validation or bootstrapping, we could see the model’s accuracy increase.
Throughout this study a number of complexities have arisen. The effects of average temperature and average relative humidity show the most significant effects to seasons compared to other variables. Wind gust direction has high covariate relationships to many other predictor variables that could be influencing results and causing issues with model assumptions in diagnostic plots. The significant interaction between mean sea level pressure and wind gust speed increasing indicated an effect predicting extreme rain events such as east coast lows. The model without wind gust direction had a higher AIC deviance, and fewer interactions capturing the covariate relationships, but almost perfect diagnostic plots. The model with wind gust direction, had lowed AIC deviance, and higher model performance given in the tests above and compared to that in the appendix.
Overall, it is observed that seasons do have a significant interaction to the likelihood of rainfall, with interactions to temperature (expected). With mean sea level pressure also having strong interactions to temperature, which effect the likelihood of rainfall. Basically, both models misclassified the rainfall today, so I deduct this test as a failure, potentially from not collecting enough data, disproportion of seasons and weather patterns, and the dynamic relationships the variables have to initial weather conditions. Logistic regression could be used to model rainfall, but further data collection is needed. Possibly from other weather stations, in say Queensland to investigate weather effects impacting Newcastle.
In summary this project has shown the complexities involved in atmospheric conditions in their relationships to a number of components. Further studies and data collection for this will be made to improve results of the weather model, aiding to reduce misclassification. Ordinal regression could also be performed to signify extreme weather events with the interaction of wind gust speed and mean sea level pressure. All in all, machine learning models in early days will not always classify correctly the first time, it is a building process which is in it’s infancy.
#Model 1's outputs
summary(rain.eff1)
##
## Call:
## glm(formula = rain.class ~ W.G.Spd + Season + av.temp + av.RH +
## av.MSLP + Season:av.temp + W.G.Spd:av.MSLP, family = binomial(link = "logit"),
## data = weather)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1135 -0.8443 0.2042 0.8523 2.1252
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.353e+02 6.676e+01 -2.027 0.042672 *
## W.G.Spd 3.415e+00 1.405e+00 2.430 0.015091 *
## SeasonSpring 1.199e+01 2.954e+00 4.058 4.95e-05 ***
## SeasonSummer 1.245e+01 3.555e+00 3.503 0.000460 ***
## SeasonWinter 1.767e+00 2.826e+00 0.625 0.531752
## av.temp 3.038e-01 1.112e-01 2.732 0.006293 **
## av.RH 8.507e-02 1.162e-02 7.317 2.53e-13 ***
## av.MSLP 1.211e-01 6.505e-02 1.861 0.062709 .
## SeasonSpring:av.temp -6.059e-01 1.522e-01 -3.981 6.85e-05 ***
## SeasonSummer:av.temp -5.827e-01 1.614e-01 -3.610 0.000306 ***
## SeasonWinter:av.temp -6.988e-03 1.732e-01 -0.040 0.967827
## W.G.Spd:av.MSLP -3.349e-03 1.387e-03 -2.416 0.015713 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 490.65 on 353 degrees of freedom
## Residual deviance: 366.12 on 342 degrees of freedom
## AIC: 390.12
##
## Number of Fisher Scoring iterations: 5
anova(rain.eff1, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: rain.class
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 353 490.65
## W.G.Spd 1 0.330 352 490.32 0.56589
## Season 3 5.420 349 484.90 0.14351
## av.temp 1 0.022 348 484.87 0.88164
## av.RH 1 85.605 347 399.27 < 2.2e-16 ***
## av.MSLP 1 0.250 346 399.02 0.61686
## Season:av.temp 3 26.379 343 372.64 7.945e-06 ***
## W.G.Spd:av.MSLP 1 6.517 342 366.12 0.01069 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Diagnostic plots
par(mfrow = c(2,2))
plot(rain.eff1)
pred_values <- rain.eff1 %>%
predict(newdata = weather, type = "response")
weather$pred_class <- ifelse(pred_values > 0.5, 1, 0)
# Model 1 confusion matrix
tab <- xtabs(~rain.class+pred_class, data = weather)
tab
## pred_class
## rain.class 0 1
## 0 122 52
## 1 48 132
#Model 1's performance tests
accuracy <- sum(diag(tab))/sum(tab)
accuracy
## [1] 0.7175141
error = 1 - accuracy
error
## [1] 0.2824859
sensitivty1 <- tab[4]/(tab[4]+tab[3])
sensitivty1
## [1] 0.7173913
specificity1 <- tab[1]/(tab[1]+tab[3])
specificity1
## [1] 0.7011494
roc(weather$rain.class, rain.eff1$fitted.values, plot=TRUE, legacy.axes = TRUE,
percennt = TRUE, xlab = "False Positive Rate", ylab = "True Positive Rate",
print.auc = TRUE, auc.polygon = TRUE, auc.polygon.col = "lightblue",
ci = TRUE, smooth = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = weather$rain.class, predictor = rain.eff1$fitted.values, smooth = TRUE, ci = TRUE, plot = TRUE, legacy.axes = TRUE, percennt = TRUE, xlab = "False Positive Rate", ylab = "True Positive Rate", print.auc = TRUE, auc.polygon = TRUE, auc.polygon.col = "lightblue")
##
## Data: rain.eff1$fitted.values in 174 controls (weather$rain.class 0) < 180 cases (weather$rain.class 1).
## Smoothing: binormal
## Area under the curve: 0.8245
## 95% CI: 0.7814-0.8623 (2000 stratified bootstrap replicates)
roc(weather$rain.class, pred_values, plot=TRUE, legacy.axes = TRUE,
percennt = TRUE, xlab = "False Positive Rate", ylab = "True Positive Rate",
print.auc = TRUE, auc.polygon = TRUE, auc.polygon.col = "lightgreen",
ci = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
##
## Call:
## roc.default(response = weather$rain.class, predictor = pred_values, ci = TRUE, plot = TRUE, legacy.axes = TRUE, percennt = TRUE, xlab = "False Positive Rate", ylab = "True Positive Rate", print.auc = TRUE, auc.polygon = TRUE, auc.polygon.col = "lightgreen")
##
## Data: pred_values in 174 controls (weather$rain.class 0) < 180 cases (weather$rain.class 1).
## Area under the curve: 0.8181
## 95% CI: 0.7753-0.8609 (DeLong)
# Model 1 interaction effect plots
plot(allEffects(rain.eff1), multiline=TRUE, selection = 1)
plot(allEffects(rain.eff1), multiline=TRUE, selection = 2)
plot(allEffects(rain.eff1), multiline=TRUE, selection = 3)