Ch.8: Q.1-5
Q.1
\(\textit{In order for the model to serve as a forewarning system for farmers, what requirements must be satisfied regarding data availability?}\)
The two predictors (maximum temperature and relative humidity) must be available at the time of prediction.
Q.2
\(\textit{What an equation for the model fitted by the researches in the form of equation (8.1). Use predictor names instead of x notation.}\)
Let \(p\) be the probability of the event occuring (the agricultural epidemic of powdery mildew in mango erupting in a certain year). The \(odds\) of this event is then defined to be \[odds = \frac{p}{1-p} \] Then the logistic regression model is given by \[\log(odds) = \beta_0 + \beta_1 \cdot (\text{Max temperature}) + \beta_2 \cdot (\text{Relative humidity}) \]
Q.3
\(\textit{Create a scatter plot of the two predictors, using different hue for epidemic and non-epidemic markers.}\)
\(\textit{Does there appear to be a relationship between epidemic status and the two predictors?}\)
First plot Relative Humidity vs. Max Temperature. Note: to get the 0-1 coding for the Outbreak event status, I converted the Outbreak vector to logicals then multiplied by 1.
mildew <- read.csv("PowderyMildewEpidemic.csv")
#str(mildew)
#head(mildew)
#tail(mildew)
mildew$Outbreak <- mildew$Outbreak == "Yes"
mildew$Outbreak <- mildew$Outbreak * 1
# plot Relative Humidity vs. Max Temperature
plot(mildew$MaxTemp ~ mildew$RelHumidity, xlab="Relative Humidity", ylab="Max Temperature", bty="l", col= mildew$Outbreak + 1,pch=15)
legend(85,33,c("No Outbreak", "Outbreak"),col=1:2,pch=15,cex=0.9,seg.len=2)
Now plot Max Temperature vs. Relative Humidity:
# plot Max Temperature vs. Relative Humidity
plot(mildew$RelHumidity ~ mildew$MaxTemp, xlab="Max Temperature", ylab="Relative Humidity", bty="l", col= mildew$Outbreak + 1,pch=15)
legend(31.5,90,c("No Outbreak", "Outbreak"),col=1:2,pch=15,cex=0.9,seg.len=2)
We first note that there are 5 occurences of the outbreak. Out of the top 4 highest Relative Humidity, we notice that an outbreak occurred at 3 of them. Also out of the top 3 highest Max Temperatures, an outbreak occurred at 2. This might suggest that outbreaks tend to occur at either high relative humidities or high max temperatures.
Q.4
\(\textit{Compute naive forecasts of epidemic status for years 1995-1997 using a roll-forward naive forecast.}\)
\(\textit{What is the naive forecast for year 2000? Summarize the results for these four years in a classification matrix.}\)
naiveForecasts <- mildew$Outbreak[(length(mildew$Outbreak)-1-3):(length(mildew$Outbreak)-1)]
naiveForecasts
## [1] 1 0 1 0
So the roll-forward naive forecasts are:
\[ \begin{array}{c c} \textbf{Year} & \textbf{Outbreak?} \\ \hline 1995 & \text{Yes} \\ 1996 & \text{No} \\ 1997 & \text{Yes} \\ 2000 & \text{No} \end{array} \]
The classification matrix is
library(caret)
confusionMatrix(naiveForecasts, mildew$Outbreak[(length(mildew$Outbreak)-3):length(mildew$Outbreak)], positive=c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1 1
## 1 2 0
##
## Accuracy : 0.25
## 95% CI : (0.0063, 0.8059)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.9961
##
## Kappa : -0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.0000
## Specificity : 0.3333
## Pos Pred Value : 0.0000
## Neg Pred Value : 0.5000
## Prevalence : 0.2500
## Detection Rate : 0.0000
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.1667
##
## 'Positive' Class : 1
##
We only predict 1 out of the 4 years correctly (25% accuracy) and the naive forecast for the year 2000 is “No Outbreak”.
Q.5
\(\textit{Partition the data into training and validation periods, so that years 1987-1994 are the training period.}\)
\(\textit{Fit a logistic regression to the training period using the two predictors, and report the outbreak probability as well as a forecast for year 1995}\)
\(\textit{(use a threshold of 0.5).}\)
The summary of the logistic regression model is
trainOut <- mildew[1:8,]
#trainOut
outLogReg <- glm(Outbreak ~ MaxTemp + RelHumidity,data=trainOut,family="binomial")
summary(outLogReg)
##
## Call:
## glm(formula = Outbreak ~ MaxTemp + RelHumidity, family = "binomial",
## data = trainOut)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## 0.7466 -1.7276 -0.3132 1.0552 -1.1419 1.2419 -0.3908 0.6060
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543 44.4573 -1.263 0.207
## MaxTemp 1.3849 1.1406 1.214 0.225
## RelHumidity 0.1877 0.1578 1.189 0.234
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11.0904 on 7 degrees of freedom
## Residual deviance: 8.1198 on 5 degrees of freedom
## AIC: 14.12
##
## Number of Fisher Scoring iterations: 5
The actual model is then given by (\(p\) is the probability of event occurring) \[ \log\left(\frac{p}{1-p}\right) = -56.1543 + 1.3849\cdot \text{MaxTemp} + 0.1877\cdot \text{RelHumidity} \]
Note that both predictors do not achieve statistical significance at the \(\alpha =0.05\) critical value. The outbreak probabilities and classification matrix for the validation period are
predictions <- predict(outLogReg,mildew[9:12,],type="response")
predictions
## 9 10 11 12
## 0.1119407 0.7021411 0.5705413 0.3894790
confusionMatrix(ifelse(predictions > 0.5, 1, 0), mildew[9:12,]$Outbreak, positive=c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2 0
## 1 1 1
##
## Accuracy : 0.75
## 95% CI : (0.1941, 0.9937)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.7383
##
## Kappa : 0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 1.0000
## Specificity : 0.6667
## Pos Pred Value : 0.5000
## Neg Pred Value : 1.0000
## Prevalence : 0.2500
## Detection Rate : 0.2500
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8333
##
## 'Positive' Class : 1
##
So the outbreak probability for 1995 is roughly 0.112. Using a threshold of 0.5, we do not forecast an outbreak to occur in 1995. Note that the forecasts from the logistic regression model is an improvement over the roll-foward naive forecasts, as the accuracy went up to 75%.