Ch.8: Q.1-5

Q.1

\(\textit{In order for the model to serve as a forewarning system for farmers, what requirements must be satisfied regarding data availability?}\)

The two predictors (maximum temperature and relative humidity) must be available at the time of prediction.

Q.2

\(\textit{What an equation for the model fitted by the researches in the form of equation (8.1). Use predictor names instead of x notation.}\)

Let \(p\) be the probability of the event occuring (the agricultural epidemic of powdery mildew in mango erupting in a certain year). The \(odds\) of this event is then defined to be \[odds = \frac{p}{1-p} \] Then the logistic regression model is given by \[\log(odds) = \beta_0 + \beta_1 \cdot (\text{Max temperature}) + \beta_2 \cdot (\text{Relative humidity}) \]

Q.3

\(\textit{Create a scatter plot of the two predictors, using different hue for epidemic and non-epidemic markers.}\)

\(\textit{Does there appear to be a relationship between epidemic status and the two predictors?}\)

First plot Relative Humidity vs. Max Temperature. Note: to get the 0-1 coding for the Outbreak event status, I converted the Outbreak vector to logicals then multiplied by 1.

mildew <- read.csv("PowderyMildewEpidemic.csv")
#str(mildew)
#head(mildew)
#tail(mildew)
mildew$Outbreak <- mildew$Outbreak == "Yes"
mildew$Outbreak <- mildew$Outbreak * 1
# plot Relative Humidity vs. Max Temperature
plot(mildew$MaxTemp ~ mildew$RelHumidity, xlab="Relative Humidity", ylab="Max Temperature", bty="l", col= mildew$Outbreak + 1,pch=15)
legend(85,33,c("No Outbreak", "Outbreak"),col=1:2,pch=15,cex=0.9,seg.len=2)

Now plot Max Temperature vs. Relative Humidity:

# plot Max Temperature vs. Relative Humidity
plot(mildew$RelHumidity ~ mildew$MaxTemp, xlab="Max Temperature", ylab="Relative Humidity", bty="l", col= mildew$Outbreak + 1,pch=15)
legend(31.5,90,c("No Outbreak", "Outbreak"),col=1:2,pch=15,cex=0.9,seg.len=2)

We first note that there are 5 occurences of the outbreak. Out of the top 4 highest Relative Humidity, we notice that an outbreak occurred at 3 of them. Also out of the top 3 highest Max Temperatures, an outbreak occurred at 2. This might suggest that outbreaks tend to occur at either high relative humidities or high max temperatures.

Q.4

\(\textit{Compute naive forecasts of epidemic status for years 1995-1997 using a roll-forward naive forecast.}\)

\(\textit{What is the naive forecast for year 2000? Summarize the results for these four years in a classification matrix.}\)

naiveForecasts <- mildew$Outbreak[(length(mildew$Outbreak)-1-3):(length(mildew$Outbreak)-1)]
naiveForecasts
## [1] 1 0 1 0

So the roll-forward naive forecasts are:

\[ \begin{array}{c c} \textbf{Year} & \textbf{Outbreak?} \\ \hline 1995 & \text{Yes} \\ 1996 & \text{No} \\ 1997 & \text{Yes} \\ 2000 & \text{No} \end{array} \]

The classification matrix is

library(caret)
confusionMatrix(naiveForecasts, mildew$Outbreak[(length(mildew$Outbreak)-3):length(mildew$Outbreak)], positive=c("1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 1 1
##          1 2 0
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.0063, 0.8059)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.9961          
##                                           
##                   Kappa : -0.5            
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.0000          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.1667          
##                                           
##        'Positive' Class : 1               
## 

We only predict 1 out of the 4 years correctly (25% accuracy) and the naive forecast for the year 2000 is “No Outbreak”.

Q.5

\(\textit{Partition the data into training and validation periods, so that years 1987-1994 are the training period.}\)

\(\textit{Fit a logistic regression to the training period using the two predictors, and report the outbreak probability as well as a forecast for year 1995}\)

\(\textit{(use a threshold of 0.5).}\)

The summary of the logistic regression model is

trainOut <- mildew[1:8,]
#trainOut
outLogReg <- glm(Outbreak ~ MaxTemp + RelHumidity,data=trainOut,family="binomial")
summary(outLogReg)
## 
## Call:
## glm(formula = Outbreak ~ MaxTemp + RelHumidity, family = "binomial", 
##     data = trainOut)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.7466  -1.7276  -0.3132   1.0552  -1.1419   1.2419  -0.3908   0.6060  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543    44.4573  -1.263    0.207
## MaxTemp       1.3849     1.1406   1.214    0.225
## RelHumidity   0.1877     0.1578   1.189    0.234
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11.0904  on 7  degrees of freedom
## Residual deviance:  8.1198  on 5  degrees of freedom
## AIC: 14.12
## 
## Number of Fisher Scoring iterations: 5

The actual model is then given by (\(p\) is the probability of event occurring) \[ \log\left(\frac{p}{1-p}\right) = -56.1543 + 1.3849\cdot \text{MaxTemp} + 0.1877\cdot \text{RelHumidity} \]

Note that both predictors do not achieve statistical significance at the \(\alpha =0.05\) critical value. The outbreak probabilities and classification matrix for the validation period are

predictions <- predict(outLogReg,mildew[9:12,],type="response")
predictions
##         9        10        11        12 
## 0.1119407 0.7021411 0.5705413 0.3894790
confusionMatrix(ifelse(predictions > 0.5, 1, 0), mildew[9:12,]$Outbreak, positive=c("1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 2 0
##          1 1 1
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.1941, 0.9937)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.7383          
##                                           
##                   Kappa : 0.5             
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : 1               
## 

So the outbreak probability for 1995 is roughly 0.112. Using a threshold of 0.5, we do not forecast an outbreak to occur in 1995. Note that the forecasts from the logistic regression model is an improvement over the roll-foward naive forecasts, as the accuracy went up to 75%.