In order for the model to serve as a forewarning for farmers, we would need timely data utilized in the forecast - the maximum temperature and the relatively humidity.
Creating the equation that could be used for this forecast, in the format as equation 8.1 would be as follows:
\[ log(odds)=beta~0~ + beta~1~(Max temperature)+ beta~2~(Relative humidity) \]
For question 3, I needed to begin working with the data set so I’ve pulled in the CSV file. I then created the scatter plot of the two predictors, maximum temperature and relative humidity. I had to convert the Outbreak column to a numeric identifier in order to color code whether the data point was a Outbreak or not (epidemic or non-epidemic).
mildew <- read.csv("~/MBA678/PowderyMildewEpidemic.csv", stringsAsFactors = FALSE)
mildew$Outbreak <- mildew$Outbreak == "Yes"
mildew$Outbreak <- mildew$Outbreak * 1
plot(mildew$RelHumidity ~ mildew$MaxTemp, xlab="Maximum Temperature", ylab="Relative Humidity", bty="l", col=mildew$Outbreak+1, pch=15)
legend(27, 70, c("No Outbreak", "Outbreak"), col=1:2, pch=15)
The predictors seem reasonably distributed across the scatterplot, but there is a slight trend of the outbreaks occuring only towards the higher humidity and temperature. The lowest recorder humidities and three of the lowest four temperatures did not have an outbreak. I also did a quick plot with the values on opposite axis to see if that yielded any further insight.
plot(mildew$MaxTemp ~ mildew$RelHumidity, ylab="Maximum Temperature", xlab="Relative Humidity", bty="l", col=mildew$Outbreak+1, pch=15)
legend(60, 28, c("No Outbreak", "Outbreak"), col=1:2, pch=15)
I generated the naive forecast and then used the caret package to create the classification matrix.
naiveForecast<- mildew$Outbreak[(length(mildew$Outbreak)-1-3):(length(mildew$Outbreak)-1)]
naiveForecast
## [1] 1 0 1 0
mildew
## Year Outbreak MaxTemp RelHumidity
## 1 1987 1 30.14 82.86
## 2 1988 0 30.66 79.57
## 3 1989 0 26.31 89.14
## 4 1990 1 28.43 91.00
## 5 1991 0 29.57 80.57
## 6 1992 1 31.25 67.82
## 7 1993 0 30.35 61.76
## 8 1994 1 30.71 81.14
## 9 1995 0 30.71 61.57
## 10 1996 1 33.07 59.76
## 11 1997 0 31.50 68.29
## 12 2000 0 29.50 79.14
| Year | Outbreak? |
|---|---|
| 1995 | Yes |
| 1996 | No |
| 1997 | Yes |
The naive forecast for 2000 is 0 - no outbreak.
confusionMatrix(naiveForecast, mildew$Outbreak[(length(mildew$Outbreak)-3):length(mildew$Outbreak)], positive=c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1 1
## 1 2 0
##
## Accuracy : 0.25
## 95% CI : (0.0063, 0.8059)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.9961
##
## Kappa : -0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.0000
## Specificity : 0.3333
## Pos Pred Value : 0.0000
## Neg Pred Value : 0.5000
## Prevalence : 0.2500
## Detection Rate : 0.0000
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.1667
##
## 'Positive' Class : 1
##
As shown above, we only have a 25% accuracy using a naive forecast.
I seperate out the training period and then run the logistic regression for that training period.
training <- mildew[1:8,]
LogReg <- glm(Outbreak ~ MaxTemp + RelHumidity,data=training,family="binomial")
summary(LogReg)
##
## Call:
## glm(formula = Outbreak ~ MaxTemp + RelHumidity, family = "binomial",
## data = training)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## 0.7466 -1.7276 -0.3132 1.0552 -1.1419 1.2419 -0.3908 0.6060
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543 44.4573 -1.263 0.207
## MaxTemp 1.3849 1.1406 1.214 0.225
## RelHumidity 0.1877 0.1578 1.189 0.234
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11.0904 on 7 degrees of freedom
## Residual deviance: 8.1198 on 5 degrees of freedom
## AIC: 14.12
##
## Number of Fisher Scoring iterations: 5
As shown by the results, the p-values for both the Maximum Temperature and Relative Humidity are not statistically significant (as they are too high).
I then use the logistic regression to create predictions on the validation period. After creating those predictions, I created the confusion matrix.
predictions <- predict(LogReg, mildew[9:12,], type="response")
predictions
## 9 10 11 12
## 0.1119407 0.7021411 0.5705413 0.3894790
confusionMatrix(ifelse(predictions > 0.5, 1, 0), mildew[9:12,]$Outbreak, positive=c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2 0
## 1 1 1
##
## Accuracy : 0.75
## 95% CI : (0.1941, 0.9937)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.7383
##
## Kappa : 0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 1.0000
## Specificity : 0.6667
## Pos Pred Value : 0.5000
## Neg Pred Value : 1.0000
## Prevalence : 0.2500
## Detection Rate : 0.2500
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8333
##
## 'Positive' Class : 1
##
The year 1995 is the first value in predictions meaning there is an 11.2% chance of an outbreak (shown in the output from predictions). As you can see from the confusion matrix, this yields a 75% accuracy of prediction which is better than the 25% yielded from the naive forecast.