The two predictors — maximum temperature and relative humidity — must be available at the time of prediction.
log(odds) = B˅0 + B˅1(maximum temperature) + B˅2(relative humidity)
#Imported and previewed
epidemic <- read.csv("PowderyMildewEpidemic.csv")
str(epidemic)
## 'data.frame': 12 obs. of 4 variables:
## $ Year : int 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 ...
## $ Outbreak : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 2 1 2 1 2 ...
## $ MaxTemp : num 30.1 30.7 26.3 28.4 29.6 ...
## $ RelHumidity: num 82.9 79.6 89.1 91 80.6 ...
#Plotted with axes, made points red and black
plot(epidemic$RelHumidity ~ epidemic$MaxTemp, xlab = "Maximum temperature", ylab = "Relative humidity", bty="l", col = epidemic$Outbreak, pch = 15)
#Legend, title
legend(31, 90, c("No epidemic", "Epidemic"), col = 1:2, pch = 15)
title("Cases of powdery mildew epidemic")
#Plotted the other way around to see if new relationships found
plot(epidemic$MaxTemp ~ epidemic$RelHumidity, xlab = "Relative humidity", ylab="Maximum temperature", bty = "l", col = epidemic$Outbreak, pch = 15)
#Legend, title
legend(82, 33, c("No epidemic", "Epidemic"), col = 1:2, pch = 15)
title("Cases of powdery mildew epidemic")
There is a strong relationship between outbreaks and high maximum temperatures. However, there seems to be no relationship between outbreaks and relative humidity with outbreaks at both high and low levels with no consistency in between.
#Use shortcut in lecture notes. We're expecting yes, no, yes, no for the four years, since they rely on the previous years' forecasts.
naiveForecasts <- epidemic$Outbreak[(length(epidemic$Outbreak)-1-3):(length(epidemic$Outbreak)-1)]
naiveForecasts
## [1] Yes No Yes No
## Levels: No Yes
The forecast for 2000 is no epidemic, as expected.
#Created matrix
confusionMatrix(naiveForecasts, epidemic$Outbreak[(length(epidemic$Outbreak)-1-2):(length(epidemic$Outbreak)-0)], positive=c("Yes"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1 1
## Yes 2 0
##
## Accuracy : 0.25
## 95% CI : (0.0063, 0.8059)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.9961
##
## Kappa : -0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.0000
## Specificity : 0.3333
## Pos Pred Value : 0.0000
## Neg Pred Value : 0.5000
## Prevalence : 0.2500
## Detection Rate : 0.0000
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.1667
##
## 'Positive' Class : Yes
##
The matrix shows a 25 percent accuracy rate, which is valid when we go back to the data. Only 2000 is correct in the four years.
#Split into training period for first 8 years
epidTrain <- epidemic[1:8, ]
#Fit logistic regression model with two predictors
trainLog<- glm(Outbreak ~ MaxTemp + RelHumidity, data=epidTrain, family="binomial")
#Examined
summary(trainLog)
##
## Call:
## glm(formula = Outbreak ~ MaxTemp + RelHumidity, family = "binomial",
## data = epidTrain)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## 0.7466 -1.7276 -0.3132 1.0552 -1.1419 1.2419 -0.3908 0.6060
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543 44.4573 -1.263 0.207
## MaxTemp 1.3849 1.1406 1.214 0.225
## RelHumidity 0.1877 0.1578 1.189 0.234
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11.0904 on 7 degrees of freedom
## Residual deviance: 8.1198 on 5 degrees of freedom
## AIC: 14.12
##
## Number of Fisher Scoring iterations: 5
#Generated predictions
logPredictions <- predict(trainLog, epidemic[9:12,], type="response")
#Printed
logPredictions
## 9 10 11 12
## 0.1119407 0.7021411 0.5705413 0.3894790
The outbreak likelihood for 1995 (or year 9) is just over 11 percent. Using a threshold of 0.5, we don’t expect an outbreak.
#Generated matrix for years starting with 1995
#Using "1, 0" led to an error for me; "yes" and "no" worked
confusionMatrix(ifelse(logPredictions > 0.5, "Yes", "No"), epidemic[9:12,]$Outbreak, positive=c("Yes"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 2 0
## Yes 1 1
##
## Accuracy : 0.75
## 95% CI : (0.1941, 0.9937)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.7383
##
## Kappa : 0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 1.0000
## Specificity : 0.6667
## Pos Pred Value : 0.5000
## Neg Pred Value : 1.0000
## Prevalence : 0.2500
## Detection Rate : 0.2500
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8333
##
## 'Positive' Class : Yes
##
We see that the logistic model has improved our forecasts. Now, there’s an accuracy rate of 75 percent over four years.