Assignment 7

1. In order for the model to serve as a forewarning system for farmers, what requirements must be satisfied regarding data availability?

The two predictors — maximum temperature and relative humidity — must be available at the time of prediction.

2. Write an equation for the model fitted by the researchers in the form of equation 8.1. Use predictor names instead of x notation.

log(odds) = B˅0 + B˅1(maximum temperature) + B˅2(relative humidity)

3. Create a scatter plot of the two predictors, using different hue for epidemic and non-epidemic markers. Does there appears to be a relationship between epidemic status and the two predictors?

#Imported and previewed
epidemic <- read.csv("PowderyMildewEpidemic.csv")
str(epidemic)

## 'data.frame':    12 obs. of  4 variables:
##  $ Year       : int  1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 ...
##  $ Outbreak   : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 2 1 2 1 2 ...
##  $ MaxTemp    : num  30.1 30.7 26.3 28.4 29.6 ...
##  $ RelHumidity: num  82.9 79.6 89.1 91 80.6 ...

#Plotted with axes, made points red and black
plot(epidemic$RelHumidity ~ epidemic$MaxTemp, xlab = "Maximum temperature", ylab = "Relative humidity", bty="l", col = epidemic$Outbreak, pch = 15)

#Legend, title
legend(31, 90, c("No epidemic", "Epidemic"), col = 1:2, pch = 15)
title("Cases of powdery mildew epidemic")

#Plotted the other way around to see if new relationships found
plot(epidemic$MaxTemp ~ epidemic$RelHumidity, xlab = "Relative humidity", ylab="Maximum temperature", bty = "l", col = epidemic$Outbreak, pch = 15)

#Legend, title
legend(82, 33, c("No epidemic", "Epidemic"), col = 1:2, pch = 15)
title("Cases of powdery mildew epidemic")

There is a strong relationship between outbreaks and high maximum temperatures. However, there seems to be no relationship between outbreaks and relative humidity with outbreaks at both high and low levels with no consistency in between.

4. Compute naive forecasts of epidemic status for years 1995-1997 using next-year forecasts. What is the naive forecast for the year 2000? Summarize the results for these four years in a classification matrix.

#Use shortcut in lecture notes. We're expecting yes, no, yes, no for the four years, since they rely on the previous years' forecasts.
naiveForecasts <- epidemic$Outbreak[(length(epidemic$Outbreak)-1-3):(length(epidemic$Outbreak)-1)]
naiveForecasts

## [1] Yes No  Yes No 
## Levels: No Yes

The forecast for 2000 is no epidemic, as expected.

#Created matrix
confusionMatrix(naiveForecasts, epidemic$Outbreak[(length(epidemic$Outbreak)-1-2):(length(epidemic$Outbreak)-0)], positive=c("Yes"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No   1   1
##        Yes  2   0
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.0063, 0.8059)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.9961          
##                                           
##                   Kappa : -0.5            
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.0000          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.1667          
##                                           
##        'Positive' Class : Yes             
##

The matrix shows a 25 percent accuracy rate, which is valid when we go back to the data. Only 2000 is correct in the four years.

5. Partition the data into training and validation periods, so that years 1987-1994 are the training period. Fit a logistic regression to the training period using the two predictors and report the outbreak probability as well as a forecast for year 1995. (Use a threshold of 0.5.)

#Split into training period for first 8 years
epidTrain <- epidemic[1:8, ]

#Fit logistic regression model with two predictors
trainLog<- glm(Outbreak ~ MaxTemp + RelHumidity, data=epidTrain, family="binomial")
#Examined
summary(trainLog)

## 
## Call:
## glm(formula = Outbreak ~ MaxTemp + RelHumidity, family = "binomial", 
##     data = epidTrain)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.7466  -1.7276  -0.3132   1.0552  -1.1419   1.2419  -0.3908   0.6060  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543    44.4573  -1.263    0.207
## MaxTemp       1.3849     1.1406   1.214    0.225
## RelHumidity   0.1877     0.1578   1.189    0.234
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11.0904  on 7  degrees of freedom
## Residual deviance:  8.1198  on 5  degrees of freedom
## AIC: 14.12
## 
## Number of Fisher Scoring iterations: 5

#Generated predictions
logPredictions <- predict(trainLog, epidemic[9:12,], type="response")
#Printed
logPredictions

##         9        10        11        12 
## 0.1119407 0.7021411 0.5705413 0.3894790

The outbreak likelihood for 1995 (or year 9) is just over 11 percent. Using a threshold of 0.5, we don’t expect an outbreak.

#Generated matrix for years starting with 1995
#Using "1, 0" led to an error for me; "yes" and "no" worked
confusionMatrix(ifelse(logPredictions > 0.5, "Yes", "No"), epidemic[9:12,]$Outbreak, positive=c("Yes"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No   2   0
##        Yes  1   1
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.1941, 0.9937)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.7383          
##                                           
##                   Kappa : 0.5             
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : Yes             
##

We see that the logistic model has improved our forecasts. Now, there’s an accuracy rate of 75 percent over four years.