Assignment 7

Chapter 8

Question 1

In order for the model to serve as a forewarning for farmers, we would need timely data utilized in the forecast - the maximum temperature and the relatively humidity.

Question 2

Creating the equation that could be used for this forecast, in the format as equation 8.1 would be as follows:

\[ log(odds)=beta~0~ + beta~1~(Max temperature)+ beta~2~(Relative humidity) \]

Question 3

For question 3, I needed to begin working with the data set so I’ve pulled in the CSV file. I then created the scatter plot of the two predictors, maximum temperature and relative humidity. I had to convert the Outbreak column to a numeric identifier in order to color code whether the data point was a Outbreak or not (epidemic or non-epidemic).

mildew <- read.csv("~/MBA678/PowderyMildewEpidemic.csv", stringsAsFactors = FALSE)
mildew$Outbreak <- mildew$Outbreak == "Yes"
mildew$Outbreak <- mildew$Outbreak * 1

plot(mildew$RelHumidity ~ mildew$MaxTemp, xlab="Maximum Temperature", ylab="Relative Humidity", bty="l", col=mildew$Outbreak+1, pch=15)

legend(27, 70, c("No Outbreak", "Outbreak"), col=1:2, pch=15)

The predictors seem reasonably distributed across the scatterplot, but there is a slight trend of the outbreaks occuring only towards the higher humidity and temperature. The lowest recorder humidities and three of the lowest four temperatures did not have an outbreak. I also did a quick plot with the values on opposite axis to see if that yielded any further insight.

plot(mildew$MaxTemp ~ mildew$RelHumidity, ylab="Maximum Temperature", xlab="Relative Humidity", bty="l", col=mildew$Outbreak+1, pch=15)

legend(60, 28, c("No Outbreak", "Outbreak"), col=1:2, pch=15)

Question 4

I generated the naive forecast and then used the caret package to create the classification matrix.

naiveForecast<- mildew$Outbreak[(length(mildew$Outbreak)-1-3):(length(mildew$Outbreak)-1)]

naiveForecast

## [1] 1 0 1 0

mildew

##    Year Outbreak MaxTemp RelHumidity
## 1  1987        1   30.14       82.86
## 2  1988        0   30.66       79.57
## 3  1989        0   26.31       89.14
## 4  1990        1   28.43       91.00
## 5  1991        0   29.57       80.57
## 6  1992        1   31.25       67.82
## 7  1993        0   30.35       61.76
## 8  1994        1   30.71       81.14
## 9  1995        0   30.71       61.57
## 10 1996        1   33.07       59.76
## 11 1997        0   31.50       68.29
## 12 2000        0   29.50       79.14

Year	Outbreak?
1995	Yes
1996	No
1997	Yes

The naive forecast for 2000 is 0 - no outbreak.

confusionMatrix(naiveForecast, mildew$Outbreak[(length(mildew$Outbreak)-3):length(mildew$Outbreak)], positive=c("1"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 1 1
##          1 2 0
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.0063, 0.8059)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.9961          
##                                           
##                   Kappa : -0.5            
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.0000          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.1667          
##                                           
##        'Positive' Class : 1               
##

As shown above, we only have a 25% accuracy using a naive forecast.

Question 5

I seperate out the training period and then run the logistic regression for that training period.

training <- mildew[1:8,]

LogReg <- glm(Outbreak ~ MaxTemp + RelHumidity,data=training,family="binomial")
summary(LogReg)

## 
## Call:
## glm(formula = Outbreak ~ MaxTemp + RelHumidity, family = "binomial", 
##     data = training)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.7466  -1.7276  -0.3132   1.0552  -1.1419   1.2419  -0.3908   0.6060  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543    44.4573  -1.263    0.207
## MaxTemp       1.3849     1.1406   1.214    0.225
## RelHumidity   0.1877     0.1578   1.189    0.234
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11.0904  on 7  degrees of freedom
## Residual deviance:  8.1198  on 5  degrees of freedom
## AIC: 14.12
## 
## Number of Fisher Scoring iterations: 5

As shown by the results, the p-values for both the Maximum Temperature and Relative Humidity are not statistically significant (as they are too high).

I then use the logistic regression to create predictions on the validation period. After creating those predictions, I created the confusion matrix.

predictions <- predict(LogReg, mildew[9:12,], type="response")
predictions

##         9        10        11        12 
## 0.1119407 0.7021411 0.5705413 0.3894790

confusionMatrix(ifelse(predictions > 0.5, 1, 0), mildew[9:12,]$Outbreak, positive=c("1"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 2 0
##          1 1 1
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.1941, 0.9937)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.7383          
##                                           
##                   Kappa : 0.5             
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : 1               
##

The year 1995 is the first value in predictions meaning there is an 11.2% chance of an outbreak (shown in the output from predictions). As you can see from the confusion matrix, this yields a 75% accuracy of prediction which is better than the 25% yielded from the naive forecast.