Problems 1, 2, 3, 4, & 5 Pgs. 187-188

setwd("C:/Users/larms.LA-INSP5559/Documents/R/win-library/3.3/17_0421assignment7")
mildew<- read.csv("PowderyMildewEpidemic.csv", stringsAsFactors = FALSE)
library(forecast)
library(knitr)
library(caret)
head(mildew)
##   Year Outbreak Max.temp Rel.humidity Outbreak10
## 1 1987      Yes    30.14        82.86          1
## 2 1988       No    30.66        79.57          0
## 3 1989       No    26.31        89.14          0
## 4 1990      Yes    28.43        91.00          1
## 5 1991       No    29.57        80.57          0
## 6 1992      Yes    31.25        67.82          1
str(mildew)
## 'data.frame':    12 obs. of  5 variables:
##  $ Year        : int  1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 ...
##  $ Outbreak    : chr  "Yes" "No" "No" "Yes" ...
##  $ Max.temp    : num  30.1 30.7 26.3 28.4 29.6 ...
##  $ Rel.humidity: num  82.9 79.6 89.1 91 80.6 ...
##  $ Outbreak10  : int  1 0 0 1 0 1 0 1 0 1 ...
kable(mildew)
Year Outbreak Max.temp Rel.humidity Outbreak10
1987 Yes 30.14 82.86 1
1988 No 30.66 79.57 0
1989 No 26.31 89.14 0
1990 Yes 28.43 91.00 1
1991 No 29.57 80.57 0
1992 Yes 31.25 67.82 1
1993 No 30.35 61.76 0
1994 Yes 30.71 81.14 1
1995 No 30.71 61.57 0
1996 Yes 33.07 59.76 1
1997 No 31.50 68.29 0
2000 No 29.50 79.14 0

1. In order for the model to serve as a forewarning system for farmers, what requirements must be satisfied regarding data availability?

Since the authors used a logistic regression model with two weather predictors to forecast an outbreak, they will need all of the predictors information at the time of the forecast. The data requirements for this forecast include the maximum temperature and relative humidity by time period with the confirmation of a break out or not. The confirmation of the outbreak will need to be converted to zero or one. Zero represents “No”. One represents “Yes”. This way the analysts will be able to use logistic regression to model the relationship between the odds of the event and the provided predictors. This will give the farmers a better understand of the year that powderly mildew will likely explode.

2. Write an equation for the model fitted by the researchers in the form of equation (8.1). Use predictor names instead of x notation.

log(odds(mildew data)) = B0 + B1(Mildew)1 + B1(Max Temperature)1 + B1(Relative Humidity)1

3. Create a scatter plot of the two predictors, using different hue for epidemic and non-epidemic markers. Does there appear to be a relationship between epidemic status and the two predictors?

plot(mildew$Max.temp ~ mildew$Rel.humidity, xlab = "Relative Humidity", ylab = "Maximum Temperature", bty="l", col=mildew$Outbreak10+1, pch=15, main = "Scatter Plot of Max. Temp & Relative Humidity")
legend(60,28, c("No Outbreak", "Outbreak"), col = 1:2, pch = 15, bty = "l")

plot(mildew$Rel.humidity ~ mildew$Max.temp, xlab = "Maximum Temperature", ylab = "Relative Humidity", bty="l", col=mildew$Outbreak10+1, pch=15, main = "Scatter Plot of Relative Humidity & Max Temp")
legend(27,70, c("No Outbreak", "Outbreak"), col = 1:2, pch = 15, bty = "l")

There doesn’t appear to be a relationship between the epidemic status and two predictors.

4. Compute naive forecasts of epidemic status for years 1995- 1997 using next-year forecasts (Ft+1 = Ft). What is the naive forecast for year 2000? Summarize the results for these four years in a classification matrix.

naivemildew <- mildew$Outbreak10[(length(mildew$Outbreak10)-1-3):(length(mildew$Outbreak10)-1)]
naivemildew
## [1] 1 0 1 0
confusionMatrix(naivemildew, mildew$Outbreak10[(length(mildew$Outbreak10)-3) : length(mildew$Outbreak10)], positive=c("1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 1 1
##          1 2 0
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.0063, 0.8059)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.9961          
##                                           
##                   Kappa : -0.5            
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.0000          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.1667          
##                                           
##        'Positive' Class : 1               
## 

5. Partition the data into training and validation periods, so that years 1987-1994 are the training period. Fit a logistic regression to the training period using the two predictors, and report the outbreak probability as well as a forecast for year 1995 (use a threshold of 0.5).

#partition training data
trainmildewout<- mildew[1:8, ]
kable(trainmildewout)
Year Outbreak Max.temp Rel.humidity Outbreak10
1987 Yes 30.14 82.86 1
1988 No 30.66 79.57 0
1989 No 26.31 89.14 0
1990 Yes 28.43 91.00 1
1991 No 29.57 80.57 0
1992 Yes 31.25 67.82 1
1993 No 30.35 61.76 0
1994 Yes 30.71 81.14 1
#logisitic regression model to training period
logregout<- glm(Outbreak10 ~ Max.temp + Rel.humidity, data = trainmildewout, family = "binomial")
summary(logregout)
## 
## Call:
## glm(formula = Outbreak10 ~ Max.temp + Rel.humidity, family = "binomial", 
##     data = trainmildewout)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.7466  -1.7276  -0.3132   1.0552  -1.1419   1.2419  -0.3908   0.6060  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -56.1543    44.4573  -1.263    0.207
## Max.temp       1.3849     1.1406   1.214    0.225
## Rel.humidity   0.1877     0.1578   1.189    0.234
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11.0904  on 7  degrees of freedom
## Residual deviance:  8.1198  on 5  degrees of freedom
## AIC: 14.12
## 
## Number of Fisher Scoring iterations: 5
predictmildew1<- predict(logregout, mildew[9:12, ], type = "response")
predictmildew1
##         9        10        11        12 
## 0.1119407 0.7021411 0.5705413 0.3894790
#cutoff of 0.5
confusionMatrix(ifelse(predictmildew1 > 0.5, 1, 0), mildew[9:12, ]$Outbreak10, positive = c("1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 2 0
##          1 1 1
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.1941, 0.9937)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.7383          
##                                           
##                   Kappa : 0.5             
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : 1               
## 
#cutoff of 0.7
confusionMatrix(ifelse(predictmildew1 > 0.7, 1, 0), mildew[9:12, ]$Outbreak10, positive = c("1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 3 0
##          1 0 1
##                                      
##                Accuracy : 1          
##                  95% CI : (0.3976, 1)
##     No Information Rate : 0.75       
##     P-Value [Acc > NIR] : 0.3164     
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00       
##             Specificity : 1.00       
##          Pos Pred Value : 1.00       
##          Neg Pred Value : 1.00       
##              Prevalence : 0.25       
##          Detection Rate : 0.25       
##    Detection Prevalence : 0.25       
##       Balanced Accuracy : 1.00       
##                                      
##        'Positive' Class : 1          
##