Binary Outcomes - Assignment 7

Problems 1, 2, 3, 4, & 5 Pgs. 187-188

setwd("C:/Users/larms.LA-INSP5559/Documents/R/win-library/3.3/17_0421assignment7")
mildew<- read.csv("PowderyMildewEpidemic.csv", stringsAsFactors = FALSE)
library(forecast)
library(knitr)
library(caret)
head(mildew)

##   Year Outbreak Max.temp Rel.humidity Outbreak10
## 1 1987      Yes    30.14        82.86          1
## 2 1988       No    30.66        79.57          0
## 3 1989       No    26.31        89.14          0
## 4 1990      Yes    28.43        91.00          1
## 5 1991       No    29.57        80.57          0
## 6 1992      Yes    31.25        67.82          1

str(mildew)

## 'data.frame':    12 obs. of  5 variables:
##  $ Year        : int  1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 ...
##  $ Outbreak    : chr  "Yes" "No" "No" "Yes" ...
##  $ Max.temp    : num  30.1 30.7 26.3 28.4 29.6 ...
##  $ Rel.humidity: num  82.9 79.6 89.1 91 80.6 ...
##  $ Outbreak10  : int  1 0 0 1 0 1 0 1 0 1 ...

kable(mildew)

Year	Outbreak	Max.temp	Rel.humidity	Outbreak10
1987	Yes	30.14	82.86	1
1988	No	30.66	79.57	0
1989	No	26.31	89.14	0
1990	Yes	28.43	91.00	1
1991	No	29.57	80.57	0
1992	Yes	31.25	67.82	1
1993	No	30.35	61.76	0
1994	Yes	30.71	81.14	1
1995	No	30.71	61.57	0
1996	Yes	33.07	59.76	1
1997	No	31.50	68.29	0
2000	No	29.50	79.14	0

1. In order for the model to serve as a forewarning system for farmers, what requirements must be satisfied regarding data availability?

Since the authors used a logistic regression model with two weather predictors to forecast an outbreak, they will need all of the predictors information at the time of the forecast. The data requirements for this forecast include the maximum temperature and relative humidity by time period with the confirmation of a break out or not. The confirmation of the outbreak will need to be converted to zero or one. Zero represents “No”. One represents “Yes”. This way the analysts will be able to use logistic regression to model the relationship between the odds of the event and the provided predictors. This will give the farmers a better understand of the year that powderly mildew will likely explode.

2. Write an equation for the model fitted by the researchers in the form of equation (8.1). Use predictor names instead of x notation.

log(odds(mildew data)) = B0 + B1(Mildew)1 + B1(Max Temperature)1 + B1(Relative Humidity)1

3. Create a scatter plot of the two predictors, using different hue for epidemic and non-epidemic markers. Does there appear to be a relationship between epidemic status and the two predictors?

plot(mildew$Max.temp ~ mildew$Rel.humidity, xlab = "Relative Humidity", ylab = "Maximum Temperature", bty="l", col=mildew$Outbreak10+1, pch=15, main = "Scatter Plot of Max. Temp & Relative Humidity")
legend(60,28, c("No Outbreak", "Outbreak"), col = 1:2, pch = 15, bty = "l")

plot(mildew$Rel.humidity ~ mildew$Max.temp, xlab = "Maximum Temperature", ylab = "Relative Humidity", bty="l", col=mildew$Outbreak10+1, pch=15, main = "Scatter Plot of Relative Humidity & Max Temp")
legend(27,70, c("No Outbreak", "Outbreak"), col = 1:2, pch = 15, bty = "l")

There doesn’t appear to be a relationship between the epidemic status and two predictors.

4. Compute naive forecasts of epidemic status for years 1995- 1997 using next-year forecasts (Ft+1 = Ft). What is the naive forecast for year 2000? Summarize the results for these four years in a classification matrix.

naivemildew <- mildew$Outbreak10[(length(mildew$Outbreak10)-1-3):(length(mildew$Outbreak10)-1)]
naivemildew

## [1] 1 0 1 0

confusionMatrix(naivemildew, mildew$Outbreak10[(length(mildew$Outbreak10)-3) : length(mildew$Outbreak10)], positive=c("1"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 1 1
##          1 2 0
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.0063, 0.8059)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.9961          
##                                           
##                   Kappa : -0.5            
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.0000          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.1667          
##                                           
##        'Positive' Class : 1               
##

5. Partition the data into training and validation periods, so that years 1987-1994 are the training period. Fit a logistic regression to the training period using the two predictors, and report the outbreak probability as well as a forecast for year 1995 (use a threshold of 0.5).

#partition training data
trainmildewout<- mildew[1:8, ]
kable(trainmildewout)

Year	Outbreak	Max.temp	Rel.humidity	Outbreak10
1987	Yes	30.14	82.86	1
1988	No	30.66	79.57	0
1989	No	26.31	89.14	0
1990	Yes	28.43	91.00	1
1991	No	29.57	80.57	0
1992	Yes	31.25	67.82	1
1993	No	30.35	61.76	0
1994	Yes	30.71	81.14	1

#logisitic regression model to training period
logregout<- glm(Outbreak10 ~ Max.temp + Rel.humidity, data = trainmildewout, family = "binomial")
summary(logregout)

## 
## Call:
## glm(formula = Outbreak10 ~ Max.temp + Rel.humidity, family = "binomial", 
##     data = trainmildewout)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.7466  -1.7276  -0.3132   1.0552  -1.1419   1.2419  -0.3908   0.6060  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -56.1543    44.4573  -1.263    0.207
## Max.temp       1.3849     1.1406   1.214    0.225
## Rel.humidity   0.1877     0.1578   1.189    0.234
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11.0904  on 7  degrees of freedom
## Residual deviance:  8.1198  on 5  degrees of freedom
## AIC: 14.12
## 
## Number of Fisher Scoring iterations: 5

predictmildew1<- predict(logregout, mildew[9:12, ], type = "response")
predictmildew1

##         9        10        11        12 
## 0.1119407 0.7021411 0.5705413 0.3894790

#cutoff of 0.5
confusionMatrix(ifelse(predictmildew1 > 0.5, 1, 0), mildew[9:12, ]$Outbreak10, positive = c("1"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 2 0
##          1 1 1
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.1941, 0.9937)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.7383          
##                                           
##                   Kappa : 0.5             
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : 1               
##

#cutoff of 0.7
confusionMatrix(ifelse(predictmildew1 > 0.7, 1, 0), mildew[9:12, ]$Outbreak10, positive = c("1"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 3 0
##          1 0 1
##                                      
##                Accuracy : 1          
##                  95% CI : (0.3976, 1)
##     No Information Rate : 0.75       
##     P-Value [Acc > NIR] : 0.3164     
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.00       
##             Specificity : 1.00       
##          Pos Pred Value : 1.00       
##          Neg Pred Value : 1.00       
##              Prevalence : 0.25       
##          Detection Rate : 0.25       
##    Detection Prevalence : 0.25       
##       Balanced Accuracy : 1.00       
##                                      
##        'Positive' Class : 1          
##

Binary Outcomes - Assignment 7

Lexi Armstrong

April 23, 2017