Libraries were loaded and the csv file was read in as a data frame.

#setwd("~/wd678/Unit 6")
#install.packages("caret")
#install.packages("pander")
#install.packages("dplyer")
#install.packages("forecast")
#install.packages("e1071", dependencies=TRUE)
library(caret, quietly=TRUE, warn.conflicts=FALSE)
## Warning: package 'caret' was built under R version 3.3.3
library(pander)
## Warning: package 'pander' was built under R version 3.3.3
library(dplyr, quietly=TRUE, warn.conflicts=FALSE)
library(forecast)
## Warning: package 'forecast' was built under R version 3.3.3
library(e1071)
## Warning: package 'e1071' was built under R version 3.3.3
df.mildew <- read.csv("PowderyMildewEpidemic.csv", stringsAsFactors=FALSE, header=TRUE)
df.mildew
##    Year Outbreak Max.temp Rel.humidity  X X.1
## 1  1987      Yes    30.14        82.86 NA  NA
## 2  1988       No    30.66        79.57 NA  NA
## 3  1989       No    26.31        89.14 NA  NA
## 4  1990      Yes    28.43        91.00 NA  NA
## 5  1991       No    29.57        80.57 NA  NA
## 6  1992      Yes    31.25        67.82 NA  NA
## 7  1993       No    30.35        61.76 NA  NA
## 8  1994      Yes    30.71        81.14 NA  NA
## 9  1995       No    30.71        61.57 NA  NA
## 10 1996      Yes    33.07        59.76 NA  NA
## 11 1997       No    31.50        68.29 NA  NA
## 12 2000       No    29.50        79.14 NA  NA

Question 1.

In general, all predictors should be available at the time of forecast.

The equation for the model is below:

log(odds(mildew)) = beta0 + beta1xMaxTemp + beta2xRelativeHumidity

Question 3.

A vector was added to the outbreak data frame to change yes/no elements in the Outbreak? vector to a binary dummy variable, 1/0, in a vector named OutbreakNum. Yes=1(epidemic) and No=0(non-epidemic)

df.mildew$OutbreakNum <- ifelse( df.mildew$Outbreak=="Yes", 1,0)
df.mildew$OutbreakNum
##  [1] 1 0 0 1 0 1 0 1 0 1 0 0
plot(df.mildew$Max.temp, df.mildew$Rel.humidity, bty="l", pch=2, col=ifelse(df.mildew$OutbreakNum==1, "red", "blue"), main="Two weather predictors for a mildew epidemic")
legend(31.5, 92, c("epidemic", "non-epidemic"), pch=c(2,2), col=c("red", "blue"))

Question 4.

The 1995-97 roll-forward naive forecasts for a mildew outbreak (epidemic) were charted in the table below (1=epidemic, 0=non-epidemic):

mildew <- df.mildew$OutbreakNum[(length(df.mildew$OutbreakNum)-4):(length(df.mildew$OutbreakNum)-2)]
naiveForecasts <- as.data.frame(mildew)
naiveForecasts$mildew
## [1] 1 0 1
naiveForecasts$Year <- df.mildew$Year[(length(df.mildew$Year)-3):(length(df.mildew$Year)-1)]

Here is the roll-forward naive forecast table for mildew in 1995-97

forecastChart <-head(naiveForecasts %>% select(Year, mildew))
panderOptions("round",2)
set.caption("1=epidemic, 0=non-epidemic")
pander(forecastChart)
1=epidemic, 0=non-epidemic
Year mildew
1995 1
1996 0
1997 1

Since the actual mildew forecasts for years 1998 & 1999 were not given, the naive forecast for year 2000 could either: (1) be obtained using fixed training and validation periods, or (2) the 1997 actual outcome (no outbreak) could be rolled forward to forecast 1998, which could be rolled forward to forecast 1999, which finally could be rolled forward to forecast 2000.

Below, the year 2000 point forecast was computed using fixed training and validation intervals.

Point forecasts for years 1998, 1999, and 2000 are below:

fixed.nValid <- 1
fixed.nTrain <- length(df.mildew$Year) - fixed.nValid

ts <- ts(df.mildew$OutbreakNum, start=c(1987,1), end = c(2000,1), freq=1)
train <-window(ts, start=c(1987, 1), end=c(1987, fixed.nTrain))
valid <- window(ts, start=c(1987, fixed.nTrain+1), end=c(1987, fixed.nTrain + fixed.nValid))
naive.pred <- naive(train, h=3)
naive.pred
##      Point Forecast     Lo 80    Hi 80     Lo 95    Hi 95
## 1998              0 -1.215787 1.215787 -1.859385 1.859385
## 1999              0 -1.719382 1.719382 -2.629568 2.629568
## 2000              0 -2.105804 2.105804 -3.220549 3.220549

Using a more direct approach to forecast mildew for the year 2000, since the given data frame had 12 rows, the naive forecast for the year 2000 would be the actual forecast from the preceeding year, in the 11th row:

mildew.11 <- df.mildew$OutbreakNum[(length(df.mildew$OutbreakNum)-1)]
mildew.11
## [1] 0

Combine the vectors for the three forecasts in 1995-97 with the single forecast in 2000.

four.YearsForecasted <- c(naiveForecasts$mildew, naive.pred$mean[3])
four.YearsForecasted
## [1] 1 0 1 0

Combine the vectors for the three actual outcomes in 1995-97 with the single actual outcome in 2000.

four.YearsActual <- c(df.mildew$OutbreakNum[(length(df.mildew$OutbreakNum)-3):(length(df.mildew$OutbreakNum)-1)], df.mildew$OutbreakNum[12])
four.YearsActual
## [1] 0 1 0 0

The classification matrix below showed a naive forecast accuracy rate of 1:4 or 25%.

confusionMatrix(four.YearsForecasted, four.YearsActual, positive=c("1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 1 1
##          1 2 0
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.0063, 0.8059)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.9961          
##                                           
##                   Kappa : -0.5            
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.0000          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.1667          
##                                           
##        'Positive' Class : 1               
## 

Question 5

The data was partitioned into training and validation periods.

mildewTrain <- df.mildew[1:8, ]
mildewTrain
##   Year Outbreak Max.temp Rel.humidity  X X.1 OutbreakNum
## 1 1987      Yes    30.14        82.86 NA  NA           1
## 2 1988       No    30.66        79.57 NA  NA           0
## 3 1989       No    26.31        89.14 NA  NA           0
## 4 1990      Yes    28.43        91.00 NA  NA           1
## 5 1991       No    29.57        80.57 NA  NA           0
## 6 1992      Yes    31.25        67.82 NA  NA           1
## 7 1993       No    30.35        61.76 NA  NA           0
## 8 1994      Yes    30.71        81.14 NA  NA           1
mildewValid <- df.mildew[9:12, ]
mildewValid
##    Year Outbreak Max.temp Rel.humidity  X X.1 OutbreakNum
## 9  1995       No    30.71        61.57 NA  NA           0
## 10 1996      Yes    33.07        59.76 NA  NA           1
## 11 1997       No    31.50        68.29 NA  NA           0
## 12 2000       No    29.50        79.14 NA  NA           0

A logistic regression model was fitted to the training period.

mildewLogReg <- glm(OutbreakNum ~ Max.temp + Rel.humidity, data=mildewTrain, family="binomial")
summary(mildewLogReg)
## 
## Call:
## glm(formula = OutbreakNum ~ Max.temp + Rel.humidity, family = "binomial", 
##     data = mildewTrain)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.7466  -1.7276  -0.3132   1.0552  -1.1419   1.2419  -0.3908   0.6060  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -56.1543    44.4573  -1.263    0.207
## Max.temp       1.3849     1.1406   1.214    0.225
## Rel.humidity   0.1877     0.1578   1.189    0.234
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11.0904  on 7  degrees of freedom
## Residual deviance:  8.1198  on 5  degrees of freedom
## AIC: 14.12
## 
## Number of Fisher Scoring iterations: 5

The probability of mildew in 1995 (time index 9) was 0.1119407 or 11.19%. The computation is below.

The probability of mildew in 1995 (0.1119407) was lower than the cut-off of 0.5. Since the probability didn’t meet the 0.5 threshold, the forecast was that there would not be a mildew outbreak in 1995. Below are the outbreak probabilities for the years in the validation period, 1995-2000.

predictions <- predict(mildewLogReg, df.mildew[9:12, ], type="response")
predictions
##         9        10        11        12 
## 0.1119407 0.7021411 0.5705413 0.3894790

The mildew outbreak probability for the year 1995 alone was computed below:

predictions95 <- predict(mildewLogReg, df.mildew[9, ], type="response")
predictions95
##         9 
## 0.1119407

The confusion matrix for years 1995-97 & 2000 is below. The acccuracy was 75%. Since the actual outcome for mildew in 1995 was “No”, which agreed with the forecast, 1995 was tallied as an accurate negative prediction in the 2x2 table under Prediction 0 and Reference 0. The confusion matrix for year 1995 alone would have a 100% accuracy.

confusionMatrix(ifelse(predictions > 0.5, 1, 0), df.mildew[9:12, ]$OutbreakNum, positive=c("1"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 2 0
##          1 1 1
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.1941, 0.9937)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.7383          
##                                           
##                   Kappa : 0.5             
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : 1               
##