\(\textit{Q1. In order for the model to serve as a forewarning system for farmers, what requirements must be satisfied regarding data availability?}\)
For the model to work as a forewarning system for the farmers, we must the true responses of whether there has been an outbreak or not. The farmers themselves are the best judge for that. We may also need number of explanatory variables along with the long series of the data.
=========================== =========================== ======================= ============================
\(\textit{Q2. Write an equation for the model fitted by the researches in the form of equation (8.1). Use predictor names instead of x notation.}\)
#Import the data into R
setwd("/Users/yusufsultan")
PowderyMildew <- read.csv("PowderyMildewEpidemic.csv",stringsAsFactors = FALSE)
str(PowderyMildew)
## 'data.frame': 12 obs. of 4 variables:
## $ Year : int 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 ...
## $ Outbreak : chr "Yes" "No" "No" "Yes" ...
## $ MaxTemp : num 30.1 30.7 26.3 28.4 29.6 ...
## $ RelHumidity: num 82.9 79.6 89.1 91 80.6 ...
head(PowderyMildew)
## Year Outbreak MaxTemp RelHumidity
## 1 1987 Yes 30.14 82.86
## 2 1988 No 30.66 79.57
## 3 1989 No 26.31 89.14
## 4 1990 Yes 28.43 91.00
## 5 1991 No 29.57 80.57
## 6 1992 Yes 31.25 67.82
tail(PowderyMildew)
## Year Outbreak MaxTemp RelHumidity
## 7 1993 No 30.35 61.76
## 8 1994 Yes 30.71 81.14
## 9 1995 No 30.71 61.57
## 10 1996 Yes 33.07 59.76
## 11 1997 No 31.50 68.29
## 12 2000 No 29.50 79.14
#Performing the logistic regression
PowderyMildew$Outbreak <- PowderyMildew$Outbreak == "Yes"
PowderyMildew$Outbreak <- PowderyMildew$Outbreak * 1
Model= glm(Outbreak ~ MaxTemp + RelHumidity, family=binomial,data = PowderyMildew)
summary(Model)
##
## Call:
## glm(formula = Outbreak ~ MaxTemp + RelHumidity, family = binomial,
## data = PowderyMildew)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7358 -0.7758 -0.1523 0.7823 1.4634
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -81.1554 53.4380 -1.519 0.129
## MaxTemp 2.0054 1.3232 1.515 0.130
## RelHumidity 0.2630 0.1841 1.428 0.153
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16.301 on 11 degrees of freedom
## Residual deviance: 11.196 on 9 degrees of freedom
## AIC: 17.196
##
## Number of Fisher Scoring iterations: 5
#put the coefficients and confidence intervals onto a useful scale
exp(Model$coefficients)
## (Intercept) MaxTemp RelHumidity
## 5.683987e-36 7.428793e+00 1.300825e+00
\(\textit{The relationship between the odds of an event and the probability of an event is:}\)
\[\\{odds} = \frac{p}{1-p}\] \(\textit{or equivalenty}\) \[\\{odds} = \frac{odds}{1-odds}\] \(\textit{Logistic equation would be}\)
\[\\log(odds)= \beta_0+\beta_1⋅(Max temperature)+\beta_2⋅(Relative humidity)\]
\(\textit{So,based on the results}\)
\[\\log(odds)= -81.1554 + 2.0054⋅(Max temperature)+ 0.2630⋅(Relative humidity)\] \(\textit{where log(odds) is the probability of Response equal to 1 (Yes to outbreak)and 0 (No to outbreak)}\) \(\textit{Since p-values for both the variables are high, we conclude that the variables are not statistically significant.}\)
=========================== =========================== ======================= ============================
\(\textit{Q3. Create a scatter plot of the two predictors, using different hue for epidemic and non-epidemic markersDoes there appear to be a }\)\(\textit{relationship between epidemic status and the two predictors?}\)
# plot Relative Humidity vs. Max Temperature
plot(PowderyMildew$MaxTemp ~ PowderyMildew$RelHumidity, xlab="Relative Humidity", ylab="Max Temperature", bty="l", col= PowderyMildew$Outbreak + 1,pch=15)
legend(84,33,c("No Outbreak", "Outbreak"),col=1:2,pch=15,cex=0.9,seg.len=2)
# plot Max Temperature vs. Relative Humidity
plot(PowderyMildew$RelHumidity ~ PowderyMildew$MaxTemp, xlab="Max Temperature", ylab="Relative Humidity", bty="l", col= PowderyMildew$Outbreak + 1,pch=15)
legend(31.6,90,c("No Outbreak", "Outbreak"),col=1:2,pch=15,cex=0.9,seg.len=2)
=========================== =========================== ======================= ============================
\(\textit{Q4.Compute naive forecasts of epidemic status for years 1995-1997 using a roll-forward naive forecast}\) \(\\(F_{t+1}=F_t)\).\(\textit{What is the naive forecast for year 2000? Summarize the results for these four years in a classification matrix.}\)
naiveForecasts <- PowderyMildew$Outbreak[(length(PowderyMildew$Outbreak)-1-3):(length(PowderyMildew$Outbreak)-1)]
naiveForecasts
## [1] 1 0 1 0
\(\textit{The roll-forward naive forecasts are:}\)
|
Year |
Outbreak |
Response |
Max Tempreture |
Relative Humidity |
Naïve Forecast (Ft+1=Ft) |
|
1987 |
Yes |
1 |
30.14 |
82.86 |
|
|
1988 |
No |
0 |
30.66 |
79.57 |
|
|
1989 |
No |
0 |
26.31 |
89.14 |
|
|
1990 |
Yes |
1 |
28.43 |
91 |
|
|
1991 |
No |
0 |
29.57 |
80.57 |
|
|
1992 |
Yes |
1 |
31.25 |
67.82 |
|
|
1993 |
No |
0 |
30.35 |
61.76 |
|
|
1994 |
Yes |
1 |
30.71 |
81.14 |
|
|
1995 |
No |
0 |
30.71 |
61.57 |
Yes |
|
1996 |
Yes |
1 |
33.07 |
59.76 |
No |
|
1997 |
No |
0 |
31.5 |
68.29 |
Yes |
|
2000 |
No |
0 |
29.5 |
79.14 |
No |
======
\(\textit{The classification matrix is:}\)
library(caret)
## Warning: package 'caret' was built under R version 3.2.5
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.2.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.5
confusionMatrix(naiveForecasts,PowderyMildew$Outbreak[(length(PowderyMildew$Outbreak)-3):length(PowderyMildew$Outbreak)], positive=c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1 1
## 1 2 0
##
## Accuracy : 0.25
## 95% CI : (0.0063, 0.8059)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.9961
##
## Kappa : -0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.0000
## Specificity : 0.3333
## Pos Pred Value : 0.0000
## Neg Pred Value : 0.5000
## Prevalence : 0.2500
## Detection Rate : 0.0000
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.1667
##
## 'Positive' Class : 1
##
\(\textit{The logistic regression's predictive accuracy for the four years(1005-1997,2000)in classifiction matrix.}\)
|
Year |
Outbreak |
Response |
Max Tempreture |
Relative Humidity |
Naïve Forecast (Ft+1=Ft) |
Forcecast using Logistic Regression (log likehoods) |
p/(1-p) |
p (probability of Outbreak) |
|
1987 |
Yes |
1 |
30.14 |
82.86 |
||||
|
1988 |
No |
0 |
30.66 |
79.57 |
||||
|
1989 |
No |
0 |
26.31 |
89.14 |
||||
|
1990 |
Yes |
1 |
28.43 |
91 |
||||
|
1991 |
No |
0 |
29.57 |
80.57 |
||||
|
1992 |
Yes |
1 |
31.25 |
67.82 |
||||
|
1993 |
No |
0 |
30.35 |
61.76 |
||||
|
1994 |
Yes |
1 |
30.71 |
81.14 |
||||
|
1995 |
No |
0 |
30.71 |
61.57 |
Yes |
-3.376656 |
0.034162 |
0.033033042 |
|
1996 |
Yes |
1 |
33.07 |
59.76 |
No |
0.880058 |
2.41104 |
0.70683424 |
|
1997 |
No |
0 |
31.5 |
68.29 |
Yes |
-0.02503 |
0.975281 |
0.493742827 |
|
2000 |
No |
0 |
29.5 |
79.14 |
No |
-1.18228 |
0.306579 |
0.234642494 |
\(\textit{The Naïve forecast is correct just once out of the four cases. The logistic regression output p gives less than 0.5 probability in cases where}\)
\(\textit{ there have been no outbreaks (in actual). Hence we can say that Logistic regression gives a better forecast than the naïve forecast despite}\)
\(\textit{ being not so good model because of high p-values of explanatory variables.}\)
=========================== =========================== ======================= ============================
\(\textit{Q5.Partition the data into training and validation periods, so that years 1987-1994 are the training period.Fit a logistic regression to }\)
\(\textit{ the training period using the two predictors, and report the outbreak probability as well as a forecast for year 1995 Fit a logistic }\)
\(\textit{ regression to the training period using the two predictors, and report the outbreak probability as well as a forecast for year 1995}\)
\(\textit{(use a threshold of 0.5).}\)
trainOut <- PowderyMildew[1:8,]
#trainOut
outLogReg <- glm(Outbreak ~ MaxTemp + RelHumidity,data=trainOut,family="binomial")
summary(outLogReg)
##
## Call:
## glm(formula = Outbreak ~ MaxTemp + RelHumidity, family = "binomial",
## data = trainOut)
##
## Deviance Residuals:
## 1 2 3 4 5 6 7 8
## 0.7466 -1.7276 -0.3132 1.0552 -1.1419 1.2419 -0.3908 0.6060
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543 44.4573 -1.263 0.207
## MaxTemp 1.3849 1.1406 1.214 0.225
## RelHumidity 0.1877 0.1578 1.189 0.234
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11.0904 on 7 degrees of freedom
## Residual deviance: 8.1198 on 5 degrees of freedom
## AIC: 14.12
##
## Number of Fisher Scoring iterations: 5
\(\textit{Logistic equation would be:}\) \(\textit{The relationship between the odds of an event and the probability of an event is:}\)
\[\\{odds} = \frac{p}{1-p}\] \(\textit{or equivalenty}\) \[\\{odds} = \frac{odds}{1-odds}\] \(\textit{Logistic equation would be}\)
\[\\log(odds)= \beta_0+\beta_1⋅(Max temperature)+\beta_2⋅(Relative humidity)\]
\(\textit{So,based on the results}\)
\[\\log(odds)= -56.1543 + 1.3849⋅(Max temperature)+ 0.1877⋅(Relative humidity)\]
\(\textit{where log(odds) is the probability of Response equal to 1 (Yes to outbreak)}\) \(\textit{Since p-values for both the variables are still high, we conclude that the variables are not statistically significant.}\)
predictions <- predict(outLogReg,PowderyMildew[9:12,],type="response")
predictions
## 9 10 11 12
## 0.1119407 0.7021411 0.5705413 0.3894790
confusionMatrix(ifelse(predictions > 0.5, 1, 0), PowderyMildew[9:12,]$Outbreak, positive=c("1"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2 0
## 1 1 1
##
## Accuracy : 0.75
## 95% CI : (0.1941, 0.9937)
## No Information Rate : 0.75
## P-Value [Acc > NIR] : 0.7383
##
## Kappa : 0.5
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 1.0000
## Specificity : 0.6667
## Pos Pred Value : 0.5000
## Neg Pred Value : 1.0000
## Prevalence : 0.2500
## Detection Rate : 0.2500
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.8333
##
## 'Positive' Class : 1
##
\(\textit{Since p-values for both the variables are still high, we conclude that the variables are not statistically significant}\)
\(\textit{Forecast for the year 1995}\)
\(\textit{We put the values of Max.Tempreture and Relative. Humidity in the above logistic regression equation and get p-( probability of outbreak)=0.234}\)