assignment_7-MBA678

\(\textit{Q1. In order for the model to serve as a forewarning system for farmers, what requirements must be satisfied regarding data availability?}\)

For the model to work as a forewarning system for the farmers, we must the true responses of whether there has been an outbreak or not. The farmers themselves are the best judge for that. We may also need number of explanatory variables along with the long series of the data.

=========================== =========================== ======================= ============================

\(\textit{Q2. Write an equation for the model fitted by the researches in the form of equation (8.1). Use predictor names instead of x notation.}\)

#Import the data into R
setwd("/Users/yusufsultan")
PowderyMildew <- read.csv("PowderyMildewEpidemic.csv",stringsAsFactors = FALSE)
str(PowderyMildew)

## 'data.frame':    12 obs. of  4 variables:
##  $ Year       : int  1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 ...
##  $ Outbreak   : chr  "Yes" "No" "No" "Yes" ...
##  $ MaxTemp    : num  30.1 30.7 26.3 28.4 29.6 ...
##  $ RelHumidity: num  82.9 79.6 89.1 91 80.6 ...

head(PowderyMildew)

##   Year Outbreak MaxTemp RelHumidity
## 1 1987      Yes   30.14       82.86
## 2 1988       No   30.66       79.57
## 3 1989       No   26.31       89.14
## 4 1990      Yes   28.43       91.00
## 5 1991       No   29.57       80.57
## 6 1992      Yes   31.25       67.82

tail(PowderyMildew)

##    Year Outbreak MaxTemp RelHumidity
## 7  1993       No   30.35       61.76
## 8  1994      Yes   30.71       81.14
## 9  1995       No   30.71       61.57
## 10 1996      Yes   33.07       59.76
## 11 1997       No   31.50       68.29
## 12 2000       No   29.50       79.14

#Performing the logistic regression
PowderyMildew$Outbreak <- PowderyMildew$Outbreak == "Yes"
PowderyMildew$Outbreak <- PowderyMildew$Outbreak * 1
Model= glm(Outbreak ~ MaxTemp + RelHumidity, family=binomial,data = PowderyMildew)
summary(Model)

## 
## Call:
## glm(formula = Outbreak ~ MaxTemp + RelHumidity, family = binomial, 
##     data = PowderyMildew)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7358  -0.7758  -0.1523   0.7823   1.4634  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -81.1554    53.4380  -1.519    0.129
## MaxTemp       2.0054     1.3232   1.515    0.130
## RelHumidity   0.2630     0.1841   1.428    0.153
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16.301  on 11  degrees of freedom
## Residual deviance: 11.196  on  9  degrees of freedom
## AIC: 17.196
## 
## Number of Fisher Scoring iterations: 5

#put the coefficients and confidence intervals onto a useful scale
exp(Model$coefficients)

##  (Intercept)      MaxTemp  RelHumidity 
## 5.683987e-36 7.428793e+00 1.300825e+00

\(\textit{The relationship between the odds of an event and the probability of an event is:}\)

\[\\{odds} = \frac{p}{1-p}\] \(\textit{or equivalenty}\) \[\\{odds} = \frac{odds}{1-odds}\] \(\textit{Logistic equation would be}\)

\[\\log(odds)= \beta_0+\beta_1⋅(Max temperature)+\beta_2⋅(Relative humidity)\]

\(\textit{So,based on the results}\)

\[\\log(odds)= -81.1554 + 2.0054⋅(Max temperature)+ 0.2630⋅(Relative humidity)\] \(\textit{where log(odds) is the probability of Response equal to 1 (Yes to outbreak)and 0 (No to outbreak)}\) \(\textit{Since p-values for both the variables are high, we conclude that the variables are not statistically significant.}\)

=========================== =========================== ======================= ============================

\(\textit{Q3. Create a scatter plot of the two predictors, using different hue for epidemic and non-epidemic markersDoes there appear to be a }\)\(\textit{relationship between epidemic status and the two predictors?}\)

# plot Relative Humidity vs. Max Temperature
plot(PowderyMildew$MaxTemp ~ PowderyMildew$RelHumidity, xlab="Relative Humidity", ylab="Max Temperature", bty="l", col= PowderyMildew$Outbreak + 1,pch=15)
legend(84,33,c("No Outbreak", "Outbreak"),col=1:2,pch=15,cex=0.9,seg.len=2)

# plot Max Temperature vs. Relative Humidity
plot(PowderyMildew$RelHumidity ~ PowderyMildew$MaxTemp, xlab="Max Temperature", ylab="Relative Humidity", bty="l", col= PowderyMildew$Outbreak + 1,pch=15)
legend(31.6,90,c("No Outbreak", "Outbreak"),col=1:2,pch=15,cex=0.9,seg.len=2)

=========================== =========================== ======================= ============================

\(\textit{Q4.Compute naive forecasts of epidemic status for years 1995-1997 using a roll-forward naive forecast}\) \(\\(F_{t+1}=F_t)\).\(\textit{What is the naive forecast for year 2000? Summarize the results for these four years in a classification matrix.}\)

naiveForecasts <- PowderyMildew$Outbreak[(length(PowderyMildew$Outbreak)-1-3):(length(PowderyMildew$Outbreak)-1)]
naiveForecasts

## [1] 1 0 1 0

\(\textit{The roll-forward naive forecasts are:}\)

*Year*	*Outbreak*	*Response*	*Max Tempreture*	*Relative Humidity*	*Naïve Forecast (Ft+1=Ft)*
1987	Yes	1	30.14	82.86
1988	No	0	30.66	79.57
1989	No	0	26.31	89.14
1990	Yes	1	28.43	91
1991	No	0	29.57	80.57
1992	Yes	1	31.25	67.82
1993	No	0	30.35	61.76
1994	Yes	1	30.71	81.14
1995	No	0	30.71	61.57	Yes
1996	Yes	1	33.07	59.76	No
1997	No	0	31.5	68.29	Yes
2000	No	0	29.5	79.14	No

======

\(\textit{The classification matrix is:}\)

library(caret)

## Warning: package 'caret' was built under R version 3.2.5

## Loading required package: lattice

## Warning: package 'lattice' was built under R version 3.2.5

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.2.5

confusionMatrix(naiveForecasts,PowderyMildew$Outbreak[(length(PowderyMildew$Outbreak)-3):length(PowderyMildew$Outbreak)], positive=c("1"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 1 1
##          1 2 0
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.0063, 0.8059)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.9961          
##                                           
##                   Kappa : -0.5            
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 0.3333          
##          Pos Pred Value : 0.0000          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.1667          
##                                           
##        'Positive' Class : 1               
##

\(\textit{The logistic regression's predictive accuracy for the four years(1005-1997,2000)in classifiction matrix.}\)

*Year*	*Outbreak*	*Response*	*Max Tempreture*	*Relative Humidity*	*Naïve Forecast (Ft+1=Ft)*	*Forcecast using Logistic Regression (log likehoods)*	*p/(1-p)*	*p (probability of Outbreak)*
1987	Yes	1	30.14	82.86
1988	No	0	30.66	79.57
1989	No	0	26.31	89.14
1990	Yes	1	28.43	91
1991	No	0	29.57	80.57
1992	Yes	1	31.25	67.82
1993	No	0	30.35	61.76
1994	Yes	1	30.71	81.14
1995	No	0	30.71	61.57	Yes	-3.376656	0.034162	0.033033042
1996	Yes	1	33.07	59.76	No	0.880058	2.41104	0.70683424
1997	No	0	31.5	68.29	Yes	-0.02503	0.975281	0.493742827
2000	No	0	29.5	79.14	No	-1.18228	0.306579	0.234642494

\(\textit{The Naïve forecast is correct just once out of the four cases. The logistic regression output p gives less than 0.5 probability in cases where}\)

\(\textit{ there have been no outbreaks (in actual). Hence we can say that Logistic regression gives a better forecast than the naïve forecast despite}\)

\(\textit{ being not so good model because of high p-values of explanatory variables.}\)

=========================== =========================== ======================= ============================

\(\textit{Q5.Partition the data into training and validation periods, so that years 1987-1994 are the training period.Fit a logistic regression to }\)

\(\textit{ the training period using the two predictors, and report the outbreak probability as well as a forecast for year 1995 Fit a logistic }\)

\(\textit{ regression to the training period using the two predictors, and report the outbreak probability as well as a forecast for year 1995}\)

\(\textit{(use a threshold of 0.5).}\)

trainOut <- PowderyMildew[1:8,]
#trainOut
outLogReg <- glm(Outbreak ~ MaxTemp + RelHumidity,data=trainOut,family="binomial")
summary(outLogReg)

## 
## Call:
## glm(formula = Outbreak ~ MaxTemp + RelHumidity, family = "binomial", 
##     data = trainOut)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6        7        8  
##  0.7466  -1.7276  -0.3132   1.0552  -1.1419   1.2419  -0.3908   0.6060  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -56.1543    44.4573  -1.263    0.207
## MaxTemp       1.3849     1.1406   1.214    0.225
## RelHumidity   0.1877     0.1578   1.189    0.234
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11.0904  on 7  degrees of freedom
## Residual deviance:  8.1198  on 5  degrees of freedom
## AIC: 14.12
## 
## Number of Fisher Scoring iterations: 5

\(\textit{Logistic equation would be:}\) \(\textit{The relationship between the odds of an event and the probability of an event is:}\)

\[\\{odds} = \frac{p}{1-p}\] \(\textit{or equivalenty}\) \[\\{odds} = \frac{odds}{1-odds}\] \(\textit{Logistic equation would be}\)

\[\\log(odds)= \beta_0+\beta_1⋅(Max temperature)+\beta_2⋅(Relative humidity)\]

\(\textit{So,based on the results}\)

\[\\log(odds)= -56.1543 + 1.3849⋅(Max temperature)+ 0.1877⋅(Relative humidity)\]

\(\textit{where log(odds) is the probability of Response equal to 1 (Yes to outbreak)}\) \(\textit{Since p-values for both the variables are still high, we conclude that the variables are not statistically significant.}\)

predictions <- predict(outLogReg,PowderyMildew[9:12,],type="response")
predictions

##         9        10        11        12 
## 0.1119407 0.7021411 0.5705413 0.3894790

confusionMatrix(ifelse(predictions > 0.5, 1, 0), PowderyMildew[9:12,]$Outbreak, positive=c("1"))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 2 0
##          1 1 1
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.1941, 0.9937)
##     No Information Rate : 0.75            
##     P-Value [Acc > NIR] : 0.7383          
##                                           
##                   Kappa : 0.5             
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.5000          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.2500          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : 1               
##

\(\textit{Since p-values for both the variables are still high, we conclude that the variables are not statistically significant}\)

\(\textit{Forecast for the year 1995}\)

\(\textit{We put the values of Max.Tempreture and Relative. Humidity in the above logistic regression equation and get p-( probability of outbreak)=0.234}\)

assignment_7-MBA678

Yusuf Sultan

4/19/2017