Logistic Regression : Definitions

What is Logistic Regression?



Definition

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.

You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.

What is Logistic Function



This is the equation used in Logistic Regression. Here (p/1-p) is the odd ratio. Whenever the log of odd ratio is found to be positive, the probability of success is always more than 50%. A typical logistic model plot is shown below. You can see probability never goes below 0 and above 1.



Performance of Logistic Regression Model

AIC (Akaike Information Criteria): The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value

Null Deviance and Residual Deviance: Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.

Confusion Matrix: It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the accuracy of the model and avoid overfitting. This is how it looks like:



You can calculate the various evaluation measures like accuracy, balanced accuracy, senstivity, specificity, recall, precision and f1 score.

ASSIGNMENT: Prepare a PPT describing all these measures derived from confusion matrix.

Specificity and Sensitivity plays a crucial role in deriving ROC curve.

ROC Curve: Receiver Operating Characteristic(ROC) summarizes the model’s performance by evaluating the trade-offs between true positive rate (sensitivity) and false positive rate (1- specificity).

For plotting ROC, it is advisable to assume p > 0.5 since we are more concerned about success rate. ROC summarizes the predictive power for all possible values of p > 0.5. The area under curve (AUC), referred to as index of accuracy(A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under curve, better the prediction power of the model. Below is a sample ROC curve. The ROC of a perfect predictive model has TP equals 1 and FP equals 0. This curve will touch the top left corner of the graph.

Logistic Regression Example

Column

Transforming the “seeds” case study



In “seeds” case study we had “length_of_kernel_groove” as dependent variable. This variable

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.519   5.045   5.224   5.409   5.877   6.550 



Let’s breakdown the “length_of_kernel_groove” into two categories based on median value=5.224. The seeds with “length_of_kernel_groove” less than or equal to 5.224 are labeled as “Small_Grain” and seeds with “length_of_kernel_groove” more than 5.224 are labeled as “Large_Grain”. These categories are saved in new variable called “Grain_Type”. This is executed in next command chunk—


Large_Grain Small_Grain 
        104         105 

<>br
We don’t need the variable “length_of_kernel_groove” so it is deleted from the dataset. Now, our dependent variable is “Grain_Type”.

Overview of Dataset

Visualizations

Logistic Model


Call:
glm(formula = Grain_Type ~ ., family = "binomial", data = seeds[, 
    -7])

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.59319  -0.07393   0.00945   0.28840   2.35912  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)   
(Intercept)            -1.6604   108.8871  -0.015  0.98783   
A                      -5.4049     4.4534  -1.214  0.22488   
P                       3.9496     8.5199   0.464  0.64295   
C                       2.7440    66.8418   0.041  0.96725   
length_of_kernel       -9.9103     5.5305  -1.792  0.07314 . 
width_of_kernel        24.3433     9.7713   2.491  0.01273 * 
asymmetry_coefficient  -0.7992     0.2555  -3.128  0.00176 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 289.731  on 208  degrees of freedom
Residual deviance:  88.903  on 202  degrees of freedom
AIC: 102.9

Number of Fisher Scoring iterations: 8


Let us do prediction with the logistic model

            1             2             3             4             5 
"Large_Grain" "Large_Grain" "Large_Grain" "Large_Grain" "Large_Grain" 
            6 
"Small_Grain" 

Assumption of Logistic Model



Linearity assumption

Here, we’ll check the linear relationship between continuous predictor variables and the logit of the outcome. This can be done by visually inspecting the scatter plot between each predictor and the logit values.



The smoothed scatter plots show that all numeric variables are all quite linearly associated with the “Grain_Type” outcome in logit scale.

If any variable is not found linearly associated , it might need some transformations. If the scatter plot shows non-linearity, you need other methods to build the model such as including 2 or 3-power terms, fractional polynomials and spline function.

Influential values

Influential values are extreme individual data points that can alter the quality of the logistic regression model. The most extreme values in the data can be examined by visualizing the Cook’s distance values. Here we label the top 3 largest values:



The data for the top 3 largest values, according to the Cook’s distance, can be displayed as follow:

# A tibble: 3 x 15
  Grain_Type     A     P     C length_of_kernel width_of_kernel
  <fct>      <dbl> <dbl> <dbl>            <dbl>           <dbl>
1 Small_Gra~  15.8  14.9 0.892             5.67            3.43
2 Large_Gra~  15.4  14.7 0.899             5.48            3.46
3 Small_Gra~  12.1  13.7 0.808             5.39            2.74
# ... with 9 more variables: asymmetry_coefficient <dbl>, .fitted <dbl>,
#   .se.fit <dbl>, .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
#   .std.resid <dbl>, index <int>


Filtering Potential Outliers

# A tibble: 0 x 15
# ... with 15 variables: Grain_Type <fct>, A <dbl>, P <dbl>, C <dbl>,
#   length_of_kernel <dbl>, width_of_kernel <dbl>,
#   asymmetry_coefficient <dbl>, .fitted <dbl>, .se.fit <dbl>,
#   .resid <dbl>, .hat <dbl>, .sigma <dbl>, .cooksd <dbl>,
#   .std.resid <dbl>, index <int>



When you have outliers in a continuous predictor, potential solutions include:

Removing the concerned records
Transform the data into log scale
Use non parametric methods

Multicollinearity

Multicollinearity corresponds to a situation where the data contain highly correlated predictor variables.Multicollinearity is an important issue in regression analysis and should be fixed by removing the concerned variables. It can be assessed using the R function vif() [car package], which computes the variance inflation factors:

                    A                     P                     C 
           739.764653            466.759220             39.647943 
     length_of_kernel       width_of_kernel asymmetry_coefficient 
            14.917927             97.535047              2.274646 


As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. If there is no collinearity: all variables have a value of VIF well below 5.

Let us check its classification accuracy.

Confusion Matrix and Statistics

             Reference
Prediction    Large_Grain Small_Grain
  Large_Grain          13          91
  Small_Grain          97           8
                                          
               Accuracy : 0.1005          
                 95% CI : (0.0633, 0.1495)
    No Information Rate : 0.5263          
    P-Value [Acc > NIR] : 1.0000          
                                          
                  Kappa : -0.7986         
                                          
 Mcnemar's Test P-Value : 0.7154          
                                          
            Sensitivity : 0.11818         
            Specificity : 0.08081         
         Pos Pred Value : 0.12500         
         Neg Pred Value : 0.07619         
             Prevalence : 0.52632         
         Detection Rate : 0.06220         
   Detection Prevalence : 0.49761         
      Balanced Accuracy : 0.09949         
                                          
       'Positive' Class : Large_Grain     
                                          


ROC Curve


Call:
roc.default(response = seeds$Grain_Type, predictor = probabilities)

Data: probabilities in 104 controls (seeds$Grain_Type Large_Grain) < 105 cases (seeds$Grain_Type Small_Grain).
Area under the curve: 0.9723



ROC Curve of Logistic Model

ROC Curve of Logistic Model



Plot of ROC Curve

Stepwise Regression

Column

What is Stepwise Regression



Definition

In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion.

Typical Workflow of Stepwise Regression is as follows—



Main approaches of Stepwise Regression

The main approaches are:

Forward selection, which involves starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.

Backward elimination, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically insignificant loss of fit.

Bidirectional elimination, a combination of the above, testing at each step for variables to be included or excluded.

Drawback of Stepwise Regression

Overfitting Data

One of the main issues with stepwise regression is that it searches a large space of possible models. Hence it is prone to overfitting the data. In other words, stepwise regression will often fit much better in sample than it does on new out-of-sample data.

Solution 1: This problem can be mitigated if the criterion for adding (or deleting) a variable is stiff enough.

Solution 2: A way to test for errors in models created by step-wise regression, is to not rely on the model’s F-statistic, significance, or multiple R, but instead assess the model against a set of data that was not used to create the model. This is often done by building a model based on a sample of the dataset available (e.g., 70%) – the “training set” – and use the remainder of the dataset (e.g., 30%) as a validation set to assess the accuracy of the model. Accuracy is then often measured as the actual standard error (SE), MAPE (Mean absolute percentage error), or mean error between the predicted value and the actual value in the hold-out sample.[15] This method is particularly valuable when data are collected in different settings (e.g., different times, social vs. solitary situations) or when models are assumed to be generalizable.

Stepwise Regression Example



We have seen in case study of logistic regression (MENU=Logistic Regression Example) wherein we classified wheat seed type– Large and Small. In Model diagnostics we could see that model classification was very low.

With stepwise regression, we will attempt to improve the model accuracy. This can be done with step command in R and we select the variable selection method – “both” which means it will choose best model from the models obtained from forward selection and backward elimination.

Start:  AIC=102.9
Grain_Type ~ A + P + C + length_of_kernel + width_of_kernel + 
    asymmetry_coefficient

                        Df Deviance    AIC
- C                      1   88.905 100.91
- P                      1   89.124 101.12
- A                      1   90.577 102.58
<none>                       88.903 102.90
- length_of_kernel       1   92.402 104.40
- width_of_kernel        1   95.900 107.90
- asymmetry_coefficient  1  101.090 113.09

Step:  AIC=100.91
Grain_Type ~ A + P + length_of_kernel + width_of_kernel + asymmetry_coefficient

                        Df Deviance     AIC
- P                      1   89.285  99.285
<none>                       88.905 100.905
- A                      1   91.458 101.458
- length_of_kernel       1   92.537 102.537
- width_of_kernel        1   97.742 107.742
- asymmetry_coefficient  1  103.431 113.431

Step:  AIC=99.28
Grain_Type ~ A + length_of_kernel + width_of_kernel + asymmetry_coefficient

                        Df Deviance     AIC
<none>                       89.285  99.285
- length_of_kernel       1   93.225 101.225
- A                      1   95.353 103.353
- width_of_kernel        1  100.170 108.170
- asymmetry_coefficient  1  103.435 111.435

Call:
glm(formula = Grain_Type ~ ., family = "binomial", data = seeds[, 
    -7])

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.59319  -0.07393   0.00945   0.28840   2.35912  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)   
(Intercept)            -1.6604   108.8871  -0.015  0.98783   
A                      -5.4049     4.4534  -1.214  0.22488   
P                       3.9496     8.5199   0.464  0.64295   
C                       2.7440    66.8418   0.041  0.96725   
length_of_kernel       -9.9103     5.5305  -1.792  0.07314 . 
width_of_kernel        24.3433     9.7713   2.491  0.01273 * 
asymmetry_coefficient  -0.7992     0.2555  -3.128  0.00176 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 289.731  on 208  degrees of freedom
Residual deviance:  88.903  on 202  degrees of freedom
AIC: 102.9

Number of Fisher Scoring iterations: 8



We can observe that the model has not improved as the Null deviance, residual deviance and AIC are same as in the previous model.

P-Value Based Filtering of Variables

As automated method doesn’t work, we have to do stepwise regression manually.

0. Check whether any predictor(s) in the model have corresponding p-value more than 0.05 significance level. Ignore the Intercept.

1. We will develop a model with all predictor variables.

2. Now select the variable wich has highest p-value for its regression coefficient and remove it from the model.

3. Rerun the model with remaining variable and repeat the step 1,2 and 3 untill model have only all remaining predictor variables with p-values less than 0.05 significance level.


Call:
glm(formula = Grain_Type ~ ., family = "binomial", data = seeds[, 
    -7])

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.59319  -0.07393   0.00945   0.28840   2.35912  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)   
(Intercept)            -1.6604   108.8871  -0.015  0.98783   
A                      -5.4049     4.4534  -1.214  0.22488   
P                       3.9496     8.5199   0.464  0.64295   
C                       2.7440    66.8418   0.041  0.96725   
length_of_kernel       -9.9103     5.5305  -1.792  0.07314 . 
width_of_kernel        24.3433     9.7713   2.491  0.01273 * 
asymmetry_coefficient  -0.7992     0.2555  -3.128  0.00176 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 289.731  on 208  degrees of freedom
Residual deviance:  88.903  on 202  degrees of freedom
AIC: 102.9

Number of Fisher Scoring iterations: 8

We can observe that predictor C has p-value 0.96 much higher than the significance level so we will remove it from the model and rerun the model.


Call:
glm(formula = Grain_Type ~ . - C, family = "binomial", data = seeds[, 
    -7])

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.59826  -0.07421   0.00941   0.28678   2.36052  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)             2.3954    45.5585   0.053 0.958067    
A                      -5.2875     3.4006  -1.555 0.119972    
P                       3.7016     5.9903   0.618 0.536617    
length_of_kernel       -9.9567     5.4184  -1.838 0.066125 .  
width_of_kernel        24.4936     9.0601   2.703 0.006862 ** 
asymmetry_coefficient  -0.8026     0.2424  -3.311 0.000929 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 289.731  on 208  degrees of freedom
Residual deviance:  88.905  on 203  degrees of freedom
AIC: 100.91

Number of Fisher Scoring iterations: 8

We can observe that predictor P has p-value 0.53 much higher than the significance level so we will remove it from the model and rerun the model.


Call:
glm(formula = Grain_Type ~ . - C - P, family = "binomial", data = seeds[, 
    -7])

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.58688  -0.08462   0.01013   0.28961   2.29300  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)            27.4648    20.9791   1.309 0.190484    
A                      -3.4265     1.4815  -2.313 0.020729 *  
length_of_kernel       -7.8068     4.0866  -1.910 0.056088 .  
width_of_kernel        21.1553     7.0734   2.991 0.002782 ** 
asymmetry_coefficient  -0.7689     0.2316  -3.320 0.000901 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 289.731  on 208  degrees of freedom
Residual deviance:  89.285  on 204  degrees of freedom
AIC: 99.285

Number of Fisher Scoring iterations: 8

We can observe that predictor length_of_kernel has p-value .056 higher than the significance level so we will remove it from the model and rerun the model.


Call:
glm(formula = Grain_Type ~ . - C - P - length_of_kernel, family = "binomial", 
    data = seeds[, -7])

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.47195  -0.07417   0.01464   0.35263   2.51428  

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -11.7515     4.9101  -2.393 0.016696 *  
A                      -5.7512     0.9973  -5.767 8.08e-09 ***
width_of_kernel        30.2629     5.6153   5.389 7.07e-08 ***
asymmetry_coefficient  -0.8110     0.2279  -3.559 0.000373 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 289.731  on 208  degrees of freedom
Residual deviance:  93.225  on 205  degrees of freedom
AIC: 101.22

Number of Fisher Scoring iterations: 8

The final model obtained has all remaining predictor variables’ p-values less than 0.05 significance level and final model has only three predictor variables.

Predictions



First, we calculate the prediction probabilities of binary variable. The probabilities range between 0 and 1.

        1         2         3         4         5         6 
0.9377376 0.9923330 0.9998860 0.8930782 0.9777116 0.3735170 



Now, we will form two categories of probbilities by cutoff 0.5. The probabilities more than 0.5 are classified as “Large Grain” and probabilities less than or equal to 0.5 are classified as “Small Grain”.

Confusion Matrix

Now, all remaining predictor variables (A, width_of_kernel and asymmetry coefficient) have corresponding p-values less than 0.05. We can take this model and evaluate its performancd through confusionMatrix.

Confusion Matrix and Statistics

             Reference
Prediction    Large_Grain Small_Grain
  Large_Grain          14          97
  Small_Grain          90           8
                                         
               Accuracy : 0.1053         
                 95% CI : (0.0672, 0.155)
    No Information Rate : 0.5024         
    P-Value [Acc > NIR] : 1.0000         
                                         
                  Kappa : -0.7889        
                                         
 Mcnemar's Test P-Value : 0.6608         
                                         
            Sensitivity : 0.13462        
            Specificity : 0.07619        
         Pos Pred Value : 0.12613        
         Neg Pred Value : 0.08163        
             Prevalence : 0.49761        
         Detection Rate : 0.06699        
   Detection Prevalence : 0.53110        
      Balanced Accuracy : 0.10540        
                                         
       'Positive' Class : Large_Grain    
                                         



The classification accuracy is abysmally low 0.10. We need to find better cutoff point for probabilities categorization.

ROC Curve

Let us calculate auc and draw ROC Curve to find optimal cutoff point for prediction.

Area under the curve: 0.9702

ConfusionMatrix with New Cutoff

So let us try the new cutoff=0.97 and see if performance on confusionMatrix improves.

Confusion Matrix and Statistics

             Reference
Prediction    Large_Grain Small_Grain
  Large_Grain         104          67
  Small_Grain           0          38
                                          
               Accuracy : 0.6794          
                 95% CI : (0.6115, 0.7421)
    No Information Rate : 0.5024          
    P-Value [Acc > NIR] : 1.662e-07       
                                          
                  Kappa : 0.3608          
                                          
 Mcnemar's Test P-Value : 7.433e-16       
                                          
            Sensitivity : 1.0000          
            Specificity : 0.3619          
         Pos Pred Value : 0.6082          
         Neg Pred Value : 1.0000          
             Prevalence : 0.4976          
         Detection Rate : 0.4976          
   Detection Prevalence : 0.8182          
      Balanced Accuracy : 0.6810          
                                          
       'Positive' Class : Large_Grain     
                                          



Wow, the balanced accuracy has jumped from 0.10 to 0.68. We can try many other cutoff points to see if classification accuracy improves.

---
title: "Logistic Regression & Stepwise Regression"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    social : ["facdbook","twitter","menu"]
    source_code : embed
---

```{r setup, include=FALSE}
library(flexdashboard)
```

Logistic Regression : Definitions {data-navmenu="MENU"}
====================================

### What is Logistic Regression?


Definition

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.

You can also think of logistic regression as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function.

What is Logistic Function



This is the equation used in Logistic Regression. Here (p/1-p) is the odd ratio. Whenever the log of odd ratio is found to be positive, the probability of success is always more than 50%. A typical logistic model plot is shown below. You can see probability never goes below 0 and above 1.



Performance of Logistic Regression Model

AIC (Akaike Information Criteria): The analogous metric of adjusted R² in logistic regression is AIC. AIC is the measure of fit which penalizes model for the number of model coefficients. Therefore, we always prefer model with minimum AIC value

Null Deviance and Residual Deviance: Null Deviance indicates the response predicted by a model with nothing but an intercept. Lower the value, better the model. Residual deviance indicates the response predicted by a model on adding independent variables. Lower the value, better the model.

Confusion Matrix: It is nothing but a tabular representation of Actual vs Predicted values. This helps us to find the accuracy of the model and avoid overfitting. This is how it looks like:



You can calculate the various evaluation measures like accuracy, balanced accuracy, senstivity, specificity, recall, precision and f1 score.

ASSIGNMENT: Prepare a PPT describing all these measures derived from confusion matrix.

Specificity and Sensitivity plays a crucial role in deriving ROC curve.

ROC Curve: Receiver Operating Characteristic(ROC) summarizes the model’s performance by evaluating the trade-offs between true positive rate (sensitivity) and false positive rate (1- specificity).

For plotting ROC, it is advisable to assume p > 0.5 since we are more concerned about success rate. ROC summarizes the predictive power for all possible values of p > 0.5. The area under curve (AUC), referred to as index of accuracy(A) or concordance index, is a perfect performance metric for ROC curve. Higher the area under curve, better the prediction power of the model. Below is a sample ROC curve. The ROC of a perfect predictive model has TP equals 1 and FP equals 0. This curve will touch the top left corner of the graph.

Logistic Regression Example {data-navmenu="MENU"} ==================================== Column {.tabset} --------------------------------------- ### Transforming the "seeds" case study

In "seeds" case study we had "length_of_kernel_groove" as dependent variable. This variable ```{r echo=TRUE} seeds=read.delim("seeds_dataset.txt") names(seeds)=c("A","P","C","length_of_kernel","width_of_kernel","asymmetry_coefficient","length_of_kernel_groove","Variety_of_wheat") summary(seeds$length_of_kernel_groove) ```

Let's breakdown the "length_of_kernel_groove" into two categories based on median value=5.224. The seeds with "length_of_kernel_groove" less than or equal to 5.224 are labeled as "Small_Grain" and seeds with "length_of_kernel_groove" more than 5.224 are labeled as "Large_Grain". These categories are saved in new variable called "Grain_Type". This is executed in next command chunk--- ```{r echo=TRUE} seeds$Grain_Type=as.factor(ifelse(seeds$length_of_kernel_groove<=5.224,"Small_Grain","Large_Grain")) seeds$length_of_kernel_groove=NULL table(seeds$Grain_Type) ``` <>br
We don't need the variable "length_of_kernel_groove" so it is deleted from the dataset. Now, our dependent variable is "Grain_Type". ### Overview of Dataset ```{r} DT::datatable(seeds, filter="top") ``` ### Visualizations ```{r} library(ggplot2) ggplot(data = seeds, mapping = aes(x = Grain_Type, y = A)) + geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.9, color = "tomato") ggplot(data = seeds, mapping = aes(x = Grain_Type, y = P)) + geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.9, color = "tomato") ggplot(data = seeds, mapping = aes(x = Grain_Type, y = C)) + geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.9, color = "tomato") ggplot(data = seeds, mapping = aes(x = Grain_Type, y = length_of_kernel)) + geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.9, color = "tomato") ggplot(data = seeds, mapping = aes(x = Grain_Type, y = width_of_kernel)) + geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.9, color = "tomato") ggplot(data = seeds, mapping = aes(x = Grain_Type, y = asymmetry_coefficient)) + geom_boxplot(alpha = 0) + geom_jitter(alpha = 0.9, color = "tomato") ``` ### Logistic Model ```{r echo=FALSE} model=glm(Grain_Type~.,data=seeds[,-7], family="binomial") summary(model) ```

Let us do prediction with the logistic model

```{r echo=TRUE} probabilities <- predict(model, type = "response") predicted.classes <- ifelse(probabilities > 0.5, "Large_Grain", "Small_Grain") head(predicted.classes) ``` ### Assumption of Logistic Model

Linearity assumption

Here, we’ll check the linear relationship between continuous predictor variables and the logit of the outcome. This can be done by visually inspecting the scatter plot between each predictor and the logit values. ```{r echo=TRUE} seeds$Variety_of_wheat=as.factor((seeds$Variety_of_wheat)) library(tidyr) library(dplyr) mydata=seeds[,1:6] predictors <- colnames(mydata) # Bind the logit and tidying the data for plot mydata <- mydata %>% mutate(logit = log(probabilities/(1-probabilities))) %>% gather(key = "predictors", value = "predictor.value", -logit) ggplot(mydata, aes(logit, predictor.value))+ geom_point(size = 0.5, alpha = 0.5) + geom_smooth(method = "loess") + theme_bw() + facet_wrap(~predictors, scales = "free_y") ```

The smoothed scatter plots show that all numeric variables are all quite linearly associated with the "Grain_Type" outcome in logit scale.

If any variable is not found linearly associated , it might need some transformations. If the scatter plot shows non-linearity, you need other methods to build the model such as including 2 or 3-power terms, fractional polynomials and spline function.

Influential values

Influential values are extreme individual data points that can alter the quality of the logistic regression model. The most extreme values in the data can be examined by visualizing the Cook’s distance values. Here we label the top 3 largest values:

```{r echo=TRUE} plot(model, which = 4, id.n = 3) ```

The data for the top 3 largest values, according to the Cook’s distance, can be displayed as follow:

```{r echo=TRUE} library(broom) model.data <- augment(model) %>% mutate(index = 1:n()) model.data %>% top_n(3, .cooksd) #Plot the standardized residuals: ggplot(model.data, aes(index, .std.resid)) + geom_point(aes(color = Grain_Type), alpha = .5) + theme_bw() ```

Filtering Potential Outliers

```{r echo=TRUE} model.data %>% filter(abs(.std.resid) > 3) ```

When you have outliers in a continuous predictor, potential solutions include:

Removing the concerned records
Transform the data into log scale
Use non parametric methods

Multicollinearity

Multicollinearity corresponds to a situation where the data contain highly correlated predictor variables.Multicollinearity is an important issue in regression analysis and should be fixed by removing the concerned variables. It can be assessed using the R function vif() [car package], which computes the variance inflation factors: ```{r echo=TRUE} car::vif(model) ```

As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. If there is no collinearity: all variables have a value of VIF well below 5.

Let us check its classification accuracy.

```{r echo=TRUE} caret::confusionMatrix(as.factor(seeds$Grain_Type), as.factor(predicted.classes), positive="Large_Grain") ```

ROC Curve

```{r echo=TRUE} library(pROC) roc_obj=roc(seeds$Grain_Type, probabilities) roc_obj ```

ROC Curve of Logistic Model {data-navmenu="MENU"} ============================================ ### ROC Curve of Logistic Model

Plot of ROC Curve

```{r} plot(roc(seeds$Grain_Type, probabilities, direction="<", percent = TRUE, ci=TRUE), col="blue", lwd=2, main="AUC Curve") ``` Stepwise Regression {data-navmenu="MENU"} ==================================================== Column {.tabset} ------------------------------------------------- ### What is Stepwise Regression

Definition

In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of explanatory variables based on some prespecified criterion.

Typical Workflow of Stepwise Regression is as follows---



Main approaches of Stepwise Regression

The main approaches are:

Forward selection, which involves starting with no variables in the model, testing the addition of each variable using a chosen model fit criterion, adding the variable (if any) whose inclusion gives the most statistically significant improvement of the fit, and repeating this process until none improves the model to a statistically significant extent.

Backward elimination, which involves starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion, deleting the variable (if any) whose loss gives the most statistically insignificant deterioration of the model fit, and repeating this process until no further variables can be deleted without a statistically insignificant loss of fit.

Bidirectional elimination, a combination of the above, testing at each step for variables to be included or excluded.

Drawback of Stepwise Regression

Overfitting Data ---

One of the main issues with stepwise regression is that it searches a large space of possible models. Hence it is prone to overfitting the data. In other words, stepwise regression will often fit much better in sample than it does on new out-of-sample data.

Solution 1: This problem can be mitigated if the criterion for adding (or deleting) a variable is stiff enough.

Solution 2: A way to test for errors in models created by step-wise regression, is to not rely on the model's F-statistic, significance, or multiple R, but instead assess the model against a set of data that was not used to create the model. This is often done by building a model based on a sample of the dataset available (e.g., 70%) – the “training set” – and use the remainder of the dataset (e.g., 30%) as a validation set to assess the accuracy of the model. Accuracy is then often measured as the actual standard error (SE), MAPE (Mean absolute percentage error), or mean error between the predicted value and the actual value in the hold-out sample.[15] This method is particularly valuable when data are collected in different settings (e.g., different times, social vs. solitary situations) or when models are assumed to be generalizable.

### Stepwise Regression Example

We have seen in case study of logistic regression (MENU=Logistic Regression Example) wherein we classified wheat seed type-- Large and Small. In Model diagnostics we could see that model classification was very low.

With stepwise regression, we will attempt to improve the model accuracy. This can be done with step command in R and we select the variable selection method -- "both" which means it will choose best model from the models obtained from forward selection and backward elimination.

```{r echo=TRUE} library(car) model_step=step(model, type="both") summary(model) ```

We can observe that the model has not improved as the Null deviance, residual deviance and AIC are same as in the previous model.

P-Value Based Filtering of Variables

As automated method doesn't work, we have to do stepwise regression manually.

0. Check whether any predictor(s) in the model have corresponding p-value more than 0.05 significance level. Ignore the Intercept.

1. We will develop a model with all predictor variables.

2. Now select the variable wich has highest p-value for its regression coefficient and remove it from the model.

3. Rerun the model with remaining variable and repeat the step 1,2 and 3 untill model have only all remaining predictor variables with p-values less than 0.05 significance level. ```{r echo=TRUE} summary(model) ``` We can observe that predictor C has p-value 0.96 much higher than the significance level so we will remove it from the model and rerun the model. ```{r echo=TRUE} model=glm(Grain_Type ~ .-C, family = "binomial", data = seeds[, -7]) summary(model) ``` We can observe that predictor P has p-value 0.53 much higher than the significance level so we will remove it from the model and rerun the model. ```{r echo=TRUE} model=glm(Grain_Type ~ .-C-P, family = "binomial", data = seeds[, -7]) summary(model) ``` We can observe that predictor length_of_kernel has p-value .056 higher than the significance level so we will remove it from the model and rerun the model. ```{r echo=TRUE} model=glm(Grain_Type ~ .-C-P-length_of_kernel, family = "binomial", data = seeds[, -7]) summary(model) ``` The final model obtained has all remaining predictor variables' p-values less than 0.05 significance level and final model has only three predictor variables. ### Predictions

First, we calculate the prediction probabilities of binary variable. The probabilities range between 0 and 1. ```{r echo=TRUE} probs=predict(model, newdata = seeds[,-c(7,8)], type="response") head(probs) ```

Now, we will form two categories of probbilities by cutoff 0.5. The probabilities more than 0.5 are classified as "Large Grain" and probabilities less than or equal to 0.5 are classified as "Small Grain". ```{r echo=FALSE} preds=as.factor(ifelse(probs>0.5, "Large_Grain","Small_Grain")) ``` ### Confusion Matrix Now, all remaining predictor variables (A, width_of_kernel and asymmetry coefficient) have corresponding p-values less than 0.05. We can take this model and evaluate its performancd through confusionMatrix. ```{r echo=TRUE} caret::confusionMatrix(preds,as.factor(seeds$Grain_Type), positive="Large_Grain") ```

The classification accuracy is abysmally low 0.10. We need to find better cutoff point for probabilities categorization. ### ROC Curve Let us calculate auc and draw ROC Curve to find optimal cutoff point for prediction. ```{r} library(pROC) probs=predict(model, newdata = seeds[,-c(7,8)], type="response") preds=as.factor(ifelse(probs>0.5, "Large_Grain","Small_Grain")) auc(seeds$Grain_Type, probs) plot(roc(seeds$Grain_Type, probs, direction="<", percent = TRUE, ci=TRUE), col="blue", lwd=2, main="AUC Curve") ``` ### ConfusionMatrix with New Cutoff So let us try the new cutoff=0.97 and see if performance on confusionMatrix improves.

```{r echo=TRUE} preds=as.factor(ifelse(probs<0.97, "Large_Grain","Small_Grain")) caret::confusionMatrix(preds,as.factor(seeds$Grain_Type), positive="Large_Grain") ```

Wow, the balanced accuracy has jumped from 0.10 to 0.68. We can try many other cutoff points to see if classification accuracy improves.