Concept

Discriminant analysis and logistics regression both employed non metric variables as dependent variable. Logistics regression represent the two groups of interest as a binary variable with values 0 and 1. On the other hand, discriminant analysis can analyze more than two categorical variable. Logistics regression may be preferred for two reason:

Comparison with multiple regression

Multiple Regression Logistics Regression
Total sum of squares -2LL of Base model
Error sum of squares -2LL of proposed model
Regression sum of squares Difference of -LL base and proposed model
F test of model fit Chi-square test of -2LL difference
Coefficient of determination “Pseudo” R2 measure

Estimating the Logistic Regression Model

Similar to multiple regression, logistics regression has a single variate composed of estimated coefficients for each independent variable. But the this variate estimated in different manner. In multiple regression we estimate a linear relationship that best fit the data. In logistics regression, we follow the same process of predicting the dependent variable by composed of the logistic coefficient. Logistic regression predicted values can never be outside the range of 0 to 1.

Estimating Coefficient

\[ Logit_i = ln \left( \frac{prob_{event}}{1 - prob_{event}} \right) = b_0 + b_1X_1 + ... +b_nX_n\]

\[ Odds_i = \left( \frac{prob_{event}}{1 - prob_{event}} \right) = e^{b_0 + b_1X_1 + ... +b_nX_n}\]

R Tutorial

Import Data

Import data from your local drive

# Change the file path "\" to "/"
hbat <- read.csv("C:/Users/asus/Google Drive/Ilham Fadhil/Tutor/Advanced Statistics/2017-Advanced_statistic/Week 2/hbat.csv", header = TRUE)

Data Examination

str(hbat)
## 'data.frame':    100 obs. of  26 variables:
##  $ X  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ x1 : Factor w/ 3 levels "1 to 5 years",..: 1 3 3 2 1 2 2 1 1 2 ...
##  $ x2 : Factor w/ 2 levels "Magazine industry",..: 1 2 1 2 1 2 2 1 2 1 ...
##  $ x3 : Factor w/ 2 levels "Large (500+)",..: 1 2 1 1 1 2 1 1 1 1 ...
##  $ x4 : Factor w/ 2 levels "Outside North America",..: 1 2 1 1 2 1 1 1 1 1 ...
##  $ x5 : Factor w/ 2 levels "Direct to customer",..: 1 2 1 2 1 2 2 2 2 2 ...
##  $ x6 : num  8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
##  $ x7 : num  3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
##  $ x8 : num  2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
##  $ x9 : num  5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
##  $ x10: num  4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
##  $ x11: num  4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
##  $ x12: num  6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
##  $ x13: num  6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
##  $ x14: num  4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
##  $ x15: num  4.3 4 4.6 3.6 4.5 9.5 2.5 4.8 4.4 5.3 ...
##  $ x16: num  5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
##  $ x17: num  5.1 4.3 4 4.1 3.5 4.7 4.2 6.3 6.1 5.8 ...
##  $ x18: num  3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
##  $ x19: num  8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...
##  $ x20: num  8 6.5 8.4 6 6.6 6.3 7.8 5.8 7.5 5.9 ...
##  $ x21: num  8.4 7.5 9 7.2 9 6.1 7.2 7.7 8.2 6.7 ...
##  $ x22: num  65.1 67.1 72.1 40.1 57.1 50.1 41.1 56.1 56.1 59.1 ...
##  $ x23: Factor w/ 2 levels "No, would not consider",..: 2 1 2 1 1 1 1 1 2 1 ...
##  $ X24: int  2 1 3 1 2 3 1 3 3 2 ...
summary(hbat)
##        X                id                        x1    
##  Min.   :  1.00   Min.   :  1.00   1 to 5 years    :35  
##  1st Qu.: 25.75   1st Qu.: 25.75   Less than 1 year:32  
##  Median : 50.50   Median : 50.50   Over 5 years    :33  
##  Mean   : 50.50   Mean   : 50.50                        
##  3rd Qu.: 75.25   3rd Qu.: 75.25                        
##  Max.   :100.00   Max.   :100.00                        
##                   x2                    x3                         x4    
##  Magazine industry :52   Large (500+)    :51   Outside North America:61  
##  Newsprint industry:48   Small (0 to 499):49   USA/North America    :39  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                        x5           x6               x7       
##  Direct to customer     :43   Min.   : 5.000   Min.   :2.200  
##  Indirect through broker:57   1st Qu.: 6.575   1st Qu.:3.275  
##                               Median : 8.000   Median :3.600  
##                               Mean   : 7.810   Mean   :3.672  
##                               3rd Qu.: 9.100   3rd Qu.:3.925  
##                               Max.   :10.000   Max.   :5.700  
##        x8              x9             x10             x11       
##  Min.   :1.300   Min.   :2.600   Min.   :1.900   Min.   :2.300  
##  1st Qu.:4.250   1st Qu.:4.600   1st Qu.:3.175   1st Qu.:4.700  
##  Median :5.400   Median :5.450   Median :4.000   Median :5.750  
##  Mean   :5.365   Mean   :5.442   Mean   :4.010   Mean   :5.805  
##  3rd Qu.:6.625   3rd Qu.:6.325   3rd Qu.:4.800   3rd Qu.:6.800  
##  Max.   :8.500   Max.   :7.800   Max.   :6.500   Max.   :8.400  
##       x12             x13             x14             x15      
##  Min.   :2.900   Min.   :3.700   Min.   :4.100   Min.   :1.70  
##  1st Qu.:4.500   1st Qu.:5.875   1st Qu.:5.400   1st Qu.:4.10  
##  Median :4.900   Median :7.100   Median :6.100   Median :5.00  
##  Mean   :5.123   Mean   :6.974   Mean   :6.043   Mean   :5.15  
##  3rd Qu.:5.800   3rd Qu.:8.400   3rd Qu.:6.600   3rd Qu.:6.30  
##  Max.   :8.200   Max.   :9.900   Max.   :8.100   Max.   :9.50  
##       x16             x17            x18             x19       
##  Min.   :2.000   Min.   :2.60   Min.   :1.600   Min.   :4.700  
##  1st Qu.:3.700   1st Qu.:3.70   1st Qu.:3.400   1st Qu.:6.000  
##  Median :4.400   Median :4.35   Median :3.900   Median :7.050  
##  Mean   :4.278   Mean   :4.61   Mean   :3.886   Mean   :6.918  
##  3rd Qu.:4.800   3rd Qu.:5.60   3rd Qu.:4.425   3rd Qu.:7.625  
##  Max.   :6.700   Max.   :7.30   Max.   :5.500   Max.   :9.900  
##       x20            x21             x22       
##  Min.   :4.60   Min.   :5.500   Min.   :37.10  
##  1st Qu.:6.30   1st Qu.:7.100   1st Qu.:51.10  
##  Median :7.00   Median :7.700   Median :58.60  
##  Mean   :7.02   Mean   :7.713   Mean   :58.40  
##  3rd Qu.:7.60   3rd Qu.:8.400   3rd Qu.:65.35  
##  Max.   :9.90   Max.   :9.900   Max.   :77.10  
##                      x23          X24       
##  No, would not consider:55   Min.   : 1.00  
##  Yes, would consider   :45   1st Qu.: 1.00  
##                              Median : 2.00  
##                              Mean   : 2.74  
##                              3rd Qu.: 3.00  
##                              Max.   :10.00

Our hbat data consist of 24 variables (6 non metric and 18 metric). For logistic regression, the dependent variable need to be dichotomous. So we could not use \(X_1\) variable. In this analysis we will use \(X_4\) variable (region of the customer location). The analysis will determine what variable best distinguish between two region (outside or not outside USA).

colSums(is.na(hbat))
##   X  id  x1  x2  x3  x4  x5  x6  x7  x8  x9 x10 x11 x12 x13 x14 x15 x16 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
## x17 x18 x19 x20 x21 x22 x23 X24 
##   0   0   0   0   0   0   0   0

The hbat data show no missing data, thus we can continue our analysis without data removal or imputation.

Logistic Regression

For this exercise we will employ variable \(X_{13}\) (competitive pricing) and \(X_{17}\) (price flexibility) to classified and predict the region of the customer location. The selection of this variable is resulted from extensive discussion with marketing team and we want to build and validate the model. First step is we build the logistic model with R.

mylogit <- glm(x4 ~ x13 + x17, data = hbat, family = binomial)
summary(mylogit)
## 
## Call:
## glm(formula = x4 ~ x13 + x17, family = binomial, data = hbat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4354  -0.4569  -0.1785   0.6381   1.9908  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  10.1888     1.9583   5.203 1.96e-07 ***
## x13          -0.6332     0.2162  -2.929   0.0034 ** 
## x17          -1.4755     0.3783  -3.900 9.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 133.750  on 99  degrees of freedom
## Residual deviance:  74.927  on 97  degrees of freedom
## AIC: 80.927
## 
## Number of Fisher Scoring iterations: 6

Logistic equation :

\[ Logit_{x4} = ln \left( \frac{prob_{Outside USA}}{1 - prob_{Outside USA}} \right) = 10.1888 - 0.6332X_{13} - 1.4755X_{17} \]

The result of the logistics model shows the coefficients, their standard errors, and the z-statistics (often called Wald z-statistics). The logistics regression coefficients give change in the log odds of the outcome for one unit increase in the predictor variable.

  • For every one unit change in competitive pricing (X13), the log odds of outside USA (versus in USA) decreases by 0.6332

  • For every one unit change in price flexibility (X17), the log odds of outside USA (versus in USA) decreases by 1.4755

\[ Odds_{X4} = \left( \frac{prob_{event}}{1 - prob_{event}} \right) = e^{10.1888 - 0.6332X_{13} - 1.4755X_{17}}\]

Explanation of the logistic regression model summary:

  • Null deviance is maximum log likelihood of the base model

  • Residual deviance is the maximum log likelihood of the proposed model

  • AIC is Akaike information criterion measure of the relative quality of statistical models for a given set of data

Logistic regression plot

In general there are two approaches in logistic regression. First, step wise method, this method start from the simple model.

Odds and Probabilities

Odds is represent the ration of measure between occurance a event relatively to the its complement.On the other hand, probability is the likelyhood an event will occur. The difference of this two term must be carefully understood in order to interpret the logistic regression result.

\[Odds = \frac{Y}{1-Y}\]

\[Probability = \frac{number \ of \ event}{sample \ space}\]

Model Assessment

Pseudo R2 measure commonly examine in Nagelkerke and Cox & Snell R2

library(modEvA)
RsqGLM(mylogit)
## $CoxSnell
## [1] 0.444687
## 
## $Nagelkerke
## [1] 0.6029672
## 
## $McFadden
## [1] 0.4397945
## 
## $Tjur
## [1] 0.5001888
## 
## $sqPearson
## [1] 0.4965371

This model generate Nagelkerke R2 of 0.6029672, and Cox & Snell R2 of 0.444687. This parameter can interpret as this model can explained the dependent variable as much as 44.47% - 60.297%.

Result
-2LL of base model 133.750
-2LL of proposed model 74.927
The difference of -LL base and proposed model 58.823
anova(mylogit, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: x4
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                    99    133.750              
## x13   1   33.188        98    100.562 8.367e-09 ***
## x17   1   25.634        97     74.927 4.126e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
HLfit(model = mylogit, bin.method = "quantiles")

## $bins.table
##      BinCenter NBin    BinObs     BinPred BinObsCIlower BinObsCIupper
## 1  0.006313888   10 0.0000000 0.007359984    0.00000000     0.3084971
## 2  0.021917919   10 0.0000000 0.021015431    0.00000000     0.3084971
## 3  0.052878424    9 0.0000000 0.051440760    0.00000000     0.3362671
## 4  0.096768023   11 0.0000000 0.095874737    0.00000000     0.2849142
## 5  0.174170827   10 0.3000000 0.171581105    0.06673951     0.6524529
## 6  0.414853510   10 0.4000000 0.434686872    0.12155226     0.7376219
## 7  0.667799911    9 0.8888889 0.655954883    0.51750349     0.9971909
## 8  0.701750940   10 0.7000000 0.706767287    0.34754715     0.9332605
## 9  0.815911730   10 0.7000000 0.812164678    0.34754715     0.9332605
## 10 0.914020546   11 0.9090909 0.913005777    0.58722008     0.9977010
## 
## $chi.sq
## [1] 6.14535
## 
## $DF
## [1] 8
## 
## $p.value
## [1] 0.6309544
## 
## $RMSE
## [1] 0.9383523