Discriminant analysis and logistics regression both employed non metric variables as dependent variable. Logistics regression represent the two groups of interest as a binary variable with values 0 and 1. On the other hand, discriminant analysis can analyze more than two categorical variable. Logistics regression may be preferred for two reason:
Discriminant analysis relies on strictly meet meeting the multivariate normality and equal variance-covariance matrices across groups. Logistics regression do not face these assumption
Even if the assumption are met, many researchers prefer logistic regression because is is similar to multiple regression.
| Multiple Regression | Logistics Regression |
|---|---|
| Total sum of squares | -2LL of Base model |
| Error sum of squares | -2LL of proposed model |
| Regression sum of squares | Difference of -LL base and proposed model |
| F test of model fit | Chi-square test of -2LL difference |
| Coefficient of determination | “Pseudo” R2 measure |
Similar to multiple regression, logistics regression has a single variate composed of estimated coefficients for each independent variable. But the this variate estimated in different manner. In multiple regression we estimate a linear relationship that best fit the data. In logistics regression, we follow the same process of predicting the dependent variable by composed of the logistic coefficient. Logistic regression predicted values can never be outside the range of 0 to 1.
\[ Logit_i = ln \left( \frac{prob_{event}}{1 - prob_{event}} \right) = b_0 + b_1X_1 + ... +b_nX_n\]
\[ Odds_i = \left( \frac{prob_{event}}{1 - prob_{event}} \right) = e^{b_0 + b_1X_1 + ... +b_nX_n}\]
Import data from your local drive
# Change the file path "\" to "/"
hbat <- read.csv("C:/Users/asus/Google Drive/Ilham Fadhil/Tutor/Advanced Statistics/2017-Advanced_statistic/Week 2/hbat.csv", header = TRUE)
str(hbat)
## 'data.frame': 100 obs. of 26 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ x1 : Factor w/ 3 levels "1 to 5 years",..: 1 3 3 2 1 2 2 1 1 2 ...
## $ x2 : Factor w/ 2 levels "Magazine industry",..: 1 2 1 2 1 2 2 1 2 1 ...
## $ x3 : Factor w/ 2 levels "Large (500+)",..: 1 2 1 1 1 2 1 1 1 1 ...
## $ x4 : Factor w/ 2 levels "Outside North America",..: 1 2 1 1 2 1 1 1 1 1 ...
## $ x5 : Factor w/ 2 levels "Direct to customer",..: 1 2 1 2 1 2 2 2 2 2 ...
## $ x6 : num 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
## $ x7 : num 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
## $ x8 : num 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
## $ x9 : num 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
## $ x10: num 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
## $ x11: num 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
## $ x12: num 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
## $ x13: num 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
## $ x14: num 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
## $ x15: num 4.3 4 4.6 3.6 4.5 9.5 2.5 4.8 4.4 5.3 ...
## $ x16: num 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
## $ x17: num 5.1 4.3 4 4.1 3.5 4.7 4.2 6.3 6.1 5.8 ...
## $ x18: num 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
## $ x19: num 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...
## $ x20: num 8 6.5 8.4 6 6.6 6.3 7.8 5.8 7.5 5.9 ...
## $ x21: num 8.4 7.5 9 7.2 9 6.1 7.2 7.7 8.2 6.7 ...
## $ x22: num 65.1 67.1 72.1 40.1 57.1 50.1 41.1 56.1 56.1 59.1 ...
## $ x23: Factor w/ 2 levels "No, would not consider",..: 2 1 2 1 1 1 1 1 2 1 ...
## $ X24: int 2 1 3 1 2 3 1 3 3 2 ...
summary(hbat)
## X id x1
## Min. : 1.00 Min. : 1.00 1 to 5 years :35
## 1st Qu.: 25.75 1st Qu.: 25.75 Less than 1 year:32
## Median : 50.50 Median : 50.50 Over 5 years :33
## Mean : 50.50 Mean : 50.50
## 3rd Qu.: 75.25 3rd Qu.: 75.25
## Max. :100.00 Max. :100.00
## x2 x3 x4
## Magazine industry :52 Large (500+) :51 Outside North America:61
## Newsprint industry:48 Small (0 to 499):49 USA/North America :39
##
##
##
##
## x5 x6 x7
## Direct to customer :43 Min. : 5.000 Min. :2.200
## Indirect through broker:57 1st Qu.: 6.575 1st Qu.:3.275
## Median : 8.000 Median :3.600
## Mean : 7.810 Mean :3.672
## 3rd Qu.: 9.100 3rd Qu.:3.925
## Max. :10.000 Max. :5.700
## x8 x9 x10 x11
## Min. :1.300 Min. :2.600 Min. :1.900 Min. :2.300
## 1st Qu.:4.250 1st Qu.:4.600 1st Qu.:3.175 1st Qu.:4.700
## Median :5.400 Median :5.450 Median :4.000 Median :5.750
## Mean :5.365 Mean :5.442 Mean :4.010 Mean :5.805
## 3rd Qu.:6.625 3rd Qu.:6.325 3rd Qu.:4.800 3rd Qu.:6.800
## Max. :8.500 Max. :7.800 Max. :6.500 Max. :8.400
## x12 x13 x14 x15
## Min. :2.900 Min. :3.700 Min. :4.100 Min. :1.70
## 1st Qu.:4.500 1st Qu.:5.875 1st Qu.:5.400 1st Qu.:4.10
## Median :4.900 Median :7.100 Median :6.100 Median :5.00
## Mean :5.123 Mean :6.974 Mean :6.043 Mean :5.15
## 3rd Qu.:5.800 3rd Qu.:8.400 3rd Qu.:6.600 3rd Qu.:6.30
## Max. :8.200 Max. :9.900 Max. :8.100 Max. :9.50
## x16 x17 x18 x19
## Min. :2.000 Min. :2.60 Min. :1.600 Min. :4.700
## 1st Qu.:3.700 1st Qu.:3.70 1st Qu.:3.400 1st Qu.:6.000
## Median :4.400 Median :4.35 Median :3.900 Median :7.050
## Mean :4.278 Mean :4.61 Mean :3.886 Mean :6.918
## 3rd Qu.:4.800 3rd Qu.:5.60 3rd Qu.:4.425 3rd Qu.:7.625
## Max. :6.700 Max. :7.30 Max. :5.500 Max. :9.900
## x20 x21 x22
## Min. :4.60 Min. :5.500 Min. :37.10
## 1st Qu.:6.30 1st Qu.:7.100 1st Qu.:51.10
## Median :7.00 Median :7.700 Median :58.60
## Mean :7.02 Mean :7.713 Mean :58.40
## 3rd Qu.:7.60 3rd Qu.:8.400 3rd Qu.:65.35
## Max. :9.90 Max. :9.900 Max. :77.10
## x23 X24
## No, would not consider:55 Min. : 1.00
## Yes, would consider :45 1st Qu.: 1.00
## Median : 2.00
## Mean : 2.74
## 3rd Qu.: 3.00
## Max. :10.00
Our hbat data consist of 24 variables (6 non metric and 18 metric). For logistic regression, the dependent variable need to be dichotomous. So we could not use \(X_1\) variable. In this analysis we will use \(X_4\) variable (region of the customer location). The analysis will determine what variable best distinguish between two region (outside or not outside USA).
colSums(is.na(hbat))
## X id x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## x17 x18 x19 x20 x21 x22 x23 X24
## 0 0 0 0 0 0 0 0
The hbat data show no missing data, thus we can continue our analysis without data removal or imputation.
For this exercise we will employ variable \(X_{13}\) (competitive pricing) and \(X_{17}\) (price flexibility) to classified and predict the region of the customer location. The selection of this variable is resulted from extensive discussion with marketing team and we want to build and validate the model. First step is we build the logistic model with R.
mylogit <- glm(x4 ~ x13 + x17, data = hbat, family = binomial)
summary(mylogit)
##
## Call:
## glm(formula = x4 ~ x13 + x17, family = binomial, data = hbat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4354 -0.4569 -0.1785 0.6381 1.9908
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.1888 1.9583 5.203 1.96e-07 ***
## x13 -0.6332 0.2162 -2.929 0.0034 **
## x17 -1.4755 0.3783 -3.900 9.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 133.750 on 99 degrees of freedom
## Residual deviance: 74.927 on 97 degrees of freedom
## AIC: 80.927
##
## Number of Fisher Scoring iterations: 6
Logistic equation :
\[ Logit_{x4} = ln \left( \frac{prob_{Outside USA}}{1 - prob_{Outside USA}} \right) = 10.1888 - 0.6332X_{13} - 1.4755X_{17} \]
The result of the logistics model shows the coefficients, their standard errors, and the z-statistics (often called Wald z-statistics). The logistics regression coefficients give change in the log odds of the outcome for one unit increase in the predictor variable.
For every one unit change in competitive pricing (X13), the log odds of outside USA (versus in USA) decreases by 0.6332
For every one unit change in price flexibility (X17), the log odds of outside USA (versus in USA) decreases by 1.4755
\[ Odds_{X4} = \left( \frac{prob_{event}}{1 - prob_{event}} \right) = e^{10.1888 - 0.6332X_{13} - 1.4755X_{17}}\]
Explanation of the logistic regression model summary:
Null deviance is maximum log likelihood of the base model
Residual deviance is the maximum log likelihood of the proposed model
AIC is Akaike information criterion measure of the relative quality of statistical models for a given set of data
In general there are two approaches in logistic regression. First, step wise method, this method start from the simple model.
Odds is represent the ration of measure between occurance a event relatively to the its complement.On the other hand, probability is the likelyhood an event will occur. The difference of this two term must be carefully understood in order to interpret the logistic regression result.
\[Odds = \frac{Y}{1-Y}\]
\[Probability = \frac{number \ of \ event}{sample \ space}\]
Pseudo R2 measure commonly examine in Nagelkerke and Cox & Snell R2
library(modEvA)
RsqGLM(mylogit)
## $CoxSnell
## [1] 0.444687
##
## $Nagelkerke
## [1] 0.6029672
##
## $McFadden
## [1] 0.4397945
##
## $Tjur
## [1] 0.5001888
##
## $sqPearson
## [1] 0.4965371
This model generate Nagelkerke R2 of 0.6029672, and Cox & Snell R2 of 0.444687. This parameter can interpret as this model can explained the dependent variable as much as 44.47% - 60.297%.
| Result | |
|---|---|
| -2LL of base model | 133.750 |
| -2LL of proposed model | 74.927 |
| The difference of -LL base and proposed model | 58.823 |
anova(mylogit, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: x4
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 99 133.750
## x13 1 33.188 98 100.562 8.367e-09 ***
## x17 1 25.634 97 74.927 4.126e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
HLfit(model = mylogit, bin.method = "quantiles")
## $bins.table
## BinCenter NBin BinObs BinPred BinObsCIlower BinObsCIupper
## 1 0.006313888 10 0.0000000 0.007359984 0.00000000 0.3084971
## 2 0.021917919 10 0.0000000 0.021015431 0.00000000 0.3084971
## 3 0.052878424 9 0.0000000 0.051440760 0.00000000 0.3362671
## 4 0.096768023 11 0.0000000 0.095874737 0.00000000 0.2849142
## 5 0.174170827 10 0.3000000 0.171581105 0.06673951 0.6524529
## 6 0.414853510 10 0.4000000 0.434686872 0.12155226 0.7376219
## 7 0.667799911 9 0.8888889 0.655954883 0.51750349 0.9971909
## 8 0.701750940 10 0.7000000 0.706767287 0.34754715 0.9332605
## 9 0.815911730 10 0.7000000 0.812164678 0.34754715 0.9332605
## 10 0.914020546 11 0.9090909 0.913005777 0.58722008 0.9977010
##
## $chi.sq
## [1] 6.14535
##
## $DF
## [1] 8
##
## $p.value
## [1] 0.6309544
##
## $RMSE
## [1] 0.9383523