1. logistic regression model for predicting vote=1

a. logistic regression equation

ln(odds) = -6.3896 + 0.0266 * conserv - 0.0208 * age + 1.0790 * knowledge

b. predicted probability of a person

logodds <- -6.3896 + 0.0266*10 + 1.0790*5
logodds
## [1] -0.7286
x <- exp(logodds)
probability <- x/(1+x)
probability
## [1] 0.325502

the probability of voting=1 for 69-year old person with a conservativeness level of 10 and a knowledge score of 5 is 32.55%

c. interpreting the outcomes

  1. Conservativeness level When a person’s convservativeness level increases by 1 unit, the log odds of voting 1 increases by .0266

  2. age age doesn’t have any significant effect on the probability of voting 1

  3. knowledge when a person’s knowledgelevel increases by 1 unit, the log odds of voting 1 increases by 1.0790

d. calculating the odds ration, and 95% of CI for conservativeness

odds_ratio <- exp(0.0266)
odds_ratio
## [1] 1.026957
lower_CI <- 0.0266-1.96*0.00894
upper_CI <- 0.0266+1.96*0.00894
exp(lower_CI)
## [1] 1.009119
exp(upper_CI)
## [1] 1.04511

The odds ratio for conservativeness is 1.026957. confidence interval is [1.009119, 1.04511]

2. restaurant closure and city

a. The odds of Ulsan

pU <- 145/(145+47)
odds <- pU/(1-pU)
odds

The odds that Ulsan experience retaurant closure are 3.0851.

b. The odds of Busan

pB <- 89/(89+15)
odds <- pB/(1-pB)
odds
## [1] 5.933333

The odds that Bsan experience retaurant closure are 5.9333.

c. The odds ratio

The odds ratio is a measure of association between an exposure and an outcome. The odds ratio means the odds that an outcome with a particular exposure, compared to the odds of the coutcome occurring without that exposure.

pU <- 145/(145+47)
pB <- 89/(89+15)
odds_ratio <- pU/pB
odds_ratio
## [1] 0.8824906

The odds ratio of that Ulsan would experience closure compared to Busan is 0.8825. Which means that Ulsan is less likely to experience restaurant closure by 88.25% compared to Busan.

3. binary logistic regression

a. A result of the regression model

hb <- read.csv("hbat_200.csv")
perception <- hb$X23
customer_type <- as.factor(hb$X1)
distribution_system <- as.factor(hb$X5)
e_commerce <- hb$X7
new_products <- hb$X15
hb_model <- glm(perception ~ customer_type + distribution_system + e_commerce + new_products)
summary(hb_model)
## 
## Call:
## glm(formula = perception ~ customer_type + distribution_system + 
##     e_commerce + new_products)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.94674  -0.32568  -0.00016   0.30571   0.78254  
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -0.599364   0.176627  -3.393 0.000837 ***
## customer_type2        0.387044   0.072824   5.315 2.92e-07 ***
## customer_type3        0.558696   0.072077   7.751 4.99e-13 ***
## distribution_system1  0.137172   0.062248   2.204 0.028727 *  
## e_commerce            0.160231   0.038293   4.184 4.33e-05 ***
## new_products          0.009439   0.019234   0.491 0.624147    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.156117)
## 
##     Null deviance: 49.020  on 199  degrees of freedom
## Residual deviance: 30.287  on 194  degrees of freedom
## AIC: 204.05
## 
## Number of Fisher Scoring iterations: 2

b. interpretations of all available coefficients.

exp(coef(hb_model))
##          (Intercept)       customer_type2       customer_type3 
##             0.549161             1.472622             1.748392 
## distribution_system1           e_commerce         new_products 
##             1.147025             1.173782             1.009484
  1. customer type Customers who have been buying products from HBAT for between 1 andd 5 years are 1.47times more likely to buy products from HBAT in the future than those who have been buying products from HBAT for less than 1 year.(p < .01) Customers who have been buying products from HBAT for more than 5 years are 1.75times more likely to buy products from HBAT in the future than those who have been buying products from HBAT for less than 1 year.(p < .01)

  2. Distribution system When papaer products are sold to customers directly the likelihood of future purchase of products from HBAT is 1.14times higher than when paper products are sold to customers indirectly through a broker. (p = .02)

  3. E-commerce activities and Web site. When the overall image of HBAT’s Website, especially user-friendliness increases by one-unit, the likelihood of future purchase of products from HBAT increases by 1.17times. (p < .01)

  4. New products New products doesn’t have any significant effect on future purchase of products from HBAT. (p = .62)

c. statistical difference of the categorical levels for X1.

perception<- as.numeric(hb$X23)
diff <- aov(perception~customer_type)
summary(diff)
##                Df Sum Sq Mean Sq F value   Pr(>F)    
## customer_type   2  13.94   6.968   39.12 4.92e-15 ***
## Residuals     197  35.08   0.178                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Considering the result of ANOVA, yes, the three categorical levels are significantly different. (F = 39.12, p < .01)

pairwise.t.test(perception, customer_type, "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  perception and customer_type 
## 
##   1       2    
## 2 8.4e-09 -    
## 3 1.1e-14 0.092
## 
## P value adjustment method: bonferroni

I examined if each of the three categorical variables are significantly different by using Bonferroni test. As we can see from the result, customer group 1 and 2 are significantly different from each other(p < .01), likewise customer grou 1 and 3 are significantly different from each other(p <.01). However, group 2 and 3 are not significantly different from each other(p = .09)

d. Pseudo R squared

Pseudo R squared is a way to evaluate the goodness-of-fit of logistic regression model. There are several pseudo R squared developed. It is called ‘pseudo’ R squared because they are look like R squared in the sense that they are on a similar scale, ranging from 0 to 1 with higher values indicating better model fit, but they cannot be interpreted as one would interpret an OLS R squared.

LLF <- hb_model$deviance
LL0 <- hb_model$null.deviance
Pseudo_R2 <- as.vector(1 - (LLF/LL0))
Pseudo_R2
## [1] 0.3821562

The pseudo R squared, in logistic regression, is defined as 1 - LLF/LL0, where LL0 represents the log likelihood for the “constant-only” model and LLF is the log likelihood for the full model with constant and predictors.(McFadden) So, the pseudo R squared calculated with McFadden method is .3821.

e. plot and compare the fitted values to the observed values

logits <- array(perception)
yhat <- predict(hb_model, newdata=hb, type="response")
for(i in 1:200) logits[i] <- log((yhat[i])/(1-yhat[i]))
plot(logits,perception, col="red") 
points(logits, yhat, col="blue")

mean((perception-yhat)^2)
## [1] 0.1514335

Mean Squared Error of fitted value and observed value is .1514.

f. If X6 should be added to the model.

product_quality <- hb$X6
hb_model2 <- glm(perception ~ customer_type + distribution_system + product_quality + e_commerce + new_products)
summary(hb_model2)
## 
## Call:
## glm(formula = perception ~ customer_type + distribution_system + 
##     product_quality + e_commerce + new_products)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.96753  -0.32068   0.01801   0.34194   0.72566  
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.074600   0.301656  -3.562 0.000463 ***
## customer_type2        0.398525   0.072556   5.493 1.24e-07 ***
## customer_type3        0.451031   0.090632   4.977 1.43e-06 ***
## distribution_system1  0.076204   0.069369   1.099 0.273338    
## product_quality       0.060376   0.031180   1.936 0.054285 .  
## e_commerce            0.175944   0.038881   4.525 1.05e-05 ***
## new_products          0.009525   0.019099   0.499 0.618532    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1539354)
## 
##     Null deviance: 49.02  on 199  degrees of freedom
## Residual deviance: 29.71  on 193  degrees of freedom
## AIC: 202.21
## 
## Number of Fisher Scoring iterations: 2

The AIC of the frist model(which excludes X6) is 204.05, while the AIC of the second moedl(which includes X6) is 202.21. The difference between these two model is less than 2. For this reason, we need to choose the model which is simpler. The first model has 4 significant variables(customer type2, customer type3, distribution system1, e-commerce), while the second model has 3 significant variables(customer type2, customer type3, e-commerce). consequently, the simpler model which excludes X6 is chosen.

4. Multinomial logistic problem

customer_type <- as.factor(hb$X1)
firm_size <- as.factor(hb$X3)
distribution_system <- as.factor(hb$X5)
technical_support <- hb$X8
price_flexibility <- hb$X17
delivery_speed <- hb$X18

a. A result of the multinomial logistic regression

library(nnet)

multi_hb <- multinom(customer_type ~ firm_size + distribution_system + technical_support + price_flexibility + delivery_speed)
## # weights:  21 (12 variable)
## initial  value 219.722458 
## iter  10 value 125.281776
## iter  20 value 116.891137
## final  value 116.891131 
## converged
summary(multi_hb)
## Call:
## multinom(formula = customer_type ~ firm_size + distribution_system + 
##     technical_support + price_flexibility + delivery_speed)
## 
## Coefficients:
##   (Intercept)   firm_size1 distribution_system1 technical_support
## 2   -12.77404 -0.003256774             1.462977        0.09371343
## 3   -12.39581  0.218619653            -1.219667        0.14772130
##   price_flexibility delivery_speed
## 2        0.09139939       3.051464
## 3       -2.28367864       5.873471
## 
## Std. Errors:
##   (Intercept) firm_size1 distribution_system1 technical_support
## 2    2.101651  0.5650909            0.5911559         0.1509950
## 3    2.268935  0.6105969            0.6902757         0.1697809
##   price_flexibility delivery_speed
## 2         0.3548687      0.7316407
## 3         0.4475277      0.8762036
## 
## Residual Deviance: 233.7823 
## AIC: 257.7823

calculating p-value

library(AER)

coeftest(multi_hb)
## 
## z test of coefficients:
## 
##                           Estimate  Std. Error z value  Pr(>|z|)    
## 2:(Intercept)          -12.7740370   2.1016508 -6.0781 1.216e-09 ***
## 2:firm_size1            -0.0032568   0.5650909 -0.0058   0.99540    
## 2:distribution_system1   1.4629770   0.5911559  2.4748   0.01333 *  
## 2:technical_support      0.0937134   0.1509950  0.6206   0.53484    
## 2:price_flexibility      0.0913994   0.3548687  0.2576   0.79675    
## 2:delivery_speed         3.0514641   0.7316407  4.1707 3.036e-05 ***
## 3:(Intercept)          -12.3958071   2.2689354 -5.4633 4.674e-08 ***
## 3:firm_size1             0.2186197   0.6105969  0.3580   0.72031    
## 3:distribution_system1  -1.2196671   0.6902757 -1.7669   0.07724 .  
## 3:technical_support      0.1477213   0.1697809  0.8701   0.38426    
## 3:price_flexibility     -2.2836786   0.4475277 -5.1029 3.345e-07 ***
## 3:delivery_speed         5.8734710   0.8762036  6.7033 2.037e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

b. Interpretations of all available coefficient.

exp(coef(multi_hb))
##    (Intercept) firm_size1 distribution_system1 technical_support
## 2 2.833389e-06  0.9967485            4.3187974          1.098245
## 3 4.135894e-06  1.2443579            0.2953285          1.159190
##   price_flexibility delivery_speed
## 2         1.0957065       21.14628
## 3         0.1019086      355.48070
  1. Firm size Firm size doesn’t have any significant effect on promoting customers to buy products from HBAT for between 1 and 5 years rather than less than 1 year.(p = .99) Neither has firm size any significant effect on promoting customers to buy products from HBAT for more than 5 years rather than less than 1 year.(p = .72)

  2. Distribution system The likelihood of being customers who buy HBAT’s products for between 1 and 5 year is 4.32 times larger than that of customers who buy HBAT’s products for less than 1 year, if moving from the case of selling paper products indirectly through brokers to the case of selling paper products directly.(p = .01) However, the distribution system has no significant effect on the likelihood of being customers who buy HBAT’s products for more than 5 years. (p = .08)

  3. Technical support Technical support doesn’t have any significant effect on promoting customers to buy products from HBAT for between 1 and 5 years rather than less than 1 year.(p = .53) Neither has firm size any significant effect on promoting customers to buy products from HBAT for more than 5 years rather than less than 1 year.(p = .38)

  4. Price flexibility Price flexibility doesn’t have any significant effect on promoting customers to buy products from HBAT for between 1 and 5 years rather than less than 1 year.(p = .79). However, one unit increase in price flexibility is associated with 89.80% decrease in the likelihood of being customers who buy HBAT products for more than 5 years compared to being those who buy HBAT products for less than 1 year. (p < .01)

  5. Delivery speed One unit increase in delivery speed means 21.14 times increase in the likelihood of being customers who buy products from HBAT for between 1 and 5 years, compared to being those who buy products from HBAT for less than 1 year.(p < .01) While one unit increase in delivery speed means 355.48 times increase in the likelihood of beting customers who buy products from HBAT for more than 5 years, compared to being those who buy products from HBAT for less than 1 year.(p < .01)

c. model significance

require(rms)
modfp <- lrm(customer_type ~ firm_size + distribution_system + technical_support + price_flexibility + delivery_speed)
modfp
## 
## Logistic Regression Model
## 
## lrm(formula = customer_type ~ firm_size + distribution_system + 
##     technical_support + price_flexibility + delivery_speed)
##                       Model Likelihood     Discrimination    Rank Discrim.    
##                          Ratio Test            Indexes          Indexes       
## Obs           200    LR chi2     140.46    R2       0.568    C       0.862    
##  1             68    d.f.             5    g        2.450    Dxy     0.723    
##  2             64    Pr(> chi2) <0.0001    gr      11.584    gamma   0.725    
##  3             68                          gp       0.343    tau-a   0.484    
## max |deriv| 2e-08                          Brier    0.122                     
## 
##                       Coef    S.E.   Wald Z Pr(>|Z|)
## y>=2                  -6.1336 1.0327 -5.94  <0.0001 
## y>=3                  -8.5305 1.1448 -7.45  <0.0001 
## firm_size=1           -0.1396 0.3416 -0.41  0.6829  
## distribution_system=1 -0.5343 0.3621 -1.48  0.1401  
## technical_support      0.0782 0.0969  0.81  0.4196  
## price_flexibility     -1.3305 0.1989 -6.69  <0.0001 
## delivery_speed         3.4452 0.3892  8.85  <0.0001

The model is significant considering the likelihood ratio. (chi-square = 140.46, df = 5, p < .01)

require(VGAM)
vglmfit <- vglm(customer_type ~ firm_size + distribution_system + technical_support + price_flexibility + delivery_speed, family=multinomial, data=hb)
vglm0 <- vglm(customer_type ~ 1, family=multinomial(refLevel=1), data=hb)
LLf   <- VGAM::logLik(vglmfit)
LL0   <- VGAM::logLik(vglm0)
as.vector(1 - (LLf / LL0))
## [1] 0.4678104

Pseudo R squared is a way to evaluate the goodness-of-fit of logistic regression model. There are several pseudo R squared developed. It is called ‘pseudo’ R squared because they are look like R squared in the sense that they are on a similar scale, ranging from 0 to 1 with higher values indicating better model fit, but they cannot be interpreted as one would interpret an OLS R squared. The pseudo R squared, in logistic regression, is defined as 1 - LLf/LL0, where LL0 represents the log likelihood for the “constant-only” model and LLf is the log likelihood for the full model with constant and predictors.(McFadden) So, the pseudo R squared calculated with McFadden method is .4678.