URL: https://www.youtube.com/watch?v=-Cp_KP9mq94&t=15s
Users picked one of the following as their preferred method of fishing and provided other stats such as income:
multinomial_fishing1 <- read.csv("~/CSPS work/621 - Analytics and Data Mining/multinomial_fishing1.csv", header=TRUE)
#View(multinomial_fishing1)
attach(multinomial_fishing1)
table(mode)
## mode
## beach charter pier private
## 134 452 178 418
Reshape the data from Long to Wide so each person has 4 lines verses only 1; True would be their choice while also generating three false entries for the choices they did not chose.
Using this, in the conditional logic model since price and catch as they vary across all the options.
mldata <- mlogit.data(multinomial_fishing1, varying = 4:15, choice = "mode", shape = "wide")
mldata[1:20,]
## ~~~~~~~
## first 10 observations out of 20
## ~~~~~~~
## mode price catchrate income mode1 alt d catch chid idx
## 1 FALSE 157.930 0.5391 7.083332 4 beach 0 0.0678 1 1:each
## 2 TRUE 182.930 0.5391 7.083332 4 charter 1 0.5391 1 1:rter
## 3 FALSE 157.930 0.5391 7.083332 4 pier 0 0.0503 1 1:pier
## 4 FALSE 157.930 0.5391 7.083332 4 private 0 0.2601 1 1:vate
## 5 FALSE 15.114 0.4671 1.250000 4 beach 0 0.1049 2 2:each
## 6 TRUE 34.534 0.4671 1.250000 4 charter 1 0.4671 2 2:rter
## 7 FALSE 15.114 0.4671 1.250000 4 pier 0 0.0451 2 2:pier
## 8 FALSE 10.534 0.4671 1.250000 4 private 0 0.1574 2 2:vate
## 9 FALSE 161.874 0.2413 3.750000 3 beach 0 0.5333 3 3:each
## 10 FALSE 59.334 0.2413 3.750000 3 charter 0 1.0266 3 3:rter
##
## ~~~ indexes ~~~~
## chid alt
## 1 1 beach
## 2 1 charter
## 3 1 pier
## 4 1 private
## 5 2 beach
## 6 2 charter
## 7 2 pier
## 8 2 private
## 9 3 beach
## 10 3 charter
## indexes: 1, 2
Mode is the dependent variable and we will not have any variable that varies with the alternative so we put 1 in the constant. Income is the variable which does not vary with the alternative. Reference level is Charter creating the base variable.
mlogit.model1 <- mlogit(mode ~ 1 | income, data=mldata, reflevel="charter")
summary(mlogit.model1)
##
## Call:
## mlogit(formula = mode ~ 1 | income, data = mldata, reflevel = "charter",
## method = "nr")
##
## Frequencies of alternatives:choice
## charter beach pier private
## 0.38240 0.11337 0.15059 0.35364
##
## nr method
## 4 iterations, 0h:0m:0s
## g'(-H)^-1g = 8.32E-07
## gradient close to zero
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept):beach -1.341291 0.194517 -6.8955 5.367e-12 ***
## (Intercept):pier -0.527141 0.177784 -2.9651 0.003026 **
## (Intercept):private -0.602371 0.136096 -4.4261 9.597e-06 ***
## income:beach 0.031640 0.041846 0.7561 0.449591
## income:pier -0.111763 0.043979 -2.5413 0.011046 *
## income:private 0.123546 0.027911 4.4265 9.577e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1477.2
## McFadden R^2: 0.013736
## Likelihood ratio test : chisq = 41.145 (p.value = 6.0931e-09)
Since charter is the base variable, it is not displayed in the coefficients results (normalized to zero). The others are in comparison to charter.
Analysis: As income increases,
Changing the reference to Pier, how does that change the coefficients? It was the only value with a negative effect earlier.
mlogit.model2 <- mlogit(mode ~ 1 | income, data = mldata, reflevel="pier")
summary(mlogit.model2)
##
## Call:
## mlogit(formula = mode ~ 1 | income, data = mldata, reflevel = "pier",
## method = "nr")
##
## Frequencies of alternatives:choice
## pier beach charter private
## 0.15059 0.11337 0.38240 0.35364
##
## nr method
## 4 iterations, 0h:0m:0s
## g'(-H)^-1g = 8.32E-07
## gradient close to zero
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept):beach -0.814150 0.228632 -3.5610 0.0003695 ***
## (Intercept):charter 0.527141 0.177784 2.9651 0.0030262 **
## (Intercept):private -0.075229 0.183240 -0.4106 0.6814007
## income:beach 0.143403 0.053288 2.6911 0.0071223 **
## income:charter 0.111763 0.043979 2.5413 0.0110455 *
## income:private 0.235309 0.043668 5.3886 7.101e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1477.2
## McFadden R^2: 0.013736
## Likelihood ratio test : chisq = 41.145 (p.value = 6.0931e-09)
Analysis: As income increases,
The Coefficients do not have any correspondence with the other model in magnitude or sign.
Taking the exponents of the Multinomial logit model to create the odds ratio.
exp(coef(mlogit.model1))
## (Intercept):beach (Intercept):pier (Intercept):private
## 0.2615077 0.5902901 0.5475121
## income:beach income:pier income:private
## 1.0321457 0.8942561 1.1315023
An odds ratio of over one make it more likely than the other outcomes and likewise less than one makes the outcome less likely.
As income increases,
In the conditional logit model, we have 2 or more variables in as alternative specific variables.
clogit.model1 <- mlogit(mode ~ price+catch |income, data = mldata, reflevel="charter")
summary(clogit.model1)
##
## Call:
## mlogit(formula = mode ~ price + catch | income, data = mldata,
## reflevel = "charter", method = "nr")
##
## Frequencies of alternatives:choice
## charter beach pier private
## 0.38240 0.11337 0.15059 0.35364
##
## nr method
## 7 iterations, 0h:0m:0s
## g'(-H)^-1g = 1.37E-05
## successive function values within tolerance limits
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept):beach -1.6943657 0.2240506 -7.5624 3.952e-14 ***
## (Intercept):pier -0.9164063 0.2072648 -4.4214 9.805e-06 ***
## (Intercept):private -1.1670869 0.1590475 -7.3380 2.169e-13 ***
## price -0.0251166 0.0017317 -14.5042 < 2.2e-16 ***
## catch 0.3577820 0.1097733 3.2593 0.001117 **
## income:beach 0.0332917 0.0503409 0.6613 0.508403
## income:pier -0.0942854 0.0500600 -1.8834 0.059640 .
## income:private 0.1227315 0.0286306 4.2867 1.813e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1215.1
## McFadden R^2: 0.18868
## Likelihood ratio test : chisq = 565.17 (p.value = < 2.22e-16)
Notice we only have one set of coefficients for price or catch but we have multiple sets for income.
Analysis:
In comparison to Charter, when incomes increases: * Beach becomes more likely * Pier becomes less likely * Private becomes more likely
Again, running the model with reference level of Pier base category
clogit.model2 <- mlogit(mode ~ price+catch | income, data = mldata, reflevel="pier")
summary(clogit.model2)
##
## Call:
## mlogit(formula = mode ~ price + catch | income, data = mldata,
## reflevel = "pier", method = "nr")
##
## Frequencies of alternatives:choice
## pier beach charter private
## 0.15059 0.11337 0.38240 0.35364
##
## nr method
## 7 iterations, 0h:0m:0s
## g'(-H)^-1g = 1.37E-05
## successive function values within tolerance limits
##
## Coefficients :
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept):beach -0.7779594 0.2204939 -3.5283 0.0004183 ***
## (Intercept):charter 0.9164063 0.2072648 4.4214 9.805e-06 ***
## (Intercept):private -0.2506806 0.2039395 -1.2292 0.2190004
## price -0.0251166 0.0017317 -14.5042 < 2.2e-16 ***
## catch 0.3577820 0.1097733 3.2593 0.0011170 **
## income:beach 0.1275771 0.0506395 2.5193 0.0117582 *
## income:charter 0.0942854 0.0500600 1.8834 0.0596396 .
## income:private 0.2170169 0.0500582 4.3353 1.456e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Log-Likelihood: -1215.1
## McFadden R^2: 0.18868
## Likelihood ratio test : chisq = 565.17 (p.value = < 2.22e-16)
Analysis:
In comparison to Pier, when incomes increases: * Beach becomes more likely * Pier becomes more likely * Private becomes more likely
Mean for the income, Price and catch across all variables, using the reference level of Charter.
m <- mlogit(mode ~ price+catch |income, data = mldata, reflevel="charter")
z <- with(mldata, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean), income = mean(income)))
We use the mean values to calculate the marginal effects.
The data becomes the z data from earlier only presenting the means.
effects(mlogit.model1, covariate = "income", data = z)
## charter beach pier private
## -1.201367e-02 7.495845e-05 -2.065980e-02 3.259851e-02
effects(clogit.model1, covariate = "price", data = z)
## charter beach pier private
## charter -0.0062430047 6.091542e-04 7.642235e-04 0.0048696270
## beach 0.0006091541 -1.249124e-03 8.681094e-05 0.0005531588
## pier 0.0007642234 8.681094e-05 -1.545008e-03 0.0006939736
## private 0.0048696270 5.531588e-04 6.939737e-04 -0.0061167595
effects(clogit.model1, covariate = "catch", data = z)
## charter beach pier private
## charter 0.088930726 -0.008677316 -0.010886256 -0.069367154
## beach -0.008677329 0.017793621 -0.001236612 -0.007879681
## pier -0.010886272 -0.001236612 0.022008455 -0.009885571
## private -0.069367164 -0.007879671 -0.009885559 0.087132394
Notice across the marginal effects for Price, we have negative values. This suggests as the price increases, each variable becomes less likely with Pier becoming the most unlikely to be chosen.
If catch increases for a given option, the likelihood of of it being chosen increases.
Multinomial probit model coefficients
#mprobit.model1 <- mlogit(mode ~ 1 | income, data = mldata, reflevel="charter", probit=TRUE)
#summary(mprobit.model1)
Hauseman-McFadden test of independence of irrelevant alternatives
First, we estimate the multinominal logit model using a single reference level, in this case Beach and income being our independent variable. Second, we estimate the same model were we have the alternative subset of three of the four options, dropping Charter
m1<- mlogit(mode ~ 1 | income, data = mldata, reflevel="beach")
m2<- mlogit(mode ~ 1 | income, data = mldata, reflevel="beach", alt.subset=c("beach", "pier", "private"))
hmftest(m1, m2)
##
## Hausman-McFadden test
##
## data: mldata
## chisq = 14.701, df = 4, p-value = 0.005363
## alternative hypothesis: IIA is rejected
P values of less than 0.05 so the IIA is rejected.
source: https://stats.idre.ucla.edu/r/dae/multinomial-logistic-regression/
HSB Demo data set contains variables on 200 students where the output variable is prog (program type).
The Predictor variables are:
hsbdemo <- read.csv("~/CSPS work/621 - Analytics and Data Mining/hsbdemo.csv", header = TRUE)
with(hsbdemo, table(ses,prog))
## prog
## ses academic general vocation
## high 42 9 7
## low 19 16 12
## middle 44 20 31
with(hsbdemo, do.call(rbind, tapply(write, prog, function(x) c(M=mean(x), SD = sd(x)))))
## M SD
## academic 56.25714 7.943343
## general 51.33333 9.397775
## vocation 46.76000 9.318754
This method uses the multinom function from the nnet package. multinom does not require the data to be reshaped as with the mlogit package.
library (nnet)
## Warning: package 'nnet' was built under R version 3.6.3
hsbdemo$prog2 <- relevel(hsbdemo$prog, ref = "academic")
model1 <- multinom(prog2 ~ ses+write, data = hsbdemo)
## # weights: 15 (8 variable)
## initial value 219.722458
## iter 10 value 179.983731
## final value 179.981726
## converged
Running the model produces a negative log-likelihood of 179.98. When this value is multiplied by 2, its seen in the residual Deviance and can be used when comparing nested models.
The model summary output has a block of coefficients and a block of standard errors. Each of these blocks has one row of values corresponding to a model equation. Focusing on the block of coefficients, we can look at the first row comparing prog = “general” to our baseline prog = “academic” and the second row comparing prog = “vocation” to our baseline prog = “academic”.
summary(model1)
## Call:
## multinom(formula = prog2 ~ ses + write, data = hsbdemo)
##
## Coefficients:
## (Intercept) seslow sesmiddle write
## general 1.689478 1.1628411 0.6295638 -0.05793086
## vocation 4.235574 0.9827182 1.2740985 -0.11360389
##
## Std. Errors:
## (Intercept) seslow sesmiddle write
## general 1.226939 0.5142211 0.4650289 0.02141101
## vocation 1.204690 0.5955688 0.5111119 0.02222000
##
## Residual Deviance: 359.9635
## AIC: 375.9635
z <- summary(model1)$coefficients/summary(model1)$standard.errors
z
## (Intercept) seslow sesmiddle write
## general 1.376987 2.261364 1.353816 -2.705658
## vocation 3.515904 1.650050 2.492798 -5.112687
# creating the 2 tailed Z test
p <- (1-pnorm(abs(z),0,1))*2
p
## (Intercept) seslow sesmiddle write
## general 0.1685163893 0.02373673 0.1757949 6.816914e-03
## vocation 0.0004382601 0.09893276 0.0126741 3.176088e-07
Analysis: * A one unit increase in the write variable is associated with a decrease in the likelihood of: * Being in the General vs Academic Program by .058 * Being in the Vocation vs Academic Program by .1136 * Moving from SES Low to SES High the log odds of being in the in the Vocation vs Academic Program increase by 1.162 * Moving from SES Middle to SES High the log odds of being in the in the Vocation vs Academic Program increase by .6295
The Relative risk is the right-hand side linear equation exponentiated, leading to the fact that the exponentiated regression coefficients are relative risk ratios for a unit change in the predictor variable. We can exponentiate the coefficients from our model to see these risk ratios.
exp(coef(model1))
## (Intercept) seslow sesmiddle write
## general 5.416653 3.199009 1.876792 0.9437152
## vocation 69.101326 2.671709 3.575477 0.8926115
Using the fitted function, we can calculate the predicted probabilities for each outcome level.
head(pp <- fitted(model1))
## academic general vocation
## 1 0.2777676 0.3762093 0.34602312
## 2 0.2331247 0.2204013 0.54647393
## 3 0.6596815 0.2646928 0.07562571
## 4 0.3865745 0.3698511 0.24357436
## 5 0.2141360 0.3656570 0.42020707
## 6 0.2291956 0.3693443 0.40146013
We want to examine the changes in predicted probability associated with one of our two variables, we create a small dataset varying one variable while holding the other constant, in this case write. Holding Write at it’s mean, we can examine the predicted probabilities for each level of SES
dses <- data.frame(ses = c("low", "middle", "high"), write = mean(hsbdemo$write))
predict(model1, newdata = dses, "probs")
## academic general vocation
## 1 0.4396813 0.3581915 0.2021272
## 2 0.4777451 0.2283359 0.2939190
## 3 0.7009046 0.1784928 0.1206026
We can also understand the model using the predicted probabilities to look at the at the averaged probabilities for different continuous predictor values, in this case different values of write within each level of SES.
dwrite <- data.frame(ses = rep(c("low", "middle", "high"), each = 41), write = rep(c(30:70),
3))
## store the predicted probabilities for each value of ses and write
pp.write <- cbind(dwrite, predict(model1, newdata = dwrite, type = "probs", se = TRUE))
## calculate the mean probabilities within each level of ses
by(pp.write[, 3:5], pp.write$ses, colMeans)
## pp.write$ses: high
## academic general vocation
## 0.6164348 0.1808049 0.2027603
## --------------------------------------------------------
## pp.write$ses: low
## academic general vocation
## 0.3972955 0.3278180 0.2748864
## --------------------------------------------------------
## pp.write$ses: middle
## academic general vocation
## 0.4256172 0.2010877 0.3732951
Lastly, we are a visual minded people and plotting the data conveys the information cleanly. Using the predictions generated so far, we plot the predicted probabilities against the write score by the level of SES for different levels of the outcome variable.
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## melt data set to long for ggplot2
lpp <- melt(pp.write, id.vars = c("ses", "write"), value.name = "probability")
head(lpp) # view first few rows
## ses write variable probability
## 1 low 30 academic 0.09843258
## 2 low 31 academic 0.10716517
## 3 low 32 academic 0.11650018
## 4 low 33 academic 0.12645441
## 5 low 34 academic 0.13704163
## 6 low 35 academic 0.14827211
## plot predicted probabilities across write values for each level of ses
## facetted by program type
ggplot(lpp, aes(x = write, y = probability, colour = ses)) + geom_line() + facet_grid(variable ~
., scales = "free")