Sungji Peter Shin
3.24.2019
data(titanic)
titanic <- titanic %>%
mutate(surv = as.integer(survived)) %>%
mutate(survival = sjmisc::rec(surv, rec = '2=0; 1=1')) %>%
select(pclass, survived, survival, everything())
# z_tit <- zelig(survival ~ age + sex * pclass + fare, model = 'logit', data = titanic, cite = F)
# summary(z_tit)
z5 = zlogit$new()
z5$zelig(survival ~ age + sex * pclass + fare, data = titanic)
summary(z5)
## Model:
##
## Call:
## z5$zelig(formula = survival ~ age + sex * pclass + fare, data = titanic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0634 -0.6641 -0.4943 0.4336 2.4941
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.8978215 0.6131092 7.988 1.37e-15
## age -0.0387245 0.0068044 -5.691 1.26e-08
## sexmale -3.8996177 0.5015659 -7.775 7.55e-15
## pclass2nd -1.5923247 0.6024844 -2.643 0.00822
## pclass3rd -4.1382715 0.5601819 -7.387 1.50e-13
## fare -0.0009058 0.0020509 -0.442 0.65874
## sexmale:pclass2nd -0.0600076 0.6372949 -0.094 0.92498
## sexmale:pclass3rd 2.5019110 0.5479901 4.566 4.98e-06
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1409.99 on 1042 degrees of freedom
## Residual deviance: 931.45 on 1035 degrees of freedom
## AIC: 947.45
##
## Number of Fisher Scoring iterations: 5
##
## Next step: Use 'setx' method
a.range = min(titanic$age):max(titanic$age)
# x <- setx(z_tit, age = a.range)
# s <- sim(z_tit, x = x)
# ci.plot(s)
z5$setrange(age = a.range)
z5$sim()
z5$graph()
Expected chance of survival decreases as age increases. Of the same confidence interval, younger subjects have a wider range of expected values compared to their older counterparts. Other variables included in the same model are set as their default values: the mode of the categorical variables and the median of continuous variables.
f.range = min(titanic$fare):max(titanic$fare)
# x <- setx(z_tit, fare = f.range)
# s <- sim(z_tit, x = x)
# ci.plot(s)
z5$setrange(fare = f.range)
z5$sim()
z5$graph()
Expected chance of survival slightly decreases as fare increases when other variables are assigned the default values. However, of the same confidence interval, those who paid more have a greatly wider range of expected values compared to their counterparts who paid less.
# x <- setx(z_tit, sex = 'male')
# x1 <- setx(z_tit, sex = 'female')
# s <- sim(z_tit, x = x, x1 = x1)
# summary(s)
z5 = zlogit$new()
z5$zelig(survival ~ age + sex * pclass + fare, data = titanic)
z5$setx(sex = 'male')
z5$setx1(sex = 'female')
z5$sim()
summary(z5)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.1406613 0.02022649 0.1395005 0.1042109 0.1832741
## pv
## 0 1
## [1,] 0.859 0.141
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.3956063 0.04267911 0.3946235 0.3131149 0.4856286
## pv
## 0 1
## [1,] 0.59 0.41
## fd
## mean sd 50% 2.5% 97.5%
## [1,] 0.254945 0.04509406 0.2548875 0.1668672 0.3477729
Statistics under ‘sim x’ represents information regarding only males; while those under ‘sim x1’ describes female information. Statistics under ‘fd’ represents information regarding the first difference.
fd <- z5$get_qi(xvalue = 'x1', qi = 'fd')
summary(fd)
## V1
## Min. :0.1130
## 1st Qu.:0.2238
## Median :0.2549
## Mean :0.2549
## 3rd Qu.:0.2837
## Max. :0.4073
# plot(s)
par(mar = c(2,2,2,2))
z5$graph()
Females have a better chance of survival than males, shown in both predicted and expected categories. Difference between the means of the expected values for females and males is 0.2545.
# c1x <- setx(z_tit, sex = 'male', pclass = '1st')
# c1x1 <- setx(z_tit, sex = 'female', pclass = '1st')
# c1s <- sim(z_tit, x = c1x, x1 = c1x1)
c1 = zlogit$new()
c1$zelig(survival ~ age + sex * pclass + fare, data = titanic)
c1$setx(sex = 'male', pclass = '1st')
c1$setx1(sex = 'female', pclass = '1st')
c1$sim()
# c2x <- setx(z_tit, sex = 'male', pclass = '2nd')
# c2x1 <- setx(z_tit, sex = 'female', pclass = '2nd')
# c2s <- sim(z_tit, x = c2x, x1 = c2x1)
c2 = zlogit$new()
c2$zelig(survival ~ age + sex * pclass + fare, data = titanic)
c2$setx(sex = 'male', pclass = '2nd')
c2$setx1(sex = 'female', pclass = '2nd')
c2$sim()
# c3x <- setx(z_tit, sex = 'male', pclass = '3rd')
# c3x1 <- setx(z_tit, sex = 'female', pclass = '3rd')
# c3s <- sim(z_tit, x = c3x, x1 = c3x1)
c3 = zlogit$new()
c3$zelig(survival ~ age + sex * pclass + fare, data = titanic)
c3$setx(sex = 'male', pclass = '3rd')
c3$setx1(sex = 'female', pclass = '3rd')
c3$sim()
# plot(c1s)
par(mar = c(2,2,2,2))
c1$graph()
For first class, it seems that almost every female guest have a chance to survive and that nearly half of male guests have a chance to survive (i.e. the expected chances of survival for females and males are 0.9741 and 0.4522, respectively). Thus, the first difference of the mean expected survival is about 0.5219.
# plot(c2s)
par(mar = c(2,2,2,2))
c2$graph()
For second class, the mean expected chances of survival for female guests is 0.8877; and for male guests is 0.1353. Thus, the difference between the mean expected survival between females and males is 0.7494. Since the expected chance of survival for the second class males decreases greatly compared to the first class males, the gender difference of the expected survival increases greatly as well.
# plot(c3s)
par(mar= c(2,2,2,2))
c3$graph()
The expected chance of survival for the third class females decrease greatly compared to the females in first and second class. Mean expected chances of survival for females and males in the third class are respectively 0.3930 and 0.1398. A gender difference in the expected chance of survival in the third class is lowest among other classes.
d1 <- c1$get_qi(xvalue = 'x1', qi = 'fd')
d2 <- c2$get_qi(xvalue = 'x1', qi = 'fd')
d3 <- c3$get_qi(xvalue = 'x1', qi = 'fd')
dfd <- as.data.frame(cbind(d1, d2, d3))
head(dfd)
For each class, the values of the first difference in each class are put into a new dataframe, ‘dfd’; thus, V1, V2, and V3 represent the values of the first difference in the first, second, and third class, respectively.
library(tidyr)
tidd <- dfd %>%
gather(class, simv, 1:3)
head(tidd)
Instead of putting each class in different columns or fields, a new dataframe, ‘tidd’ has a long format, gathered by class (V1, V2, and V3).
library(dplyr)
tidd %>%
group_by(class) %>%
summarise(mean = mean(simv), sd = sd(simv))
Above are the mean and standard deviation of the values of first difference in each class.
library(ggplot2)
ggplot(tidd, aes(simv)) + geom_histogram() + facet_grid(~class)
As stated above, the expected first difference in the third class is lowest and that in the second class is highest.
I have found this dataset from the kaggle website and will explore the dataset to predict the customers’ overall satisfaction level (dichotomous dependent variable: 1 = satisfied and 0 = neutral or dissatisfied) based on a number of variables: age, gender, class (Business, Eco Plus, and Eco), the flight distance of the journey, satisfaction level of the inflight wifi service (0 = Not applicable; 1 to 5), and satisfaction level of inflight entertainment. I expect to see higher level of satisfaction among the guests in Business class compared to those in lower class and and no gender difference in satisfaction level from the same class.
library(readr)
usair <- read_csv("C:/Users/jw/Desktop/customer-satisfaction/satisfaction_2015.csv", col_names = TRUE)
## Parsed with column specification:
## cols(
## .default = col_double(),
## satisfaction_v2 = col_character(),
## Gender = col_character(),
## `Customer Type` = col_character(),
## `Type of Travel` = col_character(),
## Class = col_character()
## )
## See spec(...) for full column specifications.
head(usair)
table(usair$satisfaction_v2)
##
## neutral or dissatisfied satisfied
## 73452 56428
usair <- usair %>%
select(id, satisfaction_v2, Gender, Age, Class, `Flight Distance`, `Inflight wifi service`, `Inflight entertainment`) %>%
mutate(
satisfaction = factor(satisfaction_v2),
gender = factor(Gender),
age = Age,
class = factor(Class),
distance = `Flight Distance`,
wifi = factor(`Inflight wifi service`),
entertainment = factor(`Inflight entertainment`))
usair <- usair %>%
mutate(satisfy = sjmisc::rec(usair$satisfaction, rec = 'satisfied = 1; else = 0')) %>%
mutate(satisfy = as.double(satisfy)) %>%
select(id, satisfaction_v2, satisfy, gender, age, class, distance, wifi, entertainment, everything())
usair <- usair %>%
mutate(satisfy = sjmisc::rec(usair$satisfy, rec = '2=1; 1=0'))
head(usair, 20)
library(pander)
## Warning: package 'pander' was built under R version 3.5.3
us5 = zlogit$new()
us5$zelig(satisfy ~ gender * class + age + distance + wifi + entertainment, data = usair)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
table1 <- htmlreg(us5, digits = 3, doctype = FALSE)
pander(table1)
| Model 1 | ||
|---|---|---|
| (Intercept) | -14.758 | |
| (817.932) | ||
| genderMale | 0.033 | |
| (0.022) | ||
| classEco | -2.807*** | |
| (0.030) | ||
| classEco Plus | -2.613*** | |
| (0.056) | ||
| age | 0.017*** | |
| (0.001) | ||
| distance | 0.000*** | |
| (0.000) | ||
| wifi1 | -19.965 | |
| (50.627) | ||
| wifi2 | -20.286 | |
| (50.627) | ||
| wifi3 | -20.336 | |
| (50.627) | ||
| wifi4 | -18.169 | |
| (50.627) | ||
| wifi5 | -13.208 | |
| (50.627) | ||
| entertainment1 | 31.878 | |
| (819.498) | ||
| entertainment2 | 32.758 | |
| (819.498) | ||
| entertainment3 | 33.215 | |
| (819.498) | ||
| entertainment4 | 34.687 | |
| (819.498) | ||
| entertainment5 | 34.791 | |
| (819.498) | ||
| genderMale:classEco | -0.005 | |
| (0.039) | ||
| genderMale:classEco Plus | 0.073 | |
| (0.079) | ||
| AIC | 81842.607 | |
| BIC | 82018.546 | |
| Log Likelihood | -40903.304 | |
| Deviance | 81806.607 | |
| Num. obs. | 129880 | |
| p < 0.001, p < 0.01, p < 0.05 | ||
Class, age, and the flight distance have statistically significant effect on the level of satisfaction. As age and the flight distance increase, the chance of customers’ rating the flight as satisfied increases. Compared to the Business class customers, the customers in Eco Plus and Eco significantly lower chance of rating the flight as satisfied.
age.range = min(usair$age):max(usair$age)
us5$setrange(age = age.range)
us5$sim()
us5$graph()
When other features of the model are assigned with default values, as age increases, the chance of rating the flight as satisfied also increases. This may be due to that the older customers are more patient and generous.
dist.range = min(usair$distance):max(usair$distance)
us5$setrange(distance = dist.range)
us5$sim()
us5$graph()
When other features of the model are assigned with default values, as the flight distance increases, the chance of rating the flight as satisfied also increases. This may be due to that people more likely attempt to sleep when the distance and flight hours are long.
us5 = zlogit$new()
us5$zelig(satisfy ~ gender * class + age + distance + wifi + entertainment, data = usair)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
us5$setx(gender = 'Male')
us5$setx1(gender = 'Female')
us5$sim()
summary(us5)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.6718425 0.00575847 0.6720366 0.6600978 0.6827078
## pv
## 0 1
## [1,] 0.346 0.654
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.6644864 0.005657732 0.6643842 0.6537264 0.6747172
## pv
## 0 1
## [1,] 0.325 0.675
## fd
## mean sd 50% 2.5% 97.5%
## [1,] -0.007356083 0.005009147 -0.007408826 -0.01676757 0.002074251
The mean expected value for male customers is 0.6711 and for female customers is 0.6640. Thus the first difference is only about 0.0072.
firstd <- us5$get_qi(xvalue = 'x1', qi = 'fd')
summary(firstd)
## V1
## Min. :-0.022450
## 1st Qu.:-0.011097
## Median :-0.007409
## Mean :-0.007356
## 3rd Qu.:-0.003729
## Max. : 0.006944
par(mar = c(2,2,2,2))
us5$graph()
cl1 = zlogit$new()
cl1$zelig(satisfy ~ gender * class + age + distance + wifi + entertainment, data = usair)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cl1$setx(gender = 'Male', class = 'Business')
cl1$setx1(gender = 'Female', class = 'Business')
cl1$sim()
par(mar = c(2,2,2,2))
cl1$graph()
cl2 = zlogit$new()
cl2$zelig(satisfy ~ gender * class + age + distance + wifi + entertainment, data = usair)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cl2$setx(gender = 'Male', class = 'Eco Plus')
cl2$setx1(gender = 'Female', class = 'Eco Plus')
cl2$sim()
par(mar = c(2,2,2,2))
cl2$graph()
cl3 = zlogit$new()
cl3$zelig(satisfy ~ gender * class + age + distance + wifi + entertainment, data = usair)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
cl3$setx(gender = 'Male', class = 'Eco')
cl3$setx1(gender = 'Female', class = 'Eco')
cl3$sim()
par(mar= c(2,2,2,2))
cl3$graph()
As we can see the above figures, Business class customers (both males and females) have the highest odd of rating the flight as satisfied compared to other class customers of both males and females. For all three classes, males have slightly higher chance of rating the flight as satisfied compared to their respective female counterparts. However, there is only minimal value of the first difference thus there seems no gender difference in the level of satisfaction if they are in the same class.
df1 <- cl1$get_qi(xvalue = 'x1', qi = 'fd')
df2 <- cl2$get_qi(xvalue = 'x1', qi = 'fd')
df3 <- cl3$get_qi(xvalue = 'x1', qi = 'fd')
dfd1 <- as.data.frame(cbind(df1, df2, df3))
head(dfd1)
dfirstd <- dfd1 %>%
gather(class, simv, 1:3)
head(dfirstd)
dfirstd %>%
group_by(class) %>%
summarise(mean = mean(simv), sd = sd(simv))
ggplot(dfirstd, aes(simv)) + geom_histogram() + facet_grid(~class)
This figure represents the distribution of the first differences in each class. The numbers on the x-axis are very small thus support the above statement that there is no gender difference in the level of satisfaction if they are in the same class.
This model predicted that the increased age and flight distance increases the chance of customers’ rating the flight as satisfied. As expected, the customers in higher class such as Business, have higher odds of rating it as satisfied compared to their counterparts in lower classes (Eco Plus and Eco). However, there was no gender difference (only minimal) in satisfaction level when they are in the same class. For the further analysis, instead of classifying it based on the class, the gender difference in different age groups (teenagers, young adults, adults and seniors) would render more intersting results.