Sungji Peter Shin
3.24.2019

Re-implementation of the computations from the lecture slides using the Zelig 5 syntax:

5.2: Interaction model:

data(titanic)
titanic <- titanic %>%
  mutate(surv = as.integer(survived)) %>%
  mutate(survival = sjmisc::rec(surv, rec = '2=0; 1=1')) %>%
  select(pclass, survived, survival, everything())
# z_tit <- zelig(survival ~ age + sex * pclass + fare, model = 'logit', data = titanic, cite = F)
# summary(z_tit)
z5 = zlogit$new()
z5$zelig(survival ~ age + sex * pclass + fare, data = titanic)
summary(z5)

## Model: 
## 
## Call:
## z5$zelig(formula = survival ~ age + sex * pclass + fare, data = titanic)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0634  -0.6641  -0.4943   0.4336   2.4941  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        4.8978215  0.6131092   7.988 1.37e-15
## age               -0.0387245  0.0068044  -5.691 1.26e-08
## sexmale           -3.8996177  0.5015659  -7.775 7.55e-15
## pclass2nd         -1.5923247  0.6024844  -2.643  0.00822
## pclass3rd         -4.1382715  0.5601819  -7.387 1.50e-13
## fare              -0.0009058  0.0020509  -0.442  0.65874
## sexmale:pclass2nd -0.0600076  0.6372949  -0.094  0.92498
## sexmale:pclass3rd  2.5019110  0.5479901   4.566 4.98e-06
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1409.99  on 1042  degrees of freedom
## Residual deviance:  931.45  on 1035  degrees of freedom
## AIC: 947.45
## 
## Number of Fisher Scoring iterations: 5
## 
## Next step: Use 'setx' method

5.3: Age effect:

a.range = min(titanic$age):max(titanic$age)
# x <- setx(z_tit, age = a.range)
# s <- sim(z_tit, x = x)
# ci.plot(s)
z5$setrange(age = a.range)
z5$sim()
z5$graph()

Expected chance of survival decreases as age increases. Of the same confidence interval, younger subjects have a wider range of expected values compared to their older counterparts. Other variables included in the same model are set as their default values: the mode of the categorical variables and the median of continuous variables.

5.5: Fare effect:

f.range = min(titanic$fare):max(titanic$fare)
# x <- setx(z_tit, fare = f.range)
# s <- sim(z_tit, x = x)
# ci.plot(s)
z5$setrange(fare = f.range)
z5$sim()
z5$graph()

Expected chance of survival slightly decreases as fare increases when other variables are assigned the default values. However, of the same confidence interval, those who paid more have a greatly wider range of expected values compared to their counterparts who paid less.

5.7: Sex difference:

# x <- setx(z_tit, sex = 'male')
# x1 <- setx(z_tit, sex = 'female')
# s <- sim(z_tit, x = x, x1 = x1)
# summary(s)
z5 = zlogit$new()
z5$zelig(survival ~ age + sex * pclass + fare, data = titanic)
z5$setx(sex = 'male')
z5$setx1(sex = 'female')
z5$sim()
summary(z5)

## 
##  sim x :
##  -----
## ev
##           mean         sd       50%      2.5%     97.5%
## [1,] 0.1406613 0.02022649 0.1395005 0.1042109 0.1832741
## pv
##          0     1
## [1,] 0.859 0.141
## 
##  sim x1 :
##  -----
## ev
##           mean         sd       50%      2.5%     97.5%
## [1,] 0.3956063 0.04267911 0.3946235 0.3131149 0.4856286
## pv
##         0    1
## [1,] 0.59 0.41
## fd
##          mean         sd       50%      2.5%     97.5%
## [1,] 0.254945 0.04509406 0.2548875 0.1668672 0.3477729

Statistics under ‘sim x’ represents information regarding only males; while those under ‘sim x1’ describes female information. Statistics under ‘fd’ represents information regarding the first difference.

5.8: only simulating first difference:

fd <- z5$get_qi(xvalue = 'x1', qi = 'fd')
summary(fd)

##        V1        
##  Min.   :0.1130  
##  1st Qu.:0.2238  
##  Median :0.2549  
##  Mean   :0.2549  
##  3rd Qu.:0.2837  
##  Max.   :0.4073

5.9: plotting:

# plot(s)
par(mar = c(2,2,2,2))
z5$graph()

Females have a better chance of survival than males, shown in both predicted and expected categories. Difference between the means of the expected values for females and males is 0.2545.

5.10: testing the class variations in sex difference:

First class:

# c1x <- setx(z_tit, sex = 'male', pclass = '1st')
# c1x1 <- setx(z_tit, sex = 'female', pclass = '1st')
# c1s <- sim(z_tit, x = c1x, x1 = c1x1)
c1 = zlogit$new()
c1$zelig(survival ~ age + sex * pclass + fare, data = titanic)
c1$setx(sex = 'male', pclass = '1st')
c1$setx1(sex = 'female', pclass = '1st')
c1$sim()

Second class:

# c2x <- setx(z_tit, sex = 'male', pclass = '2nd')
# c2x1 <- setx(z_tit, sex = 'female', pclass = '2nd')
# c2s <- sim(z_tit, x = c2x, x1 = c2x1)
c2 = zlogit$new()
c2$zelig(survival ~ age + sex * pclass + fare, data = titanic)
c2$setx(sex = 'male', pclass = '2nd')
c2$setx1(sex = 'female', pclass = '2nd')
c2$sim()

Third class:

# c3x <- setx(z_tit, sex = 'male', pclass = '3rd')
# c3x1 <- setx(z_tit, sex = 'female', pclass = '3rd')
# c3s <- sim(z_tit, x = c3x, x1 = c3x1)
c3 = zlogit$new()
c3$zelig(survival ~ age + sex * pclass + fare, data = titanic)
c3$setx(sex = 'male', pclass = '3rd')
c3$setx1(sex = 'female', pclass = '3rd')
c3$sim()

5.12-14: Plotting:

# plot(c1s)
par(mar = c(2,2,2,2))
c1$graph()

For first class, it seems that almost every female guest have a chance to survive and that nearly half of male guests have a chance to survive (i.e. the expected chances of survival for females and males are 0.9741 and 0.4522, respectively). Thus, the first difference of the mean expected survival is about 0.5219.

# plot(c2s)
par(mar = c(2,2,2,2))
c2$graph()

For second class, the mean expected chances of survival for female guests is 0.8877; and for male guests is 0.1353. Thus, the difference between the mean expected survival between females and males is 0.7494. Since the expected chance of survival for the second class males decreases greatly compared to the first class males, the gender difference of the expected survival increases greatly as well.

# plot(c3s)
par(mar= c(2,2,2,2))
c3$graph()

The expected chance of survival for the third class females decrease greatly compared to the females in first and second class. Mean expected chances of survival for females and males in the third class are respectively 0.3930 and 0.1398. A gender difference in the expected chance of survival in the third class is lowest among other classes.

5.15: Putting them in one place:

d1 <- c1$get_qi(xvalue = 'x1', qi = 'fd')
d2 <- c2$get_qi(xvalue = 'x1', qi = 'fd')
d3 <- c3$get_qi(xvalue = 'x1', qi = 'fd')
dfd <- as.data.frame(cbind(d1, d2, d3))
head(dfd)

For each class, the values of the first difference in each class are put into a new dataframe, ‘dfd’; thus, V1, V2, and V3 represent the values of the first difference in the first, second, and third class, respectively.

5.16:

library(tidyr)

tidd <- dfd %>%
  gather(class, simv, 1:3)
head(tidd)

Instead of putting each class in different columns or fields, a new dataframe, ‘tidd’ has a long format, gathered by class (V1, V2, and V3).

5.17:

library(dplyr)

tidd %>%
  group_by(class) %>%
  summarise(mean = mean(simv), sd = sd(simv))

Above are the mean and standard deviation of the values of first difference in each class.

5.18:

library(ggplot2)

ggplot(tidd, aes(simv)) + geom_histogram() + facet_grid(~class)

As stated above, the expected first difference in the third class is lowest and that in the second class is highest.

Conducting a regression analysis using Zelig and my own dataset of interst:

Analyzing US Airline passenger satisfaction survey in 2015

I have found this dataset from the kaggle website and will explore the dataset to predict the customers’ overall satisfaction level (dichotomous dependent variable: 1 = satisfied and 0 = neutral or dissatisfied) based on a number of variables: age, gender, class (Business, Eco Plus, and Eco), the flight distance of the journey, satisfaction level of the inflight wifi service (0 = Not applicable; 1 to 5), and satisfaction level of inflight entertainment. I expect to see higher level of satisfaction among the guests in Business class compared to those in lower class and and no gender difference in satisfaction level from the same class.

library(readr)
usair <- read_csv("C:/Users/jw/Desktop/customer-satisfaction/satisfaction_2015.csv", col_names = TRUE)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   satisfaction_v2 = col_character(),
##   Gender = col_character(),
##   `Customer Type` = col_character(),
##   `Type of Travel` = col_character(),
##   Class = col_character()
## )

## See spec(...) for full column specifications.

head(usair)

table(usair$satisfaction_v2)

## 
## neutral or dissatisfied               satisfied 
##                   73452                   56428

usair <- usair %>%
  select(id, satisfaction_v2, Gender, Age, Class, `Flight Distance`, `Inflight wifi service`, `Inflight entertainment`) %>%
  mutate(
    satisfaction = factor(satisfaction_v2),
    gender = factor(Gender),
    age = Age,
    class = factor(Class),
    distance = `Flight Distance`,
    wifi = factor(`Inflight wifi service`),
    entertainment = factor(`Inflight entertainment`))
usair <- usair %>%
  mutate(satisfy = sjmisc::rec(usair$satisfaction, rec = 'satisfied = 1; else = 0')) %>%
  mutate(satisfy = as.double(satisfy)) %>%
  select(id, satisfaction_v2, satisfy, gender, age, class, distance, wifi, entertainment, everything())
usair <- usair %>%
  mutate(satisfy = sjmisc::rec(usair$satisfy, rec = '2=1; 1=0'))
head(usair, 20)

Running a model using Zelig 5 syntax:

library(pander)

## Warning: package 'pander' was built under R version 3.5.3

us5 = zlogit$new()
us5$zelig(satisfy ~ gender * class + age + distance + wifi + entertainment, data = usair)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

table1 <- htmlreg(us5, digits = 3, doctype = FALSE)
pander(table1)

Statistical models
	Model 1
(Intercept)	-14.758
	(817.932)
genderMale	0.033
	(0.022)
classEco	-2.807^***
	(0.030)
classEco Plus	-2.613^***
	(0.056)
age	0.017^***
	(0.001)
distance	0.000^***
	(0.000)
wifi1	-19.965
	(50.627)
wifi2	-20.286
	(50.627)
wifi3	-20.336
	(50.627)
wifi4	-18.169
	(50.627)
wifi5	-13.208
	(50.627)
entertainment1	31.878
	(819.498)
entertainment2	32.758
	(819.498)
entertainment3	33.215
	(819.498)
entertainment4	34.687
	(819.498)
entertainment5	34.791
	(819.498)
genderMale:classEco	-0.005
	(0.039)
genderMale:classEco Plus	0.073
	(0.079)
AIC	81842.607
BIC	82018.546
Log Likelihood	-40903.304
Deviance	81806.607
Num. obs.	129880
p < 0.001, p < 0.01, p < 0.05

Brief interpretation:

Class, age, and the flight distance have statistically significant effect on the level of satisfaction. As age and the flight distance increase, the chance of customers’ rating the flight as satisfied increases. Compared to the Business class customers, the customers in Eco Plus and Eco significantly lower chance of rating the flight as satisfied.

Effect of age in satisfaction level:

age.range = min(usair$age):max(usair$age)
us5$setrange(age = age.range)
us5$sim()
us5$graph()

When other features of the model are assigned with default values, as age increases, the chance of rating the flight as satisfied also increases. This may be due to that the older customers are more patient and generous.

Effect of the flight distance in satisfaction level:

dist.range = min(usair$distance):max(usair$distance)
us5$setrange(distance = dist.range)
us5$sim()
us5$graph()

When other features of the model are assigned with default values, as the flight distance increases, the chance of rating the flight as satisfied also increases. This may be due to that people more likely attempt to sleep when the distance and flight hours are long.

Looking at gender difference in satisfaction level:

us5 = zlogit$new()
us5$zelig(satisfy ~ gender * class + age + distance + wifi + entertainment, data = usair)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

us5$setx(gender = 'Male')
us5$setx1(gender = 'Female')
us5$sim()
summary(us5)

## 
##  sim x :
##  -----
## ev
##           mean         sd       50%      2.5%     97.5%
## [1,] 0.6718425 0.00575847 0.6720366 0.6600978 0.6827078
## pv
##          0     1
## [1,] 0.346 0.654
## 
##  sim x1 :
##  -----
## ev
##           mean          sd       50%      2.5%     97.5%
## [1,] 0.6644864 0.005657732 0.6643842 0.6537264 0.6747172
## pv
##          0     1
## [1,] 0.325 0.675
## fd
##              mean          sd          50%        2.5%       97.5%
## [1,] -0.007356083 0.005009147 -0.007408826 -0.01676757 0.002074251

The mean expected value for male customers is 0.6711 and for female customers is 0.6640. Thus the first difference is only about 0.0072.

firstd <- us5$get_qi(xvalue = 'x1', qi = 'fd')
summary(firstd)

##        V1           
##  Min.   :-0.022450  
##  1st Qu.:-0.011097  
##  Median :-0.007409  
##  Mean   :-0.007356  
##  3rd Qu.:-0.003729  
##  Max.   : 0.006944

par(mar = c(2,2,2,2))
us5$graph()

Class variations (Business, Eco Plus, and Eco) in gender difference:

Business class customers and their satisfaction level:

cl1 = zlogit$new()
cl1$zelig(satisfy ~ gender * class + age + distance + wifi + entertainment, data = usair)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

cl1$setx(gender = 'Male', class = 'Business')
cl1$setx1(gender = 'Female', class = 'Business')
cl1$sim()
par(mar = c(2,2,2,2))
cl1$graph()

Eco Plus class customers and their satisfaction level:

cl2 = zlogit$new()
cl2$zelig(satisfy ~ gender * class + age + distance + wifi + entertainment, data = usair)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

cl2$setx(gender = 'Male', class = 'Eco Plus')
cl2$setx1(gender = 'Female', class = 'Eco Plus')
cl2$sim()
par(mar = c(2,2,2,2))
cl2$graph()

Eco class customers and their satisfaction level:

cl3 = zlogit$new()
cl3$zelig(satisfy ~ gender * class + age + distance + wifi + entertainment, data = usair)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

cl3$setx(gender = 'Male', class = 'Eco')
cl3$setx1(gender = 'Female', class = 'Eco')
cl3$sim()
par(mar= c(2,2,2,2))
cl3$graph()

As we can see the above figures, Business class customers (both males and females) have the highest odd of rating the flight as satisfied compared to other class customers of both males and females. For all three classes, males have slightly higher chance of rating the flight as satisfied compared to their respective female counterparts. However, there is only minimal value of the first difference thus there seems no gender difference in the level of satisfaction if they are in the same class.

Putting them together:

df1 <- cl1$get_qi(xvalue = 'x1', qi = 'fd')
df2 <- cl2$get_qi(xvalue = 'x1', qi = 'fd')
df3 <- cl3$get_qi(xvalue = 'x1', qi = 'fd')
dfd1 <- as.data.frame(cbind(df1, df2, df3))
head(dfd1)

dfirstd <- dfd1 %>%
  gather(class, simv, 1:3)
head(dfirstd)

dfirstd %>%
  group_by(class) %>%
  summarise(mean = mean(simv), sd = sd(simv))

ggplot(dfirstd, aes(simv)) + geom_histogram() + facet_grid(~class)

This figure represents the distribution of the first differences in each class. The numbers on the x-axis are very small thus support the above statement that there is no gender difference in the level of satisfaction if they are in the same class.

Conclusion:

This model predicted that the increased age and flight distance increases the chance of customers’ rating the flight as satisfied. As expected, the customers in higher class such as Business, have higher odds of rating it as satisfied compared to their counterparts in lower classes (Eco Plus and Eco). However, there was no gender difference (only minimal) in satisfaction level when they are in the same class. For the further analysis, instead of classifying it based on the class, the gender difference in different age groups (teenagers, young adults, adults and seniors) would render more intersting results.

SOC712 Assignment #6