Interpretation by simulation:

PART-1: Surviving the Titanic with Zelig5

Let’s load packages.

library(radiant.data)
data(titanic)

Recoding the Variables:

I am recoding the Data for further analysis.

newtitanic <- titanic %>% 
  mutate(survived1 = as.integer(survived)) %>%
  mutate (age = as.integer(age)) %>%
  mutate(survival = sjmisc::rec(survived1, rec = "2=0; 1=1")) %>%
  select(survived, survival, age, sex, pclass, fare)

Creating logistic regresion model using the Zelig 5 package

library(Zelig)
library(texreg)
z5<- zlogit$new()
z5$zelig(survival ~ age + sex*pclass + fare, data = newtitanic)
htmlreg(z5)

Statistical models
	Model 1
(Intercept)	4.90^***
	(0.61)
age	-0.04^***
	(0.01)
sexmale	-3.90^***
	(0.50)
pclass2nd	-1.59^**
	(0.60)
pclass3rd	-4.14^***
	(0.56)
fare	-0.00
	(0.00)
sexmale:pclass2nd	-0.06
	(0.64)
sexmale:pclass3rd	2.50^***
	(0.55)
AIC	947.42
BIC	987.02
Log Likelihood	-465.71
Deviance	931.42
Num. obs.	1043
p < 0.001, p < 0.01, p < 0.05

INTERPRETATION:
On an average there is less probability of survival by 0.04 with an increase in age, result is significant. On an average males have less probability of survival by 3.90, result is significant. On an average 2nd and 3rd class passengers have less probability of survival by 1.59 and 4.14 respectively where 3rd class passengers have much lower survival probability and result is significant. Compared to 1st class an effect of 2nd class on survival in males on an average is -0.06 where an effect of 3rd class in males on an average is 2.50 and result is significant. Males in 3rd class have higher chances of survival.

The simulation with the continues variable (age) using Zelig 5 package:

Age Effect

a.range = min(newtitanic$age):max(newtitanic$age)
z5$setrange(age=a.range)
z5$sim()
z5$graph()

INTERPRETATION:
Here I can see the relationship between age and survival rate. The expected value of survival in the titanic decreases as age increases. Infants at age “0” have high rate of survival of 0.35 than older age of 25 has low survival rate of 0.2. High uncertainty of survival found in infant (low as 0.3 and high as 0.4) than age 80 and at age 80 there is a very low chance of survival.

The simulation with the continues variable (fare) using Zelig 5 package:

Fare Effect:

f.range = min(newtitanic$fare):max(newtitanic$fare)
z5$setrange(fare=f.range)
z5$sim()
z5$graph()

INTERPRETATION:
Now I can see the relationship between fare and survival rate. The expected value of surviving as fare increases seems to remain almost constant. There is a small linear downward sloping trend. Passengers paid high fare, chances of survival is 0.10, slightly lower than who paid low fare and chances of survival is 0.15 and uncertainty of survival is much high in high fare. Fare really dose not influence survival chance.

The simulation with the categorical variable (sex) using Zelig 5 package

Sex difference:

z5.sex <- zlogit$new()
z5.sex$zelig(survival ~ age + sex*pclass + fare, data = newtitanic)
  z5.sex$setx(sex = "male") 
  z5.sex$setx1(sex = "female")
  z5.sex$sim()
  summary(z5.sex)

## 
##  sim x :
##  -----
## ev
##           mean         sd       50%      2.5%     97.5%
## [1,] 0.1407784 0.01932916 0.1398707 0.1065382 0.1823852
## pv
##          0     1
## [1,] 0.844 0.156
## 
##  sim x1 :
##  -----
## ev
##           mean         sd       50%      2.5%     97.5%
## [1,] 0.3955748 0.04342181 0.3948876 0.3105055 0.4828405
## pv
##          0     1
## [1,] 0.628 0.372
## fd
##           mean         sd       50%      2.5%     97.5%
## [1,] 0.2547964 0.04447915 0.2537797 0.1667706 0.3447734

INTERPRETATION:
In simulation for sex, there are two counter differences, one for male and other is for female i.e one model name and two counter factual names. Above syntax tells me that i want to compare predicted values based on this estimated model between two counter factual situations.
In results fd is the first difference between two expected values. On an average females have 0.255 high chance of survival than males, it can be low as 0.168 and high as 0.338. (due to simulation could be slight differnce in results)

Summary of the simulated first difference (fd) only

fd <- z5.sex$get_qi (xvalue = "x1", qi = "fd")
summary(fd)

##        V1        
##  Min.   :0.1322  
##  1st Qu.:0.2256  
##  Median :0.2538  
##  Mean   :0.2548  
##  3rd Qu.:0.2854  
##  Max.   :0.4033

INTERPRETATION:
Above syntax gave me a vector or a list of numbers that simulated fd between females and males.

plot(z5.sex)

INTERPRETATION:
Looking at the expected value and predicted value of survival between male and female, I see a greater chance of survival for females in both categories. Females have about an expected value that is greater by 0.25 as E(Y|Male) is about 0.13 and E(Y|Female) is about 0.38.

Testing the class variation in sex difference using Zelig 5

Class difference in gender Difference. Now I want to see the cause variations in gender difference in survival chance. I want to see the difference in differences i.e difference in each class (1st, 2nd & 3rd)

##FIRST CLASS
z5.c1s <- zlogit$new()
z5.c1s$zelig(survival ~ age + sex*pclass + fare, data = newtitanic)
z5.c1s$setx(sex = "male", pclass = "1st") 
z5.c1s$setx1(sex = "female", pclass = "1st")
z5.c1s$sim()
plot(z5.c1s)

INTERPRETATION:
For 1st class, the difference in expected value of survival between female and male is about .53. This is very interesting as women in this class have an almost 100% expected survival chance.

##SECOND CLASS
z5.c2s <- zlogit$new()
z5.c2s$zelig(survival ~ age + sex*pclass + fare, data = newtitanic)
z5.c2s$setx(sex = "male", pclass = "2nd")
z5.c2s$setx1(sex = "female", pclass = "2nd")
z5.c2s$sim()
plot(z5.c2s)

INTERPRETATION:
For 2nd class, male survival rate decrease to about 0.13 and the woman survival rate decrease to about 0.9. The first difference increases to about 0.76. This is due to the large decrease in survival chances for males between first and 2nd class.

##THIRD CLASS
z5.c3s <- zlogit$new()
z5.c3s$zelig(survival ~ age + sex*pclass + fare, data = newtitanic)
z5.c3s$setx(sex = "male", pclass = "3rd")
z5.c3s$setx1(sex = "female", pclass = "3rd")
z5.c3s$sim()
plot(z5.c3s)

INTERPRETATION:
For 3rd class, the difference between female and male survival rate drops to about 0.25 as women’s expected survival decreases as we move down the class ranks

Putting the results of sex difference in all 3 classes together:

d1 <- z5.c1s$get_qi(xvalue="x1", qi="fd")
d2 <- z5.c2s$get_qi(xvalue="x1", qi="fd")
d3 <- z5.c3s$get_qi(xvalue="x1", qi="fd")
  
dfd <- as.data.frame(cbind(d1, d2, d3))
head(dfd)

##          V1        V2        V3
## 1 0.5339081 0.7194479 0.2447289
## 2 0.4487429 0.7897231 0.2845484
## 3 0.5322188 0.6603001 0.2415290
## 4 0.5286412 0.7463147 0.2495381
## 5 0.5056157 0.6702471 0.2939323
## 6 0.5250327 0.7078285 0.2657865

INTERPRETATION:
Now I am extracting the 1st difference of each classes and combine them to create a dataset. V1 represents the gender difference in simulated survival rate among 1st class. V2 represents the gender difference in simulated survival rate among 2nd class. V3 represents the gender difference in simulated survival rate among 3rd class.

Segregaring the sex difference by class:

tidd <- dfd %>% 
  gather(class, simv, 1:3)
head(tidd)

##   class      simv
## 1    V1 0.5339081
## 2    V1 0.4487429
## 3    V1 0.5322188
## 4    V1 0.5286412
## 5    V1 0.5056157
## 6    V1 0.5250327

INTERPRETATION:
I used gather command provided by tidyr package to convert my dataset from wide range format to long format so that i can use my data for plotting.

Class Differences in the Gender Difference the Chance of Titanic Survival:

tidd %>% 
  group_by(class) %>% 
  summarise(mean = mean(simv), sd = sd(simv))

## # A tibble: 3 x 3
##   class  mean     sd
##   <chr> <dbl>  <dbl>
## 1 V1    0.522 0.0519
## 2 V2    0.748 0.0431
## 3 V3    0.254 0.0431

INTERPRETATION:
I am grouping the results now. Above results are showing the mean value for sex difference in every class (V1 = first class, V2 = second class, V3 = third class) and sd for each class. In V1 female has on an average 0.518 more survival rate. In V2 female has on an average 0.747 more survival rate and in V3 female has on an average 0.254 more survival rate. (Slight difference can be observed due to simulation)

Plotting the graph using ggplot2 package:

The sex difference among 1st, 2nd, and 3rd class in Titanic

ggplot(tidd, aes(simv)) + geom_histogram() + facet_grid(~class)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

INTERPRETATION:

In the above graph i can view the expected 1st difference between female and male passengers across all 3 classes. The first difference is the lowest in 3rd class, where male and female passengers have closer expected survival rates.

PART-II: #Conducting regression analysis using Zelig 5 on my own data set, Telco Customers:

INTRODUCTION:
I choose data from kaggle for Telco customers to predict their churn behavior. My Binary dependent variable is “Churn” (0=Not churn, 1=Churn) and I want to study how independent variables like tenure, Internet Service ,Senior Citizen, Dependents, Phone.Service, MonthlyCharges and TotalCharges, age, gender etc. have impact on my binary variable.
Let’s load the data first.

library(readr)
library(tidyverse)
Telco<-read_csv("/Users/kanwallatif/Documents/Telco-Customer.csv", col_names = TRUE) # Importing dataset file.
na.fail(Telco) #Confirming dataset for na values.

## # A tibble: 7,032 x 21
##    customerID gender SeniorCitizen Partner Dependents tenure PhoneService
##    <chr>      <chr>          <dbl> <chr>   <chr>       <dbl> <chr>       
##  1 7590-VHVEG Female             0 Yes     No              1 No          
##  2 5575-GNVDE Male               0 No      No             34 Yes         
##  3 3668-QPYBK Male               0 No      No              2 Yes         
##  4 7795-CFOCW Male               0 No      No             45 No          
##  5 9237-HQITU Female             0 No      No              2 Yes         
##  6 9305-CDSKC Female             0 No      No              8 Yes         
##  7 1452-KIOVK Male               0 No      Yes            22 Yes         
##  8 6713-OKOMC Female             0 No      No             10 No          
##  9 7892-POOKP Female             0 Yes     No             28 Yes         
## 10 6388-TABGU Male               0 No      Yes            62 Yes         
## # … with 7,022 more rows, and 14 more variables: MultipleLines <chr>,
## #   InternetService <chr>, OnlineSecurity <chr>, OnlineBackup <chr>,
## #   DeviceProtection <chr>, TechSupport <chr>, StreamingTV <chr>,
## #   StreamingMovies <chr>, Contract <chr>, PaperlessBilling <chr>,
## #   PaymentMethod <chr>, MonthlyCharges <dbl>, TotalCharges <dbl>,
## #   Churn <chr>

I am recoding the Data for further analysis:

library(Zelig)
Telco <- Telco %>% 
  mutate(CustomerChurn1 = sjmisc::rec(Churn, rec = "No=0; Yes=1"),
  CustomerChurn = as.integer(CustomerChurn1) ,
  gender2=factor(gender),
  Internet.Service=factor(InternetService),
  Senior.Citizen=factor(SeniorCitizen),
  Have.Dependents=factor(Dependents),
  Phone.Service=factor(PhoneService),
  Signed.Contact=factor(Contract)) %>%
  select(customerID, gender2, tenure, Internet.Service, Senior.Citizen, Have.Dependents, Phone.Service, Signed.Contact, everything())

Creating logistic regresion model using the Zelig 5 package

library(Zelig)
library(texreg)
z5<- zlogit$new()
z5$zelig(CustomerChurn ~ tenure+ gender2*Signed.Contact + Internet.Service +Senior.Citizen+ Have.Dependents + Phone.Service+ MonthlyCharges+ TotalCharges, data = Telco)
htmlreg(z5)

Statistical models
	Model 1
(Intercept)	0.11
	(0.20)
tenure	-0.06^***
	(0.01)
gender2Male	-0.07
	(0.07)
Signed.ContactOne year	-1.01^***
	(0.15)
Signed.ContactTwo year	-1.85^***
	(0.24)
Internet.ServiceFiber optic	0.93^***
	(0.13)
Internet.ServiceNo	-0.50^**
	(0.19)
Senior.Citizen1	0.33^***
	(0.08)
Have.DependentsYes	-0.20^*
	(0.08)
Phone.ServiceYes	-0.80^***
	(0.14)
MonthlyCharges	0.01
	(0.00)
TotalCharges	0.00^***
	(0.00)
gender2Male:Signed.ContactOne year	0.33
	(0.19)
gender2Male:Signed.ContactTwo year	0.28
	(0.31)
AIC	6003.79
BIC	6099.80
Log Likelihood	-2987.89
Deviance	5975.79
Num. obs.	7032
p < 0.001, p < 0.01, p < 0.05

INTERPRETATION:
On an average there is less probability to churn by 0.06 with increase in tenure, result is significant. On an average males have less probability to churn by 0.07. Compared to month-to-month signed contract on an average one and two year signed contract have less probability to churn by 1.01 and 1.85 respectively and result is significant. On an average there is a probability to churn by 0.93 who are using internet service (fiber optics) and less probability on an average by 0.50 to churn those who are not using Internet service. Senior citizens have a probability to churn on average by 0.33. Those customer who have dependents have less probability to churn on an average by 0.20. Those who are using phone service have less probability to churn on an average by 0.80 and results is significant. An effect of one year signed contract to churn in males on an average is 0.33 and an effect of two year signed contract in males on an average is 0.28.

The simulation with the continues variable (tenure) using Zelig 5 package:

Tenure Effect:

t.range = min(Telco$tenure):max(Telco$tenure)
z5$setrange(tenure=t.range)
z5$sim()
z5$graph()

INTERPRETATION:
Here I can see the relationship between tenure and customer churn rate. The expected value of churn in Telco decreases as tenure increases. New customers at 0 tenure have high rate of churn of 0.78 than those who have long tenure of 70 months has low churn rate of 0.05. High uncertainty of churn found in 0 month tenure (low as 0.75 and high as 0.85) than tenure 70 months and at tenure 70 there is a very low chance to churn.

The simulation with the continues variable (Total Charges) using Zelig 5 package:

Total Monthy Charges Effect:

TC.range = min(Telco$TotalCharges):max(Telco$TotalCharges)
z5$setrange(TotalCharges=TC.range)
z5$sim()
z5$graph()

INTERPRETATION:
Now I can see the relationship between Total charges and churn rate. The expected value of churn increases as total charges increases. Customers who are spending more on total charges, chances of churn is almost 0.79 and uncertainty of churn is much high in high monthly rate than who are paying less. Those who are spending less on monthly charges, chance to churn is 0.21.

The simulation with the categorical variable (sex) using Zelig 5 package

Sex difference:

z5.gender2 <- zlogit$new()
z5.gender2$zelig(CustomerChurn ~ tenure + gender2*Signed.Contact + TotalCharges, data = Telco)
  z5.gender2$setx(gender2 = "Male") 
  z5.gender2$setx1(gender2 = "Female")
  z5.gender2$sim()
  summary(z5.gender2)

## 
##  sim x :
##  -----
## ev
##           mean         sd       50%     2.5%     97.5%
## [1,] 0.2735423 0.01272044 0.2733798 0.249177 0.2990469
## pv
##          0     1
## [1,] 0.715 0.285
## 
##  sim x1 :
##  -----
## ev
##         mean         sd       50%      2.5%     97.5%
## [1,] 0.29345 0.01348717 0.2938858 0.2666414 0.3201162
## pv
##          0     1
## [1,] 0.691 0.309
## fd
##            mean         sd        50%         2.5%      97.5%
## [1,] 0.01990776 0.01379255 0.01991094 -0.005706276 0.04691583

INTERPRETATION:
In simulation for sex, there are two counter differences, one for male and other is for female. In results fd is the first difference between two expected values. On an average females have 0.201 high chance to churn than males, it can be low as -0.0076 and high as 0.047 (due to simulation could be slight differnce in results)

Summary of the simulated first difference (fd) only

fd <- z5.gender2$get_qi (xvalue = "x1", qi = "fd")
summary(fd)

##        V1          
##  Min.   :-0.02686  
##  1st Qu.: 0.01075  
##  Median : 0.01991  
##  Mean   : 0.01991  
##  3rd Qu.: 0.02883  
##  Max.   : 0.06018

INTERPRETATION:
Above syntax gave me a vector or a list of numbers that simulated fd between females and males.

plot(z5.gender2)

INTERPRETATION:
Looking at the expected value and predicted value of churn between male and female, I see a greater chance of churn for females. Females have about an expected value E(Y|Female) of 0.29 as E(Y|Male) is about 0.27.

Testing the Contract variation in gender difference using Zelig 5

Now I want to see the cause variations in gender difference in churn chance. I want to see the difference in differences i.e difference in each signed contract category (Month to month, One year contract, Two year Contract)

##Month-to-month Contract
z5.c1s <- zlogit$new()
z5.c1s$zelig(CustomerChurn ~ tenure + gender2*Signed.Contact + TotalCharges, data = Telco)
z5.c1s$setx(gender2 = "Male", Signed.Contact = "Month-to-month") 
z5.c1s$setx1(gender2 = "Female", Signed.Contact = "Month-to-month")
z5.c1s$sim()
plot(z5.c1s)

INTERPRETATION:
For month to month contract, male E(Y|X) is 0.27 and female E(Y|X1) is 0.29 which is higher than males.

##One Year Contract
z5.c2s <- zlogit$new()
z5.c2s$zelig(CustomerChurn ~ tenure + gender2*Signed.Contact + TotalCharges, data = Telco)
z5.c2s$setx(gender2 = "Male", Signed.Contact = "One year") 
z5.c2s$setx1(gender2 = "Female", Signed.Contact = "One year")
z5.c2s$sim()
plot(z5.c2s)

INTERPRETATION:
For one year contract signed ,male churn rate decrease to about 0.11 and the woman churn rate decrease to about 0.09. The first difference decreases to about -0.02.

##Two Year Contract
z5.c3s <- zlogit$new()
z5.c3s$zelig(CustomerChurn ~ tenure + gender2*Signed.Contact + TotalCharges, data = Telco)
z5.c3s$setx(gender2 = "Male", Signed.Contact = "Two year") 
z5.c3s$setx1(gender2 = "Female", Signed.Contact = "Two year")
z5.c3s$sim()
plot(z5.c3s)

INTERPRETATION:
For two year signed contract, male churn rate decreases to about 0.038 and female churn rate also decreases to 0.028 and the difference between female and male churn rate is about -0.01. It seems with longer tenure with Telco, the lower churn rate of customers.

Putting the results of Gender difference in all 3 signed contracts together:

d1 <- z5.c1s$get_qi(xvalue="x1", qi="fd")
d2 <- z5.c2s$get_qi(xvalue="x1", qi="fd")
d3 <- z5.c3s$get_qi(xvalue="x1", qi="fd")
  
dfd <- as.data.frame(cbind(d1, d2, d3))
head(dfd)

##            V1           V2           V3
## 1 0.011953498 -0.009057096 -0.005850344
## 2 0.004068416 -0.015537111 -0.017882110
## 3 0.025619901 -0.021529639 -0.007160881
## 4 0.011589884 -0.014196100 -0.005172986
## 5 0.023363660 -0.061702062 -0.010401298
## 6 0.008308261 -0.022917396  0.010964745

INTERPRETATION:
Now I am extracting the 1st difference of each contract categories and combine them to create a dataset. V1 represents the gender difference in simulated churn rate among month to month signed contract. V2 represents the gender difference in simulated churn rate among one year signed contract. V3 represents the gender difference in simulated churn rate among two year signed contract.

Segregaring the sex difference by class:

tidd <- dfd %>% 
  gather(Signed.Contact, simv, 1:3)
head(tidd)

##   Signed.Contact        simv
## 1             V1 0.011953498
## 2             V1 0.004068416
## 3             V1 0.025619901
## 4             V1 0.011589884
## 5             V1 0.023363660
## 6             V1 0.008308261

INTERPRETATION:
I used gather command provided by tidyr package to convert my dataset from wide range format to long format so that i can use my data for plotting.

Signed contract categories Differences in the Gender Difference the Chance of Telco Churn:

tidd %>% 
  group_by(Signed.Contact) %>% 
  summarise(mean = mean(simv), sd = sd(simv))

## # A tibble: 3 x 3
##   Signed.Contact     mean      sd
##   <chr>             <dbl>   <dbl>
## 1 V1              0.0199  0.0131 
## 2 V2             -0.0230  0.0155 
## 3 V3             -0.00598 0.00981

INTERPRETATION:
I am grouping the results now. Above results are showing the mean value for sex difference in every signed contract category (V1 = Month-to-month, V2 = One year, V3 = Two year and sd for each category. In V1 female has on an average 0.02 more churn rate. In V2 female has on an average -0.233 churn rate and in V3 female has on an average -0.006 less churn rate than males for categories one and two year signed contract. (Could be slight difference in results due to simulation)

Plotting the graph using ggplot2 package:

The gender difference among Month to month, one year and two year contracts in Telco

ggplot(tidd, aes(simv)) + geom_histogram() + facet_grid(~Signed.Contact)

INTERPRETATION:

In the above graph I can view the expected 1st difference between female and male passengers across all 3 signed contract categories. The first difference is the lowest in 2nd categories i.e One year contract, where male and female passengers have closer expected churn rates.

CONCLUSION:

In above analysis I wanted to study customer behavior for Telco customers and how various factors contributing to it. I have seen that on an average there is a less probability to churn with an increase in tenure. I have observed that longer the tenure, lower chances to churn. Those who have have dependents and have phone service are less likely to churn. Senior citizens are likely to churn. In contract variation in gender difference I have seen that there is a slight difference between both gender to churn but churn rate decreases in both gender as signed contract increases.

HomeWork # 6

Kanwal Latif

3/19/2019

Interpretation by simulation:

PART-1: Surviving the Titanic with Zelig5

Recoding the Variables:

Creating logistic regresion model using the Zelig 5 package

The simulation with the continues variable (age) using Zelig 5 package:

The simulation with the continues variable (fare) using Zelig 5 package:

The simulation with the categorical variable (sex) using Zelig 5 package

Summary of the simulated first difference (fd) only

Testing the class variation in sex difference using Zelig 5

Putting the results of sex difference in all 3 classes together:

Segregaring the sex difference by class:

Class Differences in the Gender Difference the Chance of Titanic Survival:

Plotting the graph using ggplot2 package:

INTERPRETATION:

PART-II: #Conducting regression analysis using Zelig 5 on my own data set, Telco Customers:

I am recoding the Data for further analysis:

Creating logistic regresion model using the Zelig 5 package

The simulation with the continues variable (tenure) using Zelig 5 package:

The simulation with the continues variable (Total Charges) using Zelig 5 package:

The simulation with the categorical variable (sex) using Zelig 5 package

Summary of the simulated first difference (fd) only

Testing the Contract variation in gender difference using Zelig 5

Putting the results of Gender difference in all 3 signed contracts together:

Segregaring the sex difference by class:

Signed contract categories Differences in the Gender Difference the Chance of Telco Churn:

Plotting the graph using ggplot2 package:

INTERPRETATION:

CONCLUSION: