Let’s load packages.
library(radiant.data)
data(titanic)
I am recoding the Data for further analysis.
newtitanic <- titanic %>%
mutate(survived1 = as.integer(survived)) %>%
mutate (age = as.integer(age)) %>%
mutate(survival = sjmisc::rec(survived1, rec = "2=0; 1=1")) %>%
select(survived, survival, age, sex, pclass, fare)
library(Zelig)
library(texreg)
z5<- zlogit$new()
z5$zelig(survival ~ age + sex*pclass + fare, data = newtitanic)
htmlreg(z5)
| Model 1 | ||
|---|---|---|
| (Intercept) | 4.90*** | |
| (0.61) | ||
| age | -0.04*** | |
| (0.01) | ||
| sexmale | -3.90*** | |
| (0.50) | ||
| pclass2nd | -1.59** | |
| (0.60) | ||
| pclass3rd | -4.14*** | |
| (0.56) | ||
| fare | -0.00 | |
| (0.00) | ||
| sexmale:pclass2nd | -0.06 | |
| (0.64) | ||
| sexmale:pclass3rd | 2.50*** | |
| (0.55) | ||
| AIC | 947.42 | |
| BIC | 987.02 | |
| Log Likelihood | -465.71 | |
| Deviance | 931.42 | |
| Num. obs. | 1043 | |
| p < 0.001, p < 0.01, p < 0.05 | ||
INTERPRETATION:
On an average there is less probability of survival by 0.04 with an increase in age, result is significant. On an average males have less probability of survival by 3.90, result is significant. On an average 2nd and 3rd class passengers have less probability of survival by 1.59 and 4.14 respectively where 3rd class passengers have much lower survival probability and result is significant. Compared to 1st class an effect of 2nd class on survival in males on an average is -0.06 where an effect of 3rd class in males on an average is 2.50 and result is significant. Males in 3rd class have higher chances of survival.
Age Effect
a.range = min(newtitanic$age):max(newtitanic$age)
z5$setrange(age=a.range)
z5$sim()
z5$graph()
INTERPRETATION:
Here I can see the relationship between age and survival rate. The expected value of survival in the titanic decreases as age increases. Infants at age “0” have high rate of survival of 0.35 than older age of 25 has low survival rate of 0.2. High uncertainty of survival found in infant (low as 0.3 and high as 0.4) than age 80 and at age 80 there is a very low chance of survival.
Fare Effect:
f.range = min(newtitanic$fare):max(newtitanic$fare)
z5$setrange(fare=f.range)
z5$sim()
z5$graph()
INTERPRETATION:
Now I can see the relationship between fare and survival rate. The expected value of surviving as fare increases seems to remain almost constant. There is a small linear downward sloping trend. Passengers paid high fare, chances of survival is 0.10, slightly lower than who paid low fare and chances of survival is 0.15 and uncertainty of survival is much high in high fare. Fare really dose not influence survival chance.
Sex difference:
z5.sex <- zlogit$new()
z5.sex$zelig(survival ~ age + sex*pclass + fare, data = newtitanic)
z5.sex$setx(sex = "male")
z5.sex$setx1(sex = "female")
z5.sex$sim()
summary(z5.sex)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.1407784 0.01932916 0.1398707 0.1065382 0.1823852
## pv
## 0 1
## [1,] 0.844 0.156
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.3955748 0.04342181 0.3948876 0.3105055 0.4828405
## pv
## 0 1
## [1,] 0.628 0.372
## fd
## mean sd 50% 2.5% 97.5%
## [1,] 0.2547964 0.04447915 0.2537797 0.1667706 0.3447734
INTERPRETATION:
In simulation for sex, there are two counter differences, one for male and other is for female i.e one model name and two counter factual names. Above syntax tells me that i want to compare predicted values based on this estimated model between two counter factual situations.
In results fd is the first difference between two expected values. On an average females have 0.255 high chance of survival than males, it can be low as 0.168 and high as 0.338. (due to simulation could be slight differnce in results)
fd <- z5.sex$get_qi (xvalue = "x1", qi = "fd")
summary(fd)
## V1
## Min. :0.1322
## 1st Qu.:0.2256
## Median :0.2538
## Mean :0.2548
## 3rd Qu.:0.2854
## Max. :0.4033
INTERPRETATION:
Above syntax gave me a vector or a list of numbers that simulated fd between females and males.
plot(z5.sex)
INTERPRETATION:
Looking at the expected value and predicted value of survival between male and female, I see a greater chance of survival for females in both categories. Females have about an expected value that is greater by 0.25 as E(Y|Male) is about 0.13 and E(Y|Female) is about 0.38.
Class difference in gender Difference. Now I want to see the cause variations in gender difference in survival chance. I want to see the difference in differences i.e difference in each class (1st, 2nd & 3rd)
##FIRST CLASS
z5.c1s <- zlogit$new()
z5.c1s$zelig(survival ~ age + sex*pclass + fare, data = newtitanic)
z5.c1s$setx(sex = "male", pclass = "1st")
z5.c1s$setx1(sex = "female", pclass = "1st")
z5.c1s$sim()
plot(z5.c1s)
INTERPRETATION:
For 1st class, the difference in expected value of survival between female and male is about .53. This is very interesting as women in this class have an almost 100% expected survival chance.
##SECOND CLASS
z5.c2s <- zlogit$new()
z5.c2s$zelig(survival ~ age + sex*pclass + fare, data = newtitanic)
z5.c2s$setx(sex = "male", pclass = "2nd")
z5.c2s$setx1(sex = "female", pclass = "2nd")
z5.c2s$sim()
plot(z5.c2s)
INTERPRETATION:
For 2nd class, male survival rate decrease to about 0.13 and the woman survival rate decrease to about 0.9. The first difference increases to about 0.76. This is due to the large decrease in survival chances for males between first and 2nd class.
##THIRD CLASS
z5.c3s <- zlogit$new()
z5.c3s$zelig(survival ~ age + sex*pclass + fare, data = newtitanic)
z5.c3s$setx(sex = "male", pclass = "3rd")
z5.c3s$setx1(sex = "female", pclass = "3rd")
z5.c3s$sim()
plot(z5.c3s)
INTERPRETATION:
For 3rd class, the difference between female and male survival rate drops to about 0.25 as women’s expected survival decreases as we move down the class ranks
d1 <- z5.c1s$get_qi(xvalue="x1", qi="fd")
d2 <- z5.c2s$get_qi(xvalue="x1", qi="fd")
d3 <- z5.c3s$get_qi(xvalue="x1", qi="fd")
dfd <- as.data.frame(cbind(d1, d2, d3))
head(dfd)
## V1 V2 V3
## 1 0.5339081 0.7194479 0.2447289
## 2 0.4487429 0.7897231 0.2845484
## 3 0.5322188 0.6603001 0.2415290
## 4 0.5286412 0.7463147 0.2495381
## 5 0.5056157 0.6702471 0.2939323
## 6 0.5250327 0.7078285 0.2657865
INTERPRETATION:
Now I am extracting the 1st difference of each classes and combine them to create a dataset. V1 represents the gender difference in simulated survival rate among 1st class. V2 represents the gender difference in simulated survival rate among 2nd class. V3 represents the gender difference in simulated survival rate among 3rd class.
tidd <- dfd %>%
gather(class, simv, 1:3)
head(tidd)
## class simv
## 1 V1 0.5339081
## 2 V1 0.4487429
## 3 V1 0.5322188
## 4 V1 0.5286412
## 5 V1 0.5056157
## 6 V1 0.5250327
INTERPRETATION:
I used gather command provided by tidyr package to convert my dataset from wide range format to long format so that i can use my data for plotting.
tidd %>%
group_by(class) %>%
summarise(mean = mean(simv), sd = sd(simv))
## # A tibble: 3 x 3
## class mean sd
## <chr> <dbl> <dbl>
## 1 V1 0.522 0.0519
## 2 V2 0.748 0.0431
## 3 V3 0.254 0.0431
INTERPRETATION:
I am grouping the results now. Above results are showing the mean value for sex difference in every class (V1 = first class, V2 = second class, V3 = third class) and sd for each class. In V1 female has on an average 0.518 more survival rate. In V2 female has on an average 0.747 more survival rate and in V3 female has on an average 0.254 more survival rate. (Slight difference can be observed due to simulation)
The sex difference among 1st, 2nd, and 3rd class in Titanic
ggplot(tidd, aes(simv)) + geom_histogram() + facet_grid(~class)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In the above graph i can view the expected 1st difference between female and male passengers across all 3 classes. The first difference is the lowest in 3rd class, where male and female passengers have closer expected survival rates.
INTRODUCTION:
I choose data from kaggle for Telco customers to predict their churn behavior. My Binary dependent variable is “Churn” (0=Not churn, 1=Churn) and I want to study how independent variables like tenure, Internet Service ,Senior Citizen, Dependents, Phone.Service, MonthlyCharges and TotalCharges, age, gender etc. have impact on my binary variable.
Let’s load the data first.
library(readr)
library(tidyverse)
Telco<-read_csv("/Users/kanwallatif/Documents/Telco-Customer.csv", col_names = TRUE) # Importing dataset file.
na.fail(Telco) #Confirming dataset for na values.
## # A tibble: 7,032 x 21
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## 7 1452-KIOVK Male 0 No Yes 22 Yes
## 8 6713-OKOMC Female 0 No No 10 No
## 9 7892-POOKP Female 0 Yes No 28 Yes
## 10 6388-TABGU Male 0 No Yes 62 Yes
## # … with 7,022 more rows, and 14 more variables: MultipleLines <chr>,
## # InternetService <chr>, OnlineSecurity <chr>, OnlineBackup <chr>,
## # DeviceProtection <chr>, TechSupport <chr>, StreamingTV <chr>,
## # StreamingMovies <chr>, Contract <chr>, PaperlessBilling <chr>,
## # PaymentMethod <chr>, MonthlyCharges <dbl>, TotalCharges <dbl>,
## # Churn <chr>
library(Zelig)
Telco <- Telco %>%
mutate(CustomerChurn1 = sjmisc::rec(Churn, rec = "No=0; Yes=1"),
CustomerChurn = as.integer(CustomerChurn1) ,
gender2=factor(gender),
Internet.Service=factor(InternetService),
Senior.Citizen=factor(SeniorCitizen),
Have.Dependents=factor(Dependents),
Phone.Service=factor(PhoneService),
Signed.Contact=factor(Contract)) %>%
select(customerID, gender2, tenure, Internet.Service, Senior.Citizen, Have.Dependents, Phone.Service, Signed.Contact, everything())
library(Zelig)
library(texreg)
z5<- zlogit$new()
z5$zelig(CustomerChurn ~ tenure+ gender2*Signed.Contact + Internet.Service +Senior.Citizen+ Have.Dependents + Phone.Service+ MonthlyCharges+ TotalCharges, data = Telco)
htmlreg(z5)
| Model 1 | ||
|---|---|---|
| (Intercept) | 0.11 | |
| (0.20) | ||
| tenure | -0.06*** | |
| (0.01) | ||
| gender2Male | -0.07 | |
| (0.07) | ||
| Signed.ContactOne year | -1.01*** | |
| (0.15) | ||
| Signed.ContactTwo year | -1.85*** | |
| (0.24) | ||
| Internet.ServiceFiber optic | 0.93*** | |
| (0.13) | ||
| Internet.ServiceNo | -0.50** | |
| (0.19) | ||
| Senior.Citizen1 | 0.33*** | |
| (0.08) | ||
| Have.DependentsYes | -0.20* | |
| (0.08) | ||
| Phone.ServiceYes | -0.80*** | |
| (0.14) | ||
| MonthlyCharges | 0.01 | |
| (0.00) | ||
| TotalCharges | 0.00*** | |
| (0.00) | ||
| gender2Male:Signed.ContactOne year | 0.33 | |
| (0.19) | ||
| gender2Male:Signed.ContactTwo year | 0.28 | |
| (0.31) | ||
| AIC | 6003.79 | |
| BIC | 6099.80 | |
| Log Likelihood | -2987.89 | |
| Deviance | 5975.79 | |
| Num. obs. | 7032 | |
| p < 0.001, p < 0.01, p < 0.05 | ||
INTERPRETATION:
On an average there is less probability to churn by 0.06 with increase in tenure, result is significant. On an average males have less probability to churn by 0.07. Compared to month-to-month signed contract on an average one and two year signed contract have less probability to churn by 1.01 and 1.85 respectively and result is significant. On an average there is a probability to churn by 0.93 who are using internet service (fiber optics) and less probability on an average by 0.50 to churn those who are not using Internet service. Senior citizens have a probability to churn on average by 0.33. Those customer who have dependents have less probability to churn on an average by 0.20. Those who are using phone service have less probability to churn on an average by 0.80 and results is significant. An effect of one year signed contract to churn in males on an average is 0.33 and an effect of two year signed contract in males on an average is 0.28.
Tenure Effect:
t.range = min(Telco$tenure):max(Telco$tenure)
z5$setrange(tenure=t.range)
z5$sim()
z5$graph()
INTERPRETATION:
Here I can see the relationship between tenure and customer churn rate. The expected value of churn in Telco decreases as tenure increases. New customers at 0 tenure have high rate of churn of 0.78 than those who have long tenure of 70 months has low churn rate of 0.05. High uncertainty of churn found in 0 month tenure (low as 0.75 and high as 0.85) than tenure 70 months and at tenure 70 there is a very low chance to churn.
Total Monthy Charges Effect:
TC.range = min(Telco$TotalCharges):max(Telco$TotalCharges)
z5$setrange(TotalCharges=TC.range)
z5$sim()
z5$graph()
INTERPRETATION:
Now I can see the relationship between Total charges and churn rate. The expected value of churn increases as total charges increases. Customers who are spending more on total charges, chances of churn is almost 0.79 and uncertainty of churn is much high in high monthly rate than who are paying less. Those who are spending less on monthly charges, chance to churn is 0.21.
Sex difference:
z5.gender2 <- zlogit$new()
z5.gender2$zelig(CustomerChurn ~ tenure + gender2*Signed.Contact + TotalCharges, data = Telco)
z5.gender2$setx(gender2 = "Male")
z5.gender2$setx1(gender2 = "Female")
z5.gender2$sim()
summary(z5.gender2)
##
## sim x :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.2735423 0.01272044 0.2733798 0.249177 0.2990469
## pv
## 0 1
## [1,] 0.715 0.285
##
## sim x1 :
## -----
## ev
## mean sd 50% 2.5% 97.5%
## [1,] 0.29345 0.01348717 0.2938858 0.2666414 0.3201162
## pv
## 0 1
## [1,] 0.691 0.309
## fd
## mean sd 50% 2.5% 97.5%
## [1,] 0.01990776 0.01379255 0.01991094 -0.005706276 0.04691583
INTERPRETATION:
In simulation for sex, there are two counter differences, one for male and other is for female. In results fd is the first difference between two expected values. On an average females have 0.201 high chance to churn than males, it can be low as -0.0076 and high as 0.047 (due to simulation could be slight differnce in results)
fd <- z5.gender2$get_qi (xvalue = "x1", qi = "fd")
summary(fd)
## V1
## Min. :-0.02686
## 1st Qu.: 0.01075
## Median : 0.01991
## Mean : 0.01991
## 3rd Qu.: 0.02883
## Max. : 0.06018
INTERPRETATION:
Above syntax gave me a vector or a list of numbers that simulated fd between females and males.
plot(z5.gender2)
INTERPRETATION:
Looking at the expected value and predicted value of churn between male and female, I see a greater chance of churn for females. Females have about an expected value E(Y|Female) of 0.29 as E(Y|Male) is about 0.27.
Now I want to see the cause variations in gender difference in churn chance. I want to see the difference in differences i.e difference in each signed contract category (Month to month, One year contract, Two year Contract)
##Month-to-month Contract
z5.c1s <- zlogit$new()
z5.c1s$zelig(CustomerChurn ~ tenure + gender2*Signed.Contact + TotalCharges, data = Telco)
z5.c1s$setx(gender2 = "Male", Signed.Contact = "Month-to-month")
z5.c1s$setx1(gender2 = "Female", Signed.Contact = "Month-to-month")
z5.c1s$sim()
plot(z5.c1s)
INTERPRETATION:
For month to month contract, male E(Y|X) is 0.27 and female E(Y|X1) is 0.29 which is higher than males.
##One Year Contract
z5.c2s <- zlogit$new()
z5.c2s$zelig(CustomerChurn ~ tenure + gender2*Signed.Contact + TotalCharges, data = Telco)
z5.c2s$setx(gender2 = "Male", Signed.Contact = "One year")
z5.c2s$setx1(gender2 = "Female", Signed.Contact = "One year")
z5.c2s$sim()
plot(z5.c2s)
INTERPRETATION:
For one year contract signed ,male churn rate decrease to about 0.11 and the woman churn rate decrease to about 0.09. The first difference decreases to about -0.02.
##Two Year Contract
z5.c3s <- zlogit$new()
z5.c3s$zelig(CustomerChurn ~ tenure + gender2*Signed.Contact + TotalCharges, data = Telco)
z5.c3s$setx(gender2 = "Male", Signed.Contact = "Two year")
z5.c3s$setx1(gender2 = "Female", Signed.Contact = "Two year")
z5.c3s$sim()
plot(z5.c3s)
INTERPRETATION:
For two year signed contract, male churn rate decreases to about 0.038 and female churn rate also decreases to 0.028 and the difference between female and male churn rate is about -0.01. It seems with longer tenure with Telco, the lower churn rate of customers.
d1 <- z5.c1s$get_qi(xvalue="x1", qi="fd")
d2 <- z5.c2s$get_qi(xvalue="x1", qi="fd")
d3 <- z5.c3s$get_qi(xvalue="x1", qi="fd")
dfd <- as.data.frame(cbind(d1, d2, d3))
head(dfd)
## V1 V2 V3
## 1 0.011953498 -0.009057096 -0.005850344
## 2 0.004068416 -0.015537111 -0.017882110
## 3 0.025619901 -0.021529639 -0.007160881
## 4 0.011589884 -0.014196100 -0.005172986
## 5 0.023363660 -0.061702062 -0.010401298
## 6 0.008308261 -0.022917396 0.010964745
INTERPRETATION:
Now I am extracting the 1st difference of each contract categories and combine them to create a dataset. V1 represents the gender difference in simulated churn rate among month to month signed contract. V2 represents the gender difference in simulated churn rate among one year signed contract. V3 represents the gender difference in simulated churn rate among two year signed contract.
tidd <- dfd %>%
gather(Signed.Contact, simv, 1:3)
head(tidd)
## Signed.Contact simv
## 1 V1 0.011953498
## 2 V1 0.004068416
## 3 V1 0.025619901
## 4 V1 0.011589884
## 5 V1 0.023363660
## 6 V1 0.008308261
INTERPRETATION:
I used gather command provided by tidyr package to convert my dataset from wide range format to long format so that i can use my data for plotting.
tidd %>%
group_by(Signed.Contact) %>%
summarise(mean = mean(simv), sd = sd(simv))
## # A tibble: 3 x 3
## Signed.Contact mean sd
## <chr> <dbl> <dbl>
## 1 V1 0.0199 0.0131
## 2 V2 -0.0230 0.0155
## 3 V3 -0.00598 0.00981
INTERPRETATION:
I am grouping the results now. Above results are showing the mean value for sex difference in every signed contract category (V1 = Month-to-month, V2 = One year, V3 = Two year and sd for each category. In V1 female has on an average 0.02 more churn rate. In V2 female has on an average -0.233 churn rate and in V3 female has on an average -0.006 less churn rate than males for categories one and two year signed contract. (Could be slight difference in results due to simulation)
The gender difference among Month to month, one year and two year contracts in Telco
ggplot(tidd, aes(simv)) + geom_histogram() + facet_grid(~Signed.Contact)
In the above graph I can view the expected 1st difference between female and male passengers across all 3 signed contract categories. The first difference is the lowest in 2nd categories i.e One year contract, where male and female passengers have closer expected churn rates.
In above analysis I wanted to study customer behavior for Telco customers and how various factors contributing to it. I have seen that on an average there is a less probability to churn with an increase in tenure. I have observed that longer the tenure, lower chances to churn. Those who have have dependents and have phone service are less likely to churn. Senior citizens are likely to churn. In contract variation in gender difference I have seen that there is a slight difference between both gender to churn but churn rate decreases in both gender as signed contract increases.