1 Reading DATA

First thing we are going to do is to load our data and have a global visualization from it.

customer<-read.csv("Churn_Modelling.csv")

We called the dataset customer

We want to know how is our data structured and see what we can do with it

str(customer)

## 'data.frame':    10000 obs. of  14 variables:
##  $ RowNumber      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CustomerId     : int  15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
##  $ Surname        : chr  "Hargrave" "Hill" "Onio" "Boni" ...
##  $ CreditScore    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ Geography      : chr  "France" "Spain" "France" "France" ...
##  $ Gender         : chr  "Female" "Female" "Female" "Female" ...
##  $ Age            : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ Tenure         : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ Balance        : num  0 83808 159661 0 125511 ...
##  $ NumOfProducts  : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ HasCrCard      : int  1 0 1 0 1 1 1 1 0 1 ...
##  $ IsActiveMember : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ EstimatedSalary: num  101349 112543 113932 93827 79084 ...
##  $ Exited         : int  1 0 1 0 0 1 0 1 0 0 ...

summary(customer)

##    RowNumber       CustomerId         Surname           CreditScore   
##  Min.   :    1   Min.   :15565701   Length:10000       Min.   :350.0  
##  1st Qu.: 2501   1st Qu.:15628528   Class :character   1st Qu.:584.0  
##  Median : 5000   Median :15690738   Mode  :character   Median :652.0  
##  Mean   : 5000   Mean   :15690941                      Mean   :650.5  
##  3rd Qu.: 7500   3rd Qu.:15753234                      3rd Qu.:718.0  
##  Max.   :10000   Max.   :15815690                      Max.   :850.0  
##   Geography            Gender               Age            Tenure      
##  Length:10000       Length:10000       Min.   :18.00   Min.   : 0.000  
##  Class :character   Class :character   1st Qu.:32.00   1st Qu.: 3.000  
##  Mode  :character   Mode  :character   Median :37.00   Median : 5.000  
##                                        Mean   :38.92   Mean   : 5.013  
##                                        3rd Qu.:44.00   3rd Qu.: 7.000  
##                                        Max.   :92.00   Max.   :10.000  
##     Balance       NumOfProducts    HasCrCard      IsActiveMember  
##  Min.   :     0   Min.   :1.00   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:     0   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 97199   Median :1.00   Median :1.0000   Median :1.0000  
##  Mean   : 76486   Mean   :1.53   Mean   :0.7055   Mean   :0.5151  
##  3rd Qu.:127644   3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :250898   Max.   :4.00   Max.   :1.0000   Max.   :1.0000  
##  EstimatedSalary         Exited      
##  Min.   :    11.58   Min.   :0.0000  
##  1st Qu.: 51002.11   1st Qu.:0.0000  
##  Median :100193.91   Median :0.0000  
##  Mean   :100090.24   Mean   :0.2037  
##  3rd Qu.:149388.25   3rd Qu.:0.0000  
##  Max.   :199992.48   Max.   :1.0000

For this project, the target variable will be Exited. We want to predict if a Customer will leave the bank based on his data.

2 Data Cleaning

As we can see in the previous step, we can see that we have some variables that are not helpful for our prediction. This columns are:

RowNumber
CustomerId
Surname

customer<-dplyr::select(customer,-c("RowNumber","CustomerId","Surname"))
head(customer,10)

##    CreditScore Geography Gender Age Tenure   Balance NumOfProducts HasCrCard
## 1          619    France Female  42      2      0.00             1         1
## 2          608     Spain Female  41      1  83807.86             1         0
## 3          502    France Female  42      8 159660.80             3         1
## 4          699    France Female  39      1      0.00             2         0
## 5          850     Spain Female  43      2 125510.82             1         1
## 6          645     Spain   Male  44      8 113755.78             2         1
## 7          822    France   Male  50      7      0.00             2         1
## 8          376   Germany Female  29      4 115046.74             4         1
## 9          501    France   Male  44      4 142051.07             2         0
## 10         684    France   Male  27      2 134603.88             1         1
##    IsActiveMember EstimatedSalary Exited
## 1               1       101348.88      1
## 2               1       112542.58      0
## 3               0       113931.57      1
## 4               0        93826.63      0
## 5               1        79084.10      0
## 6               0       149756.71      1
## 7               1        10062.80      0
## 8               0       119346.88      1
## 9               1        74940.50      0
## 10              1        71725.73      0

With the columns dropped, now we have to verify that all of the data is in the right data type. We are going to do two changes here:

Exited: Since we are planning to use a random forest, this variable has to be a factor, so the model can run properly
Geography: We want to change this variable to a factor, since we have different values for country and we want to add it to the model.

customer$Exited<-as.factor(customer$Exited)
customer$Geography<-as.factor(customer$Geography)

After all the cleaning this how our data will look

str(customer)

## 'data.frame':    10000 obs. of  11 variables:
##  $ CreditScore    : int  619 608 502 699 850 645 822 376 501 684 ...
##  $ Geography      : Factor w/ 3 levels "France","Germany",..: 1 3 1 1 3 3 1 2 1 1 ...
##  $ Gender         : chr  "Female" "Female" "Female" "Female" ...
##  $ Age            : int  42 41 42 39 43 44 50 29 44 27 ...
##  $ Tenure         : int  2 1 8 1 2 8 7 4 4 2 ...
##  $ Balance        : num  0 83808 159661 0 125511 ...
##  $ NumOfProducts  : int  1 1 3 2 1 2 2 4 2 1 ...
##  $ HasCrCard      : int  1 0 1 0 1 1 1 1 0 1 ...
##  $ IsActiveMember : int  1 1 0 0 1 0 1 0 1 1 ...
##  $ EstimatedSalary: num  101349 112543 113932 93827 79084 ...
##  $ Exited         : Factor w/ 2 levels "0","1": 2 1 2 1 1 2 1 2 1 1 ...

summary(customer)

##   CreditScore      Geography       Gender               Age       
##  Min.   :350.0   France :5014   Length:10000       Min.   :18.00  
##  1st Qu.:584.0   Germany:2509   Class :character   1st Qu.:32.00  
##  Median :652.0   Spain  :2477   Mode  :character   Median :37.00  
##  Mean   :650.5                                     Mean   :38.92  
##  3rd Qu.:718.0                                     3rd Qu.:44.00  
##  Max.   :850.0                                     Max.   :92.00  
##      Tenure          Balance       NumOfProducts    HasCrCard     
##  Min.   : 0.000   Min.   :     0   Min.   :1.00   Min.   :0.0000  
##  1st Qu.: 3.000   1st Qu.:     0   1st Qu.:1.00   1st Qu.:0.0000  
##  Median : 5.000   Median : 97199   Median :1.00   Median :1.0000  
##  Mean   : 5.013   Mean   : 76486   Mean   :1.53   Mean   :0.7055  
##  3rd Qu.: 7.000   3rd Qu.:127644   3rd Qu.:2.00   3rd Qu.:1.0000  
##  Max.   :10.000   Max.   :250898   Max.   :4.00   Max.   :1.0000  
##  IsActiveMember   EstimatedSalary     Exited  
##  Min.   :0.0000   Min.   :    11.58   0:7963  
##  1st Qu.:0.0000   1st Qu.: 51002.11   1:2037  
##  Median :1.0000   Median :100193.91           
##  Mean   :0.5151   Mean   :100090.24           
##  3rd Qu.:1.0000   3rd Qu.:149388.25           
##  Max.   :1.0000   Max.   :199992.48

3 Data Visualization

In order to get some insights about our data before build the models, it is important to look at some visualizations of our data.

3.1 Active Members vs Exited

ggplot(data=customer)+
  geom_bar(mapping = aes(x=IsActiveMember,fill=Exited),position="fill")+
  labs(x="Customer is Active",y="Percentage")+
  ggtitle("Customer Is Active vs Exited")+
  scale_fill_viridis_d(labels = c("Not Exited", "Exited"))

It was important to see this plot, because we didn’t want to have complete separation. It would be easier to say taht every Customer that is not active, will be exited, but this is not the case.

3.2 Balance

ggplot(data=customer)+
  geom_histogram(aes(x=Balance,fill=Exited),position = "fill",bins=30)+
  labs(x="Account Balance",y="Proportion")+
  ggtitle("Distribution of Account Balance by Exit Status")+
  scale_fill_viridis_d(labels = c("Not Exited", "Exited"))

3.3 Demographic Data

ggplot(data = customer, aes(x = Geography, y = Age, fill = Gender)) +
  geom_boxplot() +
  labs(x = "Geography", y = "Age", fill = "Gender") +
  scale_fill_viridis_d(name = "Gender") + 
  facet_wrap(~Exited, scales = "free", labeller = labeller(Exited = c("0" = "Not Exited", "1" = "Exited"))) +
  ggtitle("Distribution of Age by Gender and Geography")

The boxplot for age is noticeably higher among customers who have exited. The behaviour between the countries is similar

4 Random Forest

Fist thing we need to do is split our data in train and test set

set.seed(2356)
dim<-dim(customer)
train.idx<-sample(x=1:nrow(customer),size=floor(0.8*nrow(customer)))
train.df<-customer[train.idx,]
test.df<-customer[-train.idx,]

4.1 Baseline Forest

We are going to fit a baseline forest, and the proceed with the tuning

forest_base<-randomForest(Exited~.,
                          data=train.df,
                          ntree=1000,
                          mtry=4)
forest_base

## 
## Call:
##  randomForest(formula = Exited ~ ., data = train.df, ntree = 1000,      mtry = 4) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 13.8%
## Confusion matrix:
##      0   1 class.error
## 0 6103 257  0.04040881
## 1  847 793  0.51646341

4.2 Tuning the Forest

Now it is time to tuning or forest. In this case we are going to tune the the number of variables to randomly sample as candidates at each split mtry of our forest

mtry<-c(1:(ncol(customer)-1))

We are going to create a loop where we are going to try the different values for mtry. We are going to save the oob_error_rate in a new dataset called keeps

keeps<- data.frame(
  m= rep(NA,length(mtry)),
  oob_error_rate=rep(NA,length(mtry))
)

We are going to set a loop so we can see which value of mtry gives us the best result

#loop over each element of mtry
for(i in 1:length(mtry)){
  print(paste0("Fitting m=",mtry[i]))
  temp_forest<-randomForest(Exited~.,
                            data = train.df,
                            ntree=1000,
                            mtry=mtry[i]) #dinamically changing mtry value
  #record the results
  keeps[i,"m"]<-mtry[i]
  keeps[i,"oob_error_rate"]<-mean(predict(temp_forest)!=train.df$Exited)
}

## [1] "Fitting m=1"
## [1] "Fitting m=2"
## [1] "Fitting m=3"
## [1] "Fitting m=4"
## [1] "Fitting m=5"
## [1] "Fitting m=6"
## [1] "Fitting m=7"
## [1] "Fitting m=8"
## [1] "Fitting m=9"
## [1] "Fitting m=10"

ggplot(data=keeps)+
  geom_line(aes(x=m,y=oob_error_rate))+
  scale_x_continuous(breaks=c(1:length(mtry)))

The best value for m is 3with an obb_value of 0.138375

Now we are going to set our final forest

final_forest<-randomForest(Exited~.,
                           data=train.df,
                           ntree=1000,
                           mtry=subset(keeps, oob_error_rate == min(oob_error_rate))$m,
                           importance=TRUE) #based on tuning excercise

final_forest

## 
## Call:
##  randomForest(formula = Exited ~ ., data = train.df, ntree = 1000,      mtry = subset(keeps, oob_error_rate == min(oob_error_rate))$m,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 13.84%
## Confusion matrix:
##      0   1 class.error
## 0 6131 229  0.03600629
## 1  878 762  0.53536585

# Calculate predicted probabilities ('pi_hat') using the 'final_forest' model
pi_hat <- predict(final_forest, test.df, type = 'prob')[, '1']

# Create an ROC curve using the 'pROC' package
rocCurve <- roc(response = test.df$Exited,
                predictor = pi_hat,
                levels = c("0", "1"))

## Setting direction: controls < cases

# Plot the ROC curve with additional information
plot(rocCurve, print.thres = TRUE, print.auc = TRUE)

Based on our ROC curve, the pi* for prediction should 0.2795

# Calculate the threshold for classification based on ROC curve
pi_star <- coords(rocCurve, "best", ret = "threshold")$threshold

# Add a new column 'forest_pred' to 'test.df' based on the threshold
# If 'pi_hat' is greater than 'pi_star', set to 1, else set to 0
test.df$forest_pred <- ifelse(pi_hat > pi_star, 1, 0)

head(test.df)

##    CreditScore Geography Gender Age Tenure   Balance NumOfProducts HasCrCard
## 5          850     Spain Female  43      2 125510.82             1         1
## 20         726    France Female  24      6      0.00             2         1
## 21         732    France   Male  41      8      0.00             2         1
## 30         411    France   Male  29      0  59697.17             2         1
## 39         850    France   Male  36      7      0.00             1         1
## 43         556    France Female  61      2 117419.35             1         1
##    IsActiveMember EstimatedSalary Exited forest_pred
## 5               1        79084.10      0           0
## 20              1        54724.03      0           0
## 21              1       170886.17      0           0
## 30              1        53483.21      0           0
## 39              1        40812.90      0           0
## 43              1        94153.83      0           1

5 Logistic Model

We decided to usea logistic model because we have a binary target variable Exited, so we have a Bernoulli distribution where the response for our Y variables is 1 or 0.

# Add a new column 'Exited_bin' to 'customer'
# If 'Exited' is equal to 1, set 'Exited_bin' to 1; otherwise, set to 0
customer$Exited_numeric <- ifelse(customer$Exited == 1, 1, 0)
customer$IsActiveMember<-as.factor(customer$IsActiveMember)
customer$HasCrCard<-as.factor(customer$HasCrCard)
head(customer)

##   CreditScore Geography Gender Age Tenure   Balance NumOfProducts HasCrCard
## 1         619    France Female  42      2      0.00             1         1
## 2         608     Spain Female  41      1  83807.86             1         0
## 3         502    France Female  42      8 159660.80             3         1
## 4         699    France Female  39      1      0.00             2         0
## 5         850     Spain Female  43      2 125510.82             1         1
## 6         645     Spain   Male  44      8 113755.78             2         1
##   IsActiveMember EstimatedSalary Exited Exited_numeric
## 1              1       101348.88      1              1
## 2              1       112542.58      0              0
## 3              0       113931.57      1              1
## 4              0        93826.63      0              0
## 5              1        79084.10      0              0
## 6              0       149756.71      1              1

# Create a data frame containing variable importance values
vi <- as.data.frame(varImpPlot(final_forest, type = 1))

# Add a column 'Variable' to store variable names
vi$Variable <- rownames(vi)

# Create a bar plot of variable importance with ggplot2
ggplot(data = vi) +
  geom_bar(aes(x = reorder(Variable, MeanDecreaseAccuracy), weight = MeanDecreaseAccuracy),
           position = "identity") +
  
  # Flip the coordinates for a horizontal bar plot
  coord_flip() +
  
  # Label the axes
  labs(x = "Variable Name", y = "Importance")+
   # Add a title to the graph
  ggtitle("Variable Importance Analysis")

# Arrange the 'vi' data frame in descending order based on 'MeanDecreaseAccuracy'
# Select only the 'Variable' column
variables <- vi %>% 
  arrange(desc(MeanDecreaseAccuracy)) %>% 
  dplyr::select(Variable)

variables

##                        Variable
## Age                         Age
## NumOfProducts     NumOfProducts
## IsActiveMember   IsActiveMember
## Balance                 Balance
## Geography             Geography
## Gender                   Gender
## CreditScore         CreditScore
## Tenure                   Tenure
## EstimatedSalary EstimatedSalary
## HasCrCard             HasCrCard

We create a new data frame in order to save the AIC and the BIC of the models we are going to create

We create a function in order to fit a logistic model with a log link.

# Function to fit a logistic regression model and calculate AIC, BIC
# Arguments:
#   - data: DataFrame containing the dataset
#   - variables: Vector of predictor variables for the logistic regression
fitting_model <- function(data, variables) {
  
  # Construct the formula for logistic regression
  formula <- as.formula(paste("Exited_numeric ~", paste(variables, collapse = "+")))
  
  # Fit the logistic regression model
  model <- glm(formula, 
                data = data, 
                family = binomial(link = "logit"))  # Specify binomial family and logit link function
  
  # Calculate AIC and BIC
  aic <- AIC(model)
  bic <- BIC(model)
  
  # Create a summary vector with relevant information
  final <- c(num_variables = length(variables),
             AIC = round(aic, 2),
             BIC = round(bic, 2),
             Variables = paste(variables, collapse = ", " ))
  
  # Return the summary vector
  return(final)
}

# Loop through each row of the 'variables' data frame
for (i in 1:nrow(variables)) {
  
  # Select the first i columns from the 'variables' data frame
  columns <- variables[1:i, ]
  
  # Call the fitting_model function to fit a logistic regression model
  metrics <- fitting_model(customer, columns)
  
  # Create a data frame with the results and append it to the 'results' data frame
  results <- data.frame(rbind(results, metrics))
  
}

##    Num_variables     AIC     BIC
## 1              1 9355.05 9369.47
## 2              2 9340.96 9362.60
## 3              3 8946.91 8975.76
## 4              4 8828.37 8864.42
## 5              5 8680.84 8731.31
## 6              6 8587.55 8645.23
## 7              7 8583.87 8648.76
## 8              8 8582.95 8655.06
## 9              9 8583.92 8663.23
## 10            10 8585.36 8671.88
##                                                                                                          Variables
## 1                                                                                                              Age
## 2                                                                                               Age, NumOfProducts
## 3                                                                               Age, NumOfProducts, IsActiveMember
## 4                                                                      Age, NumOfProducts, IsActiveMember, Balance
## 5                                                           Age, NumOfProducts, IsActiveMember, Balance, Geography
## 6                                                   Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender
## 7                                      Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore
## 8                              Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore, Tenure
## 9             Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore, Tenure, EstimatedSalary
## 10 Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore, Tenure, EstimatedSalary, HasCrCard

Final Plot

ggplot(data = results) +
  geom_point(aes(x = Num_variables, y = AIC, fill = "AIC"), shape = 21, color = "black") +
  geom_line(aes(x = Num_variables, y = AIC, group = 1, color = "AIC"), linetype = "solid") +
  geom_point(aes(x = Num_variables, y = BIC, fill = "BIC"), shape = 21, color = "black") +
  geom_line(aes(x = Num_variables, y = BIC, group = 1, color = "BIC"), linetype = "solid") +
  scale_fill_viridis_d(option = "plasma") +
  scale_color_viridis_d(option = "plasma") +
  labs(title = "AIC and BIC comparition",
       x = "Number of variables",
       y = "Value") +
  scale_x_continuous(breaks = seq(1, 10, 1)) +
  theme_minimal()

results%>%arrange(AIC,BIC)

##    Num_variables     AIC     BIC
## 1              8 8582.95 8655.06
## 2              7 8583.87 8648.76
## 3              9 8583.92 8663.23
## 4             10 8585.36 8671.88
## 5              6 8587.55 8645.23
## 6              5 8680.84 8731.31
## 7              4 8828.37 8864.42
## 8              3 8946.91 8975.76
## 9              2 9340.96 9362.60
## 10             1 9355.05 9369.47
##                                                                                                          Variables
## 1                              Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore, Tenure
## 2                                      Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore
## 3             Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore, Tenure, EstimatedSalary
## 4  Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore, Tenure, EstimatedSalary, HasCrCard
## 5                                                   Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender
## 6                                                           Age, NumOfProducts, IsActiveMember, Balance, Geography
## 7                                                                      Age, NumOfProducts, IsActiveMember, Balance
## 8                                                                               Age, NumOfProducts, IsActiveMember
## 9                                                                                               Age, NumOfProducts
## 10                                                                                                             Age

Based on our graph we will take the model that has as independent variables the following: Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender. With this, we can run our final GLM model

## 
## Call:  glm(formula = Exited_numeric ~ Age + NumOfProducts + IsActiveMember + 
##     Balance + Geography + Gender, family = binomial(link = "logit"), 
##     data = customer)
## 
## Coefficients:
##      (Intercept)               Age     NumOfProducts   IsActiveMember1  
##       -3.885e+00         7.266e-02        -1.019e-01        -1.075e+00  
##          Balance  GeographyGermany    GeographySpain        GenderMale  
##        2.654e-06         7.715e-01         3.458e-02        -5.293e-01  
## 
## Degrees of Freedom: 9999 Total (i.e. Null);  9992 Residual
## Null Deviance:       10110 
## Residual Deviance: 8572  AIC: 8588

6 GLM Insights

## 
## Call:
## glm(formula = Exited_numeric ~ Age + NumOfProducts + IsActiveMember + 
##     Balance + Geography + Gender, family = binomial(link = "logit"), 
##     data = customer)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.885e+00  1.465e-01 -26.515  < 2e-16 ***
## Age               7.266e-02  2.573e-03  28.233  < 2e-16 ***
## NumOfProducts    -1.019e-01  4.708e-02  -2.165   0.0304 *  
## IsActiveMember1  -1.075e+00  5.759e-02 -18.675  < 2e-16 ***
## Balance           2.654e-06  5.137e-07   5.166 2.39e-07 ***
## GeographyGermany  7.715e-01  6.759e-02  11.414  < 2e-16 ***
## GeographySpain    3.458e-02  7.060e-02   0.490   0.6242    
## GenderMale       -5.293e-01  5.444e-02  -9.722  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10109.8  on 9999  degrees of freedom
## Residual deviance:  8571.5  on 9992  degrees of freedom
## AIC: 8587.5
## 
## Number of Fisher Scoring iterations: 5

With our final model called final_glm we can do the following statements about our target variable:

Age: If we have customers with similar characteristics, for a year of increase in the customer age, the odds of the customer to exit the bank change by a factor of 1.0753628 which is the exponential of the Age Coefficient 0.0726581
NumOfProducts: If we have customers with similar characteristics, for every increase in the number of products the customer has, the odds of the customer to exit the bank change by a factor of 0.9031001 which is the exponential of the Age Coefficient -0.1019219
IsActiveMember: If we have customers with similar characteristics, if the customer is an active member, the odds of the customer to exit the bank change by a factor of 0.3411374 which is the exponential of the Age Coefficient -1.07547. In case the customer is not active, the odds change by a factor of 2.9313703
Balance: If we have customers with similar characteristics, an increase in 1000 dollars in dollars, the odds of the customer to exit the bank change by a factor of 1.0026573 which is the exponential of the Age Coefficient 1000 X 2.653776^{-6}
Geography: If we have customers with similar characteristics, a customer that is from Germany, the odds of exiting are 2.1629295times than a customer that is from France and
2.0894113 times than a customer that is from Spain. In case with an spanish customer, the odds of exting change by a factor of 1.0351861 compared to one that is from France
GenderMale: If we have customers with similar characteristics, if the customer is an active member, the odds of the customer to exit the bank change by a factor of 0.5890427 which is the exponential of the Age Coefficient -0.5292566. In case the customer is not male, the odds change by a factor of 1.6976698

## Waiting for profiling to be done...

##           Variable   Lower_CI   Upper_CI
## 1      (Intercept) 0.01539049 0.02733679
## 2              Age 1.06997988 1.08083004
## 3    NumOfProducts 0.82321915 0.99011645
## 4  IsActiveMember1 0.30455465 0.38169850
## 5          Balance 1.00000165 1.00000366
## 6 GeographyGermany 1.89490486 2.46983146
## 7   GeographySpain 0.90089037 1.18817158
## 8       GenderMale 0.52934417 0.65528003

Confidence Intervals:
- Age: In the best case scenario, the odds change by a factor of 1.081, and in the worst case scenario, the odds change by a factor of 1.07.
- NumOfProducts: In the best case scenario, the odds change by a factor of 0.99, and in the worst case scenario, the odds change by a factor of 0.823.
- IsActiveMember: In the best case scenario, the odds change by a factor of 0.382, and in the worst case scenario, the odds change by a factor of 0.305. In case the customer is not active, the odds change by a factor of 3.283in best case scenario and 2.62 in the best scenario.
- Balance: In the best case scenario, the odds change by a factor of 1.004, and in the worst case scenario, the odds change by a factor of 1.002.
- Geography: In the best case scenario, the odds change by a factor of 2.47 times for a customer from Germany compared to France, and 2.742 times compared to Spain. In the worst case scenario, for a customer from Spain, the odds of exiting change by a factor of 0.901 compared to one that comes from France.
- GenderMale: In the best case scenario, the odds change by a factor of 0.655, and in the worst case scenario, the odds change by a factor of 0.529. In case the customer is not male, the odds change by a factor of 1.889 in the worst case scenario and the best case scenario 1.526 .

Final_Project

Jose Gabriel Usina Mogro

2023-11-27