First thing we are going to do is to load our data and have a global visualization from it.
We called the dataset customer
We want to know how is our data structured and see what we can do with it
## 'data.frame': 10000 obs. of 14 variables:
## $ RowNumber : int 1 2 3 4 5 6 7 8 9 10 ...
## $ CustomerId : int 15634602 15647311 15619304 15701354 15737888 15574012 15592531 15656148 15792365 15592389 ...
## $ Surname : chr "Hargrave" "Hill" "Onio" "Boni" ...
## $ CreditScore : int 619 608 502 699 850 645 822 376 501 684 ...
## $ Geography : chr "France" "Spain" "France" "France" ...
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ Age : int 42 41 42 39 43 44 50 29 44 27 ...
## $ Tenure : int 2 1 8 1 2 8 7 4 4 2 ...
## $ Balance : num 0 83808 159661 0 125511 ...
## $ NumOfProducts : int 1 1 3 2 1 2 2 4 2 1 ...
## $ HasCrCard : int 1 0 1 0 1 1 1 1 0 1 ...
## $ IsActiveMember : int 1 1 0 0 1 0 1 0 1 1 ...
## $ EstimatedSalary: num 101349 112543 113932 93827 79084 ...
## $ Exited : int 1 0 1 0 0 1 0 1 0 0 ...
## RowNumber CustomerId Surname CreditScore
## Min. : 1 Min. :15565701 Length:10000 Min. :350.0
## 1st Qu.: 2501 1st Qu.:15628528 Class :character 1st Qu.:584.0
## Median : 5000 Median :15690738 Mode :character Median :652.0
## Mean : 5000 Mean :15690941 Mean :650.5
## 3rd Qu.: 7500 3rd Qu.:15753234 3rd Qu.:718.0
## Max. :10000 Max. :15815690 Max. :850.0
## Geography Gender Age Tenure
## Length:10000 Length:10000 Min. :18.00 Min. : 0.000
## Class :character Class :character 1st Qu.:32.00 1st Qu.: 3.000
## Mode :character Mode :character Median :37.00 Median : 5.000
## Mean :38.92 Mean : 5.013
## 3rd Qu.:44.00 3rd Qu.: 7.000
## Max. :92.00 Max. :10.000
## Balance NumOfProducts HasCrCard IsActiveMember
## Min. : 0 Min. :1.00 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 97199 Median :1.00 Median :1.0000 Median :1.0000
## Mean : 76486 Mean :1.53 Mean :0.7055 Mean :0.5151
## 3rd Qu.:127644 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :250898 Max. :4.00 Max. :1.0000 Max. :1.0000
## EstimatedSalary Exited
## Min. : 11.58 Min. :0.0000
## 1st Qu.: 51002.11 1st Qu.:0.0000
## Median :100193.91 Median :0.0000
## Mean :100090.24 Mean :0.2037
## 3rd Qu.:149388.25 3rd Qu.:0.0000
## Max. :199992.48 Max. :1.0000
For this project, the target variable will be Exited. We want to predict if a Customer will leave the bank based on his data.
As we can see in the previous step, we can see that we have some variables that are not helpful for our prediction. This columns are:
RowNumberCustomerIdSurname## CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard
## 1 619 France Female 42 2 0.00 1 1
## 2 608 Spain Female 41 1 83807.86 1 0
## 3 502 France Female 42 8 159660.80 3 1
## 4 699 France Female 39 1 0.00 2 0
## 5 850 Spain Female 43 2 125510.82 1 1
## 6 645 Spain Male 44 8 113755.78 2 1
## 7 822 France Male 50 7 0.00 2 1
## 8 376 Germany Female 29 4 115046.74 4 1
## 9 501 France Male 44 4 142051.07 2 0
## 10 684 France Male 27 2 134603.88 1 1
## IsActiveMember EstimatedSalary Exited
## 1 1 101348.88 1
## 2 1 112542.58 0
## 3 0 113931.57 1
## 4 0 93826.63 0
## 5 1 79084.10 0
## 6 0 149756.71 1
## 7 1 10062.80 0
## 8 0 119346.88 1
## 9 1 74940.50 0
## 10 1 71725.73 0
With the columns dropped, now we have to verify that all of the data is in the right data type. We are going to do two changes here:
Exited: Since we are planning to use a random
forest, this variable has to be a factor, so the model can run
properly
Geography: We want to change this variable to a
factor, since we have different values for country and we want to add it
to the model.
After all the cleaning this how our data will look
## 'data.frame': 10000 obs. of 11 variables:
## $ CreditScore : int 619 608 502 699 850 645 822 376 501 684 ...
## $ Geography : Factor w/ 3 levels "France","Germany",..: 1 3 1 1 3 3 1 2 1 1 ...
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ Age : int 42 41 42 39 43 44 50 29 44 27 ...
## $ Tenure : int 2 1 8 1 2 8 7 4 4 2 ...
## $ Balance : num 0 83808 159661 0 125511 ...
## $ NumOfProducts : int 1 1 3 2 1 2 2 4 2 1 ...
## $ HasCrCard : int 1 0 1 0 1 1 1 1 0 1 ...
## $ IsActiveMember : int 1 1 0 0 1 0 1 0 1 1 ...
## $ EstimatedSalary: num 101349 112543 113932 93827 79084 ...
## $ Exited : Factor w/ 2 levels "0","1": 2 1 2 1 1 2 1 2 1 1 ...
## CreditScore Geography Gender Age
## Min. :350.0 France :5014 Length:10000 Min. :18.00
## 1st Qu.:584.0 Germany:2509 Class :character 1st Qu.:32.00
## Median :652.0 Spain :2477 Mode :character Median :37.00
## Mean :650.5 Mean :38.92
## 3rd Qu.:718.0 3rd Qu.:44.00
## Max. :850.0 Max. :92.00
## Tenure Balance NumOfProducts HasCrCard
## Min. : 0.000 Min. : 0 Min. :1.00 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.: 0 1st Qu.:1.00 1st Qu.:0.0000
## Median : 5.000 Median : 97199 Median :1.00 Median :1.0000
## Mean : 5.013 Mean : 76486 Mean :1.53 Mean :0.7055
## 3rd Qu.: 7.000 3rd Qu.:127644 3rd Qu.:2.00 3rd Qu.:1.0000
## Max. :10.000 Max. :250898 Max. :4.00 Max. :1.0000
## IsActiveMember EstimatedSalary Exited
## Min. :0.0000 Min. : 11.58 0:7963
## 1st Qu.:0.0000 1st Qu.: 51002.11 1:2037
## Median :1.0000 Median :100193.91
## Mean :0.5151 Mean :100090.24
## 3rd Qu.:1.0000 3rd Qu.:149388.25
## Max. :1.0000 Max. :199992.48
In order to get some insights about our data before build the models, it is important to look at some visualizations of our data.
ggplot(data=customer)+
geom_bar(mapping = aes(x=IsActiveMember,fill=Exited),position="fill")+
labs(x="Customer is Active",y="Percentage")+
ggtitle("Customer Is Active vs Exited")+
scale_fill_viridis_d(labels = c("Not Exited", "Exited"))It was important to see this plot, because we didn’t want to have complete separation. It would be easier to say taht every Customer that is not active, will be exited, but this is not the case.
ggplot(data=customer)+
geom_histogram(aes(x=Balance,fill=Exited),position = "fill",bins=30)+
labs(x="Account Balance",y="Proportion")+
ggtitle("Distribution of Account Balance by Exit Status")+
scale_fill_viridis_d(labels = c("Not Exited", "Exited"))ggplot(data = customer, aes(x = Geography, y = Age, fill = Gender)) +
geom_boxplot() +
labs(x = "Geography", y = "Age", fill = "Gender") +
scale_fill_viridis_d(name = "Gender") +
facet_wrap(~Exited, scales = "free", labeller = labeller(Exited = c("0" = "Not Exited", "1" = "Exited"))) +
ggtitle("Distribution of Age by Gender and Geography")The boxplot for age is noticeably higher among customers who have exited. The behaviour between the countries is similar
Fist thing we need to do is split our data in train and test set
set.seed(2356)
dim<-dim(customer)
train.idx<-sample(x=1:nrow(customer),size=floor(0.8*nrow(customer)))
train.df<-customer[train.idx,]
test.df<-customer[-train.idx,]We are going to fit a baseline forest, and the proceed with the tuning
##
## Call:
## randomForest(formula = Exited ~ ., data = train.df, ntree = 1000, mtry = 4)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 13.8%
## Confusion matrix:
## 0 1 class.error
## 0 6103 257 0.04040881
## 1 847 793 0.51646341
Now it is time to tuning or forest. In this case we are going to tune
the the number of variables to randomly sample as candidates at each
split mtry of our forest
We are going to create a loop where we are going to try the different
values for mtry. We are going to save the
oob_error_rate in a new dataset called
keeps
We are going to set a loop so we can see which value of mtry gives us the best result
#loop over each element of mtry
for(i in 1:length(mtry)){
print(paste0("Fitting m=",mtry[i]))
temp_forest<-randomForest(Exited~.,
data = train.df,
ntree=1000,
mtry=mtry[i]) #dinamically changing mtry value
#record the results
keeps[i,"m"]<-mtry[i]
keeps[i,"oob_error_rate"]<-mean(predict(temp_forest)!=train.df$Exited)
}## [1] "Fitting m=1"
## [1] "Fitting m=2"
## [1] "Fitting m=3"
## [1] "Fitting m=4"
## [1] "Fitting m=5"
## [1] "Fitting m=6"
## [1] "Fitting m=7"
## [1] "Fitting m=8"
## [1] "Fitting m=9"
## [1] "Fitting m=10"
ggplot(data=keeps)+
geom_line(aes(x=m,y=oob_error_rate))+
scale_x_continuous(breaks=c(1:length(mtry)))The best value for m is 3with an obb_value of 0.138375
Now we are going to set our final forest
final_forest<-randomForest(Exited~.,
data=train.df,
ntree=1000,
mtry=subset(keeps, oob_error_rate == min(oob_error_rate))$m,
importance=TRUE) #based on tuning excercise
final_forest##
## Call:
## randomForest(formula = Exited ~ ., data = train.df, ntree = 1000, mtry = subset(keeps, oob_error_rate == min(oob_error_rate))$m, importance = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 13.84%
## Confusion matrix:
## 0 1 class.error
## 0 6131 229 0.03600629
## 1 878 762 0.53536585
# Calculate predicted probabilities ('pi_hat') using the 'final_forest' model
pi_hat <- predict(final_forest, test.df, type = 'prob')[, '1']
# Create an ROC curve using the 'pROC' package
rocCurve <- roc(response = test.df$Exited,
predictor = pi_hat,
levels = c("0", "1"))## Setting direction: controls < cases
# Plot the ROC curve with additional information
plot(rocCurve, print.thres = TRUE, print.auc = TRUE)Based on our ROC curve, the pi* for prediction should 0.2795
# Calculate the threshold for classification based on ROC curve
pi_star <- coords(rocCurve, "best", ret = "threshold")$threshold
# Add a new column 'forest_pred' to 'test.df' based on the threshold
# If 'pi_hat' is greater than 'pi_star', set to 1, else set to 0
test.df$forest_pred <- ifelse(pi_hat > pi_star, 1, 0)
head(test.df)## CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard
## 5 850 Spain Female 43 2 125510.82 1 1
## 20 726 France Female 24 6 0.00 2 1
## 21 732 France Male 41 8 0.00 2 1
## 30 411 France Male 29 0 59697.17 2 1
## 39 850 France Male 36 7 0.00 1 1
## 43 556 France Female 61 2 117419.35 1 1
## IsActiveMember EstimatedSalary Exited forest_pred
## 5 1 79084.10 0 0
## 20 1 54724.03 0 0
## 21 1 170886.17 0 0
## 30 1 53483.21 0 0
## 39 1 40812.90 0 0
## 43 1 94153.83 0 1
We decided to usea logistic model because we have a binary target
variable Exited, so we have a Bernoulli distribution where
the response for our Y variables is 1 or 0.
# Add a new column 'Exited_bin' to 'customer'
# If 'Exited' is equal to 1, set 'Exited_bin' to 1; otherwise, set to 0
customer$Exited_numeric <- ifelse(customer$Exited == 1, 1, 0)
customer$IsActiveMember<-as.factor(customer$IsActiveMember)
customer$HasCrCard<-as.factor(customer$HasCrCard)
head(customer)## CreditScore Geography Gender Age Tenure Balance NumOfProducts HasCrCard
## 1 619 France Female 42 2 0.00 1 1
## 2 608 Spain Female 41 1 83807.86 1 0
## 3 502 France Female 42 8 159660.80 3 1
## 4 699 France Female 39 1 0.00 2 0
## 5 850 Spain Female 43 2 125510.82 1 1
## 6 645 Spain Male 44 8 113755.78 2 1
## IsActiveMember EstimatedSalary Exited Exited_numeric
## 1 1 101348.88 1 1
## 2 1 112542.58 0 0
## 3 0 113931.57 1 1
## 4 0 93826.63 0 0
## 5 1 79084.10 0 0
## 6 0 149756.71 1 1
# Create a data frame containing variable importance values
vi <- as.data.frame(varImpPlot(final_forest, type = 1))# Add a column 'Variable' to store variable names
vi$Variable <- rownames(vi)
# Create a bar plot of variable importance with ggplot2
ggplot(data = vi) +
geom_bar(aes(x = reorder(Variable, MeanDecreaseAccuracy), weight = MeanDecreaseAccuracy),
position = "identity") +
# Flip the coordinates for a horizontal bar plot
coord_flip() +
# Label the axes
labs(x = "Variable Name", y = "Importance")+
# Add a title to the graph
ggtitle("Variable Importance Analysis")# Arrange the 'vi' data frame in descending order based on 'MeanDecreaseAccuracy'
# Select only the 'Variable' column
variables <- vi %>%
arrange(desc(MeanDecreaseAccuracy)) %>%
dplyr::select(Variable)
variables## Variable
## Age Age
## NumOfProducts NumOfProducts
## IsActiveMember IsActiveMember
## Balance Balance
## Geography Geography
## Gender Gender
## CreditScore CreditScore
## Tenure Tenure
## EstimatedSalary EstimatedSalary
## HasCrCard HasCrCard
We create a new data frame in order to save the AIC and the BIC of the models we are going to create
We create a function in order to fit a logistic model with a log link.
# Function to fit a logistic regression model and calculate AIC, BIC
# Arguments:
# - data: DataFrame containing the dataset
# - variables: Vector of predictor variables for the logistic regression
fitting_model <- function(data, variables) {
# Construct the formula for logistic regression
formula <- as.formula(paste("Exited_numeric ~", paste(variables, collapse = "+")))
# Fit the logistic regression model
model <- glm(formula,
data = data,
family = binomial(link = "logit")) # Specify binomial family and logit link function
# Calculate AIC and BIC
aic <- AIC(model)
bic <- BIC(model)
# Create a summary vector with relevant information
final <- c(num_variables = length(variables),
AIC = round(aic, 2),
BIC = round(bic, 2),
Variables = paste(variables, collapse = ", " ))
# Return the summary vector
return(final)
}# Loop through each row of the 'variables' data frame
for (i in 1:nrow(variables)) {
# Select the first i columns from the 'variables' data frame
columns <- variables[1:i, ]
# Call the fitting_model function to fit a logistic regression model
metrics <- fitting_model(customer, columns)
# Create a data frame with the results and append it to the 'results' data frame
results <- data.frame(rbind(results, metrics))
}## Num_variables AIC BIC
## 1 1 9355.05 9369.47
## 2 2 9340.96 9362.60
## 3 3 8946.91 8975.76
## 4 4 8828.37 8864.42
## 5 5 8680.84 8731.31
## 6 6 8587.55 8645.23
## 7 7 8583.87 8648.76
## 8 8 8582.95 8655.06
## 9 9 8583.92 8663.23
## 10 10 8585.36 8671.88
## Variables
## 1 Age
## 2 Age, NumOfProducts
## 3 Age, NumOfProducts, IsActiveMember
## 4 Age, NumOfProducts, IsActiveMember, Balance
## 5 Age, NumOfProducts, IsActiveMember, Balance, Geography
## 6 Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender
## 7 Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore
## 8 Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore, Tenure
## 9 Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore, Tenure, EstimatedSalary
## 10 Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore, Tenure, EstimatedSalary, HasCrCard
Final Plot
ggplot(data = results) +
geom_point(aes(x = Num_variables, y = AIC, fill = "AIC"), shape = 21, color = "black") +
geom_line(aes(x = Num_variables, y = AIC, group = 1, color = "AIC"), linetype = "solid") +
geom_point(aes(x = Num_variables, y = BIC, fill = "BIC"), shape = 21, color = "black") +
geom_line(aes(x = Num_variables, y = BIC, group = 1, color = "BIC"), linetype = "solid") +
scale_fill_viridis_d(option = "plasma") +
scale_color_viridis_d(option = "plasma") +
labs(title = "AIC and BIC comparition",
x = "Number of variables",
y = "Value") +
scale_x_continuous(breaks = seq(1, 10, 1)) +
theme_minimal()## Num_variables AIC BIC
## 1 8 8582.95 8655.06
## 2 7 8583.87 8648.76
## 3 9 8583.92 8663.23
## 4 10 8585.36 8671.88
## 5 6 8587.55 8645.23
## 6 5 8680.84 8731.31
## 7 4 8828.37 8864.42
## 8 3 8946.91 8975.76
## 9 2 9340.96 9362.60
## 10 1 9355.05 9369.47
## Variables
## 1 Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore, Tenure
## 2 Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore
## 3 Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore, Tenure, EstimatedSalary
## 4 Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender, CreditScore, Tenure, EstimatedSalary, HasCrCard
## 5 Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender
## 6 Age, NumOfProducts, IsActiveMember, Balance, Geography
## 7 Age, NumOfProducts, IsActiveMember, Balance
## 8 Age, NumOfProducts, IsActiveMember
## 9 Age, NumOfProducts
## 10 Age
Based on our graph we will take the model that has as independent variables the following: Age, NumOfProducts, IsActiveMember, Balance, Geography, Gender. With this, we can run our final GLM model
##
## Call: glm(formula = Exited_numeric ~ Age + NumOfProducts + IsActiveMember +
## Balance + Geography + Gender, family = binomial(link = "logit"),
## data = customer)
##
## Coefficients:
## (Intercept) Age NumOfProducts IsActiveMember1
## -3.885e+00 7.266e-02 -1.019e-01 -1.075e+00
## Balance GeographyGermany GeographySpain GenderMale
## 2.654e-06 7.715e-01 3.458e-02 -5.293e-01
##
## Degrees of Freedom: 9999 Total (i.e. Null); 9992 Residual
## Null Deviance: 10110
## Residual Deviance: 8572 AIC: 8588
##
## Call:
## glm(formula = Exited_numeric ~ Age + NumOfProducts + IsActiveMember +
## Balance + Geography + Gender, family = binomial(link = "logit"),
## data = customer)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.885e+00 1.465e-01 -26.515 < 2e-16 ***
## Age 7.266e-02 2.573e-03 28.233 < 2e-16 ***
## NumOfProducts -1.019e-01 4.708e-02 -2.165 0.0304 *
## IsActiveMember1 -1.075e+00 5.759e-02 -18.675 < 2e-16 ***
## Balance 2.654e-06 5.137e-07 5.166 2.39e-07 ***
## GeographyGermany 7.715e-01 6.759e-02 11.414 < 2e-16 ***
## GeographySpain 3.458e-02 7.060e-02 0.490 0.6242
## GenderMale -5.293e-01 5.444e-02 -9.722 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10109.8 on 9999 degrees of freedom
## Residual deviance: 8571.5 on 9992 degrees of freedom
## AIC: 8587.5
##
## Number of Fisher Scoring iterations: 5
With our final model called final_glm we can do the
following statements about our target variable:
Age: If we have customers with similar characteristics, for a year of increase in the customer age, the odds of the customer to exit the bank change by a factor of 1.0753628 which is the exponential of the Age Coefficient 0.0726581
NumOfProducts: If we have customers with similar characteristics, for every increase in the number of products the customer has, the odds of the customer to exit the bank change by a factor of 0.9031001 which is the exponential of the Age Coefficient -0.1019219
IsActiveMember: If we have customers with similar characteristics, if the customer is an active member, the odds of the customer to exit the bank change by a factor of 0.3411374 which is the exponential of the Age Coefficient -1.07547. In case the customer is not active, the odds change by a factor of 2.9313703
Balance: If we have customers with similar characteristics, an increase in 1000 dollars in dollars, the odds of the customer to exit the bank change by a factor of 1.0026573 which is the exponential of the Age Coefficient 1000 X 2.653776^{-6}
Geography: If we have customers with similar
characteristics, a customer that is from Germany, the odds of exiting
are 2.1629295times than a customer that is from France and
2.0894113 times than a customer that is from Spain. In case with an
spanish customer, the odds of exting change by a factor of 1.0351861
compared to one that is from France
GenderMale: If we have customers with similar characteristics, if the customer is an active member, the odds of the customer to exit the bank change by a factor of 0.5890427 which is the exponential of the Age Coefficient -0.5292566. In case the customer is not male, the odds change by a factor of 1.6976698
## Waiting for profiling to be done...
## Variable Lower_CI Upper_CI
## 1 (Intercept) 0.01539049 0.02733679
## 2 Age 1.06997988 1.08083004
## 3 NumOfProducts 0.82321915 0.99011645
## 4 IsActiveMember1 0.30455465 0.38169850
## 5 Balance 1.00000165 1.00000366
## 6 GeographyGermany 1.89490486 2.46983146
## 7 GeographySpain 0.90089037 1.18817158
## 8 GenderMale 0.52934417 0.65528003
Age: In the best case scenario, the odds change by a factor of 1.081, and in the worst case scenario, the odds change by a factor of 1.07.
NumOfProducts: In the best case scenario, the odds change by a factor of 0.99, and in the worst case scenario, the odds change by a factor of 0.823.
IsActiveMember: In the best case scenario, the odds change by a factor of 0.382, and in the worst case scenario, the odds change by a factor of 0.305. In case the customer is not active, the odds change by a factor of 3.283in best case scenario and 2.62 in the best scenario.
Balance: In the best case scenario, the odds change by a factor of 1.004, and in the worst case scenario, the odds change by a factor of 1.002.
Geography: In the best case scenario, the odds change by a factor of 2.47 times for a customer from Germany compared to France, and 2.742 times compared to Spain. In the worst case scenario, for a customer from Spain, the odds of exiting change by a factor of 0.901 compared to one that comes from France.
GenderMale: In the best case scenario, the odds change by a factor of 0.655, and in the worst case scenario, the odds change by a factor of 0.529. In case the customer is not male, the odds change by a factor of 1.889 in the worst case scenario and the best case scenario 1.526 .