1. Background

The Bank Customer Churn data set originates from a multinational financial institution that is publicly available on Kaggle.[1] Data were collected to analyze customer behavior and predict the likelihood of customer attrition, commonly known as churn. Customer churn is important for a bank to analyze because it directly impacts profitability and long-term business growth.[2] Retaining existing customers is generally more cost-effective than acquiring new ones, so understanding why customers leave allows banks to implement targeted retention strategies. By analyzing churn, banks can identify patterns and risk factors such as low engagement, poor customer service, or uncompetitive product offerings that may lead to customer dissatisfaction. This helps in developing personalized services, improving customer experience, and ultimately reducing turnover, which enhances customer lifetime value and strengthens the bank’s competitive position.

2. Exploratory Data Analysis

Data Structure

The data set contains 10,000 records and 11 variables that provide detailed information about each customer. The binary response variable, Churn indicates whether the customer left the bank. The predictor variables include demographic information such as each customer’s Country of residence, Gender, and Age. Relevant financial attributes of each customer that may impact whether they stay with the bank are also included. The data set includes Estimated Salary for each customer as well as their Credit Score, a common measure of creditworthiness used by lenders to evaluate the potential risk posed by lending money to consumers. The Balance variable reflects the amount of money in each customer’s existing bank account. Additional variables include Tenure (how many years customers were bank clients), Products Number (the number of products a customer purchased from the bank), as well as two binary predictor variables: whether each customer have a Credit Card issued by the bank and whether they are Active Members (as opposed to passive clients who do not engage bank resources regularly).

Data Summary

Frequency Table for Churn, Country, and Gender
Variable	Value	Frequency
churn	0	151
churn	1	149
country	France	100
country	Germany	100
country	Spain	100
gender	Male	156
gender	Female	144

Per the recommendation of the instructor, we randomly selected a subset of 300 observations to ensure the categorical variables (Churn, Country, and Gender) are relatively balanced. This helps prevent bias in exploratory analysis and model training by providing a representative sample rather than one skewed by class imbalance.

Bank Customer Churn Data Summary
credit_score	age	tenure	balance
Min. :350.0	Min. :18.00	Min. : 0.000	Min. : 0
1st Qu.:584.0	1st Qu.:32.00	1st Qu.: 3.000	1st Qu.: 0
Median :652.0	Median :37.00	Median : 5.000	Median : 97199
Mean :650.5	Mean :38.92	Mean : 5.013	Mean : 76486
3rd Qu.:718.0	3rd Qu.:44.00	3rd Qu.: 7.000	3rd Qu.:127644
Max. :850.0	Max. :92.00	Max. :10.000	Max. :250898

products_number	credit_card	active_member	estimated_salary
Min. :1.00	Min. :0.0000	Min. :0.0000	Min. : 11.58
1st Qu.:1.00	1st Qu.:0.0000	1st Qu.:0.0000	1st Qu.: 51002.11
Median :1.00	Median :1.0000	Median :1.0000	Median :100193.91
Mean :1.53	Mean :0.7055	Mean :0.5151	Mean :100090.24
3rd Qu.:2.00	3rd Qu.:1.0000	3rd Qu.:1.0000	3rd Qu.:149388.25
Max. :4.00	Max. :1.0000	Max. :1.0000	Max. :199992.48

Data Pre-processing

The data set does not include any missing values or duplicates, however, some variables were misclassified. This was corrected to ensure smooth analysis. Next, we analyzed the relationship between the response variable Churn and various predictor variables.

The figure above shows that Germany has the highest customer churn rate (61%), followed by France (45%), and then Spain (43%). Germany’s higher churn rate indicates a potential issue in customer retention strategies specific to Germany.

The figure above shows that inactive members in the sample have higher churn rates compared to active members, particularly in France and Germany. This suggests that inactive membership status may be a significant risk factor for customer churn across all three countries, with the issue being most pronounced in Germany.

The figure above shows churn distribution by gender across France, Germany, and Spain. France and Spain show similar gender-based churn patterns with females churning at a higher rate than males, however this trend is reversed in Germany.

This figure above shows churn rates based on membership status and credit card ownership. Inactive members have much higher churn rates than active members. Having a credit card seems to modestly reduce the churn rate for both active and inactive members. The sample data suggests that focusing on activating inactive members could be more effective for reducing churn than promoting credit card adoption.

The correlation plot shows that most variables have weak correlations (close to 0) with each other. The strongest relationship is between Churn and Age (0.33), suggesting older customers are slightly more likely to churn. Additionally, Balance has a moderate positive correlation with Churn (0.21), indicating higher account balances correlate somewhat with increased churn rate. Products Number has a moderate, negative relationship with Balance (-0.22).

4. Data Methodology

We utilized the createDataPartition() function to split our sample into a 70% training subset and a 30% testing subset to train the model and then evaluate its performance to see how well it generalizes to new data. This helps to mitigate overfitting issues and ensures the model is able to accurate predictions.

Model Selection

Logisitic Regression Full Model

We fit a logistic regression model on the training data and found that Age, Balance, and Active member status were significant predictors of Churn. Older customers and those with higher balances were more likely to churn, while active members were less likely. The model had 68.5% accuracy and an AUC of 0.804, showing solid performance in identifying customers at risk of churning. We also checked the Variance Inflation Factor (VIF) for the predictors to measure multicollinearity in the model, which can lead to unreliable estimates and inflated errors. All VIF values were near 1, indicating no multicollinearity.

We performed a Hosmer–Lemeshow goodness-of-fit test to evaluate how a logistic model fits the data. Both the “C” and “H” statistics had high p-values (0.6415 and 0.5888 respectively), so we fail to reject \(H_{0}\) at the \(\alpha\)=.05 level, indicating that the logistic regression model fits the data well.

Model performance on the test data subset showed a accuracy of 0.6854, precision of 0.6739, recall of 0.7045, and an F1 score of 0.6889, indicating a strong balance between identifying true churn cases and limiting false positives. The area under the ROC curve (AUC) was 0.804, suggesting good classification capability. The Matthews Correlation Coefficient (MCC) was 0.3714, indicating a decent level of predictive accuracy.

Logisitic Regression with Polynomial Terms Full Model

Next, we fit a second-degree polynomial logistic regression model using all numeric predictors and categorical variables. Model performance on the test set showed a accuracy of 0.7303, precision of 0.7083, recall of 0.7727, and an F1 score of 0.7391, indicating a strong balance between identifying true churn cases and limiting false positives. The area under the ROC curve (AUC) was 0.8495, suggesting good classification capability. The MCC was 0.4630, indicating a decent level of predictive accuracy. Among the predictors, polynomial terms for age, balance, and number of products were statistically significant, while other variables such as gender and country were more limited. From these results, we can see that the model captures important nonlinear relationships in the data.

Logistic Regression Stepwise-Reduced Model

In order to simplify the model, we utilized stepwise variable selection to fit a logistic regression model. Using AIC for the selection criterion, the reduced model retained only three predictors: Age, Balance, and Active Member status. This suggests that older customers, customers with higher balances, and customers who are not active members are more likely to churn. The model was evaluated on the test set, and metrics such as precision, recall, F1 score, AUC, and MCC indicated that the simplified model still performed reasonably well. This helped highlight a few key customer characteristics that can help flag potential churners while keeping the model interpretable.

Logistic Regression LASSO-Reduced Model

In order to limit overfitting and enhance accuracy, we used the LASSO regularization technique. After cross-validation, the optimal lambda minimized the model’s error and resulted in only a few non-zero coefficients: Age, Balance, Credit Card status, and Active Member status. Among these, older age and higher account balance were positively associated with churn, while owning a credit card and being an active member were associated with a lower likelihood of churn. The LASSO model thus automatically filtered out less relevant features, enhancing model interpretability and focusing on the strongest signals related to churn behavior.

The simplified logistic regression model has an accuracy of 0.6404, which is decent but could be better. The F1 score of 0.7288 shows a good balance between precision and recall. The MCC of 0.3852 suggests a moderate positive correlation between predicted and actual churn outcomes. The AUC of 0.8495 shows that the model does a good job of distinguishing between churned and non-churned customers. The model performs well overall but some areas need improvement.

Logistic Regression with Polynomial Terms Reduced Model

The polynomial logistic regression model performs well in predicting churn, with an accuracy of .7303, indicating it correctly classifies churn and non-churn cases in most instances. The F1 score of 0.7447 reflects a balance between precision and recall. This shows us that the model is effective in identifying churned customers while minimizing the false positives. Additionally, the MCC of 0.4657 suggests a moderate positive correlation between predicted and actual outcomes, highlighting the model’s ability to correctly predict both churned and non-churned customers. The model also has an AUC of 0.8101, indicating good performance in distinguishing between churned and non-churned customers. These metrics together demonstrate that the model is robust and provides meaningful predictive insights.

4. Results

ANOVA

ANOVA - Logistic Full vs. Logitistic Full w/Polynomial Terms
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
199	260.2	NA	NA	NA
193	208.3	6	51.95	1.91e-09

ANOVA - Logistic Full vs. Logitistic Stepwise Reduced
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
207	263.4	NA	NA	NA
199	260.2	8	3.16	0.9239

ANOVA - Logistic Full vs. Logitistic LASSO Reduced
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
206	261.5	NA	NA	NA
199	260.2	7	1.244	0.9899

ANOVA - Logitistic Full w/Polynomial Terms vs. Logitistic Reduced w/Polynomial Terms
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
204	255.6	NA	NA	NA
193	208.3	11	47.34	1.87e-06

The ANOVA comparisons indicate clear differences among the models. In the first test, the full logistic model with polynomial terms demonstrates that including non-linear transformations markedly improves the fit. In contrast, comparisons between the full model and both the stepwise and simplified versions suggest that excessive predictors do not enhance performance in the linear model. Likewise, within the polynomial framework, the full model substantially outperforms its simplified counterpart.

Model Comparison

Method	Accuracy	Precision	Recall	F1	MCC	AUC
Log. Full	0.6854	0.6739	0.7045	0.6889	0.3714	0.804
Log. w/Poly Full	0.7303	0.7083	0.7727	0.7391	0.463	0.8495
Stepwise Reduced	0.7303	0.7083	0.7727	0.7391	0.463	0.8273
LASSO Reduced	0.6404	0.5811	0.9773	0.7288	0.3852	0.7985
Log. w/Poly Reduced	0.7303	0.7	0.7955	0.7447	0.4657	0.8101

Summary We tested several models to predict customer churn, including logistic regression, stepwise logistic regression, LASSO, and second-degree polynomial logistic regression. The full logistic regression model had an accuracy of .6854 and an AUC of 0.804, identifying Age, Balance, and Active Member status as key predictors. The stepwise-reduced logistic regression model further refined the model by selecting only Age, Balance, and Active Member status, showing that a simpler model can still perform reasonably well. The LASSO-reduced logistic regression model helped improve interpretability by selecting a sparse set of features - Age, Balance, Credit Card ownership, and Active Member status — while maintaining predictive strength. The full polynomial logistic regression model captured nonlinear relationships and had the strongest performance overall, with an accuracy of .7303, F1 score of 0.7447, and AUC of 0.8495. It highlighted the importance of nonlinear effects for variables. Across models, evaluation metrics like F1 score, Matthews Correlation Coefficient (MCC), and AUC consistently confirmed the predictive value of these features, especially Age and Balance, with Active Member status acting as a preventitive factor against churn.

5. Conclusion

We utilized various logistic regression models to predict customer churn using demographic and financial data from a bank’s client base. Across models—including basic logistic regression, stepwise selection, LASSO, and polynomial logistic regression—we consistently found that age, balance, and active membership status were key drivers of churn. Models with added complexity, like polynomial logistic regression, performed slightly better in terms of accuracy and AUC, but the simpler models also delivered meaningful insights. These findings indicate that a customer’s activity level (as shown by the Active Member status) and financial factors such as balance are key indicators of potential churn. By identifying these patterns, banks can better prioritize customer retention strategies and tailor their outreach to at-risk clients.

Key Insights

Age is the most consistent significant predictor: older customers are more likely to churn. Balance is also positively associated with Churn (higher balances do not necessarily indicate retention). Active Members are less likely to churn, emphasizing the importance of consistent engagement. Polynomial modeling highlighted that simple linear assumptions miss important behavior patterns—nonlinear effects matter.

Recommendations

Prioritize retention efforts for older clients, who appear more likely to churn. Consider loyalty incentives or tailored outreach.
Monitor high-balance customers closely—flag those with low engagement and intervene with personalized services.
Encourage active engagement through regular check-ins, usage-based rewards, or bundled services that promote more frequent interactions.

6. References

Topre, Gaurav. Bank Customer Churn Dataset. Kaggle. Accessed May 7, 2025. https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset.
“Customer attrition,” Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index.php?title=Customer_attrition&oldid=1277920624 (accessed May 4, 2025).

Appendix: Logistic Regression Results

## 
## Call:
## glm(formula = full_formula, family = binomial(), data = train_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.17136    0.48523   2.414 0.015778 *  
## poly(credit_score, 2)1      0.05481    2.66238   0.021 0.983574    
## poly(credit_score, 2)2      4.30254    2.85913   1.505 0.132365    
## poly(age, 2)1               9.66414    2.58509   3.738 0.000185 ***
## poly(age, 2)2              -3.70410    2.37496  -1.560 0.118843    
## poly(tenure, 2)1           -1.24055    2.58070  -0.481 0.630726    
## poly(tenure, 2)2            1.95072    2.47994   0.787 0.431518    
## poly(balance, 2)1           6.52801    2.89340   2.256 0.024060 *  
## poly(balance, 2)2           2.38204    2.62857   0.906 0.364823    
## poly(products_number, 2)1   6.77944    4.53416   1.495 0.134864    
## poly(products_number, 2)2  25.20724    5.38312   4.683 2.83e-06 ***
## poly(estimated_salary, 2)1 -0.35953    2.51745  -0.143 0.886435    
## poly(estimated_salary, 2)2  3.76055    2.60089   1.446 0.148213    
## countryGermany             -0.14241    0.47237  -0.301 0.763058    
## countrySpain               -0.21345    0.43232  -0.494 0.621489    
## genderFemale               -0.01009    0.35290  -0.029 0.977185    
## credit_cardCredit Card     -0.61807    0.39350  -1.571 0.116249    
## active_memberActive        -0.98461    0.36741  -2.680 0.007365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.50  on 210  degrees of freedom
## Residual deviance: 208.27  on 193  degrees of freedom
## AIC: 244.27
## 
## Number of Fisher Scoring iterations: 6
## Setting levels: control = Not Churned, case = Churned
## Setting direction: controls < cases
## Precision:  0.7083333
## Recall:  0.7727273
## F1 Score:  0.7391304
## MCC:  0.4630214
## AUC:  0.8494949

Check for Influential observations

# Do we have any influential observations if so how many do we have?

# List of your models
models <- list(
  log_model = log_model,
  log_model_poly = log_model_poly,
  log_model_stepwise = log_model_stepwise,
  log_model_poly_simple = log_model_poly_simple,
  log_model_simplified = log_model_simplified
)

# Loop through each model and check Cook's distance > 0.05
for (model_name in names(models)) {
  cat("\n--- Influential Observations for", model_name, "---\n")
  
  # Get Cook's distance
  cooks_d <- cooks.distance(models[[model_name]])
  
  # Find indices with Cook's D > 0.05
  influential_obs <- which(cooks_d > 0.05)
  
  # Print results
  if (length(influential_obs) > 0) {
    print(data.frame(
      Observation = influential_obs,
      Cooks_Distance = cooks_d[influential_obs]
    ))
  } else {
    cat("No influential observations with Cook's distance > 0.05\n")
  }
}
## 
## --- Influential Observations for log_model ---
## No influential observations with Cook's distance > 0.05
## 
## --- Influential Observations for log_model_poly ---
##      Observation Cooks_Distance
## 1148         109     0.08648228
## 3603         199     0.06238607
## 
## --- Influential Observations for log_model_stepwise ---
##      Observation Cooks_Distance
## 4645         191     0.07829124
## 3603         199     0.07277637
## 
## --- Influential Observations for log_model_poly_simple ---
##      Observation Cooks_Distance
## 5937         148     0.05310929
## 4645         191     0.07020027
## 
## --- Influential Observations for log_model_simplified ---
##      Observation Cooks_Distance
## 4645         191     0.07150868
## 3603         199     0.05622319

Using cook’s distance we were able to identify 4(191, 199, 109, 148) influential observations with cook’s distance greater than the significance level of 0.05, with observations 191 and 199 appearing the most in all majority of our models

# Remove  influential observation 191, re-fit the best model
influential_ids <- unique(c(191)) 
new_data <- train_data[-influential_ids, ]

#Polynomial logistic regression model
log_model_poly <- glm(full_formula, data = train_data, family = binomial())

# Refit the model
log_model_refit <- glm(full_formula, data = new_data, family = binomial())

# Compare with original model
summary(log_model_poly)
## 
## Call:
## glm(formula = full_formula, family = binomial(), data = train_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.17136    0.48523   2.414 0.015778 *  
## poly(credit_score, 2)1      0.05481    2.66238   0.021 0.983574    
## poly(credit_score, 2)2      4.30254    2.85913   1.505 0.132365    
## poly(age, 2)1               9.66414    2.58509   3.738 0.000185 ***
## poly(age, 2)2              -3.70410    2.37496  -1.560 0.118843    
## poly(tenure, 2)1           -1.24055    2.58070  -0.481 0.630726    
## poly(tenure, 2)2            1.95072    2.47994   0.787 0.431518    
## poly(balance, 2)1           6.52801    2.89340   2.256 0.024060 *  
## poly(balance, 2)2           2.38204    2.62857   0.906 0.364823    
## poly(products_number, 2)1   6.77944    4.53416   1.495 0.134864    
## poly(products_number, 2)2  25.20724    5.38312   4.683 2.83e-06 ***
## poly(estimated_salary, 2)1 -0.35953    2.51745  -0.143 0.886435    
## poly(estimated_salary, 2)2  3.76055    2.60089   1.446 0.148213    
## countryGermany             -0.14241    0.47237  -0.301 0.763058    
## countrySpain               -0.21345    0.43232  -0.494 0.621489    
## genderFemale               -0.01009    0.35290  -0.029 0.977185    
## credit_cardCredit Card     -0.61807    0.39350  -1.571 0.116249    
## active_memberActive        -0.98461    0.36741  -2.680 0.007365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.50  on 210  degrees of freedom
## Residual deviance: 208.27  on 193  degrees of freedom
## AIC: 244.27
## 
## Number of Fisher Scoring iterations: 6
summary(log_model_refit)
## 
## Call:
## glm(formula = full_formula, family = binomial(), data = new_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.22471    0.48832   2.508  0.01214 *  
## poly(credit_score, 2)1      0.09555    2.66114   0.036  0.97136    
## poly(credit_score, 2)2      4.38256    2.85584   1.535  0.12488    
## poly(age, 2)1              10.18792    2.59261   3.930 8.51e-05 ***
## poly(age, 2)2              -2.44031    2.42085  -1.008  0.31344    
## poly(tenure, 2)1           -1.03137    2.58858  -0.398  0.69031    
## poly(tenure, 2)2            1.79999    2.48417   0.725  0.46871    
## poly(balance, 2)1           6.34006    2.88356   2.199  0.02790 *  
## poly(balance, 2)2           2.53000    2.63373   0.961  0.33675    
## poly(products_number, 2)1   6.97934    4.54615   1.535  0.12473    
## poly(products_number, 2)2  25.10151    5.37358   4.671 2.99e-06 ***
## poly(estimated_salary, 2)1 -0.14777    2.52055  -0.059  0.95325    
## poly(estimated_salary, 2)2  3.76043    2.60353   1.444  0.14864    
## countryGermany             -0.11450    0.47347  -0.242  0.80891    
## countrySpain               -0.17280    0.43320  -0.399  0.68998    
## genderFemale               -0.02742    0.35354  -0.078  0.93818    
## credit_cardCredit Card     -0.66968    0.39693  -1.687  0.09158 .  
## active_memberActive        -1.00498    0.36808  -2.730  0.00633 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 291.12  on 209  degrees of freedom
## Residual deviance: 206.87  on 192  degrees of freedom
## AIC: 242.87
## 
## Number of Fisher Scoring iterations: 6

# Remove  influential observation 199, re-fit the best model
influential_ids <- unique(c(199)) 
new_data <- train_data[-influential_ids, ]

#Polynomial logistic regression model
log_model_poly <- glm(full_formula, data = train_data, family = binomial())

# Refit the model
log_model_refit <- glm(full_formula, data = new_data, family = binomial())

# Compare with original model
summary(log_model_poly)
## 
## Call:
## glm(formula = full_formula, family = binomial(), data = train_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.17136    0.48523   2.414 0.015778 *  
## poly(credit_score, 2)1      0.05481    2.66238   0.021 0.983574    
## poly(credit_score, 2)2      4.30254    2.85913   1.505 0.132365    
## poly(age, 2)1               9.66414    2.58509   3.738 0.000185 ***
## poly(age, 2)2              -3.70410    2.37496  -1.560 0.118843    
## poly(tenure, 2)1           -1.24055    2.58070  -0.481 0.630726    
## poly(tenure, 2)2            1.95072    2.47994   0.787 0.431518    
## poly(balance, 2)1           6.52801    2.89340   2.256 0.024060 *  
## poly(balance, 2)2           2.38204    2.62857   0.906 0.364823    
## poly(products_number, 2)1   6.77944    4.53416   1.495 0.134864    
## poly(products_number, 2)2  25.20724    5.38312   4.683 2.83e-06 ***
## poly(estimated_salary, 2)1 -0.35953    2.51745  -0.143 0.886435    
## poly(estimated_salary, 2)2  3.76055    2.60089   1.446 0.148213    
## countryGermany             -0.14241    0.47237  -0.301 0.763058    
## countrySpain               -0.21345    0.43232  -0.494 0.621489    
## genderFemale               -0.01009    0.35290  -0.029 0.977185    
## credit_cardCredit Card     -0.61807    0.39350  -1.571 0.116249    
## active_memberActive        -0.98461    0.36741  -2.680 0.007365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.50  on 210  degrees of freedom
## Residual deviance: 208.27  on 193  degrees of freedom
## AIC: 244.27
## 
## Number of Fisher Scoring iterations: 6
summary(log_model_refit)
## 
## Call:
## glm(formula = full_formula, family = binomial(), data = new_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.15651    0.48624   2.378  0.01739 *  
## poly(credit_score, 2)1      0.03904    2.67648   0.015  0.98836    
## poly(credit_score, 2)2      4.51884    2.87478   1.572  0.11598    
## poly(age, 2)1              10.54447    2.63112   4.008 6.13e-05 ***
## poly(age, 2)2              -1.63544    2.42866  -0.673  0.50070    
## poly(tenure, 2)1           -1.45578    2.58472  -0.563  0.57328    
## poly(tenure, 2)2            2.22500    2.50287   0.889  0.37401    
## poly(balance, 2)1           6.08765    2.90120   2.098  0.03588 *  
## poly(balance, 2)2           2.67599    2.65161   1.009  0.31288    
## poly(products_number, 2)1   6.57823    4.52907   1.452  0.14638    
## poly(products_number, 2)2  25.73978    5.40384   4.763 1.91e-06 ***
## poly(estimated_salary, 2)1 -0.80874    2.54111  -0.318  0.75028    
## poly(estimated_salary, 2)2  3.92333    2.61266   1.502  0.13319    
## countryGermany             -0.11213    0.47479  -0.236  0.81331    
## countrySpain               -0.19860    0.43226  -0.459  0.64591    
## genderFemale               -0.06064    0.35601  -0.170  0.86475    
## credit_cardCredit Card     -0.58737    0.39553  -1.485  0.13754    
## active_memberActive        -0.95211    0.36845  -2.584  0.00976 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 291.12  on 209  degrees of freedom
## Residual deviance: 206.13  on 192  degrees of freedom
## AIC: 242.13
## 
## Number of Fisher Scoring iterations: 6

# Remove  influential observation 109, re-fit the best model
influential_ids <- unique(c(109)) 
new_data <- train_data[-influential_ids, ]

#Polynomial logistic regression model
log_model_poly <- glm(full_formula, data = train_data, family = binomial())

# Refit the model
log_model_refit <- glm(full_formula, data = new_data, family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Compare with original model
summary(log_model_poly)
## 
## Call:
## glm(formula = full_formula, family = binomial(), data = train_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.17136    0.48523   2.414 0.015778 *  
## poly(credit_score, 2)1      0.05481    2.66238   0.021 0.983574    
## poly(credit_score, 2)2      4.30254    2.85913   1.505 0.132365    
## poly(age, 2)1               9.66414    2.58509   3.738 0.000185 ***
## poly(age, 2)2              -3.70410    2.37496  -1.560 0.118843    
## poly(tenure, 2)1           -1.24055    2.58070  -0.481 0.630726    
## poly(tenure, 2)2            1.95072    2.47994   0.787 0.431518    
## poly(balance, 2)1           6.52801    2.89340   2.256 0.024060 *  
## poly(balance, 2)2           2.38204    2.62857   0.906 0.364823    
## poly(products_number, 2)1   6.77944    4.53416   1.495 0.134864    
## poly(products_number, 2)2  25.20724    5.38312   4.683 2.83e-06 ***
## poly(estimated_salary, 2)1 -0.35953    2.51745  -0.143 0.886435    
## poly(estimated_salary, 2)2  3.76055    2.60089   1.446 0.148213    
## countryGermany             -0.14241    0.47237  -0.301 0.763058    
## countrySpain               -0.21345    0.43232  -0.494 0.621489    
## genderFemale               -0.01009    0.35290  -0.029 0.977185    
## credit_cardCredit Card     -0.61807    0.39350  -1.571 0.116249    
## active_memberActive        -0.98461    0.36741  -2.680 0.007365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.50  on 210  degrees of freedom
## Residual deviance: 208.27  on 193  degrees of freedom
## AIC: 244.27
## 
## Number of Fisher Scoring iterations: 6
summary(log_model_refit)
## 
## Call:
## glm(formula = full_formula, family = binomial(), data = new_data)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   2.70011  100.60660   0.027  0.97859    
## poly(credit_score, 2)1       -0.40813    2.68882  -0.152  0.87935    
## poly(credit_score, 2)2        4.19860    2.87615   1.460  0.14434    
## poly(age, 2)1                 9.18038    2.56761   3.575  0.00035 ***
## poly(age, 2)2                -3.11379    2.36299  -1.318  0.18759    
## poly(tenure, 2)1             -0.99252    2.61843  -0.379  0.70465    
## poly(tenure, 2)2              1.33321    2.51040   0.531  0.59537    
## poly(balance, 2)1             6.80092    2.92103   2.328  0.01990 *  
## poly(balance, 2)2             2.00346    2.62318   0.764  0.44501    
## poly(products_number, 2)1    63.06275 3671.98735   0.017  0.98630    
## poly(products_number, 2)2    86.34263 3968.20247   0.022  0.98264    
## poly(estimated_salary, 2)1   -0.09103    2.52330  -0.036  0.97122    
## poly(estimated_salary, 2)2    3.60948    2.61771   1.379  0.16793    
## countryGermany               -0.05977    0.47456  -0.126  0.89978    
## countrySpain                 -0.24079    0.43514  -0.553  0.58002    
## genderFemale                  0.05267    0.35690   0.148  0.88268    
## credit_cardCredit Card       -0.62829    0.39742  -1.581  0.11389    
## active_memberActive          -1.04433    0.37336  -2.797  0.00516 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 291.12  on 209  degrees of freedom
## Residual deviance: 202.06  on 192  degrees of freedom
## AIC: 238.06
## 
## Number of Fisher Scoring iterations: 17

# Remove  influential observation 148, re-fit the best model
influential_ids <- unique(c(148)) 
new_data <- train_data[-influential_ids, ]

#Polynomial logistic regression model
log_model_poly <- glm(full_formula, data = train_data, family = binomial())

# Refit the model
log_model_refit <- glm(full_formula, data = new_data, family = binomial())

# Compare with original model
summary(log_model_poly)
## 
## Call:
## glm(formula = full_formula, family = binomial(), data = train_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.17136    0.48523   2.414 0.015778 *  
## poly(credit_score, 2)1      0.05481    2.66238   0.021 0.983574    
## poly(credit_score, 2)2      4.30254    2.85913   1.505 0.132365    
## poly(age, 2)1               9.66414    2.58509   3.738 0.000185 ***
## poly(age, 2)2              -3.70410    2.37496  -1.560 0.118843    
## poly(tenure, 2)1           -1.24055    2.58070  -0.481 0.630726    
## poly(tenure, 2)2            1.95072    2.47994   0.787 0.431518    
## poly(balance, 2)1           6.52801    2.89340   2.256 0.024060 *  
## poly(balance, 2)2           2.38204    2.62857   0.906 0.364823    
## poly(products_number, 2)1   6.77944    4.53416   1.495 0.134864    
## poly(products_number, 2)2  25.20724    5.38312   4.683 2.83e-06 ***
## poly(estimated_salary, 2)1 -0.35953    2.51745  -0.143 0.886435    
## poly(estimated_salary, 2)2  3.76055    2.60089   1.446 0.148213    
## countryGermany             -0.14241    0.47237  -0.301 0.763058    
## countrySpain               -0.21345    0.43232  -0.494 0.621489    
## genderFemale               -0.01009    0.35290  -0.029 0.977185    
## credit_cardCredit Card     -0.61807    0.39350  -1.571 0.116249    
## active_memberActive        -0.98461    0.36741  -2.680 0.007365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.50  on 210  degrees of freedom
## Residual deviance: 208.27  on 193  degrees of freedom
## AIC: 244.27
## 
## Number of Fisher Scoring iterations: 6
summary(log_model_refit)
## 
## Call:
## glm(formula = full_formula, family = binomial(), data = new_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.29077    0.49692   2.598  0.00939 ** 
## poly(credit_score, 2)1      0.44150    2.70520   0.163  0.87036    
## poly(credit_score, 2)2      4.55684    2.91772   1.562  0.11834    
## poly(age, 2)1              10.60274    2.66550   3.978 6.96e-05 ***
## poly(age, 2)2              -4.82676    2.40757  -2.005  0.04498 *  
## poly(tenure, 2)1           -2.11668    2.62664  -0.806  0.42033    
## poly(tenure, 2)2            1.86020    2.50528   0.743  0.45778    
## poly(balance, 2)1           6.70129    2.93963   2.280  0.02263 *  
## poly(balance, 2)2           3.06491    2.67962   1.144  0.25271    
## poly(products_number, 2)1   6.94155    4.56067   1.522  0.12800    
## poly(products_number, 2)2  25.76260    5.46938   4.710 2.47e-06 ***
## poly(estimated_salary, 2)1 -1.07834    2.56784  -0.420  0.67453    
## poly(estimated_salary, 2)2  3.85864    2.64298   1.460  0.14430    
## countryGermany             -0.14714    0.48074  -0.306  0.75954    
## countrySpain               -0.32319    0.44237  -0.731  0.46503    
## genderFemale               -0.06694    0.35893  -0.186  0.85206    
## credit_cardCredit Card     -0.64906    0.40012  -1.622  0.10477    
## active_memberActive        -1.07507    0.37620  -2.858  0.00427 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 291.10  on 209  degrees of freedom
## Residual deviance: 202.13  on 192  degrees of freedom
## AIC: 238.13
## 
## Number of Fisher Scoring iterations: 6

Based on the R outputs above, let’s analyze what happens when individual influential observations are removed from the polynomial logistic regression model. When influential observations 191, 199, 109, and 148 are individually removed from the model:

Observation 191: Removal slightly strengthens the age (1st degree) effect (from 9.66 to 10.19) and slightly enhances the active_member effect (from -0.98 to -1.00). The overall model improves with AIC decreasing from 244.27 to 242.87.

Observation 199: Removal strengthens the age effect (from 9.66 to 10.54) and the products_number effect (from 25.21 to 25.74). AIC improves to 242.13.

Observation 109: Removal causes numerical issues (warning about fitted probabilities of 0 or 1), with extremely large coefficient estimates and standard errors for products_number variables. Despite this, AIC still improves to 238.06.

Observation 148: Removal strengthens several effects, including age (1st degree) becoming stronger (from 9.66 to 10.60) and age (2nd degree) becoming significant. AIC improves to 238.13.

Conclusion: Since removing individual influential observations consistently improves our model fit (lower AIC) and generally maintains the same significant predictors, it would be reasonable to remove these influential observations from the final model. Observation 109 in particular appears highly influential, causing numerical instability when removed alone. The final model would benefit from removing these observations, particularly 109 and 148, which show the largest AIC improvements.

# Remove all influential observations and re-fit the best model
influential_ids <- unique(c(199, 191, 109, 148)) 
new_data <- train_data[-influential_ids, ]

#Polynomial logistic regression model
log_model_poly <- glm(full_formula, data = train_data, family = binomial())

# Refit the model
log_model_refit <- glm(full_formula, data = new_data, family = binomial())
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
# Compare with original model
summary(log_model_poly)
## 
## Call:
## glm(formula = full_formula, family = binomial(), data = train_data)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 1.17136    0.48523   2.414 0.015778 *  
## poly(credit_score, 2)1      0.05481    2.66238   0.021 0.983574    
## poly(credit_score, 2)2      4.30254    2.85913   1.505 0.132365    
## poly(age, 2)1               9.66414    2.58509   3.738 0.000185 ***
## poly(age, 2)2              -3.70410    2.37496  -1.560 0.118843    
## poly(tenure, 2)1           -1.24055    2.58070  -0.481 0.630726    
## poly(tenure, 2)2            1.95072    2.47994   0.787 0.431518    
## poly(balance, 2)1           6.52801    2.89340   2.256 0.024060 *  
## poly(balance, 2)2           2.38204    2.62857   0.906 0.364823    
## poly(products_number, 2)1   6.77944    4.53416   1.495 0.134864    
## poly(products_number, 2)2  25.20724    5.38312   4.683 2.83e-06 ***
## poly(estimated_salary, 2)1 -0.35953    2.51745  -0.143 0.886435    
## poly(estimated_salary, 2)2  3.76055    2.60089   1.446 0.148213    
## countryGermany             -0.14241    0.47237  -0.301 0.763058    
## countrySpain               -0.21345    0.43232  -0.494 0.621489    
## genderFemale               -0.01009    0.35290  -0.029 0.977185    
## credit_cardCredit Card     -0.61807    0.39350  -1.571 0.116249    
## active_memberActive        -0.98461    0.36741  -2.680 0.007365 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 292.50  on 210  degrees of freedom
## Residual deviance: 208.27  on 193  degrees of freedom
## AIC: 244.27
## 
## Number of Fisher Scoring iterations: 6
summary(log_model_refit)
## 
## Call:
## glm(formula = full_formula, family = binomial(), data = new_data)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   2.89907  101.47008   0.029  0.97721    
## poly(credit_score, 2)1       -0.02682    2.75323  -0.010  0.99223    
## poly(credit_score, 2)2        4.88761    2.94908   1.657  0.09745 .  
## poly(age, 2)1                11.73218    2.74030   4.281 1.86e-05 ***
## poly(age, 2)2                 0.06938    2.56277   0.027  0.97840    
## poly(tenure, 2)1             -1.88312    2.68849  -0.700  0.48365    
## poly(tenure, 2)2              1.41264    2.57445   0.549  0.58320    
## poly(balance, 2)1             6.13792    2.97113   2.066  0.03884 *  
## poly(balance, 2)2             3.27540    2.71845   1.205  0.22825    
## poly(products_number, 2)1    63.22355 3659.30481   0.017  0.98622    
## poly(products_number, 2)2    87.18886 3933.13523   0.022  0.98231    
## poly(estimated_salary, 2)1   -1.08210    2.60415  -0.416  0.67775    
## poly(estimated_salary, 2)2    3.97251    2.68252   1.481  0.13864    
## countryGermany                0.02264    0.48964   0.046  0.96312    
## countrySpain                 -0.25577    0.44716  -0.572  0.56733    
## genderFemale                 -0.09725    0.36795  -0.264  0.79155    
## credit_cardCredit Card       -0.70538    0.40975  -1.721  0.08516 .  
## active_memberActive          -1.13725    0.38621  -2.945  0.00323 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 286.96  on 206  degrees of freedom
## Residual deviance: 191.85  on 189  degrees of freedom
## AIC: 227.85
## 
## Number of Fisher Scoring iterations: 17

When all four influential observations (199, 191, 109, 148) are removed simultaneously:

The model shows a substantial improvement with AIC decreasing from 244.27 to 227.85. We still see the warning about fitted probabilities of 0 or 1, indicating potential numeric instability with extremely large coefficients for products_number variables. The key significant predictors remain consistent (age, balance, active_member), with credit_score (2nd degree) becoming marginally significant. The coefficient for age (1st degree) strengthens considerably from 9.66 to 11.73.

Conclusion: Removing all four influential observations significantly improves model fit and clarifies variable effects. Despite some numerical issues, this approach is recommended for the final model as it provides better predictive performance and more stable coefficient estimates for the remaining variables.

As the World Churns (in Retail Banking)

Doris Asongafac

May 7, 2025