knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(fastDummies)
library(rpart)
library(rpart.plot)
library(pROC)
library(randomForest)
library(moderndive)


bank = read_rds("/Users/rochellerafn/RStudio Files/BankChurners.rds")

Multiple Regression

Step 0

No points or answer needed, but take a couple of minutes to look at the dataset, column names, etc.

Fit the model (5pts)

Use the ‘lm’ function to run a linear regression model with the natural log of Total_Trans_Amt as the dependent variable and both Customer Age and Gender as predictors. Use the entire dataset (no need to split into test and training).

Answer

bank <- bank %>%
  mutate(lTotal_Trans_Amt = log(Total_Trans_Amt))

m1 <- lm(lTotal_Trans_Amt ~ Customer_Age + Gender, data = bank)
m2 <- lm(lTotal_Trans_Amt ~ Customer_Age * Gender, data = bank)

Interpretation (5pts)

Explain the effect of both customer age and gender on total transaction amount. Use language that a non-technical manager can understand–ie. make sure you transform the coefficients so the effects are more intuitive.

Answer In this example we are looking at two separate variables as they relate to the log total transaction amount. Since we logged the transaction amount, that changes the transaction amount to a percentage. So, the total transaction amount goes down 0.5% as the customer age increases. Also, the total transaction amount is 7% less when gender = male. This is Customer_Age + Gender. It didn’t necessarily ask for it in the question… but I also added an interaction variable which is the relation of log transaction amount to age * gender, resulting in if both are true. So, in this example the total transaction amount decreases by 2% as age increases in people identifying as male.

get_regression_table(m1)

term	estimate	std_error	statistic	lower_ci	upper_ci
intercept	8.432	0.039	218.342	8.356	8.507
Customer_Age	-0.005	0.001	-6.233	-0.007	-0.003
Gender: M	-0.070	0.013	-5.421	-0.096	-0.045

get_regression_summaries(m1)

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df	nobs
0.007	0.006	0.4254496	0.652265	0.652	33.544	0	2	10127

get_regression_table(m2)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	8.387	0.053	158.656	0.000	8.283	8.490
Customer_Age	-0.004	0.001	-3.632	0.000	-0.006	-0.002
Gender: M	0.023	0.076	0.301	0.763	-0.126	0.172
Customer_Age:GenderM	-0.002	0.002	-1.245	0.213	-0.005	0.001

get_regression_summaries(m2)

r_squared	adj_r_squared	mse	rmse	sigma	statistic	p_value	df	nobs
0.007	0.006	0.4253844	0.652215	0.652	22.881	0	3	10127

Decision Trees

Run the model (5pts)

Set the random seed to 504, and split the data 60/40 into train and test. Use the train function from the Caret library to run a basic decision tree model to determine whether a customer is likely to churn or not (i.e. the Churn column). Use the training dataset and a fixed complexity parameter of 0.03.

Note: Use ‘trControl = trainControl(method = “boot”, number = 1)’ for subsampling

Answer

set.seed(504)
train_index_bdt <- createDataPartition(bank$Churn, p = 0.60, list = FALSE)
train_bdt <- bank[train_index_bdt, ]
test_bdt <- bank[-train_index_bdt, ]



fit_bdt <- train(Churn ~.,
                  data = train_bdt,
                  method = "rpart",
                  metric = "Kappa",
                  trControl = trainControl(method = "boot", number =1))

fit_bdt

## CART 
## 
## 6077 samples
##   20 predictor
##    2 classes: 'no', 'yes' 
## 
## No pre-processing
## Resampling: Bootstrapped (1 reps) 
## Summary of sample sizes: 6077 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.03684749  0.8989626  0.6300442
##   0.06755374  0.8872350  0.5629426
##   0.17195496  0.8218313  0.0000000
## 
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03684749.

Visualize (2pts)

Plot the resulting tree using the rpart.plot function.

Answer

rpart.plot(fit_bdt$finalModel, type = 2)

Interpretation (3pts)

Use the plot of your decision tree to construct a coherent sentence that explains the data in the right-most, bottom leaf of the tree.

Answer: So, first we can see that the leaves on the tree have been color coded, blue for churn = “no” and green for churn = “yes”. From the top, if the age of the customer is >=55 we split to the left and end up at a “no” as the most common outcome 66% of the time with an accuracy of 5%.

Out of sample performance (5pts)

Show the confusion matrix on the test data. Why might Accuracy not be the most meaningful metric for this dataset?

Answer Kappa gives us another measurement and equation to evaluate a type of “accuracy”. Accuracy is not great is there is a data set that is skewed to one variable over the other. Such as the fraud example we’ve used in class. Maybe there’s only 1 fraud event out of 100 cases… so if we say there is no fraud, we would be 99% accurate… but we would be missing the fraud. Kappa compares observed accuracy wtih expected accuracy giving us another way to look at it.

pred_bdt <- predict(fit_bdt, newdata = test_bdt)
confusionMatrix(factor(pred_bdt), factor(test_bdt$Churn))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   no  yes
##        no  3264  260
##        yes  136  390
##                                           
##                Accuracy : 0.9022          
##                  95% CI : (0.8927, 0.9112)
##     No Information Rate : 0.8395          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6068          
##                                           
##  Mcnemar's Test P-Value : 6.37e-10        
##                                           
##             Sensitivity : 0.9600          
##             Specificity : 0.6000          
##          Pos Pred Value : 0.9262          
##          Neg Pred Value : 0.7414          
##              Prevalence : 0.8395          
##          Detection Rate : 0.8059          
##    Detection Prevalence : 0.8701          
##       Balanced Accuracy : 0.7800          
##                                           
##        'Positive' Class : no              
##

Logistic Regression

Setup (2pt)

Starting from the original, full dataset, create a new variable called ‘Churned’ that is 1 if a customer churned and 0 otherwise, then delete the Churn column. Now redo the 60/40 train test split on the dataset after these steps.

Answer

bank_2  <- bank %>%
  mutate(Churned = ifelse(bank$Churn == "yes", 1, 0)) %>%
  select(-Churn)

str(bank_2)

## tibble [10,127 × 21] (S3: tbl_df/tbl/data.frame)
##  $ Customer_Age            : num [1:10127] 45 49 51 40 40 44 51 32 37 48 ...
##  $ Gender                  : chr [1:10127] "M" "F" "M" "F" ...
##  $ Dependent_count         : num [1:10127] 3 5 3 4 3 2 4 0 3 2 ...
##  $ Education_Level         : chr [1:10127] "High School" "Graduate" "Graduate" "High School" ...
##  $ Marital_Status          : chr [1:10127] "Married" "Single" "Married" "Unknown" ...
##  $ Income_Category         : chr [1:10127] "$60K - $80K" "Less than $40K" "$80K - $120K" "Less than $40K" ...
##  $ Card_Category           : chr [1:10127] "Blue" "Blue" "Blue" "Blue" ...
##  $ Months_on_book          : num [1:10127] 39 44 36 34 21 36 46 27 36 36 ...
##  $ Total_Relationship_Count: num [1:10127] 5 6 4 3 5 3 6 2 5 6 ...
##  $ Months_Inactive_12_mon  : num [1:10127] 1 1 1 4 1 1 1 2 2 3 ...
##  $ Contacts_Count_12_mon   : num [1:10127] 3 2 0 1 0 2 3 2 0 3 ...
##  $ Credit_Limit            : num [1:10127] 12691 8256 3418 3313 4716 ...
##  $ Total_Revolving_Bal     : num [1:10127] 777 864 0 2517 0 ...
##  $ Avg_Open_To_Buy         : num [1:10127] 11914 7392 3418 796 4716 ...
##  $ Total_Amt_Chng_Q4_Q1    : num [1:10127] 1.33 1.54 2.59 1.41 2.17 ...
##  $ Total_Trans_Amt         : num [1:10127] 1144 1291 1887 1171 816 ...
##  $ Total_Trans_Ct          : num [1:10127] 42 33 20 20 28 24 31 36 24 32 ...
##  $ Total_Ct_Chng_Q4_Q1     : num [1:10127] 1.62 3.71 2.33 2.33 2.5 ...
##  $ Avg_Utilization_Ratio   : num [1:10127] 0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
##  $ lTotal_Trans_Amt        : num [1:10127] 7.04 7.16 7.54 7.07 6.7 ...
##  $ Churned                 : num [1:10127] 0 0 0 0 0 0 0 0 0 0 ...

table(bank_2$Churned)

## 
##    0    1 
## 8500 1627

Run the model (5pts)

Use the train function from Caret to run a logistic regression model on the training data with ‘Churned’ as the dependent variable and all other variables as predictors.

Note: Use ‘trControl = trainControl(method = “boot”, number = 1)’ for subsampling

Answer

set.seed(504)
train_index_blr <- createDataPartition(bank_2$Churned, p = 0.60, list = FALSE)
train_blr <- bank_2[train_index_blr, ]
test_blr <- bank_2[-train_index_blr, ]

fit_blr <- train(Churned ~.,
                  data = train_blr,
                  method = "glm",
                  family = "binomial",
                  trControl = trainControl(method = "boot", number =1))

fit_blr

## Generalized Linear Model 
## 
## 6077 samples
##   20 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (1 reps) 
## Summary of sample sizes: 6077 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.2493697  0.4994979  0.1250461

Interpretation (5pts)

Which predictor (other than the intercept) is the most predictive? Just use odds to explain your answer. You do not need to explain what the predictor is or speculate on why it affects the outcome.

Answer The lTotal_Trans_Amt (log total transactoing amount) has significantly higher odds than the rest according to the odds_ratio table below.

odds_ratio <- exp(coef(fit_blr$finalModel))
data.frame(name = names(odds_ratio), odds_ratio = odds_ratio) %>%  
  arrange(desc(odds_ratio)) %>% 
  head()

	name	odds_ratio
lTotal_Trans_Amt	lTotal_Trans_Amt	43.160723
Card_CategoryPlatinum	Card_CategoryPlatinum	2.944081
Card_CategoryGold	Card_CategoryGold	2.667152
Contacts_Count_12_mon	Contacts_Count_12_mon	1.691137
Months_Inactive_12_mon	Months_Inactive_12_mon	1.630036
Card_CategorySilver	Card_CategorySilver	1.557646

Conceptual

Overfitting or leakage? (5pts)

A. Leakage is one common problem encountered in machine learning. How will it manifest itself in your training performance metrics (accuracy, kappa, etc.), and in your test performance metrics?

B. Overfitting is another common problem. How will overfitting manifest itself in your training performance metrics (accuracy, kappa, etc.), and in your test performance metrics?

Answer

A. So, like the example we talked about in class regarding the wine tasters, the words they use and close proximity to certain regions. This is an example of leakage. Having a very high kappa is often an indicator that something might be “too good to be true”. This could be created in this scenario by using the taster name as a feature, since we can figure out where they are likely from and tasting wine from.

B. In my experience… overfitting the data resulted in poorer results. I put a lot of thought into analyzing the top 10 key words of the dataset and continuing to remove words that appears more than once, so that each region had it’s own, pure top 10 or 20. When I tried to use that concept it seemed like it performed really poorly. I tried both filter out the words that were duplicates and throw away words (like wine, flavors, vineyard, etc.) as well as trying to filter the set so that only the top 20 words for each region were usable. It was too restrictive and gave me poor results with nauseatingly low kappas (like 0.08).

Random Forests (5pts)

A coworker has fit a random forest model to some data and is concerned about overfitting. What advice can you give on what hyperparameters you might tune to decrease the risk of overfitting? Please discuss at least two, including what the parameter means and why it might help. You can but do not need to specify the package-specific names for these parameters.

Answer Instead of overfitting with too many features, it would be good to keep it a little more simple. You definitely don’t want to choose so many variables and lock it in to where the model is memorizing the data set and not predicting anymore. Limit the complexity. Like in class… the example of if you have 100 features total, maybe each tree has 10 features, so they don’t fixate and rely on one feature. Each tree will find it’s own feature. Also, similary with ntrees, there is more accuracy with more trees growing, but both remind me of Goldy Locks and the Three Bears… Not too big, not too small, just right. In trainControl you could also make sure to set the cross fold validation to number = 10, and the repeat = 3… which will slow down the search process, but is intended to help reduce overfitting.

CV (5pts)

You use cross-validation and find that the individual kappas from the various folds for a given hyperparameter are all over the place: some are good and others are terrible. What might be causing this and why?

Answer Perhaps the predictions in the model are unreliable. If someone decides on a predictor that perhaps also shows up frequently among the different variables, it probably would have a hard time determining which one it goes to if they are all related to the chosen predictor. This could create a lot of confusion in the model and results that vary largely. I’m thinking of things like “cherry” or “wine” that showed up in the wine dataset. It’s such a common term that I could see it being unreliable. This also depends on the ratio of the different variables in question. If there is a large variance between each variable, it will skew… Anyway, I’m not that I’m exactly answering this question properly, but these are the thoughts I had.

Ethics (5pts)

Suppose you built a model for work that identifies households for mailing out a promotional discount; your job was to determine which local households are most likely to buy your employer’s nonsense product. After you build your model and generate the list of addresses, you notice that you have identified many households in some neighborhoods and almost none in others. Is this a problem? What are some potential ethical concerns and how might you address them?

Answer Oof, yeah… So obviously we all know about the redlining case. When humans creating models are basing their predictors on demographic data, such as income, that can definitely unintentionally alienate or discriminate against certain groups of people. The problem is that the model is only look at the hard data and does not yet have the ability to include more human oriented data. Given this example… since it is a flier for coupons on a product, I’m not sure this is unethical. It is totally possible that the demographic for the consumers of this specific product live in a specific and concentrated area. Like, what if they are marketing denture cream and the model shows the location is basically the largest retirement village in the city. That seems fine to me. I think in every situation with these models we need to review our findings and really think about the context and actual human impact as a result of those findings. If this was an example of a model that helps predict whether or not a person should be considered for a loan based on their neighborhood, or the redlining insurance issue… that is a problem.

Mid-term

03/10/2022

Multiple Regression

Step 0

Fit the model (5pts)

Interpretation (5pts)

Decision Trees

Run the model (5pts)

Visualize (2pts)

Interpretation (3pts)

Out of sample performance (5pts)

Logistic Regression

Setup (2pt)

Run the model (5pts)

Interpretation (5pts)

Conceptual

Overfitting or leakage? (5pts)

Random Forests (5pts)

CV (5pts)

Ethics (5pts)