knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
library(tidyverse)
library(caret)
library(fastDummies)
library(rpart)
library(rpart.plot)
library(pROC)
library(randomForest)
library(moderndive)
bank = read_rds("/Users/rochellerafn/RStudio Files/BankChurners.rds")
No points or answer needed, but take a couple of minutes to look at the dataset, column names, etc.
Use the ‘lm’ function to run a linear regression model with the natural log of Total_Trans_Amt as the dependent variable and both Customer Age and Gender as predictors. Use the entire dataset (no need to split into test and training).
Answer
bank <- bank %>%
mutate(lTotal_Trans_Amt = log(Total_Trans_Amt))
m1 <- lm(lTotal_Trans_Amt ~ Customer_Age + Gender, data = bank)
m2 <- lm(lTotal_Trans_Amt ~ Customer_Age * Gender, data = bank)
Explain the effect of both customer age and gender on total transaction amount. Use language that a non-technical manager can understand–ie. make sure you transform the coefficients so the effects are more intuitive.
Answer In this example we are looking at two separate variables as they relate to the log total transaction amount. Since we logged the transaction amount, that changes the transaction amount to a percentage. So, the total transaction amount goes down 0.5% as the customer age increases. Also, the total transaction amount is 7% less when gender = male. This is Customer_Age + Gender. It didn’t necessarily ask for it in the question… but I also added an interaction variable which is the relation of log transaction amount to age * gender, resulting in if both are true. So, in this example the total transaction amount decreases by 2% as age increases in people identifying as male.
get_regression_table(m1)
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 8.432 | 0.039 | 218.342 | 0 | 8.356 | 8.507 |
| Customer_Age | -0.005 | 0.001 | -6.233 | 0 | -0.007 | -0.003 |
| Gender: M | -0.070 | 0.013 | -5.421 | 0 | -0.096 | -0.045 |
get_regression_summaries(m1)
| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.007 | 0.006 | 0.4254496 | 0.652265 | 0.652 | 33.544 | 0 | 2 | 10127 |
get_regression_table(m2)
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 8.387 | 0.053 | 158.656 | 0.000 | 8.283 | 8.490 |
| Customer_Age | -0.004 | 0.001 | -3.632 | 0.000 | -0.006 | -0.002 |
| Gender: M | 0.023 | 0.076 | 0.301 | 0.763 | -0.126 | 0.172 |
| Customer_Age:GenderM | -0.002 | 0.002 | -1.245 | 0.213 | -0.005 | 0.001 |
get_regression_summaries(m2)
| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.007 | 0.006 | 0.4253844 | 0.652215 | 0.652 | 22.881 | 0 | 3 | 10127 |
Set the random seed to 504, and split the data 60/40 into train and test. Use the train function from the Caret library to run a basic decision tree model to determine whether a customer is likely to churn or not (i.e. the Churn column). Use the training dataset and a fixed complexity parameter of 0.03.
Note: Use ‘trControl = trainControl(method = “boot”, number = 1)’ for subsampling
Answer
set.seed(504)
train_index_bdt <- createDataPartition(bank$Churn, p = 0.60, list = FALSE)
train_bdt <- bank[train_index_bdt, ]
test_bdt <- bank[-train_index_bdt, ]
fit_bdt <- train(Churn ~.,
data = train_bdt,
method = "rpart",
metric = "Kappa",
trControl = trainControl(method = "boot", number =1))
fit_bdt
## CART
##
## 6077 samples
## 20 predictor
## 2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Bootstrapped (1 reps)
## Summary of sample sizes: 6077
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03684749 0.8989626 0.6300442
## 0.06755374 0.8872350 0.5629426
## 0.17195496 0.8218313 0.0000000
##
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03684749.
Plot the resulting tree using the rpart.plot function.
Answer
rpart.plot(fit_bdt$finalModel, type = 2)
Use the plot of your decision tree to construct a coherent sentence that explains the data in the right-most, bottom leaf of the tree.
Answer: So, first we can see that the leaves on the tree have been color coded, blue for churn = “no” and green for churn = “yes”. From the top, if the age of the customer is >=55 we split to the left and end up at a “no” as the most common outcome 66% of the time with an accuracy of 5%.
Show the confusion matrix on the test data. Why might Accuracy not be the most meaningful metric for this dataset?
Answer Kappa gives us another measurement and equation to evaluate a type of “accuracy”. Accuracy is not great is there is a data set that is skewed to one variable over the other. Such as the fraud example we’ve used in class. Maybe there’s only 1 fraud event out of 100 cases… so if we say there is no fraud, we would be 99% accurate… but we would be missing the fraud. Kappa compares observed accuracy wtih expected accuracy giving us another way to look at it.
pred_bdt <- predict(fit_bdt, newdata = test_bdt)
confusionMatrix(factor(pred_bdt), factor(test_bdt$Churn))
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 3264 260
## yes 136 390
##
## Accuracy : 0.9022
## 95% CI : (0.8927, 0.9112)
## No Information Rate : 0.8395
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6068
##
## Mcnemar's Test P-Value : 6.37e-10
##
## Sensitivity : 0.9600
## Specificity : 0.6000
## Pos Pred Value : 0.9262
## Neg Pred Value : 0.7414
## Prevalence : 0.8395
## Detection Rate : 0.8059
## Detection Prevalence : 0.8701
## Balanced Accuracy : 0.7800
##
## 'Positive' Class : no
##
Starting from the original, full dataset, create a new variable called ‘Churned’ that is 1 if a customer churned and 0 otherwise, then delete the Churn column. Now redo the 60/40 train test split on the dataset after these steps.
Answer
bank_2 <- bank %>%
mutate(Churned = ifelse(bank$Churn == "yes", 1, 0)) %>%
select(-Churn)
str(bank_2)
## tibble [10,127 × 21] (S3: tbl_df/tbl/data.frame)
## $ Customer_Age : num [1:10127] 45 49 51 40 40 44 51 32 37 48 ...
## $ Gender : chr [1:10127] "M" "F" "M" "F" ...
## $ Dependent_count : num [1:10127] 3 5 3 4 3 2 4 0 3 2 ...
## $ Education_Level : chr [1:10127] "High School" "Graduate" "Graduate" "High School" ...
## $ Marital_Status : chr [1:10127] "Married" "Single" "Married" "Unknown" ...
## $ Income_Category : chr [1:10127] "$60K - $80K" "Less than $40K" "$80K - $120K" "Less than $40K" ...
## $ Card_Category : chr [1:10127] "Blue" "Blue" "Blue" "Blue" ...
## $ Months_on_book : num [1:10127] 39 44 36 34 21 36 46 27 36 36 ...
## $ Total_Relationship_Count: num [1:10127] 5 6 4 3 5 3 6 2 5 6 ...
## $ Months_Inactive_12_mon : num [1:10127] 1 1 1 4 1 1 1 2 2 3 ...
## $ Contacts_Count_12_mon : num [1:10127] 3 2 0 1 0 2 3 2 0 3 ...
## $ Credit_Limit : num [1:10127] 12691 8256 3418 3313 4716 ...
## $ Total_Revolving_Bal : num [1:10127] 777 864 0 2517 0 ...
## $ Avg_Open_To_Buy : num [1:10127] 11914 7392 3418 796 4716 ...
## $ Total_Amt_Chng_Q4_Q1 : num [1:10127] 1.33 1.54 2.59 1.41 2.17 ...
## $ Total_Trans_Amt : num [1:10127] 1144 1291 1887 1171 816 ...
## $ Total_Trans_Ct : num [1:10127] 42 33 20 20 28 24 31 36 24 32 ...
## $ Total_Ct_Chng_Q4_Q1 : num [1:10127] 1.62 3.71 2.33 2.33 2.5 ...
## $ Avg_Utilization_Ratio : num [1:10127] 0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
## $ lTotal_Trans_Amt : num [1:10127] 7.04 7.16 7.54 7.07 6.7 ...
## $ Churned : num [1:10127] 0 0 0 0 0 0 0 0 0 0 ...
table(bank_2$Churned)
##
## 0 1
## 8500 1627
Use the train function from Caret to run a logistic regression model on the training data with ‘Churned’ as the dependent variable and all other variables as predictors.
Note: Use ‘trControl = trainControl(method = “boot”, number = 1)’ for subsampling
Answer
set.seed(504)
train_index_blr <- createDataPartition(bank_2$Churned, p = 0.60, list = FALSE)
train_blr <- bank_2[train_index_blr, ]
test_blr <- bank_2[-train_index_blr, ]
fit_blr <- train(Churned ~.,
data = train_blr,
method = "glm",
family = "binomial",
trControl = trainControl(method = "boot", number =1))
fit_blr
## Generalized Linear Model
##
## 6077 samples
## 20 predictor
##
## No pre-processing
## Resampling: Bootstrapped (1 reps)
## Summary of sample sizes: 6077
## Resampling results:
##
## RMSE Rsquared MAE
## 0.2493697 0.4994979 0.1250461
Which predictor (other than the intercept) is the most predictive? Just use odds to explain your answer. You do not need to explain what the predictor is or speculate on why it affects the outcome.
Answer The lTotal_Trans_Amt (log total transactoing amount) has significantly higher odds than the rest according to the odds_ratio table below.
odds_ratio <- exp(coef(fit_blr$finalModel))
data.frame(name = names(odds_ratio), odds_ratio = odds_ratio) %>%
arrange(desc(odds_ratio)) %>%
head()
| name | odds_ratio | |
|---|---|---|
| lTotal_Trans_Amt | lTotal_Trans_Amt | 43.160723 |
| Card_CategoryPlatinum | Card_CategoryPlatinum | 2.944081 |
| Card_CategoryGold | Card_CategoryGold | 2.667152 |
| Contacts_Count_12_mon | Contacts_Count_12_mon | 1.691137 |
| Months_Inactive_12_mon | Months_Inactive_12_mon | 1.630036 |
| Card_CategorySilver | Card_CategorySilver | 1.557646 |
A. Leakage is one common problem encountered in machine learning. How will it manifest itself in your training performance metrics (accuracy, kappa, etc.), and in your test performance metrics?
B. Overfitting is another common problem. How will overfitting manifest itself in your training performance metrics (accuracy, kappa, etc.), and in your test performance metrics?
Answer
A. So, like the example we talked about in class regarding the wine tasters, the words they use and close proximity to certain regions. This is an example of leakage. Having a very high kappa is often an indicator that something might be “too good to be true”. This could be created in this scenario by using the taster name as a feature, since we can figure out where they are likely from and tasting wine from.
B. In my experience… overfitting the data resulted in poorer results. I put a lot of thought into analyzing the top 10 key words of the dataset and continuing to remove words that appears more than once, so that each region had it’s own, pure top 10 or 20. When I tried to use that concept it seemed like it performed really poorly. I tried both filter out the words that were duplicates and throw away words (like wine, flavors, vineyard, etc.) as well as trying to filter the set so that only the top 20 words for each region were usable. It was too restrictive and gave me poor results with nauseatingly low kappas (like 0.08).
A coworker has fit a random forest model to some data and is concerned about overfitting. What advice can you give on what hyperparameters you might tune to decrease the risk of overfitting? Please discuss at least two, including what the parameter means and why it might help. You can but do not need to specify the package-specific names for these parameters.
Answer Instead of overfitting with too many features, it would be good to keep it a little more simple. You definitely don’t want to choose so many variables and lock it in to where the model is memorizing the data set and not predicting anymore. Limit the complexity. Like in class… the example of if you have 100 features total, maybe each tree has 10 features, so they don’t fixate and rely on one feature. Each tree will find it’s own feature. Also, similary with ntrees, there is more accuracy with more trees growing, but both remind me of Goldy Locks and the Three Bears… Not too big, not too small, just right. In trainControl you could also make sure to set the cross fold validation to number = 10, and the repeat = 3… which will slow down the search process, but is intended to help reduce overfitting.
You use cross-validation and find that the individual kappas from the various folds for a given hyperparameter are all over the place: some are good and others are terrible. What might be causing this and why?
Answer Perhaps the predictions in the model are unreliable. If someone decides on a predictor that perhaps also shows up frequently among the different variables, it probably would have a hard time determining which one it goes to if they are all related to the chosen predictor. This could create a lot of confusion in the model and results that vary largely. I’m thinking of things like “cherry” or “wine” that showed up in the wine dataset. It’s such a common term that I could see it being unreliable. This also depends on the ratio of the different variables in question. If there is a large variance between each variable, it will skew… Anyway, I’m not that I’m exactly answering this question properly, but these are the thoughts I had.
Suppose you built a model for work that identifies households for mailing out a promotional discount; your job was to determine which local households are most likely to buy your employer’s nonsense product. After you build your model and generate the list of addresses, you notice that you have identified many households in some neighborhoods and almost none in others. Is this a problem? What are some potential ethical concerns and how might you address them?
Answer Oof, yeah… So obviously we all know about the redlining case. When humans creating models are basing their predictors on demographic data, such as income, that can definitely unintentionally alienate or discriminate against certain groups of people. The problem is that the model is only look at the hard data and does not yet have the ability to include more human oriented data. Given this example… since it is a flier for coupons on a product, I’m not sure this is unethical. It is totally possible that the demographic for the consumers of this specific product live in a specific and concentrated area. Like, what if they are marketing denture cream and the model shows the location is basically the largest retirement village in the city. That seems fine to me. I think in every situation with these models we need to review our findings and really think about the context and actual human impact as a result of those findings. If this was an example of a model that helps predict whether or not a person should be considered for a loan based on their neighborhood, or the redlining insurance issue… that is a problem.