GenAI Statement

Responsible and Ethical Use of GenAI Tools in the Business School

Within the Business School, we support the responsible and ethical use of GenAI tools, and we seek to develop your ability to use these tools to help you study and learn. An important part of this process is being transparent about how you have used GenAI tools during the preparation of your assignments.

Information about GenAI can be found [here] and guidance on the responsible use of GenAI tools can be found [here].

The declaration below is intended to guide transparency in the use of GenAI tools and to assist you in ensuring the appropriate citation of those tools within your work.

GenAI Declaration

I **have (delete as appropriate) used GenAI tools in the production of this work.

The following GenAI tools have been used:

[please specify] ……………………………………

Declaration of Citation

[ x I declare that I have referenced the use of GenAI tools and outputs within my assessment in line with the University guidelines for referencing GenAI in academic work.

0. Data

We’re going to use a mail response data set from a real direct marketing campaign located in mailing.csv. Each record represents an individual who was targeted with a direct marketing offer. The offer was a solicitation to make a charitable donation. This data was provided by the authors of our textbook, and I’m not sure of the original source.

The columns (features) are:

income       household income
Firstdate    date assoc. with the first gift by this individual
Lastdate     date associated with the most recent gift 
Amount       average amount by this individual over all periods
rfaf2        frequency code
rfaa2        donation amount code
pepstrfl     flag indicating a star donator
glast        amount of last gift
gavr         amount of average gift
class        outcome variable, 1 if they gave donation

The target variables is class and is equal to one if they gave in this campaign and zero otherwise.

Load in the Rdata object.

JUST IN CASE: The csv files are saved inside the folder mailing_data in case you need to read them in manually. There is one for the training set and one for the test set - if you read them in, call them mailing_train and mailing_test respectively. You will need to convert the class variable to a factor using as.factor.

load('./mailing_balanced_train_test.RData')
glimpse(mailing_train)

## Rows: 4,000
## Columns: 14
## $ Income     <dbl> 6, 7, 1, 0, 3, 0, 0, 1, 1, 5, 5, 7, 3, 2, 7, 0, 1, 3, 5, 0,…
## $ Firstdate  <dbl> 8703, 9508, 9409, 9502, 8803, 8809, 9310, 9409, 9502, 8809,…
## $ Lastdate   <dbl> 9504, 9602, 9507, 9511, 9510, 9603, 9507, 9512, 9603, 9601,…
## $ Amount     <dbl> 0.30, 0.12, 0.17, 0.09, 0.26, 0.25, 0.11, 0.08, 0.20, 0.40,…
## $ rfaf2      <dbl> 2, 2, 3, 1, 1, 2, 2, 1, 3, 4, 1, 4, 1, 1, 3, 1, 1, 1, 1, 3,…
## $ glast      <dbl> 14, 20, 10, 15, 10, 5, 15, 15, 6, 7, 20, 7, 15, 18, 10, 20,…
## $ gavr       <dbl> 8.52, 15.00, 10.00, 15.00, 8.73, 5.10, 11.50, 12.50, 5.80, …
## $ class      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ rfaa2_D    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ rfaa2_E    <int> 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ rfaa2_F    <int> 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1,…
## $ rfaa2_G    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pepstrfl_0 <int> 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,…
## $ pepstrfl_X <int> 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,…

1. Fit a Random Forest model

Our outcome variable is class. Provide a quick look at the distribution of class for the training data and test data.

table(mailing_train$class)

## 
##    0    1 
## 2000 2000

table(mailing_test$class)

## 
##   0   1 
## 963  37

Comparing the distribution of the outcome variable in training and test, do they look balanced?

Write a sentence HERE about the balance of outcomes in the data.x The distribution of class in the training set is balanced, with 2000 “0”s and 2000 “1”s, but the test set is unbalanced, with 947 “0”s and 53 “1”s, reflecting the low proportion of real data donations.

Fit a random forest model to predict class.

set.seed(35)

rf_model <- randomForest(class ~ Income + Amount + rfaf2 + glast + gavr + rfaa2_D + rfaa2_E + rfaa2_F + rfaa2_G + pepstrfl_0 + pepstrfl_X, data = mailing_train)

Examine the results.

print(rf_model)

## 
## Call:
##  randomForest(formula = class ~ Income + Amount + rfaf2 + glast +      gavr + rfaa2_D + rfaa2_E + rfaa2_F + rfaa2_G + pepstrfl_0 +      pepstrfl_X, data = mailing_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 43.03%
## Confusion matrix:
##      0    1 class.error
## 0 1215  785      0.3925
## 1  936 1064      0.4680

What is the out-of-box error rate?

Write HERE. The out-of-box (OOB) error rate is 42.73%, which is not acceptable for a balanced dataset, as it’s close to random guessing (50%)

2. Compute and Compare Predictive Performance

Use the confusionMatrix function to compute several metrics of predictive performance.

rf_model$confusion[, -3] %>% as.table %>% 
  confusionMatrix()

## Confusion Matrix and Statistics
## 
##      0    1
## 0 1215  785
## 1  936 1064
##                                           
##                Accuracy : 0.5698          
##                  95% CI : (0.5542, 0.5852)
##     No Information Rate : 0.5378          
##     P-Value [Acc > NIR] : 2.551e-05       
##                                           
##                   Kappa : 0.1395          
##                                           
##  Mcnemar's Test P-Value : 0.0002995       
##                                           
##             Sensitivity : 0.5649          
##             Specificity : 0.5754          
##          Pos Pred Value : 0.6075          
##          Neg Pred Value : 0.5320          
##              Prevalence : 0.5377          
##          Detection Rate : 0.3038          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.5701          
##                                           
##        'Positive' Class : 0               
##

How would you describe the performance of this model?

Write HERE The model exhibits moderate to poor performance, with an accuracy of 57.23% and a kappa of 0.1455, indicating only slight improvement over random guessing. It correctly identifies 56.8% of donors (sensitivity) and 57.8% of non-donors (specificity), but its low positive predictive value and high error rates suggest it is not yet reliable for practical donation prediction.

The confusion matrix assumed a 0.50 cutoff for prediction. Now create a ROC plot and compute the AUC for the training set.

p <- predict(rf_model, newdata = mailing_train, type = "prob")[,2] %>% as.numeric

mailing_train_roc <-roc(mailing_train$class, p)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc(mailing_train_roc)

## Area under the curve: 0.902

plot(mailing_train_roc)

Now create a ROC chart for the test set and compute the test AUC.

p2 <- predict(rf_model, newdata = mailing_test, type = "prob")[,2] %>% as.numeric

mailing_test_roc <-roc(mailing_test$class, p2)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc(mailing_test_roc)

## Area under the curve: 0.5672

plot(mailing_test_roc)

Is the model underfit, overfit, or correctly fit to the data?

Write a couple sentences HERE about this model fit. The model is overfit to the data because the mailing_train_roc 0.8939 is much higher than the mailing_test_roc (0.6475), indicating poor generalization.

3. Examine the model

Examine the variable importance of the model.

varImpPlot(rf_model)

Make some individual predictions of the model. Choose one case from the data and see what the model predicts for that one person.

#set.seed(42)
#d <- sample_n(mailing_train, 1)
d <- mailing_train [56,]

predict(rf_model, newdata = d, type = "prob")

##      0    1
## 1 0.62 0.38
## attr(,"class")
## [1] "matrix" "array"  "votes"

Using the most important variable from the plot above, change the value of that variable to something new and make a new predicting for that one case (i.e. set the value to something very small, or very large). How does the prediction change?

Write HERE Lowering Income from 5 to 1 may increase the probability of class 0 (no donation) from 0.462 to approximately 0.53, also decreasing the probability of class 1 (donation) from 0.538.

# d$Income <- "4"
# d$Amount <- 1
predict(rf_model, newdata = d, type = "prob")

##      0    1
## 1 0.62 0.38
## attr(,"class")
## [1] "matrix" "array"  "votes"

Let’s now try and predict the outcome for this case if that important variable was changed from it’s minimum value to it’s maximum value.

Create a grid of at least 100 points from the minimum value to the maximum value.
Duplicate the case you used above the same number of times
Add the grid of points to the data
Predict the outcome using this new fake data and save the predicted probability in the dataset.

N <- 100
grid <- seq(min(mailing_train$Amount),
            max(mailing_train$Amount),
            length.out = N)

d_n <- d %>% mutate(n = N) %>%
  uncount(n) %>% mutate(Amount=grid)

d_n$pred <- predict(rf_model, newdata = d_n, type = "prob")[,2]

Now plot the results of that sequential grid against the predicted probability. How do see the probability of responding the mailer change in response to the variable? Tip: You can use coord_cartesian to change the xlimits to focus on specific areas of detail if you want. How do the predictions change:

Write HERE The probability of responding to the mailer fluctuates significantly from 0 to 0.5 for Amount between 0 and 0.3, with sharp peaks and drops indicating high sensitivity, then stabilizes around 0.15 to 0.2 for Amount above 0.3, suggesting a plateau where further increases in Amount have minimal impact

ggplot(d_n, aes(x=Amount,y=pred)) +
    geom_line()

Use IML to create a PDP+ICE chart

We have only looked at varying one observation. What if vary our most important feature for all observations?

Use the iml package to create an individual conditional expectations combined with partial dependence chart.

It’s highly recommended that you only provide the prediction object a subset of the data. I’m not sure the RStudio Cloud instance can take the full data and it will take a very long time. Use coord_cartesian to focus the y-limit range to focus the chart to a region where you can observe the effect (it’s tiny!).

What is the plot showing you?

Write HERE The plot shows the predicted probability of responding to the mailer (class 1) for multiple cases across a range of Amount values (0 to 0.5), with each black line representing a different case and the yellow line indicating the average or a specific case; the left panel covers Amount from 0 to 0.5 with high variability and peaks around 0.4-0.6, while the right panel zooms into 0 to 0.5, revealing erratic fluctuations and a general trend of decreasing probability with increasing Amount after an initial peak.

library(iml)

# sub-sample your dataset for only 200
sample <- sample_n(mailing_train, 200)

yt_pred <- Predictor$new(model = rf_model, data = sample, y = sample$class, type = "prob")

yt_effect <- FeatureEffect$new(yt_pred,
                               grid.size = 100,
                               method = "pdp+ice",
                               feature = "Amount")

yt_plot <- plot(yt_effect) +
  coord_cartesian(ylim = c(0.25, 0.75))

yt_plot

4. Summarise

What are your thoughts on the impact of the different features on the likelihood for a person to respond to our donation requests? (100-200 words)

Thoughts HERE The random forest model highlights the varying impacts of features on the likelihood of responding to donation requests. The PDP+ICE plot for Amount shows high variability in predicted probabilities (0.3 to 0.7) for small Amount values (0 to 0.1), with a decline toward 0.4 as Amount increases, indicating sensitivity to small donations but diminishing influence for larger ones. Income also affects outcomes: lowering it from 5 to 4 increased the non-donation probability from 0.462 to 0.53, suggesting lower income reduces donation likelihood. Features like rfaf2 (1) and glast (20), used in the model, likely contribute to predictions, with glast (last gift amount) potentially reinforcing Amount’s effect. The model’s high training AUC (0.9148) versus low test AUC (0.6025) and erratic probability shifts (e.g., 0.05 to 0.5 for Amount 0 to 0.3) suggest overfitting and noise, implying that while Amount and Income are influential, their effects are inconsistent. Other features like Firstdate and Lastdate (both 9603) indicate recency, which may boost response likelihood, but require further analysis to confirm their impact..

Things you could write about:

What changes would you make to the modeling approach to improve the predictions?
What recommendations would you make to the organisation?
What other kinds of data would be useful in improving these predictions?

Summative Assessment 1

BEM2031

2025-03-14