GenAI Statement

Responsible and Ethical Use of GenAI Tools in the Business School

Within the Business School, we support the responsible and ethical use of GenAI tools, and we seek to develop your ability to use these tools to help you study and learn. An important part of this process is being transparent about how you have used GenAI tools during the preparation of your assignments.

Information about GenAI can be found [here] and guidance on the responsible use of GenAI tools can be found [here].

The declaration below is intended to guide transparency in the use of GenAI tools and to assist you in ensuring the appropriate citation of those tools within your work.


GenAI Declaration

I **have / used GenAI tools in the production of this work.

The following GenAI tools have been used:

[please specify] CoPilot

I have used GenAI tools for the following purposes:

  • [Yes ] I have used GenAI tools to assist with research or gathering information.
  • [No] I have used GenAI tools to help me understand key theories and concepts.
  • [Yes] I have used GenAI tools to help me analyse data.
  • [No] I have used GenAI tools to create code.
  • [No] I have used GenAI tools to suggest a plan or structure of my assessment.
  • [No] I have used GenAI tools to give me feedback on a draft.
  • [No] I have used GenAI tools to generate images, figures, or diagrams.
  • [No] I have used GenAI tools to generate creative content for my work.
  • [No] I have used GenAI tools to proofread and correct grammar or spelling errors.
  • [] Other (please specify) …………………………………………………………………………………………………………………

Declaration of Citation

  • [Yes ] I declare that I have referenced the use of GenAI tools and outputs within my assessment in line with the University guidelines for referencing GenAI in academic work.

0. Data

We’re going to use a mail response data set from a real direct marketing campaign located in mailing.csv. Each record represents an individual who was targeted with a direct marketing offer. The offer was a solicitation to make a charitable donation. This data was provided by the authors of our textbook, and I’m not sure of the original source.

The columns (features) are:

income       household income
Firstdate    date assoc. with the first gift by this individual
Lastdate     date associated with the most recent gift 
Amount       average amount by this individual over all periods
rfaf2        frequency code
rfaa2        donation amount code
pepstrfl     flag indicating a star donator
glast        amount of last gift
gavr         amount of average gift
class        outcome variable, 1 if they gave donation

The target variables is class and is equal to one if they gave in this campaign and zero otherwise.

Load in the Rdata object.

JUST IN CASE: The csv files are saved inside the folder mailing_data in case you need to read them in manually. There is one for the training set and one for the test set - if you read them in, call them mailing_train and mailing_test respectively. You will need to convert the class variable to a factor using as.factor.

load('./mailing_balanced_train_test.RData')
glimpse(mailing_train)
## Rows: 4,000
## Columns: 14
## $ Income     <dbl> 6, 7, 1, 0, 3, 0, 0, 1, 1, 5, 5, 7, 3, 2, 7, 0, 1, 3, 5, 0,…
## $ Firstdate  <dbl> 8703, 9508, 9409, 9502, 8803, 8809, 9310, 9409, 9502, 8809,…
## $ Lastdate   <dbl> 9504, 9602, 9507, 9511, 9510, 9603, 9507, 9512, 9603, 9601,…
## $ Amount     <dbl> 0.30, 0.12, 0.17, 0.09, 0.26, 0.25, 0.11, 0.08, 0.20, 0.40,…
## $ rfaf2      <dbl> 2, 2, 3, 1, 1, 2, 2, 1, 3, 4, 1, 4, 1, 1, 3, 1, 1, 1, 1, 3,…
## $ glast      <dbl> 14, 20, 10, 15, 10, 5, 15, 15, 6, 7, 20, 7, 15, 18, 10, 20,…
## $ gavr       <dbl> 8.52, 15.00, 10.00, 15.00, 8.73, 5.10, 11.50, 12.50, 5.80, …
## $ class      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ rfaa2_D    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ rfaa2_E    <int> 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ rfaa2_F    <int> 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1,…
## $ rfaa2_G    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pepstrfl_0 <int> 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,…
## $ pepstrfl_X <int> 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,…

1. Fit a Random Forest model

Our outcome variable is class. Provide a quick look at the distribution of class for the training data and test data.

# your code here 
count(mailing_train,class)
## # A tibble: 2 × 2
##   class     n
##   <fct> <int>
## 1 0      2000
## 2 1      2000
count(mailing_train,class)
## # A tibble: 2 × 2
##   class     n
##   <fct> <int>
## 1 0      2000
## 2 1      2000

Comparing the distribution of the outcome variable in training and test, do they look balanced?

There is balance. Both “train” and “test” have an equal number of instances.

Fit a random forest model to predict class. >All the data sets seem complete so in this case but in the excel I noticed pepstrfl_X had many missing data so I choose to remove that one. But I was unable to as doing rhis broke the code and I could not get it to work no matter what I tried so

# your code here set.seed(35)
rg <-randomForest(class~.,data =mailing_train)

Examine the results.

#  your code here
rg
## 
## Call:
##  randomForest(formula = class ~ ., data = mailing_train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 43.33%
## Confusion matrix:
##      0    1 class.error
## 0 1202  798      0.3990
## 1  935 1065      0.4675

What is the out-of-box error rate?

Write HERE. 43.45%. This is very high almost half od the predictions were wrong. Thus it is likely that this model can not accurately predict this particular data set.However there are other things to look at before making a final judgement about this topic. # 2. Compute and Compare Predictive Performance

Use the confusionMatrix function to compute several metrics of predictive performance.

# your code here
rg$confusion[ ,-3] %>% as.table %>%
  confusionMatrix()
## Confusion Matrix and Statistics
## 
##      0    1
## 0 1202  798
## 1  935 1065
##                                           
##                Accuracy : 0.5668          
##                  95% CI : (0.5512, 0.5822)
##     No Information Rate : 0.5342          
##     P-Value [Acc > NIR] : 1.962e-05       
##                                           
##                   Kappa : 0.1335          
##                                           
##  Mcnemar's Test P-Value : 0.001087        
##                                           
##             Sensitivity : 0.5625          
##             Specificity : 0.5717          
##          Pos Pred Value : 0.6010          
##          Neg Pred Value : 0.5325          
##              Prevalence : 0.5343          
##          Detection Rate : 0.3005          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.5671          
##                                           
##        'Positive' Class : 0               
## 

How would you describe the performance of this model?

Write HERE The accuracy is above 50 percent which might be considered a positive but for it to be wrong almost half of the time does not inspire confidence. Secondly this model is better at catching negatives than positives but as can be seen from specificity being larger than sensitivity.

The confusion matrix assumed a 0.50 cutoff for prediction. Now create a ROC plot and compute the AUC for the training set.

# your code here
p<- predict(rg, newdata=mailing_train, type="prob")[,2]%>% as.numeric
train_mailing_roc<-roc(mailing_train$class,p)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(train_mailing_roc)
## Area under the curve: 0.9899
plot(train_mailing_roc)

Now create a ROC chart for the test set and compute the test AUC.

# your code here
p<- predict(rg, newdata=mailing_test, type="prob")[,2]%>% as.numeric
mailing_test_roc<-roc(mailing_test$class,p)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(mailing_test_roc)
## Area under the curve: 0.5828
plot(mailing_test_roc)

Is the model underfit, overfit, or correctly fit to the data?

Write a couple sentences HERE about this model fit. I can say that the model is overfitted. By observing the two side by side “Train” is much higher on the graph than the “test” graph

3. Examine the model

Examine the variable importance of the model.

# your code here
varImpPlot(rg)

Make some individual predictions of the model. Choose one case from the data and see what the model predicts for that one person.

# your code here 
# set_seed(42) 
d<-sample_n(mailing_train, 1)

predict(rg, newdata = d, type="prob")
##      0    1
## 1 0.08 0.92
## attr(,"class")
## [1] "matrix" "array"  "votes"

Using the most important variable from the plot above, change the value of that variable to something new and make a new predicting for that one case (i.e. set the value to something very small, or very large). How does the prediction change?

Write HERE The model is inacurate as it is. So I assume changing the most important variable will make the prediction wrong even more often.

# your code here  

d$Amount<-10 
predict(rg, newdata = d, type="prob")
##      0    1
## 1 0.36 0.64
## attr(,"class")
## [1] "matrix" "array"  "votes"

Let’s now try and predict the outcome for this case if that important variable was changed from it’s minimum value to it’s maximum value.

  1. Create a grid of at least 100 points from the minimum value to the maximum value.
  2. Duplicate the case you used above the same number of times
  3. Add the grid of points to the data
  4. Predict the outcome using this new fake data and save the predicted probability in the dataset.

I had to choose amount instead of gavr because for some reason Rstudio will not allow me to use that one as it gives me errors. Amount is the third best choice and seems to me as the better choice anyways so it seems better to me.

# your code here
N <- 100
grid<- seq(min(mailing_train$Amount),
       max(mailing_train$Amount),
       length.out=N)

d_n<-d %>% mutate(n=N) %>%
uncount(n) %>% mutate(Amount=grid)

d_n$pred<-predict(rg,newdata=d_n,type= "prob")[,2]

Now plot the results of that sequential grid against the predicted probability. How do see the probability of responding the mailer change in response to the variable? Tip: You can use coord_cartesian to change the xlimits to focus on specific areas of detail if you want. How do the predictions change:

Write HERE Interestingly as the “Amount” gets bigger predictions dont get closer to 1 bu the middle parts are the highest predictions. My initial prediction was the largest values would get the highest predictions.

# your code here
ggplot(d_n, aes(x = Amount, y =pred)) +
  geom_line()  

Use IML to create a PDP+ICE chart

We have only looked at varying one observation. What if vary our most important feature for all observations?

Use the iml package to create an individual conditional expectations combined with partial dependence chart.

It’s highly recommended that you only provide the prediction object a subset of the data. I’m not sure the RStudio Cloud instance can take the full data and it will take a very long time. Use coord_cartesian to focus the y-limit range to focus the chart to a region where you can observe the effect (it’s tiny!).

What is the plot showing you?

Write HERE The yellow line start at 0.5 at the “0” part dips and stablises around 4.5. For “1” it starts at 5.5 and stablises around 5.8. The black lines are the individual conditional expectation curves and the yellow curve is the individual curve. The majority of the data is clustered around the start i.e. 0.1 to 0,3 which also shows the most variation.

# your code here
library(iml)
## Warning: package 'iml' was built under R version 4.4.3
sample<-sample_n(mailing_train,200)
yt_pred <- Predictor$new(model = rg, data=sample, y=sample$class, type="prob")
yt_effect<- FeatureEffect$new(yt_pred, grid.size=100, method="pdp+ice", feature="Amount")
yt_effect$plot()+coord_cartesian(ylim=c(0.25,0.75))

4. Summarise

What are your thoughts on the impact of the different features on the likelihood for a person to respond to our donation requests? (100-200 words) For making a prediction the models accuracy is not great. Secondly the ROC curve that I found for “training” was very different than the “test” thus AUC values are not close.From the pdp graph I can say that at the low parts of “amount” donations are not likely. For the changes I would try to make new less complex modeling or the training data should be increased. I would recommend the organization to collect data to be more relevant and adding more weight to things like income and amount columns. For improving predictions new columns i.e. how long it has been from last gift(HLHBFBG) should be added. The least relavent columns should be removed to reduce complexity. Dividing gavr glast by income as a column to see how it relates in economically different houses. Finally removing extreme values from the numerical columns such as income and amount. Things you could write about: