Within the Business School, we support the responsible and ethical use of GenAI tools, and we seek to develop your ability to use these tools to help you study and learn. An important part of this process is being transparent about how you have used GenAI tools during the preparation of your assignments.
Information about GenAI can be found [here] and guidance on the responsible use of GenAI tools can be found [here].
The declaration below is intended to guide transparency in the use of GenAI tools and to assist you in ensuring the appropriate citation of those tools within your work.
I **have / used GenAI tools in the production of this work.
[please specify] CoPilot
We’re going to use a mail response data set from a real direct marketing campaign located in mailing.csv. Each record represents an individual who was targeted with a direct marketing offer. The offer was a solicitation to make a charitable donation. This data was provided by the authors of our textbook, and I’m not sure of the original source.
The columns (features) are:
income household income
Firstdate date assoc. with the first gift by this individual
Lastdate date associated with the most recent gift
Amount average amount by this individual over all periods
rfaf2 frequency code
rfaa2 donation amount code
pepstrfl flag indicating a star donator
glast amount of last gift
gavr amount of average gift
class outcome variable, 1 if they gave donation
The target variables is class and is equal to one if they gave in this campaign and zero otherwise.
Load in the Rdata object.
JUST IN CASE: The csv files are saved inside the folder
mailing_data in case you need to read them in manually.
There is one for the training set and one for the test set - if you read
them in, call them mailing_train and
mailing_test respectively. You will need to convert the
class variable to a factor using
as.factor.
load('./mailing_balanced_train_test.RData')
glimpse(mailing_train)
## Rows: 4,000
## Columns: 14
## $ Income <dbl> 6, 7, 1, 0, 3, 0, 0, 1, 1, 5, 5, 7, 3, 2, 7, 0, 1, 3, 5, 0,…
## $ Firstdate <dbl> 8703, 9508, 9409, 9502, 8803, 8809, 9310, 9409, 9502, 8809,…
## $ Lastdate <dbl> 9504, 9602, 9507, 9511, 9510, 9603, 9507, 9512, 9603, 9601,…
## $ Amount <dbl> 0.30, 0.12, 0.17, 0.09, 0.26, 0.25, 0.11, 0.08, 0.20, 0.40,…
## $ rfaf2 <dbl> 2, 2, 3, 1, 1, 2, 2, 1, 3, 4, 1, 4, 1, 1, 3, 1, 1, 1, 1, 3,…
## $ glast <dbl> 14, 20, 10, 15, 10, 5, 15, 15, 6, 7, 20, 7, 15, 18, 10, 20,…
## $ gavr <dbl> 8.52, 15.00, 10.00, 15.00, 8.73, 5.10, 11.50, 12.50, 5.80, …
## $ class <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ rfaa2_D <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ rfaa2_E <int> 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ rfaa2_F <int> 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1,…
## $ rfaa2_G <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pepstrfl_0 <int> 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,…
## $ pepstrfl_X <int> 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,…
Our outcome variable is class. Provide a quick look at
the distribution of class for the training data and test
data.
# your code here
count(mailing_train,class)
## # A tibble: 2 × 2
## class n
## <fct> <int>
## 1 0 2000
## 2 1 2000
count(mailing_train,class)
## # A tibble: 2 × 2
## class n
## <fct> <int>
## 1 0 2000
## 2 1 2000
Comparing the distribution of the outcome variable in training and test, do they look balanced?
There is balance. Both “train” and “test” have an equal number of instances.
Fit a random forest model to predict class. >All the
data sets seem complete so in this case but in the excel I noticed
pepstrfl_X had many missing data so I choose to remove that one. But I
was unable to as doing rhis broke the code and I could not get it to
work no matter what I tried so
# your code here set.seed(35)
rg <-randomForest(class~.,data =mailing_train)
Examine the results.
# your code here
rg
##
## Call:
## randomForest(formula = class ~ ., data = mailing_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 43.33%
## Confusion matrix:
## 0 1 class.error
## 0 1202 798 0.3990
## 1 935 1065 0.4675
What is the out-of-box error rate?
Write HERE. 43.45%. This is very high almost half od the predictions were wrong. Thus it is likely that this model can not accurately predict this particular data set.However there are other things to look at before making a final judgement about this topic. # 2. Compute and Compare Predictive Performance
Use the confusionMatrix function to compute several
metrics of predictive performance.
# your code here
rg$confusion[ ,-3] %>% as.table %>%
confusionMatrix()
## Confusion Matrix and Statistics
##
## 0 1
## 0 1202 798
## 1 935 1065
##
## Accuracy : 0.5668
## 95% CI : (0.5512, 0.5822)
## No Information Rate : 0.5342
## P-Value [Acc > NIR] : 1.962e-05
##
## Kappa : 0.1335
##
## Mcnemar's Test P-Value : 0.001087
##
## Sensitivity : 0.5625
## Specificity : 0.5717
## Pos Pred Value : 0.6010
## Neg Pred Value : 0.5325
## Prevalence : 0.5343
## Detection Rate : 0.3005
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.5671
##
## 'Positive' Class : 0
##
How would you describe the performance of this model?
Write HERE The accuracy is above 50 percent which might be considered a positive but for it to be wrong almost half of the time does not inspire confidence. Secondly this model is better at catching negatives than positives but as can be seen from specificity being larger than sensitivity.
The confusion matrix assumed a 0.50 cutoff for prediction. Now create a ROC plot and compute the AUC for the training set.
# your code here
p<- predict(rg, newdata=mailing_train, type="prob")[,2]%>% as.numeric
train_mailing_roc<-roc(mailing_train$class,p)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(train_mailing_roc)
## Area under the curve: 0.9899
plot(train_mailing_roc)
Now create a ROC chart for the test set and compute the test AUC.
# your code here
p<- predict(rg, newdata=mailing_test, type="prob")[,2]%>% as.numeric
mailing_test_roc<-roc(mailing_test$class,p)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(mailing_test_roc)
## Area under the curve: 0.5828
plot(mailing_test_roc)
Is the model underfit, overfit, or correctly fit to the data?
Write a couple sentences HERE about this model fit. I can say that the model is overfitted. By observing the two side by side “Train” is much higher on the graph than the “test” graph
Examine the variable importance of the model.
# your code here
varImpPlot(rg)
Make some individual predictions of the model. Choose one case from the data and see what the model predicts for that one person.
# your code here
# set_seed(42)
d<-sample_n(mailing_train, 1)
predict(rg, newdata = d, type="prob")
## 0 1
## 1 0.08 0.92
## attr(,"class")
## [1] "matrix" "array" "votes"
Using the most important variable from the plot above, change the value of that variable to something new and make a new predicting for that one case (i.e. set the value to something very small, or very large). How does the prediction change?
Write HERE The model is inacurate as it is. So I assume changing the most important variable will make the prediction wrong even more often.
# your code here
d$Amount<-10
predict(rg, newdata = d, type="prob")
## 0 1
## 1 0.36 0.64
## attr(,"class")
## [1] "matrix" "array" "votes"
Let’s now try and predict the outcome for this case if that important variable was changed from it’s minimum value to it’s maximum value.
I had to choose amount instead of gavr because for some reason Rstudio will not allow me to use that one as it gives me errors. Amount is the third best choice and seems to me as the better choice anyways so it seems better to me.
# your code here
N <- 100
grid<- seq(min(mailing_train$Amount),
max(mailing_train$Amount),
length.out=N)
d_n<-d %>% mutate(n=N) %>%
uncount(n) %>% mutate(Amount=grid)
d_n$pred<-predict(rg,newdata=d_n,type= "prob")[,2]
Now plot the results of that sequential grid against the predicted
probability. How do see the probability of responding the mailer change
in response to the variable? Tip: You can use
coord_cartesian to change the xlimits to focus on specific
areas of detail if you want. How do the predictions change:
Write HERE Interestingly as the “Amount” gets bigger predictions dont get closer to 1 bu the middle parts are the highest predictions. My initial prediction was the largest values would get the highest predictions.
# your code here
ggplot(d_n, aes(x = Amount, y =pred)) +
geom_line()
We have only looked at varying one observation. What if vary our most important feature for all observations?
Use the iml package to create an individual
conditional expectations combined with partial dependence chart.
It’s highly recommended that you only provide the prediction object a
subset of the data. I’m not sure the RStudio Cloud instance can take the
full data and it will take a very long time. Use
coord_cartesian to focus the y-limit range to focus the
chart to a region where you can observe the effect (it’s tiny!).
What is the plot showing you?
Write HERE The yellow line start at 0.5 at the “0” part dips and stablises around 4.5. For “1” it starts at 5.5 and stablises around 5.8. The black lines are the individual conditional expectation curves and the yellow curve is the individual curve. The majority of the data is clustered around the start i.e. 0.1 to 0,3 which also shows the most variation.
# your code here
library(iml)
## Warning: package 'iml' was built under R version 4.4.3
sample<-sample_n(mailing_train,200)
yt_pred <- Predictor$new(model = rg, data=sample, y=sample$class, type="prob")
yt_effect<- FeatureEffect$new(yt_pred, grid.size=100, method="pdp+ice", feature="Amount")
yt_effect$plot()+coord_cartesian(ylim=c(0.25,0.75))
What are your thoughts on the impact of the different features on the likelihood for a person to respond to our donation requests? (100-200 words) For making a prediction the models accuracy is not great. Secondly the ROC curve that I found for “training” was very different than the “test” thus AUC values are not close.From the pdp graph I can say that at the low parts of “amount” donations are not likely. For the changes I would try to make new less complex modeling or the training data should be increased. I would recommend the organization to collect data to be more relevant and adding more weight to things like income and amount columns. For improving predictions new columns i.e. how long it has been from last gift(HLHBFBG) should be added. The least relavent columns should be removed to reduce complexity. Dividing gavr glast by income as a column to see how it relates in economically different houses. Finally removing extreme values from the numerical columns such as income and amount. Things you could write about: