library(ISLR2)
library(MASS)
library(tidyverse)
library(caret)
library(readr)

Introduction

Background, directly from the assignment pdf:

A national veterans’ organization wishes to develop a predictive model to improve the cost effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is 13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs 0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.

Lets understand the problem a little:

13*.051
## [1] 0.663

Each mailer costs an average of 68 cents and gains an average of 66.3 cents. I don’t know how they’ve sustained their fundraiser so far! The purpose of this exercise is to help the national veteran’s organization better predict who will and who will not be a donor; this way they can create a better ratio of donations per mailer, and reach a sustainable operation.

First, lets read the dataset into R that we’ll use for training and testing the model. Later we’ll read in the dataset used for prediction.

fundraising.orig = read_rds("C:/Users/Owner/Documents/School stuff/intro to business analysis/R stuff/Data/fundraising.rds")

And now lets explore it a little.

head(fundraising.orig, 10)
## # A tibble: 10 × 21
##    zipcon…¹ zipco…² zipco…³ zipco…⁴ homeo…⁵ num_c…⁶ income female wealth home_…⁷
##    <fct>    <fct>   <fct>   <fct>   <fct>     <dbl>  <dbl> <fct>   <dbl>   <dbl>
##  1 Yes      No      No      No      Yes           1      1 No          7     698
##  2 No       No      No      Yes     No            2      5 Yes         8     828
##  3 No       No      No      Yes     Yes           1      3 No          4    1471
##  4 No       Yes     No      No      Yes           1      4 No          8     547
##  5 No       Yes     No      No      Yes           1      4 Yes         8     482
##  6 No       No      No      Yes     Yes           1      4 Yes         8     857
##  7 No       No      Yes     No      Yes           1      4 No          5     505
##  8 Yes      No      No      No      Yes           1      4 Yes         8    1438
##  9 No       No      No      Yes     Yes           1      4 Yes         8    1316
## 10 Yes      No      No      No      Yes           1      1 Yes         5     428
## # … with 11 more variables: med_fam_inc <dbl>, avg_fam_inc <dbl>,
## #   pct_lt15k <dbl>, num_prom <dbl>, lifetime_gifts <dbl>, largest_gift <dbl>,
## #   last_gift <dbl>, months_since_donate <dbl>, time_lag <dbl>, avg_gift <dbl>,
## #   target <fct>, and abbreviated variable names ¹​zipconvert2, ²​zipconvert3,
## #   ³​zipconvert4, ⁴​zipconvert5, ⁵​homeowner, ⁶​num_child, ⁷​home_value

Initial Data Exploration

These variables are described in the PDF as well. An interesting observation: In the top 10 rows, everyone has an average gift value (avg_gift), regardless if they are a donor or No Donor for target. This may be worth exploring further in the future. For now, I’ll assume they donated at some point, but yet were a “No Donor” on this latest mailer. If my assumption is correct, does it factor in 0s when there is no gift?

summary(fundraising.orig)
##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##     avg_gift            target    
##  Min.   :  2.139   Donor   :1499  
##  1st Qu.:  6.333   No Donor:1501  
##  Median :  9.000                  
##  Mean   : 10.669                  
##  3rd Qu.: 12.800                  
##  Max.   :122.167

Of note: Our PDF says that there is 22 variables, and describes 22 variables. Yet here we only have 21. zipconvert1 appears to be the missing variable. Assuming the dataset zipcodes is clean, all of the zipconverts should add up to the same value of “yes” and “no”, and all the “yes” across all 5 variables should also equal this value. Assuming they all add up, we can use this information to create our missing variable.

2352+648
## [1] 3000
551+2449
## [1] 3000
2357+643
## [1] 3000
1846+1154
## [1] 3000
2312+688
## [1] 3000

Good news! Looks like our method works. Lets see how many “Yes” there should be in the last zip variable.

3000-(648+551+643+1154)
## [1] 4

Only 4 people with Yes in zipconvert1; Either they are less charitable in New York than I expected, or there is something wrong with our methodology or data. With only 4 values, it will unlikely be useful to include it. However, lets remake our dataset to include and we can test it both ways.

new.var = ifelse(fundraising.orig$zipconvert2 == "No" & fundraising.orig$zipconvert3 == "No" & fundraising.orig$zipconvert4 == "No" & fundraising.orig$zipconvert5 == "No", "Yes", "No")

table(new.var)
## new.var
##   No  Yes 
## 2996    4

Now lets slot this new variable into a modified table

fundraising.alt1=fundraising.orig
fundraising.alt1$zipconvert1 = new.var
fundraising.alt1[fundraising.alt1$zipconvert1 == "Yes", ]
## # A tibble: 4 × 22
##   zipconv…¹ zipco…² zipco…³ zipco…⁴ homeo…⁵ num_c…⁶ income female wealth home_…⁷
##   <fct>     <fct>   <fct>   <fct>   <fct>     <dbl>  <dbl> <fct>   <dbl>   <dbl>
## 1 No        No      No      No      No            1      1 Yes         8     564
## 2 No        No      No      No      No            1      2 No          8    4607
## 3 No        No      No      No      Yes           1      4 No          2    1269
## 4 No        No      No      No      Yes           1      4 No          1    1273
## # … with 12 more variables: med_fam_inc <dbl>, avg_fam_inc <dbl>,
## #   pct_lt15k <dbl>, num_prom <dbl>, lifetime_gifts <dbl>, largest_gift <dbl>,
## #   last_gift <dbl>, months_since_donate <dbl>, time_lag <dbl>, avg_gift <dbl>,
## #   target <fct>, zipconvert1 <chr>, and abbreviated variable names
## #   ¹​zipconvert2, ²​zipconvert3, ³​zipconvert4, ⁴​zipconvert5, ⁵​homeowner,
## #   ⁶​num_child, ⁷​home_value
glimpse(fundraising.alt1)
## Rows: 3,000
## Columns: 22
## $ zipconvert2         <fct> Yes, No, No, No, No, No, No, Yes, No, Yes, No, Yes…
## $ zipconvert3         <fct> No, No, No, Yes, Yes, No, No, No, No, No, No, No, …
## $ zipconvert4         <fct> No, No, No, No, No, No, Yes, No, No, No, Yes, No, …
## $ zipconvert5         <fct> No, Yes, Yes, No, No, Yes, No, No, Yes, No, No, No…
## $ homeowner           <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y…
## $ num_child           <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ income              <dbl> 1, 5, 3, 4, 4, 4, 4, 4, 4, 1, 4, 5, 2, 3, 4, 4, 2,…
## $ female              <fct> No, Yes, No, No, Yes, Yes, No, Yes, Yes, Yes, Yes,…
## $ wealth              <dbl> 7, 8, 4, 8, 8, 8, 5, 8, 8, 5, 5, 8, 8, 5, 6, 9, 7,…
## $ home_value          <dbl> 698, 828, 1471, 547, 482, 857, 505, 1438, 1316, 42…
## $ med_fam_inc         <dbl> 422, 358, 484, 386, 242, 450, 333, 458, 541, 203, …
## $ avg_fam_inc         <dbl> 463, 376, 546, 432, 275, 498, 388, 533, 575, 271, …
## $ pct_lt15k           <dbl> 4, 13, 4, 7, 28, 5, 16, 8, 11, 39, 6, 8, 5, 3, 13,…
## $ num_prom            <dbl> 46, 32, 94, 20, 38, 47, 51, 21, 66, 73, 59, 25, 27…
## $ lifetime_gifts      <dbl> 94, 30, 177, 23, 73, 139, 63, 26, 108, 161, 84, 40…
## $ largest_gift        <dbl> 12, 10, 10, 11, 10, 20, 15, 16, 12, 6, 5, 10, 20, …
## $ last_gift           <dbl> 12, 5, 8, 11, 10, 20, 10, 16, 7, 3, 3, 10, 20, 7, …
## $ months_since_donate <dbl> 34, 29, 30, 30, 31, 37, 37, 30, 31, 32, 30, 32, 37…
## $ time_lag            <dbl> 6, 7, 3, 6, 3, 3, 8, 6, 1, 7, 12, 2, 7, 1, 10, 3, …
## $ avg_gift            <dbl> 9.400000, 4.285714, 7.080000, 7.666667, 7.300000, …
## $ target              <fct> Donor, Donor, No Donor, No Donor, Donor, Donor, Do…
## $ zipconvert1         <chr> "No", "No", "No", "No", "No", "No", "No", "No", "N…

Now lets look for nulls. Scanning the summary function above, I do not see any nulls.

Before we split our dataset into training and test, lets examine the predictors more closely to evaluate their association with the response variable, determine which are good candidate predictors, and look for collinear relationships. Note: This is “Step 2: Model Building” from the assignment; however, I’m doing this before Step 1: Partitioning. When looking at the relationship between the variables, I do not seem the harm in looking at them all together. This warrants more investigation, and if Step 1 must occur first, we’ll have to circle back and do this step again.

I’m going to skip pairs for now due to the large number of categorical variables; we can always factor them and check them later; in the mean time, lets check some graphs for variables we may be interested in:

par(mfrow = c(2,2))
plot(factor(fundraising.alt1$target), fundraising.alt1$income, ylab = "income")
plot(fundraising.alt1$income, fundraising.alt1$target, xlab = "income")

plot(factor(fundraising.alt1$target), fundraising.alt1$home_value, ylab = "home_value")
plot(fundraising.alt1$home_value, fundraising.alt1$target, xlab = "home_value")

home_value appears to have very little differentiation between the values of our target variable. income appears to have some differentiation between the values of our target variable.

par(mfrow = c(2,2))
plot(factor(fundraising.alt1$target), fundraising.alt1$wealth, ylab = "wealth")
plot(fundraising.alt1$wealth, fundraising.alt1$target, xlab = "wealth")

plot(factor(fundraising.alt1$target), fundraising.alt1$num_prom, ylab = "num_prom")
plot(fundraising.alt1$num_prom, fundraising.alt1$target, xlab = "num_prom")

par(mfrow = c(2,2))
plot(factor(fundraising.alt1$target), fundraising.alt1$avg_gift, ylab = "avg_gift")
plot(fundraising.alt1$avg_gift, fundraising.alt1$target, xlab = "avg_gift")

plot(factor(fundraising.alt1$target), fundraising.alt1$lifetime_gifts, ylab = "lifetime_gifts")
plot(fundraising.alt1$lifetime_gifts, fundraising.alt1$target, xlab = "lifetime_gifts")

Variable Transformations and cleaning

In the caret videos, the variables were checked for skewness; while I don’t recall doing that in other assignments, it makes sense I’ll try to replicate here:

fundraising.alt1[,-2] %>%
  keep(is.numeric) %>%
  gather() %>%
ggplot(aes(value)) + 
    facet_wrap(~ key, scales = "free") + 
    geom_density()

Some of these variables should likely use the log to make the distribution more normal.

fundraising.alt1[,-2] %>%
  keep(is.numeric) %>%
  gather() %>%
ggplot(aes(log(value))) + 
    facet_wrap(~ key, scales = "free") + 
    geom_density()

I plan to create new variables for avg_fam_inc, avg_gift, home_value, largest_gift, last_gift, lifetime_gifts, med_fam_inc, num_prom, pct_It15k, and time_lag. NOTE: Later on, when doing correlation analysis, I noticed a lot of NaN in the correlations from my log transformed variables. This is likely due to 0s in variable getting log transformed. After researching this problem online, I found a common solution is to just add 1. This is the solution I will use, and only on variables I was getting NaN.

avg_fam_inc_log = log(fundraising.alt1$avg_fam_inc + 1)
avg_gift_log = log(fundraising.alt1$avg_gift)
home_value_log = log(fundraising.alt1$home_value + 1)
largest_gift_log = log(fundraising.alt1$largest_gift)
last_gift_log = log(fundraising.alt1$last_gift + 1)
lifetime_gifts_log = log(fundraising.alt1$lifetime_gifts)
med_fam_inc_log = log(fundraising.alt1$med_fam_inc + 1)
num_prom_log = log(fundraising.alt1$num_prom)
pct_lt15k_log = log(fundraising.alt1$pct_lt15k + 1)
time_lag_log = log(fundraising.alt1$time_lag + 1)

fundraising.alt1$avg_fam_inc_log = avg_fam_inc_log
fundraising.alt1$avg_gift_log = avg_gift_log
fundraising.alt1$home_value_log = home_value_log
fundraising.alt1$largest_gift_log = largest_gift_log
fundraising.alt1$last_gift_log = last_gift_log
fundraising.alt1$lifetime_gifts_log = lifetime_gifts_log
fundraising.alt1$med_fam_inc_log = med_fam_inc_log
fundraising.alt1$num_prom_log = num_prom_log
fundraising.alt1$pct_lt15k_log = pct_lt15k_log
fundraising.alt1$time_lag_log = time_lag_log

Its time to start modifying existing variables to scale them and to remove redundant and unnecessary variables from creating the log transformations. So first, lets build a clean dataset.

fundraising.alt1.clean = fundraising.alt1

Gift values and income level variables will have a larger impact on our model than some of the other variables due to scaling. For now, we’ll let the log transformation be scaling, and we may want to revisit other scaling techniques.

In most cases, we don’t need the log and non-log versions of the same variable. Note from future self: rerunning all the models later in this document including the non-transformed variables when there isn’t collinearity did not enhance the accuracy.

fundraising.alt1.clean=subset(fundraising.alt1.clean, select=-c(avg_fam_inc, avg_gift, home_value, largest_gift, last_gift, lifetime_gifts, med_fam_inc, num_prom, pct_lt15k, time_lag))

How is our dataset looking now?

str(fundraising.alt1.clean)
## tibble [3,000 × 22] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
##  $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
##  $ zipconvert1        : chr [1:3000] "No" "No" "No" "No" ...
##  $ avg_fam_inc_log    : num [1:3000] 6.14 5.93 6.3 6.07 5.62 ...
##  $ avg_gift_log       : num [1:3000] 2.24 1.46 1.96 2.04 1.99 ...
##  $ home_value_log     : num [1:3000] 6.55 6.72 7.29 6.31 6.18 ...
##  $ largest_gift_log   : num [1:3000] 2.48 2.3 2.3 2.4 2.3 ...
##  $ last_gift_log      : num [1:3000] 2.56 1.79 2.2 2.48 2.4 ...
##  $ lifetime_gifts_log : num [1:3000] 4.54 3.4 5.18 3.14 4.29 ...
##  $ med_fam_inc_log    : num [1:3000] 6.05 5.88 6.18 5.96 5.49 ...
##  $ num_prom_log       : num [1:3000] 3.83 3.47 4.54 3 3.64 ...
##  $ pct_lt15k_log      : num [1:3000] 1.61 2.64 1.61 2.08 3.37 ...
##  $ time_lag_log       : num [1:3000] 1.95 2.08 1.39 1.95 1.39 ...

It looks like our new variable is character, so lets factor that one. Also, wealth looks to be categorical and possibly even income. These would be ordinal as the write up indicates it goes from low to high. For now, I’ll leave them as - is, but we may need to circle back and make them factors.

fundraising.alt1$zipconvert1 = as_factor(fundraising.alt1$zipconvert1)

Lets verify the income variable is usable from the description above.

fundraising.alt1 %>% ggplot(aes(x=income)) + geom_histogram()

Now we’ll check for correlation and collinearity.

str(fundraising.alt1.clean)
## tibble [3,000 × 22] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
##  $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
##  $ zipconvert1        : chr [1:3000] "No" "No" "No" "No" ...
##  $ avg_fam_inc_log    : num [1:3000] 6.14 5.93 6.3 6.07 5.62 ...
##  $ avg_gift_log       : num [1:3000] 2.24 1.46 1.96 2.04 1.99 ...
##  $ home_value_log     : num [1:3000] 6.55 6.72 7.29 6.31 6.18 ...
##  $ largest_gift_log   : num [1:3000] 2.48 2.3 2.3 2.4 2.3 ...
##  $ last_gift_log      : num [1:3000] 2.56 1.79 2.2 2.48 2.4 ...
##  $ lifetime_gifts_log : num [1:3000] 4.54 3.4 5.18 3.14 4.29 ...
##  $ med_fam_inc_log    : num [1:3000] 6.05 5.88 6.18 5.96 5.49 ...
##  $ num_prom_log       : num [1:3000] 3.83 3.47 4.54 3 3.64 ...
##  $ pct_lt15k_log      : num [1:3000] 1.61 2.64 1.61 2.08 3.37 ...
##  $ time_lag_log       : num [1:3000] 1.95 2.08 1.39 1.95 1.39 ...
temp = fundraising.alt1.clean[, c(6,7,9,10,13,14,15,16,17,18,19,20,21, 22)]
correlation = cor(temp)
round(correlation, 2)
##                     num_child income wealth months_since_donate avg_fam_inc_log
## num_child                1.00   0.09   0.06               -0.01            0.04
## income                   0.09   1.00   0.21                0.08            0.22
## wealth                   0.06   0.21   1.00                0.03            0.27
## months_since_donate     -0.01   0.08   0.03                1.00            0.02
## avg_fam_inc_log          0.04   0.22   0.27                0.02            1.00
## avg_gift_log            -0.02   0.14   0.10                0.24            0.07
## home_value_log           0.00   0.25   0.24                0.01            0.80
## largest_gift_log        -0.03   0.12   0.01                0.17            0.05
## last_gift_log           -0.02   0.13   0.07                0.29            0.07
## lifetime_gifts_log      -0.08  -0.03  -0.40               -0.23           -0.03
## med_fam_inc_log          0.04   0.24   0.29                0.02            0.99
## num_prom_log            -0.08  -0.07  -0.43               -0.26           -0.04
## pct_lt15k_log           -0.05  -0.31  -0.36               -0.02           -0.33
## time_lag_log            -0.01   0.00  -0.06                0.06            0.03
##                     avg_gift_log home_value_log largest_gift_log last_gift_log
## num_child                  -0.02           0.00            -0.03         -0.02
## income                      0.14           0.25             0.12          0.13
## wealth                      0.10           0.24             0.01          0.07
## months_since_donate         0.24           0.01             0.17          0.29
## avg_fam_inc_log             0.07           0.80             0.05          0.07
## avg_gift_log                1.00           0.12             0.87          0.84
## home_value_log              0.12           1.00             0.09          0.12
## largest_gift_log            0.87           0.09             1.00          0.82
## last_gift_log               0.84           0.12             0.82          1.00
## lifetime_gifts_log          0.09          -0.02             0.30          0.12
## med_fam_inc_log             0.08           0.79             0.05          0.08
## num_prom_log               -0.24          -0.05            -0.01         -0.14
## pct_lt15k_log              -0.09          -0.30            -0.08         -0.09
## time_lag_log                0.11           0.02             0.10          0.12
##                     lifetime_gifts_log med_fam_inc_log num_prom_log
## num_child                        -0.08            0.04        -0.08
## income                           -0.03            0.24        -0.07
## wealth                           -0.40            0.29        -0.43
## months_since_donate              -0.23            0.02        -0.26
## avg_fam_inc_log                  -0.03            0.99        -0.04
## avg_gift_log                      0.09            0.08        -0.24
## home_value_log                   -0.02            0.79        -0.05
## largest_gift_log                  0.30            0.05        -0.01
## last_gift_log                     0.12            0.08        -0.14
## lifetime_gifts_log                1.00           -0.03         0.85
## med_fam_inc_log                  -0.03            1.00        -0.04
## num_prom_log                      0.85           -0.04         1.00
## pct_lt15k_log                     0.05           -0.37         0.05
## time_lag_log                      0.03            0.02         0.12
##                     pct_lt15k_log time_lag_log
## num_child                   -0.05        -0.01
## income                      -0.31         0.00
## wealth                      -0.36        -0.06
## months_since_donate         -0.02         0.06
## avg_fam_inc_log             -0.33         0.03
## avg_gift_log                -0.09         0.11
## home_value_log              -0.30         0.02
## largest_gift_log            -0.08         0.10
## last_gift_log               -0.09         0.12
## lifetime_gifts_log           0.05         0.03
## med_fam_inc_log             -0.37         0.02
## num_prom_log                 0.05         0.12
## pct_lt15k_log                1.00         0.00
## time_lag_log                 0.00         1.00

Here are some variables that have correlation above absolute value of .8:
avg_gift_log and largest_gift_log at .87
avg_gift_log and last_gift_log at .84
largest_gift_log and last_gift_log at .82
num_prom_log and lifetime_gifts_log at .85
avg_fam_inc_log and med_fam_inc_log at .99
home_value_log and med_fam_inc_log at .79 (close enough)
home_value_log and avg_fam_inc_log at .80

Time to split our dataset

set.seed(12345)
train_index <- sample(1:nrow(fundraising.alt1.clean), nrow(fundraising.alt1.clean)*.8)

train = fundraising.alt1.clean[train_index, ]
test = fundraising.alt1.clean[-train_index, ]

nrow(train)/nrow(fundraising.alt1.clean)
## [1] 0.8
nrow(test)/nrow(fundraising.alt1.clean)
## [1] 0.2

First attempt we’ll use cross validation with caret package. repeatedcv seems to be the most popular method. (Note from future self: Caret made model fitting and evaluation standardized and easy and this was employed through the remainder of the document)

train_control = trainControl(method='repeatedcv', number=10,repeats=3)

I saw what appeared to be a simple way for variable selection done online - run a random forest model and allow it to do best variable selection. I’ll do that. Alternatively, I could have employed methods from chapter 5, and may circle back and try to get better results that way pending time.

rf.fit = train(target~., data=fundraising.alt1.clean,trControl=train_control,method='rf')

Further Variable Selection

rf.fit$besttune
## NULL
varImp(rf.fit)
## rf variable importance
## 
##   only 20 most important variables shown (out of 21)
## 
##                     Overall
## avg_gift_log         100.00
## home_value_log        99.40
## med_fam_inc_log       96.04
## lifetime_gifts_log    94.81
## avg_fam_inc_log       94.34
## num_prom_log          89.93
## pct_lt15k_log         83.00
## time_lag_log          73.88
## months_since_donate   71.93
## largest_gift_log      66.31
## last_gift_log         65.97
## income                51.66
## wealth                41.80
## femaleNo              17.22
## homeownerNo           14.47
## zipconvert5Yes        13.34
## zipconvert2Yes        12.05
## zipconvert4Yes        12.04
## zipconvert3No         11.12
## num_child             10.56
plot(varImp(rf.fit))

pred.rf<-predict(rf.fit,test)
confusionMatrix(pred.rf,test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      290        0
##   No Donor     0      310
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9939, 1)
##     No Information Rate : 0.5167     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.4833     
##          Detection Rate : 0.4833     
##    Detection Prevalence : 0.4833     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : Donor      
## 

Here are the variables we’ll try to use at first (we should use another method to verify variable use). We came to these by removing collinear relationships and poor performers.

avg_gift_log, home_value_log, lifetime_gifts_log, pct_lt15k_log, time_lag_log, months_since_donate, income, wealth, female, homeowner

For simplicity, lets do all this in caret.

lda.fit.try2=train(target~ avg_gift_log + home_value_log + lifetime_gifts_log + pct_lt15k_log + time_lag_log + months_since_donate + income + wealth + female + homeowner, method = "lda", data = train, trControl = train_control)

pred.lda.try2=predict(lda.fit.try2,test)
confusionMatrix(pred.lda.try2,test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      177      160
##   No Donor   113      150
##                                           
##                Accuracy : 0.545           
##                  95% CI : (0.5042, 0.5854)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.088750        
##                                           
##                   Kappa : 0.0937          
##                                           
##  Mcnemar's Test P-Value : 0.005369        
##                                           
##             Sensitivity : 0.6103          
##             Specificity : 0.4839          
##          Pos Pred Value : 0.5252          
##          Neg Pred Value : 0.5703          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2950          
##    Detection Prevalence : 0.5617          
##       Balanced Accuracy : 0.5471          
##                                           
##        'Positive' Class : Donor           
## 

LDA results: 54.4% test accuracy. Well, its better than 50% at least. Hopefully we’ll find better, but in many of the exercises of class, LDA performed the best.

qda.fit=train(target~ avg_gift_log + home_value_log + lifetime_gifts_log + pct_lt15k_log + time_lag_log + months_since_donate + income + wealth + female + homeowner, method = "qda", data = train, trControl = train_control)

pred.qda=predict(qda.fit,test)
confusionMatrix(pred.qda,test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      182      173
##   No Donor   108      137
##                                           
##                Accuracy : 0.5317          
##                  95% CI : (0.4908, 0.5722)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.2438027       
##                                           
##                   Kappa : 0.069           
##                                           
##  Mcnemar's Test P-Value : 0.0001346       
##                                           
##             Sensitivity : 0.6276          
##             Specificity : 0.4419          
##          Pos Pred Value : 0.5127          
##          Neg Pred Value : 0.5592          
##              Prevalence : 0.4833          
##          Detection Rate : 0.3033          
##    Detection Prevalence : 0.5917          
##       Balanced Accuracy : 0.5348          
##                                           
##        'Positive' Class : Donor           
## 

QDA Results: 53.17%. As hypothesized, QDA did not do better. However, of note, it did positively identify more donors. The best bang for our buck is when our model positively identifies donors. As this competition seeks to have the best accuracy, LDA is winning out. However, to see which model results in the best bottom line, we would want to multiply what we gained from the positive donors and subtract out where we sent direct mailers to non-donors.

glm.fit=train(target~ avg_gift_log + home_value_log + lifetime_gifts_log + pct_lt15k_log + time_lag_log + months_since_donate + income + wealth + female + homeowner, method = "glm", data = train, trControl = train_control, family = "binomial")

pred.glm=predict(glm.fit,test)
confusionMatrix(pred.glm,test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      177      160
##   No Donor   113      150
##                                           
##                Accuracy : 0.545           
##                  95% CI : (0.5042, 0.5854)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.088750        
##                                           
##                   Kappa : 0.0937          
##                                           
##  Mcnemar's Test P-Value : 0.005369        
##                                           
##             Sensitivity : 0.6103          
##             Specificity : 0.4839          
##          Pos Pred Value : 0.5252          
##          Neg Pred Value : 0.5703          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2950          
##    Detection Prevalence : 0.5617          
##       Balanced Accuracy : 0.5471          
##                                           
##        'Positive' Class : Donor           
## 

GLM Results: GLM had the exact same performance as LDA. 54.5% with the same confusion matrix

knn.fit=train(target~ avg_gift_log + home_value_log + lifetime_gifts_log + pct_lt15k_log + time_lag_log + months_since_donate + income + wealth + female + homeowner, method = "knn", data = train, trControl = train_control, tuneLength=40)

pred.knn=predict(knn.fit,test)
confusionMatrix(pred.knn,test$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      158      161
##   No Donor   132      149
##                                           
##                Accuracy : 0.5117          
##                  95% CI : (0.4709, 0.5524)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.6127          
##                                           
##                   Kappa : 0.0254          
##                                           
##  Mcnemar's Test P-Value : 0.1019          
##                                           
##             Sensitivity : 0.5448          
##             Specificity : 0.4806          
##          Pos Pred Value : 0.4953          
##          Neg Pred Value : 0.5302          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2633          
##    Detection Prevalence : 0.5317          
##       Balanced Accuracy : 0.5127          
##                                           
##        'Positive' Class : Donor           
## 

KNN Results: 51.17%. KNN produced considerably worse results. At first I ran KNN with a tunelength of 10; when I got 51.17, I decided to check a greater range of values - note, changing it to 40 did not change the results.

knn.fit
## k-Nearest Neighbors 
## 
## 2400 samples
##   10 predictor
##    2 classes: 'Donor', 'No Donor' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 2160, 2160, 2160, 2161, 2160, 2160, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa     
##    5  0.5211118  0.04222767
##    7  0.5188872  0.03764644
##    9  0.5224960  0.04486498
##   11  0.5213855  0.04254076
##   13  0.5187541  0.03719135
##   15  0.5247298  0.04913618
##   17  0.5344515  0.06852477
##   19  0.5284717  0.05657842
##   21  0.5286158  0.05682272
##   23  0.5233403  0.04626945
##   25  0.5232026  0.04588015
##   27  0.5243079  0.04804429
##   29  0.5240371  0.04749713
##   31  0.5262582  0.05177853
##   33  0.5234798  0.04620541
##   35  0.5190319  0.03730082
##   37  0.5191720  0.03757268
##   39  0.5195880  0.03845239
##   41  0.5148664  0.02906859
##   43  0.5137541  0.02685167
##   45  0.5130649  0.02538581
##   47  0.5145920  0.02832949
##   49  0.5159821  0.03110561
##   51  0.5154283  0.02997422
##   53  0.5140405  0.02721635
##   55  0.5195932  0.03831245
##   57  0.5198710  0.03887911
##   59  0.5219589  0.04300056
##   61  0.5247402  0.04855957
##   63  0.5240446  0.04711425
##   65  0.5243241  0.04771024
##   67  0.5209885  0.04099105
##   69  0.5236279  0.04624499
##   71  0.5248756  0.04877770
##   73  0.5265411  0.05205397
##   75  0.5265371  0.05193790
##   77  0.5258473  0.05056209
##   79  0.5251494  0.04910220
##   81  0.5275070  0.05378805
##   83  0.5316749  0.06212756
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 17.

Out of curiosity, I checked k in our KNN model. Optimal solution is at 17, showing 53.3%. I don’t know why it produced something better than our results showed, but still even when examining KNN at the optimal value its still coming in a little under LDA and GLM.

plot(knn.fit)

Maybe we can tweak the variables involved and revisit KNN. With K values these large, we might be introducing a bit of bias for our purposes, so lets move on.

Final Results: With GLM and LDA being tied, I’ll go with LDA; However, it is worth running everything again to see if I can do better variable selection.

Try 2!

glimpse(fundraising.alt1)
## Rows: 3,000
## Columns: 32
## $ zipconvert2         <fct> Yes, No, No, No, No, No, No, Yes, No, Yes, No, Yes…
## $ zipconvert3         <fct> No, No, No, Yes, Yes, No, No, No, No, No, No, No, …
## $ zipconvert4         <fct> No, No, No, No, No, No, Yes, No, No, No, Yes, No, …
## $ zipconvert5         <fct> No, Yes, Yes, No, No, Yes, No, No, Yes, No, No, No…
## $ homeowner           <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y…
## $ num_child           <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ income              <dbl> 1, 5, 3, 4, 4, 4, 4, 4, 4, 1, 4, 5, 2, 3, 4, 4, 2,…
## $ female              <fct> No, Yes, No, No, Yes, Yes, No, Yes, Yes, Yes, Yes,…
## $ wealth              <dbl> 7, 8, 4, 8, 8, 8, 5, 8, 8, 5, 5, 8, 8, 5, 6, 9, 7,…
## $ home_value          <dbl> 698, 828, 1471, 547, 482, 857, 505, 1438, 1316, 42…
## $ med_fam_inc         <dbl> 422, 358, 484, 386, 242, 450, 333, 458, 541, 203, …
## $ avg_fam_inc         <dbl> 463, 376, 546, 432, 275, 498, 388, 533, 575, 271, …
## $ pct_lt15k           <dbl> 4, 13, 4, 7, 28, 5, 16, 8, 11, 39, 6, 8, 5, 3, 13,…
## $ num_prom            <dbl> 46, 32, 94, 20, 38, 47, 51, 21, 66, 73, 59, 25, 27…
## $ lifetime_gifts      <dbl> 94, 30, 177, 23, 73, 139, 63, 26, 108, 161, 84, 40…
## $ largest_gift        <dbl> 12, 10, 10, 11, 10, 20, 15, 16, 12, 6, 5, 10, 20, …
## $ last_gift           <dbl> 12, 5, 8, 11, 10, 20, 10, 16, 7, 3, 3, 10, 20, 7, …
## $ months_since_donate <dbl> 34, 29, 30, 30, 31, 37, 37, 30, 31, 32, 30, 32, 37…
## $ time_lag            <dbl> 6, 7, 3, 6, 3, 3, 8, 6, 1, 7, 12, 2, 7, 1, 10, 3, …
## $ avg_gift            <dbl> 9.400000, 4.285714, 7.080000, 7.666667, 7.300000, …
## $ target              <fct> Donor, Donor, No Donor, No Donor, Donor, Donor, Do…
## $ zipconvert1         <fct> No, No, No, No, No, No, No, No, No, No, No, No, No…
## $ avg_fam_inc_log     <dbl> 6.139885, 5.932245, 6.304449, 6.070738, 5.620401, …
## $ avg_gift_log        <dbl> 2.240710, 1.455287, 1.957274, 2.036882, 1.987874, …
## $ home_value_log      <dbl> 6.549651, 6.720220, 7.294377, 6.306275, 6.180017, …
## $ largest_gift_log    <dbl> 2.484907, 2.302585, 2.302585, 2.397895, 2.302585, …
## $ last_gift_log       <dbl> 2.564949, 1.791759, 2.197225, 2.484907, 2.397895, …
## $ lifetime_gifts_log  <dbl> 4.543295, 3.401197, 5.176150, 3.135494, 4.290459, …
## $ med_fam_inc_log     <dbl> 6.047372, 5.883322, 6.184149, 5.958425, 5.493061, …
## $ num_prom_log        <dbl> 3.828641, 3.465736, 4.543295, 2.995732, 3.637586, …
## $ pct_lt15k_log       <dbl> 1.6094379, 2.6390573, 1.6094379, 2.0794415, 3.3672…
## $ time_lag_log        <dbl> 1.9459101, 2.0794415, 1.3862944, 1.9459101, 1.3862…

Lets try using the full set that also had the log transformation in it with lda.

First to build a new training and test set

set.seed(12345)
train_index2 <- sample(1:nrow(fundraising.alt1), nrow(fundraising.alt1)*.8)

train2 = fundraising.alt1[train_index2, ]
test2 = fundraising.alt1[-train_index2, ]

nrow(train)/nrow(fundraising.alt1)
## [1] 0.8
nrow(test)/nrow(fundraising.alt1)
## [1] 0.2
lda.fit.try3=train(target~ ., method = "lda", data = train2, trControl = train_control)

pred.lda.try3=predict(lda.fit.try3,test2)
confusionMatrix(pred.lda.try3,test2$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      169      154
##   No Donor   121      156
##                                           
##                Accuracy : 0.5417          
##                  95% CI : (0.5008, 0.5821)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.11804         
##                                           
##                   Kappa : 0.0857          
##                                           
##  Mcnemar's Test P-Value : 0.05365         
##                                           
##             Sensitivity : 0.5828          
##             Specificity : 0.5032          
##          Pos Pred Value : 0.5232          
##          Neg Pred Value : 0.5632          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2817          
##    Detection Prevalence : 0.5383          
##       Balanced Accuracy : 0.5430          
##                                           
##        'Positive' Class : Donor           
## 

results: Without removing any collinear relationships we get a warning about it, but we get nearly the same results.

Lets remove the potential collinear relationships but leave in all variables otherwise.

I don’t want to corrupt my past work, so I’ll make a new dataset.

fundraising.alt2 = fundraising.alt1
str(fundraising.alt2)
## tibble [3,000 × 32] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
##  $ wealth             : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
##  $ home_value         : num [1:3000] 698 828 1471 547 482 ...
##  $ med_fam_inc        : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
##  $ avg_fam_inc        : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
##  $ pct_lt15k          : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
##  $ num_prom           : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
##  $ lifetime_gifts     : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
##  $ largest_gift       : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
##  $ last_gift          : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
##  $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
##  $ time_lag           : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
##  $ avg_gift           : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
##  $ target             : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
##  $ zipconvert1        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ avg_fam_inc_log    : num [1:3000] 6.14 5.93 6.3 6.07 5.62 ...
##  $ avg_gift_log       : num [1:3000] 2.24 1.46 1.96 2.04 1.99 ...
##  $ home_value_log     : num [1:3000] 6.55 6.72 7.29 6.31 6.18 ...
##  $ largest_gift_log   : num [1:3000] 2.48 2.3 2.3 2.4 2.3 ...
##  $ last_gift_log      : num [1:3000] 2.56 1.79 2.2 2.48 2.4 ...
##  $ lifetime_gifts_log : num [1:3000] 4.54 3.4 5.18 3.14 4.29 ...
##  $ med_fam_inc_log    : num [1:3000] 6.05 5.88 6.18 5.96 5.49 ...
##  $ num_prom_log       : num [1:3000] 3.83 3.47 4.54 3 3.64 ...
##  $ pct_lt15k_log      : num [1:3000] 1.61 2.64 1.61 2.08 3.37 ...
##  $ time_lag_log       : num [1:3000] 1.95 2.08 1.39 1.95 1.39 ...
temp2 = fundraising.alt2[, c(6,7,9,10,11,12,13,14,15,16,17,18,19,20, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)]
correlation2 = cor(temp2)
round(correlation2, 2)
##                     num_child income wealth home_value med_fam_inc avg_fam_inc
## num_child                1.00   0.09   0.06      -0.01        0.05        0.05
## income                   0.09   1.00   0.21       0.29        0.37        0.38
## wealth                   0.06   0.21   1.00       0.26        0.38        0.39
## home_value              -0.01   0.29   0.26       1.00        0.74        0.75
## med_fam_inc              0.05   0.37   0.38       0.74        1.00        0.97
## avg_fam_inc              0.05   0.38   0.39       0.75        0.97        1.00
## pct_lt15k               -0.03  -0.28  -0.38      -0.40       -0.67       -0.68
## num_prom                -0.09  -0.07  -0.41      -0.06       -0.05       -0.06
## lifetime_gifts          -0.05  -0.02  -0.23      -0.02       -0.04       -0.04
## largest_gift            -0.02   0.03  -0.03       0.06        0.05        0.04
## last_gift               -0.01   0.11   0.05       0.16        0.14        0.13
## months_since_donate     -0.01   0.08   0.03       0.02        0.03        0.03
## time_lag                -0.01   0.00  -0.07       0.00        0.02        0.02
## avg_gift                -0.02   0.12   0.09       0.17        0.14        0.13
## avg_fam_inc_log          0.04   0.22   0.27       0.49        0.69        0.73
## avg_gift_log            -0.02   0.14   0.10       0.16        0.14        0.14
## home_value_log           0.00   0.25   0.24       0.75        0.63        0.68
## largest_gift_log        -0.03   0.12   0.01       0.12        0.12        0.11
## last_gift_log           -0.02   0.13   0.07       0.14        0.14        0.13
## lifetime_gifts_log      -0.08  -0.03  -0.40      -0.03       -0.03       -0.04
## med_fam_inc_log          0.04   0.24   0.29       0.50        0.74        0.76
## num_prom_log            -0.08  -0.07  -0.43      -0.07       -0.06       -0.06
## pct_lt15k_log           -0.05  -0.31  -0.36      -0.43       -0.69       -0.70
## time_lag_log            -0.01   0.00  -0.06       0.00        0.01        0.02
##                     pct_lt15k num_prom lifetime_gifts largest_gift last_gift
## num_child               -0.03    -0.09          -0.05        -0.02     -0.01
## income                  -0.28    -0.07          -0.02         0.03      0.11
## wealth                  -0.38    -0.41          -0.23        -0.03      0.05
## home_value              -0.40    -0.06          -0.02         0.06      0.16
## med_fam_inc             -0.67    -0.05          -0.04         0.05      0.14
## avg_fam_inc             -0.68    -0.06          -0.04         0.04      0.13
## pct_lt15k                1.00     0.04           0.06        -0.01     -0.06
## num_prom                 0.04     1.00           0.54         0.11     -0.06
## lifetime_gifts           0.06     0.54           1.00         0.51      0.20
## largest_gift            -0.01     0.11           0.51         1.00      0.45
## last_gift               -0.06    -0.06           0.20         0.45      1.00
## months_since_donate     -0.01    -0.28          -0.14         0.02      0.19
## time_lag                -0.02     0.12           0.04         0.04      0.08
## avg_gift                -0.06    -0.15           0.18         0.47      0.87
## avg_fam_inc_log         -0.43    -0.04          -0.03         0.02      0.07
## avg_gift_log            -0.08    -0.19           0.17         0.42      0.75
## home_value_log          -0.36    -0.05          -0.02         0.04      0.12
## largest_gift_log        -0.07     0.04           0.35         0.59      0.76
## last_gift_log           -0.08    -0.10           0.16         0.35      0.85
## lifetime_gifts_log       0.04     0.84           0.66         0.25      0.17
## med_fam_inc_log         -0.48    -0.04          -0.03         0.02      0.07
## num_prom_log             0.04     0.96           0.47         0.06     -0.08
## pct_lt15k_log            0.87     0.05           0.05        -0.02     -0.07
## time_lag_log            -0.01     0.09          -0.03         0.03      0.09
##                     months_since_donate time_lag avg_gift avg_fam_inc_log
## num_child                         -0.01    -0.01    -0.02            0.04
## income                             0.08     0.00     0.12            0.22
## wealth                             0.03    -0.07     0.09            0.27
## home_value                         0.02     0.00     0.17            0.49
## med_fam_inc                        0.03     0.02     0.14            0.69
## avg_fam_inc                        0.03     0.02     0.13            0.73
## pct_lt15k                         -0.01    -0.02    -0.06           -0.43
## num_prom                          -0.28     0.12    -0.15           -0.04
## lifetime_gifts                    -0.14     0.04     0.18           -0.03
## largest_gift                       0.02     0.04     0.47            0.02
## last_gift                          0.19     0.08     0.87            0.07
## months_since_donate                1.00     0.02     0.19            0.02
## time_lag                           0.02     1.00     0.07            0.03
## avg_gift                           0.19     0.07     1.00            0.06
## avg_fam_inc_log                    0.02     0.03     0.06            1.00
## avg_gift_log                       0.24     0.09     0.87            0.07
## home_value_log                     0.01     0.02     0.12            0.80
## largest_gift_log                   0.17     0.10     0.76            0.05
## last_gift_log                      0.29     0.09     0.74            0.07
## lifetime_gifts_log                -0.23     0.08     0.11           -0.03
## med_fam_inc_log                    0.02     0.02     0.07            0.99
## num_prom_log                      -0.26     0.16    -0.20           -0.04
## pct_lt15k_log                     -0.02    -0.01    -0.07           -0.33
## time_lag_log                       0.06     0.89     0.08            0.03
##                     avg_gift_log home_value_log largest_gift_log last_gift_log
## num_child                  -0.02           0.00            -0.03         -0.02
## income                      0.14           0.25             0.12          0.13
## wealth                      0.10           0.24             0.01          0.07
## home_value                  0.16           0.75             0.12          0.14
## med_fam_inc                 0.14           0.63             0.12          0.14
## avg_fam_inc                 0.14           0.68             0.11          0.13
## pct_lt15k                  -0.08          -0.36            -0.07         -0.08
## num_prom                   -0.19          -0.05             0.04         -0.10
## lifetime_gifts              0.17          -0.02             0.35          0.16
## largest_gift                0.42           0.04             0.59          0.35
## last_gift                   0.75           0.12             0.76          0.85
## months_since_donate         0.24           0.01             0.17          0.29
## time_lag                    0.09           0.02             0.10          0.09
## avg_gift                    0.87           0.12             0.76          0.74
## avg_fam_inc_log             0.07           0.80             0.05          0.07
## avg_gift_log                1.00           0.12             0.87          0.84
## home_value_log              0.12           1.00             0.09          0.12
## largest_gift_log            0.87           0.09             1.00          0.82
## last_gift_log               0.84           0.12             0.82          1.00
## lifetime_gifts_log          0.09          -0.02             0.30          0.12
## med_fam_inc_log             0.08           0.79             0.05          0.08
## num_prom_log               -0.24          -0.05            -0.01         -0.14
## pct_lt15k_log              -0.09          -0.30            -0.08         -0.09
## time_lag_log                0.11           0.02             0.10          0.12
##                     lifetime_gifts_log med_fam_inc_log num_prom_log
## num_child                        -0.08            0.04        -0.08
## income                           -0.03            0.24        -0.07
## wealth                           -0.40            0.29        -0.43
## home_value                       -0.03            0.50        -0.07
## med_fam_inc                      -0.03            0.74        -0.06
## avg_fam_inc                      -0.04            0.76        -0.06
## pct_lt15k                         0.04           -0.48         0.04
## num_prom                          0.84           -0.04         0.96
## lifetime_gifts                    0.66           -0.03         0.47
## largest_gift                      0.25            0.02         0.06
## last_gift                         0.17            0.07        -0.08
## months_since_donate              -0.23            0.02        -0.26
## time_lag                          0.08            0.02         0.16
## avg_gift                          0.11            0.07        -0.20
## avg_fam_inc_log                  -0.03            0.99        -0.04
## avg_gift_log                      0.09            0.08        -0.24
## home_value_log                   -0.02            0.79        -0.05
## largest_gift_log                  0.30            0.05        -0.01
## last_gift_log                     0.12            0.08        -0.14
## lifetime_gifts_log                1.00           -0.03         0.85
## med_fam_inc_log                  -0.03            1.00        -0.04
## num_prom_log                      0.85           -0.04         1.00
## pct_lt15k_log                     0.05           -0.37         0.05
## time_lag_log                      0.03            0.02         0.12
##                     pct_lt15k_log time_lag_log
## num_child                   -0.05        -0.01
## income                      -0.31         0.00
## wealth                      -0.36        -0.06
## home_value                  -0.43         0.00
## med_fam_inc                 -0.69         0.01
## avg_fam_inc                 -0.70         0.02
## pct_lt15k                    0.87        -0.01
## num_prom                     0.05         0.09
## lifetime_gifts               0.05        -0.03
## largest_gift                -0.02         0.03
## last_gift                   -0.07         0.09
## months_since_donate         -0.02         0.06
## time_lag                    -0.01         0.89
## avg_gift                    -0.07         0.08
## avg_fam_inc_log             -0.33         0.03
## avg_gift_log                -0.09         0.11
## home_value_log              -0.30         0.02
## largest_gift_log            -0.08         0.10
## last_gift_log               -0.09         0.12
## lifetime_gifts_log           0.05         0.03
## med_fam_inc_log             -0.37         0.02
## num_prom_log                 0.05         0.12
## pct_lt15k_log                1.00         0.00
## time_lag_log                 0.00         1.00

Reexamining collinearity on the full dataset, I’ve decided to remove: Med_fam_inc, home_value, pct_lt15k, num_prom, num_prom_log, avg_gift, last_gift_log, time_lag, avg_gift, avg_fam_inc_log, and largest_gift_log (and that zip1 variable we made that doesn’t seem to do anything)

fundraising.alt2.clean=subset(fundraising.alt2, select=-c(med_fam_inc, home_value, pct_lt15k, num_prom, num_prom_log, avg_gift, last_gift_log, time_lag, avg_gift, avg_fam_inc_log, zipconvert1))

Doesn’t hurt, lets chop it up a 3rd time.

set.seed(12345)
train_index3 <- sample(1:nrow(fundraising.alt2.clean), nrow(fundraising.alt2.clean)*.8)

train3 = fundraising.alt2.clean[train_index3, ]
test3 = fundraising.alt2.clean[-train_index3, ]

nrow(train3)/nrow(fundraising.alt2.clean)
## [1] 0.8
nrow(test3)/nrow(fundraising.alt2.clean)
## [1] 0.2

Ok one more time for good measure, lets see how some of the models perform using all remaining variables.

train_control2 = trainControl(method='repeatedcv', number=10,repeats=3)
lda.fit.try4=train(target~ ., method = "lda", data = train3, trControl = train_control2)

pred.lda.try4=predict(lda.fit.try4,test3)
confusionMatrix(pred.lda.try3,test3$target)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Donor No Donor
##   Donor      169      154
##   No Donor   121      156
##                                           
##                Accuracy : 0.5417          
##                  95% CI : (0.5008, 0.5821)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : 0.11804         
##                                           
##                   Kappa : 0.0857          
##                                           
##  Mcnemar's Test P-Value : 0.05365         
##                                           
##             Sensitivity : 0.5828          
##             Specificity : 0.5032          
##          Pos Pred Value : 0.5232          
##          Neg Pred Value : 0.5632          
##              Prevalence : 0.4833          
##          Detection Rate : 0.2817          
##    Detection Prevalence : 0.5383          
##       Balanced Accuracy : 0.5430          
##                                           
##        'Positive' Class : Donor           
## 

Third LDA results: Again, not as good as our first pass. I’ve tried a lot of different things to beat our original LDA figures, and none have helped. I think we have our winner.

Now lets get the prediction test set and get those predictions:

future_fundraising = read_rds("C:/Users/Owner/Documents/School stuff/intro to business analysis/R stuff/Data/future_fundraising.rds")

Now we need to fit the model and write the output to a csv on predicting donor or no donor.

First, we need to do the same preparations of the data.

future.clean = future_fundraising

avg_fam_inc_log = log(future.clean$avg_fam_inc + 1)
avg_gift_log = log(future.clean$avg_gift)
home_value_log = log(future.clean$home_value + 1)
largest_gift_log = log(future.clean$largest_gift)
last_gift_log = log(future.clean$last_gift + 1)
lifetime_gifts_log = log(future.clean$lifetime_gifts)
med_fam_inc_log = log(future.clean$med_fam_inc + 1)
num_prom_log = log(future.clean$num_prom)
pct_lt15k_log = log(future.clean$pct_lt15k + 1)
time_lag_log = log(future.clean$time_lag + 1)

future.clean$avg_fam_inc_log = avg_fam_inc_log
future.clean$avg_gift_log = avg_gift_log
future.clean$home_value_log = home_value_log
future.clean$largest_gift_log = largest_gift_log
future.clean$last_gift_log = last_gift_log
future.clean$lifetime_gifts_log = lifetime_gifts_log
future.clean$med_fam_inc_log = med_fam_inc_log
future.clean$num_prom_log = num_prom_log
future.clean$pct_lt15k_log = pct_lt15k_log
future.clean$time_lag_log = time_lag_log

future.clean=subset(future.clean, select=-c(avg_fam_inc, avg_gift, home_value, largest_gift, last_gift, lifetime_gifts, med_fam_inc, num_prom, pct_lt15k, time_lag))

In retrospect, if I did it all over again, I wouldn’t transform the variables within my dataset, but rather transform them within the code fitting the model. Live and learn.

Now lets fit the model.

#pred.lda.final = predict(lda.fit.try2, future_fundraising)

#write.table(pred.lda.final, file = "predictions_lda_final.csv", col.names = c("value"), row.names = FALSE)