library(ISLR2)
library(MASS)
library(tidyverse)
library(caret)
library(readr)
Background, directly from the assignment pdf:
A national veterans’ organization wishes to develop a predictive model to improve the cost effectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct-mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is 13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs 0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classification model that can effectively capture donors so that the expected net profit is maximized. Weighted sampling was used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors.
Lets understand the problem a little:
13*.051
## [1] 0.663
Each mailer costs an average of 68 cents and gains an average of 66.3 cents. I don’t know how they’ve sustained their fundraiser so far! The purpose of this exercise is to help the national veteran’s organization better predict who will and who will not be a donor; this way they can create a better ratio of donations per mailer, and reach a sustainable operation.
First, lets read the dataset into R that we’ll use for training and testing the model. Later we’ll read in the dataset used for prediction.
fundraising.orig = read_rds("C:/Users/Owner/Documents/School stuff/intro to business analysis/R stuff/Data/fundraising.rds")
And now lets explore it a little.
head(fundraising.orig, 10)
## # A tibble: 10 × 21
## zipcon…¹ zipco…² zipco…³ zipco…⁴ homeo…⁵ num_c…⁶ income female wealth home_…⁷
## <fct> <fct> <fct> <fct> <fct> <dbl> <dbl> <fct> <dbl> <dbl>
## 1 Yes No No No Yes 1 1 No 7 698
## 2 No No No Yes No 2 5 Yes 8 828
## 3 No No No Yes Yes 1 3 No 4 1471
## 4 No Yes No No Yes 1 4 No 8 547
## 5 No Yes No No Yes 1 4 Yes 8 482
## 6 No No No Yes Yes 1 4 Yes 8 857
## 7 No No Yes No Yes 1 4 No 5 505
## 8 Yes No No No Yes 1 4 Yes 8 1438
## 9 No No No Yes Yes 1 4 Yes 8 1316
## 10 Yes No No No Yes 1 1 Yes 5 428
## # … with 11 more variables: med_fam_inc <dbl>, avg_fam_inc <dbl>,
## # pct_lt15k <dbl>, num_prom <dbl>, lifetime_gifts <dbl>, largest_gift <dbl>,
## # last_gift <dbl>, months_since_donate <dbl>, time_lag <dbl>, avg_gift <dbl>,
## # target <fct>, and abbreviated variable names ¹zipconvert2, ²zipconvert3,
## # ³zipconvert4, ⁴zipconvert5, ⁵homeowner, ⁶num_child, ⁷home_value
These variables are described in the PDF as well. An interesting
observation: In the top 10 rows, everyone has an average gift value
(avg_gift), regardless if they are a donor or No Donor for
target. This may be worth exploring further in the future.
For now, I’ll assume they donated at some point, but yet were a “No
Donor” on this latest mailer. If my assumption is correct, does it
factor in 0s when there is no gift?
summary(fundraising.orig)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Donor :1499
## 1st Qu.: 6.333 No Donor:1501
## Median : 9.000
## Mean : 10.669
## 3rd Qu.: 12.800
## Max. :122.167
Of note: Our PDF says that there is 22 variables, and describes 22
variables. Yet here we only have 21. zipconvert1 appears to
be the missing variable. Assuming the dataset zipcodes is clean, all of
the zipconverts should add up to the same value of “yes” and “no”, and
all the “yes” across all 5 variables should also equal this value.
Assuming they all add up, we can use this information to create our
missing variable.
2352+648
## [1] 3000
551+2449
## [1] 3000
2357+643
## [1] 3000
1846+1154
## [1] 3000
2312+688
## [1] 3000
Good news! Looks like our method works. Lets see how many “Yes” there should be in the last zip variable.
3000-(648+551+643+1154)
## [1] 4
Only 4 people with Yes in zipconvert1; Either they are less charitable in New York than I expected, or there is something wrong with our methodology or data. With only 4 values, it will unlikely be useful to include it. However, lets remake our dataset to include and we can test it both ways.
new.var = ifelse(fundraising.orig$zipconvert2 == "No" & fundraising.orig$zipconvert3 == "No" & fundraising.orig$zipconvert4 == "No" & fundraising.orig$zipconvert5 == "No", "Yes", "No")
table(new.var)
## new.var
## No Yes
## 2996 4
Now lets slot this new variable into a modified table
fundraising.alt1=fundraising.orig
fundraising.alt1$zipconvert1 = new.var
fundraising.alt1[fundraising.alt1$zipconvert1 == "Yes", ]
## # A tibble: 4 × 22
## zipconv…¹ zipco…² zipco…³ zipco…⁴ homeo…⁵ num_c…⁶ income female wealth home_…⁷
## <fct> <fct> <fct> <fct> <fct> <dbl> <dbl> <fct> <dbl> <dbl>
## 1 No No No No No 1 1 Yes 8 564
## 2 No No No No No 1 2 No 8 4607
## 3 No No No No Yes 1 4 No 2 1269
## 4 No No No No Yes 1 4 No 1 1273
## # … with 12 more variables: med_fam_inc <dbl>, avg_fam_inc <dbl>,
## # pct_lt15k <dbl>, num_prom <dbl>, lifetime_gifts <dbl>, largest_gift <dbl>,
## # last_gift <dbl>, months_since_donate <dbl>, time_lag <dbl>, avg_gift <dbl>,
## # target <fct>, zipconvert1 <chr>, and abbreviated variable names
## # ¹zipconvert2, ²zipconvert3, ³zipconvert4, ⁴zipconvert5, ⁵homeowner,
## # ⁶num_child, ⁷home_value
glimpse(fundraising.alt1)
## Rows: 3,000
## Columns: 22
## $ zipconvert2 <fct> Yes, No, No, No, No, No, No, Yes, No, Yes, No, Yes…
## $ zipconvert3 <fct> No, No, No, Yes, Yes, No, No, No, No, No, No, No, …
## $ zipconvert4 <fct> No, No, No, No, No, No, Yes, No, No, No, Yes, No, …
## $ zipconvert5 <fct> No, Yes, Yes, No, No, Yes, No, No, Yes, No, No, No…
## $ homeowner <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y…
## $ num_child <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ income <dbl> 1, 5, 3, 4, 4, 4, 4, 4, 4, 1, 4, 5, 2, 3, 4, 4, 2,…
## $ female <fct> No, Yes, No, No, Yes, Yes, No, Yes, Yes, Yes, Yes,…
## $ wealth <dbl> 7, 8, 4, 8, 8, 8, 5, 8, 8, 5, 5, 8, 8, 5, 6, 9, 7,…
## $ home_value <dbl> 698, 828, 1471, 547, 482, 857, 505, 1438, 1316, 42…
## $ med_fam_inc <dbl> 422, 358, 484, 386, 242, 450, 333, 458, 541, 203, …
## $ avg_fam_inc <dbl> 463, 376, 546, 432, 275, 498, 388, 533, 575, 271, …
## $ pct_lt15k <dbl> 4, 13, 4, 7, 28, 5, 16, 8, 11, 39, 6, 8, 5, 3, 13,…
## $ num_prom <dbl> 46, 32, 94, 20, 38, 47, 51, 21, 66, 73, 59, 25, 27…
## $ lifetime_gifts <dbl> 94, 30, 177, 23, 73, 139, 63, 26, 108, 161, 84, 40…
## $ largest_gift <dbl> 12, 10, 10, 11, 10, 20, 15, 16, 12, 6, 5, 10, 20, …
## $ last_gift <dbl> 12, 5, 8, 11, 10, 20, 10, 16, 7, 3, 3, 10, 20, 7, …
## $ months_since_donate <dbl> 34, 29, 30, 30, 31, 37, 37, 30, 31, 32, 30, 32, 37…
## $ time_lag <dbl> 6, 7, 3, 6, 3, 3, 8, 6, 1, 7, 12, 2, 7, 1, 10, 3, …
## $ avg_gift <dbl> 9.400000, 4.285714, 7.080000, 7.666667, 7.300000, …
## $ target <fct> Donor, Donor, No Donor, No Donor, Donor, Donor, Do…
## $ zipconvert1 <chr> "No", "No", "No", "No", "No", "No", "No", "No", "N…
Now lets look for nulls. Scanning the summary function above, I do not see any nulls.
Before we split our dataset into training and test, lets examine the predictors more closely to evaluate their association with the response variable, determine which are good candidate predictors, and look for collinear relationships. Note: This is “Step 2: Model Building” from the assignment; however, I’m doing this before Step 1: Partitioning. When looking at the relationship between the variables, I do not seem the harm in looking at them all together. This warrants more investigation, and if Step 1 must occur first, we’ll have to circle back and do this step again.
I’m going to skip pairs for now due to the large number of categorical variables; we can always factor them and check them later; in the mean time, lets check some graphs for variables we may be interested in:
par(mfrow = c(2,2))
plot(factor(fundraising.alt1$target), fundraising.alt1$income, ylab = "income")
plot(fundraising.alt1$income, fundraising.alt1$target, xlab = "income")
plot(factor(fundraising.alt1$target), fundraising.alt1$home_value, ylab = "home_value")
plot(fundraising.alt1$home_value, fundraising.alt1$target, xlab = "home_value")
home_value appears to have very little differentiation
between the values of our target variable.
income appears to have some differentiation between the
values of our target variable.
par(mfrow = c(2,2))
plot(factor(fundraising.alt1$target), fundraising.alt1$wealth, ylab = "wealth")
plot(fundraising.alt1$wealth, fundraising.alt1$target, xlab = "wealth")
plot(factor(fundraising.alt1$target), fundraising.alt1$num_prom, ylab = "num_prom")
plot(fundraising.alt1$num_prom, fundraising.alt1$target, xlab = "num_prom")
par(mfrow = c(2,2))
plot(factor(fundraising.alt1$target), fundraising.alt1$avg_gift, ylab = "avg_gift")
plot(fundraising.alt1$avg_gift, fundraising.alt1$target, xlab = "avg_gift")
plot(factor(fundraising.alt1$target), fundraising.alt1$lifetime_gifts, ylab = "lifetime_gifts")
plot(fundraising.alt1$lifetime_gifts, fundraising.alt1$target, xlab = "lifetime_gifts")
In the caret videos, the variables were checked for skewness; while I don’t recall doing that in other assignments, it makes sense I’ll try to replicate here:
fundraising.alt1[,-2] %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density()
Some of these variables should likely use the log to make the distribution more normal.
fundraising.alt1[,-2] %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(log(value))) +
facet_wrap(~ key, scales = "free") +
geom_density()
I plan to create new variables for avg_fam_inc,
avg_gift, home_value,
largest_gift, last_gift,
lifetime_gifts, med_fam_inc,
num_prom, pct_It15k, and
time_lag. NOTE: Later on, when doing correlation analysis,
I noticed a lot of NaN in the correlations from my log transformed
variables. This is likely due to 0s in variable getting log transformed.
After researching this problem online, I found a common solution is to
just add 1. This is the solution I will use, and only on variables I was
getting NaN.
avg_fam_inc_log = log(fundraising.alt1$avg_fam_inc + 1)
avg_gift_log = log(fundraising.alt1$avg_gift)
home_value_log = log(fundraising.alt1$home_value + 1)
largest_gift_log = log(fundraising.alt1$largest_gift)
last_gift_log = log(fundraising.alt1$last_gift + 1)
lifetime_gifts_log = log(fundraising.alt1$lifetime_gifts)
med_fam_inc_log = log(fundraising.alt1$med_fam_inc + 1)
num_prom_log = log(fundraising.alt1$num_prom)
pct_lt15k_log = log(fundraising.alt1$pct_lt15k + 1)
time_lag_log = log(fundraising.alt1$time_lag + 1)
fundraising.alt1$avg_fam_inc_log = avg_fam_inc_log
fundraising.alt1$avg_gift_log = avg_gift_log
fundraising.alt1$home_value_log = home_value_log
fundraising.alt1$largest_gift_log = largest_gift_log
fundraising.alt1$last_gift_log = last_gift_log
fundraising.alt1$lifetime_gifts_log = lifetime_gifts_log
fundraising.alt1$med_fam_inc_log = med_fam_inc_log
fundraising.alt1$num_prom_log = num_prom_log
fundraising.alt1$pct_lt15k_log = pct_lt15k_log
fundraising.alt1$time_lag_log = time_lag_log
Its time to start modifying existing variables to scale them and to remove redundant and unnecessary variables from creating the log transformations. So first, lets build a clean dataset.
fundraising.alt1.clean = fundraising.alt1
Gift values and income level variables will have a larger impact on our model than some of the other variables due to scaling. For now, we’ll let the log transformation be scaling, and we may want to revisit other scaling techniques.
In most cases, we don’t need the log and non-log versions of the same variable. Note from future self: rerunning all the models later in this document including the non-transformed variables when there isn’t collinearity did not enhance the accuracy.
fundraising.alt1.clean=subset(fundraising.alt1.clean, select=-c(avg_fam_inc, avg_gift, home_value, largest_gift, last_gift, lifetime_gifts, med_fam_inc, num_prom, pct_lt15k, time_lag))
How is our dataset looking now?
str(fundraising.alt1.clean)
## tibble [3,000 × 22] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
## $ zipconvert1 : chr [1:3000] "No" "No" "No" "No" ...
## $ avg_fam_inc_log : num [1:3000] 6.14 5.93 6.3 6.07 5.62 ...
## $ avg_gift_log : num [1:3000] 2.24 1.46 1.96 2.04 1.99 ...
## $ home_value_log : num [1:3000] 6.55 6.72 7.29 6.31 6.18 ...
## $ largest_gift_log : num [1:3000] 2.48 2.3 2.3 2.4 2.3 ...
## $ last_gift_log : num [1:3000] 2.56 1.79 2.2 2.48 2.4 ...
## $ lifetime_gifts_log : num [1:3000] 4.54 3.4 5.18 3.14 4.29 ...
## $ med_fam_inc_log : num [1:3000] 6.05 5.88 6.18 5.96 5.49 ...
## $ num_prom_log : num [1:3000] 3.83 3.47 4.54 3 3.64 ...
## $ pct_lt15k_log : num [1:3000] 1.61 2.64 1.61 2.08 3.37 ...
## $ time_lag_log : num [1:3000] 1.95 2.08 1.39 1.95 1.39 ...
It looks like our new variable is character, so lets factor that one. Also, wealth looks to be categorical and possibly even income. These would be ordinal as the write up indicates it goes from low to high. For now, I’ll leave them as - is, but we may need to circle back and make them factors.
fundraising.alt1$zipconvert1 = as_factor(fundraising.alt1$zipconvert1)
Lets verify the income variable is usable from the description above.
fundraising.alt1 %>% ggplot(aes(x=income)) + geom_histogram()
str(fundraising.alt1.clean)
## tibble [3,000 × 22] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
## $ zipconvert1 : chr [1:3000] "No" "No" "No" "No" ...
## $ avg_fam_inc_log : num [1:3000] 6.14 5.93 6.3 6.07 5.62 ...
## $ avg_gift_log : num [1:3000] 2.24 1.46 1.96 2.04 1.99 ...
## $ home_value_log : num [1:3000] 6.55 6.72 7.29 6.31 6.18 ...
## $ largest_gift_log : num [1:3000] 2.48 2.3 2.3 2.4 2.3 ...
## $ last_gift_log : num [1:3000] 2.56 1.79 2.2 2.48 2.4 ...
## $ lifetime_gifts_log : num [1:3000] 4.54 3.4 5.18 3.14 4.29 ...
## $ med_fam_inc_log : num [1:3000] 6.05 5.88 6.18 5.96 5.49 ...
## $ num_prom_log : num [1:3000] 3.83 3.47 4.54 3 3.64 ...
## $ pct_lt15k_log : num [1:3000] 1.61 2.64 1.61 2.08 3.37 ...
## $ time_lag_log : num [1:3000] 1.95 2.08 1.39 1.95 1.39 ...
temp = fundraising.alt1.clean[, c(6,7,9,10,13,14,15,16,17,18,19,20,21, 22)]
correlation = cor(temp)
round(correlation, 2)
## num_child income wealth months_since_donate avg_fam_inc_log
## num_child 1.00 0.09 0.06 -0.01 0.04
## income 0.09 1.00 0.21 0.08 0.22
## wealth 0.06 0.21 1.00 0.03 0.27
## months_since_donate -0.01 0.08 0.03 1.00 0.02
## avg_fam_inc_log 0.04 0.22 0.27 0.02 1.00
## avg_gift_log -0.02 0.14 0.10 0.24 0.07
## home_value_log 0.00 0.25 0.24 0.01 0.80
## largest_gift_log -0.03 0.12 0.01 0.17 0.05
## last_gift_log -0.02 0.13 0.07 0.29 0.07
## lifetime_gifts_log -0.08 -0.03 -0.40 -0.23 -0.03
## med_fam_inc_log 0.04 0.24 0.29 0.02 0.99
## num_prom_log -0.08 -0.07 -0.43 -0.26 -0.04
## pct_lt15k_log -0.05 -0.31 -0.36 -0.02 -0.33
## time_lag_log -0.01 0.00 -0.06 0.06 0.03
## avg_gift_log home_value_log largest_gift_log last_gift_log
## num_child -0.02 0.00 -0.03 -0.02
## income 0.14 0.25 0.12 0.13
## wealth 0.10 0.24 0.01 0.07
## months_since_donate 0.24 0.01 0.17 0.29
## avg_fam_inc_log 0.07 0.80 0.05 0.07
## avg_gift_log 1.00 0.12 0.87 0.84
## home_value_log 0.12 1.00 0.09 0.12
## largest_gift_log 0.87 0.09 1.00 0.82
## last_gift_log 0.84 0.12 0.82 1.00
## lifetime_gifts_log 0.09 -0.02 0.30 0.12
## med_fam_inc_log 0.08 0.79 0.05 0.08
## num_prom_log -0.24 -0.05 -0.01 -0.14
## pct_lt15k_log -0.09 -0.30 -0.08 -0.09
## time_lag_log 0.11 0.02 0.10 0.12
## lifetime_gifts_log med_fam_inc_log num_prom_log
## num_child -0.08 0.04 -0.08
## income -0.03 0.24 -0.07
## wealth -0.40 0.29 -0.43
## months_since_donate -0.23 0.02 -0.26
## avg_fam_inc_log -0.03 0.99 -0.04
## avg_gift_log 0.09 0.08 -0.24
## home_value_log -0.02 0.79 -0.05
## largest_gift_log 0.30 0.05 -0.01
## last_gift_log 0.12 0.08 -0.14
## lifetime_gifts_log 1.00 -0.03 0.85
## med_fam_inc_log -0.03 1.00 -0.04
## num_prom_log 0.85 -0.04 1.00
## pct_lt15k_log 0.05 -0.37 0.05
## time_lag_log 0.03 0.02 0.12
## pct_lt15k_log time_lag_log
## num_child -0.05 -0.01
## income -0.31 0.00
## wealth -0.36 -0.06
## months_since_donate -0.02 0.06
## avg_fam_inc_log -0.33 0.03
## avg_gift_log -0.09 0.11
## home_value_log -0.30 0.02
## largest_gift_log -0.08 0.10
## last_gift_log -0.09 0.12
## lifetime_gifts_log 0.05 0.03
## med_fam_inc_log -0.37 0.02
## num_prom_log 0.05 0.12
## pct_lt15k_log 1.00 0.00
## time_lag_log 0.00 1.00
Here are some variables that have correlation above absolute value of
.8:
avg_gift_log and largest_gift_log at .87
avg_gift_log and last_gift_log at .84
largest_gift_log and last_gift_log at
.82
num_prom_log and lifetime_gifts_log at
.85
avg_fam_inc_log and med_fam_inc_log at
.99
home_value_log and med_fam_inc_log at .79
(close enough)
home_value_log and avg_fam_inc_log at .80
set.seed(12345)
train_index <- sample(1:nrow(fundraising.alt1.clean), nrow(fundraising.alt1.clean)*.8)
train = fundraising.alt1.clean[train_index, ]
test = fundraising.alt1.clean[-train_index, ]
nrow(train)/nrow(fundraising.alt1.clean)
## [1] 0.8
nrow(test)/nrow(fundraising.alt1.clean)
## [1] 0.2
First attempt we’ll use cross validation with caret package. repeatedcv seems to be the most popular method. (Note from future self: Caret made model fitting and evaluation standardized and easy and this was employed through the remainder of the document)
train_control = trainControl(method='repeatedcv', number=10,repeats=3)
I saw what appeared to be a simple way for variable selection done online - run a random forest model and allow it to do best variable selection. I’ll do that. Alternatively, I could have employed methods from chapter 5, and may circle back and try to get better results that way pending time.
rf.fit = train(target~., data=fundraising.alt1.clean,trControl=train_control,method='rf')
rf.fit$besttune
## NULL
varImp(rf.fit)
## rf variable importance
##
## only 20 most important variables shown (out of 21)
##
## Overall
## avg_gift_log 100.00
## home_value_log 99.40
## med_fam_inc_log 96.04
## lifetime_gifts_log 94.81
## avg_fam_inc_log 94.34
## num_prom_log 89.93
## pct_lt15k_log 83.00
## time_lag_log 73.88
## months_since_donate 71.93
## largest_gift_log 66.31
## last_gift_log 65.97
## income 51.66
## wealth 41.80
## femaleNo 17.22
## homeownerNo 14.47
## zipconvert5Yes 13.34
## zipconvert2Yes 12.05
## zipconvert4Yes 12.04
## zipconvert3No 11.12
## num_child 10.56
plot(varImp(rf.fit))
pred.rf<-predict(rf.fit,test)
confusionMatrix(pred.rf,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 290 0
## No Donor 0 310
##
## Accuracy : 1
## 95% CI : (0.9939, 1)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.4833
## Detection Rate : 0.4833
## Detection Prevalence : 0.4833
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : Donor
##
Here are the variables we’ll try to use at first (we should use another method to verify variable use). We came to these by removing collinear relationships and poor performers.
avg_gift_log, home_value_log, lifetime_gifts_log, pct_lt15k_log, time_lag_log, months_since_donate, income, wealth, female, homeowner
For simplicity, lets do all this in caret.
lda.fit.try2=train(target~ avg_gift_log + home_value_log + lifetime_gifts_log + pct_lt15k_log + time_lag_log + months_since_donate + income + wealth + female + homeowner, method = "lda", data = train, trControl = train_control)
pred.lda.try2=predict(lda.fit.try2,test)
confusionMatrix(pred.lda.try2,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 177 160
## No Donor 113 150
##
## Accuracy : 0.545
## 95% CI : (0.5042, 0.5854)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.088750
##
## Kappa : 0.0937
##
## Mcnemar's Test P-Value : 0.005369
##
## Sensitivity : 0.6103
## Specificity : 0.4839
## Pos Pred Value : 0.5252
## Neg Pred Value : 0.5703
## Prevalence : 0.4833
## Detection Rate : 0.2950
## Detection Prevalence : 0.5617
## Balanced Accuracy : 0.5471
##
## 'Positive' Class : Donor
##
LDA results: 54.4% test accuracy. Well, its better than 50% at least. Hopefully we’ll find better, but in many of the exercises of class, LDA performed the best.
qda.fit=train(target~ avg_gift_log + home_value_log + lifetime_gifts_log + pct_lt15k_log + time_lag_log + months_since_donate + income + wealth + female + homeowner, method = "qda", data = train, trControl = train_control)
pred.qda=predict(qda.fit,test)
confusionMatrix(pred.qda,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 182 173
## No Donor 108 137
##
## Accuracy : 0.5317
## 95% CI : (0.4908, 0.5722)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.2438027
##
## Kappa : 0.069
##
## Mcnemar's Test P-Value : 0.0001346
##
## Sensitivity : 0.6276
## Specificity : 0.4419
## Pos Pred Value : 0.5127
## Neg Pred Value : 0.5592
## Prevalence : 0.4833
## Detection Rate : 0.3033
## Detection Prevalence : 0.5917
## Balanced Accuracy : 0.5348
##
## 'Positive' Class : Donor
##
QDA Results: 53.17%. As hypothesized, QDA did not do better. However, of note, it did positively identify more donors. The best bang for our buck is when our model positively identifies donors. As this competition seeks to have the best accuracy, LDA is winning out. However, to see which model results in the best bottom line, we would want to multiply what we gained from the positive donors and subtract out where we sent direct mailers to non-donors.
glm.fit=train(target~ avg_gift_log + home_value_log + lifetime_gifts_log + pct_lt15k_log + time_lag_log + months_since_donate + income + wealth + female + homeowner, method = "glm", data = train, trControl = train_control, family = "binomial")
pred.glm=predict(glm.fit,test)
confusionMatrix(pred.glm,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 177 160
## No Donor 113 150
##
## Accuracy : 0.545
## 95% CI : (0.5042, 0.5854)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.088750
##
## Kappa : 0.0937
##
## Mcnemar's Test P-Value : 0.005369
##
## Sensitivity : 0.6103
## Specificity : 0.4839
## Pos Pred Value : 0.5252
## Neg Pred Value : 0.5703
## Prevalence : 0.4833
## Detection Rate : 0.2950
## Detection Prevalence : 0.5617
## Balanced Accuracy : 0.5471
##
## 'Positive' Class : Donor
##
GLM Results: GLM had the exact same performance as LDA. 54.5% with the same confusion matrix
knn.fit=train(target~ avg_gift_log + home_value_log + lifetime_gifts_log + pct_lt15k_log + time_lag_log + months_since_donate + income + wealth + female + homeowner, method = "knn", data = train, trControl = train_control, tuneLength=40)
pred.knn=predict(knn.fit,test)
confusionMatrix(pred.knn,test$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 158 161
## No Donor 132 149
##
## Accuracy : 0.5117
## 95% CI : (0.4709, 0.5524)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.6127
##
## Kappa : 0.0254
##
## Mcnemar's Test P-Value : 0.1019
##
## Sensitivity : 0.5448
## Specificity : 0.4806
## Pos Pred Value : 0.4953
## Neg Pred Value : 0.5302
## Prevalence : 0.4833
## Detection Rate : 0.2633
## Detection Prevalence : 0.5317
## Balanced Accuracy : 0.5127
##
## 'Positive' Class : Donor
##
KNN Results: 51.17%. KNN produced considerably worse results. At first I ran KNN with a tunelength of 10; when I got 51.17, I decided to check a greater range of values - note, changing it to 40 did not change the results.
knn.fit
## k-Nearest Neighbors
##
## 2400 samples
## 10 predictor
## 2 classes: 'Donor', 'No Donor'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 2160, 2160, 2160, 2161, 2160, 2160, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.5211118 0.04222767
## 7 0.5188872 0.03764644
## 9 0.5224960 0.04486498
## 11 0.5213855 0.04254076
## 13 0.5187541 0.03719135
## 15 0.5247298 0.04913618
## 17 0.5344515 0.06852477
## 19 0.5284717 0.05657842
## 21 0.5286158 0.05682272
## 23 0.5233403 0.04626945
## 25 0.5232026 0.04588015
## 27 0.5243079 0.04804429
## 29 0.5240371 0.04749713
## 31 0.5262582 0.05177853
## 33 0.5234798 0.04620541
## 35 0.5190319 0.03730082
## 37 0.5191720 0.03757268
## 39 0.5195880 0.03845239
## 41 0.5148664 0.02906859
## 43 0.5137541 0.02685167
## 45 0.5130649 0.02538581
## 47 0.5145920 0.02832949
## 49 0.5159821 0.03110561
## 51 0.5154283 0.02997422
## 53 0.5140405 0.02721635
## 55 0.5195932 0.03831245
## 57 0.5198710 0.03887911
## 59 0.5219589 0.04300056
## 61 0.5247402 0.04855957
## 63 0.5240446 0.04711425
## 65 0.5243241 0.04771024
## 67 0.5209885 0.04099105
## 69 0.5236279 0.04624499
## 71 0.5248756 0.04877770
## 73 0.5265411 0.05205397
## 75 0.5265371 0.05193790
## 77 0.5258473 0.05056209
## 79 0.5251494 0.04910220
## 81 0.5275070 0.05378805
## 83 0.5316749 0.06212756
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 17.
Out of curiosity, I checked k in our KNN model. Optimal solution is at 17, showing 53.3%. I don’t know why it produced something better than our results showed, but still even when examining KNN at the optimal value its still coming in a little under LDA and GLM.
plot(knn.fit)
Maybe we can tweak the variables involved and revisit KNN. With K values
these large, we might be introducing a bit of bias for our purposes, so
lets move on.
Final Results: With GLM and LDA being tied, I’ll go with LDA; However, it is worth running everything again to see if I can do better variable selection.
glimpse(fundraising.alt1)
## Rows: 3,000
## Columns: 32
## $ zipconvert2 <fct> Yes, No, No, No, No, No, No, Yes, No, Yes, No, Yes…
## $ zipconvert3 <fct> No, No, No, Yes, Yes, No, No, No, No, No, No, No, …
## $ zipconvert4 <fct> No, No, No, No, No, No, Yes, No, No, No, Yes, No, …
## $ zipconvert5 <fct> No, Yes, Yes, No, No, Yes, No, No, Yes, No, No, No…
## $ homeowner <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Y…
## $ num_child <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ income <dbl> 1, 5, 3, 4, 4, 4, 4, 4, 4, 1, 4, 5, 2, 3, 4, 4, 2,…
## $ female <fct> No, Yes, No, No, Yes, Yes, No, Yes, Yes, Yes, Yes,…
## $ wealth <dbl> 7, 8, 4, 8, 8, 8, 5, 8, 8, 5, 5, 8, 8, 5, 6, 9, 7,…
## $ home_value <dbl> 698, 828, 1471, 547, 482, 857, 505, 1438, 1316, 42…
## $ med_fam_inc <dbl> 422, 358, 484, 386, 242, 450, 333, 458, 541, 203, …
## $ avg_fam_inc <dbl> 463, 376, 546, 432, 275, 498, 388, 533, 575, 271, …
## $ pct_lt15k <dbl> 4, 13, 4, 7, 28, 5, 16, 8, 11, 39, 6, 8, 5, 3, 13,…
## $ num_prom <dbl> 46, 32, 94, 20, 38, 47, 51, 21, 66, 73, 59, 25, 27…
## $ lifetime_gifts <dbl> 94, 30, 177, 23, 73, 139, 63, 26, 108, 161, 84, 40…
## $ largest_gift <dbl> 12, 10, 10, 11, 10, 20, 15, 16, 12, 6, 5, 10, 20, …
## $ last_gift <dbl> 12, 5, 8, 11, 10, 20, 10, 16, 7, 3, 3, 10, 20, 7, …
## $ months_since_donate <dbl> 34, 29, 30, 30, 31, 37, 37, 30, 31, 32, 30, 32, 37…
## $ time_lag <dbl> 6, 7, 3, 6, 3, 3, 8, 6, 1, 7, 12, 2, 7, 1, 10, 3, …
## $ avg_gift <dbl> 9.400000, 4.285714, 7.080000, 7.666667, 7.300000, …
## $ target <fct> Donor, Donor, No Donor, No Donor, Donor, Donor, Do…
## $ zipconvert1 <fct> No, No, No, No, No, No, No, No, No, No, No, No, No…
## $ avg_fam_inc_log <dbl> 6.139885, 5.932245, 6.304449, 6.070738, 5.620401, …
## $ avg_gift_log <dbl> 2.240710, 1.455287, 1.957274, 2.036882, 1.987874, …
## $ home_value_log <dbl> 6.549651, 6.720220, 7.294377, 6.306275, 6.180017, …
## $ largest_gift_log <dbl> 2.484907, 2.302585, 2.302585, 2.397895, 2.302585, …
## $ last_gift_log <dbl> 2.564949, 1.791759, 2.197225, 2.484907, 2.397895, …
## $ lifetime_gifts_log <dbl> 4.543295, 3.401197, 5.176150, 3.135494, 4.290459, …
## $ med_fam_inc_log <dbl> 6.047372, 5.883322, 6.184149, 5.958425, 5.493061, …
## $ num_prom_log <dbl> 3.828641, 3.465736, 4.543295, 2.995732, 3.637586, …
## $ pct_lt15k_log <dbl> 1.6094379, 2.6390573, 1.6094379, 2.0794415, 3.3672…
## $ time_lag_log <dbl> 1.9459101, 2.0794415, 1.3862944, 1.9459101, 1.3862…
Lets try using the full set that also had the log transformation in it with lda.
First to build a new training and test set
set.seed(12345)
train_index2 <- sample(1:nrow(fundraising.alt1), nrow(fundraising.alt1)*.8)
train2 = fundraising.alt1[train_index2, ]
test2 = fundraising.alt1[-train_index2, ]
nrow(train)/nrow(fundraising.alt1)
## [1] 0.8
nrow(test)/nrow(fundraising.alt1)
## [1] 0.2
lda.fit.try3=train(target~ ., method = "lda", data = train2, trControl = train_control)
pred.lda.try3=predict(lda.fit.try3,test2)
confusionMatrix(pred.lda.try3,test2$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 169 154
## No Donor 121 156
##
## Accuracy : 0.5417
## 95% CI : (0.5008, 0.5821)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.11804
##
## Kappa : 0.0857
##
## Mcnemar's Test P-Value : 0.05365
##
## Sensitivity : 0.5828
## Specificity : 0.5032
## Pos Pred Value : 0.5232
## Neg Pred Value : 0.5632
## Prevalence : 0.4833
## Detection Rate : 0.2817
## Detection Prevalence : 0.5383
## Balanced Accuracy : 0.5430
##
## 'Positive' Class : Donor
##
results: Without removing any collinear relationships we get a warning about it, but we get nearly the same results.
Lets remove the potential collinear relationships but leave in all variables otherwise.
I don’t want to corrupt my past work, so I’ll make a new dataset.
fundraising.alt2 = fundraising.alt1
str(fundraising.alt2)
## tibble [3,000 × 32] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 2 1 2 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:3000] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:3000] 1 5 3 4 4 4 4 4 4 1 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 1 2 1 1 1 ...
## $ wealth : num [1:3000] 7 8 4 8 8 8 5 8 8 5 ...
## $ home_value : num [1:3000] 698 828 1471 547 482 ...
## $ med_fam_inc : num [1:3000] 422 358 484 386 242 450 333 458 541 203 ...
## $ avg_fam_inc : num [1:3000] 463 376 546 432 275 498 388 533 575 271 ...
## $ pct_lt15k : num [1:3000] 4 13 4 7 28 5 16 8 11 39 ...
## $ num_prom : num [1:3000] 46 32 94 20 38 47 51 21 66 73 ...
## $ lifetime_gifts : num [1:3000] 94 30 177 23 73 139 63 26 108 161 ...
## $ largest_gift : num [1:3000] 12 10 10 11 10 20 15 16 12 6 ...
## $ last_gift : num [1:3000] 12 5 8 11 10 20 10 16 7 3 ...
## $ months_since_donate: num [1:3000] 34 29 30 30 31 37 37 30 31 32 ...
## $ time_lag : num [1:3000] 6 7 3 6 3 3 8 6 1 7 ...
## $ avg_gift : num [1:3000] 9.4 4.29 7.08 7.67 7.3 ...
## $ target : Factor w/ 2 levels "Donor","No Donor": 1 1 2 2 1 1 1 2 1 1 ...
## $ zipconvert1 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ avg_fam_inc_log : num [1:3000] 6.14 5.93 6.3 6.07 5.62 ...
## $ avg_gift_log : num [1:3000] 2.24 1.46 1.96 2.04 1.99 ...
## $ home_value_log : num [1:3000] 6.55 6.72 7.29 6.31 6.18 ...
## $ largest_gift_log : num [1:3000] 2.48 2.3 2.3 2.4 2.3 ...
## $ last_gift_log : num [1:3000] 2.56 1.79 2.2 2.48 2.4 ...
## $ lifetime_gifts_log : num [1:3000] 4.54 3.4 5.18 3.14 4.29 ...
## $ med_fam_inc_log : num [1:3000] 6.05 5.88 6.18 5.96 5.49 ...
## $ num_prom_log : num [1:3000] 3.83 3.47 4.54 3 3.64 ...
## $ pct_lt15k_log : num [1:3000] 1.61 2.64 1.61 2.08 3.37 ...
## $ time_lag_log : num [1:3000] 1.95 2.08 1.39 1.95 1.39 ...
temp2 = fundraising.alt2[, c(6,7,9,10,11,12,13,14,15,16,17,18,19,20, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32)]
correlation2 = cor(temp2)
round(correlation2, 2)
## num_child income wealth home_value med_fam_inc avg_fam_inc
## num_child 1.00 0.09 0.06 -0.01 0.05 0.05
## income 0.09 1.00 0.21 0.29 0.37 0.38
## wealth 0.06 0.21 1.00 0.26 0.38 0.39
## home_value -0.01 0.29 0.26 1.00 0.74 0.75
## med_fam_inc 0.05 0.37 0.38 0.74 1.00 0.97
## avg_fam_inc 0.05 0.38 0.39 0.75 0.97 1.00
## pct_lt15k -0.03 -0.28 -0.38 -0.40 -0.67 -0.68
## num_prom -0.09 -0.07 -0.41 -0.06 -0.05 -0.06
## lifetime_gifts -0.05 -0.02 -0.23 -0.02 -0.04 -0.04
## largest_gift -0.02 0.03 -0.03 0.06 0.05 0.04
## last_gift -0.01 0.11 0.05 0.16 0.14 0.13
## months_since_donate -0.01 0.08 0.03 0.02 0.03 0.03
## time_lag -0.01 0.00 -0.07 0.00 0.02 0.02
## avg_gift -0.02 0.12 0.09 0.17 0.14 0.13
## avg_fam_inc_log 0.04 0.22 0.27 0.49 0.69 0.73
## avg_gift_log -0.02 0.14 0.10 0.16 0.14 0.14
## home_value_log 0.00 0.25 0.24 0.75 0.63 0.68
## largest_gift_log -0.03 0.12 0.01 0.12 0.12 0.11
## last_gift_log -0.02 0.13 0.07 0.14 0.14 0.13
## lifetime_gifts_log -0.08 -0.03 -0.40 -0.03 -0.03 -0.04
## med_fam_inc_log 0.04 0.24 0.29 0.50 0.74 0.76
## num_prom_log -0.08 -0.07 -0.43 -0.07 -0.06 -0.06
## pct_lt15k_log -0.05 -0.31 -0.36 -0.43 -0.69 -0.70
## time_lag_log -0.01 0.00 -0.06 0.00 0.01 0.02
## pct_lt15k num_prom lifetime_gifts largest_gift last_gift
## num_child -0.03 -0.09 -0.05 -0.02 -0.01
## income -0.28 -0.07 -0.02 0.03 0.11
## wealth -0.38 -0.41 -0.23 -0.03 0.05
## home_value -0.40 -0.06 -0.02 0.06 0.16
## med_fam_inc -0.67 -0.05 -0.04 0.05 0.14
## avg_fam_inc -0.68 -0.06 -0.04 0.04 0.13
## pct_lt15k 1.00 0.04 0.06 -0.01 -0.06
## num_prom 0.04 1.00 0.54 0.11 -0.06
## lifetime_gifts 0.06 0.54 1.00 0.51 0.20
## largest_gift -0.01 0.11 0.51 1.00 0.45
## last_gift -0.06 -0.06 0.20 0.45 1.00
## months_since_donate -0.01 -0.28 -0.14 0.02 0.19
## time_lag -0.02 0.12 0.04 0.04 0.08
## avg_gift -0.06 -0.15 0.18 0.47 0.87
## avg_fam_inc_log -0.43 -0.04 -0.03 0.02 0.07
## avg_gift_log -0.08 -0.19 0.17 0.42 0.75
## home_value_log -0.36 -0.05 -0.02 0.04 0.12
## largest_gift_log -0.07 0.04 0.35 0.59 0.76
## last_gift_log -0.08 -0.10 0.16 0.35 0.85
## lifetime_gifts_log 0.04 0.84 0.66 0.25 0.17
## med_fam_inc_log -0.48 -0.04 -0.03 0.02 0.07
## num_prom_log 0.04 0.96 0.47 0.06 -0.08
## pct_lt15k_log 0.87 0.05 0.05 -0.02 -0.07
## time_lag_log -0.01 0.09 -0.03 0.03 0.09
## months_since_donate time_lag avg_gift avg_fam_inc_log
## num_child -0.01 -0.01 -0.02 0.04
## income 0.08 0.00 0.12 0.22
## wealth 0.03 -0.07 0.09 0.27
## home_value 0.02 0.00 0.17 0.49
## med_fam_inc 0.03 0.02 0.14 0.69
## avg_fam_inc 0.03 0.02 0.13 0.73
## pct_lt15k -0.01 -0.02 -0.06 -0.43
## num_prom -0.28 0.12 -0.15 -0.04
## lifetime_gifts -0.14 0.04 0.18 -0.03
## largest_gift 0.02 0.04 0.47 0.02
## last_gift 0.19 0.08 0.87 0.07
## months_since_donate 1.00 0.02 0.19 0.02
## time_lag 0.02 1.00 0.07 0.03
## avg_gift 0.19 0.07 1.00 0.06
## avg_fam_inc_log 0.02 0.03 0.06 1.00
## avg_gift_log 0.24 0.09 0.87 0.07
## home_value_log 0.01 0.02 0.12 0.80
## largest_gift_log 0.17 0.10 0.76 0.05
## last_gift_log 0.29 0.09 0.74 0.07
## lifetime_gifts_log -0.23 0.08 0.11 -0.03
## med_fam_inc_log 0.02 0.02 0.07 0.99
## num_prom_log -0.26 0.16 -0.20 -0.04
## pct_lt15k_log -0.02 -0.01 -0.07 -0.33
## time_lag_log 0.06 0.89 0.08 0.03
## avg_gift_log home_value_log largest_gift_log last_gift_log
## num_child -0.02 0.00 -0.03 -0.02
## income 0.14 0.25 0.12 0.13
## wealth 0.10 0.24 0.01 0.07
## home_value 0.16 0.75 0.12 0.14
## med_fam_inc 0.14 0.63 0.12 0.14
## avg_fam_inc 0.14 0.68 0.11 0.13
## pct_lt15k -0.08 -0.36 -0.07 -0.08
## num_prom -0.19 -0.05 0.04 -0.10
## lifetime_gifts 0.17 -0.02 0.35 0.16
## largest_gift 0.42 0.04 0.59 0.35
## last_gift 0.75 0.12 0.76 0.85
## months_since_donate 0.24 0.01 0.17 0.29
## time_lag 0.09 0.02 0.10 0.09
## avg_gift 0.87 0.12 0.76 0.74
## avg_fam_inc_log 0.07 0.80 0.05 0.07
## avg_gift_log 1.00 0.12 0.87 0.84
## home_value_log 0.12 1.00 0.09 0.12
## largest_gift_log 0.87 0.09 1.00 0.82
## last_gift_log 0.84 0.12 0.82 1.00
## lifetime_gifts_log 0.09 -0.02 0.30 0.12
## med_fam_inc_log 0.08 0.79 0.05 0.08
## num_prom_log -0.24 -0.05 -0.01 -0.14
## pct_lt15k_log -0.09 -0.30 -0.08 -0.09
## time_lag_log 0.11 0.02 0.10 0.12
## lifetime_gifts_log med_fam_inc_log num_prom_log
## num_child -0.08 0.04 -0.08
## income -0.03 0.24 -0.07
## wealth -0.40 0.29 -0.43
## home_value -0.03 0.50 -0.07
## med_fam_inc -0.03 0.74 -0.06
## avg_fam_inc -0.04 0.76 -0.06
## pct_lt15k 0.04 -0.48 0.04
## num_prom 0.84 -0.04 0.96
## lifetime_gifts 0.66 -0.03 0.47
## largest_gift 0.25 0.02 0.06
## last_gift 0.17 0.07 -0.08
## months_since_donate -0.23 0.02 -0.26
## time_lag 0.08 0.02 0.16
## avg_gift 0.11 0.07 -0.20
## avg_fam_inc_log -0.03 0.99 -0.04
## avg_gift_log 0.09 0.08 -0.24
## home_value_log -0.02 0.79 -0.05
## largest_gift_log 0.30 0.05 -0.01
## last_gift_log 0.12 0.08 -0.14
## lifetime_gifts_log 1.00 -0.03 0.85
## med_fam_inc_log -0.03 1.00 -0.04
## num_prom_log 0.85 -0.04 1.00
## pct_lt15k_log 0.05 -0.37 0.05
## time_lag_log 0.03 0.02 0.12
## pct_lt15k_log time_lag_log
## num_child -0.05 -0.01
## income -0.31 0.00
## wealth -0.36 -0.06
## home_value -0.43 0.00
## med_fam_inc -0.69 0.01
## avg_fam_inc -0.70 0.02
## pct_lt15k 0.87 -0.01
## num_prom 0.05 0.09
## lifetime_gifts 0.05 -0.03
## largest_gift -0.02 0.03
## last_gift -0.07 0.09
## months_since_donate -0.02 0.06
## time_lag -0.01 0.89
## avg_gift -0.07 0.08
## avg_fam_inc_log -0.33 0.03
## avg_gift_log -0.09 0.11
## home_value_log -0.30 0.02
## largest_gift_log -0.08 0.10
## last_gift_log -0.09 0.12
## lifetime_gifts_log 0.05 0.03
## med_fam_inc_log -0.37 0.02
## num_prom_log 0.05 0.12
## pct_lt15k_log 1.00 0.00
## time_lag_log 0.00 1.00
Reexamining collinearity on the full dataset, I’ve decided to remove: Med_fam_inc, home_value, pct_lt15k, num_prom, num_prom_log, avg_gift, last_gift_log, time_lag, avg_gift, avg_fam_inc_log, and largest_gift_log (and that zip1 variable we made that doesn’t seem to do anything)
fundraising.alt2.clean=subset(fundraising.alt2, select=-c(med_fam_inc, home_value, pct_lt15k, num_prom, num_prom_log, avg_gift, last_gift_log, time_lag, avg_gift, avg_fam_inc_log, zipconvert1))
Doesn’t hurt, lets chop it up a 3rd time.
set.seed(12345)
train_index3 <- sample(1:nrow(fundraising.alt2.clean), nrow(fundraising.alt2.clean)*.8)
train3 = fundraising.alt2.clean[train_index3, ]
test3 = fundraising.alt2.clean[-train_index3, ]
nrow(train3)/nrow(fundraising.alt2.clean)
## [1] 0.8
nrow(test3)/nrow(fundraising.alt2.clean)
## [1] 0.2
Ok one more time for good measure, lets see how some of the models perform using all remaining variables.
train_control2 = trainControl(method='repeatedcv', number=10,repeats=3)
lda.fit.try4=train(target~ ., method = "lda", data = train3, trControl = train_control2)
pred.lda.try4=predict(lda.fit.try4,test3)
confusionMatrix(pred.lda.try3,test3$target)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Donor No Donor
## Donor 169 154
## No Donor 121 156
##
## Accuracy : 0.5417
## 95% CI : (0.5008, 0.5821)
## No Information Rate : 0.5167
## P-Value [Acc > NIR] : 0.11804
##
## Kappa : 0.0857
##
## Mcnemar's Test P-Value : 0.05365
##
## Sensitivity : 0.5828
## Specificity : 0.5032
## Pos Pred Value : 0.5232
## Neg Pred Value : 0.5632
## Prevalence : 0.4833
## Detection Rate : 0.2817
## Detection Prevalence : 0.5383
## Balanced Accuracy : 0.5430
##
## 'Positive' Class : Donor
##
Third LDA results: Again, not as good as our first pass. I’ve tried a lot of different things to beat our original LDA figures, and none have helped. I think we have our winner.
Now lets get the prediction test set and get those predictions:
future_fundraising = read_rds("C:/Users/Owner/Documents/School stuff/intro to business analysis/R stuff/Data/future_fundraising.rds")
Now we need to fit the model and write the output to a csv on predicting donor or no donor.
First, we need to do the same preparations of the data.
future.clean = future_fundraising
avg_fam_inc_log = log(future.clean$avg_fam_inc + 1)
avg_gift_log = log(future.clean$avg_gift)
home_value_log = log(future.clean$home_value + 1)
largest_gift_log = log(future.clean$largest_gift)
last_gift_log = log(future.clean$last_gift + 1)
lifetime_gifts_log = log(future.clean$lifetime_gifts)
med_fam_inc_log = log(future.clean$med_fam_inc + 1)
num_prom_log = log(future.clean$num_prom)
pct_lt15k_log = log(future.clean$pct_lt15k + 1)
time_lag_log = log(future.clean$time_lag + 1)
future.clean$avg_fam_inc_log = avg_fam_inc_log
future.clean$avg_gift_log = avg_gift_log
future.clean$home_value_log = home_value_log
future.clean$largest_gift_log = largest_gift_log
future.clean$last_gift_log = last_gift_log
future.clean$lifetime_gifts_log = lifetime_gifts_log
future.clean$med_fam_inc_log = med_fam_inc_log
future.clean$num_prom_log = num_prom_log
future.clean$pct_lt15k_log = pct_lt15k_log
future.clean$time_lag_log = time_lag_log
future.clean=subset(future.clean, select=-c(avg_fam_inc, avg_gift, home_value, largest_gift, last_gift, lifetime_gifts, med_fam_inc, num_prom, pct_lt15k, time_lag))
In retrospect, if I did it all over again, I wouldn’t transform the variables within my dataset, but rather transform them within the code fitting the model. Live and learn.
Now lets fit the model.
#pred.lda.final = predict(lda.fit.try2, future_fundraising)
#write.table(pred.lda.final, file = "predictions_lda_final.csv", col.names = c("value"), row.names = FALSE)