library(readr)
## Warning: package 'readr' was built under R version 4.1.3
library(ISLR)
library(MASS)
library(class)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
## Loading required package: lattice
library(ggplot2)
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 4.1.3
## Loading required package: xts
## Warning: package 'xts' was built under R version 4.1.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.1.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
library(regclass)
## Warning: package 'regclass' was built under R version 4.1.3
## Loading required package: bestglm
## Warning: package 'bestglm' was built under R version 4.1.3
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 4.1.3
## Loading required package: VGAM
## Warning: package 'VGAM' was built under R version 4.1.3
## Loading required package: stats4
## Loading required package: splines
##
## Attaching package: 'VGAM'
## The following object is masked from 'package:caret':
##
## predictors
## Loading required package: rpart
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.
##
## Attaching package: 'regclass'
## The following object is masked from 'package:lattice':
##
## qq
fundraising <- readRDS("C:/Users/ocanc/Downloads/fundraising (4).rds")
`future_fundraising ` <- readRDS("C:/Users/ocanc/Downloads/future_fundraising (5).rds")
Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).
set.seed(12345)
fundraising$target <- as.numeric(fundraising$target)
sample<- createDataPartition(y=fundraising$target, p= .80, list = FALSE)
train <- fundraising[sample,]
test <- fundraising[-sample,]
train_control <- trainControl(method="repeatedcv",number=10,repeats=3)
Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.
str(test)
## tibble [600 x 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 2 1 2 2 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 1 1 2 1 1 1 1 2 1 ...
## $ num_child : num [1:600] 1 1 1 1 1 1 1 1 1 1 ...
## $ income : num [1:600] 4 4 3 1 4 7 4 4 5 2 ...
## $ female : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 2 1 1 2 1 ...
## $ wealth : num [1:600] 8 8 8 8 5 8 1 3 8 0 ...
## $ home_value : num [1:600] 482 411 296 1000 971 ...
## $ med_fam_inc : num [1:600] 242 236 280 468 347 521 250 347 464 239 ...
## $ avg_fam_inc : num [1:600] 275 273 315 506 375 576 276 354 506 288 ...
## $ pct_lt15k : num [1:600] 28 21 17 6 12 1 32 15 6 36 ...
## $ num_prom : num [1:600] 38 15 64 28 50 30 55 76 43 69 ...
## $ lifetime_gifts : num [1:600] 73 15 52 39 84 66 91 119 119 304 ...
## $ largest_gift : num [1:600] 10 15 5 10 10 12 10 15 30 50 ...
## $ last_gift : num [1:600] 10 15 2 5 7 12 5 10 22 50 ...
## $ months_since_donate: num [1:600] 31 35 20 29 30 28 28 31 23 37 ...
## $ time_lag : num [1:600] 3 5 7 3 6 4 4 10 1 4 ...
## $ avg_gift : num [1:600] 7.3 15 2.26 6.5 6.46 ...
## $ target : num [1:600] 1 1 1 2 2 1 2 1 1 2 ...
first we got to look at the data to find the predictors
str(train)
## tibble [2,400 x 21] (S3: tbl_df/tbl/data.frame)
## $ zipconvert2 : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 2 1 2 1 ...
## $ zipconvert3 : Factor w/ 2 levels "Yes","No": 2 2 2 1 2 2 2 2 2 2 ...
## $ zipconvert4 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 2 ...
## $ zipconvert5 : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 1 1 2 1 1 ...
## $ homeowner : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
## $ num_child : num [1:2400] 1 2 1 1 1 1 1 1 1 1 ...
## $ income : num [1:2400] 1 5 3 4 4 4 4 4 1 4 ...
## $ female : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 2 1 1 1 1 ...
## $ wealth : num [1:2400] 7 8 4 8 8 5 8 8 5 5 ...
## $ home_value : num [1:2400] 698 828 1471 547 857 ...
## $ med_fam_inc : num [1:2400] 422 358 484 386 450 333 458 541 203 434 ...
## $ avg_fam_inc : num [1:2400] 463 376 546 432 498 388 533 575 271 472 ...
## $ pct_lt15k : num [1:2400] 4 13 4 7 5 16 8 11 39 6 ...
## $ num_prom : num [1:2400] 46 32 94 20 47 51 21 66 73 59 ...
## $ lifetime_gifts : num [1:2400] 94 30 177 23 139 63 26 108 161 84 ...
## $ largest_gift : num [1:2400] 12 10 10 11 20 15 16 12 6 5 ...
## $ last_gift : num [1:2400] 12 5 8 11 20 10 16 7 3 3 ...
## $ months_since_donate: num [1:2400] 34 29 30 30 37 37 30 31 32 30 ...
## $ time_lag : num [1:2400] 6 7 3 6 3 8 6 1 7 12 ...
## $ avg_gift : num [1:2400] 9.4 4.29 7.08 7.67 10.69 ...
## $ target : num [1:2400] 1 1 2 2 1 1 2 1 1 2 ...
summary(fundraising)
## zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner num_child
## No :2352 Yes: 551 No :2357 No :1846 Yes:2312 Min. :1.000
## Yes: 648 No :2449 Yes: 643 Yes:1154 No : 688 1st Qu.:1.000
## Median :1.000
## Mean :1.069
## 3rd Qu.:1.000
## Max. :5.000
## income female wealth home_value med_fam_inc
## Min. :1.000 Yes:1831 Min. :0.000 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.000 No :1169 1st Qu.:5.000 1st Qu.: 554.8 1st Qu.: 278.0
## Median :4.000 Median :8.000 Median : 816.5 Median : 355.0
## Mean :3.899 Mean :6.396 Mean :1143.3 Mean : 388.4
## 3rd Qu.:5.000 3rd Qu.:8.000 3rd Qu.:1341.2 3rd Qu.: 465.0
## Max. :7.000 Max. :9.000 Max. :5945.0 Max. :1500.0
## avg_fam_inc pct_lt15k num_prom lifetime_gifts
## Min. : 0.0 Min. : 0.00 Min. : 11.00 Min. : 15.0
## 1st Qu.: 318.0 1st Qu.: 5.00 1st Qu.: 29.00 1st Qu.: 45.0
## Median : 396.0 Median :12.00 Median : 48.00 Median : 81.0
## Mean : 432.3 Mean :14.71 Mean : 49.14 Mean : 110.7
## 3rd Qu.: 516.0 3rd Qu.:21.00 3rd Qu.: 65.00 3rd Qu.: 135.0
## Max. :1331.0 Max. :90.00 Max. :157.00 Max. :5674.9
## largest_gift last_gift months_since_donate time_lag
## Min. : 5.00 Min. : 0.00 Min. :17.00 Min. : 0.000
## 1st Qu.: 10.00 1st Qu.: 7.00 1st Qu.:29.00 1st Qu.: 3.000
## Median : 15.00 Median : 10.00 Median :31.00 Median : 5.000
## Mean : 16.65 Mean : 13.48 Mean :31.13 Mean : 6.876
## 3rd Qu.: 20.00 3rd Qu.: 16.00 3rd Qu.:34.00 3rd Qu.: 9.000
## Max. :1000.00 Max. :219.00 Max. :37.00 Max. :77.000
## avg_gift target
## Min. : 2.139 Min. :1.0
## 1st Qu.: 6.333 1st Qu.:1.0
## Median : 9.000 Median :2.0
## Mean : 10.669 Mean :1.5
## 3rd Qu.: 12.800 3rd Qu.:2.0
## Max. :122.167 Max. :2.0
plot(train)
i then decided to look at correlations with a scatterplot
data(test)
## Warning in data(test): data set 'test' not found
testcor <- test[, c(6,7,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)]
head(testcor, 10)
## # A tibble: 10 x 15
## num_child income wealth home_value med_fam_inc avg_fam_inc pct_lt15k num_prom
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 4 8 482 242 275 28 38
## 2 1 4 8 411 236 273 21 15
## 3 1 3 8 296 280 315 17 64
## 4 1 1 8 1000 468 506 6 28
## 5 1 4 5 971 347 375 12 50
## 6 1 7 8 1046 521 576 1 30
## 7 1 4 1 1366 250 276 32 55
## 8 1 4 3 806 347 354 15 76
## 9 1 5 8 2194 464 506 6 43
## 10 1 2 0 913 239 288 36 69
## # ... with 7 more variables: lifetime_gifts <dbl>, largest_gift <dbl>,
## # last_gift <dbl>, months_since_donate <dbl>, time_lag <dbl>, avg_gift <dbl>,
## # target <dbl>
cormat <- cor(testcor)
round(cormat, 5)
## num_child income wealth home_value med_fam_inc
## num_child 1.00000 0.12918 -0.00552 0.00062 0.07040
## income 0.12918 1.00000 0.17976 0.34294 0.37025
## wealth -0.00552 0.17976 1.00000 0.23769 0.33829
## home_value 0.00062 0.34294 0.23769 1.00000 0.73582
## med_fam_inc 0.07040 0.37025 0.33829 0.73582 1.00000
## avg_fam_inc 0.08294 0.39449 0.35116 0.74283 0.97034
## pct_lt15k -0.03584 -0.26513 -0.34150 -0.39312 -0.66540
## num_prom -0.05622 -0.11424 -0.39654 -0.05858 -0.02204
## lifetime_gifts -0.07707 -0.02510 -0.26851 0.02508 0.02700
## largest_gift -0.05637 0.14504 0.04088 0.18475 0.16293
## last_gift -0.03962 0.14807 0.07428 0.22678 0.19614
## months_since_donate 0.01983 0.11951 -0.02478 0.09585 0.06328
## time_lag -0.05725 -0.00578 -0.01106 0.02378 0.04536
## avg_gift -0.05295 0.14170 0.08288 0.22074 0.17177
## target 0.00839 -0.05853 -0.02543 -0.03787 -0.03126
## avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## num_child 0.08294 -0.03584 -0.05622 -0.07707 -0.05637
## income 0.39449 -0.26513 -0.11424 -0.02510 0.14504
## wealth 0.35116 -0.34150 -0.39654 -0.26851 0.04088
## home_value 0.74283 -0.39312 -0.05858 0.02508 0.18475
## med_fam_inc 0.97034 -0.66540 -0.02204 0.02700 0.16293
## avg_fam_inc 1.00000 -0.70938 -0.01771 0.02425 0.15295
## pct_lt15k -0.70938 1.00000 0.02113 0.04221 -0.03750
## num_prom -0.01771 0.02113 1.00000 0.70507 0.03718
## lifetime_gifts 0.02425 0.04221 0.70507 1.00000 0.46329
## largest_gift 0.15295 -0.03750 0.03718 0.46329 1.00000
## last_gift 0.18316 -0.05796 -0.04413 0.34876 0.93734
## months_since_donate 0.05889 -0.03965 -0.24384 -0.19860 0.02917
## time_lag 0.05050 -0.06480 0.10717 0.00973 0.00228
## avg_gift 0.16057 -0.03702 -0.10143 0.34600 0.91383
## target -0.02833 0.04105 -0.07790 -0.07646 0.02210
## last_gift months_since_donate time_lag avg_gift target
## num_child -0.03962 0.01983 -0.05725 -0.05295 0.00839
## income 0.14807 0.11951 -0.00578 0.14170 -0.05853
## wealth 0.07428 -0.02478 -0.01106 0.08288 -0.02543
## home_value 0.22678 0.09585 0.02378 0.22074 -0.03787
## med_fam_inc 0.19614 0.06328 0.04536 0.17177 -0.03126
## avg_fam_inc 0.18316 0.05889 0.05050 0.16057 -0.02833
## pct_lt15k -0.05796 -0.03965 -0.06480 -0.03702 0.04105
## num_prom -0.04413 -0.24384 0.10717 -0.10143 -0.07790
## lifetime_gifts 0.34876 -0.19860 0.00973 0.34600 -0.07646
## largest_gift 0.93734 0.02917 0.00228 0.91383 0.02210
## last_gift 1.00000 0.08204 0.00046 0.92070 0.03849
## months_since_donate 0.08204 1.00000 -0.01327 0.09609 0.15989
## time_lag 0.00046 -0.01327 1.00000 0.01291 -0.06692
## avg_gift 0.92070 0.09609 0.01291 1.00000 0.04651
## target 0.03849 0.15989 -0.06692 0.04651 1.00000
chart.Correlation(testcor, histogram=FALSE, pch=19)
i tried to use this one to give my self a better idea
the first model i wanted to use was an LDA model
set.seed(2)
ldafund = lda(target~., data= train)
summary(ldafund)
## Length Class Mode
## prior 2 -none- numeric
## counts 2 -none- numeric
## means 40 -none- numeric
## scaling 20 -none- numeric
## lev 2 -none- character
## svd 1 -none- numeric
## N 1 -none- numeric
## call 3 -none- call
## terms 3 terms call
## xlevels 6 -none- list
lda.pred = predict(ldafund, newdata=test, type="response")
lda.class = lda.pred$class
table(lda.class, test$target)
##
## lda.class 1 2
## 1 175 159
## 2 115 151
mean(lda.class==test$target)
## [1] 0.5433333
the second model i wanted to use was a GLM model
set.seed(4)
fundglm3 <- glm(target~., data=train)
fundglm.pred <- predict(fundglm3, test)
fund.probs = predict(fundglm3, test, type = "response")
fundglm.pred = rep(0, length(fund.probs))
fundglm.pred[fund.probs > 0.5] = 1
table(fundglm.pred, test$target)
##
## fundglm.pred 1 2
## 1 290 310
mean(fundglm.pred == test$target)
## [1] 0.4833333
This model gave me a lower MSE so i will be using it as my main model for the remainder of the project
set.seed(5)
fundglm <- glm(target ~ largest_gift+months_since_donate, data=train)
fundglm.pred <- predict(fundglm, test)
fund.probs = predict(fundglm, test, type = "response")
fundglm.pred = rep(0, length(fund.probs))
fundglm.pred[fund.probs > 0.5] = 1
table(fundglm.pred, test$target)
##
## fundglm.pred 1 2
## 1 290 310
mean(fundglm.pred == test$target)
## [1] 0.4833333
here in the model i used the two strong predictors which brought my MSE down, these predictors of largest gift and moths since donate seem to be good
set.seed(6)
fundglm <- glm(target~wealth+income, data=train)
fundglm.pred <- predict(fundglm, test)
fund.probs = predict(fundglm, test, type = "response")
fundglm.pred = rep(0, length(fund.probs))
fundglm.pred[fund.probs > 0.5] = 1
table(fundglm.pred, test$target)
##
## fundglm.pred 1 2
## 1 290 310
mean(fundglm.pred == test$target)
## [1] 0.4833333
i tired using target wealth and income to see how they would match up but found that they did not lower the initial MSE so i will stick with the other two predictors used above
set.seed(7)
fundglm1 <- glm(target~largest_gift+months_since_donate+num_child, data=train)
fundglm.pred <- predict(fundglm1, test)
fund.probs = predict(fundglm1, test, type = "response")
fundglm.pred = rep(0, length(fund.probs))
fundglm.pred[fund.probs > 0.5] = 1
table(fundglm.pred, test$target)
##
## fundglm.pred 1 2
## 1 290 310
mean(fundglm.pred == test$target)
## [1] 0.4833333
so in this model i added one more predictor to see if it would be beneficial in my analysis but found out that number of children didnt lower or alter the MSE so it meant that it would not be a factor in this
set.seed(7)
par(mfrow = c(1,2))
boxplot(test$lifetime_gifts ~ test$target, data = test)
boxplot(test$largest_gift ~ test$target, data = test)
using a weighted sample helps us smooth out the data and giving us a better overall result in our models. this basically creates an equal amount of donors so it is easier to work with and allows to find the best predictors. if we used a the original data set or a random sample we would most likely end up with skewed results which wouldnt allow us to find the best predictors. having the data averaged allows me to produce these box plots to see how the two predictors selected look.
so after picking our best predictors using the glm model we saw which would be the best ones to use. To see which ones where the best we have to look at the MSE and use the ones that produced the lowest MSE to have confirmation that they are the best ones since they fit our data.i could also use a different model to find the rate at which classifications could be wrong, but to me the glm would be the most efficient in using those factors i did. we can see the output of these models and MSE above by looking at the Mean functions i used.
looking at the previous question and the outputs i used to me the GLM is the best model when using the largest gift and the month since dontation predictors to select the targets wanted.
Step 3: Testing. The file FutureFundraising.csv contains the attributes for future mailing candidates
finalfundglm <- glm(target~ wealth + months_since_donate, data=train)
finalprob <- predict(finalfundglm, type="response")
finalpred <- rep("Donor", length(finalprob))
finalpred[finalprob >1.5] ="No Donor"
future_value <- predict(finalfundglm, `future_fundraising `)
Value <- c("value", as.character(future_value))
Value <- if_else(Value>1.5, "Donor", "No Donor")
write.table(Value, file="attempt1.csv", col.names = c("value"),sep = ",",row.names = F)
so after uploading it to the data algorithms II modeling competition it gave me 1 row at .5206612 witch put me on 32
#7