final project

library(readr)

## Warning: package 'readr' was built under R version 4.1.3

library(ISLR)
library(MASS)
library(class)
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.1.3

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caret)

## Warning: package 'caret' was built under R version 4.1.3

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.1.3

## Loading required package: lattice

library(ggplot2)
library(PerformanceAnalytics)

## Warning: package 'PerformanceAnalytics' was built under R version 4.1.3

## Loading required package: xts

## Warning: package 'xts' was built under R version 4.1.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.1.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

library(regclass)

## Warning: package 'regclass' was built under R version 4.1.3

## Loading required package: bestglm

## Warning: package 'bestglm' was built under R version 4.1.3

## Loading required package: leaps

## Warning: package 'leaps' was built under R version 4.1.3

## Loading required package: VGAM

## Warning: package 'VGAM' was built under R version 4.1.3

## Loading required package: stats4

## Loading required package: splines

## 
## Attaching package: 'VGAM'

## The following object is masked from 'package:caret':
## 
##     predictors

## Loading required package: rpart

## Loading required package: randomForest

## Warning: package 'randomForest' was built under R version 4.1.3

## randomForest 4.7-1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.

## 
## Attaching package: 'regclass'

## The following object is masked from 'package:lattice':
## 
##     qq

fundraising <- readRDS("C:/Users/ocanc/Downloads/fundraising (4).rds")
`future_fundraising ` <- readRDS("C:/Users/ocanc/Downloads/future_fundraising (5).rds")

Step 1: Partitioning. You might think about how to estimate the out of sample error. Either partition the dataset into 80% training and 20% validation or use cross validation (set the seed to 12345).

set.seed(12345)
fundraising$target <- as.numeric(fundraising$target)

sample<- createDataPartition(y=fundraising$target, p= .80, list = FALSE)
train <- fundraising[sample,]
test <- fundraising[-sample,]
train_control <- trainControl(method="repeatedcv",number=10,repeats=3)

Step 2: Model Building. Follow the following steps to build, evaluate, and choose a model.

Exploratory data analysis. Examine the predictors and evaluate their association with the response variable. Which might be good candidate predictors? Are any collinear with each other?

str(test)

## tibble [600 x 21] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 2 1 2 2 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 1 1 2 1 1 1 1 2 1 ...
##  $ num_child          : num [1:600] 1 1 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:600] 4 4 3 1 4 7 4 4 5 2 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 2 1 1 2 1 ...
##  $ wealth             : num [1:600] 8 8 8 8 5 8 1 3 8 0 ...
##  $ home_value         : num [1:600] 482 411 296 1000 971 ...
##  $ med_fam_inc        : num [1:600] 242 236 280 468 347 521 250 347 464 239 ...
##  $ avg_fam_inc        : num [1:600] 275 273 315 506 375 576 276 354 506 288 ...
##  $ pct_lt15k          : num [1:600] 28 21 17 6 12 1 32 15 6 36 ...
##  $ num_prom           : num [1:600] 38 15 64 28 50 30 55 76 43 69 ...
##  $ lifetime_gifts     : num [1:600] 73 15 52 39 84 66 91 119 119 304 ...
##  $ largest_gift       : num [1:600] 10 15 5 10 10 12 10 15 30 50 ...
##  $ last_gift          : num [1:600] 10 15 2 5 7 12 5 10 22 50 ...
##  $ months_since_donate: num [1:600] 31 35 20 29 30 28 28 31 23 37 ...
##  $ time_lag           : num [1:600] 3 5 7 3 6 4 4 10 1 4 ...
##  $ avg_gift           : num [1:600] 7.3 15 2.26 6.5 6.46 ...
##  $ target             : num [1:600] 1 1 1 2 2 1 2 1 1 2 ...

first we got to look at the data to find the predictors

str(train)

## tibble [2,400 x 21] (S3: tbl_df/tbl/data.frame)
##  $ zipconvert2        : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 2 1 2 1 ...
##  $ zipconvert3        : Factor w/ 2 levels "Yes","No": 2 2 2 1 2 2 2 2 2 2 ...
##  $ zipconvert4        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 2 ...
##  $ zipconvert5        : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 1 1 2 1 1 ...
##  $ homeowner          : Factor w/ 2 levels "Yes","No": 1 2 1 1 1 1 1 1 1 1 ...
##  $ num_child          : num [1:2400] 1 2 1 1 1 1 1 1 1 1 ...
##  $ income             : num [1:2400] 1 5 3 4 4 4 4 4 1 4 ...
##  $ female             : Factor w/ 2 levels "Yes","No": 2 1 2 2 1 2 1 1 1 1 ...
##  $ wealth             : num [1:2400] 7 8 4 8 8 5 8 8 5 5 ...
##  $ home_value         : num [1:2400] 698 828 1471 547 857 ...
##  $ med_fam_inc        : num [1:2400] 422 358 484 386 450 333 458 541 203 434 ...
##  $ avg_fam_inc        : num [1:2400] 463 376 546 432 498 388 533 575 271 472 ...
##  $ pct_lt15k          : num [1:2400] 4 13 4 7 5 16 8 11 39 6 ...
##  $ num_prom           : num [1:2400] 46 32 94 20 47 51 21 66 73 59 ...
##  $ lifetime_gifts     : num [1:2400] 94 30 177 23 139 63 26 108 161 84 ...
##  $ largest_gift       : num [1:2400] 12 10 10 11 20 15 16 12 6 5 ...
##  $ last_gift          : num [1:2400] 12 5 8 11 20 10 16 7 3 3 ...
##  $ months_since_donate: num [1:2400] 34 29 30 30 37 37 30 31 32 30 ...
##  $ time_lag           : num [1:2400] 6 7 3 6 3 8 6 1 7 12 ...
##  $ avg_gift           : num [1:2400] 9.4 4.29 7.08 7.67 10.69 ...
##  $ target             : num [1:2400] 1 1 2 2 1 1 2 1 1 2 ...

summary(fundraising)

##  zipconvert2 zipconvert3 zipconvert4 zipconvert5 homeowner    num_child    
##  No :2352    Yes: 551    No :2357    No :1846    Yes:2312   Min.   :1.000  
##  Yes: 648    No :2449    Yes: 643    Yes:1154    No : 688   1st Qu.:1.000  
##                                                             Median :1.000  
##                                                             Mean   :1.069  
##                                                             3rd Qu.:1.000  
##                                                             Max.   :5.000  
##      income      female         wealth        home_value      med_fam_inc    
##  Min.   :1.000   Yes:1831   Min.   :0.000   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:3.000   No :1169   1st Qu.:5.000   1st Qu.: 554.8   1st Qu.: 278.0  
##  Median :4.000              Median :8.000   Median : 816.5   Median : 355.0  
##  Mean   :3.899              Mean   :6.396   Mean   :1143.3   Mean   : 388.4  
##  3rd Qu.:5.000              3rd Qu.:8.000   3rd Qu.:1341.2   3rd Qu.: 465.0  
##  Max.   :7.000              Max.   :9.000   Max.   :5945.0   Max.   :1500.0  
##   avg_fam_inc       pct_lt15k        num_prom      lifetime_gifts  
##  Min.   :   0.0   Min.   : 0.00   Min.   : 11.00   Min.   :  15.0  
##  1st Qu.: 318.0   1st Qu.: 5.00   1st Qu.: 29.00   1st Qu.:  45.0  
##  Median : 396.0   Median :12.00   Median : 48.00   Median :  81.0  
##  Mean   : 432.3   Mean   :14.71   Mean   : 49.14   Mean   : 110.7  
##  3rd Qu.: 516.0   3rd Qu.:21.00   3rd Qu.: 65.00   3rd Qu.: 135.0  
##  Max.   :1331.0   Max.   :90.00   Max.   :157.00   Max.   :5674.9  
##   largest_gift       last_gift      months_since_donate    time_lag     
##  Min.   :   5.00   Min.   :  0.00   Min.   :17.00       Min.   : 0.000  
##  1st Qu.:  10.00   1st Qu.:  7.00   1st Qu.:29.00       1st Qu.: 3.000  
##  Median :  15.00   Median : 10.00   Median :31.00       Median : 5.000  
##  Mean   :  16.65   Mean   : 13.48   Mean   :31.13       Mean   : 6.876  
##  3rd Qu.:  20.00   3rd Qu.: 16.00   3rd Qu.:34.00       3rd Qu.: 9.000  
##  Max.   :1000.00   Max.   :219.00   Max.   :37.00       Max.   :77.000  
##     avg_gift           target   
##  Min.   :  2.139   Min.   :1.0  
##  1st Qu.:  6.333   1st Qu.:1.0  
##  Median :  9.000   Median :2.0  
##  Mean   : 10.669   Mean   :1.5  
##  3rd Qu.: 12.800   3rd Qu.:2.0  
##  Max.   :122.167   Max.   :2.0

plot(train)

i then decided to look at correlations with a scatterplot

data(test)

## Warning in data(test): data set 'test' not found

testcor <- test[, c(6,7,9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21)]
head(testcor, 10)

## # A tibble: 10 x 15
##    num_child income wealth home_value med_fam_inc avg_fam_inc pct_lt15k num_prom
##        <dbl>  <dbl>  <dbl>      <dbl>       <dbl>       <dbl>     <dbl>    <dbl>
##  1         1      4      8        482         242         275        28       38
##  2         1      4      8        411         236         273        21       15
##  3         1      3      8        296         280         315        17       64
##  4         1      1      8       1000         468         506         6       28
##  5         1      4      5        971         347         375        12       50
##  6         1      7      8       1046         521         576         1       30
##  7         1      4      1       1366         250         276        32       55
##  8         1      4      3        806         347         354        15       76
##  9         1      5      8       2194         464         506         6       43
## 10         1      2      0        913         239         288        36       69
## # ... with 7 more variables: lifetime_gifts <dbl>, largest_gift <dbl>,
## #   last_gift <dbl>, months_since_donate <dbl>, time_lag <dbl>, avg_gift <dbl>,
## #   target <dbl>

cormat <- cor(testcor)
round(cormat, 5)

##                     num_child   income   wealth home_value med_fam_inc
## num_child             1.00000  0.12918 -0.00552    0.00062     0.07040
## income                0.12918  1.00000  0.17976    0.34294     0.37025
## wealth               -0.00552  0.17976  1.00000    0.23769     0.33829
## home_value            0.00062  0.34294  0.23769    1.00000     0.73582
## med_fam_inc           0.07040  0.37025  0.33829    0.73582     1.00000
## avg_fam_inc           0.08294  0.39449  0.35116    0.74283     0.97034
## pct_lt15k            -0.03584 -0.26513 -0.34150   -0.39312    -0.66540
## num_prom             -0.05622 -0.11424 -0.39654   -0.05858    -0.02204
## lifetime_gifts       -0.07707 -0.02510 -0.26851    0.02508     0.02700
## largest_gift         -0.05637  0.14504  0.04088    0.18475     0.16293
## last_gift            -0.03962  0.14807  0.07428    0.22678     0.19614
## months_since_donate   0.01983  0.11951 -0.02478    0.09585     0.06328
## time_lag             -0.05725 -0.00578 -0.01106    0.02378     0.04536
## avg_gift             -0.05295  0.14170  0.08288    0.22074     0.17177
## target                0.00839 -0.05853 -0.02543   -0.03787    -0.03126
##                     avg_fam_inc pct_lt15k num_prom lifetime_gifts largest_gift
## num_child               0.08294  -0.03584 -0.05622       -0.07707     -0.05637
## income                  0.39449  -0.26513 -0.11424       -0.02510      0.14504
## wealth                  0.35116  -0.34150 -0.39654       -0.26851      0.04088
## home_value              0.74283  -0.39312 -0.05858        0.02508      0.18475
## med_fam_inc             0.97034  -0.66540 -0.02204        0.02700      0.16293
## avg_fam_inc             1.00000  -0.70938 -0.01771        0.02425      0.15295
## pct_lt15k              -0.70938   1.00000  0.02113        0.04221     -0.03750
## num_prom               -0.01771   0.02113  1.00000        0.70507      0.03718
## lifetime_gifts          0.02425   0.04221  0.70507        1.00000      0.46329
## largest_gift            0.15295  -0.03750  0.03718        0.46329      1.00000
## last_gift               0.18316  -0.05796 -0.04413        0.34876      0.93734
## months_since_donate     0.05889  -0.03965 -0.24384       -0.19860      0.02917
## time_lag                0.05050  -0.06480  0.10717        0.00973      0.00228
## avg_gift                0.16057  -0.03702 -0.10143        0.34600      0.91383
## target                 -0.02833   0.04105 -0.07790       -0.07646      0.02210
##                     last_gift months_since_donate time_lag avg_gift   target
## num_child            -0.03962             0.01983 -0.05725 -0.05295  0.00839
## income                0.14807             0.11951 -0.00578  0.14170 -0.05853
## wealth                0.07428            -0.02478 -0.01106  0.08288 -0.02543
## home_value            0.22678             0.09585  0.02378  0.22074 -0.03787
## med_fam_inc           0.19614             0.06328  0.04536  0.17177 -0.03126
## avg_fam_inc           0.18316             0.05889  0.05050  0.16057 -0.02833
## pct_lt15k            -0.05796            -0.03965 -0.06480 -0.03702  0.04105
## num_prom             -0.04413            -0.24384  0.10717 -0.10143 -0.07790
## lifetime_gifts        0.34876            -0.19860  0.00973  0.34600 -0.07646
## largest_gift          0.93734             0.02917  0.00228  0.91383  0.02210
## last_gift             1.00000             0.08204  0.00046  0.92070  0.03849
## months_since_donate   0.08204             1.00000 -0.01327  0.09609  0.15989
## time_lag              0.00046            -0.01327  1.00000  0.01291 -0.06692
## avg_gift              0.92070             0.09609  0.01291  1.00000  0.04651
## target                0.03849             0.15989 -0.06692  0.04651  1.00000

chart.Correlation(testcor, histogram=FALSE, pch=19)

i tried to use this one to give my self a better idea

Select classification tool and parameters. Run at least two classification models of your choosing. Describe the two models that you chose, with sufficient detail (method, parameters, variables, etc.) so that it can be reproduced.

the first model i wanted to use was an LDA model

set.seed(2)
ldafund = lda(target~., data= train)
summary(ldafund)

##         Length Class  Mode     
## prior    2     -none- numeric  
## counts   2     -none- numeric  
## means   40     -none- numeric  
## scaling 20     -none- numeric  
## lev      2     -none- character
## svd      1     -none- numeric  
## N        1     -none- numeric  
## call     3     -none- call     
## terms    3     terms  call     
## xlevels  6     -none- list

lda.pred = predict(ldafund, newdata=test, type="response")
lda.class = lda.pred$class
table(lda.class, test$target)

##          
## lda.class   1   2
##         1 175 159
##         2 115 151

mean(lda.class==test$target)

## [1] 0.5433333

the second model i wanted to use was a GLM model

set.seed(4)
fundglm3 <- glm(target~., data=train)
fundglm.pred <- predict(fundglm3, test)
fund.probs = predict(fundglm3, test, type = "response")
fundglm.pred = rep(0, length(fund.probs))
fundglm.pred[fund.probs > 0.5] = 1
table(fundglm.pred, test$target)

##             
## fundglm.pred   1   2
##            1 290 310

mean(fundglm.pred == test$target)

## [1] 0.4833333

This model gave me a lower MSE so i will be using it as my main model for the remainder of the project

set.seed(5)
fundglm <- glm(target ~ largest_gift+months_since_donate, data=train)
fundglm.pred <- predict(fundglm, test)
fund.probs = predict(fundglm, test, type = "response")
fundglm.pred = rep(0, length(fund.probs))
fundglm.pred[fund.probs > 0.5] = 1
table(fundglm.pred, test$target)

##             
## fundglm.pred   1   2
##            1 290 310

mean(fundglm.pred == test$target)

## [1] 0.4833333

here in the model i used the two strong predictors which brought my MSE down, these predictors of largest gift and moths since donate seem to be good

set.seed(6)
fundglm <- glm(target~wealth+income, data=train)
fundglm.pred <- predict(fundglm, test)
fund.probs = predict(fundglm, test, type = "response")
fundglm.pred = rep(0, length(fund.probs))
fundglm.pred[fund.probs > 0.5] = 1
table(fundglm.pred, test$target)

##             
## fundglm.pred   1   2
##            1 290 310

mean(fundglm.pred == test$target)

## [1] 0.4833333

i tired using target wealth and income to see how they would match up but found that they did not lower the initial MSE so i will stick with the other two predictors used above

set.seed(7)
fundglm1 <- glm(target~largest_gift+months_since_donate+num_child, data=train)
fundglm.pred <- predict(fundglm1, test)
fund.probs = predict(fundglm1, test, type = "response")
fundglm.pred = rep(0, length(fund.probs))
fundglm.pred[fund.probs > 0.5] = 1
table(fundglm.pred, test$target)

##             
## fundglm.pred   1   2
##            1 290 310

mean(fundglm.pred == test$target)

## [1] 0.4833333

so in this model i added one more predictor to see if it would be beneficial in my analysis but found out that number of children didnt lower or alter the MSE so it meant that it would not be a factor in this

Classification under asymmetric response and cost. Comment on the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset?

set.seed(7)
par(mfrow = c(1,2))
boxplot(test$lifetime_gifts ~ test$target, data = test)
boxplot(test$largest_gift ~ test$target, data = test)

using a weighted sample helps us smooth out the data and giving us a better overall result in our models. this basically creates an equal amount of donors so it is easier to work with and allows to find the best predictors. if we used a the original data set or a random sample we would most likely end up with skewed results which wouldnt allow us to find the best predictors. having the data averaged allows me to produce these box plots to see how the two predictors selected look.

Evaluate the fit. Examine the out of sample error for your models. Use tables or graphs to display your results. Is there a model that dominates?

so after picking our best predictors using the glm model we saw which would be the best ones to use. To see which ones where the best we have to look at the MSE and use the ones that produced the lowest MSE to have confirmation that they are the best ones since they fit our data.i could also use a different model to find the rate at which classifications could be wrong, but to me the glm would be the most efficient in using those factors i did. we can see the output of these models and MSE above by looking at the Mean functions i used.

Select best model. From your answer in (4), what do you think is the “best” model?

looking at the previous question and the outputs i used to me the GLM is the best model when using the largest gift and the month since dontation predictors to select the targets wanted.

Step 3: Testing. The file FutureFundraising.csv contains the attributes for future mailing candidates

Using your “best” model from Step 2 (number 4), which of these candidates do you predict as donors and non-donors? Use your best model and predict whether the candidate will be a donor or not. Upload your prediction to the leaderboard and comment on the result.

finalfundglm <- glm(target~ wealth + months_since_donate, data=train)
finalprob <- predict(finalfundglm, type="response")
finalpred <- rep("Donor", length(finalprob))
finalpred[finalprob >1.5] ="No Donor"

future_value <- predict(finalfundglm, `future_fundraising `)
Value <- c("value", as.character(future_value))
Value <- if_else(Value>1.5, "Donor", "No Donor")
write.table(Value, file="attempt1.csv", col.names = c("value"),sep = ",",row.names = F)

so after uploading it to the data algorithms II modeling competition it gave me 1 row at .5206612 witch put me on 32

final project

oscar cancino

5/9/2022