Background

Tayko is a software catalog firm that sells games and education software. They are planning on sending out a new catalog. Tayko is wanting to expand its customer base, so it has recently joined a consortium of similar catalog firms. Tayko, as a member of the consortium, is allowed to pull 200,000 names from the entire 5,000,000 names. Tayko would like to use predictive modeling in order to find the best names from the list.

Business Problem: Choose the names that will have high probability of performing the best

Analytics Problem: Create a predictive model that classifies whether a person is a purchasers or a non-purchaser and then modeling that predicts the amount the purchasers would spend.

Estimated Profit Without Modeling

Each Catalog costs approximately $2 to mail (including printing, postage, and mailing costs). In order to make a predictive model we randomly set 20,000 catalogs to customers in order to help make a model. If we randomly selected the remaining 180,000 customers to send a new catalog we’d expect that the following would be true.

Based off of the random selection people who purchased from the catalog, the people who purchased items spent on average the following:

## [1] 205.04

The estimated gross profit per individual would be about:

## [1] 8.87

And the overall profit would be:

## [1] 1596082

In conclusion, if you randomly pick a person from the 180000 people your expected profit would be roughly $8.87 or if you sent a catalog to all of the remaining 180000 people at random the total profit would be about $1596082

Create A Model For Classifying A Customer As A Purchser Or A Non-Purchaser

Below is a list of the variables.

##  [1] "US"                   "source_a"             "source_c"            
##  [4] "source_b"             "source_d"             "source_e"            
##  [7] "source_m"             "source_o"             "source_h"            
## [10] "source_r"             "source_s"             "source_t"            
## [13] "source_u"             "source_p"             "source_x"            
## [16] "source_w"             "Freq"                 "last_update_days_ago"
## [19] "1st_update_days_ago"  "Web order"            "Gender=male"         
## [22] "Address_is_res"       "Purchase"             "Spending"

The predictors of this list are everything, but the response variable: Purchase

Below are listed first 6 records of the dataset.

## # A tibble: 6 × 8
##      US source_a source_c source_b source_d source_e source_m source_o
##   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1     1        0        0        1        0        0        0        0
## 2     1        0        0        0        0        1        0        0
## 3     1        0        0        0        0        0        0        0
## 4     1        0        1        0        0        0        0        0
## 5     1        0        1        0        0        0        0        0
## 6     1        0        0        0        0        0        0        0
## # A tibble: 6 × 8
##   source_h source_r source_s source_t source_u source_p source_x source_w
##      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1        0        0        0        0        0        0        0        0
## 2        0        0        0        0        0        0        0        0
## 3        0        0        0        1        0        0        0        0
## 4        0        0        0        0        0        0        0        0
## 5        0        0        0        0        0        0        0        0
## 6        0        1        0        0        0        0        0        0
## # A tibble: 6 × 8
##    Freq last_update_days_ago 1st_updat…¹ Web o…² Gende…³ Addre…⁴ Purch…⁵ Spend…⁶
##   <dbl>                <dbl>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1     2                 3662        3662       1       0       1       1     128
## 2     0                 2900        2900       1       1       0       0       0
## 3     2                 3883        3914       0       0       0       1     127
## 4     1                  829         829       0       1       0       0       0
## 5     1                  869         869       0       0       0       0       0
## 6     1                 1995        2002       0       0       1       0       0
## # … with abbreviated variable names ¹​`1st_update_days_ago`, ²​`Web order`,
## #   ³​`Gender=male`, ⁴​Address_is_res, ⁵​Purchase, ⁶​Spending

Partition

Partition the data randomly into a training set (800 records), validation set (700 records), and test set (500 records).

Verifying that training set is 800, validation set is 700, and test set is 500.

## [1] 800
## [1] 700
## [1] 500

Logistic Regression Model Using Backwards Elimination

Below is the process of deleting the least significant variable step by step until we get the best AIC.

## 
## Call:
## glm(formula = Purchase ~ . - Spending, family = binomial, data = train.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3620  -0.5757   0.0019   0.6610   2.2437  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -3.509e+00  7.126e-01  -4.924 8.46e-07 ***
## US                     4.476e-01  2.700e-01   1.658  0.09735 .  
## source_a               1.602e+00  6.751e-01   2.373  0.01762 *  
## source_c              -5.834e-01  7.038e-01  -0.829  0.40713    
## source_b              -9.072e-01  8.411e-01  -1.079  0.28076    
## source_d              -3.416e-01  7.463e-01  -0.458  0.64713    
## source_e               4.960e-02  6.451e-01   0.077  0.93872    
## source_m               2.206e-01  1.044e+00   0.211  0.83265    
## source_o               8.907e-01  8.829e-01   1.009  0.31308    
## source_h              -3.805e+00  9.204e-01  -4.134 3.57e-05 ***
## source_r               6.066e-01  6.938e-01   0.874  0.38192    
## source_s               4.531e-02  7.505e-01   0.060  0.95186    
## source_t               6.539e-01  8.105e-01   0.807  0.41980    
## source_u               1.391e+00  6.557e-01   2.122  0.03387 *  
## source_p               1.447e+01  7.068e+02   0.020  0.98367    
## source_x               1.451e+00  9.503e-01   1.526  0.12690    
## source_w               7.009e-01  6.651e-01   1.054  0.29195    
## Freq                   2.609e+00  2.382e-01  10.949  < 2e-16 ***
## last_update_days_ago   3.066e-04  2.449e-04   1.252  0.21056    
## `1st_update_days_ago` -3.012e-04  2.608e-04  -1.155  0.24816    
## `Web order`            5.990e-01  1.997e-01   2.999  0.00271 ** 
## `Gender=male`         -3.354e-01  2.011e-01  -1.668  0.09532 .  
## Address_is_res        -8.731e-01  2.783e-01  -3.138  0.00170 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1108.32  on 799  degrees of freedom
## Residual deviance:  640.68  on 777  degrees of freedom
## AIC: 686.68
## 
## Number of Fisher Scoring iterations: 14
## Start:  AIC=686.68
## Purchase ~ (US + source_a + source_c + source_b + source_d + 
##     source_e + source_m + source_o + source_h + source_r + source_s + 
##     source_t + source_u + source_p + source_x + source_w + Freq + 
##     last_update_days_ago + `1st_update_days_ago` + `Web order` + 
##     `Gender=male` + Address_is_res + Spending) - Spending
## 
##                         Df Deviance    AIC
## - source_s               1   640.69 684.69
## - source_e               1   640.69 684.69
## - source_m               1   640.73 684.73
## - source_d               1   640.89 684.89
## - source_c               1   641.35 685.35
## - source_t               1   641.35 685.35
## - source_r               1   641.47 685.47
## - source_o               1   641.69 685.69
## - source_b               1   641.84 685.84
## - source_w               1   641.84 685.84
## - `1st_update_days_ago`  1   641.98 685.98
## - last_update_days_ago   1   642.20 686.20
## - source_p               1   642.37 686.37
## <none>                       640.68 686.68
## - source_x               1   643.10 687.10
## - US                     1   643.46 687.46
## - `Gender=male`          1   643.48 687.48
## - source_u               1   645.68 689.68
## - source_a               1   647.01 691.01
## - `Web order`            1   649.72 693.72
## - Address_is_res         1   650.77 694.77
## - source_h               1   657.92 701.92
## - Freq                   1   878.70 922.70
## 
## Step:  AIC=684.69
## Purchase ~ US + source_a + source_c + source_b + source_d + source_e + 
##     source_m + source_o + source_h + source_r + source_t + source_u + 
##     source_p + source_x + source_w + Freq + last_update_days_ago + 
##     `1st_update_days_ago` + `Web order` + `Gender=male` + Address_is_res
## 
##                         Df Deviance    AIC
## - source_e               1   640.69 682.69
## - source_m               1   640.73 682.73
## - source_d               1   641.09 683.09
## - source_t               1   641.59 683.59
## - source_o               1   641.90 683.90
## - source_r               1   641.96 683.96
## - `1st_update_days_ago`  1   642.02 684.02
## - source_c               1   642.05 684.05
## - last_update_days_ago   1   642.25 684.25
## - source_p               1   642.41 684.41
## - source_b               1   642.55 684.55
## <none>                       640.69 684.69
## - source_w               1   642.75 684.75
## - US                     1   643.49 685.49
## - `Gender=male`          1   643.49 685.49
## - source_x               1   643.71 685.71
## - `Web order`            1   649.76 691.76
## - source_u               1   649.88 691.88
## - Address_is_res         1   650.79 692.79
## - source_a               1   651.62 693.62
## - source_h               1   669.91 711.91
## - Freq                   1   880.60 922.60
## 
## Step:  AIC=682.69
## Purchase ~ US + source_a + source_c + source_b + source_d + source_m + 
##     source_o + source_h + source_r + source_t + source_u + source_p + 
##     source_x + source_w + Freq + last_update_days_ago + `1st_update_days_ago` + 
##     `Web order` + `Gender=male` + Address_is_res
## 
##                         Df Deviance    AIC
## - source_m               1   640.73 680.73
## - source_d               1   641.33 681.33
## - source_t               1   641.81 681.81
## - `1st_update_days_ago`  1   642.04 682.04
## - source_o               1   642.04 682.04
## - last_update_days_ago   1   642.29 682.29
## - source_p               1   642.44 682.44
## - source_r               1   642.55 682.55
## <none>                       640.69 682.69
## - source_c               1   642.87 682.87
## - source_b               1   643.25 683.25
## - `Gender=male`          1   643.49 683.49
## - US                     1   643.51 683.51
## - source_w               1   644.00 684.00
## - source_x               1   644.11 684.11
## - `Web order`            1   649.76 689.76
## - Address_is_res         1   650.81 690.81
## - source_u               1   657.47 697.47
## - source_a               1   658.98 698.98
## - source_h               1   684.04 724.04
## - Freq                   1   891.06 931.06
## 
## Step:  AIC=680.73
## Purchase ~ US + source_a + source_c + source_b + source_d + source_o + 
##     source_h + source_r + source_t + source_u + source_p + source_x + 
##     source_w + Freq + last_update_days_ago + `1st_update_days_ago` + 
##     `Web order` + `Gender=male` + Address_is_res
## 
##                         Df Deviance    AIC
## - source_d               1   641.41 679.41
## - source_t               1   641.83 679.83
## - source_o               1   642.07 680.07
## - `1st_update_days_ago`  1   642.13 680.13
## - last_update_days_ago   1   642.34 680.34
## - source_p               1   642.47 680.47
## - source_r               1   642.55 680.55
## <none>                       640.73 680.73
## - source_c               1   643.05 681.05
## - source_b               1   643.33 681.33
## - `Gender=male`          1   643.54 681.54
## - US                     1   643.60 681.60
## - source_w               1   644.02 682.02
## - source_x               1   644.12 682.12
## - `Web order`            1   649.98 687.98
## - Address_is_res         1   650.88 688.88
## - source_u               1   657.57 695.57
## - source_a               1   659.02 697.02
## - source_h               1   685.12 723.12
## - Freq                   1   893.18 931.18
## 
## Step:  AIC=679.41
## Purchase ~ US + source_a + source_c + source_b + source_o + source_h + 
##     source_r + source_t + source_u + source_p + source_x + source_w + 
##     Freq + last_update_days_ago + `1st_update_days_ago` + `Web order` + 
##     `Gender=male` + Address_is_res
## 
##                         Df Deviance    AIC
## - source_t               1   642.86 678.86
## - `1st_update_days_ago`  1   642.96 678.96
## - last_update_days_ago   1   642.98 678.98
## - source_o               1   643.05 679.05
## - source_p               1   643.22 679.22
## - source_c               1   643.39 679.39
## <none>                       641.41 679.41
## - source_b               1   643.62 679.62
## - source_r               1   643.79 679.79
## - US                     1   644.03 680.03
## - `Gender=male`          1   644.38 680.38
## - source_w               1   645.11 681.11
## - source_x               1   645.16 681.16
## - `Web order`            1   650.68 686.68
## - Address_is_res         1   651.61 687.61
## - source_u               1   661.41 697.41
## - source_a               1   663.43 699.43
## - source_h               1   685.22 721.22
## - Freq                   1   894.47 930.47
## 
## Step:  AIC=678.86
## Purchase ~ US + source_a + source_c + source_b + source_o + source_h + 
##     source_r + source_u + source_p + source_x + source_w + Freq + 
##     last_update_days_ago + `1st_update_days_ago` + `Web order` + 
##     `Gender=male` + Address_is_res
## 
##                         Df Deviance    AIC
## - source_o               1   644.25 678.25
## - `1st_update_days_ago`  1   644.34 678.34
## - last_update_days_ago   1   644.47 678.47
## - source_p               1   644.59 678.59
## - source_r               1   644.80 678.80
## <none>                       642.86 678.86
## - source_c               1   645.28 679.28
## - US                     1   645.49 679.49
## - source_b               1   645.53 679.53
## - `Gender=male`          1   645.61 679.61
## - source_w               1   646.07 680.07
## - source_x               1   646.30 680.30
## - Address_is_res         1   652.75 686.75
## - `Web order`            1   652.84 686.84
## - source_u               1   661.58 695.58
## - source_a               1   663.56 697.56
## - source_h               1   688.76 722.76
## - Freq                   1   899.46 933.46
## 
## Step:  AIC=678.25
## Purchase ~ US + source_a + source_c + source_b + source_h + source_r + 
##     source_u + source_p + source_x + source_w + Freq + last_update_days_ago + 
##     `1st_update_days_ago` + `Web order` + `Gender=male` + Address_is_res
## 
##                         Df Deviance    AIC
## - `1st_update_days_ago`  1   644.97 676.97
## - last_update_days_ago   1   645.17 677.17
## - source_r               1   645.85 677.85
## - source_p               1   645.92 677.92
## <none>                       644.25 678.25
## - US                     1   646.76 678.76
## - source_c               1   646.82 678.82
## - `Gender=male`          1   646.94 678.94
## - source_b               1   647.42 679.42
## - source_w               1   647.47 679.47
## - source_x               1   647.49 679.49
## - Address_is_res         1   653.46 685.46
## - `Web order`            1   654.28 686.28
## - source_u               1   661.96 693.96
## - source_a               1   663.70 695.70
## - source_h               1   689.96 721.96
## - Freq                   1   904.64 936.64
## 
## Step:  AIC=676.97
## Purchase ~ US + source_a + source_c + source_b + source_h + source_r + 
##     source_u + source_p + source_x + source_w + Freq + last_update_days_ago + 
##     `Web order` + `Gender=male` + Address_is_res
## 
##                        Df Deviance    AIC
## - last_update_days_ago  1   645.17 675.17
## - source_r              1   646.55 676.55
## - source_p              1   646.64 676.64
## <none>                      644.97 676.97
## - source_c              1   647.28 677.28
## - US                    1   647.42 677.42
## - `Gender=male`         1   647.86 677.86
## - source_b              1   648.35 678.35
## - source_x              1   648.37 678.37
## - source_w              1   649.35 679.35
## - Address_is_res        1   654.34 684.34
## - `Web order`           1   654.95 684.95
## - source_u              1   663.01 693.01
## - source_a              1   663.92 693.92
## - source_h              1   690.32 720.32
## - Freq                  1   950.88 980.88
## 
## Step:  AIC=675.17
## Purchase ~ US + source_a + source_c + source_b + source_h + source_r + 
##     source_u + source_p + source_x + source_w + Freq + `Web order` + 
##     `Gender=male` + Address_is_res
## 
##                  Df Deviance    AIC
## - source_r        1   646.69 674.69
## - source_p        1   646.87 674.87
## <none>                645.17 675.17
## - US              1   647.64 675.64
## - `Gender=male`   1   647.98 675.98
## - source_c        1   648.04 676.04
## - source_b        1   648.36 676.36
## - source_x        1   648.53 676.53
## - source_w        1   649.82 677.82
## - Address_is_res  1   654.98 682.98
## - `Web order`     1   655.06 683.06
## - source_u        1   663.31 691.31
## - source_a        1   665.29 693.29
## - source_h        1   691.23 719.23
## - Freq            1   970.64 998.64
## 
## Step:  AIC=674.69
## Purchase ~ US + source_a + source_c + source_b + source_h + source_u + 
##     source_p + source_x + source_w + Freq + `Web order` + `Gender=male` + 
##     Address_is_res
## 
##                  Df Deviance    AIC
## - source_p        1   648.30 674.30
## <none>                646.69 674.69
## - US              1   649.22 675.22
## - source_x        1   649.66 675.66
## - `Gender=male`   1   649.79 675.79
## - source_w        1   650.41 676.41
## - source_c        1   650.43 676.43
## - source_b        1   650.51 676.51
## - Address_is_res  1   655.83 681.83
## - `Web order`     1   656.78 682.78
## - source_u        1   663.41 689.41
## - source_a        1   665.42 691.42
## - source_h        1   697.05 723.05
## - Freq            1   972.05 998.05
## 
## Step:  AIC=674.3
## Purchase ~ US + source_a + source_c + source_b + source_h + source_u + 
##     source_x + source_w + Freq + `Web order` + `Gender=male` + 
##     Address_is_res
## 
##                  Df Deviance     AIC
## <none>                648.30  674.30
## - US              1   650.51  674.51
## - source_x        1   651.25  675.25
## - `Gender=male`   1   651.63  675.63
## - source_w        1   651.77  675.77
## - source_c        1   652.14  676.14
## - source_b        1   652.24  676.24
## - Address_is_res  1   657.57  681.57
## - `Web order`     1   658.90  682.90
## - source_u        1   664.70  688.70
## - source_a        1   666.80  690.80
## - source_h        1   699.05  723.05
## - Freq            1   976.35 1000.35
## 
## Call:
## glm(formula = Purchase ~ US + source_a + source_c + source_b + 
##     source_h + source_u + source_x + source_w + Freq + `Web order` + 
##     `Gender=male` + Address_is_res, family = binomial, data = train.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4144  -0.6148   0.0041   0.6795   2.1585  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -3.1664     0.3539  -8.948  < 2e-16 ***
## US               0.3865     0.2615   1.478  0.13940    
## source_a         1.3923     0.3373   4.128 3.67e-05 ***
## source_c        -0.7658     0.4003  -1.913  0.05571 .  
## source_b        -1.1034     0.5905  -1.869  0.06169 .  
## source_h        -3.8783     0.6710  -5.779 7.49e-09 ***
## source_u         1.2098     0.3082   3.926 8.65e-05 ***
## source_x         1.2693     0.7489   1.695  0.09011 .  
## source_w         0.5254     0.2837   1.852  0.06404 .  
## Freq             2.4721     0.2099  11.780  < 2e-16 ***
## `Web order`      0.6371     0.1965   3.242  0.00119 ** 
## `Gender=male`   -0.3594     0.1979  -1.816  0.06938 .  
## Address_is_res  -0.8160     0.2714  -3.007  0.00264 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1108.3  on 799  degrees of freedom
## Residual deviance:  648.3  on 787  degrees of freedom
## AIC: 674.3
## 
## Number of Fisher Scoring iterations: 6
## 
## Call:  glm(formula = Purchase ~ US + source_a + source_c + source_b + 
##     source_h + source_u + source_x + source_w + Freq + `Web order` + 
##     `Gender=male` + Address_is_res, family = binomial, data = train.df)
## 
## Coefficients:
##    (Intercept)              US        source_a        source_c        source_b  
##        -3.1664          0.3865          1.3923         -0.7658         -1.1034  
##       source_h        source_u        source_x        source_w            Freq  
##        -3.8783          1.2098          1.2693          0.5254          2.4721  
##    `Web order`   `Gender=male`  Address_is_res  
##         0.6371         -0.3594         -0.8160  
## 
## Degrees of Freedom: 799 Total (i.e. Null);  787 Residual
## Null Deviance:       1108 
## Residual Deviance: 648.3     AIC: 674.3

The lowest AIC backwards elimination is 674.3 with the coefficients and variables listed directly above.

Below is the confusion matrix of the logistic regression model and its corresponding lift chart. This is about 82% accurate compared to the actual data.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 315  65
##          1  59 261
##                                           
##                Accuracy : 0.8229          
##                  95% CI : (0.7925, 0.8504)
##     No Information Rate : 0.5343          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6436          
##                                           
##  Mcnemar's Test P-Value : 0.6534          
##                                           
##             Sensitivity : 0.8006          
##             Specificity : 0.8422          
##          Pos Pred Value : 0.8156          
##          Neg Pred Value : 0.8289          
##              Prevalence : 0.4657          
##          Detection Rate : 0.3729          
##    Detection Prevalence : 0.4571          
##       Balanced Accuracy : 0.8214          
##                                           
##        'Positive' Class : 1               
## 

Notice that the model does a good job for the first 51% of the data, with a lift greater than 1.

Develop A Model For Predicting Spending Among The Purchasers.

We first partition this dataset into the training and validation records.

Multiple Linear Regression Model (using stepwise regression)

The below model was created by using step wise regression. The plot compared the actual spending vs. the predicted spending of the validation data.

## 
## Call:
## lm(formula = Spending ~ ., data = p_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -339.39  -82.57  -18.25   47.89 1181.31 
## 
## Coefficients: (1 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.800e+01  7.065e+01   0.255   0.7990    
## US                     1.153e+01  2.515e+01   0.458   0.6470    
## source_a               8.570e+01  6.865e+01   1.248   0.2126    
## source_c              -1.762e+00  7.489e+01  -0.024   0.9812    
## source_b              -2.294e+01  7.989e+01  -0.287   0.7741    
## source_d               2.949e+01  7.578e+01   0.389   0.6973    
## source_e               5.745e+01  6.917e+01   0.831   0.4067    
## source_m               5.592e+00  8.921e+01   0.063   0.9501    
## source_o               6.642e+01  9.279e+01   0.716   0.4745    
## source_h              -1.287e+02  9.772e+01  -1.317   0.1885    
## source_r               1.530e+02  7.205e+01   2.124   0.0343 *  
## source_s               4.444e+01  8.336e+01   0.533   0.5943    
## source_t               1.204e+01  7.993e+01   0.151   0.8804    
## source_u               7.691e+01  6.826e+01   1.127   0.2606    
## source_p               1.026e+02  1.191e+02   0.862   0.3893    
## source_x               2.749e+01  9.574e+01   0.287   0.7741    
## source_w               9.305e+01  6.975e+01   1.334   0.1830    
## Freq                   8.785e+01  7.255e+00  12.108  < 2e-16 ***
## last_update_days_ago  -1.525e-02  1.480e-02  -1.030   0.3034    
## `1st_update_days_ago` -3.309e-03  1.742e-02  -0.190   0.8494    
## `Web order`            8.479e-01  1.708e+01   0.050   0.9604    
## `Gender=male`         -1.047e+01  1.729e+01  -0.606   0.5449    
## Address_is_res        -8.509e+01  2.073e+01  -4.105 4.94e-05 ***
## Purchase                      NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 169.2 on 389 degrees of freedom
## Multiple R-squared:  0.4617, Adjusted R-squared:  0.4312 
## F-statistic: 15.16 on 22 and 389 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = Spending ~ source_a + source_e + source_h + source_r + 
##     source_u + source_w + Freq + last_update_days_ago + Address_is_res, 
##     data = p_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -352.10  -85.28  -18.69   48.69 1179.87 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.585e+01  3.110e+01   1.153  0.24968    
## source_a              6.612e+01  2.494e+01   2.651  0.00834 ** 
## source_e              4.213e+01  2.741e+01   1.537  0.12504    
## source_h             -1.388e+02  7.197e+01  -1.929  0.05448 .  
## source_r              1.356e+02  3.352e+01   4.047 6.23e-05 ***
## source_u              5.883e+01  2.559e+01   2.299  0.02200 *  
## source_w              7.757e+01  2.980e+01   2.603  0.00959 ** 
## Freq                  8.689e+01  5.725e+00  15.179  < 2e-16 ***
## last_update_days_ago -1.637e-02  8.906e-03  -1.838  0.06682 .  
## Address_is_res       -8.422e+01  2.018e+01  -4.173 3.69e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 167.2 on 402 degrees of freedom
## Multiple R-squared:  0.457,  Adjusted R-squared:  0.4448 
## F-statistic: 37.59 on 9 and 402 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = Spending ~ source_a + source_e + source_h + source_r + 
##     source_u + source_w + Freq + last_update_days_ago + Address_is_res, 
##     data = p_train)
## 
## Coefficients:
##          (Intercept)              source_a              source_e  
##             35.85157              66.12141              42.12693  
##             source_h              source_r              source_u  
##           -138.81542             135.64186              58.83304  
##             source_w                  Freq  last_update_days_ago  
##             77.56666              86.89006              -0.01637  
##       Address_is_res  
##            -84.22450

Regression Tree Models

Below are two classification tree models. The 2nd model uses a pruning method to get the best complexity parameter i.e. the smallest cp vs. a controlled max depth of 3.

d. 

Choose one model on the basis of its performance on the validation data.

##                 ME     RMSE      MAE       MPE     MAPE
## Test set -12.46933 183.2119 116.6166 -94.95069 121.1143
##                 ME     RMSE      MAE       MPE     MAPE
## Test set -10.86196 202.8882 116.1196 -95.99186 117.4464

Based off of the RMSE we can see that the Multiple Regression Model worked the best.

Score Analysis of Data

Use the original test data partition. We will create a new data frame called Score Analysis that contains the test data portion of this dataset. Add a column to the data frame with the predicted scores from the logistic regression. Add another column with the predicted spending amount from the prediction model chosen. Add a column for “adjusted probability of purchase” by multiplying “predicted probability of purchase” by .107. This is to adjust for over-sampling the purchasers. Add a column for expected spending: adjusted probability of purchase X predicted spending.

Lift Charts For Expected Spending

Estimated Gross Profit Base on Lift Charts

Using the results from the lift curves estimate the gross profit from mailing to remaining the 180,000 names on the basis of your data mining models.

## [1] 1776600

Thus using the model we would only need to use half of the 180000 people to get a total profit of $177660 vs. the $1596082, which is about a increase of $180518. Note: we still could send out catalogs to the remaining 90000 slots we have and could do more intense modeling to see whether or not those would produce a profit.