Develop a model for predicting the conversion rate of a customer.
library(dplyr)
library(ISLR)
library(knitr)
library(glmnet)
library(caret)
library(rpart)
library(randomForest)
library(pROC)
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
data <- read.csv("/Users/bear/Desktop/capstone.csv")
Using the gam4 library, we fit the model Additive Models: Each fj can be any of the different methods, such as Linear, Polynomial, Step Function, Degree-k spline, Natural cubic spline, and etc… The model is also a flexible and smooth technique which helps us to fit Linear Models which can be either linearly or non linearly dependent on several Predictors Xi to capture Non linear relationships between Response and Predictors.
gender_fac <- factor(data$gender, levels=c("M","F"), labels=c(0,1))
# time_old <- substring(data$InvoiceDate,10,11)
time <- format(as.POSIXct(data$InvoiceDate,format='%m/%d/%Y %H:%M'),format='%m%d%H')
data$time <- time
data$hour <- substring(data$time,5,6)
head(data, n=15)
## ad_id xyz_campaign_id fb_campaign_id age gender interest Impressions
## 1 708746 916 103916 30-34 M 15 7350
## 2 708749 916 103917 30-34 M 16 17861
## 3 708771 916 103920 30-34 M 20 693
## 4 708815 916 103928 30-34 M 28 4259
## 5 708818 916 103928 30-34 M 28 4133
## 6 708820 916 103929 30-34 M 29 1915
## 7 708889 916 103940 30-34 M 15 15615
## 8 708895 916 103941 30-34 M 16 10951
## 9 708953 916 103951 30-34 M 27 2355
## 10 708958 916 103952 30-34 M 28 9502
## 11 708979 916 103955 30-34 M 31 1224
## 12 709023 916 103962 30-34 M 7 735
## 13 709038 916 103965 30-34 M 16 5117
## 14 709040 916 103965 30-34 M 16 5120
## 15 709059 916 103968 30-34 M 20 14669
## Clicks Spent Total_Conversion Approved_Conversion InvoiceDate InvoiceNo
## 1 1 1.43 2 1 12/01/10 8:26 536365
## 2 2 1.82 2 0 12/01/10 8:26 536365
## 3 0 0.00 1 0 12/01/10 8:26 536365
## 4 1 1.25 1 0 12/01/10 8:26 536365
## 5 1 1.29 1 1 12/01/10 8:26 536365
## 6 0 0.00 1 1 12/01/10 8:26 536365
## 7 3 4.77 1 0 12/01/10 8:26 536365
## 8 1 1.27 1 1 12/01/10 8:28 536366
## 9 1 1.50 1 0 12/01/10 8:28 536366
## 10 3 3.16 1 0 12/01/10 8:34 536367
## 11 0 0.00 1 0 12/01/10 8:34 536367
## 12 0 0.00 1 0 12/01/10 8:34 536367
## 13 0 0.00 1 0 12/01/10 8:34 536367
## 14 0 0.00 1 0 12/01/10 8:34 536367
## 15 7 10.28 1 1 12/01/10 8:34 536367
## time hour
## 1 120108 08
## 2 120108 08
## 3 120108 08
## 4 120108 08
## 5 120108 08
## 6 120108 08
## 7 120108 08
## 8 120108 08
## 9 120108 08
## 10 120108 08
## 11 120108 08
## 12 120108 08
## 13 120108 08
## 14 120108 08
## 15 120108 08
Model try outs:
library(ggplot2)
qplot(data = data, x = interest, y = Approved_Conversion, xlab = "interest", ylab = "Approved_Conversion", colour = I(cbPalette[1]),alpha = I(0.9) ) + stat_smooth(method = "lm", aes(colour = "Linear regression"),lwd = 1.25) + stat_smooth(method = "lm", formula = y ~ s(x, 6), aes(colour = "Smoothing spline"), lwd = 1.25) + scale_colour_discrete("Model") + theme_bw()
## Warning: Computation failed in `stat_smooth()`:
## could not find function "s"
## Warning: Computation failed in `stat_smooth()`:
## could not find function "ns"
## Warning: Removed 1 rows containing missing values (geom_smooth).
Additive model
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.16.1
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
## Call:
## gam(formula = Approved_Conversion ~ +s(interest, 6) + Impressions +
## gender_fac + age + Spent + cut(Clicks, breaks = c(0, 100,
## 200, 300, Inf)) + Total_Conversion + hour, data = data)
##
## Degrees of Freedom: 935 total; 914.0001 Residual
## 207 observations deleted due to missingness
## Residual Deviance: 769.4782
##
## Call: gam(formula = Approved_Conversion ~ +s(interest, 6) + Impressions +
## gender_fac + age + Spent + cut(Clicks, breaks = c(0, 100,
## 200, 300, Inf)) + Total_Conversion + hour, data = data)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.4632 -0.4453 -0.1774 0.5367 5.8579
##
## (Dispersion Parameter for gaussian family taken to be 0.8419)
##
## Null Deviance: 3306.204 on 935 degrees of freedom
## Residual Deviance: 769.4782 on 914.0001 degrees of freedom
## AIC: 2518.888
## 207 observations deleted due to missingness
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value
## s(interest, 6) 1 5.89 5.89 6.9949
## Impressions 1 1557.54 1557.54 1850.0705
## gender_fac 1 17.57 17.57 20.8735
## age 3 83.27 27.76 32.9716
## Spent 1 222.14 222.14 263.8658
## cut(Clicks, breaks = c(0, 100, 200, 300, Inf)) 3 12.68 4.23 5.0188
## Total_Conversion 1 668.91 668.91 794.5486
## hour 5 2.34 0.47 0.5552
## Residuals 914 769.48 0.84
## Pr(>F)
## s(interest, 6) 0.008314 **
## Impressions < 2.2e-16 ***
## gender_fac 5.579e-06 ***
## age < 2.2e-16 ***
## Spent < 2.2e-16 ***
## cut(Clicks, breaks = c(0, 100, 200, 300, Inf)) 0.001867 **
## Total_Conversion < 2.2e-16 ***
## hour 0.734422
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(interest, 6) 5 3.1035 0.008738 **
## Impressions
## gender_fac
## age
## Spent
## cut(Clicks, breaks = c(0, 100, 200, 300, Inf))
## Total_Conversion
## hour
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Intercept)
## 2.004944e-01
## s(interest, 6)
## -4.674917e-03
## Impressions
## 1.993109e-06
## gender_fac1
## -1.270960e-01
## age35-39
## 5.373149e-02
## age40-44
## 6.521062e-02
## age45-49
## 9.766814e-02
## Spent
## -8.346374e-03
## cut(Clicks, breaks = c(0, 100, 200, 300, Inf))(100,200]
## 1.630005e-01
## cut(Clicks, breaks = c(0, 100, 200, 300, Inf))(200,300]
## 5.518351e-01
## cut(Clicks, breaks = c(0, 100, 200, 300, Inf))(300,Inf]
## -4.772973e-01
## Total_Conversion
## 3.357723e-01
## hour09
## 4.647636e-02
## hour10
## -1.344962e-01
## hour11
## -5.363080e-02
## hour12
## 2.279031e-02
## hour13
## 1.079674e-01
Effect of influencing factors on the change of failure number
## Warning in gplot.numeric(x = c(7350L, 17861L, 4259L, 4133L, 15615L, 10951L, :
## Residuals do not match x in "partial for Impressions" preplot object
## Warning in gplot.numeric(x = c(1.42999995, 1.82000002, 1.25, 1.28999996, :
## Residuals do not match x in "partial for Spent" preplot object
## Warning in gplot.numeric(x = c(2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, :
## Residuals do not match x in "partial for Total_Conversion" preplot object
## Warning in gplot.default(x = c("08", "08", "08", "08", "08", "08", "08", : The
## "x" component of "partial for hour" has class "character"; no gplot() methods
## available
## [1] 769.4782
Generalized Additive Models are a very nice and effective way of fitting Linear Models which depends on some smooth and flexible Non linear functions fitted on some predictors to capture Non linear relationships in the data. The best part is that they lead to interpretable Models.