My company is spending a lot of money on Digital Marketing and I am the main person to purchase impressions and clicks from online media. While I am using an agency to purchase for us, I would like to find out the relationship between click and demography of users. I need to build a predictor model or build a basic recommender. It will help me to understand how to target audience for online advertisement purchase.
Initially, I was going to use the following as mentioned in my proposal: https://labs.criteo.com/2014/02/download-dataset/. The dataset doesn’t have column names and I don’t know what it means.
I download another dataset from Taobao/Alimama which is a subsidiary of Alibaba.
OSEMN Process 1. Obtain Data 2. Scrub Data 3. Explore Data 4. Model Data 5. Interpret Results
#Create a function to extract and read CSV. Download to local drive method.
exCSV <- function(theUrl){
fname <- str_extract(theUrl, "[^/]+(?=\\.tar.gz$)")
untar(theUrl, files=fname, exdir="C:/Users/dansa/Documents/")
read_csv(file=fname, na = c("", "NA"))
}
#Set the working directory from which the files will be read from
setwd("C:/Users/dansa/Documents/")
theUrl1 <- "C:/Users/dansa/Documents/user_profile.csv.tar.gz"
theUrl2 <- "C:/Users/dansa/Documents/ad_feature.csv.tar.gz"
theUrl3 <- "C:/Users/dansa/Documents/raw_sample.csv.tar.gz"
userProfile <- exCSV(theUrl1)
adFeature <- exCSV(theUrl2)
rawSample <- exCSV(theUrl3)
#Rename columns according to the definition of columns
userProfile1 <- userProfile %>%
dplyr:: rename("user"=userid,
"cms_seg"= cms_segid,
"cms_group"=cms_group_id,
"gender"=final_gender_code,
"age_group"=age_level,
"purchase_level"=pvalue_level,
"college_graduate"=occupation,
"city_type"=new_user_class_level)%>%
select(-starts_with("cms"))
adFeature1 <- adFeature %>%
dplyr:: rename("ad"=adgroup_id,
"category"=cate_id,
"campaign"=campaign_id,
"ad_customer"=customer,
) #The price for the product, a product represented by an ad, a product belongs to a cateogry and a brand
rawSample1 <- rawSample %>%
dplyr:: rename("ad"=adgroup_id,
"ad_location"=pid,
"click"=clk
) %>%
mutate("date" = as.Date(as_datetime(time_stamp))) %>%
select(-nonclk, -time_stamp) #nonclk and clk are complement column. nonclk is removed.#The price for the product, a product represented by an ad, a product belongs to a cateogry and a brand
# mutate(total_click = nonclk + clk)
#Even though I tried to make use of all the data, my computer keeps giving me error message on file size. I decides to scale down the dataset by just pick up one tenth of the data.
set.seed(1001)
randRawSam <- rawSample1[sample(nrow(rawSample1), 2655000),]
#The adfeature and raw sample dataframe are joined together
rawad <- randRawSam %>%
left_join(adFeature1, by="ad")
#The user profile and raw sample dataframe are joined together
rawuser <- randRawSam %>%
left_join(userProfile1, by="user")%>%
select(-user, -ad_location, -date)%>%
na.omit()
#Create functions to change datasets and to plot top 20 set case by product category, brand, campaign and ad customer
sum20 <- function(xvar){
df<-rawad%>%
mutate(xvar=sapply(xvar, as.factor))%>%
group_by(xvar)%>%
summarize(n=n())%>%
arrange(desc(n))%>%
top_n(20)
return(df)
}
plotsum20 <- function(dfsum20){
ggplot(dfsum20)+
geom_col(mapping=aes(x=fct_reorder(xvar,n), y=n))+
coord_flip()+
ggtitle(paste("Top 20 "))
}
r <- vector(mode = "list", length = 4)
for(i in 1:4){
r[[i]] <- sum20(rawad[,(i+5)])
}
## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n
plotsum20(r[[1]])
plotsum20(r[[2]])
plotsum20(r[[3]])
plotsum20(r[[4]])
#Calculate the click %
mean(rawad$click)
## [1] 0.05144595
The click rate is about 5% which is high against the industrial benchmark of 1% as we normal purchase.
##Both Regression and classification approaches are implemented to model and predict results. Create model and predictor by Random Forest and Logistic Regression.
#Prepare matrix for recommender
# split data into testing & training
set.seed(1234)
# 80-20 train/test split
training_indexs <- createDataPartition(rawuser$click, p = .2, list = F)
training <- rawuser[training_indexs, ]
testing <- rawuser[-training_indexs, ]
predictors <- training %>% select(-click) %>% as.matrix()
output <- training$click %>% as.factor()
str(output)
## Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
class(output)
## [1] "factor"
model <- randomForest(x = predictors, y = output,
ntree = 50) # number of trees
# check out the details
model
##
## Call:
## randomForest(x = predictors, y = output, ntree = 50)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 5.09%
## Confusion matrix:
## 0 1 class.error
## 0 194051 0 0
## 1 10406 0 1
#rmse(predict(model, testing), testing$click)
tuned_model <- train(x = predictors, y = output,
ntree = 5, # number of trees (passed ot random forest)
method = "rf") # random forests
print(tuned_model)
## Random Forest
##
## 204457 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 204457, 204457, 204457, 204457, 204457, 204457, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9491393 -5.828845e-06
## 4 0.9458146 -4.072602e-04
## 7 0.9241460 2.415971e-04
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
ggplot(tuned_model)
modelGLM <- glm(click ~.,family=binomial(link='logit'),data=training)
#By using function summary() we obtain the results of our model:
summary(modelGLM)
##
## Call:
## glm(formula = click ~ ., family = binomial(link = "logit"), data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3579 -0.3308 -0.3259 -0.3117 2.5234
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.999e+00 1.527e-01 -19.639 < 2e-16 ***
## ad 1.166e-07 4.742e-08 2.460 0.0139 *
## gender 1.303e-01 2.441e-02 5.339 9.36e-08 ***
## age_group -4.947e-03 9.450e-03 -0.523 0.6006
## purchase_level 1.305e-03 1.790e-02 0.073 0.9419
## shopping_level -6.370e-02 4.597e-02 -1.386 0.1658
## college_graduate -5.813e-02 4.232e-02 -1.374 0.1696
## city_type -4.019e-03 1.097e-02 -0.367 0.7139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 82251 on 204456 degrees of freedom
## Residual deviance: 82208 on 204449 degrees of freedom
## AIC: 82224
##
## Number of Fisher Scoring iterations: 5
anova(modelGLM, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: click
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 204456 82251
## ad 1 10.2581 204455 82241 0.001361 **
## gender 1 28.7841 204454 82212 8.091e-08 ***
## age_group 1 0.0105 204453 82212 0.918484
## purchase_level 1 0.0079 204452 82212 0.929000
## shopping_level 1 1.9264 204451 82210 0.165148
## college_graduate 1 1.8730 204450 82208 0.171130
## city_type 1 0.1344 204449 82208 0.713926
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is very interesting to find out Only gender and college educated have significant impact to click.
##Building a model by the Recommender Packages
m <- randRawSam %>%
select(user, ad, click)
m_matrix <- as.matrix(m)
m_R <- as(m_matrix, "realRatingMatrix")
mm_R <- Recommender(m_R[1:1000],method="Popular")
Prediction by Recommender Packages as classification problem
recommm_R <- predict(mm_R, m_R[111:120], type="ratings")
recommm_R
## 10 x 3 rating matrix of class 'realRatingMatrix' with 0 ratings.
as(recommm_R, "matrix")[,1:2]
## user ad
## [1,] NA NA
## [2,] NA NA
## [3,] NA NA
## [4,] NA NA
## [5,] NA NA
## [6,] NA NA
## [7,] NA NA
## [8,] NA NA
## [9,] NA NA
## [10,] NA NA
A dataset for online advertising was used from Taobao which is an online shopping mall similar to Amazon.com. Both regression and classification was used fo analysis. Only gender and college educated are signigicant factors to determine if a click will happen based on the dataset.