Introduction

My company is spending a lot of money on Digital Marketing and I am the main person to purchase impressions and clicks from online media. While I am using an agency to purchase for us, I would like to find out the relationship between click and demography of users. I need to build a predictor model or build a basic recommender. It will help me to understand how to target audience for online advertisement purchase.

Initially, I was going to use the following as mentioned in my proposal: https://labs.criteo.com/2014/02/download-dataset/. The dataset doesn’t have column names and I don’t know what it means.

I download another dataset from Taobao/Alimama which is a subsidiary of Alibaba.

Overall Process

OSEMN Process 1. Obtain Data 2. Scrub Data 3. Explore Data 4. Model Data 5. Interpret Results

Data gathering

#Create a function to extract and read CSV. Download to local drive method.

exCSV <- function(theUrl){
  fname <- str_extract(theUrl, "[^/]+(?=\\.tar.gz$)")
  untar(theUrl, files=fname, exdir="C:/Users/dansa/Documents/")
  read_csv(file=fname, na = c("", "NA"))
}


#Set the working directory from which the files will be read from
setwd("C:/Users/dansa/Documents/")

theUrl1 <- "C:/Users/dansa/Documents/user_profile.csv.tar.gz"
theUrl2 <- "C:/Users/dansa/Documents/ad_feature.csv.tar.gz"
theUrl3 <- "C:/Users/dansa/Documents/raw_sample.csv.tar.gz"

userProfile <- exCSV(theUrl1)
adFeature <- exCSV(theUrl2)
rawSample <- exCSV(theUrl3)

Scrub and tidy up the data for analysis

#Rename columns according to the definition of columns

userProfile1 <- userProfile %>%
  dplyr:: rename("user"=userid,        
                 "cms_seg"= cms_segid,
                 "cms_group"=cms_group_id, 
                 "gender"=final_gender_code,
                 "age_group"=age_level,
                 "purchase_level"=pvalue_level,
                 "college_graduate"=occupation,
                 "city_type"=new_user_class_level)%>%
  select(-starts_with("cms"))

adFeature1 <- adFeature %>%
  dplyr:: rename("ad"=adgroup_id,        
                 "category"=cate_id,
                 "campaign"=campaign_id, 
                 "ad_customer"=customer,
                 ) #The price for the product, a product represented by an ad, a product belongs to a cateogry and a brand

rawSample1 <- rawSample %>%
  dplyr:: rename("ad"=adgroup_id,        
                 "ad_location"=pid,
                 "click"=clk
                ) %>%
  mutate("date" = as.Date(as_datetime(time_stamp))) %>%
  select(-nonclk, -time_stamp) #nonclk and clk are complement column. nonclk is removed.#The price for the product, a product represented by an ad, a product belongs to a cateogry and a brand
#  mutate(total_click = nonclk + clk)

Exploring the dataset

#Even though I tried to make use of all the data, my computer keeps giving me error message on file size. I decides to scale down the dataset by just pick up one tenth of the data. 
set.seed(1001)
randRawSam <- rawSample1[sample(nrow(rawSample1), 2655000),] 

#The adfeature and raw sample dataframe are joined together
rawad <- randRawSam %>%
  left_join(adFeature1, by="ad")

#The user profile and raw sample dataframe are joined together
rawuser <- randRawSam %>%
  left_join(userProfile1, by="user")%>%
  select(-user, -ad_location, -date)%>%
  na.omit()

#Create functions to change datasets and to plot top 20 set case by product category, brand, campaign and ad customer

sum20 <- function(xvar){ 
df<-rawad%>%
  mutate(xvar=sapply(xvar, as.factor))%>%
  group_by(xvar)%>%
  summarize(n=n())%>%
  arrange(desc(n))%>%
  top_n(20)
return(df)
}
  
plotsum20 <- function(dfsum20){
ggplot(dfsum20)+
  geom_col(mapping=aes(x=fct_reorder(xvar,n), y=n))+
  coord_flip()+
  ggtitle(paste("Top 20 "))
}

r <- vector(mode = "list", length = 4)

for(i in 1:4){
 r[[i]] <- sum20(rawad[,(i+5)])
}

## Selecting by n
## Selecting by n
## Selecting by n
## Selecting by n

plotsum20(r[[1]])

plotsum20(r[[2]])

plotsum20(r[[3]])

plotsum20(r[[4]])

#Calculate the click %
mean(rawad$click)

## [1] 0.05144595

The click rate is about 5% which is high against the industrial benchmark of 1% as we normal purchase.

Analysis Section.

##Both Regression and classification approaches are implemented to model and predict results. Create model and predictor by Random Forest and Logistic Regression.

#Prepare matrix for recommender
# split data into testing & training
set.seed(1234)

# 80-20 train/test split 
training_indexs <- createDataPartition(rawuser$click, p = .2, list = F)
training <- rawuser[training_indexs, ]
testing  <- rawuser[-training_indexs, ]
predictors <- training %>% select(-click) %>% as.matrix()
output <- training$click %>% as.factor()
str(output)

##  Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

class(output)

## [1] "factor"

model <- randomForest(x = predictors, y = output,
                      ntree = 50) # number of trees

# check out the details
model

## 
## Call:
##  randomForest(x = predictors, y = output, ntree = 50) 
##                Type of random forest: classification
##                      Number of trees: 50
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5.09%
## Confusion matrix:
##        0 1 class.error
## 0 194051 0           0
## 1  10406 0           1

#rmse(predict(model, testing), testing$click)

tuned_model <- train(x = predictors, y = output,
                     ntree = 5, # number of trees (passed ot random forest)
                     method = "rf") # random forests

print(tuned_model)

## Random Forest 
## 
## 204457 samples
##      7 predictor
##      2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 204457, 204457, 204457, 204457, 204457, 204457, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa        
##   2     0.9491393  -5.828845e-06
##   4     0.9458146  -4.072602e-04
##   7     0.9241460   2.415971e-04
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

ggplot(tuned_model)

modelGLM <- glm(click ~.,family=binomial(link='logit'),data=training)

#By using function summary() we obtain the results of our model:

summary(modelGLM)

## 
## Call:
## glm(formula = click ~ ., family = binomial(link = "logit"), data = training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3579  -0.3308  -0.3259  -0.3117   2.5234  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.999e+00  1.527e-01 -19.639  < 2e-16 ***
## ad                1.166e-07  4.742e-08   2.460   0.0139 *  
## gender            1.303e-01  2.441e-02   5.339 9.36e-08 ***
## age_group        -4.947e-03  9.450e-03  -0.523   0.6006    
## purchase_level    1.305e-03  1.790e-02   0.073   0.9419    
## shopping_level   -6.370e-02  4.597e-02  -1.386   0.1658    
## college_graduate -5.813e-02  4.232e-02  -1.374   0.1696    
## city_type        -4.019e-03  1.097e-02  -0.367   0.7139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 82251  on 204456  degrees of freedom
## Residual deviance: 82208  on 204449  degrees of freedom
## AIC: 82224
## 
## Number of Fisher Scoring iterations: 5

anova(modelGLM, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: click
## 
## Terms added sequentially (first to last)
## 
## 
##                  Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                            204456      82251              
## ad                1  10.2581    204455      82241  0.001361 ** 
## gender            1  28.7841    204454      82212 8.091e-08 ***
## age_group         1   0.0105    204453      82212  0.918484    
## purchase_level    1   0.0079    204452      82212  0.929000    
## shopping_level    1   1.9264    204451      82210  0.165148    
## college_graduate  1   1.8730    204450      82208  0.171130    
## city_type         1   0.1344    204449      82208  0.713926    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It is very interesting to find out Only gender and college educated have significant impact to click.

##Building a model by the Recommender Packages

m <- randRawSam %>%
 select(user, ad, click)
m_matrix <- as.matrix(m)
m_R <- as(m_matrix, "realRatingMatrix")

mm_R <- Recommender(m_R[1:1000],method="Popular")

Prediction by Recommender Packages as classification problem

recommm_R <- predict(mm_R, m_R[111:120], type="ratings")
recommm_R

## 10 x 3 rating matrix of class 'realRatingMatrix' with 0 ratings.

as(recommm_R, "matrix")[,1:2]

##       user ad
##  [1,]   NA NA
##  [2,]   NA NA
##  [3,]   NA NA
##  [4,]   NA NA
##  [5,]   NA NA
##  [6,]   NA NA
##  [7,]   NA NA
##  [8,]   NA NA
##  [9,]   NA NA
## [10,]   NA NA

Conclusion

A dataset for online advertising was used from Taobao which is an online shopping mall similar to Amazon.com. Both regression and classification was used fo analysis. Only gender and college educated are signigicant factors to determine if a click will happen based on the dataset.

DATA 607: Final Project

Chunsan Yip