Display Advertising Challenge

(https://www.kaggle.com/c/criteo-display-ad-challenge/data)

The data was no longer available at the speciefied link

I was able to find it from Criteo. (http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/)

I suspect it is not the exact data set used for the challenge because of the number, oder, name of the variables. It consists of more than 46M rows. I used random sampling to obtain 300000 rows of data. /with respect to my available resources/

About the data Data fields Label - Target variable that indicates if an ad was clicked (1) or not (0). I1-I13 - A total of 13 columns of integer features (mostly count features). C1-C26 - A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes. The semantic of the features is undisclosed.

When a value is missing, the field is empty.

Until I know more about how to deal with the hashed values I would leave them out from the analysis. /“Hash functions may output the same value for different inputs” - To me that doesn’t speak accuracy in the calculations. But as I said. I will pay attention to that later.

I plan to use Logistic regression for this task.

library(readr)
# load data
train300 <- read_csv("C://Users/Public/Documents/Tiharis R/train300.csv", col_names = FALSE)[,1:13]


for (i in c(2:ncol(train300))) {
      train300[,i] <- as.numeric(as.integer(as.character(unlist(train300[,i]))))
}

# Removing allegedly a categorical variable that left.
train300$X10 <- NULL

# train300$X1 <- as.factor(train300$X1)

EDA

summary(train300)

##        X1               X2                 X3                 X4          
##  Min.   :0.0000   Min.   :   -2.00   Min.   :      -2   Min.   :       0  
##  1st Qu.:0.0000   1st Qu.:    0.00   1st Qu.:       1   1st Qu.:       2  
##  Median :0.0000   Median :    1.00   Median :       5   Median :       6  
##  Mean   :0.2561   Mean   :   44.82   Mean   :    3867   Mean   :    8550  
##  3rd Qu.:1.0000   3rd Qu.:    7.00   3rd Qu.:      36   3rd Qu.:      27  
##  Max.   :1.0000   Max.   :10469.00   Max.   :87552397   Max.   :87552397  
##                                      NA's   :145        NA's   :984       
##        X5                 X6                  X7          
##  Min.   :       0   Min.   :        0   Min.   :       0  
##  1st Qu.:       3   1st Qu.:        2   1st Qu.:       2  
##  Median :      14   Median :       26   Median :       8  
##  Mean   :   27224   Mean   :    22350   Mean   :   24135  
##  3rd Qu.:    2839   3rd Qu.:      391   3rd Qu.:      30  
##  Max.   :98237733   Max.   :511156000   Max.   :98237733  
##  NA's   :3884       NA's   :6391        NA's   :8526      
##        X8                  X9                 X11           
##  Min.   :        0   Min.   :        0   Min.   :        0  
##  1st Qu.:        2   1st Qu.:        1   1st Qu.:        0  
##  Median :        7   Median :       10   Median :        1  
##  Mean   :    76526   Mean   :   133534   Mean   :   495323  
##  3rd Qu.:       22   3rd Qu.:       33   3rd Qu.:        2  
##  Max.   :511156000   Max.   :961290000   Max.   :511156000  
##  NA's   :17255       NA's   :32082       NA's   :94325      
##       X12                 X13           
##  Min.   :        0   Min.   :        0  
##  1st Qu.:        1   1st Qu.:        0  
##  Median :        2   Median :        3  
##  Mean   :   820474   Mean   :   865780  
##  3rd Qu.:        5   3rd Qu.:        9  
##  Max.   :705740000   Max.   :511156000  
##  NA's   :144992      NA's   :172115

From the summary I see that all the variables are heavily skewed to the right. Mean, Median difference is huge. I see many missing values, so I’ll plot them to see whether there is a pattern or something that might speak up. The data resides above the 3rd quartile meaning that 75% of the data is less than the number corespoding to the 3rd quartile. In context of the variable I can’t speak, so I can’t take any well-argumented decisions. I may treat these extreme high data points as are or bin the variable. I also think about the NA’s - for some variables missing values may make sense. For that they should be treated accordingly. We’ll see. What I also notice is the same extreme value ‘511156000’ in variables X6, X8, X11 and X13. The same is with X3 and X4 ‘87552397’ and X5 and X7 ‘98237733’. There is a class bias for the response varbiable X1. I may sample more balanced data for the model but we’ll see.

# I will get in numerical form the skewness and the kurtosis
library(psych)
describe(train300)

##     vars      n      mean         sd median trimmed   mad min       max
## X1     1 300000      0.26       0.44      0    0.20  0.00   0         1
## X2     2 300000     44.82     244.52      1    4.52  1.48  -2     10469
## X3     3 299855   3867.12  278537.78      5   28.36  7.41  -2  87552397
## X4     4 299016   8549.67  621113.54      6   59.24  7.41   0  87552397
## X5     5 296116  27223.75 1205105.10     14 1580.13 20.76   0  98237733
## X6     6 293609  22350.18 1591223.37     26  272.60 38.55   0 511156000
## X7     7 291474  24135.33 1373172.85      8   16.37 10.38   0  98237733
## X8     8 282745  76526.25 2609410.78      7   12.22  8.90   0 511156000
## X9     9 267918 133533.54 3746332.25     10   16.75 14.83   0 961290000
## X11   10 205675 495323.35 6383570.38      1    1.25  1.48   0 511156000
## X12   11 155008 820474.47 8458144.95      2    3.19  1.48   0 705740000
## X13   12 127885 865779.51 8096012.92      3    4.93  4.45   0 511156000
##         range   skew kurtosis       se
## X1          1   1.12    -0.75     0.00
## X2      10471   9.88   124.17     0.45
## X3   87552399 310.70 97646.35   508.66
## X4   87552397 140.47 19794.77  1135.86
## X5   98237733  71.95  5220.82  2214.59
## X6  511156000 148.07 38186.34  2936.61
## X7   98237733  61.02  3819.79  2543.46
## X8  511156000  54.38  6118.15  4907.33
## X9  961290000  80.19 16565.39  7237.78
## X11 511156000  15.58   386.37 14075.79
## X12 705740000  19.63   960.43 21483.16
## X13 511156000  14.76   460.31 22639.22

I plan to log transform the data in order to reduce the skew. Too much skewness will mess up the statistical techinques.

library(Amelia)
# plot NA's
missmap(train300)

# missing values in percentage
round(sapply(train300, function(x) sum(length(which(is.na(x))))) / nrow(train300),3)*100

##   X1   X2   X3   X4   X5   X6   X7   X8   X9  X11  X12  X13 
##  0.0  0.0  0.0  0.3  1.3  2.1  2.8  5.8 10.7 31.4 48.3 57.4

After seeing the missing NA’s plot I decide to remove my attention from the variables X12 and X13. Imputing any numbers would lead to creating data insead of invastigating it. I see monotone missing data pattern - when a value is missing in a column it continuos to be missing in the next column. That might be a bit misleading because of the condensed plot. Probably X11 will go away as well but first I will investigate it further. The imputation I’m considering is k-NN /k nearest neighbor/.But we’ll see.

# Copy data
train300_copy <-  train300

Histograms

par(mfrow = c(6, 2))
par(cex = 0.6)
# I will do log transformation of the variables in order to see their distribution a little bit better.
hist(log(train300_copy$X13+1),breaks=80)
rug(jitter(log(train300_copy$X13+1)))
hist(log(train300_copy$X12+1),breaks=80)
rug(jitter(log(train300_copy$X12+1)))
hist(log(train300_copy$X11+1),breaks=80)
rug(jitter(log(train300_copy$X11+1)))
hist(log(train300_copy$X9+1),breaks=80)
rug(jitter(log(train300_copy$X9+1)))
hist(log(train300_copy$X8+1),breaks=80)
rug(jitter(log(train300_copy$X8+1)))
hist(log(train300_copy$X8+1),breaks=80)
rug(jitter(log(train300_copy$X8+1)))
hist(log(train300_copy$X6+1),breaks=80)
rug(jitter(log(train300_copy$X6+1)))
hist(log(train300_copy$X5+1),breaks=80)
rug(jitter(log(train300_copy$X5+1)))
hist(log(train300_copy$X4+1),breaks=80)
rug(jitter(log(train300_copy$X4+1)))
hist(log(train300_copy$X3+3),breaks=80)
rug(jitter(log(train300_copy$X3+1)))
hist(log(train300_copy$X2+3),breaks=80)
rug(jitter(log(train300_copy$X2+3)))

Boxplots

# yet another way of seeing the distribution and the variability of the variables
boxplot(train300, main = "Boxplot of not transformed variables")

boxplot(log(train300[,-1]+3),main = "Boxplot of log-transformed variables")

Schould I treat the extreme observations as outliers or not? They represent 25% of the data. That perhaps shouldn’t be ignored.

# Removing the last 3 variables from the analysis because of too much missing data 
train300_copy$X13 <- NULL
train300_copy$X12 <- NULL
train300_copy$X11 <- NULL

# Find the mode for every variable
MODE <- function(dataframe){
            DF <- as.data.frame(dataframe)
            
            MODE2 <- function(x){      
                  if (is.numeric(x) == FALSE){
                        df <- as.data.frame(table(x))  
                        df <- df[order(df$Freq), ]         
                        m <- max(df$Freq)        
                        MODE1 <- as.vector(as.character(subset(df, Freq == m)[, 1]))
                        
                        if (sum(df$Freq)/length(df$Freq)==1){
                              warning("No Mode: Frequency of all values is 1", call. = FALSE)
                        }else{
                              return(MODE1)
                        }
                        
                  }else{ 
                        df <- as.data.frame(table(x))  
                        df <- df[order(df$Freq), ]         
                        m <- max(df$Freq)        
                        MODE1 <- as.vector(as.numeric(as.character(subset(df, Freq == m)[, 1])))
                        
                        if (sum(df$Freq)/length(df$Freq)==1){
                              warning("No Mode: Frequency of all values is 1", call. = FALSE)
                        }else{
                              return(MODE1)
                        }
                  }
            }
            
            return(as.vector(lapply(DF, MODE2)))
      }
# Display the mode of every variable
as.data.frame(MODE(train300_copy))

##   X1 X2 X3 X4 X5 X6 X7 X8 X9
## 1  0  0  1  1  0  0  0  1  0

I am considering an option to separate the variables at their mode and thus making them categorical ones. I can’t yet decide whether to do it with 2 levels or 3 or 4.. or leave them as they are

# How many missing values for each variable 
sapply(train300_copy,function(x) sum(is.na(x)))

##    X1    X2    X3    X4    X5    X6    X7    X8    X9 
##     0     0   145   984  3884  6391  8526 17255 32082

# how many unique values for each variable
sapply(train300_copy, function(x) length(unique(x)))

##    X1    X2    X3    X4    X5    X6    X7    X8    X9 
##     2  2850 16042 19127 32116 11580  1892  1296  1510

I will idetify the veriables with zero variance
freqCut: looks at the ratio of the most common value to the second most common value uniqueCut: percentage of distinct values out of the number of total samples
the percent of unique values, the number of unique values divided by the total number of samples (times 100)

library(caret) nearZeroVar(train300_copy, names = TRUE, freqCut = 19, uniqueCut = 10) nearZeroVar(train300_copy, names = TRUE, freqCut = 2, uniqueCut = 20) X1, X6, X9

I’ll keep those variables in mind.
# Imputation I may also try the default setting for the number of multiple imputations m, which is 5.
maxit - the number of iterations which is again 5. But for large sample size 10 and up may give better results. I will do it for now with low settings

# Imputing data with predictive mean matching
library(mice)
imputedData300 <- mice(train300_copy, m=2, maxit = 5, method = 'pmm', seed = 500)

## 
##  iter imp variable
##   1   1  X3  X4  X5  X6  X7  X8  X9
##   1   2  X3  X4  X5  X6  X7  X8  X9
##   2   1  X3  X4  X5  X6  X7  X8  X9
##   2   2  X3  X4  X5  X6  X7  X8  X9
##   3   1  X3  X4  X5  X6  X7  X8  X9
##   3   2  X3  X4  X5  X6  X7  X8  X9
##   4   1  X3  X4  X5  X6  X7  X8  X9
##   4   2  X3  X4  X5  X6  X7  X8  X9
##   5   1  X3  X4  X5  X6  X7  X8  X9
##   5   2  X3  X4  X5  X6  X7  X8  X9

# I will chose the second dataset and combine it with the original data 
train300_copy_imputed <- complete(imputedData300, 2)
# Check for NA's 
anyNA(train300_copy_imputed)

## [1] FALSE

# Compare summary statistics
summary(train300_copy)

##        X1               X2                 X3                 X4          
##  Min.   :0.0000   Min.   :   -2.00   Min.   :      -2   Min.   :       0  
##  1st Qu.:0.0000   1st Qu.:    0.00   1st Qu.:       1   1st Qu.:       2  
##  Median :0.0000   Median :    1.00   Median :       5   Median :       6  
##  Mean   :0.2561   Mean   :   44.82   Mean   :    3867   Mean   :    8550  
##  3rd Qu.:1.0000   3rd Qu.:    7.00   3rd Qu.:      36   3rd Qu.:      27  
##  Max.   :1.0000   Max.   :10469.00   Max.   :87552397   Max.   :87552397  
##                                      NA's   :145        NA's   :984       
##        X5                 X6                  X7          
##  Min.   :       0   Min.   :        0   Min.   :       0  
##  1st Qu.:       3   1st Qu.:        2   1st Qu.:       2  
##  Median :      14   Median :       26   Median :       8  
##  Mean   :   27224   Mean   :    22350   Mean   :   24135  
##  3rd Qu.:    2839   3rd Qu.:      391   3rd Qu.:      30  
##  Max.   :98237733   Max.   :511156000   Max.   :98237733  
##  NA's   :3884       NA's   :6391        NA's   :8526      
##        X8                  X9           
##  Min.   :        0   Min.   :        0  
##  1st Qu.:        2   1st Qu.:        1  
##  Median :        7   Median :       10  
##  Mean   :    76526   Mean   :   133534  
##  3rd Qu.:       22   3rd Qu.:       33  
##  Max.   :511156000   Max.   :961290000  
##  NA's   :17255       NA's   :32082

summary(train300_copy_imputed)

##        X1               X2                 X3                 X4          
##  Min.   :0.0000   Min.   :   -2.00   Min.   :      -2   Min.   :       0  
##  1st Qu.:0.0000   1st Qu.:    0.00   1st Qu.:       1   1st Qu.:       2  
##  Median :0.0000   Median :    1.00   Median :       5   Median :       6  
##  Mean   :0.2561   Mean   :   44.82   Mean   :    3869   Mean   :    8842  
##  3rd Qu.:1.0000   3rd Qu.:    7.00   3rd Qu.:      36   3rd Qu.:      27  
##  Max.   :1.0000   Max.   :10469.00   Max.   :87552397   Max.   :87552397  
##        X5                 X6                  X7          
##  Min.   :       0   Min.   :        0   Min.   :       0  
##  1st Qu.:       3   1st Qu.:        2   1st Qu.:       2  
##  Median :      14   Median :       25   Median :       8  
##  Mean   :   33876   Mean   :    50121   Mean   :   61858  
##  3rd Qu.:    2789   3rd Qu.:      367   3rd Qu.:      30  
##  Max.   :98237733   Max.   :511156000   Max.   :98237733  
##        X8                  X9           
##  Min.   :        0   Min.   :        0  
##  1st Qu.:        2   1st Qu.:        1  
##  Median :        7   Median :        9  
##  Mean   :   268595   Mean   :   476578  
##  3rd Qu.:       22   3rd Qu.:       32  
##  Max.   :511156000   Max.   :961290000

The summary comparison reveals to what extent I influenced the variables. Some of them remain solid, some are influenced a bit, In my oppinion it is quite survivable. I plan to use both imputed data sets to compare their performance. # Log transform

# Copy the imputed data
train300_copy_imputed_log <- train300_copy_imputed
# Transform every variable but the response variable
train300_copy_imputed_log[,-1] <- log(train300_copy_imputed_log[, -1] + 3)
summary(train300_copy_imputed_log)

##        X1               X2              X3               X4        
##  Min.   :0.0000   Min.   :0.000   Min.   : 0.000   Min.   : 1.099  
##  1st Qu.:0.0000   1st Qu.:1.099   1st Qu.: 1.386   1st Qu.: 1.609  
##  Median :0.0000   Median :1.386   Median : 2.079   Median : 2.197  
##  Mean   :0.2561   Mean   :1.978   Mean   : 2.965   Mean   : 3.170  
##  3rd Qu.:1.0000   3rd Qu.:2.303   3rd Qu.: 3.664   3rd Qu.: 3.401  
##  Max.   :1.0000   Max.   :9.256   Max.   :18.288   Max.   :18.288  
##        X5               X6               X7               X8        
##  Min.   : 1.099   Min.   : 1.099   Min.   : 1.099   Min.   : 1.099  
##  1st Qu.: 1.792   1st Qu.: 1.609   1st Qu.: 1.609   1st Qu.: 1.609  
##  Median : 2.833   Median : 3.332   Median : 2.398   Median : 2.303  
##  Mean   : 4.493   Mean   : 3.923   Mean   : 2.694   Mean   : 2.587  
##  3rd Qu.: 7.935   3rd Qu.: 5.914   3rd Qu.: 3.497   3rd Qu.: 3.219  
##  Max.   :18.403   Max.   :20.052   Max.   :18.403   Max.   :20.052  
##        X9        
##  Min.   : 1.099  
##  1st Qu.: 1.386  
##  Median : 2.485  
##  Mean   : 2.762  
##  3rd Qu.: 3.555  
##  Max.   :20.684

# Change names accordingly
names(train300_copy_imputed_log) <- paste0(names(train300_copy_imputed_log), "_log")
names(train300_copy_imputed_log)[names(train300_copy_imputed_log) == "X1_log"] <- "X1"

Split data

I will randomly reorder the data and then split it into train and test sets.

# Shuffle rows
shuf <- sample(nrow(train300_copy_imputed_log))
# Randomly order data
train300_copy_imputed_log <- train300_copy_imputed_log[shuf,]
# Identify row to split on
split <- round(nrow(train300_copy_imputed_log) * .80)
# Create train
train_train300_copy_imputed_log <- train300_copy_imputed_log[1:split,]
# Create test
test_train300_copy_imputed_log <- train300_copy_imputed_log[(split + 1):nrow(train300_copy_imputed_log), ]
# check splitting
nrow(train_train300_copy_imputed_log)/nrow(train300_copy_imputed_log)

## [1] 0.8

str(train_train300_copy_imputed_log)

## 'data.frame':    240000 obs. of  9 variables:
##  $ X1    : int  1 1 0 0 1 0 0 0 1 0 ...
##  $ X2_log: num  2.56 1.95 1.39 3.89 1.1 ...
##  $ X3_log: num  4.17 2.48 8.41 2.94 3.26 ...
##  $ X4_log: num  1.79 2.4 1.1 7.6 1.39 ...
##  $ X5_log: num  2.4 8.3 1.1 3.64 1.39 ...
##  $ X6_log: num  6.3 5.24 3 2.3 7.81 ...
##  $ X7_log: num  2.4 3.43 1.1 3.53 2.71 ...
##  $ X8_log: num  4.64 3.3 1.95 3.64 2.3 ...
##  $ X9_log: num  2.94 5.67 1.39 1.39 1.39 ...

Model building

# 1 glm model
LogModel1 <- glm(X1 ~ ., data = train_train300_copy_imputed_log, family = binomial)
summary(LogModel1)

## 
## Call:
## glm(formula = X1 ~ ., family = binomial, data = train_train300_copy_imputed_log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5427  -0.7853  -0.6694   1.2642   3.4178  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.327849   0.022618  14.495  < 2e-16 ***
## X2_log      -0.001664   0.003773  -0.441    0.659    
## X3_log      -0.034692   0.002132 -16.275  < 2e-16 ***
## X4_log      -0.081383   0.002117 -38.444  < 2e-16 ***
## X5_log      -0.128464   0.001655 -77.633  < 2e-16 ***
## X6_log      -0.082325   0.002306 -35.708  < 2e-16 ***
## X7_log      -0.126601   0.004067 -31.130  < 2e-16 ***
## X8_log       0.053021   0.003028  17.511  < 2e-16 ***
## X9_log       0.011844   0.002441   4.852 1.22e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 273276  on 239999  degrees of freedom
## Residual deviance: 264812  on 239991  degrees of freedom
## AIC: 264830
## 
## Number of Fisher Scoring iterations: 4

# 2 glm model
LogModel2 <- glm(X1 ~ . - X2_log, data = train_train300_copy_imputed_log, family = binomial)
summary(LogModel2)

## 
## Call:
## glm(formula = X1 ~ . - X2_log, family = binomial, data = train_train300_copy_imputed_log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5420  -0.7853  -0.6694   1.2641   3.4171  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.324051   0.020914  15.495  < 2e-16 ***
## X3_log      -0.034643   0.002129 -16.274  < 2e-16 ***
## X4_log      -0.081385   0.002117 -38.447  < 2e-16 ***
## X5_log      -0.128535   0.001647 -78.043  < 2e-16 ***
## X6_log      -0.082155   0.002273 -36.145  < 2e-16 ***
## X7_log      -0.126508   0.004061 -31.150  < 2e-16 ***
## X8_log       0.052958   0.003024  17.511  < 2e-16 ***
## X9_log       0.011821   0.002440   4.844 1.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 273276  on 239999  degrees of freedom
## Residual deviance: 264812  on 239992  degrees of freedom
## AIC: 264828
## 
## Number of Fisher Scoring iterations: 4

# 3 glm model /whitout the near zero variance variables one at a time/
LogModel3 <- glm(X1 ~ . - X2_log - X9_log, data = train_train300_copy_imputed_log, family = binomial)
summary(LogModel3)

## 
## Call:
## glm(formula = X1 ~ . - X2_log - X9_log, family = binomial, data = train_train300_copy_imputed_log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5045  -0.7857  -0.6698   1.2631   3.4348  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.333534   0.020831   16.01   <2e-16 ***
## X3_log      -0.032616   0.002085  -15.64   <2e-16 ***
## X4_log      -0.082059   0.002113  -38.84   <2e-16 ***
## X5_log      -0.127765   0.001639  -77.94   <2e-16 ***
## X6_log      -0.080386   0.002244  -35.83   <2e-16 ***
## X7_log      -0.125820   0.004064  -30.96   <2e-16 ***
## X8_log       0.055640   0.002972   18.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 273276  on 239999  degrees of freedom
## Residual deviance: 264836  on 239993  degrees of freedom
## AIC: 264850
## 
## Number of Fisher Scoring iterations: 4

# 4 glm model /whitout the near zero variance variables one at a time/
LogModel4 <- glm(X1 ~ . - X2_log - X9_log -X6_log, data = train_train300_copy_imputed_log, family = binomial)
summary(LogModel4)

## 
## Call:
## glm(formula = X1 ~ . - X2_log - X9_log - X6_log, family = binomial, 
##     data = train_train300_copy_imputed_log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4255  -0.7941  -0.6768   1.3276   3.3884  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.027749   0.019232   1.443    0.149    
## X3_log      -0.016831   0.002052  -8.202 2.36e-16 ***
## X4_log      -0.059662   0.002027 -29.434  < 2e-16 ***
## X5_log      -0.115618   0.001599 -72.290  < 2e-16 ***
## X7_log      -0.188458   0.003732 -50.500  < 2e-16 ***
## X8_log       0.052585   0.002991  17.578  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 273276  on 239999  degrees of freedom
## Residual deviance: 266132  on 239994  degrees of freedom
## AIC: 266144
## 
## Number of Fisher Scoring iterations: 4

The interpretation of the coefficients will be a challenge for me, because the glm model gives the coefficients in the log odds scale and I already took the natural log of them.but I’ll give it a try. For variables X3_log to X7_log I see decrease from The log odds ratio associated with a 2.72 fold change in the predictor. By just looking at the sign of the estimates I may say that which one has positive or negative affect on the response variable X1 which represents whether the add was clicked on or not. I will show the LogModel2 in odds scale for hopefully easier interpretation. For variables X3_log to X7_log I see assosiated decrease in the odds of Click. for example: X3_log would be interpreted as: each additional log unit of variable X3_log is assosiated with 3.4% decrease in the odds of Click. The difference between the null deviance and the residual deviance is not big. It would mean/when there is a bigger gap/ that variables add value to the model.

# Coefficients and their CI of LogModel2 in the ddds scale
exp(cbind(Odds = coef(LogModel2), confint(LogModel2)))

##                  Odds     2.5 %    97.5 %
## (Intercept) 1.3827174 1.3271997 1.4405878
## X3_log      0.9659497 0.9619244 0.9699851
## X4_log      0.9218387 0.9180171 0.9256664
## X5_log      0.8793826 0.8765459 0.8822232
## X6_log      0.9211291 0.9170334 0.9252405
## X7_log      0.8811671 0.8741696 0.8881977
## X8_log      1.0543851 1.0481535 1.0606541
## X9_log      1.0118908 1.0070520 1.0167324

Here I see that no coefficients in the odds scale crosses zero which indicates to me that all of the variables have to some extent, even small one/ influence one the response variable. Apart from X8_log and X9_log that have a tiny positive influence all other remaining variables have a negative one.
# AIC comparison Comparing models by their Akaike Information Criterion - evaluating the quality of the model

# AIC comparison
AIC(LogModel1,LogModel2,LogModel3,LogModel4)

##           df      AIC
## LogModel1  9 264830.2
## LogModel2  8 264828.4
## LogModel3  7 264849.6
## LogModel4  6 266143.5

Model 2 is better. I will continue with it.

Predictions

# prediction
p <- predict(LogModel2, test_train300_copy_imputed_log, type = "response")
# set prediction probability for the model
p_class <- ifelse(p > 0.2, 1, 0)
# create confusion matrix
table(predicted = p_class, actual = test_train300_copy_imputed_log$X1)

##          actual
## predicted     0     1
##         0 14366  2878
##         1 30362 12394

# estimate the accuracy of the model
mean(p_class == test_train300_copy_imputed_log$X1)

## [1] 0.446

If I set the cutoff value to be 0.20 I would get more clicks but with less certanty than if I set it to 0.6.

# set prediction probability for the model
p_class <- ifelse(p > 0.6, 1, 0)
# create confusion matrix
table(predicted = p_class, actual = test_train300_copy_imputed_log$X1)

##          actual
## predicted     0     1
##         0 44686 15268
##         1    42     4

# estimate the accuracy of the model
mean(p_class == test_train300_copy_imputed_log$X1)

## [1] 0.7448333

ROC curve and AUC

I will draw a ROC curve and see the possible cutoff point as welll as the overall performance of the model across these thresholds.

library(caTools)
# Draw a ROC curve
colAUC(p, test_train300_copy_imputed_log$X1, plotROC = T)

##              [,1]
## 0 vs. 1 0.6229923

I would choose to set the threshold a bit over the average probability of success

# Check once again the average prop of clicks
mean(test_train300_copy_imputed_log$X1)

## [1] 0.2545333

# I decide to leave it at 0.26
p_class <- ifelse(p > 0.26, 1, 0)
# create confusion matrix
table(predicted = p_class, actual = test_train300_copy_imputed_log$X1)

##          actual
## predicted     0     1
##         0 28578  6922
##         1 16150  8350

# estimate the accuracy of the model
mean(p_class == test_train300_copy_imputed_log$X1)

## [1] 0.6154667

Performance

sensitivity <-  8350 / (8350 + 16150 )
specificity <-  28578 / (28578 + 6922)
sensitivity

## [1] 0.3408163

specificity

## [1] 0.8050141

The model correctly identifies 34% of the clicks The model correctly identifies 81% of no clicks

# McFadden R2 index can be used to assess the model fit
library(pscl)
pR2(LogModel2)

##           llh       llhNull            G2      McFadden          r2ML 
## -1.324062e+05 -1.366378e+05  8.463247e+03  3.096964e-02  3.464901e-02 
##          r2CU 
##  5.097329e-02

Well, now I am confused. # Second imputed dataset model

# GLM for both of the imputed data set
ModelIIset <- with(imputedData300, glm(X1 ~ ., data = train_train300_copy_imputed_log, family = binomial))
summary(ModelIIset)

## 
##  ## summary of imputation 1 :
## 
## Call:
## glm(formula = X1 ~ ., family = binomial, data = train_train300_copy_imputed_log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5427  -0.7853  -0.6694   1.2642   3.4178  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.327849   0.022618  14.495  < 2e-16 ***
## X2_log      -0.001664   0.003773  -0.441    0.659    
## X3_log      -0.034692   0.002132 -16.275  < 2e-16 ***
## X4_log      -0.081383   0.002117 -38.444  < 2e-16 ***
## X5_log      -0.128464   0.001655 -77.633  < 2e-16 ***
## X6_log      -0.082325   0.002306 -35.708  < 2e-16 ***
## X7_log      -0.126601   0.004067 -31.130  < 2e-16 ***
## X8_log       0.053021   0.003028  17.511  < 2e-16 ***
## X9_log       0.011844   0.002441   4.852 1.22e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 273276  on 239999  degrees of freedom
## Residual deviance: 264812  on 239991  degrees of freedom
## AIC: 264830
## 
## Number of Fisher Scoring iterations: 4
## 
## 
##  ## summary of imputation 2 :
## 
## Call:
## glm(formula = X1 ~ ., family = binomial, data = train_train300_copy_imputed_log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5427  -0.7853  -0.6694   1.2642   3.4178  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.327849   0.022618  14.495  < 2e-16 ***
## X2_log      -0.001664   0.003773  -0.441    0.659    
## X3_log      -0.034692   0.002132 -16.275  < 2e-16 ***
## X4_log      -0.081383   0.002117 -38.444  < 2e-16 ***
## X5_log      -0.128464   0.001655 -77.633  < 2e-16 ***
## X6_log      -0.082325   0.002306 -35.708  < 2e-16 ***
## X7_log      -0.126601   0.004067 -31.130  < 2e-16 ***
## X8_log       0.053021   0.003028  17.511  < 2e-16 ***
## X9_log       0.011844   0.002441   4.852 1.22e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 273276  on 239999  degrees of freedom
## Residual deviance: 264812  on 239991  degrees of freedom
## AIC: 264830
## 
## Number of Fisher Scoring iterations: 4

Both of the imputed datasets yield the same results in respect of the significance of the variables.

Cor plot

# correlation
library(corrplot)
correlations_o <- round(cor(train300_copy, use = "complete.obs",method = "spearman"),2)
corrplot(correlations_o, method = "number")

correlations_i <- round(cor(train300_copy_imputed, method = "spearman"),2)
corrplot(correlations_i, method = "number")

correlations_il <- round(cor(log(train300_copy_imputed_log),method = "spearman"),2)
corrplot(correlations_il, method = "number")

pl <- plot(train300_copy_imputed_log, col=rgb(0,0,0,40,maxColorValue=255) , pch = 16)

pl

## NULL

Caret package

I would like to train the model using the caret package as well.

# Copy data
caret300 <- train300_copy
# Delete variables with many NA's
caret300$X13 <- NULL
caret300$X12 <- NULL
caret300$X11 <- NULL
# Rename response variable
names(caret300)[names(caret300) == 'X1'] <- 'Class'
# Make the response variabel a factor
caret300$Class <- as.factor(caret300$Class)
# Renane
library(plyr)
caret300$Class <- mapvalues(caret300$Class, from = c("0", "1"), to = c("N", "C"))
# Create more balanced data in order the model to learn better.
library(dplyr)
caret300_b <- caret300 %>%
              group_by(Class) %>%
              sample_n(75000)

caret300_b[,-1] <- log(caret300_b[,-1] + 3)

str(caret300_b)

## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  150000 obs. of  9 variables:
##  $ Class: Factor w/ 2 levels "N","C": 1 1 1 1 1 1 1 1 1 1 ...
##  $ X2   : num  1.1 5.15 1.1 1.1 6.07 ...
##  $ X3   : num  3.89 4.5 1.39 2.08 8.87 ...
##  $ X4   : num  2.56 1.1 8.66 8.38 1.1 ...
##  $ X5   : num  10.6 8.66 1.1 1.1 1.1 ...
##  $ X6   : num  6.28 1.1 1.61 1.1 2.89 ...
##  $ X7   : num  1.1 2.48 1.79 2.08 1.1 ...
##  $ X8   : num  2.64 3.97 1.1 1.1 NA ...
##  $ X9   : num  2.64 1.1 NA NA NA ...
##  - attr(*, "vars")= chr "Class"
##  - attr(*, "drop")= logi TRUE
##  - attr(*, "indices")=List of 2
##   ..$ : int  0 1 2 3 4 5 6 7 8 9 ...
##   ..$ : int  75000 75001 75002 75003 75004 75005 75006 75007 75008 75009 ...
##  - attr(*, "group_sizes")= int  75000 75000
##  - attr(*, "biggest_group_size")= int 75000
##  - attr(*, "labels")='data.frame':   2 obs. of  1 variable:
##   ..$ Class: Factor w/ 2 levels "N","C": 1 2
##   ..- attr(*, "vars")= chr "Class"
##   ..- attr(*, "drop")= logi TRUE
##   ..- attr(*, "indices")=List of 2
##   .. ..$ : int  0 1 2 3 4 5 6 7 8 9 ...
##   .. ..$ : int  75000 75001 75002 75003 75004 75005 75006 75007 75008 75009 ...
##   ..- attr(*, "group_sizes")= int  75000 75000
##   ..- attr(*, "biggest_group_size")= int 75000
##   ..- attr(*, "labels")='data.frame':    2 obs. of  1 variable:
##   .. ..$ Class: Factor w/ 2 levels "N","C": 1 2
##   .. ..- attr(*, "vars")= chr "Class"
##   .. ..- attr(*, "drop")= logi TRUE

# Shuffle data
rows <- sample(nrow(caret300_b))

# Randomly order data
caret300_b <- caret300_b[rows,]

# Identify row to split on: split
split <- round(nrow(caret300_b) * .80)

# Create train
train <- caret300_b[1:split,]

# Create test
test <- caret300_b[(split + 1):nrow(caret300_b), ]

nrow(train)/nrow(caret300_b)

## [1] 0.8

library(caret)
myControl <- trainControl(
  method = "cv",
  number = 5,  # with respect to my RAM resources
  summaryFunction = twoClassSummary,
  classProbs = T,
  verboseIter = TRUE
)

The variables are more or less within the same scale, or it just seems to me so. But I would apply scaling within the train function as well as centering. I will as well perform cross-validation since I don’t work with all the data but just a small part of it. I hope this simulation would be good enough to give an overall accurate representation.

model <- train(Class ~., train,
  metric = "ROC",
  method = "glm",
  na.action = na.pass,
  preProcess = c("knnImpute", "center", "scale"),
  trControl = myControl)

## + Fold1: parameter=none 
## - Fold1: parameter=none 
## + Fold2: parameter=none 
## - Fold2: parameter=none 
## + Fold3: parameter=none 
## - Fold3: parameter=none 
## + Fold4: parameter=none 
## - Fold4: parameter=none 
## + Fold5: parameter=none 
## - Fold5: parameter=none 
## Aggregating results
## Fitting final model on full training set

summary(model)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2929  -1.1133  -0.4503   1.1436   2.2408  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.005001   0.005925  -0.844    0.399    
## X2          -0.004279   0.006176  -0.693    0.488    
## X3          -0.069728   0.006506 -10.717  < 2e-16 ***
## X4          -0.215273   0.006520 -33.018  < 2e-16 ***
## X5          -0.430261   0.006713 -64.095  < 2e-16 ***
## X6          -0.217107   0.007413 -29.286  < 2e-16 ***
## X7          -0.160279   0.006951 -23.057  < 2e-16 ***
## X8           0.134343   0.006293  21.349  < 2e-16 ***
## X9           0.042023   0.005458   7.699 1.37e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 166355  on 119999  degrees of freedom
## Residual deviance: 160551  on 119991  degrees of freedom
## AIC: 160569
## 
## Number of Fisher Scoring iterations: 4

# Dispalying the variables importance for the model. Again X2 is not of a much use.
varImp(model)

## glm variable importance
## 
##    Overall
## X5  100.00
## X4   50.98
## X6   45.10
## X7   35.27
## X8   32.58
## X3   15.81
## X9   11.05
## X2    0.00

# Save predictions
ppp <- predict(model, test,na.action = na.pass)

# Construction a confusion metrix
confusionMatrix(ppp, test$Class,positive = "C")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    N    C
##          N 9412 6591
##          C 5584 8413
##                                           
##                Accuracy : 0.5942          
##                  95% CI : (0.5886, 0.5997)
##     No Information Rate : 0.5001          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1883          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5607          
##             Specificity : 0.6276          
##          Pos Pred Value : 0.6011          
##          Neg Pred Value : 0.5881          
##              Prevalence : 0.5001          
##          Detection Rate : 0.2804          
##    Detection Prevalence : 0.4666          
##       Balanced Accuracy : 0.5942          
##                                           
##        'Positive' Class : C               
##

By centering and scaling the variables before training the model /despite my suspecions they being on the same scale/ the initial model has lower AIC and better accuracy and increased sensitivity. I would leave out the X2 variable for the custom scaling centering and imputation of the data and again build the model with the glm() in R.

Custom model

I would impute with the median here It is not my first choice for such a skewed data , for sure, but..

caret300_b_m <- caret300_b
caret300_b_m$X3[is.na(caret300_b_m$X3)] <- median(caret300_b_m$X3, na.rm=T)
caret300_b_m$X4[is.na(caret300_b_m$X4)] <- median(caret300_b_m$X4, na.rm=T)
caret300_b_m$X5[is.na(caret300_b_m$X5)] <- median(caret300_b_m$X5, na.rm=T)
caret300_b_m$X6[is.na(caret300_b_m$X6)] <- median(caret300_b_m$X6, na.rm=T)
caret300_b_m$X7[is.na(caret300_b_m$X7)] <- median(caret300_b_m$X7, na.rm=T)
caret300_b_m$X8[is.na(caret300_b_m$X8)] <- median(caret300_b_m$X8, na.rm=T)
caret300_b_m$X9[is.na(caret300_b_m$X9)] <- median(caret300_b_m$X9, na.rm=T)

for (i in c(2:ncol(caret300_b_m))) {
      caret300_b_m[,i] <- scale(caret300_b_m[,i], center = T, scale = T)
}

# shuffle
rowsII <- sample(nrow(caret300_b_m))
# reorder
caret300_b_m <- caret300_b_m[rowsII,]
# split 
splitII <- round(nrow(caret300_b_m) * .80)
# create train
trainII <- caret300_b_m[1:splitII,]
# create test
testII <- caret300_b_m[(splitII + 1):nrow(caret300_b_m), ]

# model
LogKnnCSmodel <- glm(Class ~ . - X2, data = trainII, family = binomial)
summary(LogKnnCSmodel)

## 
## Call:
## glm(formula = Class ~ . - X2, family = binomial, data = trainII)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.580  -1.108  -0.644   1.145   2.740  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  7.326e-07  5.924e-03   0.000        1    
## X3          -6.108e-02  6.289e-03  -9.711   <2e-16 ***
## X4          -2.109e-01  6.534e-03 -32.278   <2e-16 ***
## X5          -4.493e-01  6.804e-03 -66.030   <2e-16 ***
## X6          -2.366e-01  7.372e-03 -32.100   <2e-16 ***
## X7          -1.752e-01  6.912e-03 -25.352   <2e-16 ***
## X8           1.346e-01  6.291e-03  21.389   <2e-16 ***
## X9           1.321e-01  6.755e-03  19.558   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 166355  on 119999  degrees of freedom
## Residual deviance: 160265  on 119992  degrees of freedom
## AIC: 160281
## 
## Number of Fisher Scoring iterations: 4

# prediction
pII <- predict(LogKnnCSmodel, testII, type = "response")
# set prediction probability for the model
p_classII <- ifelse(pII > 0.5, "C", "N")
# create confusion matrix
confusionMatrix(factor(p_classII), testII$Class, positive = "C")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    N    C
##          N 9564 6718
##          C 5372 8346
##                                           
##                Accuracy : 0.597           
##                  95% CI : (0.5914, 0.6026)
##     No Information Rate : 0.5021          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.1943          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.5540          
##             Specificity : 0.6403          
##          Pos Pred Value : 0.6084          
##          Neg Pred Value : 0.5874          
##              Prevalence : 0.5021          
##          Detection Rate : 0.2782          
##    Detection Prevalence : 0.4573          
##       Balanced Accuracy : 0.5972          
##                                           
##        'Positive' Class : C               
##

# Draw a ROC curve
library(caTools)
colAUC(pII, testII$Class, plotROC = T)

##              [,1]
## N vs. C 0.6316014