(https://www.kaggle.com/c/criteo-display-ad-challenge/data)
The data was no longer available at the speciefied link
I was able to find it from Criteo. (http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/)
I suspect it is not the exact data set used for the challenge because of the number, oder, name of the variables. It consists of more than 46M rows. I used random sampling to obtain 300000 rows of data. /with respect to my available resources/
About the data Data fields Label - Target variable that indicates if an ad was clicked (1) or not (0). I1-I13 - A total of 13 columns of integer features (mostly count features). C1-C26 - A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes. The semantic of the features is undisclosed.
When a value is missing, the field is empty.
Until I know more about how to deal with the hashed values I would leave them out from the analysis. /“Hash functions may output the same value for different inputs” - To me that doesn’t speak accuracy in the calculations. But as I said. I will pay attention to that later.
I plan to use Logistic regression for this task.
library(readr)
# load data
train300 <- read_csv("C://Users/Public/Documents/Tiharis R/train300.csv", col_names = FALSE)[,1:13]
for (i in c(2:ncol(train300))) {
train300[,i] <- as.numeric(as.integer(as.character(unlist(train300[,i]))))
}
# Removing allegedly a categorical variable that left.
train300$X10 <- NULL
# train300$X1 <- as.factor(train300$X1)
summary(train300)
## X1 X2 X3 X4
## Min. :0.0000 Min. : -2.00 Min. : -2 Min. : 0
## 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 1 1st Qu.: 2
## Median :0.0000 Median : 1.00 Median : 5 Median : 6
## Mean :0.2561 Mean : 44.82 Mean : 3867 Mean : 8550
## 3rd Qu.:1.0000 3rd Qu.: 7.00 3rd Qu.: 36 3rd Qu.: 27
## Max. :1.0000 Max. :10469.00 Max. :87552397 Max. :87552397
## NA's :145 NA's :984
## X5 X6 X7
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 3 1st Qu.: 2 1st Qu.: 2
## Median : 14 Median : 26 Median : 8
## Mean : 27224 Mean : 22350 Mean : 24135
## 3rd Qu.: 2839 3rd Qu.: 391 3rd Qu.: 30
## Max. :98237733 Max. :511156000 Max. :98237733
## NA's :3884 NA's :6391 NA's :8526
## X8 X9 X11
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 2 1st Qu.: 1 1st Qu.: 0
## Median : 7 Median : 10 Median : 1
## Mean : 76526 Mean : 133534 Mean : 495323
## 3rd Qu.: 22 3rd Qu.: 33 3rd Qu.: 2
## Max. :511156000 Max. :961290000 Max. :511156000
## NA's :17255 NA's :32082 NA's :94325
## X12 X13
## Min. : 0 Min. : 0
## 1st Qu.: 1 1st Qu.: 0
## Median : 2 Median : 3
## Mean : 820474 Mean : 865780
## 3rd Qu.: 5 3rd Qu.: 9
## Max. :705740000 Max. :511156000
## NA's :144992 NA's :172115
From the summary I see that all the variables are heavily skewed to the right. Mean, Median difference is huge. I see many missing values, so I’ll plot them to see whether there is a pattern or something that might speak up. The data resides above the 3rd quartile meaning that 75% of the data is less than the number corespoding to the 3rd quartile. In context of the variable I can’t speak, so I can’t take any well-argumented decisions. I may treat these extreme high data points as are or bin the variable. I also think about the NA’s - for some variables missing values may make sense. For that they should be treated accordingly. We’ll see. What I also notice is the same extreme value ‘511156000’ in variables X6, X8, X11 and X13. The same is with X3 and X4 ‘87552397’ and X5 and X7 ‘98237733’. There is a class bias for the response varbiable X1. I may sample more balanced data for the model but we’ll see.
# I will get in numerical form the skewness and the kurtosis
library(psych)
describe(train300)
## vars n mean sd median trimmed mad min max
## X1 1 300000 0.26 0.44 0 0.20 0.00 0 1
## X2 2 300000 44.82 244.52 1 4.52 1.48 -2 10469
## X3 3 299855 3867.12 278537.78 5 28.36 7.41 -2 87552397
## X4 4 299016 8549.67 621113.54 6 59.24 7.41 0 87552397
## X5 5 296116 27223.75 1205105.10 14 1580.13 20.76 0 98237733
## X6 6 293609 22350.18 1591223.37 26 272.60 38.55 0 511156000
## X7 7 291474 24135.33 1373172.85 8 16.37 10.38 0 98237733
## X8 8 282745 76526.25 2609410.78 7 12.22 8.90 0 511156000
## X9 9 267918 133533.54 3746332.25 10 16.75 14.83 0 961290000
## X11 10 205675 495323.35 6383570.38 1 1.25 1.48 0 511156000
## X12 11 155008 820474.47 8458144.95 2 3.19 1.48 0 705740000
## X13 12 127885 865779.51 8096012.92 3 4.93 4.45 0 511156000
## range skew kurtosis se
## X1 1 1.12 -0.75 0.00
## X2 10471 9.88 124.17 0.45
## X3 87552399 310.70 97646.35 508.66
## X4 87552397 140.47 19794.77 1135.86
## X5 98237733 71.95 5220.82 2214.59
## X6 511156000 148.07 38186.34 2936.61
## X7 98237733 61.02 3819.79 2543.46
## X8 511156000 54.38 6118.15 4907.33
## X9 961290000 80.19 16565.39 7237.78
## X11 511156000 15.58 386.37 14075.79
## X12 705740000 19.63 960.43 21483.16
## X13 511156000 14.76 460.31 22639.22
I plan to log transform the data in order to reduce the skew. Too much skewness will mess up the statistical techinques.
library(Amelia)
# plot NA's
missmap(train300)
# missing values in percentage
round(sapply(train300, function(x) sum(length(which(is.na(x))))) / nrow(train300),3)*100
## X1 X2 X3 X4 X5 X6 X7 X8 X9 X11 X12 X13
## 0.0 0.0 0.0 0.3 1.3 2.1 2.8 5.8 10.7 31.4 48.3 57.4
After seeing the missing NA’s plot I decide to remove my attention from the variables X12 and X13. Imputing any numbers would lead to creating data insead of invastigating it. I see monotone missing data pattern - when a value is missing in a column it continuos to be missing in the next column. That might be a bit misleading because of the condensed plot. Probably X11 will go away as well but first I will investigate it further. The imputation I’m considering is k-NN /k nearest neighbor/.But we’ll see.
# Copy data
train300_copy <- train300
par(mfrow = c(6, 2))
par(cex = 0.6)
# I will do log transformation of the variables in order to see their distribution a little bit better.
hist(log(train300_copy$X13+1),breaks=80)
rug(jitter(log(train300_copy$X13+1)))
hist(log(train300_copy$X12+1),breaks=80)
rug(jitter(log(train300_copy$X12+1)))
hist(log(train300_copy$X11+1),breaks=80)
rug(jitter(log(train300_copy$X11+1)))
hist(log(train300_copy$X9+1),breaks=80)
rug(jitter(log(train300_copy$X9+1)))
hist(log(train300_copy$X8+1),breaks=80)
rug(jitter(log(train300_copy$X8+1)))
hist(log(train300_copy$X8+1),breaks=80)
rug(jitter(log(train300_copy$X8+1)))
hist(log(train300_copy$X6+1),breaks=80)
rug(jitter(log(train300_copy$X6+1)))
hist(log(train300_copy$X5+1),breaks=80)
rug(jitter(log(train300_copy$X5+1)))
hist(log(train300_copy$X4+1),breaks=80)
rug(jitter(log(train300_copy$X4+1)))
hist(log(train300_copy$X3+3),breaks=80)
rug(jitter(log(train300_copy$X3+1)))
hist(log(train300_copy$X2+3),breaks=80)
rug(jitter(log(train300_copy$X2+3)))
# yet another way of seeing the distribution and the variability of the variables
boxplot(train300, main = "Boxplot of not transformed variables")
boxplot(log(train300[,-1]+3),main = "Boxplot of log-transformed variables")
Schould I treat the extreme observations as outliers or not? They represent 25% of the data. That perhaps shouldn’t be ignored.
# Removing the last 3 variables from the analysis because of too much missing data
train300_copy$X13 <- NULL
train300_copy$X12 <- NULL
train300_copy$X11 <- NULL
# Find the mode for every variable
MODE <- function(dataframe){
DF <- as.data.frame(dataframe)
MODE2 <- function(x){
if (is.numeric(x) == FALSE){
df <- as.data.frame(table(x))
df <- df[order(df$Freq), ]
m <- max(df$Freq)
MODE1 <- as.vector(as.character(subset(df, Freq == m)[, 1]))
if (sum(df$Freq)/length(df$Freq)==1){
warning("No Mode: Frequency of all values is 1", call. = FALSE)
}else{
return(MODE1)
}
}else{
df <- as.data.frame(table(x))
df <- df[order(df$Freq), ]
m <- max(df$Freq)
MODE1 <- as.vector(as.numeric(as.character(subset(df, Freq == m)[, 1])))
if (sum(df$Freq)/length(df$Freq)==1){
warning("No Mode: Frequency of all values is 1", call. = FALSE)
}else{
return(MODE1)
}
}
}
return(as.vector(lapply(DF, MODE2)))
}
# Display the mode of every variable
as.data.frame(MODE(train300_copy))
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## 1 0 0 1 1 0 0 0 1 0
I am considering an option to separate the variables at their mode and thus making them categorical ones. I can’t yet decide whether to do it with 2 levels or 3 or 4.. or leave them as they are
# How many missing values for each variable
sapply(train300_copy,function(x) sum(is.na(x)))
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## 0 0 145 984 3884 6391 8526 17255 32082
# how many unique values for each variable
sapply(train300_copy, function(x) length(unique(x)))
## X1 X2 X3 X4 X5 X6 X7 X8 X9
## 2 2850 16042 19127 32116 11580 1892 1296 1510
I will idetify the veriables with zero variance
freqCut: looks at the ratio of the most common value to the second most common value uniqueCut: percentage of distinct values out of the number of total samples
the percent of unique values, the number of unique values divided by the total number of samples (times 100)
library(caret) nearZeroVar(train300_copy, names = TRUE, freqCut = 19, uniqueCut = 10) nearZeroVar(train300_copy, names = TRUE, freqCut = 2, uniqueCut = 20) X1, X6, X9
I’ll keep those variables in mind.
# Imputation I may also try the default setting for the number of multiple imputations m
, which is 5.
maxit
- the number of iterations which is again 5. But for large sample size 10 and up may give better results. I will do it for now with low settings
# Imputing data with predictive mean matching
library(mice)
imputedData300 <- mice(train300_copy, m=2, maxit = 5, method = 'pmm', seed = 500)
##
## iter imp variable
## 1 1 X3 X4 X5 X6 X7 X8 X9
## 1 2 X3 X4 X5 X6 X7 X8 X9
## 2 1 X3 X4 X5 X6 X7 X8 X9
## 2 2 X3 X4 X5 X6 X7 X8 X9
## 3 1 X3 X4 X5 X6 X7 X8 X9
## 3 2 X3 X4 X5 X6 X7 X8 X9
## 4 1 X3 X4 X5 X6 X7 X8 X9
## 4 2 X3 X4 X5 X6 X7 X8 X9
## 5 1 X3 X4 X5 X6 X7 X8 X9
## 5 2 X3 X4 X5 X6 X7 X8 X9
# I will chose the second dataset and combine it with the original data
train300_copy_imputed <- complete(imputedData300, 2)
# Check for NA's
anyNA(train300_copy_imputed)
## [1] FALSE
# Compare summary statistics
summary(train300_copy)
## X1 X2 X3 X4
## Min. :0.0000 Min. : -2.00 Min. : -2 Min. : 0
## 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 1 1st Qu.: 2
## Median :0.0000 Median : 1.00 Median : 5 Median : 6
## Mean :0.2561 Mean : 44.82 Mean : 3867 Mean : 8550
## 3rd Qu.:1.0000 3rd Qu.: 7.00 3rd Qu.: 36 3rd Qu.: 27
## Max. :1.0000 Max. :10469.00 Max. :87552397 Max. :87552397
## NA's :145 NA's :984
## X5 X6 X7
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 3 1st Qu.: 2 1st Qu.: 2
## Median : 14 Median : 26 Median : 8
## Mean : 27224 Mean : 22350 Mean : 24135
## 3rd Qu.: 2839 3rd Qu.: 391 3rd Qu.: 30
## Max. :98237733 Max. :511156000 Max. :98237733
## NA's :3884 NA's :6391 NA's :8526
## X8 X9
## Min. : 0 Min. : 0
## 1st Qu.: 2 1st Qu.: 1
## Median : 7 Median : 10
## Mean : 76526 Mean : 133534
## 3rd Qu.: 22 3rd Qu.: 33
## Max. :511156000 Max. :961290000
## NA's :17255 NA's :32082
summary(train300_copy_imputed)
## X1 X2 X3 X4
## Min. :0.0000 Min. : -2.00 Min. : -2 Min. : 0
## 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 1 1st Qu.: 2
## Median :0.0000 Median : 1.00 Median : 5 Median : 6
## Mean :0.2561 Mean : 44.82 Mean : 3869 Mean : 8842
## 3rd Qu.:1.0000 3rd Qu.: 7.00 3rd Qu.: 36 3rd Qu.: 27
## Max. :1.0000 Max. :10469.00 Max. :87552397 Max. :87552397
## X5 X6 X7
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 3 1st Qu.: 2 1st Qu.: 2
## Median : 14 Median : 25 Median : 8
## Mean : 33876 Mean : 50121 Mean : 61858
## 3rd Qu.: 2789 3rd Qu.: 367 3rd Qu.: 30
## Max. :98237733 Max. :511156000 Max. :98237733
## X8 X9
## Min. : 0 Min. : 0
## 1st Qu.: 2 1st Qu.: 1
## Median : 7 Median : 9
## Mean : 268595 Mean : 476578
## 3rd Qu.: 22 3rd Qu.: 32
## Max. :511156000 Max. :961290000
The summary comparison reveals to what extent I influenced the variables. Some of them remain solid, some are influenced a bit, In my oppinion it is quite survivable. I plan to use both imputed data sets to compare their performance. # Log transform
# Copy the imputed data
train300_copy_imputed_log <- train300_copy_imputed
# Transform every variable but the response variable
train300_copy_imputed_log[,-1] <- log(train300_copy_imputed_log[, -1] + 3)
summary(train300_copy_imputed_log)
## X1 X2 X3 X4
## Min. :0.0000 Min. :0.000 Min. : 0.000 Min. : 1.099
## 1st Qu.:0.0000 1st Qu.:1.099 1st Qu.: 1.386 1st Qu.: 1.609
## Median :0.0000 Median :1.386 Median : 2.079 Median : 2.197
## Mean :0.2561 Mean :1.978 Mean : 2.965 Mean : 3.170
## 3rd Qu.:1.0000 3rd Qu.:2.303 3rd Qu.: 3.664 3rd Qu.: 3.401
## Max. :1.0000 Max. :9.256 Max. :18.288 Max. :18.288
## X5 X6 X7 X8
## Min. : 1.099 Min. : 1.099 Min. : 1.099 Min. : 1.099
## 1st Qu.: 1.792 1st Qu.: 1.609 1st Qu.: 1.609 1st Qu.: 1.609
## Median : 2.833 Median : 3.332 Median : 2.398 Median : 2.303
## Mean : 4.493 Mean : 3.923 Mean : 2.694 Mean : 2.587
## 3rd Qu.: 7.935 3rd Qu.: 5.914 3rd Qu.: 3.497 3rd Qu.: 3.219
## Max. :18.403 Max. :20.052 Max. :18.403 Max. :20.052
## X9
## Min. : 1.099
## 1st Qu.: 1.386
## Median : 2.485
## Mean : 2.762
## 3rd Qu.: 3.555
## Max. :20.684
# Change names accordingly
names(train300_copy_imputed_log) <- paste0(names(train300_copy_imputed_log), "_log")
names(train300_copy_imputed_log)[names(train300_copy_imputed_log) == "X1_log"] <- "X1"
I will randomly reorder the data and then split it into train and test sets.
# Shuffle rows
shuf <- sample(nrow(train300_copy_imputed_log))
# Randomly order data
train300_copy_imputed_log <- train300_copy_imputed_log[shuf,]
# Identify row to split on
split <- round(nrow(train300_copy_imputed_log) * .80)
# Create train
train_train300_copy_imputed_log <- train300_copy_imputed_log[1:split,]
# Create test
test_train300_copy_imputed_log <- train300_copy_imputed_log[(split + 1):nrow(train300_copy_imputed_log), ]
# check splitting
nrow(train_train300_copy_imputed_log)/nrow(train300_copy_imputed_log)
## [1] 0.8
str(train_train300_copy_imputed_log)
## 'data.frame': 240000 obs. of 9 variables:
## $ X1 : int 1 1 0 0 1 0 0 0 1 0 ...
## $ X2_log: num 2.56 1.95 1.39 3.89 1.1 ...
## $ X3_log: num 4.17 2.48 8.41 2.94 3.26 ...
## $ X4_log: num 1.79 2.4 1.1 7.6 1.39 ...
## $ X5_log: num 2.4 8.3 1.1 3.64 1.39 ...
## $ X6_log: num 6.3 5.24 3 2.3 7.81 ...
## $ X7_log: num 2.4 3.43 1.1 3.53 2.71 ...
## $ X8_log: num 4.64 3.3 1.95 3.64 2.3 ...
## $ X9_log: num 2.94 5.67 1.39 1.39 1.39 ...
# 1 glm model
LogModel1 <- glm(X1 ~ ., data = train_train300_copy_imputed_log, family = binomial)
summary(LogModel1)
##
## Call:
## glm(formula = X1 ~ ., family = binomial, data = train_train300_copy_imputed_log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5427 -0.7853 -0.6694 1.2642 3.4178
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.327849 0.022618 14.495 < 2e-16 ***
## X2_log -0.001664 0.003773 -0.441 0.659
## X3_log -0.034692 0.002132 -16.275 < 2e-16 ***
## X4_log -0.081383 0.002117 -38.444 < 2e-16 ***
## X5_log -0.128464 0.001655 -77.633 < 2e-16 ***
## X6_log -0.082325 0.002306 -35.708 < 2e-16 ***
## X7_log -0.126601 0.004067 -31.130 < 2e-16 ***
## X8_log 0.053021 0.003028 17.511 < 2e-16 ***
## X9_log 0.011844 0.002441 4.852 1.22e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 273276 on 239999 degrees of freedom
## Residual deviance: 264812 on 239991 degrees of freedom
## AIC: 264830
##
## Number of Fisher Scoring iterations: 4
# 2 glm model
LogModel2 <- glm(X1 ~ . - X2_log, data = train_train300_copy_imputed_log, family = binomial)
summary(LogModel2)
##
## Call:
## glm(formula = X1 ~ . - X2_log, family = binomial, data = train_train300_copy_imputed_log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5420 -0.7853 -0.6694 1.2641 3.4171
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.324051 0.020914 15.495 < 2e-16 ***
## X3_log -0.034643 0.002129 -16.274 < 2e-16 ***
## X4_log -0.081385 0.002117 -38.447 < 2e-16 ***
## X5_log -0.128535 0.001647 -78.043 < 2e-16 ***
## X6_log -0.082155 0.002273 -36.145 < 2e-16 ***
## X7_log -0.126508 0.004061 -31.150 < 2e-16 ***
## X8_log 0.052958 0.003024 17.511 < 2e-16 ***
## X9_log 0.011821 0.002440 4.844 1.27e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 273276 on 239999 degrees of freedom
## Residual deviance: 264812 on 239992 degrees of freedom
## AIC: 264828
##
## Number of Fisher Scoring iterations: 4
# 3 glm model /whitout the near zero variance variables one at a time/
LogModel3 <- glm(X1 ~ . - X2_log - X9_log, data = train_train300_copy_imputed_log, family = binomial)
summary(LogModel3)
##
## Call:
## glm(formula = X1 ~ . - X2_log - X9_log, family = binomial, data = train_train300_copy_imputed_log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5045 -0.7857 -0.6698 1.2631 3.4348
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.333534 0.020831 16.01 <2e-16 ***
## X3_log -0.032616 0.002085 -15.64 <2e-16 ***
## X4_log -0.082059 0.002113 -38.84 <2e-16 ***
## X5_log -0.127765 0.001639 -77.94 <2e-16 ***
## X6_log -0.080386 0.002244 -35.83 <2e-16 ***
## X7_log -0.125820 0.004064 -30.96 <2e-16 ***
## X8_log 0.055640 0.002972 18.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 273276 on 239999 degrees of freedom
## Residual deviance: 264836 on 239993 degrees of freedom
## AIC: 264850
##
## Number of Fisher Scoring iterations: 4
# 4 glm model /whitout the near zero variance variables one at a time/
LogModel4 <- glm(X1 ~ . - X2_log - X9_log -X6_log, data = train_train300_copy_imputed_log, family = binomial)
summary(LogModel4)
##
## Call:
## glm(formula = X1 ~ . - X2_log - X9_log - X6_log, family = binomial,
## data = train_train300_copy_imputed_log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4255 -0.7941 -0.6768 1.3276 3.3884
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.027749 0.019232 1.443 0.149
## X3_log -0.016831 0.002052 -8.202 2.36e-16 ***
## X4_log -0.059662 0.002027 -29.434 < 2e-16 ***
## X5_log -0.115618 0.001599 -72.290 < 2e-16 ***
## X7_log -0.188458 0.003732 -50.500 < 2e-16 ***
## X8_log 0.052585 0.002991 17.578 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 273276 on 239999 degrees of freedom
## Residual deviance: 266132 on 239994 degrees of freedom
## AIC: 266144
##
## Number of Fisher Scoring iterations: 4
The interpretation of the coefficients will be a challenge for me, because the glm model gives the coefficients in the log odds scale and I already took the natural log of them.but I’ll give it a try. For variables X3_log to X7_log I see decrease from The log odds ratio associated with a 2.72 fold change in the predictor. By just looking at the sign of the estimates I may say that which one has positive or negative affect on the response variable X1 which represents whether the add was clicked on or not. I will show the LogModel2 in odds scale for hopefully easier interpretation. For variables X3_log to X7_log I see assosiated decrease in the odds of Click. for example: X3_log would be interpreted as: each additional log unit of variable X3_log is assosiated with 3.4% decrease in the odds of Click. The difference between the null deviance and the residual deviance is not big. It would mean/when there is a bigger gap/ that variables add value to the model.
# Coefficients and their CI of LogModel2 in the ddds scale
exp(cbind(Odds = coef(LogModel2), confint(LogModel2)))
## Odds 2.5 % 97.5 %
## (Intercept) 1.3827174 1.3271997 1.4405878
## X3_log 0.9659497 0.9619244 0.9699851
## X4_log 0.9218387 0.9180171 0.9256664
## X5_log 0.8793826 0.8765459 0.8822232
## X6_log 0.9211291 0.9170334 0.9252405
## X7_log 0.8811671 0.8741696 0.8881977
## X8_log 1.0543851 1.0481535 1.0606541
## X9_log 1.0118908 1.0070520 1.0167324
Here I see that no coefficients in the odds scale crosses zero which indicates to me that all of the variables have to some extent, even small one/ influence one the response variable. Apart from X8_log and X9_log that have a tiny positive influence all other remaining variables have a negative one.
# AIC comparison Comparing models by their Akaike Information Criterion - evaluating the quality of the model
# AIC comparison
AIC(LogModel1,LogModel2,LogModel3,LogModel4)
## df AIC
## LogModel1 9 264830.2
## LogModel2 8 264828.4
## LogModel3 7 264849.6
## LogModel4 6 266143.5
Model 2 is better. I will continue with it.
# prediction
p <- predict(LogModel2, test_train300_copy_imputed_log, type = "response")
# set prediction probability for the model
p_class <- ifelse(p > 0.2, 1, 0)
# create confusion matrix
table(predicted = p_class, actual = test_train300_copy_imputed_log$X1)
## actual
## predicted 0 1
## 0 14366 2878
## 1 30362 12394
# estimate the accuracy of the model
mean(p_class == test_train300_copy_imputed_log$X1)
## [1] 0.446
If I set the cutoff value to be 0.20 I would get more clicks but with less certanty than if I set it to 0.6.
# set prediction probability for the model
p_class <- ifelse(p > 0.6, 1, 0)
# create confusion matrix
table(predicted = p_class, actual = test_train300_copy_imputed_log$X1)
## actual
## predicted 0 1
## 0 44686 15268
## 1 42 4
# estimate the accuracy of the model
mean(p_class == test_train300_copy_imputed_log$X1)
## [1] 0.7448333
I will draw a ROC curve and see the possible cutoff point as welll as the overall performance of the model across these thresholds.
library(caTools)
# Draw a ROC curve
colAUC(p, test_train300_copy_imputed_log$X1, plotROC = T)
## [,1]
## 0 vs. 1 0.6229923
I would choose to set the threshold a bit over the average probability of success
# Check once again the average prop of clicks
mean(test_train300_copy_imputed_log$X1)
## [1] 0.2545333
# I decide to leave it at 0.26
p_class <- ifelse(p > 0.26, 1, 0)
# create confusion matrix
table(predicted = p_class, actual = test_train300_copy_imputed_log$X1)
## actual
## predicted 0 1
## 0 28578 6922
## 1 16150 8350
# estimate the accuracy of the model
mean(p_class == test_train300_copy_imputed_log$X1)
## [1] 0.6154667
sensitivity <- 8350 / (8350 + 16150 )
specificity <- 28578 / (28578 + 6922)
sensitivity
## [1] 0.3408163
specificity
## [1] 0.8050141
The model correctly identifies 34% of the clicks The model correctly identifies 81% of no clicks
# McFadden R2 index can be used to assess the model fit
library(pscl)
pR2(LogModel2)
## llh llhNull G2 McFadden r2ML
## -1.324062e+05 -1.366378e+05 8.463247e+03 3.096964e-02 3.464901e-02
## r2CU
## 5.097329e-02
Well, now I am confused. # Second imputed dataset model
# GLM for both of the imputed data set
ModelIIset <- with(imputedData300, glm(X1 ~ ., data = train_train300_copy_imputed_log, family = binomial))
summary(ModelIIset)
##
## ## summary of imputation 1 :
##
## Call:
## glm(formula = X1 ~ ., family = binomial, data = train_train300_copy_imputed_log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5427 -0.7853 -0.6694 1.2642 3.4178
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.327849 0.022618 14.495 < 2e-16 ***
## X2_log -0.001664 0.003773 -0.441 0.659
## X3_log -0.034692 0.002132 -16.275 < 2e-16 ***
## X4_log -0.081383 0.002117 -38.444 < 2e-16 ***
## X5_log -0.128464 0.001655 -77.633 < 2e-16 ***
## X6_log -0.082325 0.002306 -35.708 < 2e-16 ***
## X7_log -0.126601 0.004067 -31.130 < 2e-16 ***
## X8_log 0.053021 0.003028 17.511 < 2e-16 ***
## X9_log 0.011844 0.002441 4.852 1.22e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 273276 on 239999 degrees of freedom
## Residual deviance: 264812 on 239991 degrees of freedom
## AIC: 264830
##
## Number of Fisher Scoring iterations: 4
##
##
## ## summary of imputation 2 :
##
## Call:
## glm(formula = X1 ~ ., family = binomial, data = train_train300_copy_imputed_log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5427 -0.7853 -0.6694 1.2642 3.4178
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.327849 0.022618 14.495 < 2e-16 ***
## X2_log -0.001664 0.003773 -0.441 0.659
## X3_log -0.034692 0.002132 -16.275 < 2e-16 ***
## X4_log -0.081383 0.002117 -38.444 < 2e-16 ***
## X5_log -0.128464 0.001655 -77.633 < 2e-16 ***
## X6_log -0.082325 0.002306 -35.708 < 2e-16 ***
## X7_log -0.126601 0.004067 -31.130 < 2e-16 ***
## X8_log 0.053021 0.003028 17.511 < 2e-16 ***
## X9_log 0.011844 0.002441 4.852 1.22e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 273276 on 239999 degrees of freedom
## Residual deviance: 264812 on 239991 degrees of freedom
## AIC: 264830
##
## Number of Fisher Scoring iterations: 4
Both of the imputed datasets yield the same results in respect of the significance of the variables.
# correlation
library(corrplot)
correlations_o <- round(cor(train300_copy, use = "complete.obs",method = "spearman"),2)
corrplot(correlations_o, method = "number")
correlations_i <- round(cor(train300_copy_imputed, method = "spearman"),2)
corrplot(correlations_i, method = "number")
correlations_il <- round(cor(log(train300_copy_imputed_log),method = "spearman"),2)
corrplot(correlations_il, method = "number")
pl <- plot(train300_copy_imputed_log, col=rgb(0,0,0,40,maxColorValue=255) , pch = 16)
pl
## NULL
I would like to train the model using the caret
package as well.
# Copy data
caret300 <- train300_copy
# Delete variables with many NA's
caret300$X13 <- NULL
caret300$X12 <- NULL
caret300$X11 <- NULL
# Rename response variable
names(caret300)[names(caret300) == 'X1'] <- 'Class'
# Make the response variabel a factor
caret300$Class <- as.factor(caret300$Class)
# Renane
library(plyr)
caret300$Class <- mapvalues(caret300$Class, from = c("0", "1"), to = c("N", "C"))
# Create more balanced data in order the model to learn better.
library(dplyr)
caret300_b <- caret300 %>%
group_by(Class) %>%
sample_n(75000)
caret300_b[,-1] <- log(caret300_b[,-1] + 3)
str(caret300_b)
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 150000 obs. of 9 variables:
## $ Class: Factor w/ 2 levels "N","C": 1 1 1 1 1 1 1 1 1 1 ...
## $ X2 : num 1.1 5.15 1.1 1.1 6.07 ...
## $ X3 : num 3.89 4.5 1.39 2.08 8.87 ...
## $ X4 : num 2.56 1.1 8.66 8.38 1.1 ...
## $ X5 : num 10.6 8.66 1.1 1.1 1.1 ...
## $ X6 : num 6.28 1.1 1.61 1.1 2.89 ...
## $ X7 : num 1.1 2.48 1.79 2.08 1.1 ...
## $ X8 : num 2.64 3.97 1.1 1.1 NA ...
## $ X9 : num 2.64 1.1 NA NA NA ...
## - attr(*, "vars")= chr "Class"
## - attr(*, "drop")= logi TRUE
## - attr(*, "indices")=List of 2
## ..$ : int 0 1 2 3 4 5 6 7 8 9 ...
## ..$ : int 75000 75001 75002 75003 75004 75005 75006 75007 75008 75009 ...
## - attr(*, "group_sizes")= int 75000 75000
## - attr(*, "biggest_group_size")= int 75000
## - attr(*, "labels")='data.frame': 2 obs. of 1 variable:
## ..$ Class: Factor w/ 2 levels "N","C": 1 2
## ..- attr(*, "vars")= chr "Class"
## ..- attr(*, "drop")= logi TRUE
## ..- attr(*, "indices")=List of 2
## .. ..$ : int 0 1 2 3 4 5 6 7 8 9 ...
## .. ..$ : int 75000 75001 75002 75003 75004 75005 75006 75007 75008 75009 ...
## ..- attr(*, "group_sizes")= int 75000 75000
## ..- attr(*, "biggest_group_size")= int 75000
## ..- attr(*, "labels")='data.frame': 2 obs. of 1 variable:
## .. ..$ Class: Factor w/ 2 levels "N","C": 1 2
## .. ..- attr(*, "vars")= chr "Class"
## .. ..- attr(*, "drop")= logi TRUE
# Shuffle data
rows <- sample(nrow(caret300_b))
# Randomly order data
caret300_b <- caret300_b[rows,]
# Identify row to split on: split
split <- round(nrow(caret300_b) * .80)
# Create train
train <- caret300_b[1:split,]
# Create test
test <- caret300_b[(split + 1):nrow(caret300_b), ]
nrow(train)/nrow(caret300_b)
## [1] 0.8
library(caret)
myControl <- trainControl(
method = "cv",
number = 5, # with respect to my RAM resources
summaryFunction = twoClassSummary,
classProbs = T,
verboseIter = TRUE
)
The variables are more or less within the same scale, or it just seems to me so. But I would apply scaling within the train function as well as centering. I will as well perform cross-validation since I don’t work with all the data but just a small part of it. I hope this simulation would be good enough to give an overall accurate representation.
model <- train(Class ~., train,
metric = "ROC",
method = "glm",
na.action = na.pass,
preProcess = c("knnImpute", "center", "scale"),
trControl = myControl)
## + Fold1: parameter=none
## - Fold1: parameter=none
## + Fold2: parameter=none
## - Fold2: parameter=none
## + Fold3: parameter=none
## - Fold3: parameter=none
## + Fold4: parameter=none
## - Fold4: parameter=none
## + Fold5: parameter=none
## - Fold5: parameter=none
## Aggregating results
## Fitting final model on full training set
summary(model)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2929 -1.1133 -0.4503 1.1436 2.2408
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.005001 0.005925 -0.844 0.399
## X2 -0.004279 0.006176 -0.693 0.488
## X3 -0.069728 0.006506 -10.717 < 2e-16 ***
## X4 -0.215273 0.006520 -33.018 < 2e-16 ***
## X5 -0.430261 0.006713 -64.095 < 2e-16 ***
## X6 -0.217107 0.007413 -29.286 < 2e-16 ***
## X7 -0.160279 0.006951 -23.057 < 2e-16 ***
## X8 0.134343 0.006293 21.349 < 2e-16 ***
## X9 0.042023 0.005458 7.699 1.37e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 166355 on 119999 degrees of freedom
## Residual deviance: 160551 on 119991 degrees of freedom
## AIC: 160569
##
## Number of Fisher Scoring iterations: 4
# Dispalying the variables importance for the model. Again X2 is not of a much use.
varImp(model)
## glm variable importance
##
## Overall
## X5 100.00
## X4 50.98
## X6 45.10
## X7 35.27
## X8 32.58
## X3 15.81
## X9 11.05
## X2 0.00
# Save predictions
ppp <- predict(model, test,na.action = na.pass)
# Construction a confusion metrix
confusionMatrix(ppp, test$Class,positive = "C")
## Confusion Matrix and Statistics
##
## Reference
## Prediction N C
## N 9412 6591
## C 5584 8413
##
## Accuracy : 0.5942
## 95% CI : (0.5886, 0.5997)
## No Information Rate : 0.5001
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.1883
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5607
## Specificity : 0.6276
## Pos Pred Value : 0.6011
## Neg Pred Value : 0.5881
## Prevalence : 0.5001
## Detection Rate : 0.2804
## Detection Prevalence : 0.4666
## Balanced Accuracy : 0.5942
##
## 'Positive' Class : C
##
By centering and scaling the variables before training the model /despite my suspecions they being on the same scale/ the initial model has lower AIC and better accuracy and increased sensitivity. I would leave out the X2 variable for the custom scaling centering and imputation of the data and again build the model with the glm() in R.
I would impute with the median here It is not my first choice for such a skewed data , for sure, but..
caret300_b_m <- caret300_b
caret300_b_m$X3[is.na(caret300_b_m$X3)] <- median(caret300_b_m$X3, na.rm=T)
caret300_b_m$X4[is.na(caret300_b_m$X4)] <- median(caret300_b_m$X4, na.rm=T)
caret300_b_m$X5[is.na(caret300_b_m$X5)] <- median(caret300_b_m$X5, na.rm=T)
caret300_b_m$X6[is.na(caret300_b_m$X6)] <- median(caret300_b_m$X6, na.rm=T)
caret300_b_m$X7[is.na(caret300_b_m$X7)] <- median(caret300_b_m$X7, na.rm=T)
caret300_b_m$X8[is.na(caret300_b_m$X8)] <- median(caret300_b_m$X8, na.rm=T)
caret300_b_m$X9[is.na(caret300_b_m$X9)] <- median(caret300_b_m$X9, na.rm=T)
for (i in c(2:ncol(caret300_b_m))) {
caret300_b_m[,i] <- scale(caret300_b_m[,i], center = T, scale = T)
}
# shuffle
rowsII <- sample(nrow(caret300_b_m))
# reorder
caret300_b_m <- caret300_b_m[rowsII,]
# split
splitII <- round(nrow(caret300_b_m) * .80)
# create train
trainII <- caret300_b_m[1:splitII,]
# create test
testII <- caret300_b_m[(splitII + 1):nrow(caret300_b_m), ]
# model
LogKnnCSmodel <- glm(Class ~ . - X2, data = trainII, family = binomial)
summary(LogKnnCSmodel)
##
## Call:
## glm(formula = Class ~ . - X2, family = binomial, data = trainII)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.580 -1.108 -0.644 1.145 2.740
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.326e-07 5.924e-03 0.000 1
## X3 -6.108e-02 6.289e-03 -9.711 <2e-16 ***
## X4 -2.109e-01 6.534e-03 -32.278 <2e-16 ***
## X5 -4.493e-01 6.804e-03 -66.030 <2e-16 ***
## X6 -2.366e-01 7.372e-03 -32.100 <2e-16 ***
## X7 -1.752e-01 6.912e-03 -25.352 <2e-16 ***
## X8 1.346e-01 6.291e-03 21.389 <2e-16 ***
## X9 1.321e-01 6.755e-03 19.558 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 166355 on 119999 degrees of freedom
## Residual deviance: 160265 on 119992 degrees of freedom
## AIC: 160281
##
## Number of Fisher Scoring iterations: 4
# prediction
pII <- predict(LogKnnCSmodel, testII, type = "response")
# set prediction probability for the model
p_classII <- ifelse(pII > 0.5, "C", "N")
# create confusion matrix
confusionMatrix(factor(p_classII), testII$Class, positive = "C")
## Confusion Matrix and Statistics
##
## Reference
## Prediction N C
## N 9564 6718
## C 5372 8346
##
## Accuracy : 0.597
## 95% CI : (0.5914, 0.6026)
## No Information Rate : 0.5021
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.1943
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5540
## Specificity : 0.6403
## Pos Pred Value : 0.6084
## Neg Pred Value : 0.5874
## Prevalence : 0.5021
## Detection Rate : 0.2782
## Detection Prevalence : 0.4573
## Balanced Accuracy : 0.5972
##
## 'Positive' Class : C
##
# Draw a ROC curve
library(caTools)
colAUC(pII, testII$Class, plotROC = T)
## [,1]
## N vs. C 0.6316014