I’ll get started by setting up my working directory and reading in the data.
setwd("C:/Users/Nathan/Desktop/Working Directory")
ad.df <- read.csv("Analyst_HW.csv", header = T, stringsAsFactors = F)
What is the structure of the data?
str(ad.df) #find out the structure at a glance
## 'data.frame': 14905865 obs. of 7 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 36 73 30 49 47 47 0 46 16 52 ...
## $ Gender : int 0 1 0 1 1 0 0 0 0 0 ...
## $ Impressions: int 3 3 3 3 11 11 7 5 3 4 ...
## $ Clicks : int 0 0 0 0 0 1 1 0 0 0 ...
## $ Signed_In : int 1 1 1 1 1 1 0 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
summary(ad.df) #basic summary statistics
## X Age Gender Impressions
## Min. : 1 Min. : 0.00 Min. :0.0000 Min. : 0
## 1st Qu.: 3726467 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.: 3
## Median : 7452933 Median : 26.00 Median :0.0000 Median : 5
## Mean : 7452933 Mean : 26.24 Mean :0.3231 Mean : 5
## 3rd Qu.:11179399 3rd Qu.: 46.00 3rd Qu.:1.0000 3rd Qu.: 6
## Max. :14905865 Max. :115.00 Max. :1.0000 Max. :21
## Clicks Signed_In day
## Min. :0.00000 Min. :0.0000 Min. : 1.00
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.: 8.00
## Median :0.00000 Median :1.0000 Median :16.00
## Mean :0.09773 Mean :0.6234 Mean :15.98
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:24.00
## Max. :6.00000 Max. :1.0000 Max. :31.00
From the summary stats I saw that the max age was 115, so I’ll set that as my upper bound for my age cuts.
breaks <- c(0,18,25,35,45,55,65,116)
labels <- c('<18','18-24','25-34','35-44', '45-54','55-64','65+')
ad.df$age_group <- cut(ad.df$Age,
breaks = breaks,
labels = labels,
right = F, ordered_result = T)
# Let's check to see if it's right (there should be no NA's)
summary(ad.df$age_group) #No NA values
## <18 18-24 25-34 35-44 45-54 55-64 65+
## 6016115 1176595 1673650 2044613 1859487 1299303 836102
# (CTR=# clicks/# impressions)
ad.df.day <- subset(ad.df, day == 10) # I'll pick the 10th, since it's my birthday day
Let’s look at the impressions and CTRs for the six age groups
library(ggplot2)
library(plyr)
ggplot(ad.df.day, aes(x=Impressions, colour=age_group)) + geom_density()
They are normally distributed, looks like the averages are around 5 for each age group. It’s hard to see diferences with the overlapping though. Let’s plot the means for each.
#this we can use to plot the means of the different groups for comparison
mimp <- ddply(ad.df.day, "age_group", summarise, imp.mean=mean(Impressions))
mimp
## age_group imp.mean
## 1 <18 5.008970
## 2 18-24 5.010811
## 3 25-34 4.988482
## 4 35-44 4.998832
## 5 45-54 4.984716
## 6 55-64 4.989324
## 7 65+ 4.982199
ggplot(ad.df.day, aes(x=Impressions)) + geom_histogram(binwidth=.5, colour="black", fill="white") +
facet_grid(age_group ~ .) +
geom_vline(data=mimp, aes(xintercept=imp.mean),
linetype="dashed", size=1, colour="red")
Now we can see that the ‘<18’ group has the most impressions, with the ‘35-44’ group in second.
Now let’s look at the CTR. To analyze / visualize CTR we need to first create a CTR variable.
ad.df.day$ctr <- (ad.df.day$Clicks)/(ad.df.day$Impressions) # there were 3020 NA's due to '0/0';
ad.df.day$ctr[is.na(ad.df.day$ctr)] <- 0 # replace the NA's with '0'
mctr <- ddply(ad.df.day, "age_group", summarise, ctr.mean=mean(ctr))
mctr
## age_group ctr.mean
## 1 <18 0.028138911
## 2 18-24 0.011232519
## 3 25-34 0.010139070
## 4 35-44 0.010132852
## 5 45-54 0.009574999
## 6 55-64 0.020482843
## 7 65+ 0.029169597
ggplot(ad.df.day, aes(x=ctr, colour=age_group)) + geom_density() # again, very difficult to see
ggplot(ad.df.day, aes(x=ctr)) + geom_histogram(binwidth=.5, colour="black", fill="white") +
facet_grid(age_group ~ .) +
geom_vline(data=mctr, aes(xintercept=ctr.mean),
linetype="dashed", size=1, colour="red")
The means for the ‘65+’ and ‘<18’ are visibly higher than the other groups. We can see this better if we just graph the CTR means by age group.
ggplot(data=mctr, aes(x=age_group, y=ctr.mean, fill=age_group)) +
geom_bar(stat="identity")
Maybe we could segment users based on click status…
library(lattice)
summary(ad.df.day$Clicks) #Overall, not very many clicks
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.09216 0.00000 4.00000
prop.table(table(ad.df.day$Clicks))*100 #91.3% of click data is '0', only 8.1% being 1 click.
##
## 0 1 2 3 4
## 91.343210400 8.131131755 0.494736796 0.028712403 0.002208646
I’ll create a grouping variable ‘have_clicked’; 1= have 1+ clicks, 0= no clicks
ad.df.day$have_clicked <- ifelse(ad.df.day$Clicks>=1, 1, 0)
prop.table(table(as.factor(ad.df.day$have_clicked))) #see proportions of new groups
##
## 0 1
## 0.9134321 0.0865679
histogram(~as.factor(ad.df.day$have_clicked) | ad.df.day$age_group,
data=ad.df.day, type = "count",layout=c(7,1),
col=c("light blue", "orange"), xlab= "Clicks by Age group, 0=No clicks, 1=At least 1 click",
ylab ="Count of records")
The data shows an interesting pattern - lots of customers are in the “<18” age group and overall, more people are not clicking than those who do. The ‘0’ series follows a fairly normal curve, indicating that the customer age group distribution is fairly normal for day 10, overlooking those less than 18.
Let’s look at just those who have clicks - is their data normally distributed?
ClkA <- prop.table(table(ad.df.day$age_group, ad.df.day$have_clicked==1)) #'ClkA'...sounds like a rapper!
ClkA <- data.frame(ClkA)
ClkA <- subset(ClkA, Var2==TRUE) #get rid of data not in the cluster
ClkA$Var2 <- NULL
s <- sum(ClkA$Freq)
ClkA$RelFrq <- ClkA$Freq/s #calculate relative frequency
library(data.table)
labl <- data.table(ClkA)[, per := sprintf("%.1f%%", RelFrq*100), by = Var1]
ggplot(ClkA, aes(x=Var1, y=RelFrq, fill=Var1))+geom_bar(stat="identity")+
labs(title="1+ Clicks by Age Group", x= "Age Groups",
y= "Relative Frequency of Clicks")+ guides(fill=FALSE)+
geom_text(data = labl, aes(x = Var1, y = RelFrq, label = per), vjust= -0.5)
Looks like the data for those with 1+ click are not normally distributed. The ‘<18’ age segment make up almost half of all clicks for day 10, followed by the older groups (55-64, 65+) making up 21.1% of 1+ clicks.
means <- with(ad.df.day, aggregate(x=list(Y=Impressions),
by=list(A=age_group, B=Gender),mean))
with(means, interaction.plot(x.factor=A, trace.factor=B, response=Y, type='b',
main = "Impressions by gender and age group, day 10",
xlab = "Age groups", ylab= "Average Impressions"))
This shows an interesting result - generally, the average number of impressions decreases as you move up in age group. Women, ‘0’, seem to have a higher average of impressions with every age group.
prop.table(table(ad.df.day$Gender)) #looks like there's more women than men in this data
##
## 0 1
## 0.6371879 0.3628121
# test to see if there are significant (with alpha = 0.05) differences between age groups and genders
TukeyHSD(aov(Y~(as.factor(A))+(as.factor(B)), means))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Y ~ (as.factor(A)) + (as.factor(B)), data = means)
##
## $`as.factor(A)`
## diff lwr upr p adj
## 18-24-<18 0.0023022891 -0.016064247 0.0206688256 0.9972371
## 25-34-<18 -0.0198334524 -0.038199989 -0.0014669159 0.0357225
## 35-44-<18 -0.0098559333 -0.028222470 0.0085106033 0.3923595
## 45-54-<18 -0.0237776653 -0.042144202 -0.0054111288 0.0152988
## 55-64-<18 -0.0191677647 -0.037534301 -0.0008012281 0.0415547
## 65+-<18 -0.0287184923 -0.047085029 -0.0103519558 0.0059187
## 25-34-18-24 -0.0221357415 -0.040502278 -0.0037692050 0.0215580
## 35-44-18-24 -0.0121582224 -0.030524759 0.0062083141 0.2265793
## 45-54-18-24 -0.0260799544 -0.044446491 -0.0077134179 0.0096830
## 55-64-18-24 -0.0214700538 -0.039836590 -0.0031035172 0.0248753
## 65+-18-24 -0.0310207814 -0.049387318 -0.0126542449 0.0039526
## 35-44-25-34 0.0099775191 -0.008389017 0.0283440556 0.3816150
## 45-54-25-34 -0.0039442129 -0.022310749 0.0144223236 0.9604502
## 55-64-25-34 0.0006656877 -0.017700849 0.0190322243 0.9999978
## 65+-25-34 -0.0088850399 -0.027251576 0.0094814966 0.4859155
## 45-54-35-44 -0.0139217320 -0.032288269 0.0044448045 0.1464780
## 55-64-35-44 -0.0093118314 -0.027678368 0.0090547052 0.4431463
## 65+-35-44 -0.0188625590 -0.037229096 -0.0004960225 0.0445721
## 55-64-45-54 0.0046099006 -0.013756636 0.0229764372 0.9238380
## 65+-45-54 -0.0049408270 -0.023307364 0.0134257095 0.9002983
## 65+-55-64 -0.0095507277 -0.027917264 0.0088158089 0.4203114
##
## $`as.factor(B)`
## diff lwr upr p adj
## 1-0 -0.01026019 -0.0160228 -0.004497566 0.0047878
It looks like there’s a difference between several groups, especially between 65+ and <18 (surprising?) and the 65+ and 18-24, among others. Genders are significantly different at 0.05 alpha level.
Let’s check out if there are differences with gender and age group with clicks…
means1 <- with(ad.df.day, aggregate(x=list(Y=Clicks),
by=list(A=age_group, B=Gender),mean))
with(means1, interaction.plot(x.factor=A, trace.factor=B, response=Y, type='b',
main = "Clicks by gender and age group, day 10",
xlab = "Age groups", ylab= "Average Clicks"))
TukeyHSD(aov(Y~(as.factor(A))+(as.factor(B)), means1))
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Y ~ (as.factor(A)) + (as.factor(B)), data = means1)
##
## $`as.factor(A)`
## diff lwr upr p adj
## 18-24-<18 -0.0921921597 -0.11331052 -0.07107379 0.0000161
## 25-34-<18 -0.0976283786 -0.11874674 -0.07651001 0.0000108
## 35-44-<18 -0.0979647600 -0.11908312 -0.07684640 0.0000106
## 45-54-<18 -0.1008209662 -0.12193933 -0.07970260 0.0000087
## 55-64-<18 -0.0462250931 -0.06734346 -0.02510673 0.0009664
## 65+-<18 -0.0001628079 -0.02128117 0.02095556 1.0000000
## 25-34-18-24 -0.0054362189 -0.02655458 0.01568215 0.9158447
## 35-44-18-24 -0.0057726003 -0.02689097 0.01534576 0.8941317
## 45-54-18-24 -0.0086288066 -0.02974717 0.01248956 0.6374858
## 55-64-18-24 0.0459670665 0.02484870 0.06708543 0.0009969
## 65+-18-24 0.0920293517 0.07091099 0.11314772 0.0000163
## 35-44-25-34 -0.0003363814 -0.02145475 0.02078198 1.0000000
## 45-54-25-34 -0.0031925877 -0.02431095 0.01792578 0.9926247
## 55-64-25-34 0.0514032854 0.03028492 0.07252165 0.0005333
## 65+-25-34 0.0974655706 0.07634721 0.11858394 0.0000109
## 45-54-35-44 -0.0028562062 -0.02397457 0.01826216 0.9958620
## 55-64-35-44 0.0517396669 0.03062130 0.07285803 0.0005140
## 65+-35-44 0.0978019520 0.07668359 0.11892032 0.0000107
## 55-64-45-54 0.0545958731 0.03347751 0.07571424 0.0003795
## 65+-45-54 0.1006581583 0.07953979 0.12177652 0.0000088
## 65+-55-64 0.0460622852 0.02494392 0.06718065 0.0009855
##
## $`as.factor(B)`
## diff lwr upr p adj
## 1-0 0.000898213 -0.00572781 0.007524236 0.7513962
This shows visually how average clicks between genders are not very different while there are big differences between age groups. The Tukey HSD confirms that there is no difference between genders while showing which age groups are statistically different.
For this let’s just look at numeric data…
library(corrplot)
library(gplots)
new.df <- as.data.frame(sapply(ad.df.day[,c(2:6,8,10)], as.numeric)) #if needed coerce to numeric
corrplot.mixed(corr=cor(new.df), upper = "ellipse", tl.pos = "lt",
col= colorpanel(50, "red", "gray50", "blue4"))
This correlation plot helps us see besides the obvious correlations like Age and age_group, there is a 82% correlation with Age and Signed_in, suggesting a strong linear relationship with these variables, even though this is just simulated data. If this were real data, I’d expect to see a stronger correlation with Impressions and Clicks than just 13% among other things.
Along with the correlation matrix, it is useful to see the quantile breakdown for our numeric variables.
quant <- function(x) {quantile(x,probs=0:10/10)}
apply(new.df[,1:dim(new.df)[2]],2,quant)
## Age Gender Impressions Clicks Signed_In age_group have_clicked
## 0% 0 0 0 0 0 1 0
## 10% 0 0 2 0 0 1 0
## 20% 0 0 3 0 0 1 0
## 30% 0 0 4 0 0 1 0
## 40% 23 0 4 0 1 2 0
## 50% 31 0 5 0 1 3 0
## 60% 38 0 5 0 1 4 0
## 70% 45 1 6 0 1 5 0
## 80% 52 1 7 0 1 5 0
## 90% 61 1 8 0 1 6 0
## 100% 106 1 20 4 1 7 1
With this great numeric data I’m going to try to create a new matrix for k-means clustering (attempting with mixing binary and interval data).
library(cluster)
set.seed(84105) # my zip code
new.df.seg <- kmeans(new.df, centers = 3) #choosing 3 clusters
# summary of clusters
seg.summ <- function(data, groups){
aggregate(data, list(groups), function(x) mean(as.numeric(x)))
} #thanks to "R for Marketing Research and Analytics" by Chapman and Feit for this function!
#Now we can see some differences in the clusters in the data
seg.summ(new.df, new.df.seg$cluster)
## Group.1 Age Gender Impressions Clicks Signed_In
## 1 1 0.9550272 0.04514474 5.008762 0.14134898 0.06567445
## 2 2 31.2184120 0.53113028 4.998626 0.05466881 1.00000000
## 3 3 57.6833979 0.49046132 4.985688 0.08670597 1.00000000
## age_group have_clicked
## 1 1.000000 0.12995275
## 2 3.121837 0.05303699
## 3 5.744824 0.08235087
From the k-means we got 3 groups. I notice the greatest differences in Age and age_group. We can visualize the distributions of different metrics of our clusters with boxplots.
boxplot(new.df$Impressions~ new.df.seg$cluster, xlab ="Clusters", ylab = "Impressions") #not very different
boxplot(new.df$Age~ new.df.seg$cluster, xlab ="Clusters", ylab = "Age") #differs by age
boxplot(new.df$have_clicked~ new.df.seg$cluster, xlab ="Clusters", ylab = "Have clicks") #not different
We can also plot the clusters, but for this k-means / my computer it’s computationally bankrupting.
#plot of clusters (not run)
clusplot(new.df, new.df.seg$cluster, color =T, shade = T,
labels= 4, lines = 0, main = "K-means clusters for 10th day")
Since these data represent May 2012 we can figure out with the help of a calendar exactly what weekdays are represented.
ad.df$weekday <- paste('2012','05',ad.df$day, sep = '-') #using the $day variable to build the date
ad.df$weekday <- weekdays.Date(as.Date(ad.df$weekday)) #convert from date to weekday (takes long time)
d.order <- c("Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday") # I want the weekdays in this order
dt.imp <- aggregate(ad.df$Impressions, by=list(weekday=ad.df$weekday),sum) # Weekday impressions
dt.clk <- aggregate(ad.df$Clicks, by=list(weekday=ad.df$weekday),sum) # Weekday clicks
dt.sgn <- aggregate(ad.df$Signed_In, by=list(weekday=ad.df$weekday),sum) # Weekday sign-ins
mt.imp <- aggregate(ad.df$Impressions, by=list(weekday=ad.df$day),sum) # Daily impressions
mt.clk <- aggregate(ad.df$Clicks, by=list(weekday=ad.df$day),sum) # Daily clicks
mt.sgn <- aggregate(ad.df$Signed_In, by=list(weekday=ad.df$day),sum) # Daily sign-ins
dt.imp$weekday <- factor(as.character(dt.imp$weekday), levels = d.order) # order data by correct weekday order
dt.imp <- dt.imp[order(dt.imp$weekday),]
dt.clk$weekday <- factor(as.character(dt.clk$weekday), levels = d.order)
dt.clk <- dt.clk[order(dt.clk$weekday),]
dt.sgn$weekday <- factor(as.character(dt.sgn$weekday), levels = d.order)
dt.sgn <- dt.sgn[order(dt.sgn$weekday),]
I made this simple plotting function to help generate visuals quickly.
vizr <- function(data, ylab, title){
require(ggplot2)
p<-ggplot(data, aes(weekday, x, group=1)) + geom_line() +
labs(x="Days", y= ylab ,title = title) + geom_smooth(method=lm)
return(p)
}
vizr(dt.imp, "# of Impressions", "Impressions by day of week")
vizr(dt.clk, "# of Clicks", "Clicks by day of week")
vizr(dt.sgn, "# of Sign-ins", "Sign-ins by day of week")
vizr(mt.imp, "# of Impressions", "Impressions by day of May")
vizr(mt.clk, "# of Clicks", "Clicks by day of May")
vizr(mt.sgn, "# of Sign-ins", "Sign-ins by day of May")
There is a definite cyclical pattern in all metrics (Impressions, Sign-ins and clicks) with peaks on Sundays and troughs on Saturdays. Impressions have no linear trend, remaining relatively constant for the month. Clicks are seeing an upward trend towards the end of the month. Lastly, Sign-ins have a decreasing rate trend and a distinguishable drop for the last two weeks of the month, dropping over 100000 on peak days. For the weekdays, Sunday is the big day. For the rest of the week it trends downward, having its lowest point on Saturday. Sign-ins see a local peak on Thursdays.
At first I tried a basic OLS linear model regressing Clicks against Age, Impressions and Signed_in status. I was skeptical about the results, expecting at least a weak R-squared. This approach was worth a try, but in the end everything turned out to be statistically sigificant (alpha=0.05) in predicting Clicks. From what I learned about the correlations with Clicks, Age was the only mildly correlated variable, and with data this big, I decided that the OLS approach is not going to give me an idea of what’s really going on.
# First model (not run)
model <- lm(Clicks ~ Impressions + Signed_In + Age)
summary(model)
This makes sense, but due to the volume of data all predictors are very significant, so it really doesn’t do anything from a model standpoint because of way to many “false positives”.
Instead of using regression analysis to predict drivers for clicks, I decided to use K nearest neighbors algorthm to group observations on the binary variable of ‘Signed_in’. By feeding new observations to the algorithm, one can “predict” sign-in behavior.
library(car)
library(class)
library(scales)
nDT <- some(ad.df, 100000) # Let's work with a subset of the data (i.e. 100k rows of random data)
set.seed(841052)
nDT$rand <- runif(nrow(nDT))
After running this algorithm and finding that, with almost any k, the knn had too many ties. This seems to be caused because there are so many similar records that the algorithm can’t “break the ties”. To help this, I’ll add a variable to add some random noise to the data set to break up the ties.
nDT.new <- as.data.frame(sapply(nDT[,c(2,4,5,10)], as.numeric))
#normalize continuous data for better performance with KNN
normy <- function(x){
return((x - min(x)) / (max(x) - min(x)))
}
nDT.new <- data.frame(apply(nDT[,c(2,4,5,10)], 2, normy))
summary(nDT.new)
## Age Impressions Clicks rand
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.1500 1st Qu.:0.00000 1st Qu.:0.2476
## Median :0.2524 Median :0.2500 Median :0.00000 Median :0.5010
## Mean :0.2545 Mean :0.2501 Mean :0.02462 Mean :0.5006
## 3rd Qu.:0.4466 3rd Qu.:0.3000 3rd Qu.:0.00000 3rd Qu.:0.7520
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
Now I’ll set up my training and testing models; I like to keep 10% for test.
train <- nDT.new[1:90000,]
test <- nDT.new[90001:nrow(nDT.new),]
train_target <- nDT[1:90000,6]
test_target <- nDT[90001:nrow(nDT),6] # the 6th column has the acutal sign in data
# KNN model
k <- sqrt(nrow(nDT)) # choosing k as sqrt of length of observations
k # I'll round up to the next odd number
## [1] 316.2278
model1 <- knn(train=train, test=test, cl=train_target, k=317)
Awesome! so…how accurate was the model? I’ll use a basic confusion matrix approach.
mod1 <-table(test_target, model1)
mod1
## model1
## test_target 0 1
## 0 4095 0
## 1 24 5881
sum(diag(mod1))/sum(mod1) #overall model accuracy
## [1] 0.9976
sum(mod1[1,1])/sum(mod1[,1]) #model precision
## [1] 0.9941733
sum(mod1[1,1])/sum(mod1[1,]) #model recall
## [1] 1
sum(mod1[2,1])/sum(mod1[2,]) #model false positive rate
## [1] 0.004064352
Overall this model did a great job at classifying user behavior related to ‘Sign ups’. When I ran it, it incorrectly categorized 22 records (false positive rate of 0.3%) as ‘0’ or a not sign in event. It did well with 100% recall (was able to find all the ‘0’ records).
SELECT BikeID,
COUNT(*) count
FROM BikeSupplier JOIN Inventory ON (BikeID = BikeID)
GROUP BY SupplierID
SELECT CustID
FROM Rentals
WHERE DURATION>=5
SELECT * FROM Orders WHERE name LIKE ‘N%’
HAVING COUNT(*) > 4
SELECT * FROM ( SELECT BikeID FROM Inventory) s1
FULL OUTER JOIN ( SELECT BikeID, BIKE_NAME FROM Bikes) s2
ON s1.BikeID = s2.BikeID
WHERE s1 IS NULL