This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
library(readr) # for reading csv files
library(ggplot2) # for visualization
library(dplyr) # for data manipulation
library(stringr) # for working with strings
library(knitr) # for more appealing tables
library(ggrepel) # for better point-labelling
library(gplots) # for baloonplot
library(lubridate) # for manipulation of date and time
library(caTools) # for random splitting
library(forecast) # for timeseries prediction
library(tseries) # for stationary hypothesis testing
#library(tsne) # for dimension-reduction
#library(cluster) # for Gower dissimilarity
This is a report based on the provided content and data from exoclick as part of the application process of data science position.
I tried to prepare the report as a showcase to present my technical abilities and understanding, while keep it relevant and loyal to the structure suggested by the application process.
In practice, this report is too lengthy for a short-time analysis, i.e. the scope of this analysis should be reduced in order to meet the time limit requirements. From the other side, it is possible to dig deeper into some parts of this analysis as well.
The next sections include Reading Data, Making Sense of the Data which is another name for exploratory data analysis, Analysis of eCPM, Temporal Analysis, and Prediction of future eCPM. At the end, some suggestions are gathered in the Suggestions section.
First we need to read data.
# readr::read_csv() is faster and more capable than base:read.csv()
exo_data <- read_csv("/Users/Shaahin/Downloads/exoclick\ challenge/ExoClick_challenge.csv")
Having read the data, I calculate basic statistics of the dataset, and refine the variables.
#summarization of the dataset
summary(exo_data)
## dattime client country ad_type
## Length:96223 Min. : 1.00 Length:96223 Min. : 2.000
## Class :character 1st Qu.: 4.00 Class :character 1st Qu.: 2.000
## Mode :character Median :13.00 Mode :character Median : 2.000
## Mean :16.67 Mean : 6.642
## 3rd Qu.:25.00 3rd Qu.:10.000
## Max. :73.00 Max. :17.000
## total_impressions total_value
## Min. : 1.0 Min. : 0.0000
## 1st Qu.: 7.0 1st Qu.: 0.0201
## Median : 42.0 Median : 0.1400
## Mean : 4045.5 Mean : 4.0022
## 3rd Qu.: 378.5 3rd Qu.: 1.1045
## Max. :941289.0 Max. :846.0000
# a glimpse of the data
glimpse(exo_data)
## Observations: 96,223
## Variables: 6
## $ dattime <chr> "2016-06-06*06:00:00", "2016-06-12*22:00:00"...
## $ client <int> 1, 2, 1, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,...
## $ country <chr> "EST", "NCL", "DZA", "EST", "DZA", "NCL", "N...
## $ ad_type <int> 10, 10, 10, 17, 10, 10, 2, 2, 2, 17, 2, 2, 1...
## $ total_impressions <int> 137, 4424, 271, 1, 11701, 1496, 16, 1, 8, 1,...
## $ total_value <dbl> 0.13700, 91.72066, 0.27100, 0.00100, 4.87190...
So there are 6 columns/variables and 96223 records/observations in the dataset.
Some changes must be done in the classes of some variables. For instance, country and ad_type should be a factor variable. More importantly, the dattime variable must be converted to date or POSIXct class. Some data cleaning and manipulation are needed here.
# for coercing the country variable into factor
exo_data$country <- factor(exo_data$country)
exo_data$ad_type <- factor(exo_data$ad_type)
exo_data$client <- factor(exo_data$client)
# extraction of dates
exo_data$date<- str_split_fixed(string =exo_data$dattime ,
pattern = "\\*" ,
n = 2 )[,1]
# extraction of time
exo_data$time <- str_split_fixed(string =exo_data$dattime ,
pattern = "\\*" ,
n = 2 )[,2]
# coercing the new variables into proper class
exo_data$date <- ymd(exo_data$date)
exo_data$time <- hms(exo_data$time)
# removing the data*time variable
exo_data$dattime <- NULL
#glipse of the data
glimpse(exo_data)
## Observations: 96,223
## Variables: 7
## $ client <fctr> 1, 2, 1, 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ country <fctr> EST, NCL, DZA, EST, DZA, NCL, NCL, NCL, DZA...
## $ ad_type <fctr> 10, 10, 10, 17, 10, 10, 2, 2, 2, 17, 2, 2, ...
## $ total_impressions <int> 137, 4424, 271, 1, 11701, 1496, 16, 1, 8, 1,...
## $ total_value <dbl> 0.13700, 91.72066, 0.27100, 0.00100, 4.87190...
## $ date <date> 2016-06-06, 2016-06-12, 2016-06-05, 2016-06...
## $ time <S4: Period> 6H 0M 0S, 22H 0M 0S, 20H 0M 0S, 17H 0...
# how many missing values in the dataset?
sum(is.na(exo_data))
## [1] 0
Now as it can be seen above, the data is ready as it is supposed to be. Having made the data ready for analysis, the next step would be making sense of the data.
Making sense of the data is another name of the exploratory data analysis(EDA). I evaluate the dataset at uni-variate, bi-variate and tri-variate level. It is possible to go
Exploratory data analysis(EDA) is the first step, and the cornerstone of the analytics. I start with examination of the single variables, and continue to go beyond single variable towards a more holistic view of the data.
#number of clients
exo_data %>%
select(client) %>%
distinct() %>%
summarise(number_of_clients=n())
## # A tibble: 1 x 1
## number_of_clients
## <int>
## 1 73
#clients with the most records in our data
exo_data %>%
select(client) %>%
group_by(client) %>%
mutate(order_number = n()) %>%
distinct() %>%
arrange(desc(order_number)) %>%
head()
## # A tibble: 6 x 2
## # Groups: client [6]
## client order_number
## <fctr> <int>
## 1 4 8672
## 2 3 7353
## 3 8 7325
## 4 1 7319
## 5 10 4178
## 6 14 2774
exo_data %>%
select(client) %>%
ggplot() +
geom_bar(aes(x = client)) +
theme_linedraw() +
ggtitle("Clients vs. Ad Appearance Count")
clients 4,3,8,and 1 have the highest number of records in our dataset. But it does not mean that they have the highest impressions, or generated the highest revenue consequently.
exo_data %>%
select(country) %>%
ggplot() +
geom_bar(aes(x = country), fill = "blue") +
theme_linedraw() +
ggtitle("Country vs. Ad Appearance Count")
There are only four countries in which the ads are showed/viewed. The number of records related to “EST” country is the highest, but it does not necessarily mean that the impression or revenue is the highest.
exo_data %>%
select(ad_type)%>%
ggplot() +
geom_bar(aes(x = ad_type), fill = "green") +
theme_linedraw() +
ggtitle("Ad type vs. Appearance times")
The next figure shows that we have four ad types. The most common ad_type is type 2, then 10,17 and 15. It seems that ad_type 15 is the rarest type of the ad, shown to visitors.
exo_data %>%
select(date) %>%
summary()
## date
## Min. :2016-06-01
## 1st Qu.:2016-06-08
## Median :2016-06-16
## Mean :2016-06-15
## 3rd Qu.:2016-06-23
## Max. :2016-06-29
exo_data %>%
select(date) %>%
ggplot() +
geom_bar(aes(x=date), fill = "gray") +
theme_linedraw() +
ggtitle("Ad appearance times for each day")
So all ads are shown in June 2016, and the graph shows that the number of appearance per day is relatively higher at the end of the month.
It is possible to check the weekday of the appearances.
exo_data$wday <- wday(x = exo_data$date, label = TRUE)
exo_data %>%
select(wday) %>%
ggplot() +
geom_bar(aes(x = wday)) +
theme_linedraw() +
ggtitle("Ad appearance times for each weekday")
It is interesting that while the number of appearances over the weekdays is more or less equall, the Wednesday has much higher appearance in our records. In other words, the ads have been viewed on Wednesdays more than any other weekday.
exo_data$hour <- hour(exo_data$time)
exo_data %>%
select(hour) %>%
ggplot() +
geom_bar(aes(x = hour )) +
theme_linedraw() +
ggtitle("Ad appearance times per each hour")
The appearance is almost uniformly distributed over 24 hours of days, except for the hours around 12, in which the ads have appeared relatively less.
exo_data %>%
select(total_impressions) %>%
group_by(total_impressions) %>%
mutate(freq = n()) %>%
distinct() %>%
arrange(desc(freq))
## # A tibble: 10,790 x 2
## # Groups: total_impressions [10,790]
## total_impressions freq
## <int> <int>
## 1 1 6594
## 2 2 4893
## 3 3 3500
## 4 4 2978
## 5 5 2371
## 6 6 2316
## 7 7 1881
## 8 8 1701
## 9 9 1514
## 10 10 1355
## # ... with 10,780 more rows
exo_data %>%
select(total_impressions) %>%
arrange(desc(total_impressions))
## # A tibble: 96,223 x 1
## total_impressions
## <int>
## 1 941289
## 2 924823
## 3 911728
## 4 910885
## 5 908618
## 6 895867
## 7 887437
## 8 886572
## 9 886478
## 10 874973
## # ... with 96,213 more rows
exo_data %>%
select(total_impressions) %>%
ggplot() +
geom_histogram(aes(x =total_impressions ), bins = 100) +
theme_linedraw() +
ggtitle("Distribution of total impressions")
kable(quantile(x = exo_data$total_impressions , probs = seq(0,1,0.1)))
| x | |
|---|---|
| 0% | 1.0 |
| 10% | 2.0 |
| 20% | 5.0 |
| 30% | 10.0 |
| 40% | 21.0 |
| 50% | 42.0 |
| 60% | 92.0 |
| 70% | 218.0 |
| 80% | 731.0 |
| 90% | 3346.8 |
| 100% | 941289.0 |
The distribution of the total_impression is hugely skewed; while the total_impression of the most of the appearances is between 1 and 10 impressions, there are appearances with around 1,000,000 impressions. The figure shows the extremely right skewed distributions of the total_impression.
The last table shows the percentiles of the total_impression, in order to look at this skewed distribution from another perspective. more than 60% of the appearances recieve fewer than 100 impressions, while there are some with exteremly high impressions. It is interesting to delve deeper into the latter group.
exo_data %>%
select(total_value) %>%
summary()
## total_value
## Min. : 0.0000
## 1st Qu.: 0.0201
## Median : 0.1400
## Mean : 4.0022
## 3rd Qu.: 1.1045
## Max. :846.0000
kable(quantile(x = exo_data$total_value , probs = seq(0,1,0.1)))
| x | |
|---|---|
| 0% | 0.0000000 |
| 10% | 0.0040000 |
| 20% | 0.0130000 |
| 30% | 0.0305158 |
| 40% | 0.0700000 |
| 50% | 0.1400000 |
| 60% | 0.2800000 |
| 70% | 0.6550000 |
| 80% | 1.9500000 |
| 90% | 6.6819140 |
| 100% | 846.0000000 |
exo_data %>%
select(total_value) %>%
ggplot() +
geom_histogram(aes(x = total_value)) +
theme_linedraw() +
ggtitle("Distribution of total value")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As it was expected from the total impression, the total value is also extremely skewed. The summary statistics and quantiles show that half of the appearances yield less than 0.15 unit revenue
One important thing to evaluate is the value generation of each client. I will investigate this idea as the last graph of this subsection
exo_data %>%
select(client,total_value) %>%
group_by(client) %>%
mutate(freq = n() ,
avg_value = sum(total_value)/freq) %>%
select(client, avg_value) %>%
distinct() %>%
filter(avg_value > 10 ) %>%
#arrange(desc(avg_value)) %>%
ggplot() +
geom_col(aes(y = avg_value , x = reorder(client,avg_value)),
fill = "darkgreen") +
coord_flip() +
theme_linedraw() +
xlab(label = "Clients") +
ylab(label = "Average Value per Appearance") +
ggtitle("Clients with avg values above 10 value unit")
exo_data %>%
select(client,total_value) %>%
group_by(client) %>%
mutate(freq = n() ,
avg_value = sum(total_value)/freq) %>%
select(client, avg_value) %>%
distinct() %>%
filter(avg_value < 0.5 ) %>%
#arrange(desc(avg_value)) %>%
ggplot() +
geom_col(aes(y = avg_value , x = reorder(client,avg_value)),
fill = "orange") +
coord_flip() +
theme_linedraw() +
xlab(label = "Clients") +
ylab(label = "Average Value per Appearance") +
ggtitle("Clients with avg values below 0.5 value unit")
It is important to know the value generator clients, and the clients with very low performance. This idea would be repeated in the last sections, after introduction of eCPM.
Examination of each variable out of the dataset, i.e. regardless of the relationships with other variables, is informative and useful but not enough. More holistic approaches are needed to understand a dataset. Approaches that consider more variables simultanously.
In this section I examine the relationships between pair variables.
# coercing contigency table of client-country to a data.frame
client_country_data <- as.data.frame.matrix(table(exo_data$client , exo_data$country))
# generating a PCA-Biplot out of the above data.frame
## standardization of the variables
data_scaled <- scale(client_country_data)
data_pca <- prcomp(data_scaled)
# PCA
components_load<- summary(data_pca)$importance[2,1:2]
components_load <- round(components_load,2)
# setting up the coordinates and vectors datasets
coordinates <- data.frame(clients = 1:73, data_pca$x[,1:2])
vectors <- data.frame(Variables = colnames(client_country_data),
data_pca$rotation[, 1:2])
ggplot() +
geom_point(data = coordinates, aes(x = PC1 , y = PC2),
size = 4,
color = "black",
alpha = 0.5) +
geom_text(data = coordinates,
aes(x = PC1 , y = PC2, label = clients),
color = "cyan",
size = 2) +
geom_segment(data = vectors , aes(x = PC1*4 , y = PC2*4, xend=0, yend=0),
color = "red",
alpha =1) +
geom_text_repel(data = vectors ,
aes(PC1*4 , PC2*4 , label = Variables ),
color = "red",
alpha = 0.5,
size = 2)+
xlab(label = paste0("PC1(",components_load[1]*100,"%)"))+
ylab(label = paste0("PC2(",components_load[2]*100,"%)"))+
theme_linedraw() +
coord_fixed(ratio = 1) +
ggtitle("Client-Country Contingency table PCA-Biplot")
It seems while most of the clients have similar country profile, i.e. the number of times their impressions have been shown in the countries, there are two distinct clusters with different profiles. One includes clients 2,31,27,5,32 and 20 which have most of their views from NCL, and then cluster of 4,3,1,8 which have most of the views from countries except NCL but with much higher frequency than majority of clients.
# coercing contigency table of client-country to a data.frame
client_adtype_data <- as.data.frame.matrix(table(exo_data$client ,
exo_data$ad_type))
# generating a PCA-Biplot out of the above data.frame
## standardization of the variables
data_scaled <- scale(client_adtype_data)
data_pca <- prcomp(data_scaled)
# PCA
components_load<- summary(data_pca)$importance[2,1:2]
components_load <- round(components_load,2)
# setting up the coordinates and vectors datasets
coordinates <- data.frame(clients = 1:73, data_pca$x[,1:2])
vectors <- data.frame(Variables = colnames(client_adtype_data),
data_pca$rotation[, 1:2])
ggplot() +
geom_point(data = coordinates, aes(x = PC1 , y = PC2),
size = 4,
color = "black",
alpha = 0.5) +
geom_text(data = coordinates,
aes(x = PC1 , y = PC2, label = clients),
color = "cyan",
size = 2) +
geom_segment(data = vectors , aes(x = PC1*4 , y = PC2*4, xend=0, yend=0),
color = "red",
alpha =1) +
geom_text_repel(data = vectors ,
aes(PC1*4 , PC2*4 , label = Variables ),
color = "red",
alpha = 0.5,
size = 4)+
xlab(label = paste0("PC1(",components_load[1]*100,"%)"))+
ylab(label = paste0("PC2(",components_load[2]*100,"%)"))+
theme_linedraw() +
coord_fixed(ratio = 1) +
ggtitle("Client-Adtype Contingency table PCA-Biplot")
Some outliers are seen. Client 1 have used ad_type 15 havily, while cliets 4,3,8 have hired ad_types 17,2,10, far more than others. Generally ad_type 15 is very rare seemingly, and it seems Client 1 is a huge customer of our ads.
exo_data %>%
group_by(ad_type) %>%
summarise(sum_impression = sum(total_impressions),
count = n(),
avg_impression = sum_impression/count) %>%
arrange(desc(avg_impression))
## # A tibble: 4 x 4
## ad_type sum_impression count avg_impression
## <fctr> <int> <int> <dbl>
## 1 2 263716764 53170 4959.87895
## 2 10 125036302 27766 4503.21624
## 3 17 448198 12916 34.70099
## 4 15 69652 2371 29.37663
exo_data %>%
ggplot(aes(x = ad_type , y = total_impressions)) +
geom_boxplot(outlier.colour = "red" , outlier.shape = 1, alpha = 0.2) +
theme_linedraw() +
ggtitle("Boxplots of total impressions for each ad type")
It is clearly can be seen that the number of impressions per ad occurance is much higher for ad types 2 and 10. In other words, the 2 and 10 ad types probably get much higher impressions than 17 and 15. This is just co-relation, and we need to evaluate such hypothesis in the presence of other variables.
exo_data %>%
group_by(country) %>%
dplyr::summarise(sum_impression = sum(total_impressions),
count = n(),
avg_impression = sum_impression/count) %>%
arrange(desc(avg_impression))
## # A tibble: 4 x 4
## country sum_impression count avg_impression
## <fctr> <int> <int> <dbl>
## 1 DZA 309647885 23475 13190.5382
## 2 EST 45229803 28333 1596.3648
## 3 ISL 22124193 23652 935.4047
## 4 NCL 12269035 20763 590.9086
The average impression of DZA is much higher than others, and this is not related to ad_type of DZA since the following table shows that these countries have similar proportions of each ad_type.
Why not boxplots? This is a true question to be asked here. There was a categorical variable, and a continueous one, and the goal was comparison of observations belong to each categories. So boxplots seem the first choice. However, since the distributions are too skewed, boxplots would be distorted to the level of being useless.
exo_data %>%
ggplot(aes(x = country , y = total_impressions)) +
geom_boxplot(outlier.colour = "red" , outlier.shape = 1, alpha = 0.2) +
theme_linedraw() +
ggtitle("Boxplots of total impressions over each country")
As it can be seen in the graph, boxs of the boxplots, due to small value of IQR comparing to the whole range of distribution, are pressed into a horizontal lines. The red dots are outliers.
exo_data %>%
group_by(country) %>%
dplyr::summarise(sum_value = sum(total_value),
count = n(),
avg_value = sum_value/count) %>%
arrange(desc(avg_value))
## # A tibble: 4 x 4
## country sum_value count avg_value
## <fctr> <dbl> <int> <dbl>
## 1 NCL 117898.0 20763 5.678276
## 2 DZA 113720.3 23475 4.844317
## 3 EST 126952.4 28333 4.480727
## 4 ISL 26529.1 23652 1.121643
exo_data %>%
ggplot(aes(x = country , y = total_value)) +
geom_boxplot(outlier.colour = "red" , outlier.shape = 1, alpha = 0.2) +
theme_linedraw() +
ggtitle("Boxplots of total value over each country")
The average value per appearance of the ads that are seen in NCL is higher than others. The boxplots however show that the IQR of the categories are apparantly similar but the outliers make the difference.
exo_data %>%
group_by(ad_type) %>%
dplyr::summarise(sum_value = sum(total_value),
count = n(),
avg_value = sum_value/count) %>%
arrange(desc(avg_value))
## # A tibble: 4 x 4
## ad_type sum_value count avg_value
## <fctr> <dbl> <int> <dbl>
## 1 10 153460.70736 27766 5.52692888
## 2 2 231066.61482 53170 4.34580807
## 3 15 95.28606 2371 0.04018813
## 4 17 477.31888 12916 0.03695563
exo_data %>%
ggplot(aes(x = ad_type , y = total_value)) +
geom_boxplot(outlier.colour = "red" , outlier.shape = 1, alpha = 0.2) +
theme_linedraw() +
ggtitle("Boxplots of total value over each ad type")
Ad types 10 and 2 generate much higher average value per appearance than the two other ads. Enough evidence to focus on these two types of ads, and improve 15 and 17? The boxplots show that IQRs are apparently similar, but the outliers make the difference.
# in order to evaluate proportion of ad_types in each country
country_type_table<-round(prop.table(table(exo_data$country , exo_data$ad_type),margin = 1)*100,2)
country_type_table
##
## 2 10 15 17
## DZA 52.71 26.83 4.49 15.97
## EST 56.31 28.19 1.86 13.64
## ISL 56.49 28.04 1.57 13.90
## NCL 55.29 32.99 2.02 9.70
balloonplot(x = t(country_type_table), label = FALSE, main = "Country vs Type: row")
# in order to evaluate distribution of ad_type over countries
country_type_table<-round(prop.table(table(exo_data$country , exo_data$ad_type),margin = 2)*100,2)
country_type_table
##
## 2 10 15 17
## DZA 23.27 22.68 44.41 29.03
## EST 30.01 28.77 22.27 29.92
## ISL 25.13 23.88 15.65 25.46
## NCL 21.59 24.67 17.67 15.59
balloonplot(x = t(country_type_table), label = FALSE, main = "Country vs Type: column")
So ad_types 2 and 10 result more impressions, and similarly countries DZA and EST yield more impression, specifically DZA.
exo_data %>%
dplyr::select(total_impressions, total_value) %>%
ggplot() +
geom_point(aes(x = total_impressions, y = total_value) ,
alpha =0.5, col = "skyblue") +
theme_linedraw() +
ggtitle("Total impressions vs total value")
According the above graph, it seems that there are several different linear models that points follow. One on the left, in which
total_value seems unrelated to total_impressions and, in contrast, one on the right in which increment in the total_impression causes improvement in the total_value but to much less extent.
Here I try to find relationships and possible irregularities among three chosen variables of the dataset. It is possible to approach this goal using either facets or scatterplots with color-mapping or shape-mapping.
exo_data %>%
#dplyr::select(total_impressions, total_value) %>%
ggplot() +
geom_point(aes(x = total_impressions, y = total_value) ,
alpha =0.5, col = "skyblue") +
theme_linedraw() +
ggtitle("Total impressions vs total value on ad_type facets") +
facet_grid(.~ad_type)
Above graphs show that for adtypes 15 and 17, the total value and total impressions are uniformly low among all occurances. The behaviour of adtypes 10 and 2 are similar to somehow. There are some outliers in both, in which very high values are gained through very low impressions. Also some clear linear relationships are seen, that may imply on influence of another variable.
exo_data %>%
#dplyr::select(total_impressions, total_value) %>%
ggplot() +
geom_point(aes(x = total_impressions, y = total_value) ,
alpha =0.5, col = "skyblue") +
theme_linedraw() +
ggtitle("Total impressions vs total value on country facets") +
facet_grid(.~country)
Similar to the previous facet, total_value and total_impressions are chosen to be plotted on country facets. The resulted graphs are interesting. NCL country shows no relation between total_impression and tota_value. This suggests that there are other decisive factors on the total_value in the ads related to this country.
ISL ads have attrached low impressions, and generated low value as well. DZA shows sevelal linear models with outliers.
This set of plots suggest that the behaviour of ads in different countries are diferent, so it is better not to lump them up for analysis
exo_data %>%
#dplyr::select(total_impressions, total_value) %>%
ggplot() +
geom_point(aes(x = total_impressions, y = total_value) ,
alpha =0.5, col = "skyblue") +
theme_linedraw() +
ggtitle("Total impressions vs total value on weekday facets") +
facet_grid(.~wday)
The otliers of values are on Wednesdays. Shall we remove them before the final analysis?
While the facet sets of graphs are useful, the real tri-variate visualization is called to single graph of three variables. Here I explore two of such graphs.
exo_data %>%
#dplyr::select(total_impressions, total_value) %>%
ggplot() +
geom_point(aes(x = total_impressions, y = total_value, color = country) ,
alpha =0.5) +
theme_linedraw() +
ggtitle("Total impressions vs total value, country color-mapped")
So here it is more clear that for countries such as NCL and EST, total impression is not a good predictor of total value.(Why?) The outliers are from DZA.
exo_data %>%
#dplyr::select(total_impressions, total_value) %>%
ggplot() +
geom_point(aes(x = total_impressions, y = total_value, color = ad_type) ,
alpha =0.5) +
theme_linedraw() +
ggtitle("Total impressions vs total value, ad_type color-mapped")
The ads are dominated by type 2 and 10. There is no single pattern for these two types, howver it is clear that the occurances are comming from two different structures. In one, resided on the left vertical lines, the total impression is not a decisive factor on total value, and on one group there are some linear relationships between these two variables.
exo_data %>%
#dplyr::select(total_impressions, total_value) %>%
ggplot() +
geom_point(aes(x = total_impressions, y = total_value, color = client) ,
alpha =0.5) +
theme_linedraw() +
ggtitle("Total impressions vs total value, client color-mapped")
While the number of clients is too high, and it has made the legend of the plot ugly, and the colors are not distinguishable in many cases. It is interesting to see that all the outliers are comming from one single color, possibly client 51. Does it ring any bell?
While tri-variate is useful, still the emergent properties and structures of the data are revealed in visualizations of higher dimensions.
Still we can use both facet approach or single graph approach. I explore one graph of each.
exo_data %>%
#dplyr::select(total_impressions, total_value) %>%
ggplot() +
geom_point(aes(x = total_impressions, y = total_value) ,
alpha =0.5, col = "skyblue") +
theme_linedraw() +
ggtitle("Total impressions vs total value: country and type facets") +
facet_grid(country~ad_type)
The above graph shows that types 15 and 17 in any country do not generate value, and do not recieve impression. This is a point for improvements.
The graph also suggests that the behviours of ad_types over one single country are similar, while identical ad_types over different countries show very different behaviour for adtypes 2 and 10.
It is also possible to present four variables in one graph.
exo_data %>%
#dplyr::select(total_impressions, total_value) %>%
ggplot() +
geom_point(aes(x = total_impressions,
y = total_value,
color = country,
shape = ad_type) ,
alpha =0.5) +
theme_linedraw() +
ggtitle("Impressions vs Value: country and ad_tye")
This graph is not very impromative since the shapes are hardly discernible. It is possible also to use stroke, or size as other indicators. In general, the more information added into a graph, the higher the difficulty of understanding the graph.
While there is no universal definition, but having more than 4 variables in a dataset can be considered as high-dimensional. High-D visualization includes dimension reduction methods, such as PCA, MDS, SOM, t-SNE. Also it is still possible to use facet, depending on the dataset that we have.
exo_data %>%
#dplyr::select(total_impressions, total_value) %>%
ggplot() +
geom_point(aes(x = total_impressions, y = total_value, color = client) ,
alpha =0.5) +
theme_linedraw() +
ggtitle("Total impressions vs total value: country and type facets") +
facet_grid(country~ad_type)
The above graph is not very informative though.
The final and, to my humble opinion, the most informative visualization is a high-D plot that makes it possible to have a holistic view of the data. For instance, it is possible to compute the Gower distance between records of the dataset, and then visualize the result using t-sne algorithm. However, my machine does not have enough RAM to do so.
What I did here was random sampling the dataset, in order to cluster the sample, and then assigning the remaining records to the clusters. It seems there is no other way for such a large matrix, and such demanding algorithm.
However, even with sampling of half of the dataset, still my machine cannot run this algorithm. So I overlook it, and pass for the next parts of the report.
# It is poss
# set.seed(7)
# split_vec <- sample.split(Y = exo_data$ad_type , SplitRatio = 0.6 )
#
# dissimilarity <- exo_data %>%
# dplyr::select(country , ad_type , total_impressions , total_value ) %>%
# filter(split_vec) %>%
# daisy(metric = "gower", stand = TRUE )
# tsne_coo <- tsne(X = dissimilarity , k = 2 )
From the provided link:
CPM : Cost per mille, or cost per thousand, refers to the price a network charges for one thousand ad impressions
eCPM formula
So I guess we need to divide the total_value to total_impressions and multiply to constact 1000.
The eCPM of each cross-section of ad_type&country is calculated as follows. The table is sorted descendingly by the eCPM value.
exo_data %>%
dplyr::select(country,ad_type , total_impressions , total_value) %>%
group_by(country,ad_type) %>%
mutate(eCPM = 1000*sum(total_value)/sum(total_impressions)) %>%
dplyr::select(country,ad_type,eCPM) %>%
distinct() %>%
arrange(desc(eCPM))
## # A tibble: 16 x 3
## # Groups: country, ad_type [16]
## country ad_type eCPM
## <fctr> <fctr> <dbl>
## 1 NCL 10 11.4542916
## 2 NCL 2 8.5654526
## 3 NCL 15 3.0135937
## 4 EST 2 2.8112432
## 5 EST 10 2.8027883
## 6 NCL 17 2.4223456
## 7 EST 17 2.3860749
## 8 ISL 10 1.8966165
## 9 ISL 17 1.7974555
## 10 DZA 15 1.3346723
## 11 ISL 15 1.2966387
## 12 EST 15 1.0375066
## 13 ISL 2 0.9138850
## 14 DZA 10 0.4933913
## 15 DZA 17 0.4522772
## 16 DZA 2 0.3070799
It shows that the ad_type of 10 seen in the country NCL generates the highest eCPM. In the exploratory section we found evidences that ad_type 10 and 2 are the top value generators per appearance. Morevoer, NCL and DZA are the countries with the highest average value per appearance.
exo_data %>%
dplyr::select(country,ad_type , total_impressions , total_value) %>%
group_by(country,ad_type) %>%
mutate(eCPM = 1000*sum(total_value)/sum(total_impressions)) %>%
dplyr::select(country,ad_type,eCPM) %>%
distinct() %>%
arrange(desc(eCPM)) %>%
ggplot(aes(x = eCPM,y = 1)) +
geom_point(col = "red") +
facet_grid(country~ad_type) +
theme_linedraw()
exo_data %>%
dplyr::select(country,ad_type , total_impressions , total_value) %>%
group_by(country,ad_type) %>%
mutate(eCPM = 1000*sum(total_value)/sum(total_impressions)) %>%
dplyr::select(country,ad_type,eCPM) %>%
distinct() %>%
arrange(desc(eCPM)) %>%
ggplot(aes(x = eCPM,y = 1)) +
geom_point(col = "red") +
facet_grid(ad_type~country) +
theme_linedraw()
exo_data %>%
dplyr::select(country,ad_type , total_impressions , total_value) %>%
group_by(country,ad_type) %>%
mutate(eCPM = 1000*sum(total_value)/sum(total_impressions)) %>%
dplyr::select(country,ad_type,eCPM) %>%
distinct() %>%
arrange(desc(eCPM)) %>%
ggplot(aes(x = eCPM,y = 1)) +
geom_point(aes(col = ad_type)) +
facet_grid(country ~ .) +
theme_linedraw()
The three above graphs beautifuly show the eCPM value of different countries and different ad_types. In the first, we can compare the position of the eCPM value index vertically(through one type of ad) or for different cross-sections. It is seen that eCPM is higher for every ad_type in NCL, comparing to others. The margin is very high in ad types 2 and 10.
For comparing the eCPM of ad_types in every single country, the second graph is generated which is basically the first graph but 90degrees rotated. For DZA, ad_type 15 has more eCPM, for EST all have similar except ad_type 15 which is lower than others, in ISL all ad_types perform similarily, and in NCL ad_types 10 and 2 have higher values by a high margin.
The third graph tries to capture the best of the two previous graphs. So not only we can see which ad_type has more eCPM in each country, we can compare eCPM of different ad_types based on different countries.
Here we would like to assess eCPM trends over time.
What is asked here is the daily time-series of the eCPM for NCL country.
Compute the eCPM for the country NCL for each day and ad type.
My perception is segmentization, i.e. grouping, the data based on days and ad_types, then calculation of eCPM for all the ads on each type on that specific date. In other words, a total eCPM for each day which is computed based on sum of total_values over sum of total_impressions for NCL country. Then
Present these data in a line chart with the date in the x -axis to see the evolution of the eCPM for each ad type separately
exo_data %>%
filter(country == "NCL") %>%
group_by(ad_type,date) %>%
mutate(eCPM = 1000*sum(total_value)/sum(total_impressions)) %>%
ungroup() %>%
ggplot() +
geom_line(aes(x = date , y = eCPM, col = ad_type)) +
theme_linedraw() +
ggtitle("eCPM timeseries for each ad type")
The interesting point is capricious changes of the ad_type 15. It ad_type is going from the lowest eCPM to the highest eCPM very fast and quickly.
Are these timeseries are stationary? It will be chekced in the next section.
Based on the historical temporal data build a model that can predict the expected eCPM for the next day of the series (30th of June)
Since it is not clear what part of data is meant to be used here, I continue with the data of NCL country, separated by the ad types. It could be more specific, to be eCPM of each client related to a specific country, separated by the ad types.
# time series of the ad_type2 in NCL
adtype2_ds <- exo_data %>%
filter(country == "NCL") %>%
group_by(ad_type,date) %>%
mutate(eCPM = 1000*sum(total_value)/sum(total_impressions)) %>%
filter(ad_type == 2 ) %>%
dplyr::select(eCPM,date) %>%
distinct() %>%
arrange(date)
#--- for timeseries of a specific client
# client10_ds <- exo_data %>%
# filter( client==10 & country == "EST" & ad_type == 2 ) %>%
# group_by(date) %>%
# mutate(eCPM = 1000*sum(total_value)/sum(total_impressions)) %>%
# dplyr::select(eCPM,date) %>%
# distinct() %>%
# arrange(date)
Since the goal of this report is showing methods rather than getting specific results, I choose to work on two time series. One is eCPM of adtype 2 of NCL country, the other is eCPM of adtype 2 of EST country of client 10.
In the following chunk, the timeseries are defined from the data.
adtype2_ts <- ts(data = adtype2_ds$eCPM,
start = 1,
end = 29,
frequency = 1)
# client10_ts <- ts(data = client10_ds$eCPM,
# start = 1,
# end = 29,
# frequency = 1)
Now we have the time series, we have to check whether they are stationary. I do this evaluation using Augmented Dickey-Fuller(ADF) test and Kwiatkowski-Phillips-Schmidt-Shin(KPSS) both from tseries package.
plot(adtype2_ts)
#lines(ma(adtype2_ts,order = 5), col = "red")
adf.test(x = adtype2_ts)
##
## Augmented Dickey-Fuller Test
##
## data: adtype2_ts
## Dickey-Fuller = -2.9841, Lag order = 3, p-value = 0.1953
## alternative hypothesis: stationary
kpss.test(x = adtype2_ts)
## Warning in kpss.test(x = adtype2_ts): p-value greater than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: adtype2_ts
## KPSS Level = 0.069686, Truncation lag parameter = 1, p-value = 0.1
From the plot we hardly can see any trend or even seasonality. But what tests say? The ADF test yields p-value of 0.19 which means we fail to reject the null hypothesis, which is “the timeseries is non-stationary”. The KPSS disapproves this finding, since in this test the null hypothesis is “the timeseries is stationary”. But the p-value is not large, and in such situation, I rather go conservatively and consider the timeseries as non-stationary.
What to do now? I try to make the timeseries stationary by differencing.
adf.test(diff(adtype2_ts,differences = 2))
## Warning in adf.test(diff(adtype2_ts, differences = 2)): p-value smaller
## than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(adtype2_ts, differences = 2)
## Dickey-Fuller = -4.6908, Lag order = 2, p-value = 0.01
## alternative hypothesis: stationary
kpss.test(diff(adtype2_ts,differences = 2))
## Warning in kpss.test(diff(adtype2_ts, differences = 2)): p-value greater
## than printed p-value
##
## KPSS Test for Level Stationarity
##
## data: diff(adtype2_ts, differences = 2)
## KPSS Level = 0.067934, Truncation lag parameter = 1, p-value = 0.1
adtypes2_diff <- diff(adtype2_ts,differences = 2)
plot(adtypes2_diff)
With differencing of order 2, the ADF test endorses that the resulted timeseries is stationary. KPSS agrees with ADF, so we can continue with the stationary assumption. (The original timeseries also did not seem to have seasonality and trend. stats::decompose() and forecast::stl() and forecast::ma() can help here)
I start with some basic methods and end up with ARIMA.
mean_model <- meanf(y = adtype2_ts, h = 1)
plot(mean_model)
#for forecasting the 30th day
forecast(object = mean_model ,h = 1)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 30 8.581628 8.030162 9.133093 7.720977 9.442278
#for evaluation of the accuracy of the model
accuracy(mean_model)
## ME RMSE MAE MPE MAPE
## Training set -4.287608e-16 0.4059091 0.3326476 -0.2172775 3.842548
## MASE ACF1
## Training set 0.7828553 0.1426178
The mean model simply is based on the average values of the timeseries. Nothing elaborated, but can be used as a benchmark.
The accuracy can be measured by RMSE for instance, and here it is based on the training data, so the pitfall of overfitting is very close.
naive_model <- naive(y = adtype2_ts, h = 1 )
plot(naive_model )
forecast(naive_model , h = 1 )
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 30 9.15306 8.488729 9.81739 8.137053 10.16907
accuracy(naive_model)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.04131527 0.51838 0.4249158 0.306316 4.882729 1
## ACF1
## Training set -0.4371021
Naive model is basically random walk with drift when the constant parameter is 0. As a result, the mean of the timeseries would be constant, and the variance would be depended on time. In naive model, the forecasting of the 30th observation would be equal to the actual value of the 29th value of the timeseries. The RMSE is even worse than the mean approach.
ar_model <- ar(x = adtype2_ts )
f <- forecast(object = ar_model , h = 1 )
plot(adtype2_ts, xlim = c(1,31))
points(x = 30 , y = f$mean[1], col = "blue" , pch = 19)
points(x = 30 , y = f$lower[2] , col = "gray", pch = 24)
points(x = 30 , y = f$upper[2] , col = "gray", pch = 25)
text(x = 30 , y = f$lower[2] , pos = 1 , labels = "lower 95%", cex = 0.5)
text(x = 30 , y = f$upper[2] , pos = 1 , labels = "upper 95%", cex = 0.5)
ar_model
##
## Call:
## ar(x = adtype2_ts)
##
## Coefficients:
## 1 2 3
## 0.1489 0.0813 -0.4201
##
## Order selected 3 sigma^2 estimated as 0.1541
accuracy(f)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.03602334 0.3552722 0.2898368 0.2493715 3.324556 0.6821041
## ACF1
## Training set 0.02344178
#CI
c(f$lower[2],f$upper[2])
## [1] 8.126400 9.665268
#forecast of 30th day
(f$mean[1])
## [1] 8.895834
AR model is based on the assumption that the current value is a function of previous value(s). How many previous values can be determined by the model through a tuning process. As it can be seen, the model has order of 3, which means that time t value is depended on time t-1,t-2,and t-3 values. The RMSE is better than the previous models.
Another model that we can test is moving average(MA) which assumes that the value of series at time t is depended on the white noise values of previous times. Here I do not use this method independently, since I want to finish with ARIMA method. It includes both AR and MA models.
arima_model <- auto.arima(x = adtype2_ts)
arima_model
## Series:
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 8.5816
## s.e. 0.0754
##
## sigma^2 estimated as 0.1706: log likelihood=-15
## AIC=34 AICc=34.47 BIC=36.74
arima_f <- (forecast(arima_model, h = 1 ) )
accuracy(arima_f)
## ME RMSE MAE MPE MAPE
## Training set 1.347596e-15 0.4059091 0.3326476 -0.2172775 3.842548
## MASE ACF1
## Training set 0.7828553 0.1426178
plot(adtype2_ts, xlim = c(1,31), ylim = c(7.5,9.5))
points(x = 30 , y = arima_f$mean[1], col = "blue" , pch = 19)
points(x = 30 , y = arima_f$lower[2] , col = "gray", pch = 24)
points(x = 30 , y = arima_f$upper[2] , col = "gray", pch = 25)
text(x = 30 , y = arima_f$lower[2] , pos = 1 , labels = "lower 95%", cex = 0.5)
text(x = 30 , y = arima_f$upper[2] , pos = 1 , labels = "upper 95%", cex = 0.5)
#--------
arima_model2 <- arima(x = adtype2_ts , order = c(3,2,1))
arima_model2
##
## Call:
## arima(x = adtype2_ts, order = c(3, 2, 1))
##
## Coefficients:
## ar1 ar2 ar3 ma1
## -0.3560 -0.0456 -0.3335 -1.0000
## s.e. 0.1868 0.2228 0.2093 0.1105
##
## sigma^2 estimated as 0.196: log likelihood = -18.78, aic = 47.56
accuracy(forecast(arima_model2, h = 1))
## ME RMSE MAE MPE MAPE MASE
## Training set -0.03901056 0.4271951 0.3276154 -0.6081202 3.776461 0.7710125
## ACF1
## Training set -0.08768319
# write_csv(x = exo_data, path = "/Users/Shaahin/Downloads/exoclick\ challenge/ExoClick_data_refined.csv")
Two arima models are trained, one using auto.arima() and the other by defining the AR order, and the degree of differencing, and MA order. It seems that the second ARIMA has better training set evaluation metrics, but still the hazard of over fitting is available.
The main hazard here is the overfitting. It is better not to evaluate the models on the training dataset, and use at least a random sample split of the main dataset based on the date variable.
Moreover, the outlier removal may help to get better forecasting.
It is also possible to come back to EDA, in order to incorporate eCPM in our analysis.
exo_data %>%
#select(client,total_value) %>%
group_by(client) %>%
mutate(eCPM = 1000*sum(total_value)/sum(total_impressions)) %>%
select(client, eCPM) %>%
distinct() %>%
filter(eCPM > 10 ) %>%
#arrange(desc(avg_value)) %>%
ggplot() +
geom_col(aes(y = eCPM , x = reorder(client,eCPM)),
fill = "darkgreen") +
coord_flip() +
theme_linedraw() +
xlab(label = "Clients") +
ylab(label = "eCPM") +
ggtitle("Clients with eCPM above 10 units")
The above graph is an example of coming back to EDA. Is client70 an outlier? We should investigate it further. Also, such plot would help to know the clients with high eCPM and clients with low eCPM. Both are important.
About the EDA part, It is possible to come back to EDA part to incorporate eCPM variable into visualizations. However, still the main problem would be the visualization of all important variables together. The solution of the computational problem can be using a cloud service or searching for better algorithms. The outliers also demand further investigations.
About the low-value countries or ad_types, we should either not to invest in the countries and ad types with the low revenue, or optimize them using, for instance, A/B testing, if it is under our authorization. Nevertheless, success of the ad types is not identical in different countries, and this hast to be taken into account, as it may point to specific taste of our audience in those countries.
About the time-series forecasting part, it is better not to evaluate our models on the training data. So train/test datasets would be the minimum necessity here. Also there are other models, such as generalized autoregressive conditional heteroscedastic(GARCH), that can be used. More data is needed, and the data should be used more intelligently. Here I wanted to show only the technical aspects, rather than searching for answers to a problem. In the latter case, I would segmentize the feature space further and do forecasting for each relatively small segment.