CTR : Modeling & PCA

Our goal is to predict whether or not someone clicked on an ad; we use their feed data to predict of they clicked on the ad as the ad data is less enriched.

# Source Domain :

train_data_feeds <- read_csv(“Downloads/train_data_feeds.csv”)

source_domain <- train_data_feeds

Investigating Observations :

source_domain : Appearance Ct

## # A tibble: 6 × 2
## # Groups:   Appeared_ct [5]
##   u_userId Appeared_ct
##   <chr>          <int>
## 1 100001             6
## 2 100002             3
## 3 100003             6
## 4 100005             5
## 5 100006            26
## 6 100007             4

## # A tibble: 6 × 2
## # Groups:   Appeared_ct [6]
##   Appeared_ct     n
##         <int> <int>
## 1           1 45833
## 2           2 26836
## 3           3 16012
## 4           4 10966
## 5           5  7888
## 6           6  6036

target_domain : Appearance Ct

## # A tibble: 6 × 2
## # Groups:   Appeared_ct [6]
##   user_id Appeared_ct
##   <chr>         <int>
## 1 100002            5
## 2 100005            6
## 3 100006           23
## 4 100008           15
## 5 100009           52
## 6 100010           90

## # A tibble: 6 × 2
## # Groups:   Appeared_ct [6]
##   Appeared_ct     n
##         <int> <int>
## 1           1   288
## 2           2  2719
## 3           3  2686
## 4           4  2323
## 5           5  2015
## 6           6  1720

Data Manipulation :

## [1] "intersect(source_domain$u_userId, target_domain$user_id)"
## [1] "length(common_user_ids)/n_distinct(source_domain$u_userId) :  0.362513393625467"
## [1] "length(common_user_ids)/n_distinct(target_domain$user_id) :  1"

Interpretation : As we can see, we keep everyone from those whom click ads and reduce our source data by about 64%

how_much_everyone_appears appearance frequencies

## # A tibble: 6 × 2
## # Groups:   Appeared_ct [6]
##   u_userId Appeared_ct
##   <chr>          <int>
## 1 100002             3
## 2 100005             5
## 3 100006            26
## 4 100008             6
## 5 100009            38
## 6 100010           253

## # A tibble: 6 × 2
## # Groups:   Appeared_ct [6]
##   Appeared_ct     n
##         <int> <int>
## 1           1  5999
## 2           2  4791
## 3           3  3592
## 4           4  2927
## 5           5  2402
## 6           6  1963

target_subset appearance frequencies

## # A tibble: 6 × 2
## # Groups:   Appeared_ct [6]
##   user_id Appeared_ct
##   <chr>         <int>
## 1 100002            5
## 2 100005            6
## 3 100006           23
## 4 100008           15
## 5 100009           52
## 6 100010           90

## # A tibble: 6 × 2
## # Groups:   Appeared_ct [6]
##   Appeared_ct     n
##         <int> <int>
## 1           1   288
## 2           2  2719
## 3           3  2686
## 4           4  2323
## 5           5  2015
## 6           6  1720

Random Sampling :

## Warning in left_join(target_subset, source_subset, by = c(user_id = "u_userId"), : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 150718 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Modeling

##           Reference
## Prediction      0      1
##          0 0.8920 0.0075
##          1 0.0996 0.0009

As we can see, even with a low cutoff of .009, our model still inaccurately predicts ad clicks frequently. Ultimately this is what we would like to optimize.

Frequency of levels per variable :

combined : Appearance Ct

## # A tibble: 6 × 2
## # Groups:   Appeared_ct [6]
##   user_id Appeared_ct
##   <chr>         <int>
## 1 100006           26
## 2 100009           38
## 3 100019          178
## 4 100022          684
## 5 100024         5373
## 6 100025          180

## # A tibble: 6 × 2
## # Groups:   Appeared_ct [6]
##   Appeared_ct     n
##         <int> <int>
## 1           1   843
## 2           2   933
## 3           3   670
## 4           4   740
## 5           5   524
## 6           6   594

Modeling workflow :

## # A tibble: 18,267,978 × 61
##    log_id label_target   age gender residence  city city_rank series_dev
##     <dbl> <fct>        <dbl>  <dbl>     <dbl> <dbl>     <dbl>      <dbl>
##  1 131360 0                3      2        41   135         5         11
##  2 131360 0                3      2        41   135         5         11
##  3 131360 0                3      2        41   135         5         11
##  4 131360 0                3      2        41   135         5         11
##  5 131360 0                3      2        41   135         5         11
##  6 131360 0                3      2        41   135         5         11
##  7 131360 0                3      2        41   135         5         11
##  8 131360 0                3      2        41   135         5         11
##  9 131360 0                3      2        41   135         5         11
## 10 131360 0                3      2        41   135         5         11
## # ℹ 18,267,968 more rows
## # ℹ 53 more variables: series_group <dbl>, emui_dev <dbl>, device_name <dbl>,
## #   device_size <dbl>, net_type <dbl>, task_id <dbl>, adv_id <dbl>,
## #   creat_type_cd <dbl>, adv_prim_id <dbl>, inter_type_cd <dbl>, slot_id <dbl>,
## #   site_id <dbl>, spread_app_id <dbl>, hispace_app_tags <dbl>,
## #   app_second_class <dbl>, app_score <dbl>, ad_click_list_v001 <chr>,
## #   ad_click_list_v002 <chr>, ad_click_list_v003 <chr>, …