Our goal is to predict whether or not someone clicked on an ad; we use their feed data to predict of they clicked on the ad as the ad data is less enriched.
# Source Domain :
train_data_feeds <- read_csv(“Downloads/train_data_feeds.csv”)
source_domain <- train_data_feeds
## # A tibble: 6 × 2
## # Groups: Appeared_ct [5]
## u_userId Appeared_ct
## <chr> <int>
## 1 100001 6
## 2 100002 3
## 3 100003 6
## 4 100005 5
## 5 100006 26
## 6 100007 4
## # A tibble: 6 × 2
## # Groups: Appeared_ct [6]
## Appeared_ct n
## <int> <int>
## 1 1 45833
## 2 2 26836
## 3 3 16012
## 4 4 10966
## 5 5 7888
## 6 6 6036
## # A tibble: 6 × 2
## # Groups: Appeared_ct [6]
## user_id Appeared_ct
## <chr> <int>
## 1 100002 5
## 2 100005 6
## 3 100006 23
## 4 100008 15
## 5 100009 52
## 6 100010 90
## # A tibble: 6 × 2
## # Groups: Appeared_ct [6]
## Appeared_ct n
## <int> <int>
## 1 1 288
## 2 2 2719
## 3 3 2686
## 4 4 2323
## 5 5 2015
## 6 6 1720
## [1] "intersect(source_domain$u_userId, target_domain$user_id)"
## [1] "length(common_user_ids)/n_distinct(source_domain$u_userId) : 0.362513393625467"
## [1] "length(common_user_ids)/n_distinct(target_domain$user_id) : 1"
Interpretation : As we can see, we keep everyone from those whom click ads and reduce our source data by about 64%
## # A tibble: 6 × 2
## # Groups: Appeared_ct [6]
## u_userId Appeared_ct
## <chr> <int>
## 1 100002 3
## 2 100005 5
## 3 100006 26
## 4 100008 6
## 5 100009 38
## 6 100010 253
## # A tibble: 6 × 2
## # Groups: Appeared_ct [6]
## Appeared_ct n
## <int> <int>
## 1 1 5999
## 2 2 4791
## 3 3 3592
## 4 4 2927
## 5 5 2402
## 6 6 1963
## # A tibble: 6 × 2
## # Groups: Appeared_ct [6]
## user_id Appeared_ct
## <chr> <int>
## 1 100002 5
## 2 100005 6
## 3 100006 23
## 4 100008 15
## 5 100009 52
## 6 100010 90
## # A tibble: 6 × 2
## # Groups: Appeared_ct [6]
## Appeared_ct n
## <int> <int>
## 1 1 288
## 2 2 2719
## 3 3 2686
## 4 4 2323
## 5 5 2015
## 6 6 1720
## Warning in left_join(target_subset, source_subset, by = c(user_id = "u_userId"), : Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 1 of `x` matches multiple rows in `y`.
## ℹ Row 150718 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## Reference
## Prediction 0 1
## 0 0.8920 0.0075
## 1 0.0996 0.0009
As we can see, even with a low cutoff of .009, our model still inaccurately predicts ad clicks frequently. Ultimately this is what we would like to optimize.
## # A tibble: 6 × 2
## # Groups: Appeared_ct [6]
## user_id Appeared_ct
## <chr> <int>
## 1 100006 26
## 2 100009 38
## 3 100019 178
## 4 100022 684
## 5 100024 5373
## 6 100025 180
## # A tibble: 6 × 2
## # Groups: Appeared_ct [6]
## Appeared_ct n
## <int> <int>
## 1 1 843
## 2 2 933
## 3 3 670
## 4 4 740
## 5 5 524
## 6 6 594
## # A tibble: 18,267,978 × 61
## log_id label_target age gender residence city city_rank series_dev
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 131360 0 3 2 41 135 5 11
## 2 131360 0 3 2 41 135 5 11
## 3 131360 0 3 2 41 135 5 11
## 4 131360 0 3 2 41 135 5 11
## 5 131360 0 3 2 41 135 5 11
## 6 131360 0 3 2 41 135 5 11
## 7 131360 0 3 2 41 135 5 11
## 8 131360 0 3 2 41 135 5 11
## 9 131360 0 3 2 41 135 5 11
## 10 131360 0 3 2 41 135 5 11
## # ℹ 18,267,968 more rows
## # ℹ 53 more variables: series_group <dbl>, emui_dev <dbl>, device_name <dbl>,
## # device_size <dbl>, net_type <dbl>, task_id <dbl>, adv_id <dbl>,
## # creat_type_cd <dbl>, adv_prim_id <dbl>, inter_type_cd <dbl>, slot_id <dbl>,
## # site_id <dbl>, spread_app_id <dbl>, hispace_app_tags <dbl>,
## # app_second_class <dbl>, app_score <dbl>, ad_click_list_v001 <chr>,
## # ad_click_list_v002 <chr>, ad_click_list_v003 <chr>, …