I’m curious on the questions like :
What products and brands were the hottest in 2020?
What’s the pricing for different brands?
What categories did each hottest brand provide? etc
1/11/2022
I’m curious on the questions like :
What products and brands were the hottest in 2020?
What’s the pricing for different brands?
What categories did each hottest brand provide? etc
Here’s a dataset including eCommerce Purchase History from Eletronics Stores, the details for each column as below:
event_time: Time when event happened at (in UTC).
event_type: Only one kind of event: purchase.
product_id: ID of a product.
category_id: Product’s category ID.
category_code: Product’s category taxonomy (code name) if it was possible to make it. Usually present for meaningful categories and skipped for different kinds of accessories.
brand: Downcased string of brand name. Can be missed.
price: US Dollar
user_id:Permanent user ID.
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5 ✓ purrr 0.3.4 ✓ tibble 3.1.3 ✓ dplyr 1.0.7 ✓ tidyr 1.1.3 ✓ stringr 1.4.0 ✓ readr 2.0.0 ✓ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag()
library(reshape2)
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
f1 <- 'eCommerce Purchase from eStores.csv' purchase_history <- read_csv(f1,show_col_types = FALSE) purchase_history %>% glimpse
Rows: 2,633,521 Columns: 8
Warning: One or more parsing issues, see `problems()` for details
$ event_time <chr> "2020-04-24 11:50:39 UTC", "2020-04-24 11:50:39 UTC", "2… $ order_id <dbl> 2.294360e+18, 2.294360e+18, 2.294444e+18, 2.294444e+18, … $ product_id <dbl> 1.515966e+18, 1.515966e+18, 2.273948e+18, 2.273948e+18, … $ category_id <dbl> 2.268105e+18, 2.268105e+18, 2.268105e+18, 2.268105e+18, … $ category_code <chr> "electronics.tablet", "electronics.tablet", "electronics… $ brand <chr> "samsung", "samsung", "huawei", "huawei", "karcher", "ma… $ price <dbl> 162.01, 162.01, 77.52, 77.52, 217.57, 39.33, 1387.01, 13… $ user_id <dbl> 1.515916e+18, 1.515916e+18, 1.515916e+18, 1.515916e+18, …
purchase_history$day<- as.Date(purchase_history$event_time,'%Y-%m-%d')
summary(purchase_history[c('event_time','day')])
event_time day
Length:2633521 Min. :1970-01-01
Class :character 1st Qu.:2020-03-05
Mode :character Median :2020-06-08
Mean :2020-01-16
3rd Qu.:2020-08-24
Max. :2020-11-21
purchase_history <- purchase_history %>% subset(day!='1970-01-01')
# A tibble: 509 × 2 category_code n <chr> <int> 1 electronics.smartphone 354848 2 appliances.kitchen.refrigerators 77023 3 16.18 72321 4 computers.notebook 71103 5 electronics.video.tv 70995 6 appliances.environment.vacuum 65983 7 electronics.audio.headphone 65650 8 appliances.kitchen.kettle 62309 9 appliances.kitchen.washer 55892 10 furniture.kitchen.table 55331 # … with 499 more rows
brand [1,] "samsung" [2,] "huawei" [3,] "apple" [4,] "oppo" [5,] "xiaomi" [6,] "honor"
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Using brand as id variables
# A tibble: 6 × 3 brand price_median price_mean <chr> <dbl> <dbl> 1 apple 764. 768. 2 honor 116. 130. 3 huawei 162. 209. 4 oppo 208. 201. 5 samsung 162. 252. 6 xiaomi 185. 187.
# A tibble: 11 × 3 category_code n price <chr> <int> <dbl> 1 electronics.smartphone 45604 768. 2 electronics.audio.headphone 9478 219. 3 electronics.clocks 5736 427. 4 computers.notebook 3906 1394. 5 electronics.tablet 2571 684. 6 construction.tools.screw 1245 30.1 7 computers.components.hdd 724 202. 8 computers.peripherals.mouse 255 140. 9 computers.components.cooler 214 48.1 10 computers.desktop 174 2103. 11 computers.peripherals.keyboard 127 292.
# A tibble: 3 × 3 category_code n price <chr> <int> <dbl> 1 electronics.smartphone 11356 130. 2 electronics.audio.headphone 572 23.8 3 electronics.clocks 83 98.1
# A tibble: 8 × 3 category_code n price <chr> <int> <dbl> 1 electronics.smartphone 46351 209. 2 electronics.audio.headphone 3214 67.7 3 electronics.clocks 1846 187. 4 electronics.tablet 1161 177. 5 construction.tools.screw 218 31.0 6 appliances.personal.scales 181 69.4 7 computers.network.router 37 37.0 8 computers.notebook 1 1146.
# A tibble: 3 × 3 category_code n price <chr> <int> <dbl> 1 electronics.smartphone 35981 201. 2 electronics.audio.headphone 77 62.4 3 electronics.clocks 6 324.
# A tibble: 23 × 3 category_code n price <chr> <int> <dbl> 1 electronics.smartphone 177952 252. 2 appliances.environment.vacuum 33034 107. 3 electronics.video.tv 22489 559. 4 appliances.kitchen.washer 21271 398. 5 electronics.audio.headphone 13078 49.5 6 electronics.tablet 12553 243. 7 appliances.kitchen.refrigerators 11674 730. 8 appliances.kitchen.microwave 8809 133. 9 appliances.kitchen.hood 5586 210. 10 electronics.clocks 4418 285. # … with 13 more rows
# A tibble: 23 × 3 category_code n price <chr> <int> <dbl> 1 electronics.smartphone 14215 187. 2 electronics.audio.headphone 3434 28.2 3 electronics.video.tv 1887 297. 4 kids.skates 925 437. 5 appliances.environment.vacuum 589 351. 6 stationery.battery 498 5.79 7 auto.accessories.videoregister 248 68.5 8 electronics.clocks 244 73.7 9 electronics.tablet 197 28.2 10 stationery.stapler 129 25.7 # … with 13 more rows
# A tibble: 6 × 2 category_code sales <chr> <dbl> 1 electronics.smartphone 101935464. 2 appliances.kitchen.refrigerators 31359708. 3 computers.notebook 31251211. 4 electronics.video.tv 29017988. 5 appliances.kitchen.washer 18346288. 6 appliances.environment.vacuum 9174349