1/11/2022

Context

I’m curious on the questions like :

  • What products and brands were the hottest in 2020?

  • What’s the pricing for different brands?

  • What categories did each hottest brand provide? etc

Content

Here’s a dataset including eCommerce Purchase History from Eletronics Stores, the details for each column as below:

  • event_time: Time when event happened at (in UTC).

  • event_type: Only one kind of event: purchase.

  • product_id: ID of a product.

  • category_id: Product’s category ID.

  • category_code: Product’s category taxonomy (code name) if it was possible to make it. Usually present for meaningful categories and skipped for different kinds of accessories.

  • brand: Downcased string of brand name. Can be missed.

  • price: US Dollar

  • user_id:Permanent user ID.

1. Load the libraries

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.3     ✓ dplyr   1.0.7
✓ tidyr   1.1.3     ✓ stringr 1.4.0
✓ readr   2.0.0     ✓ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library(reshape2)
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':

    smiths

2. Read the dataset

f1 <- 'eCommerce Purchase from eStores.csv'
purchase_history <- read_csv(f1,show_col_types = FALSE) 
purchase_history %>% glimpse
Rows: 2,633,521
Columns: 8
Warning: One or more parsing issues, see `problems()` for details
$ event_time    <chr> "2020-04-24 11:50:39 UTC", "2020-04-24 11:50:39 UTC", "2…
$ order_id      <dbl> 2.294360e+18, 2.294360e+18, 2.294444e+18, 2.294444e+18, …
$ product_id    <dbl> 1.515966e+18, 1.515966e+18, 2.273948e+18, 2.273948e+18, …
$ category_id   <dbl> 2.268105e+18, 2.268105e+18, 2.268105e+18, 2.268105e+18, …
$ category_code <chr> "electronics.tablet", "electronics.tablet", "electronics…
$ brand         <chr> "samsung", "samsung", "huawei", "huawei", "karcher", "ma…
$ price         <dbl> 162.01, 162.01, 77.52, 77.52, 217.57, 39.33, 1387.01, 13…
$ user_id       <dbl> 1.515916e+18, 1.515916e+18, 1.515916e+18, 1.515916e+18, …

3. Clean the data

3.1 Convert date

purchase_history$day<- as.Date(purchase_history$event_time,'%Y-%m-%d')

3.2 Summary dataset

summary(purchase_history[c('event_time','day')])
  event_time             day            
 Length:2633521     Min.   :1970-01-01  
 Class :character   1st Qu.:2020-03-05  
 Mode  :character   Median :2020-06-08  
                    Mean   :2020-01-16  
                    3rd Qu.:2020-08-24  
                    Max.   :2020-11-21  

3.3 Exclude data of 1970 as we’d like to focus on 2020

purchase_history <- purchase_history %>% subset(day!='1970-01-01')

3.4 Check the amount of products sold by category

# A tibble: 509 × 2
   category_code                         n
   <chr>                             <int>
 1 electronics.smartphone           354848
 2 appliances.kitchen.refrigerators  77023
 3 16.18                             72321
 4 computers.notebook                71103
 5 electronics.video.tv              70995
 6 appliances.environment.vacuum     65983
 7 electronics.audio.headphone       65650
 8 appliances.kitchen.kettle         62309
 9 appliances.kitchen.washer         55892
10 furniture.kitchen.table           55331
# … with 499 more rows

4. Q1: Which product brand was sold best?

Q2: Which product category was sold best?

Q3: Which smartphone was the best seller?

Q4: How’s price distribution by top6 smartphone brands?

     brand    
[1,] "samsung"
[2,] "huawei" 
[3,] "apple"  
[4,] "oppo"   
[5,] "xiaomi" 
[6,] "honor"  

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Using brand as id variables

# A tibble: 6 × 3
  brand   price_median price_mean
  <chr>          <dbl>      <dbl>
1 apple           764.       768.
2 honor           116.       130.
3 huawei          162.       209.
4 oppo            208.       201.
5 samsung         162.       252.
6 xiaomi          185.       187.

Let’s check the categories and corresponding price for each brand.

1. apple

# A tibble: 11 × 3
   category_code                      n  price
   <chr>                          <int>  <dbl>
 1 electronics.smartphone         45604  768. 
 2 electronics.audio.headphone     9478  219. 
 3 electronics.clocks              5736  427. 
 4 computers.notebook              3906 1394. 
 5 electronics.tablet              2571  684. 
 6 construction.tools.screw        1245   30.1
 7 computers.components.hdd         724  202. 
 8 computers.peripherals.mouse      255  140. 
 9 computers.components.cooler      214   48.1
10 computers.desktop                174 2103. 
11 computers.peripherals.keyboard   127  292. 

2. honor

# A tibble: 3 × 3
  category_code                   n price
  <chr>                       <int> <dbl>
1 electronics.smartphone      11356 130. 
2 electronics.audio.headphone   572  23.8
3 electronics.clocks             83  98.1

3. huawei

# A tibble: 8 × 3
  category_code                   n  price
  <chr>                       <int>  <dbl>
1 electronics.smartphone      46351  209. 
2 electronics.audio.headphone  3214   67.7
3 electronics.clocks           1846  187. 
4 electronics.tablet           1161  177. 
5 construction.tools.screw      218   31.0
6 appliances.personal.scales    181   69.4
7 computers.network.router       37   37.0
8 computers.notebook              1 1146. 

4. oppo

# A tibble: 3 × 3
  category_code                   n price
  <chr>                       <int> <dbl>
1 electronics.smartphone      35981 201. 
2 electronics.audio.headphone    77  62.4
3 electronics.clocks              6 324. 

5. samsung

# A tibble: 23 × 3
   category_code                         n price
   <chr>                             <int> <dbl>
 1 electronics.smartphone           177952 252. 
 2 appliances.environment.vacuum     33034 107. 
 3 electronics.video.tv              22489 559. 
 4 appliances.kitchen.washer         21271 398. 
 5 electronics.audio.headphone       13078  49.5
 6 electronics.tablet                12553 243. 
 7 appliances.kitchen.refrigerators  11674 730. 
 8 appliances.kitchen.microwave       8809 133. 
 9 appliances.kitchen.hood            5586 210. 
10 electronics.clocks                 4418 285. 
# … with 13 more rows

6. xiaomi

# A tibble: 23 × 3
   category_code                      n  price
   <chr>                          <int>  <dbl>
 1 electronics.smartphone         14215 187.  
 2 electronics.audio.headphone     3434  28.2 
 3 electronics.video.tv            1887 297.  
 4 kids.skates                      925 437.  
 5 appliances.environment.vacuum    589 351.  
 6 stationery.battery               498   5.79
 7 auto.accessories.videoregister   248  68.5 
 8 electronics.clocks               244  73.7 
 9 electronics.tablet               197  28.2 
10 stationery.stapler               129  25.7 
# … with 13 more rows

Q5: What’s the top6 sales?

# A tibble: 6 × 2
  category_code                         sales
  <chr>                                 <dbl>
1 electronics.smartphone           101935464.
2 appliances.kitchen.refrigerators  31359708.
3 computers.notebook                31251211.
4 electronics.video.tv              29017988.
5 appliances.kitchen.washer         18346288.
6 appliances.environment.vacuum      9174349