Problem Research

Background Problem

In general, there are three things that buyers consider in buying goods, namely in terms of the seller, the specifications of the goods, and other buyer reviews of the goods.

Considerations for product specifications include product categories, product prices, product brands, product variations, price discounts, number of likes, comments, and views, product ratings, product stock, number of products sold, availability on flash sale, lowest price guarantee feature, deal features bundles, wholesale features, and free shipping features.

Considerations for buyer reviews include product quality, product accommodations, packaging conditions, delivery speed, buyers’ desire to order again, product defects, whether the product was delivered or not, and how similar the product is to the picture. These buyer reviews will affect the product rating.

Store considerations include the number of items, number of followers, response time, response rate, store location, number of buyers who rate the store, store ratings, and store labels (verified shop/official shop).

One of the focuses of e-commerce is ensuring the availability of products to meet customer needs. For this reason, e-commerce does not only pay attention to customer loyalty but also pays attention to the loyalty of shops/sellers by increasing buyer confidence in the store and increasing the number of transactions.

One of the programs used to increase the number of product transactions in stores is flash sales. Usually, the target of the flash sale program is a store that does not yet have a label or a store that has just joined e-commerce. The products offered in the flash sale program should also be products that are in high demand by customers so that the chances of buying them are greater.

Business Impact

The author will analyze three things that are considered by buyers in e-commerce transactions. This includes making machine learning regarding store classification for shopee verified and official shops. In addition, the program offers such as flash sales can be used to collaborate with stores in the program according to the items available at the store and items that are in high demand. The results of the analysis will be displayed on the dashboard which will be used as a consideration in compiling a program or determining the right target store to be included in a program run by e-commerce.

Target User

The target user of this project is e-commerce with the following benefits:

  1. increasing user and store loyalty at the same time in one program
  2. maining new insights in innovation and improvisation programs
  3. marketing effectiveness through selecting the right target stores

Implementation to Similar Businesses

In addition to e-commerce, this project can also be implemented in the FMCG industry or other industries that have stores or branch offices. This project can also be used by shops joining e-commerce and buyers in choosing products.

Outputs of The Project

The output of this project is a dashboard that displays:

  1. exploratory data analysis (EDA)
  2. shop recommendations that can be offered by the shopee verified and official shop programs using machine learning
  3. shop recommendations that can be offered to join the flash sale program

Data Source

The growing popularity of Korean inspired fashion has taken over e-retailing in the Philippines. Entrepreneurs are catching up to this growing trend by selling their Korean fashion inspired products on different online platforms.

The data is secondary data taken by the author from one of the contributors to Kaggle, namely Jaye Cabrera. The dataset contains:

  1. Product Data
  2. Shop Data

Data Preparation

Load The Packages

library(dplyr)
library(lubridate)
library(ggplot2)

Product Data

Read Data

product <- read.csv("data_input/2021June-July_product_data.csv", stringsAsFactors = T)
head(product)

Data Glossary

  1. pk_product: Primary key product
  2. date_collected: Date collected (2021-06-13, 2021-06-21, 2021-06-28, and 2021-07-06)
  3. product_itemid: Product item ID
  4. product_shopid: Product shop ID
  5. product_category: Product category (Crop Top, Long Sleeves, Short Sleeves, Top)
  6. product_name: Product name
  7. product_price: Product price
  8. product_price_min: Product price minimum
  9. product_price_max: Product price maximum
  10. product_discount: Product discount
  11. product_brand: Product brand (No Brand, Huilishi, Bsco, Dnemnl, etc)
  12. product_like_count: Product like count
  13. product_comment_count: Product comment count
  14. product_views: Product views
  15. product_rate_star_0: The number of users who give a 0 star rating to the product
  16. product_rate_star_1: The number of users who gave a 1 star rating to the product
  17. product_rate_star_2: The number of users who gave a 2 star rating to the product
  18. product_rate_star_3: The number of users who gave a 3 star rating to the product
  19. product_rate_star_4: The number of users who gave a 4 star rating to the product
  20. product_rate_star_5: The number of users who gave a 5 star rating to the product
  21. product_total_rating: Average of the total product star rating
  22. stock: Product stock
  23. units_sold: Number of units sold
  24. status: Avalilability status of the product
  25. shop_location: Shop location (Mainland China, Metro Manila, etc)
  26. shop_is_on_flash_sale: Product availability on flash sale
  27. shop_is_preferred_plus_seller: Shop preferences for plus sellers
  28. feature_lowest_price_guarantee: Lowest price guarantee feature availability
  29. feature_can_use_bundle_deal: Bundle deal feature availability
  30. feature_can_use_cod: COD (Cash on Delivery) feature availability
  31. feature_can_use_wholesale: Wholesale feature availability
  32. feature_show_free_shipping: Free shipping availability
  33. product_variation_count: Number of product variations

Adjust Data Type

glimpse(product)
#> Rows: 1,583
#> Columns: 33
#> $ pk_product                     <dbl> 202106137919342720, 202106137140015488,~
#> $ date_collected                 <fct> 2021-06-13, 2021-06-13, 2021-06-13, 202~
#> $ product_itemid                 <dbl> 7919342724, 7140015494, 4843903203, 114~
#> $ product_shopid                 <int> 1083822, 130651949, 55748694, 396605392~
#> $ product_category               <fct> Short Sleeves, Short Sleeves, Short Sle~
#> $ product_name                   <fct> "NKD Korean Fashion Erich Basic Daily P~
#> $ product_price                  <dbl> 300, NA, 200, 391, 300, 150, 150, 603, ~
#> $ product_price_min              <int> 135, 149, 75, 294, 55, 63, 58, 423, 58,~
#> $ product_price_max              <int> 138, 149, 75, 294, 55, 63, 58, 423, 64,~
#> $ product_discount               <int> 55, 0, 63, 25, 82, 58, 61, 30, 61, 51, ~
#> $ product_brand                  <fct> No Brand, No Brand, No Brand, No Brand,~
#> $ product_like_count             <int> 4504, 293, 8184, 13, 5999, 518, 24, 4, ~
#> $ product_comment_count          <int> 3696, 158, 15912, 0, 11825, 1544, 83, 0~
#> $ product_views                  <int> 12815, 2189, 19575, 317, 27205, 1029, 1~
#> $ prod_rate_star_0               <int> 3447, 148, 13890, 0, 10138, 1422, 80, 0~
#> $ prod_rate_star_1               <int> 146, 3, 984, 0, 778, 59, 2, 0, 1293, 49~
#> $ prod_rate_star_2               <int> 70, 3, 606, 0, 459, 34, 0, 0, 726, 26, ~
#> $ prod_rate_star_3               <int> 9, 2, 178, 0, 200, 12, 0, 0, 185, 6, 0,~
#> $ prod_rate_star_4               <int> 24, 0, 251, 0, 249, 17, 0, 0, 309, 9, 0~
#> $ prod_rate_star_5               <int> 3696, 156, 15909, 0, 11824, 1544, 82, 0~
#> $ product_total_rating           <dbl> 4.89, 4.90, 4.77, 0.00, 4.72, 4.85, 4.9~
#> $ stock                          <int> 134411, 84, 7721, 10799, 3358, 49297, 8~
#> $ units_sold                     <int> 10154, 336, 68171, 0, 35672, 4333, 189,~
#> $ status                         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ shop_location                  <fct> "Taytay, Rizal", "San Nicolas, Metro Ma~
#> $ shop_is_on_flash_sale          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ shop_is_preferred_plus_seller  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ feature_lowest_price_guarantee <int> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, ~
#> $ feature_can_use_bundle_deal    <int> 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, ~
#> $ feature_can_use_cod            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ feature_can_use_wholesale      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ feature_show_free_shipping     <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ product_variation_count        <int> 20, 9, 14, 3, 10, 7, 9, 3, 14, 9, 12, 3~

We need to adjust data type of several variable: - pk_product, product_itemid, product_shopid, product_name as character - product_price as integer

product <- product %>%
  mutate_at(c("pk_product", "product_itemid", "product_shopid", "product_name"), as.character) %>%
  mutate_at("product_price", as.integer)

NA Value Check

colSums(is.na(product))
#>                     pk_product                 date_collected 
#>                              0                              0 
#>                 product_itemid                 product_shopid 
#>                              0                              0 
#>               product_category                   product_name 
#>                              0                              0 
#>                  product_price              product_price_min 
#>                            198                              0 
#>              product_price_max               product_discount 
#>                              0                              0 
#>                  product_brand             product_like_count 
#>                              0                              0 
#>          product_comment_count                  product_views 
#>                              0                              0 
#>               prod_rate_star_0               prod_rate_star_1 
#>                              0                              0 
#>               prod_rate_star_2               prod_rate_star_3 
#>                              0                              0 
#>               prod_rate_star_4               prod_rate_star_5 
#>                              0                              0 
#>           product_total_rating                          stock 
#>                              0                              0 
#>                     units_sold                         status 
#>                              0                              0 
#>                  shop_location          shop_is_on_flash_sale 
#>                              0                              0 
#>  shop_is_preferred_plus_seller feature_lowest_price_guarantee 
#>                              0                              0 
#>    feature_can_use_bundle_deal            feature_can_use_cod 
#>                              0                              0 
#>      feature_can_use_wholesale     feature_show_free_shipping 
#>                              0                              0 
#>        product_variation_count 
#>                              0

Several obervations in product_price variables have NA value.

product %>%
  filter(is.na(product_price))

Products that have NA value are products with no discount. So, we can fill NA value with product price minimum.

product$product_price[is.na(product$product_price)] = product$product_price_min[is.na(product$product_price)]

anyNA(product)
#> [1] FALSE

Remove Unused Variables

product <- product %>%
  select(-c(pk_product, date_collected, status))

head(product)

Shop Data

Read Data

shop <- read.csv("data_input/2021June-July_shop_data.csv", stringsAsFactors = T)
head(shop)

Data Glossary

  1. pk_shop: Primary key shop
  2. date_collected: Date collected (2021-06-13, 2021-06-21, 2021-06-28, and 2021-07-06)
  3. shopid: Shop ID
  4. name: Shop name
  5. join_month: Shop month join
  6. join_day: Shop day join
  7. join_year: Shop year join (2015-2021)
  8. item_count: Number of items sold
  9. follower_count: Number of followers
  10. response_time: Response time
  11. response_rate: Response rate (range 1-100)
  12. shop_location: Shop location (Pasay City, Taytay, Binondo, etc)
  13. rating_bad: The number of users who give bad rating to the shop
  14. rating_good: The number of users who give good rating to the shop
  15. rating_normal: The number of users who give normal rating to the shop
  16. rating_star: Shop rating star
  17. is_shopee_verified: Is the shop shopee verified or not
  18. is_official_shop: Is the shop official shop or not

Adjust Data Type

glimpse(shop)
#> Rows: 746
#> Columns: 18
#> $ pk_shop            <dbl> 20210613254711736, 20210613314198144, 2021061310377~
#> $ date_collected     <fct> 2021-06-13, 2021-06-13, 2021-06-13, 2021-06-13, 202~
#> $ shopid             <int> 254711738, 314198143, 10377223, 21407329, 11187885,~
#> $ name               <fct> "hello.may.ph", "love.women.shop", "R&O RedOrange",~
#> $ join_month         <fct> April, September, September, April, October, April,~
#> $ join_day           <int> 27, 25, 27, 30, 18, 26, 17, 12, 22, 30, 14, 26, 17,~
#> $ join_year          <int> 2020, 2020, 2016, 2017, 2016, 2021, 2020, 2019, 202~
#> $ item_count         <int> 446, 263, 209, 165, 467, 115, 219, 1444, 286, 509, ~
#> $ follower_count     <int> 17487, 8574, 660417, 18787, 67371, 41, 30480, 72704~
#> $ response_time      <fct>  10:13:55,  10:43:05,  08:32:07,  09:10:38,  11:40:~
#> $ response_rate      <int> 96, 73, 90, 94, 89, 29, 92, 95, 76, 85, 77, 10, 61,~
#> $ shop_location      <fct> "", "San Nicolas,Metro Manila", "Pasay City,Metro M~
#> $ rating_bad         <int> 181, 78, 4647, 44, 680, 0, 87, 354, 679, 4524, 21, ~
#> $ rating_good        <int> 17510, 2656, 1071829, 18959, 39252, 2, 16460, 22350~
#> $ rating_normal      <int> 729, 198, 18073, 215, 2383, 0, 408, 962, 2027, 1007~
#> $ rating_star        <dbl> 4.81, 4.68, 4.91, 4.94, 4.73, 5.00, 4.87, 4.80, 4.7~
#> $ is_shopee_verified <int> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, ~
#> $ is_official_shop   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

We need to adjust data type of several variables: - pk_shop, shopid, name as character - response_time as time

shop <- shop %>%
  mutate_at(c("pk_shop", "shopid", "name"), as.character) %>%
  mutate(response_time = hms(response_time))

NA Value Check

colSums(is.na(shop))
#>            pk_shop     date_collected             shopid               name 
#>                  0                  0                  0                  0 
#>         join_month           join_day          join_year         item_count 
#>                  0                  0                  0                  0 
#>     follower_count      response_time      response_rate      shop_location 
#>                  0                  0                  0                  0 
#>         rating_bad        rating_good      rating_normal        rating_star 
#>                  0                  0                  0                  6 
#> is_shopee_verified   is_official_shop 
#>                  0                  0

There ara several observations that has NA value in rating_star variable.

shop %>%
  filter(is.na(rating_star))

These data have NA value because the shop has no transaction yet. So, we can fill the NA value with 0.

shop$rating_star[is.na(shop$rating_star)] = 0

anyNA(shop)
#> [1] FALSE

Remove Unused Variables

shop <- shop %>%
  select(-c(pk_shop, date_collected))

head(product)

Machine Learning

Target variables:

  1. is_shopee_verified
  2. is_official_shop

Predictor variables:

  1. join_date
  2. item_count
  3. follower_count
  4. response_time
  5. response_rate
  6. shop_location
  7. rating_bad
  8. rating_good
  9. rating_normal
  10. rating_star

Data Exploration

Product Data

product %>%
  group_by(product_category) %>%
  summarise(total_product = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = total_product, y = reorder(product_category, total_product))) +
  geom_col(aes(fill = total_product), color = "darkred") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Product",) +
  geom_text(aes(label = total_product), nudge_x = -12) +
  labs( title = "Number of Products with a Rating of 4.5 and Above in Each Category",
        x = "Number of Product",
        y = NULL
      )

Based on the chart above, most of the products have a rating of 4.5 and above. The crop top category has the most products with a rating of 4.5 and above.

The following are the most in-demand product categories with a rating of 4.5 and above at each store location.

product %>%
  group_by(shop_location, product_category) %>%
  summarise(total_product = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = total_product, y = reorder(shop_location, total_product)))+
  geom_col(aes(fill = product_category), position = "fill")+
  labs(x = "Category",
       y = NULL,
       fill = NULL)+
  scale_fill_brewer(palette = "Set2")+
  theme_minimal()+
  theme(legend.position = "top")

From the graph, it can be seen that each location has different product demands.

product %>%
  group_by(product_category) %>%
  summarise(likes = round(mean(product_like_count))) %>%
  ggplot(aes(x = product_category, y = likes))+
  geom_col(aes(fill = likes), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Likes Count",) +
  geom_text(aes(label = likes), nudge_y = -80) +
  labs( title = "Most in-demand Product Categories",
        subtitle = "Based on Likes Count",
        x = NULL,
        y = "Likes Count"
      )

product %>%
  group_by(product_category) %>%
  summarise(views = round(mean(product_views))) %>%
  ggplot(aes(x = product_category, y = views))+
  geom_col(aes(fill = views), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Views",) +
  geom_text(aes(label = views), nudge_y = -1000) +
  labs( title = "Most in-demand Product Categories",
        subtitle = "Based on Views",
        x = NULL,
        y = "Views"
      )

The crop top is the most popular product category both in terms of the number of likes and the number of views.

product %>%
  mutate(feature_lowest_price_guarantee = as.factor(feature_lowest_price_guarantee)) %>%
  group_by(feature_lowest_price_guarantee) %>%
  summarise(products = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = feature_lowest_price_guarantee, y = products))+
  geom_col(aes(fill = products), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = products), nudge_y = -80) +
  labs( title = "Availability Feature Lowest Price Guarantee",
        subtitle = "on Products with Rating 4.5 and Above",
        x = NULL,
        y = "Number of Products"
      )

product %>%
  mutate(feature_can_use_bundle_deal = as.factor(feature_can_use_bundle_deal)) %>%
  group_by(feature_can_use_bundle_deal) %>%
  summarise(products = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = feature_can_use_bundle_deal, y = products))+
  geom_col(aes(fill = products), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = products), nudge_y = -80) +
  labs( title = "Availability Feature Use Bundle Deal",
        subtitle = "on Products with Rating 4.5 and Above",
        x = NULL,
        y = "Number of Products"
      )

product %>%
  mutate(feature_can_use_cod = as.factor(feature_can_use_cod)) %>%
  group_by(feature_can_use_cod) %>%
  summarise(products = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = feature_can_use_cod, y = products))+
  geom_col(aes(fill = products), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = products), nudge_y = -80) +
  labs( title = "Availability Feature Use Cash on Delivery",
        subtitle = "on Products with Rating 4.5 and Above",
        x = NULL,
        y = "Number of Products"
      )

product %>%
  mutate(feature_can_use_wholesale = as.factor(feature_can_use_wholesale)) %>%
  group_by(feature_can_use_wholesale) %>%
  summarise(products = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = feature_can_use_wholesale, y = products))+
  geom_col(aes(fill = products), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = products), nudge_y = -80) +
  labs( title = "Availability Feature Use Wholesale",
        subtitle = "on Products with Rating 4.5 and Above",
        x = NULL,
        y = "Number of Products"
      )

product %>%
  mutate(feature_show_free_shipping = as.factor(feature_show_free_shipping)) %>%
  group_by(feature_show_free_shipping) %>%
  summarise(products = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = feature_show_free_shipping, y = products))+
  geom_col(aes(fill = products), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = products), nudge_y = -80) +
  labs( title = "Availability Feature Show Free Shipping",
        subtitle = "on Products with Rating 4.5 and Above",
        x = NULL,
        y = "Number of Products"
      )

Based on 5 graphs of feature availability on products that have a rating of 4.5 and above, it can be concluded that feature availability does not affect product ratings.

Shop Data

Here are the shops that have not yet been rated or have a 0 rating.

shop %>%
  filter(rating_star == 0)

These shops can be used as recommendations for collaboration in several programs to increase shop ratings, for example participating in flash sales programs.

shop %>%
  mutate(is_shopee_verified = as.factor(is_shopee_verified)) %>%
  group_by(is_shopee_verified) %>%
  summarise(shops = sum(table(shopid))) %>%
  ggplot(aes(x = is_shopee_verified, y = shops))+
  geom_col(aes(fill = shops), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = shops), nudge_y = -12) +
  labs( title = "Shopee Verified",
        x = NULL,
        y = "Number of Shops"
      )

Most of the shops are not shopee verified.

shop %>%
  mutate(is_official_shop = as.factor(is_official_shop)) %>%
  group_by(is_official_shop) %>%
  summarise(shops = sum(table(shopid))) %>%
  ggplot(aes(x = is_official_shop, y = shops))+
  geom_col(aes(fill = shops), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = shops), nudge_y = -12) +
  labs( title = "Official Shop",
        x = NULL,
        y = "Number of Shops"
      )

Almost all shops are not official shops.

Shops that have been verified by Shopee and official stores can increase buyer confidence in transactions and increase the chances of transactions with buyers. Buyers who have transacted will rate the store and product they bought. This is also a consideration for buyers to transact again and can attract other buyers to transact with stores that have good ratings and good products.