Problem Research

Background Problem

In general, there are three things that buyers consider in buying goods, namely in terms of the seller, the specifications of the goods, and other buyer reviews of the goods.

Considerations for product specifications include product categories, product prices, product brands, product variations, price discounts, number of likes, comments, and views, product ratings, product stock, number of products sold, availability on flash sale, lowest price guarantee feature, deal features bundles, wholesale features, and free shipping features.

Considerations for buyer reviews include product quality, product accommodations, packaging conditions, delivery speed, buyers’ desire to order again, product defects, whether the product was delivered or not, and how similar the product is to the picture. These buyer reviews will affect the product rating.

Store considerations include the number of items, number of followers, response time, response rate, store location, number of buyers who rate the store, store ratings, and store labels (verified shop/official shop).

One of the focuses of e-commerce is ensuring the availability of products to meet customer needs. For this reason, e-commerce does not only pay attention to customer loyalty but also pays attention to the loyalty of shops/sellers by increasing buyer confidence in the store and increasing the number of transactions.

One of the programs used to increase the number of product transactions in stores is flash sales. Usually, the target of the flash sale program is a store that does not yet have a label or a store that has just joined e-commerce. The products offered in the flash sale program should also be products that are in high demand by customers so that the chances of buying them are greater.

Business Impact

The author will analyze three things that are considered by buyers in e-commerce transactions. This includes making machine learning regarding store classification for shopee verified and official shops. In addition, the program offers such as flash sales can be used to collaborate with stores in the program according to the items available at the store and items that are in high demand. The results of the analysis will be displayed on the dashboard which will be used as a consideration in compiling a program or determining the right target store to be included in a program run by e-commerce.

Target User

The target user of this project is e-commerce with the following benefits:

increasing user and store loyalty at the same time in one program
maining new insights in innovation and improvisation programs
marketing effectiveness through selecting the right target stores

Implementation to Similar Businesses

In addition to e-commerce, this project can also be implemented in the FMCG industry or other industries that have stores or branch offices. This project can also be used by shops joining e-commerce and buyers in choosing products.

Outputs of The Project

The output of this project is a dashboard that displays:

exploratory data analysis (EDA)
shop recommendations that can be offered by the shopee verified and official shop programs using machine learning
shop recommendations that can be offered to join the flash sale program

Data Source

The growing popularity of Korean inspired fashion has taken over e-retailing in the Philippines. Entrepreneurs are catching up to this growing trend by selling their Korean fashion inspired products on different online platforms.

The data is secondary data taken by the author from one of the contributors to Kaggle, namely Jaye Cabrera. The dataset contains:

Product Data
Shop Data

Data Preparation

Load The Packages

library(dplyr)
library(lubridate)
library(ggplot2)

Product Data

Read Data

product <- read.csv("data_input/2021June-July_product_data.csv", stringsAsFactors = T)
head(product)

Data Glossary

pk_product: Primary key product
date_collected: Date collected (2021-06-13, 2021-06-21, 2021-06-28, and 2021-07-06)
product_itemid: Product item ID
product_shopid: Product shop ID
product_category: Product category (Crop Top, Long Sleeves, Short Sleeves, Top)
product_name: Product name
product_price: Product price
product_price_min: Product price minimum
product_price_max: Product price maximum
product_discount: Product discount
product_brand: Product brand (No Brand, Huilishi, Bsco, Dnemnl, etc)
product_like_count: Product like count
product_comment_count: Product comment count
product_views: Product views
product_rate_star_0: The number of users who give a 0 star rating to the product
product_rate_star_1: The number of users who gave a 1 star rating to the product
product_rate_star_2: The number of users who gave a 2 star rating to the product
product_rate_star_3: The number of users who gave a 3 star rating to the product
product_rate_star_4: The number of users who gave a 4 star rating to the product
product_rate_star_5: The number of users who gave a 5 star rating to the product
product_total_rating: Average of the total product star rating
stock: Product stock
units_sold: Number of units sold
status: Avalilability status of the product
shop_location: Shop location (Mainland China, Metro Manila, etc)
shop_is_on_flash_sale: Product availability on flash sale
shop_is_preferred_plus_seller: Shop preferences for plus sellers
feature_lowest_price_guarantee: Lowest price guarantee feature availability
feature_can_use_bundle_deal: Bundle deal feature availability
feature_can_use_cod: COD (Cash on Delivery) feature availability
feature_can_use_wholesale: Wholesale feature availability
feature_show_free_shipping: Free shipping availability
product_variation_count: Number of product variations

Adjust Data Type

glimpse(product)

#> Rows: 1,583
#> Columns: 33
#> $ pk_product                     <dbl> 202106137919342720, 202106137140015488,~
#> $ date_collected                 <fct> 2021-06-13, 2021-06-13, 2021-06-13, 202~
#> $ product_itemid                 <dbl> 7919342724, 7140015494, 4843903203, 114~
#> $ product_shopid                 <int> 1083822, 130651949, 55748694, 396605392~
#> $ product_category               <fct> Short Sleeves, Short Sleeves, Short Sle~
#> $ product_name                   <fct> "NKD Korean Fashion Erich Basic Daily P~
#> $ product_price                  <dbl> 300, NA, 200, 391, 300, 150, 150, 603, ~
#> $ product_price_min              <int> 135, 149, 75, 294, 55, 63, 58, 423, 58,~
#> $ product_price_max              <int> 138, 149, 75, 294, 55, 63, 58, 423, 64,~
#> $ product_discount               <int> 55, 0, 63, 25, 82, 58, 61, 30, 61, 51, ~
#> $ product_brand                  <fct> No Brand, No Brand, No Brand, No Brand,~
#> $ product_like_count             <int> 4504, 293, 8184, 13, 5999, 518, 24, 4, ~
#> $ product_comment_count          <int> 3696, 158, 15912, 0, 11825, 1544, 83, 0~
#> $ product_views                  <int> 12815, 2189, 19575, 317, 27205, 1029, 1~
#> $ prod_rate_star_0               <int> 3447, 148, 13890, 0, 10138, 1422, 80, 0~
#> $ prod_rate_star_1               <int> 146, 3, 984, 0, 778, 59, 2, 0, 1293, 49~
#> $ prod_rate_star_2               <int> 70, 3, 606, 0, 459, 34, 0, 0, 726, 26, ~
#> $ prod_rate_star_3               <int> 9, 2, 178, 0, 200, 12, 0, 0, 185, 6, 0,~
#> $ prod_rate_star_4               <int> 24, 0, 251, 0, 249, 17, 0, 0, 309, 9, 0~
#> $ prod_rate_star_5               <int> 3696, 156, 15909, 0, 11824, 1544, 82, 0~
#> $ product_total_rating           <dbl> 4.89, 4.90, 4.77, 0.00, 4.72, 4.85, 4.9~
#> $ stock                          <int> 134411, 84, 7721, 10799, 3358, 49297, 8~
#> $ units_sold                     <int> 10154, 336, 68171, 0, 35672, 4333, 189,~
#> $ status                         <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ shop_location                  <fct> "Taytay, Rizal", "San Nicolas, Metro Ma~
#> $ shop_is_on_flash_sale          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ shop_is_preferred_plus_seller  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ feature_lowest_price_guarantee <int> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, ~
#> $ feature_can_use_bundle_deal    <int> 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, ~
#> $ feature_can_use_cod            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ feature_can_use_wholesale      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ feature_show_free_shipping     <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ product_variation_count        <int> 20, 9, 14, 3, 10, 7, 9, 3, 14, 9, 12, 3~

We need to adjust data type of several variable: - pk_product, product_itemid, product_shopid, product_name as character - product_price as integer

product <- product %>%
  mutate_at(c("pk_product", "product_itemid", "product_shopid", "product_name"), as.character) %>%
  mutate_at("product_price", as.integer)

NA Value Check

colSums(is.na(product))

#>                     pk_product                 date_collected 
#>                              0                              0 
#>                 product_itemid                 product_shopid 
#>                              0                              0 
#>               product_category                   product_name 
#>                              0                              0 
#>                  product_price              product_price_min 
#>                            198                              0 
#>              product_price_max               product_discount 
#>                              0                              0 
#>                  product_brand             product_like_count 
#>                              0                              0 
#>          product_comment_count                  product_views 
#>                              0                              0 
#>               prod_rate_star_0               prod_rate_star_1 
#>                              0                              0 
#>               prod_rate_star_2               prod_rate_star_3 
#>                              0                              0 
#>               prod_rate_star_4               prod_rate_star_5 
#>                              0                              0 
#>           product_total_rating                          stock 
#>                              0                              0 
#>                     units_sold                         status 
#>                              0                              0 
#>                  shop_location          shop_is_on_flash_sale 
#>                              0                              0 
#>  shop_is_preferred_plus_seller feature_lowest_price_guarantee 
#>                              0                              0 
#>    feature_can_use_bundle_deal            feature_can_use_cod 
#>                              0                              0 
#>      feature_can_use_wholesale     feature_show_free_shipping 
#>                              0                              0 
#>        product_variation_count 
#>                              0

Several obervations in product_price variables have NA value.

product %>%
  filter(is.na(product_price))

Products that have NA value are products with no discount. So, we can fill NA value with product price minimum.

product$product_price[is.na(product$product_price)] = product$product_price_min[is.na(product$product_price)]

anyNA(product)

#> [1] FALSE

Remove Unused Variables

product <- product %>%
  select(-c(pk_product, date_collected, status))

head(product)

Shop Data

Read Data

shop <- read.csv("data_input/2021June-July_shop_data.csv", stringsAsFactors = T)
head(shop)

Data Glossary

pk_shop: Primary key shop
date_collected: Date collected (2021-06-13, 2021-06-21, 2021-06-28, and 2021-07-06)
shopid: Shop ID
name: Shop name
join_month: Shop month join
join_day: Shop day join
join_year: Shop year join (2015-2021)
item_count: Number of items sold
follower_count: Number of followers
response_time: Response time
response_rate: Response rate (range 1-100)
shop_location: Shop location (Pasay City, Taytay, Binondo, etc)
rating_bad: The number of users who give bad rating to the shop
rating_good: The number of users who give good rating to the shop
rating_normal: The number of users who give normal rating to the shop
rating_star: Shop rating star
is_shopee_verified: Is the shop shopee verified or not
is_official_shop: Is the shop official shop or not

Adjust Data Type

glimpse(shop)

#> Rows: 746
#> Columns: 18
#> $ pk_shop            <dbl> 20210613254711736, 20210613314198144, 2021061310377~
#> $ date_collected     <fct> 2021-06-13, 2021-06-13, 2021-06-13, 2021-06-13, 202~
#> $ shopid             <int> 254711738, 314198143, 10377223, 21407329, 11187885,~
#> $ name               <fct> "hello.may.ph", "love.women.shop", "R&O RedOrange",~
#> $ join_month         <fct> April, September, September, April, October, April,~
#> $ join_day           <int> 27, 25, 27, 30, 18, 26, 17, 12, 22, 30, 14, 26, 17,~
#> $ join_year          <int> 2020, 2020, 2016, 2017, 2016, 2021, 2020, 2019, 202~
#> $ item_count         <int> 446, 263, 209, 165, 467, 115, 219, 1444, 286, 509, ~
#> $ follower_count     <int> 17487, 8574, 660417, 18787, 67371, 41, 30480, 72704~
#> $ response_time      <fct>  10:13:55,  10:43:05,  08:32:07,  09:10:38,  11:40:~
#> $ response_rate      <int> 96, 73, 90, 94, 89, 29, 92, 95, 76, 85, 77, 10, 61,~
#> $ shop_location      <fct> "", "San Nicolas,Metro Manila", "Pasay City,Metro M~
#> $ rating_bad         <int> 181, 78, 4647, 44, 680, 0, 87, 354, 679, 4524, 21, ~
#> $ rating_good        <int> 17510, 2656, 1071829, 18959, 39252, 2, 16460, 22350~
#> $ rating_normal      <int> 729, 198, 18073, 215, 2383, 0, 408, 962, 2027, 1007~
#> $ rating_star        <dbl> 4.81, 4.68, 4.91, 4.94, 4.73, 5.00, 4.87, 4.80, 4.7~
#> $ is_shopee_verified <int> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, ~
#> $ is_official_shop   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~

We need to adjust data type of several variables: - pk_shop, shopid, name as character - response_time as time

shop <- shop %>%
  mutate_at(c("pk_shop", "shopid", "name"), as.character) %>%
  mutate(response_time = hms(response_time))

NA Value Check

colSums(is.na(shop))

#>            pk_shop     date_collected             shopid               name 
#>                  0                  0                  0                  0 
#>         join_month           join_day          join_year         item_count 
#>                  0                  0                  0                  0 
#>     follower_count      response_time      response_rate      shop_location 
#>                  0                  0                  0                  0 
#>         rating_bad        rating_good      rating_normal        rating_star 
#>                  0                  0                  0                  6 
#> is_shopee_verified   is_official_shop 
#>                  0                  0

There ara several observations that has NA value in rating_star variable.

shop %>%
  filter(is.na(rating_star))

These data have NA value because the shop has no transaction yet. So, we can fill the NA value with 0.

shop$rating_star[is.na(shop$rating_star)] = 0

anyNA(shop)

#> [1] FALSE

Remove Unused Variables

shop <- shop %>%
  select(-c(pk_shop, date_collected))

head(product)

Machine Learning

Target variables:

is_shopee_verified
is_official_shop

Predictor variables:

join_date
item_count
follower_count
response_time
response_rate
shop_location
rating_bad
rating_good
rating_normal
rating_star

Data Exploration

Product Data

product %>%
  group_by(product_category) %>%
  summarise(total_product = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = total_product, y = reorder(product_category, total_product))) +
  geom_col(aes(fill = total_product), color = "darkred") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Product",) +
  geom_text(aes(label = total_product), nudge_x = -12) +
  labs( title = "Number of Products with a Rating of 4.5 and Above in Each Category",
        x = "Number of Product",
        y = NULL
      )

Based on the chart above, most of the products have a rating of 4.5 and above. The crop top category has the most products with a rating of 4.5 and above.

The following are the most in-demand product categories with a rating of 4.5 and above at each store location.

product %>%
  group_by(shop_location, product_category) %>%
  summarise(total_product = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = total_product, y = reorder(shop_location, total_product)))+
  geom_col(aes(fill = product_category), position = "fill")+
  labs(x = "Category",
       y = NULL,
       fill = NULL)+
  scale_fill_brewer(palette = "Set2")+
  theme_minimal()+
  theme(legend.position = "top")

From the graph, it can be seen that each location has different product demands.

product %>%
  group_by(product_category) %>%
  summarise(likes = round(mean(product_like_count))) %>%
  ggplot(aes(x = product_category, y = likes))+
  geom_col(aes(fill = likes), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Likes Count",) +
  geom_text(aes(label = likes), nudge_y = -80) +
  labs( title = "Most in-demand Product Categories",
        subtitle = "Based on Likes Count",
        x = NULL,
        y = "Likes Count"
      )

product %>%
  group_by(product_category) %>%
  summarise(views = round(mean(product_views))) %>%
  ggplot(aes(x = product_category, y = views))+
  geom_col(aes(fill = views), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Views",) +
  geom_text(aes(label = views), nudge_y = -1000) +
  labs( title = "Most in-demand Product Categories",
        subtitle = "Based on Views",
        x = NULL,
        y = "Views"
      )

The crop top is the most popular product category both in terms of the number of likes and the number of views.

product %>%
  mutate(feature_lowest_price_guarantee = as.factor(feature_lowest_price_guarantee)) %>%
  group_by(feature_lowest_price_guarantee) %>%
  summarise(products = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = feature_lowest_price_guarantee, y = products))+
  geom_col(aes(fill = products), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = products), nudge_y = -80) +
  labs( title = "Availability Feature Lowest Price Guarantee",
        subtitle = "on Products with Rating 4.5 and Above",
        x = NULL,
        y = "Number of Products"
      )

product %>%
  mutate(feature_can_use_bundle_deal = as.factor(feature_can_use_bundle_deal)) %>%
  group_by(feature_can_use_bundle_deal) %>%
  summarise(products = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = feature_can_use_bundle_deal, y = products))+
  geom_col(aes(fill = products), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = products), nudge_y = -80) +
  labs( title = "Availability Feature Use Bundle Deal",
        subtitle = "on Products with Rating 4.5 and Above",
        x = NULL,
        y = "Number of Products"
      )

product %>%
  mutate(feature_can_use_cod = as.factor(feature_can_use_cod)) %>%
  group_by(feature_can_use_cod) %>%
  summarise(products = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = feature_can_use_cod, y = products))+
  geom_col(aes(fill = products), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = products), nudge_y = -80) +
  labs( title = "Availability Feature Use Cash on Delivery",
        subtitle = "on Products with Rating 4.5 and Above",
        x = NULL,
        y = "Number of Products"
      )

product %>%
  mutate(feature_can_use_wholesale = as.factor(feature_can_use_wholesale)) %>%
  group_by(feature_can_use_wholesale) %>%
  summarise(products = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = feature_can_use_wholesale, y = products))+
  geom_col(aes(fill = products), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = products), nudge_y = -80) +
  labs( title = "Availability Feature Use Wholesale",
        subtitle = "on Products with Rating 4.5 and Above",
        x = NULL,
        y = "Number of Products"
      )

product %>%
  mutate(feature_show_free_shipping = as.factor(feature_show_free_shipping)) %>%
  group_by(feature_show_free_shipping) %>%
  summarise(products = sum(table(product_total_rating[product_total_rating >= 4.5]))) %>%
  ggplot(aes(x = feature_show_free_shipping, y = products))+
  geom_col(aes(fill = products), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = products), nudge_y = -80) +
  labs( title = "Availability Feature Show Free Shipping",
        subtitle = "on Products with Rating 4.5 and Above",
        x = NULL,
        y = "Number of Products"
      )

Based on 5 graphs of feature availability on products that have a rating of 4.5 and above, it can be concluded that feature availability does not affect product ratings.

Shop Data

Here are the shops that have not yet been rated or have a 0 rating.

shop %>%
  filter(rating_star == 0)

These shops can be used as recommendations for collaboration in several programs to increase shop ratings, for example participating in flash sales programs.

shop %>%
  mutate(is_shopee_verified = as.factor(is_shopee_verified)) %>%
  group_by(is_shopee_verified) %>%
  summarise(shops = sum(table(shopid))) %>%
  ggplot(aes(x = is_shopee_verified, y = shops))+
  geom_col(aes(fill = shops), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = shops), nudge_y = -12) +
  labs( title = "Shopee Verified",
        x = NULL,
        y = "Number of Shops"
      )

Most of the shops are not shopee verified.

shop %>%
  mutate(is_official_shop = as.factor(is_official_shop)) %>%
  group_by(is_official_shop) %>%
  summarise(shops = sum(table(shopid))) %>%
  ggplot(aes(x = is_official_shop, y = shops))+
  geom_col(aes(fill = shops), position = "dodge") +
  scale_fill_gradient(low = "white", high = "salmon", name = "Number of Products") +
  geom_text(aes(label = shops), nudge_y = -12) +
  labs( title = "Official Shop",
        x = NULL,
        y = "Number of Shops"
      )

Almost all shops are not official shops.

Shops that have been verified by Shopee and official stores can increase buyer confidence in transactions and increase the chances of transactions with buyers. Buyers who have transacted will rate the store and product they bought. This is also a consideration for buyers to transact again and can attract other buyers to transact with stores that have good ratings and good products.

Increasing Online Shop Transactions Through Programs Hold by e-commerce

Karina Sasmitha

1/21/2022

Problem Research

Background Problem

Business Impact

Target User

Implementation to Similar Businesses

Outputs of The Project

Data Source

Data Preparation

Load The Packages

Product Data

Read Data

Adjust Data Type

NA Value Check

Remove Unused Variables

Shop Data

Read Data

Adjust Data Type

NA Value Check

Remove Unused Variables

Machine Learning

Data Exploration

Product Data

Shop Data