Olist a Brazilian e-commerce

Olist’s Business Model

Olist is a Brazilian department store (marketplace) that operates in e-commerce segment, but is not an e-commerce itself as she says. It operates as a SaaS (Software as a Service) technology company since 2015. It offers a marketplace solution (of e-commerce segment) to shopkeepers of all sizes (aand for most segements) to increase their sales whether they have online presence or not.

Olist’s Solution

Olist’s solution consists of three aspects: Software, Contracts with the main marketplaces and Reputation sharing.

What Olist says?

Olist says she

… is a large department store within marketplaces
… is connected to the main e-commerce of Brazil
… does not buy products
… does not keep products in stock
… does not carry out shipping of any products offered in its store
All products are sold and shipped by the thousands of shopkeepers (registered on Olist) who sell through Olist
Her strength lies in union of all participating shopkeepers, who are selling physical products
Participant shopkeeper is responsible for separating, packing, and taking products to the logistics operator

Please note Olist’s perspective ( a supply chain preview):: she prescribes there are many factors that can influence the sales of a shopkeeper (e.g: type of product, demand, seasonality, competitive pricing, terms, inventory etc)

Load packages and determine working directory

Load datasets

Customers <- read.csv("olist_customers_dataset.csv", header = TRUE)
Geolocation <- read.csv("olist_geolocation_dataset.csv", header = TRUE)
Items <- read.csv("olist_order_items_dataset.csv", header = TRUE)
Payments <- read.csv("olist_order_payments_dataset.csv", header = TRUE)
Reviews <- read.csv("olist_order_reviews_dataset.csv", header = TRUE)
Orders <- read.csv("olist_orders_dataset.csv", header = TRUE)
Products <- read.csv("olist_products_dataset.csv", header = TRUE)
Sellers <- read.csv("olist_sellers_dataset.csv", header = TRUE)
Pdttrans <- read.csv("product_category_name_translation.csv", header = TRUE) %>%
               rename("product_category_name"  = "ï..product_category_name")

Data Model

On Kaggle, she shared its data of 100k orders from 2016 to 2018. There are 8+1 datasets to play with and explore. The data model has been described in image below and it is organised and normalised for each category.

Data Model

Principal Dataset

Orders dataset

Order ID
Customer ID
Order Status
Order Purchase Timestamp
Order Approved at
Order Delivered Carrier date
Order Delivered Customer date
Order Estimated delivery date

kable(head(Orders), format = 'markdown', caption = 'Orders dataset')

Orders dataset
order_id	customer_id	order_status	order_purchase_timestamp	order_approved_at	order_delivered_carrier_date	order_delivered_customer_date	order_estimated_delivery_date
e481f51cbdc54678b7cc49136f2d6af7	9ef432eb6251297304e76186b10a928d	delivered	2017-10-02 10:56:33	2017-10-02 11:07:15	2017-10-04 19:55:00	2017-10-10 21:25:13	2017-10-18 00:00:00
53cdb2fc8bc7dce0b6741e2150273451	b0830fb4747a6c6d20dea0b8c802d7ef	delivered	2018-07-24 20:41:37	2018-07-26 03:24:27	2018-07-26 14:31:00	2018-08-07 15:27:45	2018-08-13 00:00:00
47770eb9100c2d0c44946d9cf07ec65d	41ce2a54c0b03bf3443c3d931a367089	delivered	2018-08-08 08:38:49	2018-08-08 08:55:23	2018-08-08 13:50:00	2018-08-17 18:06:29	2018-09-04 00:00:00
949d5b44dbf5de918fe9c16f97b45f8a	f88197465ea7920adcdbec7375364d82	delivered	2017-11-18 19:28:06	2017-11-18 19:45:59	2017-11-22 13:39:59	2017-12-02 00:28:42	2017-12-15 00:00:00
ad21c59c0840e6cb83a9ceb5573f8159	8ab97904e6daea8866dbdbc4fb7aad2c	delivered	2018-02-13 21:18:39	2018-02-13 22:20:29	2018-02-14 19:46:34	2018-02-16 18:17:02	2018-02-26 00:00:00
a4591c265e18cb1dcee52889e2d8acc3	503740e9ca751ccdda7ba28e9ab8f608	delivered	2017-07-09 21:57:05	2017-07-09 22:10:13	2017-07-11 14:58:04	2017-07-26 10:57:55	2017-08-01 00:00:00

Others Datasets

Customers dataset

Customer ID
Customer Unique ID
Customer Zip Code prefix
Customer City
Customer State

Customers dataset
customer_id	customer_unique_id	customer_zip_code_prefix	customer_city	customer_state
06b8999e2fba1a1fbc88172c00ba8bc7	861eff4711a542e4b93843c6dd7febb0	14409	franca	SP
18955e83d337fd6b2def6b18a428ac77	290c77bc529b7ac935b93aa66c333dc3	9790	sao bernardo do campo	SP
4e7b3e00288586ebd08712fdd0374a03	060e732b5b29e8181a18229c7b0b2b5e	1151	sao paulo	SP
b2b6027bc5c5109e529d4dc6358b12c3	259dac757896d24d7702b9acbbff3f3c	8775	mogi das cruzes	SP
4f2d8ab171c80ec8364f7c12e35b23ad	345ecd01c38d18a9036ed96c73b8d066	13056	campinas	SP
879864dab9bc3047522c92c82e1212b8	4c93744516667ad3b8f1fb645a3116a4	89254	jaragua do sul	SC

Geolocation dataset

Geolocation Zip Code Prefix
Geolocation Lat.
Geolocation Lng.
Geolocation City
Geolocation State

kable(head(Geolocation), format = 'markdown', caption = 'Geolocation dataset')

Geolocation dataset
geolocation_zip_code_prefix	geolocation_lat	geolocation_lng	geolocation_city	geolocation_state
1037	-23.54562	-46.63929	sao paulo	SP
1046	-23.54608	-46.64482	sao paulo	SP
1046	-23.54613	-46.64295	sao paulo	SP
1041	-23.54439	-46.63950	sao paulo	SP
1035	-23.54158	-46.64161	sao paulo	SP
1012	-23.54776	-46.63536	sÃ£o paulo	SP

Items dataset

Order ID
Order Item ID
Seller ID
Shipping Limit Date
Price
Freight Value

Items dataset
order_id	order_item_id	product_id	seller_id	shipping_limit_date	price	freight_value
00010242fe8c5a6d1ba2dd792cb16214	1	4244733e06e7ecb4970a6e2683c13e61	48436dade18ac8b2bce089ec2a041202	2017-09-19 09:45:35	58.90	13.29
00018f77f2f0320c557190d7a144bdd3	1	e5f2d52b802189ee658865ca93d83a8f	dd7ddc04e1b6c2c614352b383efe2d36	2017-05-03 11:05:13	239.90	19.93
000229ec398224ef6ca0657da4fc703e	1	c777355d18b72b67abbeef9df44fd0fd	5b51032eddd242adc84c38acab88f23d	2018-01-18 14:48:30	199.00	17.87
00024acbcdf0a6daa1e931b038114c75	1	7634da152a4610f1595efa32f14722fc	9d7a1d34a5052409006425275ba1c2b4	2018-08-15 10:10:18	12.99	12.79
00042b26cf59d7ce69dfabb4e55b4fd9	1	ac6c3623068f30de03045865e4e10089	df560393f3a51e74553ab94004ba5c87	2017-02-13 13:57:51	199.90	18.14
00048cc3ae777c65dbb7d2a0634bc1ea	1	ef92defde845ab8450f9d70c526ef70f	6426d21aca402a131fc0a5d0960a3c90	2017-05-23 03:55:27	21.90	12.69

Payments dataset

Order ID
Payment Sequential
Payment Type
Payment Installments
Payment Value

kable(head(Payments), format = 'markdown', caption = 'Payments dataset')

Payments dataset
order_id	payment_sequential	payment_type	payment_installments	payment_value
b81ef226f3fe1789b1e8b2acac839d17	1	credit_card	8	99.33
a9810da82917af2d9aefd1278f1dcfa0	1	credit_card	1	24.39
25e8ea4e93396b6fa0d3dd708e76c1bd	1	credit_card	1	65.71
ba78997921bbcdc1373bb41e913ab953	1	credit_card	8	107.78
42fdf880ba16b47b59251dd489d4441a	1	credit_card	2	128.45
298fcdf1f73eb413e4d26d01b25bc1cd	1	credit_card	2	96.12

Products dataset

Product ID
Product Category name
Product Name length
Product Photos (quantity)
Product Weight (grams)
Product Length (cm)
Product Height (cm)
Product Width (cm)

Products dataset
product_id	product_category_name	product_name_lenght	product_description_lenght	product_photos_qty	product_weight_g	product_length_cm	product_height_cm	product_width_cm
1e9e8ef04dbcff4541ed26657ea517e5	perfumaria	40	287	1	225	16	10	14
3aa071139cb16b67ca9e5dea641aaa2f	artes	44	276	1	1000	30	18	20
96bd76ec8810374ed1b65e291975717f	esporte_lazer	46	250	1	154	18	9	15
cef67bcfe19066a932b7673e239eb23d	bebes	27	261	1	371	26	4	26
9dc1a7de274444849c219cff195d0b71	utilidades_domesticas	37	402	4	625	20	17	13
41d3672d4792049fa1779bb35283ed13	instrumentos_musicais	60	745	1	200	38	5	11

Reviews dataset

Review ID
Order ID
Review Score
Review Comment title
Review Comment message
Review Creation date
Review Answer timestamp

Reviews dataset
review_id	order_id	review_score	review_comment_message	review_creation_date	review_answer_timestamp
7bc2406110b926393aa56f80a40eba40	73fc7af87114b39712e6da79b0a377eb	4		2018-01-18 00:00:00	2018-01-18 21:46:59
80e641a11e56f04c1ad469d5645fdfde	a548910a1c6147796b98fdf73dbeba33	5		2018-03-10 00:00:00	2018-03-11 03:05:13
228ce5500dc1d8e020d8d1322874b6f0	f9e4b658b201a9f2ecdecbb34bed034b	5		2018-02-17 00:00:00	2018-02-18 14:36:24
e64fb393e7b32834bb789ff8bb30750e	658677c97b385a9be170737859d3511b	5	Recebi bem antes do prazo estipulado.	2017-04-21 00:00:00	2017-04-21 22:02:06
f7c4243c7fe1938f181bec41a392bdeb	8e6bfb81e283fa7e4f11123a3fb894f1	5	ParabÃ©ns lojas lannister adorei comprar pela Internet seguro e prÃ¡tico ParabÃ©ns a todos feliz PÃ¡scoa	2018-03-01 00:00:00	2018-03-02 10:26:53
15197aa66ff4d0650b5434f1b46cda19	b18dcdf73be66366873cd26c5724d1dc	1		2018-04-13 00:00:00	2018-04-16 00:39:37

Sellers dataset

Seller ID
Seller Zip Code prefix
Seller City
Seller State

Sellers dataset
seller_id	seller_zip_code_prefix	seller_city	seller_state
3442f8959a84dea7ee197c632cb2df15	13023	campinas	SP
d1b65fc7debc3361ea86b5f14c68d2e2	13844	mogi guacu	SP
ce3ad9de960102d0677a81f5d0bb7b2d	20031	rio de janeiro	RJ
c0f3eea2e14555b6faeea3dd58c1b1c3	4195	sao paulo	SP
51a04a8a6bdcb23deccc82b0b80742cf	12914	braganca paulista	SP
c240c4061717ac1806ae6ee72be3533b	20920	rio de janeiro	RJ

Product Category Translation dataset

Product Category name (Portuguese)
Product Category name (English)

kable(head(Pdttrans), format = 'markdown', caption = 'Translation dataset')

Translation dataset
product_category_name	product_category_name_english
beleza_saude	health_beauty
informatica_acessorios	computers_accessories
automotivo	auto
cama_mesa_banho	bed_bath_table
moveis_decoracao	furniture_decor
esporte_lazer	sports_leisure

Simulation as f(Customers, Product Categories, Sellers)

Olist has been simulate addition of Customers, Product Categories, and Sellers. As per the data, there has been a steady stream of customers and sellers getting registered Olist’s platform. This progress, in conjunction with regular increase in number of new product categories being offered has maintainedOlist’s growth momentum.

Trend of New Customer’s registration

The graph shows, in 2017, there has been a positive trend line in number of New Customers [Customer Unique Identity] getting registered with Olist. In 2018, morethan 6000 were getting registered every month.

Galvanisation by Product Categories

In my opinion, addition of + 50 (approximate) New Product Categories per month - as an offer - has galvanised the spurt in new customers getting registered. Consistency in this phenomenon was Olist’s byword for 20 months i.e from January 2017 to August 2018.

Sellers’s Engagement

Sellers too maintained the similar trend as that of Customers, and within the time frame of this data, Olist ended up having + 3095 registered sellers on it’s platform

Seller.Gr <- inner_join(Orders, 
                        left_join(Items, Sellers, by = 'seller_id'), 
                        by = 'order_id') 

Seller.Gr %>% select_at(vars(purchase_year, purchase_month, seller_id)) %>%
               distinct_at(vars(purchase_year, purchase_month, seller_id)) %>% 
               arrange_at(vars(purchase_year, purchase_month)) %>%
               count(purchase_year, purchase_month) %>% 
               ggplot(aes(x = factor(purchase_month), y =n)) +
               geom_point(color = 'red', alpha = 0.63, shape = 21, aes(size = n), fill = alpha('cornsilk', 0.3)) + 
               geom_segment(aes(x = factor(purchase_month), xend = factor(purchase_month),
                                y = 0, yend = n), alpha = 0.45, linetyle = 3, color = 'red') + 
               coord_polar(start = 2) +
               facet_wrap(.~purchase_year) +
               theme_minimal() +
               theme(axis.text = element_text(size = 8), panel.grid.major = element_line(),
                     legend.position = 'bottom') +
               labs(title = 'Increase in number of sellers per month', 
                    x = 'Month', y = 'Number of Sellers')

## Warning: Ignoring unknown parameters: linetyle

Fast Moving product categories

Buoyancy of Product Categories can be gauged from the fact how frequently a Product Category has been purchased by customers? OR which are the customer driven “FAST MOVING” product categories? This kernel of information may act as one of the parameters which a company (and in this case Olist) may probably use for portfolio building.

Purchase made multiple times

Lets first start with the case where all Product Categories were purchased multiple times. The graph below shows which categories are high frequency categories.

Please Note: It excludes all the categories that were purchased as: one product category per month per year

Six sub - sections

One Order per Product Category

Case where only one order was placed for one product category in a month and in a given year

Purchased “1 < n < 6” times

Case where orders were placed for various product categories and each product category was purchased more than once but less than six times. Plus, list below provides some most frequently ordered categories

Construction Tools
Cine Photo
DVDs Blue Ray
Fashion Female Clothing
Music

Case where only one order was placed for one product category in a month and in a given year

Purchased “5 < n < 21” times

Case where orders were placed for various product categories and each product category was purchased more than five but less than twenty one times. Plus, list below provides some most frequently ordered categories.

Air Conditioning
Construction tools Garden
Fixed Telephony
Home Appliances 2
Market Place

Purchased “20 < n < 93” times

Case where orders were placed for various product categories and each product category was purchased more than twenty but less than ninety three times. Plus, list below provides some most frequently ordered categories.

Books general interest
Console games
Home appliances
Luggage accessories
Musical instruments
Small appliances

Purchased “92 < n < 251” times

Case where orders were placed for various product categories and each product category was purchased more than ninety two but less than two hundred and fifty one times. Plus, list below provides some most frequently ordered categories.

Baby
Cool stuff
Garden tools
Perfumery
Toys

Purchased “250 < n” times

Case where orders were placed for various product categories and each product category was purchased more than two hundred and fifty times. Plus, list below provides some most frequently ordered categories

Bed bath table
Computer accessories
Furniture decor
Health beauty
Housewares
Sports leisure

Significant features, relations & equations

k means clustering has been used to create clusters and to discover significant features. This helped me look into; how those features interacted and what were the relations between them? The clusters were created to look into data in tabular form and to discover the polynomial equations.

k - means CLustering

Actual Lead Time (days):: Number of days taken to deliver a product to the customer, from the moment an order has been placed in the system.
Order ID: Identification of order
Approval time (mins): Time taken by Olist’s system to approve an order after an order has been placed and before it’s intimated to supplier/carrier
Order Item ID: Number of Items/Order
Price: Price/Item, charged from a customer
Freight Value: Delivery Cost/Item, charged from a customer
Product Weight (gms): Weight of the product in grams

Please Note: Price & Freight Value is amount charged per item and is not per order. Plus, an order may have more than one item

Visualize these clusters

#assignments <- unnest(kclusts, augmented)

#p1 <- assignments %>% head(10000) %>% 
               #ggplot(aes(x = price, y = freight_value)) +
               #geom_point(aes(color))

Plot the clusters with total within

clusterings %>% ggplot(aes(x = k, y = tot.withinss)) +
               geom_line() + geom_point()

Cluster 1’s Equations

The \(\frac{\text{Residual Deviance}}{\text{Degrees of freedom}} > 1 \text{ or} \text{ } \frac{\text{Variance}}{\text{Mean}} \neq 1\) for Poisson Regression is at 3.207, so probably there is overdispersion. We adjust to take into account overdispersion and use Quasi-Poisson Regression equation.

Quasi-Poisson: regression equation

\[\text{Freight Value}_\text{ Cluster 1} \approx 2.830 - 0.037x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Interpretation: The predictor is significant as p-value \(\approx\) ZERO, the Dispersion parameter for Quasi-Poisson family is taken to be at 3.984. The \(\beta = -0.037\) or \(\beta < 0\) implies with every unit increase in items per order the freight value will decrease and will be multiplied by -0.037.

IF we take 8 different parameters to determine the Freight Value then using Quasi-Poisson regression - following predictors are significant with p-value \(\approx\) ZERO:

Intercept
Number of Items per order
Price
Estimated lead time
Actaul lead time
Product’s Weight (g)
Product’s Length (cm)
Product’s Height (cm)
Product’s Width (cm)

…and lastly the distance is yet to be taken into account.

NB: - ~~strikethrough~~ are non-significant parameters. Codes can be had from ‘Codes’ section.

##TG31 <- glm(freight_value ~ order_item_id, data = Tot.Grkm1, family="poisson or quasipoisson")
##summary(TG31)

Polynomial: regression equation

\[\text{Freight Value}_\text{ Cluster 1} \approx 16.22 - 96.55x - 20.80x^3 + 29.69x^4 - 20.78x^5 + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Cluster 2’s Equations

The \(\frac{\text{Residual Deviance}}{\text{Degrees of freedom}} > 1 \text{ or} \text{ } \frac{\text{Variance}}{\text{Mean}} \neq 1\) for Poisson Regression is at 3.528, so probably there is overdispersion. We adjust to take into account overdispersion and use Quasi-Poisson Regression equation.

Quasi-Poisson: regression equation

\[\text{Freight Value}_\text{ Cluster 2} \approx 2.867 - 0.018x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Interpretation: The predictor is significant as p-value \(\approx\) ZERO, the Dispersion parameter for Quasi-Poisson family is taken to be at 4.204. The \(\beta = -0.018\) or \(\beta < 0\) implies with every unit increase in items per order the freight value will decrease and will be multiplied by -0.018.

IF we take 8 different parameters to determine the Freight Value then using Quasi-Poisson regression - following predictors are significant with p-value \(\approx\) ZERO:

Intercept
Number of Items per order
Price
Estimated lead time
Actaul lead time
Product’s Weight (g)
Product’s Length (cm)
~~Product’s Height (cm)~~
~~Product’s Width (cm)~~

…and lastly the distance is yet to be taken into account.

NB: - ~~strikethrough~~ are non-significant parameters. Codes can be had from ‘Codes’ section.

##TG32 <- glm(freight_value ~ order_item_id, data = Tot.Grkm2, family="poisson or quasipoisson")
##summary(TG32)

Polynomial: regression equation

\[\text{Freight Value}_\text{ Cluster 2} \approx 17.14 - 34.91x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Cluster 3’s Equations

The \(\frac{\text{Residual Deviance}}{\text{Degrees of freedom}} > 1 \text{ or} \text{ } \frac{\text{Variance}}{\text{Mean}} \neq 1\) for Poisson Regression is at 35.237, so probably there is overdispersion. We adjust to take into account overdispersion and use Quasi-Poisson Regression equation.

Quasi-Poisson: regression equation

\[\text{Freight Value}_\text{ Cluster 3} \approx 4.525 - 0.032x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Interpretation: The predictor is significant as p-value \(\approx\) ZERO, the Dispersion parameter for Quasi-Poisson family is taken to be at 38.974. The \(\beta = -0.032\) or \(\beta < 0\) implies with every unit increase in items per order the freight value will decrease and will be multiplied by -0.032.

IF we take 8 different parameters to determine the Freight Value then using Quasi-Poisson regression - following predictors are significant with p-value \(\approx\) ZERO:

Intercept
~~Number of Items per order~~
Price
Estimated lead time
~~Actaul lead time~~
~~Product’s Weight (g)~~
Product’s Length (cm)
Product’s Height (cm)
Product’s Width (cm)

…and lastly the distance is yet to be taken into account.

NB: - ~~strikethrough~~ are non-significant parameters. Codes can be had from ‘Codes’ section.

##TG33 <- glm(freight_value ~ order_item_id, data = Tot.Grkm3, family="poisson or quasipoisson")
##summary(TG33)

Polynomial: regression equation

\[\text{Freight Value}_\text{ Cluster 3} \approx 89.06 + \epsilon \]

\[ \epsilon \Rightarrow \text{error} \]

Cluster 4’s Equations

The \(\frac{\text{Residual Deviance}}{\text{Degrees of freedom}} > 1 \text{ or} \text{ } \frac{\text{Variance}}{\text{Mean}} \neq 1\) for Poisson Regression is at 18.649, so probably there is overdispersion. We adjust to take into account overdispersion and use Quasi-Poisson Regression equation.

Quasi-Poisson: regression equation

\[\text{Freight Value}_\text{ Cluster 4} \approx 3.964 - 0.028x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Interpretation: The predictor is significant as p-value \(\approx\) ZERO, the Dispersion parameter for Quasi-Poisson family is taken to be at 23.213. The \(\beta = -0.028\) or \(\beta < 0\) implies with every unit increase in items per order the freight value will decrease and will be multiplied by -0.028.

IF we take 8 different parameters to determine the Freight Value then using Quasi-Poisson regression - following predictors are significant with p-value \(\approx\) ZERO:

Intercept
~~Number of Items per order~~
Price
Estimated lead time
~~Actaul lead time~~
Product’s Weight (g)
Product’s Length (cm)
Product’s Height (cm)
Product’s Width (cm)

…and lastly the distance is yet to be taken into account.

NB: - ~~strikethrough~~ are non-significant parameters. Codes can be had from ‘Codes’ section.

##TG34 <- glm(freight_value ~ order_item_id, data = Tot.Grkm4, family="poisson or quasipoisson")
##summary(TG34)

Polynomial: regression equation

\[\text{Freight Value}_\text{ Cluster 4} \approx 50.88 + 98.48x^2 - 90.17x^3 + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Cluster 5’s Equations

The \(\frac{\text{Residual Deviance}}{\text{Degrees of freedom}} > 1 \text{ or} \text{ } \frac{\text{Variance}}{\text{Mean}} \neq 1\) for Poisson Regression is at 4.810, so probably there is overdispersion. We adjust to take into account overdispersion and use Quasi-Poisson Regression equation.

Quasi-Poisson: regression equation

\[\text{Freight Value}_\text{ Cluster 5} \approx 3.169 - 0.068x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Interpretation: The predictor is significant as p-value \(\approx\) ZERO, the Dispersion parameter for Quasi-Poisson family is taken to be at 5.761. The \(\beta = -0.068\) or \(\beta < 0\) implies with every unit increase in items per order the freight value will decrease and will be multiplied by -0.068.

IF we take 8 different parameters to determine the Freight Value then using Quasi-Poisson regression - following predictors are significant with p-value \(\approx\) ZERO:

Intercept
Number of Items per order
Price
Estimated lead time
Actaul lead time
Product’s Weight (g)
Product’s Length (cm)
Product’s Height (cm)
Product’s Width (cm)

…and lastly the distance is yet to be taken into account.

NB: - ~~strikethrough~~ are non-significant parameters. Codes can be had from ‘Codes’ section.

##TG35 <- glm(freight_value ~ order_item_id, data = Tot.Grkm5, family="poisson or quasipoisson")
##summary(TG35)

Polynomial: regression equation

\[\text{Freight Value}_\text{ Cluster 5} \approx 22.00 - 95.53x + 27.75x^2 + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Cluster 6’s Equations

The \(\frac{\text{Residual Deviance}}{\text{Degrees of freedom}} > 1 \text{ or} \text{ } \frac{\text{Variance}}{\text{Mean}} \neq 1\) for Poisson Regression is at 9.993, so probably there is overdispersion. We adjust to take into account overdispersion and use Quasi-Poisson Regression equation.

Quasi-Poisson: regression equation

\[\text{Freight Value}_\text{ Cluster 6} \approx 3.461 - 0.012x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Interpretation: The predictor is significant as p-value \(\approx\) ZERO, the Dispersion parameter for Quasi-Poisson family is taken to be at 12.102. The \(\beta = -0.012\) or \(\beta < 0\) implies with every unit increase in items per order the freight value will decrease and will be multiplied by -0.012.

IF we take 8 different parameters to determine the Freight Value then using Quasi-Poisson regression - following predictors are significant with p-value \(\approx\) ZERO:

Intercept
~~Number of Items per order~~
Price
Estimated lead time
Actaul lead time
Product’s Weight (g)
Product’s Length (cm)
Product’s Height (cm)
Product’s Width (cm)

…and lastly the distance is yet to be taken into account.

NB: - ~~strikethrough~~ are non-significant parameters. Codes can be had from ‘Codes’ section.

##TG36 <- glm(freight_value ~ order_item_id, data = Tot.Grkm6, family="poisson or quasipoisson")
##summary(TG36)

Polynomial: regression equation

\[\text{Freight Value}_\text{ Cluster 6} \approx 31.41 + 49.12x^2 + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Sellers and Customers spread across states

The graph below shows how the density of customers and sellers are spread across various states of Brazil. It is a stagnant presentation of graph “Customers.Sellers.States.gif”, that is pasted below along with it’s code.

Customer Order Cycle impression

Graph is the first pictorial insight into: how Customer Order Cycle (the time to deliver goods) has changed as Olist started growing. This graph is the stagnant presentation of the graph “Lead.Times.gif”, which has been pasted below along with it’s code.

Conclusion

Thanks for giving your time to this long notebook. Should you like it, please do upvote.

Appendix

Appendix 01

The code of “Customers.Sellers.States.gif” graph has been provided below. The animated graph had been created with gganimate package but due to API version incompatibility the actual code has been provided below for reference.

#CS.states %>% 
#ggplot(aes(y = customer_state, x = seller_state))+
  #geom_jitter(fill = alpha("blue", 0.001), color = "white", alpha = 0.36, size = 2.1, shape = 21)+
  #coord_polar(start = 0)+
  #theme_minimal()+
  #labs(title ="Order Status' Density :: ...{closest_state}...", 
  #x = "Seller's State", y = "Customer's State", 
  #subtitle = "States :: Sellers vs. Customers")+
#transition_states(order_status, transition_length = 5, state_length = 5)

State

Appendix 02

The actual graph had been created with gganimate package but due to API version incompatibility the code of “Lead.Times.gif” graph has been provided below for reference.

Average Lead Times per Quarter per Year

#Lead.Time %>% ggplot(aes(x = purchase_quart, y =Est.Lead.t))+
               #geom_line(color = "red", alpha = 0.81, linetype = "longdash")+
               #geom_segment(aes(xend = 2018.4, yend = Est.Lead.t), 
                            #linetype = 2, colour = 'grey51')+
               #geom_text(aes(x = 2018.3, label = Quart.Y, hjust = 1))+
               #geom_point(size=3, color="red", fill=alpha("orange", 0.63), 
                          #alpha=0.54, shape=21, stroke=1)+
               #facet_wrap(~order_status)+
               #coord_polar(start = 0)+
               #transition_reveal(purchase_quart)+
               #theme(axis.text.x = element_blank(),
                     #axis.ticks = element_blank(),
                     #panel.background = element_blank(),
                     #panel.grid.major.x = element_line(color = "coral", linetype = 3),
                     #panel.grid.major.y = element_line(color = "coral", linetype = 3),
                     #strip.background = element_blank())+
               #labs(title = "Change in proposed average lead times per Quarter/Year",
                    #subtitle = "Proposed average number of days to delivery", 
                    #x = "Purchasing Quarter/Year", 
                    #y= " Estimated average lead time [... in Days]", 
                    #caption = "derived from Thomas Lin Pedersen's Dec 2018 graph: ~github.com/thomasp85")

gif

Olist e-commerce Analytics

Nguyen_LSCM

3/14/2021

Olist’s Business Model

Olist’s Solution

What Olist says?

Data Model

Principal Dataset

Others Datasets

Customers dataset

Geolocation dataset

Items dataset

Payments dataset

Products dataset

Reviews dataset

Sellers dataset

Product Category Translation dataset

Simulation as f(Customers, Product Categories, Sellers)

Trend of New Customer’s registration

Galvanisation by Product Categories

Sellers’s Engagement

Fast Moving product categories

Purchase made multiple times

Six sub - sections

One Order per Product Category

Purchased “1 < n < 6” times

Purchased “5 < n < 21” times

Purchased “20 < n < 93” times

Purchased “92 < n < 251” times

Purchased “250 < n” times

Significant features, relations & equations

k - means CLustering

Visualize these clusters

Plot the clusters with total within

Cluster 1’s Equations

Cluster 2’s Equations

Cluster 3’s Equations

Cluster 4’s Equations

Cluster 5’s Equations

Cluster 6’s Equations

Sellers and Customers spread across states

Customer Order Cycle impression

Conclusion

Appendix