Olist a Brazilian e-commerce

Olist’s Business Model

Olist is a Brazilian department store (marketplace) that operates in e-commerce segment, but is not an e-commerce itself as she says. It operates as a SaaS (Software as a Service) technology company since 2015. It offers a marketplace solution (of e-commerce segment) to shopkeepers of all sizes (aand for most segements) to increase their sales whether they have online presence or not.

Olist’s Solution

Olist’s solution consists of three aspects: Software, Contracts with the main marketplaces and Reputation sharing.

What Olist says?

Olist says she

  1. … is a large department store within marketplaces
  2. … is connected to the main e-commerce of Brazil
  3. … does not buy products
  4. … does not keep products in stock
  5. … does not carry out shipping of any products offered in its store
  6. All products are sold and shipped by the thousands of shopkeepers (registered on Olist) who sell through Olist
  7. Her strength lies in union of all participating shopkeepers, who are selling physical products
  8. Participant shopkeeper is responsible for separating, packing, and taking products to the logistics operator

Please note Olist’s perspective ( a supply chain preview):: she prescribes there are many factors that can influence the sales of a shopkeeper (e.g: type of product, demand, seasonality, competitive pricing, terms, inventory etc)

Load packages and determine working directory

Load datasets

Customers <- read.csv("olist_customers_dataset.csv", header = TRUE)
Geolocation <- read.csv("olist_geolocation_dataset.csv", header = TRUE)
Items <- read.csv("olist_order_items_dataset.csv", header = TRUE)
Payments <- read.csv("olist_order_payments_dataset.csv", header = TRUE)
Reviews <- read.csv("olist_order_reviews_dataset.csv", header = TRUE)
Orders <- read.csv("olist_orders_dataset.csv", header = TRUE)
Products <- read.csv("olist_products_dataset.csv", header = TRUE)
Sellers <- read.csv("olist_sellers_dataset.csv", header = TRUE)
Pdttrans <- read.csv("product_category_name_translation.csv", header = TRUE) %>%
               rename("product_category_name"  = "ï..product_category_name")

Data Model

On Kaggle, she shared its data of 100k orders from 2016 to 2018. There are 8+1 datasets to play with and explore. The data model has been described in image below and it is organised and normalised for each category.

Data Model

Data Model

Principal Dataset

Orders dataset

  • Order ID
  • Customer ID
  • Order Status
  • Order Purchase Timestamp
  • Order Approved at
  • Order Delivered Carrier date
  • Order Delivered Customer date
  • Order Estimated delivery date
kable(head(Orders), format = 'markdown', caption = 'Orders dataset')
Orders dataset
order_id customer_id order_status order_purchase_timestamp order_approved_at order_delivered_carrier_date order_delivered_customer_date order_estimated_delivery_date
e481f51cbdc54678b7cc49136f2d6af7 9ef432eb6251297304e76186b10a928d delivered 2017-10-02 10:56:33 2017-10-02 11:07:15 2017-10-04 19:55:00 2017-10-10 21:25:13 2017-10-18 00:00:00
53cdb2fc8bc7dce0b6741e2150273451 b0830fb4747a6c6d20dea0b8c802d7ef delivered 2018-07-24 20:41:37 2018-07-26 03:24:27 2018-07-26 14:31:00 2018-08-07 15:27:45 2018-08-13 00:00:00
47770eb9100c2d0c44946d9cf07ec65d 41ce2a54c0b03bf3443c3d931a367089 delivered 2018-08-08 08:38:49 2018-08-08 08:55:23 2018-08-08 13:50:00 2018-08-17 18:06:29 2018-09-04 00:00:00
949d5b44dbf5de918fe9c16f97b45f8a f88197465ea7920adcdbec7375364d82 delivered 2017-11-18 19:28:06 2017-11-18 19:45:59 2017-11-22 13:39:59 2017-12-02 00:28:42 2017-12-15 00:00:00
ad21c59c0840e6cb83a9ceb5573f8159 8ab97904e6daea8866dbdbc4fb7aad2c delivered 2018-02-13 21:18:39 2018-02-13 22:20:29 2018-02-14 19:46:34 2018-02-16 18:17:02 2018-02-26 00:00:00
a4591c265e18cb1dcee52889e2d8acc3 503740e9ca751ccdda7ba28e9ab8f608 delivered 2017-07-09 21:57:05 2017-07-09 22:10:13 2017-07-11 14:58:04 2017-07-26 10:57:55 2017-08-01 00:00:00

Others Datasets

Customers dataset

  • Customer ID
  • Customer Unique ID
  • Customer Zip Code prefix
  • Customer City
  • Customer State
Customers dataset
customer_id customer_unique_id customer_zip_code_prefix customer_city customer_state
06b8999e2fba1a1fbc88172c00ba8bc7 861eff4711a542e4b93843c6dd7febb0 14409 franca SP
18955e83d337fd6b2def6b18a428ac77 290c77bc529b7ac935b93aa66c333dc3 9790 sao bernardo do campo SP
4e7b3e00288586ebd08712fdd0374a03 060e732b5b29e8181a18229c7b0b2b5e 1151 sao paulo SP
b2b6027bc5c5109e529d4dc6358b12c3 259dac757896d24d7702b9acbbff3f3c 8775 mogi das cruzes SP
4f2d8ab171c80ec8364f7c12e35b23ad 345ecd01c38d18a9036ed96c73b8d066 13056 campinas SP
879864dab9bc3047522c92c82e1212b8 4c93744516667ad3b8f1fb645a3116a4 89254 jaragua do sul SC

Geolocation dataset

  • Geolocation Zip Code Prefix
  • Geolocation Lat.
  • Geolocation Lng.
  • Geolocation City
  • Geolocation State
kable(head(Geolocation), format = 'markdown', caption = 'Geolocation dataset')
Geolocation dataset
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state
1037 -23.54562 -46.63929 sao paulo SP
1046 -23.54608 -46.64482 sao paulo SP
1046 -23.54613 -46.64295 sao paulo SP
1041 -23.54439 -46.63950 sao paulo SP
1035 -23.54158 -46.64161 sao paulo SP
1012 -23.54776 -46.63536 são paulo SP

Items dataset

  • Order ID
  • Order Item ID
  • Seller ID
  • Shipping Limit Date
  • Price
  • Freight Value
Items dataset
order_id order_item_id product_id seller_id shipping_limit_date price freight_value
00010242fe8c5a6d1ba2dd792cb16214 1 4244733e06e7ecb4970a6e2683c13e61 48436dade18ac8b2bce089ec2a041202 2017-09-19 09:45:35 58.90 13.29
00018f77f2f0320c557190d7a144bdd3 1 e5f2d52b802189ee658865ca93d83a8f dd7ddc04e1b6c2c614352b383efe2d36 2017-05-03 11:05:13 239.90 19.93
000229ec398224ef6ca0657da4fc703e 1 c777355d18b72b67abbeef9df44fd0fd 5b51032eddd242adc84c38acab88f23d 2018-01-18 14:48:30 199.00 17.87
00024acbcdf0a6daa1e931b038114c75 1 7634da152a4610f1595efa32f14722fc 9d7a1d34a5052409006425275ba1c2b4 2018-08-15 10:10:18 12.99 12.79
00042b26cf59d7ce69dfabb4e55b4fd9 1 ac6c3623068f30de03045865e4e10089 df560393f3a51e74553ab94004ba5c87 2017-02-13 13:57:51 199.90 18.14
00048cc3ae777c65dbb7d2a0634bc1ea 1 ef92defde845ab8450f9d70c526ef70f 6426d21aca402a131fc0a5d0960a3c90 2017-05-23 03:55:27 21.90 12.69

Payments dataset

  • Order ID
  • Payment Sequential
  • Payment Type
  • Payment Installments
  • Payment Value
kable(head(Payments), format = 'markdown', caption = 'Payments dataset')
Payments dataset
order_id payment_sequential payment_type payment_installments payment_value
b81ef226f3fe1789b1e8b2acac839d17 1 credit_card 8 99.33
a9810da82917af2d9aefd1278f1dcfa0 1 credit_card 1 24.39
25e8ea4e93396b6fa0d3dd708e76c1bd 1 credit_card 1 65.71
ba78997921bbcdc1373bb41e913ab953 1 credit_card 8 107.78
42fdf880ba16b47b59251dd489d4441a 1 credit_card 2 128.45
298fcdf1f73eb413e4d26d01b25bc1cd 1 credit_card 2 96.12

Products dataset

  • Product ID
  • Product Category name
  • Product Name length
  • Product Photos (quantity)
  • Product Weight (grams)
  • Product Length (cm)
  • Product Height (cm)
  • Product Width (cm)
Products dataset
product_id product_category_name product_name_lenght product_description_lenght product_photos_qty product_weight_g product_length_cm product_height_cm product_width_cm
1e9e8ef04dbcff4541ed26657ea517e5 perfumaria 40 287 1 225 16 10 14
3aa071139cb16b67ca9e5dea641aaa2f artes 44 276 1 1000 30 18 20
96bd76ec8810374ed1b65e291975717f esporte_lazer 46 250 1 154 18 9 15
cef67bcfe19066a932b7673e239eb23d bebes 27 261 1 371 26 4 26
9dc1a7de274444849c219cff195d0b71 utilidades_domesticas 37 402 4 625 20 17 13
41d3672d4792049fa1779bb35283ed13 instrumentos_musicais 60 745 1 200 38 5 11

Reviews dataset

  • Review ID
  • Order ID
  • Review Score
  • Review Comment title
  • Review Comment message
  • Review Creation date
  • Review Answer timestamp
Reviews dataset
review_id order_id review_score review_comment_title review_comment_message review_creation_date review_answer_timestamp
7bc2406110b926393aa56f80a40eba40 73fc7af87114b39712e6da79b0a377eb 4 2018-01-18 00:00:00 2018-01-18 21:46:59
80e641a11e56f04c1ad469d5645fdfde a548910a1c6147796b98fdf73dbeba33 5 2018-03-10 00:00:00 2018-03-11 03:05:13
228ce5500dc1d8e020d8d1322874b6f0 f9e4b658b201a9f2ecdecbb34bed034b 5 2018-02-17 00:00:00 2018-02-18 14:36:24
e64fb393e7b32834bb789ff8bb30750e 658677c97b385a9be170737859d3511b 5 Recebi bem antes do prazo estipulado. 2017-04-21 00:00:00 2017-04-21 22:02:06
f7c4243c7fe1938f181bec41a392bdeb 8e6bfb81e283fa7e4f11123a3fb894f1 5 Parabéns lojas lannister adorei comprar pela Internet seguro e prático Parabéns a todos feliz Páscoa 2018-03-01 00:00:00 2018-03-02 10:26:53
15197aa66ff4d0650b5434f1b46cda19 b18dcdf73be66366873cd26c5724d1dc 1 2018-04-13 00:00:00 2018-04-16 00:39:37

Sellers dataset

  • Seller ID
  • Seller Zip Code prefix
  • Seller City
  • Seller State
Sellers dataset
seller_id seller_zip_code_prefix seller_city seller_state
3442f8959a84dea7ee197c632cb2df15 13023 campinas SP
d1b65fc7debc3361ea86b5f14c68d2e2 13844 mogi guacu SP
ce3ad9de960102d0677a81f5d0bb7b2d 20031 rio de janeiro RJ
c0f3eea2e14555b6faeea3dd58c1b1c3 4195 sao paulo SP
51a04a8a6bdcb23deccc82b0b80742cf 12914 braganca paulista SP
c240c4061717ac1806ae6ee72be3533b 20920 rio de janeiro RJ

Product Category Translation dataset

  • Product Category name (Portuguese)
  • Product Category name (English)
kable(head(Pdttrans), format = 'markdown', caption = 'Translation dataset')
Translation dataset
product_category_name product_category_name_english
beleza_saude health_beauty
informatica_acessorios computers_accessories
automotivo auto
cama_mesa_banho bed_bath_table
moveis_decoracao furniture_decor
esporte_lazer sports_leisure

Simulation as f(Customers, Product Categories, Sellers)

Olist has been simulate addition of Customers, Product Categories, and Sellers. As per the data, there has been a steady stream of customers and sellers getting registered Olist’s platform. This progress, in conjunction with regular increase in number of new product categories being offered has maintainedOlist’s growth momentum.

Trend of New Customer’s registration

The graph shows, in 2017, there has been a positive trend line in number of New Customers [Customer Unique Identity] getting registered with Olist. In 2018, morethan 6000 were getting registered every month.

Galvanisation by Product Categories

In my opinion, addition of + 50 (approximate) New Product Categories per month - as an offer - has galvanised the spurt in new customers getting registered. Consistency in this phenomenon was Olist’s byword for 20 months i.e from January 2017 to August 2018.

Sellers’s Engagement

Sellers too maintained the similar trend as that of Customers, and within the time frame of this data, Olist ended up having + 3095 registered sellers on it’s platform

Seller.Gr <- inner_join(Orders, 
                        left_join(Items, Sellers, by = 'seller_id'), 
                        by = 'order_id') 

Seller.Gr %>% select_at(vars(purchase_year, purchase_month, seller_id)) %>%
               distinct_at(vars(purchase_year, purchase_month, seller_id)) %>% 
               arrange_at(vars(purchase_year, purchase_month)) %>%
               count(purchase_year, purchase_month) %>% 
               ggplot(aes(x = factor(purchase_month), y =n)) +
               geom_point(color = 'red', alpha = 0.63, shape = 21, aes(size = n), fill = alpha('cornsilk', 0.3)) + 
               geom_segment(aes(x = factor(purchase_month), xend = factor(purchase_month),
                                y = 0, yend = n), alpha = 0.45, linetyle = 3, color = 'red') + 
               coord_polar(start = 2) +
               facet_wrap(.~purchase_year) +
               theme_minimal() +
               theme(axis.text = element_text(size = 8), panel.grid.major = element_line(),
                     legend.position = 'bottom') +
               labs(title = 'Increase in number of sellers per month', 
                    x = 'Month', y = 'Number of Sellers')
## Warning: Ignoring unknown parameters: linetyle

Fast Moving product categories

Buoyancy of Product Categories can be gauged from the fact how frequently a Product Category has been purchased by customers? OR which are the customer driven “FAST MOVING” product categories? This kernel of information may act as one of the parameters which a company (and in this case Olist) may probably use for portfolio building.

Purchase made multiple times

Lets first start with the case where all Product Categories were purchased multiple times. The graph below shows which categories are high frequency categories.

Please Note: It excludes all the categories that were purchased as: one product category per month per year

Six sub - sections

One Order per Product Category

Case where only one order was placed for one product category in a month and in a given year

Purchased “1 < n < 6” times

Case where orders were placed for various product categories and each product category was purchased more than once but less than six times. Plus, list below provides some most frequently ordered categories

  • Construction Tools
  • Cine Photo
  • DVDs Blue Ray
  • Fashion Female Clothing
  • Music

Case where only one order was placed for one product category in a month and in a given year

Purchased “5 < n < 21” times

Case where orders were placed for various product categories and each product category was purchased more than five but less than twenty one times. Plus, list below provides some most frequently ordered categories.

  • Air Conditioning
  • Construction tools Garden
  • Fixed Telephony
  • Home Appliances 2
  • Market Place

Purchased “20 < n < 93” times

Case where orders were placed for various product categories and each product category was purchased more than twenty but less than ninety three times. Plus, list below provides some most frequently ordered categories.

  • Books general interest
  • Console games
  • Home appliances
  • Luggage accessories
  • Musical instruments
  • Small appliances

Purchased “92 < n < 251” times

Case where orders were placed for various product categories and each product category was purchased more than ninety two but less than two hundred and fifty one times. Plus, list below provides some most frequently ordered categories.

  • Baby
  • Cool stuff
  • Garden tools
  • Perfumery
  • Toys

Purchased “250 < n” times

Case where orders were placed for various product categories and each product category was purchased more than two hundred and fifty times. Plus, list below provides some most frequently ordered categories

  • Bed bath table
  • Computer accessories
  • Furniture decor
  • Health beauty
  • Housewares
  • Sports leisure

Significant features, relations & equations

k means clustering has been used to create clusters and to discover significant features. This helped me look into; how those features interacted and what were the relations between them? The clusters were created to look into data in tabular form and to discover the polynomial equations.

k - means CLustering

  • Actual Lead Time (days):: Number of days taken to deliver a product to the customer, from the moment an order has been placed in the system.
  • Order ID: Identification of order
  • Approval time (mins): Time taken by Olist’s system to approve an order after an order has been placed and before it’s intimated to supplier/carrier
  • Order Item ID: Number of Items/Order
  • Price: Price/Item, charged from a customer
  • Freight Value: Delivery Cost/Item, charged from a customer
  • Product Weight (gms): Weight of the product in grams

Please Note: Price & Freight Value is amount charged per item and is not per order. Plus, an order may have more than one item

Visualize these clusters

#assignments <- unnest(kclusts, augmented)

#p1 <- assignments %>% head(10000) %>% 
               #ggplot(aes(x = price, y = freight_value)) +
               #geom_point(aes(color))

Plot the clusters with total within

clusterings %>% ggplot(aes(x = k, y = tot.withinss)) +
               geom_line() + geom_point()


Cluster 1’s Equations

The \(\frac{\text{Residual Deviance}}{\text{Degrees of freedom}} > 1 \text{ or} \text{ } \frac{\text{Variance}}{\text{Mean}} \neq 1\) for Poisson Regression is at 3.207, so probably there is overdispersion. We adjust to take into account overdispersion and use Quasi-Poisson Regression equation.

Quasi-Poisson: regression equation

\[\text{Freight Value}_\text{ Cluster 1} \approx 2.830 - 0.037x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Interpretation: The predictor is significant as p-value \(\approx\) ZERO, the Dispersion parameter for Quasi-Poisson family is taken to be at 3.984. The \(\beta = -0.037\) or \(\beta < 0\) implies with every unit increase in items per order the freight value will decrease and will be multiplied by -0.037.

IF we take 8 different parameters to determine the Freight Value then using Quasi-Poisson regression - following predictors are significant with p-value \(\approx\) ZERO:

  • Intercept
  • Number of Items per order
  • Price
  • Estimated lead time
  • Actaul lead time
  • Product’s Weight (g)
  • Product’s Length (cm)
  • Product’s Height (cm)
  • Product’s Width (cm)

…and lastly the distance is yet to be taken into account.

NB: - strikethrough are non-significant parameters. Codes can be had from ‘Codes’ section.

##TG31 <- glm(freight_value ~ order_item_id, data = Tot.Grkm1, family="poisson or quasipoisson")
##summary(TG31)

Polynomial: regression equation

\[\text{Freight Value}_\text{ Cluster 1} \approx 16.22 - 96.55x - 20.80x^3 + 29.69x^4 - 20.78x^5 + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]


Cluster 2’s Equations

The \(\frac{\text{Residual Deviance}}{\text{Degrees of freedom}} > 1 \text{ or} \text{ } \frac{\text{Variance}}{\text{Mean}} \neq 1\) for Poisson Regression is at 3.528, so probably there is overdispersion. We adjust to take into account overdispersion and use Quasi-Poisson Regression equation.

Quasi-Poisson: regression equation

\[\text{Freight Value}_\text{ Cluster 2} \approx 2.867 - 0.018x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Interpretation: The predictor is significant as p-value \(\approx\) ZERO, the Dispersion parameter for Quasi-Poisson family is taken to be at 4.204. The \(\beta = -0.018\) or \(\beta < 0\) implies with every unit increase in items per order the freight value will decrease and will be multiplied by -0.018.

IF we take 8 different parameters to determine the Freight Value then using Quasi-Poisson regression - following predictors are significant with p-value \(\approx\) ZERO:

  • Intercept
  • Number of Items per order
  • Price
  • Estimated lead time
  • Actaul lead time
  • Product’s Weight (g)
  • Product’s Length (cm)
  • Product’s Height (cm)
  • Product’s Width (cm)

…and lastly the distance is yet to be taken into account.

NB: - strikethrough are non-significant parameters. Codes can be had from ‘Codes’ section.

##TG32 <- glm(freight_value ~ order_item_id, data = Tot.Grkm2, family="poisson or quasipoisson")
##summary(TG32)

Polynomial: regression equation

\[\text{Freight Value}_\text{ Cluster 2} \approx 17.14 - 34.91x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]


Cluster 3’s Equations

The \(\frac{\text{Residual Deviance}}{\text{Degrees of freedom}} > 1 \text{ or} \text{ } \frac{\text{Variance}}{\text{Mean}} \neq 1\) for Poisson Regression is at 35.237, so probably there is overdispersion. We adjust to take into account overdispersion and use Quasi-Poisson Regression equation.

Quasi-Poisson: regression equation

\[\text{Freight Value}_\text{ Cluster 3} \approx 4.525 - 0.032x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Interpretation: The predictor is significant as p-value \(\approx\) ZERO, the Dispersion parameter for Quasi-Poisson family is taken to be at 38.974. The \(\beta = -0.032\) or \(\beta < 0\) implies with every unit increase in items per order the freight value will decrease and will be multiplied by -0.032.

IF we take 8 different parameters to determine the Freight Value then using Quasi-Poisson regression - following predictors are significant with p-value \(\approx\) ZERO:

  • Intercept
  • Number of Items per order
  • Price
  • Estimated lead time
  • Actaul lead time
  • Product’s Weight (g)
  • Product’s Length (cm)
  • Product’s Height (cm)
  • Product’s Width (cm)

…and lastly the distance is yet to be taken into account.

NB: - strikethrough are non-significant parameters. Codes can be had from ‘Codes’ section.

##TG33 <- glm(freight_value ~ order_item_id, data = Tot.Grkm3, family="poisson or quasipoisson")
##summary(TG33)

Polynomial: regression equation

\[\text{Freight Value}_\text{ Cluster 3} \approx 89.06 + \epsilon \]

\[ \epsilon \Rightarrow \text{error} \]


Cluster 4’s Equations

The \(\frac{\text{Residual Deviance}}{\text{Degrees of freedom}} > 1 \text{ or} \text{ } \frac{\text{Variance}}{\text{Mean}} \neq 1\) for Poisson Regression is at 18.649, so probably there is overdispersion. We adjust to take into account overdispersion and use Quasi-Poisson Regression equation.

Quasi-Poisson: regression equation

\[\text{Freight Value}_\text{ Cluster 4} \approx 3.964 - 0.028x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Interpretation: The predictor is significant as p-value \(\approx\) ZERO, the Dispersion parameter for Quasi-Poisson family is taken to be at 23.213. The \(\beta = -0.028\) or \(\beta < 0\) implies with every unit increase in items per order the freight value will decrease and will be multiplied by -0.028.

IF we take 8 different parameters to determine the Freight Value then using Quasi-Poisson regression - following predictors are significant with p-value \(\approx\) ZERO:

  • Intercept
  • Number of Items per order
  • Price
  • Estimated lead time
  • Actaul lead time
  • Product’s Weight (g)
  • Product’s Length (cm)
  • Product’s Height (cm)
  • Product’s Width (cm)

…and lastly the distance is yet to be taken into account.

NB: - strikethrough are non-significant parameters. Codes can be had from ‘Codes’ section.

##TG34 <- glm(freight_value ~ order_item_id, data = Tot.Grkm4, family="poisson or quasipoisson")
##summary(TG34)

Polynomial: regression equation

\[\text{Freight Value}_\text{ Cluster 4} \approx 50.88 + 98.48x^2 - 90.17x^3 + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]


Cluster 5’s Equations

The \(\frac{\text{Residual Deviance}}{\text{Degrees of freedom}} > 1 \text{ or} \text{ } \frac{\text{Variance}}{\text{Mean}} \neq 1\) for Poisson Regression is at 4.810, so probably there is overdispersion. We adjust to take into account overdispersion and use Quasi-Poisson Regression equation.

Quasi-Poisson: regression equation

\[\text{Freight Value}_\text{ Cluster 5} \approx 3.169 - 0.068x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Interpretation: The predictor is significant as p-value \(\approx\) ZERO, the Dispersion parameter for Quasi-Poisson family is taken to be at 5.761. The \(\beta = -0.068\) or \(\beta < 0\) implies with every unit increase in items per order the freight value will decrease and will be multiplied by -0.068.

IF we take 8 different parameters to determine the Freight Value then using Quasi-Poisson regression - following predictors are significant with p-value \(\approx\) ZERO:

  • Intercept
  • Number of Items per order
  • Price
  • Estimated lead time
  • Actaul lead time
  • Product’s Weight (g)
  • Product’s Length (cm)
  • Product’s Height (cm)
  • Product’s Width (cm)

…and lastly the distance is yet to be taken into account.

NB: - strikethrough are non-significant parameters. Codes can be had from ‘Codes’ section.

##TG35 <- glm(freight_value ~ order_item_id, data = Tot.Grkm5, family="poisson or quasipoisson")
##summary(TG35)

Polynomial: regression equation

\[\text{Freight Value}_\text{ Cluster 5} \approx 22.00 - 95.53x + 27.75x^2 + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]


Cluster 6’s Equations

The \(\frac{\text{Residual Deviance}}{\text{Degrees of freedom}} > 1 \text{ or} \text{ } \frac{\text{Variance}}{\text{Mean}} \neq 1\) for Poisson Regression is at 9.993, so probably there is overdispersion. We adjust to take into account overdispersion and use Quasi-Poisson Regression equation.

Quasi-Poisson: regression equation

\[\text{Freight Value}_\text{ Cluster 6} \approx 3.461 - 0.012x + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]

Interpretation: The predictor is significant as p-value \(\approx\) ZERO, the Dispersion parameter for Quasi-Poisson family is taken to be at 12.102. The \(\beta = -0.012\) or \(\beta < 0\) implies with every unit increase in items per order the freight value will decrease and will be multiplied by -0.012.

IF we take 8 different parameters to determine the Freight Value then using Quasi-Poisson regression - following predictors are significant with p-value \(\approx\) ZERO:

  • Intercept
  • Number of Items per order
  • Price
  • Estimated lead time
  • Actaul lead time
  • Product’s Weight (g)
  • Product’s Length (cm)
  • Product’s Height (cm)
  • Product’s Width (cm)

…and lastly the distance is yet to be taken into account.

NB: - strikethrough are non-significant parameters. Codes can be had from ‘Codes’ section.

##TG36 <- glm(freight_value ~ order_item_id, data = Tot.Grkm6, family="poisson or quasipoisson")
##summary(TG36)

Polynomial: regression equation

\[\text{Freight Value}_\text{ Cluster 6} \approx 31.41 + 49.12x^2 + \epsilon \]

\[ x \Rightarrow \text{Number of Items per Order}, \text{ } \epsilon \Rightarrow \text{error}\]


Sellers and Customers spread across states

The graph below shows how the density of customers and sellers are spread across various states of Brazil. It is a stagnant presentation of graph “Customers.Sellers.States.gif”, that is pasted below along with it’s code.

Customer Order Cycle impression

Graph is the first pictorial insight into: how Customer Order Cycle (the time to deliver goods) has changed as Olist started growing. This graph is the stagnant presentation of the graph “Lead.Times.gif”, which has been pasted below along with it’s code.

Conclusion

Thanks for giving your time to this long notebook. Should you like it, please do upvote.

Appendix

Appendix 01

The code of “Customers.Sellers.States.gif” graph has been provided below. The animated graph had been created with gganimate package but due to API version incompatibility the actual code has been provided below for reference.

#CS.states %>% 
#ggplot(aes(y = customer_state, x = seller_state))+
  #geom_jitter(fill = alpha("blue", 0.001), color = "white", alpha = 0.36, size = 2.1, shape = 21)+
  #coord_polar(start = 0)+
  #theme_minimal()+
  #labs(title ="Order Status' Density :: ...{closest_state}...", 
  #x = "Seller's State", y = "Customer's State", 
  #subtitle = "States :: Sellers vs. Customers")+
#transition_states(order_status, transition_length = 5, state_length = 5)
State

State


Appendix 02

The actual graph had been created with gganimate package but due to API version incompatibility the code of “Lead.Times.gif” graph has been provided below for reference.

Average Lead Times per Quarter per Year

Average Lead Times per Quarter per Year

#Lead.Time %>% ggplot(aes(x = purchase_quart, y =Est.Lead.t))+
               #geom_line(color = "red", alpha = 0.81, linetype = "longdash")+
               #geom_segment(aes(xend = 2018.4, yend = Est.Lead.t), 
                            #linetype = 2, colour = 'grey51')+
               #geom_text(aes(x = 2018.3, label = Quart.Y, hjust = 1))+
               #geom_point(size=3, color="red", fill=alpha("orange", 0.63), 
                          #alpha=0.54, shape=21, stroke=1)+
               #facet_wrap(~order_status)+
               #coord_polar(start = 0)+
               #transition_reveal(purchase_quart)+
               #theme(axis.text.x = element_blank(),
                     #axis.ticks = element_blank(),
                     #panel.background = element_blank(),
                     #panel.grid.major.x = element_line(color = "coral", linetype = 3),
                     #panel.grid.major.y = element_line(color = "coral", linetype = 3),
                     #strip.background = element_blank())+
               #labs(title = "Change in proposed average lead times per Quarter/Year",
                    #subtitle = "Proposed average number of days to delivery", 
                    #x = "Purchasing Quarter/Year", 
                    #y= " Estimated average lead time [... in Days]", 
                    #caption = "derived from Thomas Lin Pedersen's Dec 2018 graph: ~github.com/thomasp85")
gif

gif