Prepare and set up

## Warning: package 'tidyverse' was built under R version 3.6.2
## Warning: package 'plotly' was built under R version 3.6.2
## Warning: package 'viridis' was built under R version 3.6.2
## Warning: package 'DT' was built under R version 3.6.2
## Warning: package 'ggsci' was built under R version 3.6.3
## Warning: package 'gridExtra' was built under R version 3.6.2
## Warning: package 'patchwork' was built under R version 3.6.3

Customer Master

We can see that Sao Paolo and Rio De Janeiro is the top rate of customer users, then it’s just decreasing marginally. It is easy to know the Olist customers stay centrally in large cities. But the range of place is quite wide

## Warning in bscols(filter, gg, widths = c(12, 12)): Sum of bscol width units is
## greater than 12

The geom_col makes me think that the state scope is the same as cities scope. The top places are Sp, Rj, Mg. Especially, Mg mightbe have many sub cities, Likewise, Sp and Rj are centralized

Geolocation

Userids are all arould the Brazil, Three points such as (MG,RJ,SP) are nearly. Therefore; Logistics hubs maybe stay cluster and very high density in this place.

From this interactive graph, we can easily realize that

In size aspect: There are 4 groups: small-medium-large-very large. In location aspect: There are 4 cluster locations: The very large + Large Cluster, Medium+large cluster, 2 Medium objects in central cluster and 3 poits with small sizes

So We conclude that the logistics serivces play a key role in E-commerce coverage. Maybe, Because of cost, there are many points in 1 place. But in some areas, There are some except cases, This is a signal that there is a complete in ecommerce market, So Olist decides to expand more and more their coverage place around Brazil

Order datasets

Order status

"Each customerid has only one orderid in this order datasets
Orderid and CustomerId is unique, without more than second time"
## [1] "Each customerid has only one orderid in this order datasets\nOrderid and CustomerId is unique, without more than second time"
gg <-
               order %>% group_by(order_status) %>% count() %>% arrange(n) %>% mutate(log_scale =
                                                                                                     log10(n)) %>% ungroup()%>%mutate(order_status = fct_reorder(str_to_upper(order_status), log_scale)) %>% ggplot(aes(
                                                                                                                    x = order_status,
                                                                                                                    y = log_scale,
                                                                                                                    fill = order_status,
                                                                                                                    label = n,
                                                                                                                    color = "white"
                                                                                                     )) + geom_col() + scale_fill_manual(values = brewer.pal(8, "Reds")) +
                guides(guide_legend(reverse = TRUE))+xlab("Order Status")+ylab("Log10 Scale")+labs(title=" The statistics of order status")+theme(legend.title =element_blank())

ggplotly(gg)

Leadtime

Leadtime in total

Time for delivery from carrier to customer most common is 12 day shipment Zero is for among approved, purchase and carrier interval Expectation is higher, but more shape like a belt Time for total processes is common in 7 days, but more shape into right directions.

Products

pro_dt[,-c(1:5)]%>%pivot_longer(starts_with("product_"),names_to = "Product_dimension",values_to = "value")%>%ggplot(aes(x=Product_dimension,y=value,fill=Product_dimension))+geom_boxplot()
## Warning: Removed 8 rows containing non-finite values (stat_boxplot).

payment%>%group_by(payment_type)%>%count()%>%arrange(desc(n))%>%datatable()
payment %>%ggplot(aes(x=payment_type,y=payment_value,fill=payment_type))+geom_boxplot()

gg<-seller%>%group_by(seller_state)%>%count()%>%arrange(desc(n))%>%ggplot(aes(x=fct_reorder(seller_state,n),y=n))+geom_col(aes(fill=seller_state))

ggplotly(gg)

Dimensions are very different

Most payments are Credit card and boleto and Debit card The value is very large difference amongmany payment type These places have many customers as well as more sellers