setwd('C:/Users/DellPC/Desktop/Corner/Py_source_code/Project/Olist')

htmltools::includeHTML('Olist_Visualization_Py.html')
Olist_Visualization_Py

Olist dataset

Schema

In [297]:
import os
import pandas as pd
import numpy as np 

In [298]:
geo = pd.read_csv('olist_geolocation_dataset.csv', dtype = {'geolocation_zip_code_prefix' : str})
In [299]:
geo
Out[299]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state
0 01037 -23.545621 -46.639292 sao paulo SP
1 01046 -23.546081 -46.644820 sao paulo SP
2 01046 -23.546129 -46.642951 sao paulo SP
3 01041 -23.544392 -46.639499 sao paulo SP
4 01035 -23.541578 -46.641607 sao paulo SP
... ... ... ... ... ...
1000158 99950 -28.068639 -52.010705 tapejara RS
1000159 99900 -27.877125 -52.224882 getulio vargas RS
1000160 99950 -28.071855 -52.014716 tapejara RS
1000161 99980 -28.388932 -51.846871 david canabarro RS
1000162 99950 -28.070104 -52.018658 tapejara RS

1000163 rows × 5 columns

In [300]:
# Get the first three and four first digits of zip codes, 
# we will explore his further to understand how zup codes works

geo['geolocation_zip_code_prefix_1_digits'] = geo['geolocation_zip_code_prefix'].str[0:1]
geo['geolocation_zip_code_prefix_2_digits'] = geo['geolocation_zip_code_prefix'].str[0:2]
geo['geolocation_zip_code_prefix_3_digits'] = geo['geolocation_zip_code_prefix'].str[0:3]
geo['geolocation_zip_code_prefix_4_digits'] = geo['geolocation_zip_code_prefix'].str[0:4]
In [301]:
geo.head(10)
Out[301]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state geolocation_zip_code_prefix_1_digits geolocation_zip_code_prefix_2_digits geolocation_zip_code_prefix_3_digits geolocation_zip_code_prefix_4_digits
0 01037 -23.545621 -46.639292 sao paulo SP 0 01 010 0103
1 01046 -23.546081 -46.644820 sao paulo SP 0 01 010 0104
2 01046 -23.546129 -46.642951 sao paulo SP 0 01 010 0104
3 01041 -23.544392 -46.639499 sao paulo SP 0 01 010 0104
4 01035 -23.541578 -46.641607 sao paulo SP 0 01 010 0103
5 01012 -23.547762 -46.635361 são paulo SP 0 01 010 0101
6 01047 -23.546273 -46.641225 sao paulo SP 0 01 010 0104
7 01013 -23.546923 -46.634264 sao paulo SP 0 01 010 0101
8 01029 -23.543769 -46.634278 sao paulo SP 0 01 010 0102
9 01011 -23.547640 -46.636032 sao paulo SP 0 01 010 0101
In [302]:
 geo['geolocation_zip_code_prefix'].nunique()
(r, c) = geo.shape
r/ geo['geolocation_zip_code_prefix'].nunique()
Out[302]:
52.598632658427555
There are 18.051 different zip_code_prefix. On average there are 52.6 coordinates for each prefix. But there is one prefix with 1146 coordinates availan;e/
In [303]:
geo['geolocation_zip_code_prefix'].value_counts(ascending = True).to_frame().describe()
Out[303]:
geolocation_zip_code_prefix
count 19015.000000
mean 52.598633
std 72.057907
min 1.000000
25% 10.000000
50% 29.000000
75% 66.500000
max 1146.000000
There are some outliers coordinates in the dataset that are outside of brazilian territory. Lets guarantee that all coordinates are within a rectangle delimited by the limits of Brazil.
In [304]:
# Removing some outliers 
geo = geo[geo.geolocation_lat <= 5.27438888]
geo = geo[geo.geolocation_lng >= -73.98283055]
geo = geo[geo.geolocation_lat >= -33.75116944]
geo = geo[geo.geolocation_lng <= -34.79314722]
In [305]:
from datashader.utils import lnglat_to_meters as webm

x, y = webm(geo.geolocation_lng, geo.geolocation_lat)
geo['x'] = pd.Series(x)
geo['y'] = pd.Series(y)
Then we treat the latitude and longitude coordinates and transform them to Mercator x/y Coordinates/
In [306]:
geo.head(3)
Out[306]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state geolocation_zip_code_prefix_1_digits geolocation_zip_code_prefix_2_digits geolocation_zip_code_prefix_3_digits geolocation_zip_code_prefix_4_digits x y
0 01037 -23.545621 -46.639292 sao paulo SP 0 01 010 0103 -5.191862e+06 -2.698137e+06
1 01046 -23.546081 -46.644820 sao paulo SP 0 01 010 0104 -5.192478e+06 -2.698193e+06
2 01046 -23.546129 -46.642951 sao paulo SP 0 01 010 0104 -5.192270e+06 -2.698199e+06

Zip Codes in Brazil

Finnally ploting the coordinates on a map. We see there is a relationship between the zip code prefix and the location of that zip code. They start in Sao Paulo, with prefix 01001, and then increase counterclockwise finishing in Rio Grande do Sul (south of Brazil) with prefix 99990.
In [307]:
# transforming the prefixes to int for plotting purposes
geo['geolocation_zip_code_prefix'] = geo['geolocation_zip_code_prefix'].astype(int)
geo['geolocation_zip_code_prefix_1_digits'] = geo['geolocation_zip_code_prefix_1_digits'].astype(int)
geo['geolocation_zip_code_prefix_2_digits'] = geo['geolocation_zip_code_prefix_2_digits'].astype(int)
geo['geolocation_zip_code_prefix_3_digits'] = geo['geolocation_zip_code_prefix_3_digits'].astype(int)
geo['geolocation_zip_code_prefix_4_digits'] = geo['geolocation_zip_code_prefix_4_digits'].astype(int)
In [308]:
brazil = geo 
agg_name = 'geolocation_zip_code_prefix'
brazil[agg_name].describe().to_frame()
Out[308]:
geolocation_zip_code_prefix
count 1.000121e+06
mean 3.657332e+04
std 3.054939e+04
min 1.001000e+03
25% 1.107500e+04
50% 2.653000e+04
75% 6.350400e+04
max 9.999000e+04
In [309]:
import holoviews as hv
import geoviews as gv
import datashader as ds
from colorcet import fire, rainbow, bgy, bjy, bkr, kb, kr
from datashader.colors import colormap_select, Greys9
from holoviews.streams import RangeXY
from holoviews.operation.datashader import datashade, dynspread, rasterize
from bokeh.io import push_notebook, show, output_notebook
output_notebook()
hv.extension('bokeh')
%opts Overlay [width=800 height=600 toolbar='above' xaxis=None yaxis=None]
%opts QuadMesh [tools=['hover'] colorbar=True] (alpha=0 hover_alpha=0.2)
Loading BokehJS ...
In [314]:
import holoviews as hv
import geoviews as gv
import datashader as ds
from colorcet import fire, rainbow, bgy, bjy, bkr, kb, kr, gwv
from datashader.colors import colormap_select, Greys9
from holoviews.streams import RangeXY
from holoviews.operation.datashader import datashade, dynspread, rasterize
from bokeh.io import push_notebook, show, output_notebook
output_notebook()
hv.extension('bokeh')

%opts Overlay [width=800 height=600 toolbar='above' xaxis=None yaxis=None]
%opts QuadMesh [tools=['hover'] colorbar=True] (alpha=0 hover_alpha=0.2)

T = 0.05
PX = 1

def plot_map(data, label, agg_data, agg_name, cmap):
    url="http://server.arcgisonline.com/ArcGIS/rest/services/Canvas/World_Dark_Gray_Base/MapServer/tile/{Z}/{Y}/{X}.png"
    geomap = gv.WMTS(url)
    points = hv.Points(gv.Dataset(data, kdims=['x', 'y'], vdims=[agg_name]))
    agg = datashade(points, element_type=gv.Image, aggregator=agg_data, cmap=cmap)
    zip_codes = dynspread(agg, threshold=T, max_px=PX)
    hover = hv.util.Dynamic(rasterize(points, aggregator=agg_data, width=50, height=25, streams=[RangeXY]), operation=hv.QuadMesh)
    hover = hover.options(cmap=cmap)
    img = geomap * zip_codes * hover
    img = img.relabel(label)
    return img
Loading BokehJS ...
In [311]:
brazil.head(2)
agg_name
Out[311]:
'geolocation_zip_code_prefix'
In [315]:
plot_map(brazil, 'Zip Codes in Brazil', ds.min(agg_name), agg_name, cmap = gwv)
WARNING:param.OverlayPlot103663: No plotting class for WMTS type and bokeh backend found. 
Out[315]:
In [295]:
import datashader as ds
from datashader import transfer_functions as tf
from functools import partial
from datashader.utils import export_image
from IPython.core.display import HTML, display
from colorcet import fire, rainbow, bgy, bjy, bkr, kb, kr

background = "black"
cm = partial(colormap_select, reverse=(background!="black"))
export = partial(export_image, background = background, export_path="export")
display(HTML("<style>.container { width:100% !important; }</style>"))
W = 700 

def create_map(data, cmap, data_agg, export_name='img'):
    pad = (data.x.max() - data.x.min())/50
    x_range, y_range = ((data.x.min() - pad, data.x.max() + pad), 
                             (data.y.min() - pad, data.y.max() + pad))

    ratio = (y_range[1] - y_range[0]) / (x_range[1] - x_range[0])

    plot_width  = int(W)
    plot_height = int(plot_width * ratio)
    if ratio > 1.5:
        plot_height = 550
        plot_width = int(plot_height / ratio)
        
    cvs = ds.Canvas(plot_width=plot_width, plot_height=plot_height, x_range=x_range, y_range=y_range)

    agg = cvs.points(data, 'x', 'y', data_agg)
    img = tf.shade(agg, cmap=cmap, how='eq_hist')
    return export(img, export_name)
In [316]:
create_map(brazil, gwv, ds.mean(agg_name),'brazil_zip_codes')
Out[316]:

Zip Codes in States

Lets look at the state of Sao Paulo (SP) to see how zip code prefixes works in a regional level. We see that: - zip code prefixes in Sao Paulo state ranges from 01001 to 19990 - zip codes starting with 0 are in the Sao Paulo meto region - zip codes starting with 1 are in the interior of the state
In [130]:
geo.head(2)
Out[130]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state geolocation_zip_code_prefix_1_digits geolocation_zip_code_prefix_2_digits geolocation_zip_code_prefix_3_digits geolocation_zip_code_prefix_4_digits x y
0 1037 -23.545621 -46.639292 sao paulo SP 0 1 10 103 -5.191862e+06 -2.698137e+06
1 1046 -23.546081 -46.644820 sao paulo SP 0 1 10 104 -5.192478e+06 -2.698193e+06
In [128]:
def filter_data(level, name):
    df = geo[geo[level] == name]
    #remove outliers
    df = df[(df.x <= df.x.quantile(0.999)) & (df.x >= df.x.quantile(0.001))]
    df = df[(df.y <= df.y.quantile(0.999)) & (df.y >= df.y.quantile(0.001))]
    return df
In [135]:
sp = filter_data('geolocation_state', 'SP')
agg_name = 'geolocation_zip_code_prefix'
sp[agg_name].describe().to_frame()
Out[135]:
geolocation_zip_code_prefix
count 402651.000000
mean 9030.831782
std 5101.941134
min 1001.000000
25% 4661.000000
50% 8257.000000
75% 13330.000000
max 19990.000000
In [136]:
plot_map(sp, 'Zip Codes in Sao Paulo State', ds.min(agg_name), agg_name, cmap = rainbow)
WARNING:param.OverlayPlot69111: No plotting class for WMTS type and bokeh backend found. 
Out[136]:
In [137]:
create_map(sp, rainbow, ds.mean(agg_name), 'sp_zip_codes')
Out[137]:

Zip Codes in Large Cities

Lets look at the city of Sao Paulo to see how zip code prefixes works in a city level. We see that: - zip code prefixes in Sao Paulo city ranges from 01001 to 09540 - zip code prefixes are somehow related to neighborhoods or city districts
In [158]:
geo.head(3)
Out[158]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state geolocation_zip_code_prefix_1_digits geolocation_zip_code_prefix_2_digits geolocation_zip_code_prefix_3_digits geolocation_zip_code_prefix_4_digits x y
0 1037 -23.545621 -46.639292 sao paulo SP 0 1 10 103 -5.191862e+06 -2.698137e+06
1 1046 -23.546081 -46.644820 sao paulo SP 0 1 10 104 -5.192478e+06 -2.698193e+06
2 1046 -23.546129 -46.642951 sao paulo SP 0 1 10 104 -5.192270e+06 -2.698199e+06
In [159]:
saopaulo = filter_data('geolocation_city', 'sao paulo')
agg_name = 'geolocation_zip_code_prefix'
saopaulo[agg_name].describe().to_frame()
Out[159]:
geolocation_zip_code_prefix
count 135256.000000
mean 4137.454996
std 1804.189554
min 1001.000000
25% 2941.000000
50% 4166.000000
75% 5158.000000
max 9540.000000
In [163]:
plot_map(saopaulo, 'Zip Codes in Sao Paulo City', ds.min(agg_name), agg_name, cmap = rainbow)
WARNING:param.OverlayPlot71685: No plotting class for WMTS type and bokeh backend found. 
Out[163]:
In [161]:
create_map(saopaulo, rainbow, ds.mean(agg_name), 'sao_paulo_zip_codes')
Out[161]:

Zip Codes in Small Cities

Lets look at the city of Atibaia to see how zip code prefixes works in a city level. We see that: - zip code prefix of Atibaia city is between 12940 and 12954 - but there are other neighbor cities with the same zip code prefix - to have more detail and go down to a city level we would probably need more zip code digits (the 4th and 5th digit)
In [164]:
atibaia = geo[geo['geolocation_city']=='atibaia']
agg_name = 'geolocation_zip_code_prefix'
atibaia[agg_name].describe().to_frame()
Out[164]:
geolocation_zip_code_prefix
count 1254.000000
mean 12945.042265
std 4.197704
min 12940.000000
25% 12942.000000
50% 12944.000000
75% 12948.000000
max 12954.000000
In [165]:
plot_map(atibaia, 'Zip Codes in Atibaia', ds.min(agg_name), agg_name, cmap = rainbow)
WARNING:param.OverlayPlot72543: No plotting class for WMTS type and bokeh backend found. 
Out[165]:
In [166]:
create_map(atibaia, rainbow, ds.mean(agg_name), 'atibaia_zip_codes')
Out[166]:

Abrangence of zip code digits

What does every digit in the zip code means? Lets zee how it behave, from 1 to 5 digits. Zip codes starting with 2 are all from Rio de Janeiro (RJ) and Espírito Santo (ES) States
In [167]:
# Zip code: 2
df = filter_data('geolocation_zip_code_prefix_1_digits', 2)
create_map(df, cm(Greys9), ds.count(), 'zip_code_2')
Out[167]:
Zip codes starting with 22 are all within the Rio de Janeiro City, but they represent multiple neighborhoods At the from right to left you see neighborhoods such as Leme, Copacabana, Ipanema, Leblon and Barra da Tijuca.
In [168]:
# Zip code: 22
df = filter_data('geolocation_zip_code_prefix_2_digits', 22)
create_map(df, cm(Greys9), ds.count(), 'zip_code_22')
Out[168]:
Zip codes starting with 220 are all within Copacabana, a neighborhood of Rio de Janeiro
In [169]:
# Zip code: 220
df = filter_data('geolocation_zip_code_prefix_3_digits', 220)
create_map(df, cm(Greys9), ds.count(), 'zip_code_220')
Out[169]:
Zip codes starting with 2201 represent some streets of Copacabana
In [170]:
# Zip code: 2201
df = filter_data('geolocation_zip_code_prefix_4_digits', 2201)
create_map(df, cm(Greys9), ds.count(), 'zip_code_2201')
Out[170]:
Zip codes starting with 22010 represent a more strict set of streets of Copacabana
In [171]:
# Zip code: 22010
df = filter_data('geolocation_zip_code_prefix', 22010)
create_map(df, cm(Greys9), ds.count(), 'zip_code_22010')
Out[171]:
So, now it should be clear to you that as we increase the quantity of numbers on zip codes we are selecting subsets of data.

Where does most revenue comes from?

Working with 5 digits of zip codes might lead us to very small samples in some areas, so we will work with 3 digits. Plotting the sum of products value grouped by zip code prefix we see that most of the revenue came from the Southeast and South regions of Brazil. It is also possible to see that large cities and capitals, where population is bigger, have larger participation on revenue.
In [198]:
orders_df = pd.read_csv('olist_orders_dataset.csv')
order_items = pd.read_csv('olist_order_items_dataset.csv')
order_reviews = pd.read_csv('olist_order_reviews_dataset.csv')
customer = pd.read_csv('olist_customers_dataset.csv', dtype={'customer_zip_code_prefix': str})

# getting the first 3 digits of customer zipcode
customer['customer_zip_code_prefix_3_digits'] = customer['customer_zip_code_prefix'].str[0:3]
customer['customer_zip_code_prefix_3_digits'] = customer['customer_zip_code_prefix_3_digits'].astype(int)

brazil_geo = geo.set_index('geolocation_zip_code_prefix_3_digits').copy()
In [211]:
# merging the data
orders = orders_df.merge(order_items, on='order_id')
orders = orders.merge(customer, on='customer_id')
orders = orders.merge(order_reviews, on='order_id')
In [210]:
gp = orders.groupby('customer_zip_code_prefix_3_digits')['price'].sum().to_frame()
revenue = brazil_geo.join(gp)
revenue
Out[210]:
geolocation_zip_code_prefix geolocation_lat geolocation_lng geolocation_city geolocation_state geolocation_zip_code_prefix_1_digits geolocation_zip_code_prefix_2_digits geolocation_zip_code_prefix_4_digits x y price
10 1037 -23.545621 -46.639292 sao paulo SP 0 1 103 -5.191862e+06 -2.698137e+06 24073.51
10 1046 -23.546081 -46.644820 sao paulo SP 0 1 104 -5.192478e+06 -2.698193e+06 24073.51
10 1046 -23.546129 -46.642951 sao paulo SP 0 1 104 -5.192270e+06 -2.698199e+06 24073.51
10 1041 -23.544392 -46.639499 sao paulo SP 0 1 104 -5.191885e+06 -2.697988e+06 24073.51
10 1035 -23.541578 -46.641607 sao paulo SP 0 1 103 -5.192120e+06 -2.697646e+06 24073.51
... ... ... ... ... ... ... ... ... ... ... ...
999 99950 -28.068639 -52.010705 tapejara RS 9 99 9995 -5.789805e+06 -3.257630e+06 4186.04
999 99900 -27.877125 -52.224882 getulio vargas RS 9 99 9990 -5.813647e+06 -3.233491e+06 4186.04
999 99950 -28.071855 -52.014716 tapejara RS 9 99 9995 -5.790252e+06 -3.258036e+06 4186.04
999 99980 -28.388932 -51.846871 david canabarro RS 9 99 9998 -5.771567e+06 -3.298098e+06 4186.04
999 99950 -28.070104 -52.018658 tapejara RS 9 99 9995 -5.790690e+06 -3.257815e+06 4186.04

1000121 rows × 11 columns

In [213]:
gp = orders.groupby('customer_zip_code_prefix_3_digits')['price'].sum().to_frame()
revenue = brazil_geo.join(gp)
agg_name = 'revenue'
revenue[agg_name] = revenue.price / 1000
In [214]:
plot_map(revenue, 'Orders Revenue (thousands R$)', ds.mean(agg_name), agg_name, cmap=fire)
WARNING:param.OverlayPlot73401: No plotting class for WMTS type and bokeh backend found. 
Out[214]:

What is the Average Ticket?

Here we see something somehow unexpected. Customers of the south and southeast regions of Brazil have lower average ticket, than their peers on north and norteast. This might happen because they have to pay more for freight (let's check that in a moment)
In [218]:
gp = orders.groupby('order_id').agg({'price': 'sum', 'customer_zip_code_prefix_3_digits': 'max'})
gp = gp.groupby('customer_zip_code_prefix_3_digits')['price'].mean().to_frame()
avg_ticket = brazil_geo.join(gp)
agg_name = 'avg_ticket'
avg_ticket[agg_name] = avg_ticket.price
In [216]:
plot_map(avg_ticket, 'Orders Average Ticket (R$)', ds.mean(agg_name), agg_name, cmap=bgy)
WARNING:param.OverlayPlot74259: No plotting class for WMTS type and bokeh backend found. 
Out[216]:
In [217]:
create_map(avg_ticket, bgy, ds.mean('avg_ticket'), 'avg_ticket_brazil')
Out[217]:

Who pays more for transportation?

We might find a freight ratio by dividing the freight value by the order value. This ratio indicates the percentage of the product price that a person had to pay just to get their order delivered. For example, if a product costs R$50.00 and the freight value was R$10.00, then the freight ratio is 0.2 or 20%. Higher freight ratios are very likely to discourage customers to complete a purchase. Due to logistics costs, we expect to see lower freight ratios in densely populated areas and are higher freight ratios on sparsely poulated regions.
In [220]:
gp = orders.groupby('order_id').agg({'price': 'sum', 'freight_value': 'sum', 'customer_zip_code_prefix_3_digits': 'max'})
agg_name = 'freight_ratio'
gp[agg_name] = gp.freight_value / gp.price
gp = gp.groupby('customer_zip_code_prefix_3_digits')[agg_name].mean().to_frame()
freight_ratio = brazil_geo.join(gp)
In [221]:
plot_map(freight_ratio, 'Orders Average Freight Ratio', ds.mean(agg_name), agg_name, cmap=bgy)
WARNING:param.OverlayPlot75117: No plotting class for WMTS type and bokeh backend found. 
Out[221]:
In [222]:
create_map(freight_ratio, bgy, ds.mean('freight_ratio'), 'freight_ratio_brazil')
Out[222]:

Average Delivery Time

Unfortunately, who lives in the north and northeast of Brazil has to bear with higher freight costs and has to wait longer to receive their purchase.
In [223]:
orders['order_delivered_customer_date'] = pd.to_datetime(orders.order_delivered_customer_date)
orders['order_estimated_delivery_date'] = pd.to_datetime(orders.order_estimated_delivery_date)
orders['order_delivered_carrier_date'] = pd.to_datetime(orders.order_delivered_carrier_date)
orders['actual_delivery_time'] = orders.order_delivered_customer_date - orders.order_delivered_carrier_date
orders['actual_delivery_time'] = orders['actual_delivery_time'].dt.days
In [224]:
gp = orders.groupby('customer_zip_code_prefix_3_digits')['actual_delivery_time'].mean().to_frame()
delivery_time = brazil_geo.join(gp)
agg_name = 'avg_delivery_time'
delivery_time[agg_name] = delivery_time['actual_delivery_time']
In [225]:
plot_map(delivery_time, 'Orders Average Delivery Time (days)', ds.mean(agg_name), agg_name, cmap=bjy)
WARNING:param.OverlayPlot75975: No plotting class for WMTS type and bokeh backend found. 
Out[225]:
In [226]:
create_map(delivery_time, bjy, ds.mean(agg_name), 'avg_delivery_time_brazil')
Out[226]:
In [229]:
pr = filter_data('geolocation_state', 'PR').set_index('geolocation_zip_code_prefix_3_digits')
gp = orders.groupby('customer_zip_code_prefix_3_digits')['actual_delivery_time'].mean().to_frame()
pr_delivery_time = pr.join(gp)
pr_delivery_time[agg_name] = pr_delivery_time['actual_delivery_time']
Lets look to delivery times at a state level. We filtered only orders from Parana state (PR). It is possible to see that larger cities have lower average delivery times than the smaller ones.
In [237]:
plot_map(pr_delivery_time, 'Orders Average Delivery Time in Parana State (days)', ds.mean(agg_name), agg_name, cmap=bjy)
WARNING:param.OverlayPlot81123: No plotting class for WMTS type and bokeh backend found. 
Out[237]:
In [238]:
create_map(pr_delivery_time, bjy, ds.mean(agg_name), 'avg_delivery_time_pr')
Out[238]:

Interesting Point About Brazilian Suburbs

Unlike other countries, in Brazil the richest areas usualy are near downtow and suburbs are know for poverty and high violence rates. Lets explore that in Rio the Janeiro.
In [281]:
riodejaneiro = filter_data('geolocation_city', 'rio de janeiro').set_index('geolocation_zip_code_prefix_3_digits')
gp = orders.groupby('customer_zip_code_prefix_3_digits')['actual_delivery_time'].mean().to_frame()
rj_delivery_time = riodejaneiro.join(gp)
rj_delivery_time[agg_name] = rj_delivery_time['actual_delivery_time']
In [282]:
plot_map(rj_delivery_time, 'Orders Average Delivery Time in Rio de Janeiro (days)', ds.mean(agg_name), agg_name, cmap=bjy)
WARNING:param.OverlayPlot92297: No plotting class for WMTS type and bokeh backend found. 
Out[282]:
It turns out that if you live in rich neighborhoods such as Downtown, Botafogo, Copacabana and Flamengo you are likey to receive your order five days earlier than someone who lives in a poor neighborhood such as Cidade de Deus or Bangu. We see the same pattern in Sao Paulo and Porto Alegre, customers near downtown receive their orders faster than those who lives on suburbs.
In [244]:
saopaulo = filter_data('geolocation_city', 'sao paulo').set_index('geolocation_zip_code_prefix_3_digits')
gp = orders.groupby('customer_zip_code_prefix_3_digits')['actual_delivery_time'].mean().to_frame()
sp_delivery_time = saopaulo.join(gp)
sp_delivery_time[agg_name] = sp_delivery_time['actual_delivery_time']
In [245]:
plot_map(sp_delivery_time, 'Orders Average Delivery Time in Sao Paulo (days)', ds.mean(agg_name), agg_name, cmap=bjy)
WARNING:param.OverlayPlot83697: No plotting class for WMTS type and bokeh backend found. 
Out[245]:
In [246]:
create_map(sp_delivery_time, bjy, ds.mean(agg_name), 'sao_paulo_avg_delivery_time')
Out[246]:
In [247]:
poa = filter_data('geolocation_city', 'porto alegre').set_index('geolocation_zip_code_prefix_3_digits')
gp = orders.groupby('customer_zip_code_prefix_3_digits')['actual_delivery_time'].mean().to_frame()
poa_delivery_time = poa.join(gp)
poa_delivery_time[agg_name] = poa_delivery_time['actual_delivery_time']
In [248]:
plot_map(poa_delivery_time, 'Orders Average Delivery Time in Porto Alegre (days)', ds.mean(agg_name), agg_name, cmap=bjy)
WARNING:param.OverlayPlot84555: No plotting class for WMTS type and bokeh backend found. 
Out[248]:
In [249]:
create_map(poa_delivery_time, bjy, ds.mean(agg_name), 'poa_avg_delivery_time')
Out[249]:

Orders Average Review Score

Customers of Rio de Janeiro State and Northeast Region are more likely to give low scores on purchases.
In [254]:
gp = orders.groupby('customer_zip_code_prefix_3_digits')['review_score'].mean().to_frame()
score = brazil_geo.join(gp)
agg_name = 'avg_score'
score[agg_name] = score['review_score']
In [255]:
plot_map(score, 'Orders Average Review Score', ds.mean(agg_name), agg_name, cmap=bgy)
WARNING:param.OverlayPlot86271: No plotting class for WMTS type and bokeh backend found. 
Out[255]:
In [256]:
create_map(score, bgy, ds.mean(agg_name), 'avg_review_score_brazil')
Out[256]:
Let's look at the average review score in Rio de Janeiro? Do you see any relationship with average delivery time?
In [258]:
rj_score = riodejaneiro.join(gp)
rj_score[agg_name] = rj_score['review_score']
In [260]:
plot_map(rj_score, 'Orders Average Review Score in Rio de Janeiro', ds.mean(agg_name), agg_name, cmap=bgy)
WARNING:param.OverlayPlot87987: No plotting class for WMTS type and bokeh backend found. 
Out[260]:
In [262]:
create_map(rj_score, bgy, ds.mean(agg_name), 'rio_de_janeiro_score')
Out[262]:

Delayed Orders

Lets measure the percentage of delayed orders and see how that relate to the score given by customers?
In [263]:
orders['is_delayed'] = orders['order_delivered_customer_date'] > orders['order_estimated_delivery_date'] 
gp = orders.groupby('customer_zip_code_prefix_3_digits').agg({'is_delayed': ['sum', 'count']})
agg_name = 'delayed'
gp[agg_name] = gp['is_delayed']['sum'] / gp['is_delayed']['count']
gp = gp[agg_name]
order_delay = brazil_geo.join(gp)
In [264]:
plot_map(order_delay, 'Orders Delay Percentage in Brazil', ds.mean(agg_name), agg_name, cmap=bgy)
WARNING:param.OverlayPlot88845: No plotting class for WMTS type and bokeh backend found. 
Out[264]:
In [265]:
create_map(order_delay, bgy, ds.mean(agg_name), 'brazil_order_delay_percentage')
Out[265]:
Looking again to Rio de Janeiro to view where order deliveries are more prone to be delayed.
In [266]:
rj_delay = riodejaneiro.join(gp)
In [267]:
plot_map(rj_delay, 'Orders Delay Percentage in Rio de Janeiro', ds.mean(agg_name), agg_name, cmap=bgy)
WARNING:param.OverlayPlot89703: No plotting class for WMTS type and bokeh backend found. 
Out[267]:
In [268]:
create_map(rj_delay, bgy, ds.mean(agg_name), 'rio_de_janeiro_order_delay_percentage')
Out[268]:

Does customers from small towns buy more items on a single order

Lets check the average item quantity per order to see if customers that are further out of large cities buy in bulk.
In [270]:
orders['count'] = 1
gp = orders.groupby(['customer_zip_code_prefix_3_digits','order_id'], as_index=False)['count'].count()
gp = gp.groupby('customer_zip_code_prefix_3_digits')['count'].mean().to_frame()
avg_qty = brazil_geo.join(gp)
agg_name = 'avg_qty'
avg_qty[agg_name] = avg_qty['count']
In [272]:
plot_map(avg_qty, 'Average Item Quantity', ds.mean(agg_name), agg_name, cmap=bgy)
WARNING:param.OverlayPlot91142: No plotting class for WMTS type and bokeh backend found. 
Out[272]:
In [274]:
create_map(avg_qty, bgy, ds.mean(agg_name), 'avg_qty_items')
Out[274]:
Apparently that hypothesis is false. Looks like that customers from large cities buy more products per order. But there isn't a clear geographical pattern that would allow us to draw any further conclusion.

Work in progress...

Work in progress... To do: Which categories are most sold. Wich payment method was chosen. How many installments. Analysis on specific cities, such as Sao Paulo, Porto Alegre, Curitiba, Fortaleza, Bahia, Brasilia. Any sugestion?