In this document we will learn about pandas built-in capabilities for data visualization and explore plotly and cufflinks which is also great for visualizing data in Python.

Let's take a look!

Imports¶

import numpy as np
import pandas as pd
%matplotlib inline

The Data¶

I have created some fake dataframes named df1 and df2 for data visualzation. Lets import the data.

df1 = pd.read_csv('df1',index_col=0)
df2 = pd.read_csv('df2')

Lets look at first few rows of each of these sample dataframes to get a sense of how data looks like. All the columns of the dataframe df1 are numerical. The rows of the dataframe df1 is time-series. As for df2 all the columns of df2 are numerical as well.

df1.head()

df2.head()

Style Sheets¶

Matplotlib has style sheets you can use to make your plots look a little nicer. These style sheets include plot_bmh,plot_fivethirtyeight,plot_ggplot and more. They basically create a set of style rules that your plots follow. I recommend using them, they make all your plots have the same look and feel more professional. Here is how to use them.

Before plt.style.use() your plots look like this:

df1['A'].hist(histtype='bar', ec='black')

<matplotlib.axes._subplots.AxesSubplot at 0x114968780>

The above plot shows a histogram. Histograms are used to see the distribution of numerical variables. The column 'A' from df1 is used for creating this histogram. In other words the plot shows the distribution of column A.

Use different styles:

import matplotlib.pyplot as plt
plt.style.use('ggplot')

Now your plots look like this:

df1['A'].hist(color = 'orange',histtype='bar', ec='black')

<matplotlib.axes._subplots.AxesSubplot at 0x115ab3fd0>

plt.style.use('bmh')
df1['A'].hist(histtype='bar', ec='black')

<matplotlib.axes._subplots.AxesSubplot at 0x114bdd6a0>

plt.style.use('dark_background')
df1['A'].hist(histtype='bar', ec='black')

<matplotlib.axes._subplots.AxesSubplot at 0x11565f208>

plt.style.use('fivethirtyeight')
df1['A'].hist(histtype='bar', ec='black',color='pink')

<matplotlib.axes._subplots.AxesSubplot at 0x115cbbdd8>

plt.style.use('ggplot')

Let's stick with the ggplot style and actually show you how to utilize pandas built-in plotting capabilities!

Plot Types¶

There are several plot types built-in to pandas, most of them statistical plots by nature:

df.plot.area
df.plot.barh
df.plot.density
df.plot.hist
df.plot.line
df.plot.scatter
df.plot.bar
df.plot.box
df.plot.hexbin
df.plot.kde
df.plot.pie

You can also just call df.plot(kind='hist') or replace that kind argument with any of the key terms shown in the list above (e.g. 'box','barh', etc..)

Let's start going through them!

Area¶

df2.plot.area(alpha=0.4)

<matplotlib.axes._subplots.AxesSubplot at 0x115d87a20>

You can use an area plot to view many numerical variables at once. The x axis is the row number (there are only 9 rows in the dataframe) and the y axis is the actual value for the respective column (a,b,c,d in this case).

Barplots¶

df2.head()

df2.plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x11610dbe0>

Bar Plots are also good way to visualize your data. This is a side by side bar plot where for each row all the column values are plotted.

df2.plot.bar(stacked=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1178575f8>

You can create a stacked bar plot as well.

Histograms¶

df1['A'].plot.hist(bins=50,histtype='bar', ec='white',color = 'brown')

<matplotlib.axes._subplots.AxesSubplot at 0x116e32c88>

We have already discussed histograms above. You can also specify number of bins using bins parameter. The ec parameter is the color for the outer lines of the histogram and histtype should be specified as bar.

Line Plots¶

df1.plot.line(x=df1.index,y='B',figsize=(12,3),lw=1,color='brown')

<matplotlib.axes._subplots.AxesSubplot at 0x11706d9e8>

The above plot shows a timeseries plot. The x axis is the time series (index of df1) and y axis is a numerical variable (B variable of df1). The figsize argument is used to specify the size of your figure and lw is the line width. You can also specify color using the color argument.

Scatter Plots¶

df1.plot.scatter(x='A',y='B',color = 'orange')

<matplotlib.axes._subplots.AxesSubplot at 0x117804390>

The above plot shows a scatterplot. Scatterplots are used to investigate relationship between two numerical variables.

df1.plot.scatter(x='A',y='B',c='C',cmap='coolwarm')

<matplotlib.axes._subplots.AxesSubplot at 0x126f7b400>

This plot uses three variables to visualize. The A and B variables are on X and Y axis respectively and color is specified as per variable C using the cmap parameter. It means the more higher value of C the more darker color and vice versa.

df1.plot.scatter(x='A',y='B',s=df1['C']*80)

/anaconda/lib/python3.6/site-packages/matplotlib/collections.py:877: RuntimeWarning: invalid value encountered in sqrt
  scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor

<matplotlib.axes._subplots.AxesSubplot at 0x1175a5898>

Instead of color as shown in the above example you can also use s to indicate size based off another column. s parameter needs to be an array, not just the name of a column. So the size of the point is bigger if the value of C is higher and vice versa.

BoxPlots¶

df2.plot.box() # Can also pass a by= argument for groupby

<matplotlib.axes._subplots.AxesSubplot at 0x117b58c50>

Boxplots are used to see the distributions of a numerical variables. In this plot we can see the distribution of each of the 4 varaibles of df2 dataframe. It is quite convenient rather than plotting 4 histograms for each variable you can plot a boxplot which shows distribution of each variable at a glance. The box represents the middle 50% of the values of the variable and the yellow line is the median.

Hexagonal Bin Plot¶

Useful for Bivariate Data, alternative to scatterplot:

df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])

df.plot.hexbin(x='a', y='b', gridsize=25,cmap='Oranges',figsize=(5,5))

<matplotlib.axes._subplots.AxesSubplot at 0x1181fe2e8>

Suppose you have a lot of data and you want to plot two variables to investigate any relationship. So in this case due to lot of data what happens is when you plot a scatteplot many values are overlapped on each other and it makes the plot visually unappealing. In this case instead of plotting each data point we can use a Hex bin plot which tells you how many data points are present in x and y pair.

Kernel Density Estimation plot (KDE)¶

df2['a'].plot.kde()

<matplotlib.axes._subplots.AxesSubplot at 0x1276d9160>

The KDE plot is used to see the overall distribution of a numerical variable.

df2.plot.density()

<matplotlib.axes._subplots.AxesSubplot at 0x1183ba710>

This method of plotting is a lot easier to use than full-on matplotlib, it balances ease of use with control over the figure. A lot of the plot calls also accept additional arguments of their parent matplotlib plt. call.

For more exploration check out the official documentation page for the pandas built in data visualization for more examples.

Plotly and Cufflinks¶

Plotly is a library that allows you to create interactive plots that you can use in dashboards or websites (you can save them as html files or static images).

Installation¶

In order for this all to work, you'll need to install plotly and cufflinks to call plots directly off of a pandas dataframe. These libraries are not currently available through conda but are available through pip. Install the libraries at your command line/terminal using:

pip install plotly
pip install cufflinks

NOTE: Make sure you only have one installation of Python on your computer when you do this, otherwise the installation may not work.

Imports and Set-up¶

import pandas as pd
import numpy as np
%matplotlib inline

from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

print(__version__) # requires version >= 1.9.0

1.12.9

import cufflinks as cf

# For Notebooks
init_notebook_mode(connected=True)

# For offline use
cf.go_offline()

Create Some Fake Data¶

df_p = pd.DataFrame(np.random.randn(100,4),columns='A B C D'.split())

df_p.head() # view the first few rows

df2_p = pd.DataFrame({'Category':['A','B','C'],'Values':[32,43,50]})

df2_p.head()

Using Cufflinks and iplot()¶

scatter
bar
box
spread
ratio
heatmap
surface
histogram
bubble

Scatter¶

df_p.iplot(kind='scatter',x='A',y='B',mode='markers',size=10)

Bar Plots¶

df2_p.iplot(kind='bar',x='Category',y='Values')

Boxplots¶

df_p.iplot(kind='box')

3d Plots¶

df3d = pd.DataFrame({'x':[1,2,3,4,5],'y':[10,20,30,20,10],'z':[5,4,3,2,1]})
df3d.iplot(kind='surface',colorscale='rdylbu')

Spread¶

df_p[['A','B']].iplot(kind='spread')

histogram¶

df_p['A'].iplot(kind='hist',bins=25)

Bubble plot¶

df_p.iplot(kind='bubble',x='A',y='B',size='C')

scatter_matrix()¶

df_p.scatter_matrix()

I highly recommend to visit the official documentation pages for both Plotly and Cufflinks for further exploration.

Plotly

Cufflinks

	A	B	C	D
2000-01-01	1.339091	-0.163643	-0.646443	1.041233
2000-01-02	-0.774984	0.137034	-0.882716	-2.253382
2000-01-03	-0.921037	-0.482943	-0.417100	0.478638
2000-01-04	-1.738808	-0.072973	0.056517	0.015085
2000-01-05	-0.905980	1.778576	0.381918	0.291436

	a	b	c	d
0	0.039762	0.218517	0.103423	0.957904
1	0.937288	0.041567	0.899125	0.977680
2	0.780504	0.008948	0.557808	0.797510
3	0.672717	0.247870	0.264071	0.444358
4	0.053829	0.520124	0.552264	0.190008

	a	b	c	d
0	0.039762	0.218517	0.103423	0.957904
1	0.937288	0.041567	0.899125	0.977680
2	0.780504	0.008948	0.557808	0.797510
3	0.672717	0.247870	0.264071	0.444358
4	0.053829	0.520124	0.552264	0.190008

	A	B	C	D
0	0.528969	-0.144851	-0.438275	0.910590
1	0.207720	-1.249268	0.975464	1.449979
2	1.221294	-0.954854	2.250601	-0.899967
3	-0.700466	0.926348	1.461159	-0.460005
4	0.136282	-0.790601	-0.134462	1.077107

	Category	Values
0	A	32
1	B	43
2	C	50