In this document we will learn about pandas built-in capabilities for data visualization and explore plotly and cufflinks which is also great for visualizing data in Python.
Let's take a look!
import numpy as np
import pandas as pd
%matplotlib inline
I have created some fake dataframes named df1 and df2 for data visualzation. Lets import the data.
df1 = pd.read_csv('df1',index_col=0)
df2 = pd.read_csv('df2')
Lets look at first few rows of each of these sample dataframes to get a sense of how data looks like. All the columns of the dataframe df1 are numerical. The rows of the dataframe df1 is time-series. As for df2 all the columns of df2 are numerical as well.
df1.head()
df2.head()
Matplotlib has style sheets you can use to make your plots look a little nicer. These style sheets include plot_bmh,plot_fivethirtyeight,plot_ggplot and more. They basically create a set of style rules that your plots follow. I recommend using them, they make all your plots have the same look and feel more professional. Here is how to use them.
Before plt.style.use() your plots look like this:
df1['A'].hist(histtype='bar', ec='black')
The above plot shows a histogram. Histograms are used to see the distribution of numerical variables. The column 'A' from df1 is used for creating this histogram. In other words the plot shows the distribution of column A.
Use different styles:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
Now your plots look like this:
df1['A'].hist(color = 'orange',histtype='bar', ec='black')
plt.style.use('bmh')
df1['A'].hist(histtype='bar', ec='black')
plt.style.use('dark_background')
df1['A'].hist(histtype='bar', ec='black')
plt.style.use('fivethirtyeight')
df1['A'].hist(histtype='bar', ec='black',color='pink')
plt.style.use('ggplot')
Let's stick with the ggplot style and actually show you how to utilize pandas built-in plotting capabilities!
There are several plot types built-in to pandas, most of them statistical plots by nature:
You can also just call df.plot(kind='hist') or replace that kind argument with any of the key terms shown in the list above (e.g. 'box','barh', etc..)
Let's start going through them!
df2.plot.area(alpha=0.4)
You can use an area plot to view many numerical variables at once. The x axis is the row number (there are only 9 rows in the dataframe) and the y axis is the actual value for the respective column (a,b,c,d in this case).
df2.head()
df2.plot.bar()
Bar Plots are also good way to visualize your data. This is a side by side bar plot where for each row all the column values are plotted.
df2.plot.bar(stacked=True)
You can create a stacked bar plot as well.
df1['A'].plot.hist(bins=50,histtype='bar', ec='white',color = 'brown')
We have already discussed histograms above. You can also specify number of bins using bins parameter. The ec parameter is the color for the outer lines of the histogram and histtype should be specified as bar.
df1.plot.line(x=df1.index,y='B',figsize=(12,3),lw=1,color='brown')
The above plot shows a timeseries plot. The x axis is the time series (index of df1) and y axis is a numerical variable (B variable of df1). The figsize argument is used to specify the size of your figure and lw is the line width. You can also specify color using the color argument.
df1.plot.scatter(x='A',y='B',color = 'orange')
The above plot shows a scatterplot. Scatterplots are used to investigate relationship between two numerical variables.
df1.plot.scatter(x='A',y='B',c='C',cmap='coolwarm')
This plot uses three variables to visualize. The A and B variables are on X and Y axis respectively and color is specified as per variable C using the cmap parameter. It means the more higher value of C the more darker color and vice versa.
df1.plot.scatter(x='A',y='B',s=df1['C']*80)
Instead of color as shown in the above example you can also use s to indicate size based off another column. s parameter needs to be an array, not just the name of a column. So the size of the point is bigger if the value of C is higher and vice versa.
df2.plot.box() # Can also pass a by= argument for groupby
Boxplots are used to see the distributions of a numerical variables. In this plot we can see the distribution of each of the 4 varaibles of df2 dataframe. It is quite convenient rather than plotting 4 histograms for each variable you can plot a boxplot which shows distribution of each variable at a glance. The box represents the middle 50% of the values of the variable and the yellow line is the median.
Useful for Bivariate Data, alternative to scatterplot:
df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
df.plot.hexbin(x='a', y='b', gridsize=25,cmap='Oranges',figsize=(5,5))
Suppose you have a lot of data and you want to plot two variables to investigate any relationship. So in this case due to lot of data what happens is when you plot a scatteplot many values are overlapped on each other and it makes the plot visually unappealing. In this case instead of plotting each data point we can use a Hex bin plot which tells you how many data points are present in x and y pair.
df2['a'].plot.kde()
The KDE plot is used to see the overall distribution of a numerical variable.
df2.plot.density()
This method of plotting is a lot easier to use than full-on matplotlib, it balances ease of use with control over the figure. A lot of the plot calls also accept additional arguments of their parent matplotlib plt. call.
For more exploration check out the official documentation page for the pandas built in data visualization for more examples.
Plotly is a library that allows you to create interactive plots that you can use in dashboards or websites (you can save them as html files or static images).
In order for this all to work, you'll need to install plotly and cufflinks to call plots directly off of a pandas dataframe. These libraries are not currently available through conda but are available through pip. Install the libraries at your command line/terminal using:
pip install plotly
pip install cufflinks
NOTE: Make sure you only have one installation of Python on your computer when you do this, otherwise the installation may not work.
import pandas as pd
import numpy as np
%matplotlib inline
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
print(__version__) # requires version >= 1.9.0
import cufflinks as cf
# For Notebooks
init_notebook_mode(connected=True)
# For offline use
cf.go_offline()
df_p = pd.DataFrame(np.random.randn(100,4),columns='A B C D'.split())
df_p.head() # view the first few rows
df2_p = pd.DataFrame({'Category':['A','B','C'],'Values':[32,43,50]})
df2_p.head()
df_p.iplot(kind='scatter',x='A',y='B',mode='markers',size=10)
df2_p.iplot(kind='bar',x='Category',y='Values')
df_p.iplot(kind='box')
df3d = pd.DataFrame({'x':[1,2,3,4,5],'y':[10,20,30,20,10],'z':[5,4,3,2,1]})
df3d.iplot(kind='surface',colorscale='rdylbu')
df_p[['A','B']].iplot(kind='spread')
df_p['A'].iplot(kind='hist',bins=25)
df_p.iplot(kind='bubble',x='A',y='B',size='C')
df_p.scatter_matrix()