Intermediate Data Visualization with Seaborn
Instructor: Chris Moffitt - Creator of Practical Business Python
1 Seaborn Introduction
1.1 Reading a csv file
Before you analyze data, you will need to read the data into a pandas DataFrame. In this exercise, you will be looking at data from US School Improvement Grants in 2010. This program gave nearly $4B to schools to help them renovate or improve their programs.
This first step in most data analysis is to import pandas and
seaborn and read a data file in order to analyze it further.
This course introduces a lot of new concepts, so if you ever need a quick refresher, download the Seaborn Cheat Sheet and keep it handy!
Instructions
- Import
pandasandseabornusing the standard naming conventions. - The path to the csv file is stored in the
grant_filevariable. - Use
pandasto read the file. - Store the resulting DataFrame in the variable
df.
Answer
1.2 Comparing a histogram and displot
The pandas library supports simple plotting of data, which is very
convenient when data is already likely to be in a pandas DataFrame.
Seaborn generally does more statistical analysis on data and can provide
more sophisticated insight into the data. In this exercise, we will
compare a pandas histogram vs the seaborn displot.
Instructions
- Use the pandas’
plot.hist()function to plot a histogram of theAward_Amountcolumn. - Use Seaborn’s
displot()function to plot a distribution plot of the same column.
Answer
1.3 Plot a histogram
The displot() function will return a histogram by default. The
displot() can also create a KDE or rug plot which are useful ways to
look at the data. Seaborn can also combine these plots so you can
perform more meaningful analysis.
Instructions
- Create a
displotfor the data. - Explicitly pass in the number 20 for the number of bins in the histogram.
- Display the plot using
plt.show().
Answer
1.4 Rug plot and kde shading
Now that you understand some function arguments for displot(), we can
continue further refining the output. This process of creating a
visualization and updating it in an incremental fashion is a useful and
common approach to look at data from multiple perspectives.
Seaborn excels at making this process simple.
Instructions
- Create a
displotof theAward_Amountcolumn in thedf. - Configure it to show a shaded kde plot (using the
kindandfillparameters). - Add a rug plot above the x axis (using the
rugparameter). - Display the plot.
Answer
# Create a displot of the Award Amount
sns.displot(df['Award_Amount'],
kind='kde',
rug=True,
fill=True)1.5 Create a regression plot
For this set of exercises, we will be looking at FiveThirtyEight’s data on which US State has the worst drivers. The data set includes summary level information about fatal accidents as well as insurance premiums for each state as of 2010.
In this exercise, we will look at the difference between the regression plotting functions.
Instructions
- The data is available in the dataframe called
df. - Create a regression plot using
regplot()with"insurance_losses"on the x axis and"premiums"on the y axis. - Create a regression plot of “premiums” versus “insurance_losses” using lmplot().
- Display the plot.
Answer
# Create a regression plot of premiums vs. insurance_losses
sns.regplot(data=df,
x="insurance_losses",
y="premiums")
# Display the plot
plt.show()# Create an lmplot of premiums vs. insurance_losses
sns.lmplot(data=df,
x="insurance_losses",
y="premiums")1.6 Plotting multiple variables
Since we are using lmplot() now, we can look at the more complex
interactions of data. This data set includes geographic information by
state and area. It might be interesting to see if there is a difference
in relationships based on the Region of the country.
Instructions
- Use
lmplot()to look at the relationship betweeninsurance_lossesandpremiums. - Plot a regression line for each
Regionof the country.
Answer
# Create a regression plot using hue
sns.lmplot(data=df,
x="insurance_losses",
y="premiums",
hue="Region")1.7 Facetting multiple regressions
lmplot() allows us to facet the data across multiple rows and columns.
In the previous plot, the multiple lines were difficult to read in one
plot. We can try creating multiple plots by Region to see if that is a
more useful visualization.
Instructions
- Use
lmplot()to look at the relationship betweeninsurance_lossesandpremiums. - Create a plot for each
Regionof the country. - Display the plots across multiple rows.
Answer
# Create a regression plot with multiple rows
sns.lmplot(data=df,
x="insurance_losses",
y="premiums",
row="Region")2 Customizing Seaborn Plots
2.1 Setting the default style
For these exercises, we will be looking at fair market rent values calculated by the US Housing and Urban Development Department. This data is used to calculate guidelines for several federal programs. The actual values for rents vary greatly across the US. We can use this data to get some experience with configuring Seaborn plots.
All of the necessary imports for seaborn, pandas and matplotlib
have been completed. The data is stored in the pandas DataFrame df.
By the way, if you haven’t downloaded it already, check out the Seaborn Cheat Sheet. It includes an overview of the most important concepts, functions and methods and might come in handy if you ever need a quick refresher!
Instructions
- Plot a
pandashistogram without adjusting the style. - Set Seaborn’s default style.
- Create another
pandashistogram of thefmr_2column which represents fair market rent for a 2-bedroom apartment.
Answer
plt.close() # plt.clf() # added/edited
# Set the default seaborn style
sns.set()
# Plot the pandas histogram again
df['fmr_2'].plot.hist()
plt.show()2.2 Comparing styles
Seaborn supports setting different styles that can control the aesthetics of the final plot. In this exercise, you will plot the same data in two different styles in order to see how the styles change the output.
Instructions
Create a displot() of the fmr_2 column in df using a dark style.
Use plt.clf() to clear the figure.
Create the same displot() of fmr_2 using a whitegrid style. Clear
the plot after showing it.
Answer
2.3 Removing spines
In general, visualizations should minimize extraneous markings so that the data speaks for itself. Seaborn allows you to remove the lines on the top, bottom, left and right axis, which are often called spines.
Instructions
- Use a
whitestyle for the plot. - Create a
lmplot()comparing thepop2010and thefmr_2columns. - Remove the top and right spines using
despine().
Answer
# Set the style to white
sns.set_style('white')
# Create a regression plot
sns.lmplot(data=df,
x='pop2010',
y='fmr_2')2.4 Matplotlib color codes
Seaborn offers several options for modifying the colors of your
visualizations. The simplest approach is to explicitly state the color
of the plot. A quick way to change colors is to use the standard
matplotlib color codes.
Instructions
- Set the default Seaborn style and enable the
matplotlibcolor codes. - Create a
displotfor thefmr_3column usingmatplotlib’s magenta (m) color code.
Answer
# Set style, enable color code, and create a magenta displot
sns.set(color_codes=True)
sns.displot(df['fmr_3'], color='m')2.5 Using default palettes
Seaborn includes several default palettes that can be easily applied to
your plots. In this example, we will look at the impact of two different
palettes on the same displot.
Instructions
- Create a
forloop to show the difference between thebrightandcolorblindpalette. - Set the palette using the
set_palette()function. - Use a
displotof thefmr_3column.
Answer
# Loop through differences between bright and colorblind palettes
for p in ['bright', 'colorblind']:
sns.set_palette(p)
sns.displot(df['fmr_3'])
plt.show()
# Clear the plots
plt.clf()2.6 Creating Custom Palettes
Choosing a cohesive palette that works for your data can be time
consuming. Fortunately, Seaborn provides the color_palette() function
to create your own custom sequential, categorical, or diverging
palettes. Seaborn also makes it easy to view your palettes by using the
palplot() function.
In this exercise, you can experiment with creating different palettes.
Instructions
Create and display a Purples sequential palette containing 8 colors.
Create and display a palette with 10 colors using the husl system.
Create and display a diverging palette with 6 colors coolwarm.
Answer
2.7 Using matplotlib axes
Seaborn uses matplotlib as the underlying library for creating plots.
Most of the time, you can use the Seaborn API to modify your
visualizations but sometimes it is helpful to use matplotlib’s
functions to customize your plots. The most important object in this
case is matplotlib’s axes.
Once you have an axes object, you can perform a lot of customization
of your plot.
In these examples, the US HUD data is loaded in the dataframe df and
all libraries are imported.
Instructions
- Use
plt.subplots()to create a axes and figure objects. - Plot a
histplotof columnfmr_3on the axes. - Set a more useful label on the x axis of “3 Bedroom Fair Market Rent”.
Answer
# Create a figure and axes
fig, ax = plt.subplots()
# Plot the distribution of data
sns.histplot(df['fmr_3'], ax=ax)
# Create a more descriptive x axis label
ax.set(xlabel="3 Bedroom Fair Market Rent")
# Show the plot
plt.show()2.8 Additional plot customizations
The matplotlib API supports many common customizations such as
labeling axes, adding titles, and setting limits. Let’s complete another
customization exercise.
Instructions
- Create a
histplotof thefmr_1column. - Modify the x axis label to say “1 Bedroom Fair Market Rent”.
- Change the x axis limits to be between 100 and 1500.
- Add a descriptive title of
"US Rent"to the plot.
Answer
# Create a figure and axes
fig, ax = plt.subplots()
# Plot the distribution of 1 bedroom rents
sns.histplot(df['fmr_1'], ax=ax)
# Modify the properties of the plot
ax.set(xlabel="1 Bedroom Fair Market Rent",
xlim=(100,1500),
title="US Rent")
# Display the plot
plt.show()2.9 Adding annotations
Each of the enhancements we have covered can be combined together. In the next exercise, we can annotate our distribution plot to include lines that show the mean and median rent prices.
For this example, the palette has been changed to bright using
sns.set_palette()
Instructions
- Create a figure and axes.
- Plot the
fmr_1column distribution. - Add a vertical line using
axvlinefor themedianandmeanof the values which are already defined.
Answer
# Create a figure and axes. Then plot the data
fig, ax = plt.subplots()
sns.histplot(df['fmr_1'], ax=ax)
# Customize the labels and limits
ax.set(xlabel="1 Bedroom Fair Market Rent", xlim=(100,1500), title="US Rent")
# Add vertical lines for the median and mean
ax.axvline(x=median, color='m', label='Median', linestyle='--', linewidth=2)
ax.axvline(x=mean, color='b', label='Mean', linestyle='-', linewidth=2)
# Show the legend and plot the data
ax.legend()
plt.show()2.10 Multiple plots
For the final exercise we will plot a comparison of the fair market rents for 1-bedroom and 2-bedroom apartments.
Instructions
- Create two axes objects,
ax0andax1. - Plot
fmr_1onax0andfmr_2onax1. - Display the plots side by side.
Answer
# Create a plot with 1 row and 2 columns that share the y axis label
fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, sharey=True)
# Plot the distribution of 1 bedroom apartments on ax0
sns.histplot(df['fmr_1'], ax=ax0)
ax0.set(xlabel="1 Bedroom Fair Market Rent", xlim=(100,1500))
# Plot the distribution of 2 bedroom apartments on ax1
sns.histplot(df['fmr_2'], ax=ax1)
ax1.set(xlabel="2 Bedroom Fair Market Rent", xlim=(100,1500))
# Display the plot
plt.show()3 Additional Plot Types
3.1 stripplot() and swarmplot()
Many datasets have categorical data and Seaborn supports several useful plot types for this data. In this example, we will continue to look at the 2010 School Improvement data and segment the data by the types of school improvement models used.
As a refresher, here is the KDE distribution of the Award Amounts:
While this plot is useful, there is a lot more we can learn by looking
at the individual Award_Amount and how the amounts are distributed
among the four categories.
Instructions
- Create a
stripplotof theAward_Amountwith theModel Selectedon the y axis withjitterenabled. - Create a
swarmplot()of the same data, but also include thehuebyRegion.
Answer
# Create the stripplot
sns.stripplot(data=df,
x='Award_Amount',
y='Model Selected',
jitter=True)
plt.show()# Create and display a swarmplot with hue set to the Region
sns.swarmplot(data=df,
x='Award_Amount',
y='Model Selected',
hue='Region')
plt.show()## /cloud/project/r-reticulate/lib/python3.12/site-packages/seaborn/categorical.py:3399: UserWarning: 20.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
## warnings.warn(msg, UserWarning)
3.2 boxplots, violinplots and boxenplots
Seaborn’s categorical plots also support several abstract representations of data. The API for each of these is the same so it is very convenient to try each plot and see if the data lends itself to one over the other.
In this exercise, we will use the color palette options presented in Chapter 2 to show how colors can easily be included in the plots.
Instructions
- Create and display a
boxplotof the data withAward_Amounton the x axis andModel Selectedon the y axis. - Create and display a similar
violinplotof the data, but use thehuslpalette for colors. - Use
Award_Amounton the x axis andModel Selectedon the y axis. - Create and display an
boxenplotusing thePairedpalette and theRegioncolumn as thehue.
Answer
# Create a violinplot with the husl palette
sns.violinplot(data=df,
x='Award_Amount',
y='Model Selected',
palette='husl')
plt.show()# Create a boxenplot with the Paired palette and the Region column as the hue
sns.boxenplot(data=df,
x='Award_Amount',
y='Model Selected',
palette='Paired',
hue='Region')
plt.show()3.3 barplots, pointplots and countplots
The final group of categorical plots are barplots, pointplots and
countplot which create statistical summaries of the data. The plots
follow a similar API as the other plots and allow further customization
for the specific problem at hand.
Instructions
- Create a
countplotwith thedfdataframe andModel Selectedon the y axis and the color varying byRegion. - Create a
pointplotwith the df dataframe andModel Selectedon the x-axis andAward_Amounton the y-axis. - Use a
capsizein thepointplotin order to add caps to the error bars. - Create a
barplotwith the same data on the x and y axis and change the color of each bar based on theRegioncolumn.
Answer
# Show a countplot with the number of models used with each region a different color
sns.countplot(data=df,
y="Model Selected",
hue="Region")
plt.show()# Create a pointplot and include the capsize in order to show caps on the error bars
sns.pointplot(data=df,
y='Award_Amount',
x='Model Selected',
capsize=.1)
plt.show()# Create a barplot with each Region shown as a different color
sns.barplot(data=df,
y='Award_Amount',
x='Model Selected',
hue='Region')
plt.show()3.4 Regression and residual plots
Linear regression is a useful tool for understanding the relationship between numerical variables. Seaborn has simple but powerful tools for examining these relationships.
For these exercises, we will look at some details from the US Department of Education on 4 year college tuition information and see if there are any interesting insights into which variables might help predict tuition costs.
For these exercises, all data is loaded in the df variable.
Instructions
- Plot a regression plot comparing
Tuitionand average SAT scores(SAT_AVG_ALL). - Make sure the values are shown as green triangles.
- Use a residual plot to determine if the relationship looks linear.
Answer
# Display a regression plot for Tuition
sns.regplot(data=df,
y='Tuition',
x='SAT_AVG_ALL',
marker='^',
color='g')
plt.show()# Display the residual plot
sns.residplot(data=df,
y='Tuition',
x='SAT_AVG_ALL',
color='g')
plt.show()3.5 Regression plot parameters
Seaborn’s regression plot supports several parameters that can be used to configure the plots and drive more insight into the data.
For the next exercise, we can look at the relationship between tuition
and the percent of students that receive Pell grants. A Pell grant is
based on student financial need and subsidized by the US Government. In
this data set, each University has some percentage of students that
receive these grants. Since this data is continuous, using x_bins can
be useful to break the percentages into categories in order to summarize
and understand the data.
Instructions
- Plot a regression plot of
TuitionandPCTPELL. - Create another plot that breaks the
PCTPELLcolumn into 5 different bins. - Create a final regression plot that includes a 2nd
orderpolynomial regression line.
Answer
# Plot a regression plot of Tuition and the Percentage of Pell Grants
sns.regplot(data=df,
y='Tuition',
x='PCTPELL')
plt.show()# Create another plot that estimates the tuition by region
sns.regplot(data=df,
y='Tuition',
x='PCTPELL',
x_bins=5)
plt.show()# The final plot should include a line using a 2nd order polynomial
sns.regplot(data=df,
y='Tuition',
x='PCTPELL',
x_bins=5,
order=2)
plt.show()3.6 Binning data
When the data on the x axis is a continuous value, it can be useful to break it into different bins in order to get a better visualization of the changes in the data.
For this exercise, we will look at the relationship between tuition and
the Undergraduate population abbreviated as UG in this data. We will
start by looking at a scatter plot of the data and examining the impact
of different bin sizes on the visualization.
Instructions
- Create a
regplotofTuitionandUGand set thefit_regparameter toFalseto disable the regression line. - Create another plot with the
UGdata divided into 5 bins. - Create a
regplot()with the data divided into 8 bins.
Answer
# Create a scatter plot by disabling the regression line
sns.regplot(data=df,
y='Tuition',
x='UG',
fit_reg=False)
plt.show()# Create a scatter plot and bin the data into 5 bins
sns.regplot(data=df,
y='Tuition',
x='UG',
x_bins=5)
plt.show()# Create a regplot and bin the data into 8 bins
sns.regplot(data=df,
y='Tuition',
x='UG',
x_bins=8)
plt.show()3.7 Creating heatmaps
A heatmap is a common matrix plot that can be used to graphically summarize the relationship between two variables. For this exercise, we will start by looking at guests of the Daily Show from 1999 - 2015 and see how the occupations of the guests have changed over time.
The data includes the date of each guest appearance as well as their
occupation. For the first exercise, we need to get the data into the
right format for Seaborn’s heatmap function to correctly plot the
data. All of the data has already been read into the df variable.
Instructions
- Use pandas’
crosstab()function to build a table of visits byGroupandYear. - Print the
pd_crosstabDataFrame. - Plot the data using Seaborn’s
heatmap().
Answer
# Create a crosstab table of the data and print it
pd_crosstab = pd.crosstab(df["Group"], df["YEAR"])
print(pd_crosstab)## YEAR 1999 2000 2001 2002 2003 ... 2011 2012 2013 2014 2015
## Group ...
## Academic 0 0 2 0 4 ... 10 8 8 10 2
## Acting 108 100 92 84 74 ... 42 33 60 47 33
## Advocacy 0 1 0 1 0 ... 1 2 2 3 3
## Athletics 0 3 1 2 0 ... 2 7 4 4 3
## Business 0 1 0 0 0 ... 3 3 3 1 1
## Clergy 0 0 0 1 1 ... 1 2 0 0 0
## Comedy 25 12 11 5 12 ... 7 6 6 9 7
## Consultant 0 0 0 0 1 ... 0 0 0 0 0
## Government 0 0 2 1 2 ... 3 3 7 6 0
## Media 11 21 31 42 41 ... 51 52 51 53 24
## Military 0 0 0 0 0 ... 3 1 1 1 1
## Misc 0 0 2 1 1 ... 5 6 2 5 3
## Musician 17 13 11 10 7 ... 6 5 5 8 5
## Political Aide 0 1 1 2 1 ... 1 1 3 2 3
## Politician 2 13 3 8 14 ... 23 29 11 13 14
## Science 0 0 0 0 1 ... 5 2 2 1 1
##
## [16 rows x 17 columns]
# Plot a heatmap of the table
sns.heatmap(pd_crosstab)
# Rotate tick marks for visibility
plt.yticks(rotation=0)## (array([ 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5,
## 11.5, 12.5, 13.5, 14.5, 15.5]), [Text(0, 0.5, 'Academic'), Text(0, 1.5, 'Acting'), Text(0, 2.5, 'Advocacy'), Text(0, 3.5, 'Athletics'), Text(0, 4.5, 'Business'), Text(0, 5.5, 'Clergy'), Text(0, 6.5, 'Comedy'), Text(0, 7.5, 'Consultant'), Text(0, 8.5, 'Government'), Text(0, 9.5, 'Media'), Text(0, 10.5, 'Military'), Text(0, 11.5, 'Misc'), Text(0, 12.5, 'Musician'), Text(0, 13.5, 'Political Aide'), Text(0, 14.5, 'Politician'), Text(0, 15.5, 'Science')])
## (array([ 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5,
## 11.5, 12.5, 13.5, 14.5, 15.5, 16.5]), [Text(0.5, 0, '1999'), Text(1.5, 0, '2000'), Text(2.5, 0, '2001'), Text(3.5, 0, '2002'), Text(4.5, 0, '2003'), Text(5.5, 0, '2004'), Text(6.5, 0, '2005'), Text(7.5, 0, '2006'), Text(8.5, 0, '2007'), Text(9.5, 0, '2008'), Text(10.5, 0, '2009'), Text(11.5, 0, '2010'), Text(12.5, 0, '2011'), Text(13.5, 0, '2012'), Text(14.5, 0, '2013'), Text(15.5, 0, '2014'), Text(16.5, 0, '2015')])
3.8 Customizing heatmaps
Seaborn supports several types of additional customizations to improve
the output of a heatmap. For this exercise, we will continue to use the
Daily Show data that is stored in the df variable but we will
customize the output.
Instructions
- Create a crosstab table of
GroupandYEAR - Create a heatmap of the data using the
BuGnpalette - Disable the
cbarand increase thelinewidthto 0.3
Answer
# Create the crosstab DataFrame
pd_crosstab = pd.crosstab(df["Group"], df["YEAR"])
# Plot a heatmap of the table
sns.heatmap(pd_crosstab, cbar=False, cmap="BuGn", linewidths=.3)
# Rotate tick marks for visibility
plt.yticks(rotation=0)## (array([ 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5,
## 11.5, 12.5, 13.5, 14.5, 15.5]), [Text(0, 0.5, 'Academic'), Text(0, 1.5, 'Acting'), Text(0, 2.5, 'Advocacy'), Text(0, 3.5, 'Athletics'), Text(0, 4.5, 'Business'), Text(0, 5.5, 'Clergy'), Text(0, 6.5, 'Comedy'), Text(0, 7.5, 'Consultant'), Text(0, 8.5, 'Government'), Text(0, 9.5, 'Media'), Text(0, 10.5, 'Military'), Text(0, 11.5, 'Misc'), Text(0, 12.5, 'Musician'), Text(0, 13.5, 'Political Aide'), Text(0, 14.5, 'Politician'), Text(0, 15.5, 'Science')])
## (array([ 0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5, 8.5, 9.5, 10.5,
## 11.5, 12.5, 13.5, 14.5, 15.5, 16.5]), [Text(0.5, 0, '1999'), Text(1.5, 0, '2000'), Text(2.5, 0, '2001'), Text(3.5, 0, '2002'), Text(4.5, 0, '2003'), Text(5.5, 0, '2004'), Text(6.5, 0, '2005'), Text(7.5, 0, '2006'), Text(8.5, 0, '2007'), Text(9.5, 0, '2008'), Text(10.5, 0, '2009'), Text(11.5, 0, '2010'), Text(12.5, 0, '2011'), Text(13.5, 0, '2012'), Text(14.5, 0, '2013'), Text(15.5, 0, '2014'), Text(16.5, 0, '2015')])
4 Creating Plots on Data Aware Grids
4.1 Building a FacetGrid
Seaborn’s FacetGrid is the foundation for building data-aware grids. A
data-aware grid allows you to create a series of small plots that can be
useful for understanding complex data relationships.
For these exercises, we will continue to look at the College Scorecard Data from the US Department of Education. This rich dataset has many interesting data elements that we can plot with Seaborn.
When building a FacetGrid, there are two steps:
- Create a
FacetGridobject with columns, rows, or hue. - Map individual plots to the grid.
Instructions
- Create a
FacetGridthat shows a point plot of the Average SAT scoresSAT_AVG_ALL. - Use
row_orderto control the display order of the degree types.
Answer
# Create FacetGrid with Degree_Type and specify the order of the rows using row_order
g2 = sns.FacetGrid(df,
row="Degree_Type",
row_order=['Graduate', 'Bachelors', 'Associates', 'Certificate'])
# Map a pointplot of SAT_AVG_ALL onto the grid
g2.map(sns.pointplot, 'SAT_AVG_ALL')4.2 Using a catplot
In many cases, Seaborn’s catplot() can be a simpler way to create a
FacetGrid. Instead of creating a grid and mapping the plot, we can use
the catplot() to create a plot with one line of code.
For this exercise, we will recreate one of the plots from the previous
exercise using catplot() and show how to create a boxplot on a
data-aware grid.
Instructions
- Create a
catplot()that contains a boxplot (box) ofTuitionvalues varying byDegree_Typeacross rows. - Create a
catplot()of SAT Averages (SAT_AVG_ALL) facetted acrossDegree_Typethat shows a pointplot (point). - Use
row_orderto order the degrees from highest to lowest level.
Answer
# Create a factor plot that contains boxplots of Tuition values
sns.catplot(data=df,
x='Tuition',
kind='box',
row='Degree_Type')# Create a facetted pointplot of Average SAT scores facetted by Degree Type
sns.catplot(data=df,
x='SAT_AVG_ALL',
kind='point',
row='Degree_Type',
row_order=['Graduate', 'Bachelors', 'Associates', 'Certificate'])4.3 Using a lmplot
The lmplot is used to plot scatter plots with regression lines on
FacetGrid objects. The API is similar to catplot with the difference
that the default behavior of lmplot is to plot regression lines.
For the first set of exercises, we will look at the Undergraduate
population (UG) and compare it to the percentage of students receiving
Pell Grants (PCTPELL).
For the second lmplot exercise, we can look at the relationships
between Average SAT scores and Tuition across the different degree types
and public vs. non-profit schools.
Instructions
- Create a
FacetGrid()withDegree_Typecolumns and scatter plot ofUGandPCTPELL. - Create a
lmplot()using the same values from theFacetGrid(). - Create a facetted
lmplot()comparingSAT_AVG_ALLtoTuitionwith columns varying byOwnershipand rows byDegree_Type. - In the
lmplot()add ahuefor Women Only Universities.
Answer
# added/edited
df = pd.read_csv("college_datav3.csv")
degree_ord = ["Graduate", "Bachelors", "Associates"]
inst_ord = ["Public", "Private non-profit"]# Create a FacetGrid varying by column and columns ordered with the degree_order variable
g = sns.FacetGrid(df, col="Degree_Type", col_order=degree_ord)
# Map a scatter plot of Undergrad Population compared to PCTPELL
g.map(plt.scatter, 'UG', 'PCTPELL')# Re-create the previous plot as an lmplot
sns.lmplot(data=df,
x='UG',
y='PCTPELL',
col="Degree_Type",
col_order=degree_ord)# Create an lmplot that has a column for Ownership, a row for Degree_Type
# and hue based on the WOMENONLY column and columns defined by inst_order
sns.lmplot(data=df,
x='SAT_AVG_ALL',
y='Tuition',
col="Ownership",
row='Degree_Type',
row_order=['Graduate', 'Bachelors'],
hue='WOMENONLY',
col_order=inst_ord)4.4 Building a PairGrid
When exploring a dataset, one of the earliest tasks is exploring the relationship between pairs of variables. This step is normally a precursor to additional investigation.
Seaborn supports this pair-wise analysis using the PairGrid. In this
exercise, we will look at the Car Insurance Premium data we analyzed in
Chapter 1. All data is available in the df variable.
Instructions
- Compare
"fatal_collisions"to"premiums"by using a scatter plot mapped to aPairGrid(). - Create another
PairGridbut plot a histogram on the diagonal and scatter plot on the off diagonal.
Answer
# Create a PairGrid with a scatter plot for fatal_collisions and premiums
g = sns.PairGrid(df, vars=["fatal_collisions", "premiums"])
g2 = g.map(sns.scatterplot)
plt.show()# Create the same PairGrid but map a histogram on the diag
g = sns.PairGrid(df, vars=["fatal_collisions", "premiums"])
g2 = g.map_diag(sns.histplot)
g3 = g2.map_offdiag(sns.scatterplot)
plt.show()4.5 Using a pairplot
The pairplot() function is generally a more convenient way to look at
pairwise relationships. In this exercise, we will create the same
results as the PairGrid using less code. Then, we will explore some
additional functionality of the pairplot(). We will also use a
different palette and adjust the transparency of the diagonal plots
using the alpha parameter.
Instructions
- Recreate the pairwise plot from the previous exercise using
pairplot(). - Create another pairplot using the
"Region"to color code the results. - Use the
RdBupalette to change the colors of the plot.
Answer
# Create a pairwise plot of the variables using a scatter plot
sns.pairplot(data=df,
vars=["fatal_collisions", "premiums"],
kind='scatter')# Plot the same data but use a different color palette and color code by Region
sns.pairplot(data=df,
vars=["fatal_collisions", "premiums"],
kind='scatter',
hue='Region',
palette='RdBu',
diag_kws={'alpha':.5})4.6 Additional pairplots
This exercise will go through a couple of more examples of how the
pairplot() can be customized for quickly analyzing data and
determining areas of interest that might be worthy of additional
analysis.
One area of customization that is useful is to explicitly define the
x_vars and y_vars that you wish to examine. Instead of examining all
pairwise relationships, this capability allows you to look only at the
specific interactions that may be of interest.
We have already looked at using kind to control the types of plots. We
can also use diag_kind to control the types of plots shown on the
diagonals. In the final example, we will include a regression and kde
plot in the pairplot.
Instructions
- Create a pair plot that examines
fatal_collisions_speedingandfatal_collisions_alcon the x axis andpremiumsandinsurance_losseson the y axis. - Use the
huslpalette and color code the scatter plot byRegion. - Build a
pairplot()with kde plots along the diagonals. Include theinsurance_lossesandpremiumsas the variables. - Use a
regplot for the the non-diagonal plots. - Use the
BrBGpalette for the final plot.
Answer
# Build a pairplot with different x and y variables
sns.pairplot(data=df,
x_vars=["fatal_collisions_speeding", "fatal_collisions_alc"],
y_vars=['premiums', 'insurance_losses'],
kind='scatter',
hue='Region',
palette='husl')# plot relationships between insurance_losses and premiums
sns.pairplot(data=df,
vars=["insurance_losses", "premiums"],
kind='reg',
palette='BrBG',
diag_kind = 'kde',
hue='Region')4.7 Building a JointGrid and jointplot
Seaborn’s JointGrid combines univariate plots such as histograms, rug
plots and kde plots with bivariate plots such as scatter and regression
plots. The process for creating these plots should be familiar to you
now. These plots also demonstrate how Seaborn provides convenient
functions to combine multiple plots together.
For these exercises, we will use the bike share data that we reviewed earlier. In this exercise, we will look at the relationship between humidity levels and total rentals to see if there is an interesting relationship we might want to explore later.
Instructions
- Use Seaborn’s “whitegrid” style for these plots.
- Create a
JointGrid()with “hum” on the x-axis and “total_rentals” on the y. - Plot a
regplot()andhistplot()on the margins. - Re-create the plot using a
jointplot().
Answer
# Build a JointGrid comparing humidity and total_rentals
sns.set_style("whitegrid")
g = sns.JointGrid(x="hum",
y="total_rentals",
data=df,
xlim=(0.1, 1.0))
g.plot(sns.regplot, sns.histplot)## <seaborn.axisgrid.JointGrid object at 0x7f93c480ae40>
# Create a jointplot similar to the JointGrid
sns.jointplot(x="hum",
y="total_rentals",
kind='reg',
data=df)## <seaborn.axisgrid.JointGrid object at 0x7f93c3c875c0>
4.8 Jointplots and regression
Since the previous plot does not show a relationship between humidity
and rental amounts, we can look at another variable that we reviewed
earlier. Specifically, the relationship between temp and
total_rentals.
Instructions
- Create a
jointplotwith a 2nd order polynomial regression plot comparingtempandtotal_rentals. - Use a residual plot to check the appropriateness of the model.
Answer
# Plot temp vs. total_rentals as a regression plot
sns.jointplot(x="temp",
y="total_rentals",
kind='reg',
data=df,
order=2,
xlim=(0, 1))## <seaborn.axisgrid.JointGrid object at 0x7f93c4e2dd60>
# Plot a jointplot showing the residuals
sns.jointplot(x="temp",
y="total_rentals",
kind='resid',
data=df,
order=2)## <seaborn.axisgrid.JointGrid object at 0x7f93c4a7ebd0>
4.9 Complex jointplots
The jointplot is a convenience wrapper around many of the JointGrid
functions. However, it is possible to overlay some of the JointGrid
plots on top of the standard jointplot. In this example, we can look
at the different distributions for riders that are considered casual
versus those that are registered.
Instructions
- Create a
jointplotwith a scatter plot comparingtempandcasualriders. - Overlay a
kdeploton top of the scatter plot. - Build a similar plot for
registeredusers.
Answer
# Create a jointplot of temp vs. casual riders
# Include a kdeplot over the scatter plot
g = sns.jointplot(x="temp",
y="casual",
kind='scatter',
data=df,
marginal_kws=dict(bins=10))
g.plot_joint(sns.kdeplot)## <seaborn.axisgrid.JointGrid object at 0x7f93c4e06ab0>
# Replicate the previous plot but only for registered riders
g = sns.jointplot(x="temp",
y="registered",
kind='scatter',
data=df,
marginal_kws=dict(bins=10))
g.plot_joint(sns.kdeplot)## <seaborn.axisgrid.JointGrid object at 0x7f93c7807680>