Introduction to Data Visualization with Seaborn
Instructor: DataCamp Content Creator
1 Introduction to Seaborn
1.1 Making a scatter plot with lists
In this exercise, we’ll use a dataset that contains information about 227 countries. This dataset has lots of interesting information on each country, such as the country’s birth rates, death rates, and its gross domestic product (GDP). GDP is the value of all the goods and services produced in a year, expressed as dollars per person.
We’ve created three lists of data from this dataset to get you started.
gdp
is a list that contains the value of GDP per country, expressed as
dollars per person. phones
is a list of the number of mobile phones
per 1,000 people in that country. Finally, percent_literate
is a list
that contains the percent of each country’s population that can read and
write.
Instructions
- Import Matplotlib and Seaborn using the standard naming convention.
- Create a scatter plot of GDP (
gdp
) vs. number of phones per 1000 people (phones
). - Display the plot.
- Change the scatter plot so it displays the percent of the population that can read and write (
percent_literate
) on the y-axis.
Answer
# added/edited
import pandas as pd
countries = pd.read_csv("countries-of-the-world.csv")
gdp = list(map(float,[word.replace(",",".") for word in countries["GDP ($ per capita)"].astype(str)]))
phones = list(map(float,[word.replace(",",".") for word in countries["Phones (per 1000)"].astype(str)]))
percent_literate = list(map(float,[word.replace(",",".") for word in countries["Literacy (%)"].astype(str)]))
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Create scatter plot with GDP on the x-axis and number of phones on the y-axis
sns.scatterplot(x=gdp, y=phones)
# Show plot
plt.show()
# Change this scatter plot to have percent literate on the y-axis
sns.scatterplot(x=gdp, y=percent_literate)
# Show plot
plt.show()
1.2 Making a count plot with a list
In the last exercise, we explored a dataset that contains information about 227 countries. Let’s do more exploration of this data - specifically, how many countries are in each region of the world?
To do this, we’ll need to use a count plot. Count plots take in a
categorical list and return bars that represent the number of list
entries per category. You can create one here using a list of regions
for each country, which is a variable named region
.
Instructions
- Import Matplotlib and Seaborn using the standard naming conventions.
- Use Seaborn to create a count plot with
region
on the y-axis. - Display the plot.
Answer
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Create count plot with region on the y-axis
sns.countplot(y=region)
# Show plot
plt.show()
1.3 “Tidy” vs. “untidy” data
Here, we have a sample dataset from a survey of children about their favorite animals. But can we use this dataset as-is with Seaborn? Let’s use pandas to import the csv file with the data collected from the survey and determine whether it is tidy, which is essential to having it work well with Seaborn.
To get you started, the filepath to the csv file has been assigned to
the variable csv_filepath
.
Note that because csv_filepath
is a Python variable, you will not need
to put quotation marks around it when you read the csv.
Instructions
- Read the csv file located at
csv_filepath
into a DataFrame nameddf
. - Print the head of
df
to show the first five rows.
Answer
# Import pandas
import pandas as pd
# Create a DataFrame from csv file
df = pd.read_csv(csv_filepath)
# Print the head of df
print(df.head())
## Unnamed: 0 How old are you?
## 0 Marion 12
## 1 Elroy 16
## 2 NaN What is your favorite animal?
## 3 Marion dog
## 4 Elroy cat
1.4 Making a count plot with a DataFrame
In this exercise, we’ll look at the responses to a survey sent out to young people. Our primary question here is: how many young people surveyed report being scared of spiders? Survey participants were asked to agree or disagree with the statement “I am afraid of spiders”. Responses vary from 1 to 5, where 1 is “Strongly disagree” and 5 is “Strongly agree”.
To get you started, the filepath to the csv file with the survey data
has been assigned to the variable csv_filepath
.
Note that because csv_filepath
is a Python variable, you will not need
to put quotation marks around it when you read the csv.
Instructions
- Import Matplotlib, pandas, and Seaborn using the standard names.
- Create a DataFrame named
df
from the csv file located atcsv_filepath
. - Use the
countplot()
function with thex=
anddata=
arguments to create a count plot with the"Spiders"
column values on the x-axis. - Display the plot.
Answer
# Import Matplotlib, pandas, and Seaborn
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
# Create a DataFrame from csv file
df = pd.read_csv(csv_filepath)
# Create a count plot with "Spiders" on the x-axis
sns.countplot(x="Spiders", data=df)
# Display the plot
plt.show()
1.5 Hue and scatter plots
In the prior video, we learned how hue
allows us to easily make
subgroups within Seaborn plots. Let’s try it out by exploring data from
students in secondary school. We have a lot of information about each
student like their age, where they live, their study habits and their
extracurricular activities.
For now, we’ll look at the relationship between the number of absences they have in school and their final grade in the course, segmented by where the student lives (rural vs. urban area).
Instructions
- Create a scatter plot with
"absences"
on the x-axis and final grade ("G3"
) on the y-axis using the DataFramestudent_data
. Color the plot points based on"location"
(urban vs. rural). - Make
"Rural"
appear before"Urban"
in the plot legend.
Answer
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Create a scatter plot of absences vs. final grade
sns.scatterplot(x="absences", y="G3",
data=student_data,
hue="location")
# Show plot
plt.show()
# Change the legend order in the scatter plot
sns.scatterplot(x="absences", y="G3",
data=student_data,
hue="location",
hue_order=["Rural", "Urban"])
# Show plot
plt.show()
1.6 Hue and count plots
Let’s continue exploring our dataset from students in secondary school
by looking at a new variable. The "school"
column indicates the
initials of which school the student attended - either “GP” or “MS”.
In the last exercise, we created a scatter plot where the plot points were colored based on whether the student lived in an urban or rural area. How many students live in urban vs. rural areas, and does this vary based on what school the student attends? Let’s make a count plot with subgroups to find out.
Instructions
- Fill in the
palette_colors
dictionary to map the"Rural"
location value to the color"green"
and the"Urban"
location value to the color"blue"
. - Create a count plot with
"school"
on the x-axis using thestudent_data
DataFrame.- Add subgroups to the plot using
"location"
variable and use thepalette_colors
dictionary to make the location subgroups green and blue.
- Add subgroups to the plot using
Answer
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Create a dictionary mapping subgroup values to colors
palette_colors = {"Rural": "green", "Urban": "blue"}
# Create a count plot of school with location subgroups
sns.countplot(x="school", data=student_data,
hue="location",
palette=palette_colors)
# Display plot
plt.show()
2 Visualizing Two Quantitative Variables
2.1 Creating subplots with col and row
We’ve seen in prior exercises that students with more absences
("absences"
) tend to have lower final grades ("G3"
). Does this
relationship hold regardless of how much time students study each week?
To answer this, we’ll look at the relationship between the number of
absences that a student has in school and their final grade in the
course, creating separate subplots based on each student’s weekly study
time ("study_time"
).
Seaborn has been imported as sns
and matplotlib.pyplot
has been
imported as plt
.
Instructions
Modify the code to use relplot()
instead of scatterplot()
.
Modify the code to create one scatter plot for each level of the
variable "study_time"
, arranged in columns.
Adapt your code to create one scatter plot for each level of a student’s weekly study time, this time arranged in rows.
Answer
# Change to use relplot() instead of scatterplot()
sns.relplot(x="absences", y="G3",
data=student_data,
kind="scatter")
# Change to make subplots based on study time
sns.relplot(x="absences", y="G3",
data=student_data,
kind="scatter",
col="study_time")
# Change this scatter plot to arrange the plots in rows instead of columns
sns.relplot(x="absences", y="G3",
data=student_data,
kind="scatter",
row="study_time")
2.2 Creating two-factor subplots
Let’s continue looking at the student_data
dataset of students in
secondary school. Here, we want to answer the following question: does a
student’s first semester grade ("G1"
) tend to correlate with their
final grade ("G3"
)?
There are many aspects of a student’s life that could result in a higher
or lower final grade in the class. For example, some students receive
extra educational support from their school ("schoolsup"
) or from
their family ("famsup"
), which could result in higher grades. Let’s
try to control for these two factors by creating subplots based on
whether the student received extra educational support from their school
or family.
Seaborn has been imported as sns
and matplotlib.pyplot
has been
imported as plt
.
Instructions
- Use
relplot()
to create a scatter plot with"G1"
on the x-axis and"G3"
on the y-axis, using thestudent_data
DataFrame. - Create column subplots based on whether the student received support from the school (
"schoolsup"
), ordered so that “yes” comes before “no”. - Add row subplots based on whether the student received support from the family (
"famsup"
), ordered so that “yes” comes before “no”. This will result in subplots based on two factors.
Answer
# Adjust to add subplots based on school support
sns.relplot(x="G1", y="G3",
data=student_data,
kind="scatter",
col="schoolsup",
col_order=["yes", "no"])
# Adjust further to add subplots based on family support
sns.relplot(x="G1", y="G3",
data=student_data,
kind="scatter",
col="schoolsup",
col_order=["yes", "no"],
row="famsup",
row_order=["yes", "no"])
2.3 Changing the size of scatter plot points
In this exercise, we’ll explore Seaborn’s mpg
dataset, which contains
one row per car model and includes information such as the year the car
was made, the number of miles per gallon (“M.P.G.”) it achieves, the
power of its engine (measured in “horsepower”), and its country of
origin.
What is the relationship between the power of a car’s engine
("horsepower"
) and its fuel efficiency ("mpg"
)? And how does this
relationship vary by the number of cylinders ("cylinders"
) the car
has? Let’s find out.
Let’s continue to use relplot()
instead of scatterplot()
since it
offers more flexibility.
Instructions
- Use
relplot()
and thempg
DataFrame to create a scatter plot with"horsepower"
on the x-axis and"mpg"
on the y-axis. Vary the size of the points by the number of cylinders in the car ("cylinders"
). - To make this plot easier to read, use
hue
to vary the color of the points by the number of cylinders in the car ("cylinders"
).
Answer
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Create scatter plot of horsepower vs. mpg
sns.relplot(x="horsepower", y="mpg",
data=mpg, kind="scatter",
size="cylinders")
# Create scatter plot of horsepower vs. mpg
sns.relplot(x="horsepower", y="mpg",
data=mpg, kind="scatter",
size="cylinders", hue="cylinders")
2.4 Changing the style of scatter plot points
Let’s continue exploring Seaborn’s mpg
dataset by looking at the
relationship between how fast a car can accelerate ("acceleration"
)
and its fuel efficiency ("mpg"
). Do these properties vary by country
of origin ("origin"
)?
Note that the "acceleration"
variable is the time to accelerate from 0
to 60 miles per hour, in seconds. Higher values indicate slower
acceleration.
Instructions
- Use
relplot()
and thempg
DataFrame to create a scatter plot with"acceleration"
on the x-axis and"mpg"
on the y-axis. Vary the style and color of the plot points by country of origin ("origin"
).
Answer
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Create a scatter plot of acceleration vs. mpg
sns.relplot(x="acceleration", y="mpg",
data=mpg, kind="scatter",
style="origin", hue="origin")
2.5 Interpreting line plots
In this exercise, we’ll continue to explore Seaborn’s mpg
dataset,
which contains one row per car model and includes information such as
the year the car was made, its fuel efficiency (measured in “miles per
gallon” or “M.P.G”), and its country of origin (USA, Europe, or Japan).
How has the average miles per gallon achieved by these cars changed over time? Let’s use line plots to find out!
Instructions
- Use
relplot()
and thempg
DataFrame to create a line plot with"model_year"
on the x-axis and"mpg"
on the y-axis.
Answer
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Create line plot
sns.relplot(x="model_year", y="mpg",
data=mpg, kind="line")
2.6 Visualizing standard deviation with line plots
In the last exercise, we looked at how the average miles per gallon achieved by cars has changed over time. Now let’s use a line plot to visualize how the distribution of miles per gallon has changed over time.
Seaborn has been imported as sns
and matplotlib.pyplot
has been
imported as plt
.
Instructions
- Change the plot so the shaded area shows the standard deviation instead of the confidence interval for the mean.
Answer
# Make the shaded area show the standard deviation
sns.relplot(x="model_year", y="mpg",
data=mpg, kind="line",
ci="sd")
2.7 Plotting subgroups in line plots
Let’s continue to look at the mpg
dataset. We’ve seen that the average
miles per gallon for cars has increased over time, but how has the
average horsepower for cars changed over time? And does this trend
differ by country of origin?
Instructions
- Use
relplot()
and thempg
DataFrame to create a line plot with"model_year"
on the x-axis and"horsepower"
on the y-axis. Turn off the confidence intervals on the plot. - Create different lines for each country of origin (
"origin"
) that vary in both line style and color. - Add markers for each data point to the lines.
- Use the
dashes
parameter to use solid lines for all countries, while still allowing for different marker styles for each line.
Answer
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
# Create line plot of model year vs. horsepower
sns.relplot(x="model_year", y="horsepower",
data=mpg, kind="line",
ci=None)
# Change to create subgroups for country of origin
sns.relplot(x="model_year", y="horsepower",
data=mpg, kind="line",
ci=None, style="origin",
hue="origin")
# Add markers and make each line have the same style
sns.relplot(x="model_year", y="horsepower",
data=mpg, kind="line",
ci=None, style="origin",
hue="origin", markers=True,
dashes=False)
3 Visualizing a Categorical and a Quantitative Variable
3.1 Count plots
In this exercise, we’ll return to exploring our dataset that contains the responses to a survey sent out to young people. We might suspect that young people spend a lot of time on the internet, but how much do they report using the internet each day? Let’s use a count plot to break down the number of survey responses in each category and then explore whether it changes based on age.
As a reminder, to create a count plot, we’ll use the catplot()
function and specify the name of the categorical variable to count
(x=____
), the pandas DataFrame to use (data=____
), and the type of
plot (kind="count"
).
Seaborn has been imported as sns
and matplotlib.pyplot
has been
imported as plt
.
Instructions
- Use
sns.catplot()
to create a count plot using thesurvey_data
DataFrame with"Internet usage"
on the x-axis. - Make the bars horizontal instead of vertical.
- Separate this plot into two side-by-side column subplots based on
"Age Category"
, which separates respondents into those that are younger than 21 vs. 21 and older.
Answer
# added/edited
sns.set_context("paper", font_scale=0.8)
import numpy as np
survey_data = pd.read_csv("young-people-survey-responses.csv")
survey_data["Age Category"] = np.where(survey_data["Age"]<21, "Less than 21", "21+")
# Create count plot of internet usage
sns.catplot(x="Internet usage", data=survey_data,
kind="count")
# Change the orientation of the plot
sns.catplot(y="Internet usage", data=survey_data,
kind="count")
# Separate into column subplots based on age category
sns.catplot(y="Internet usage", data=survey_data,
kind="count", col="Age Category")
3.2 Bar plots with percentages
Let’s continue exploring the responses to a survey sent out to young
people. The variable "Interested in Math"
is True
if the person
reported being interested or very interested in mathematics, and False
otherwise. What percentage of young people report being interested in
math, and does this vary based on gender? Let’s use a bar plot to find
out.
As a reminder, we’ll create a bar plot using the catplot()
function,
providing the name of categorical variable to put on the x-axis
(x=____
), the name of the quantitative variable to summarize on the
y-axis (y=____
), the pandas DataFrame to use (data=____
), and the
type of categorical plot (kind="bar"
).
Seaborn has been imported as sns
and matplotlib.pyplot
has been
imported as plt
.
Instructions
- Use the
survey_data
DataFrame andsns.catplot()
to create a bar plot with"Gender"
on the x-axis and"Interested in Math"
on the y-axis.
Answer
# Create a bar plot of interest in math, separated by gender
sns.catplot(x="Gender", y="Interested in Math",
data=survey_data, kind="bar")
3.3 Customizing bar plots
In this exercise, we’ll explore data from students in secondary school.
The "study_time"
variable records each student’s reported weekly study
time as one of the following categories: "<2 hours"
, "2 to 5 hours"
,
"5 to 10 hours"
, or ">10 hours"
. Do students who report higher
amounts of studying tend to get better final grades? Let’s compare the
average final grade among students in each category using a bar plot.
Seaborn has been imported as sns
and matplotlib.pyplot
has been
imported as plt
.
Instructions
- Use
sns.catplot()
to create a bar plot with"study_time"
on the x-axis and final grade ("G3"
) on the y-axis, using thestudent_data
DataFrame. - Using the
order
parameter and thecategory_order
list that is provided, rearrange the bars so that they are in order from lowest study time to highest. - Update the plot so that it no longer displays confidence intervals.
Answer
# Create bar plot of average final grade in each study category
sns.catplot(x="study_time", y="G3",
data=student_data,
kind="bar")
# List of categories from lowest to highest
category_order = ["<2 hours",
"2 to 5 hours",
"5 to 10 hours",
">10 hours"]
# Rearrange the categories
sns.catplot(x="study_time", y="G3",
data=student_data,
kind="bar",
order=category_order)
# List of categories from lowest to highest
category_order = ["<2 hours",
"2 to 5 hours",
"5 to 10 hours",
">10 hours"]
# Turn off the confidence intervals
sns.catplot(x="study_time", y="G3",
data=student_data,
kind="bar",
order=category_order,
ci=None)
3.4 Create and interpret a box plot
Let’s continue using the student_data
dataset. In an earlier exercise,
we explored the relationship between studying and final grade by using a
bar plot to compare the average final grade ("G3"
) among students in
different categories of "study_time"
.
In this exercise, we’ll try using a box plot look at this relationship
instead. As a reminder, to create a box plot you’ll need to use the
catplot()
function and specify the name of the categorical variable to
put on the x-axis (x=____
), the name of the quantitative variable to
summarize on the y-axis (y=____
), the pandas DataFrame to use
(data=____
), and the type of plot (kind="box"
).
We have already imported matplotlib.pyplot
as plt
and seaborn
as
sns
.
Instructions
- Use
sns.catplot()
and thestudent_data
DataFrame to create a box plot with"study_time"
on the x-axis and"G3"
on the y-axis. Set the ordering of the categories tostudy_time_order
.
Answer
# Specify the category ordering
study_time_order = ["<2 hours", "2 to 5 hours",
"5 to 10 hours", ">10 hours"]
# Create a box plot and set the order of the categories
sns.catplot(x="study_time", y="G3",
data=student_data,
kind="box",
order=study_time_order)
3.5 Omitting outliers
Now let’s use the student_data
dataset to compare the distribution of
final grades ("G3"
) between students who have internet access at home
and those who don’t. To do this, we’ll use the "internet"
variable,
which is a binary (yes/no) indicator of whether the student has internet
access at home.
Since internet may be less accessible in rural areas, we’ll add
subgroups based on where the student lives. For this, we can use the
"location"
variable, which is an indicator of whether a student lives
in an urban (“Urban”) or rural (“Rural”) location.
Seaborn has already been imported as sns
and matplotlib.pyplot
has
been imported as plt
. As a reminder, you can omit outliers in box
plots by setting the sym
parameter equal to an empty string (""
).
Instructions
- Use
sns.catplot()
to create a box plot with thestudent_data
DataFrame, putting"internet"
on the x-axis and"G3"
on the y-axis. - Add subgroups so each box plot is colored based on
"location"
. - Do not display the outliers.
Answer
# Create a box plot with subgroups and omit the outliers
sns.catplot(x="internet", y="G3",
data=student_data,
kind="box",
hue="location",
flierprops={"marker": ""}) # added/edited
3.6 Adjusting the whiskers
In the lesson we saw that there are multiple ways to define the whiskers
in a box plot. In this set of exercises, we’ll continue to use the
student_data
dataset to compare the distribution of final grades
("G3"
) between students who are in a romantic relationship and those
that are not. We’ll use the "romantic"
variable, which is a yes/no
indicator of whether the student is in a romantic relationship.
Let’s create a box plot to look at this relationship and try different ways to define the whiskers.
We’ve already imported Seaborn as sns
and matplotlib.pyplot
as
plt
.
Instructions
Adjust the code to make the box plot whiskers to extend to 0.5 * IQR. Recall: the IQR is the interquartile range.
Change the code to set the whiskers to extend to the 5th and 95th percentiles.
Change the code to set the whiskers to extend to the min and max values.
Answer
# Set the whiskers to 0.5 * IQR
sns.catplot(x="romantic", y="G3",
data=student_data,
kind="box",
whis=0.5)
# Extend the whiskers to the 5th and 95th percentile
sns.catplot(x="romantic", y="G3",
data=student_data,
kind="box",
whis=[5, 95])
# Set the whiskers at the min and max values
sns.catplot(x="romantic", y="G3",
data=student_data,
kind="box",
whis=[0, 100])
3.7 Customizing point plots
Let’s continue to look at data from students in secondary school, this
time using a point plot to answer the question: does the quality of the
student’s family relationship influence the number of absences the
student has in school? Here, we’ll use the "famrel"
variable, which
describes the quality of a student’s family relationship from 1 (very
bad) to 5 (very good).
As a reminder, to create a point plot, use the catplot()
function and
specify the name of the categorical variable to put on the x-axis
(x=____
), the name of the quantitative variable to summarize on the
y-axis (y=____
), the pandas DataFrame to use (data=____
), and the
type of categorical plot (kind="point"
).
We’ve already imported Seaborn as sns
and matplotlib.pyplot
as
plt
.
Instructions
- Use
sns.catplot()
and thestudent_data
DataFrame to create a point plot with"famrel"
on the x-axis and number of absences ("absences"
) on the y-axis. - Add “caps” to the end of the confidence intervals with size
0.2
. - Remove the lines joining the points in each category.
Answer
# Create a point plot of family relationship vs. absences
sns.catplot(x="famrel", y="absences",
data=student_data,
kind="point")
# Add caps to the confidence interval
sns.catplot(x="famrel", y="absences",
data=student_data,
kind="point",
capsize=0.2)
# Remove the lines joining the points
sns.catplot(x="famrel", y="absences",
data=student_data,
kind="point",
capsize=0.2,
join=False)
3.8 Point plots with subgroups
Let’s continue exploring the dataset of students in secondary school. This time, we’ll ask the question: is being in a romantic relationship associated with higher or lower school attendance? And does this association differ by which school the students attend? Let’s find out using a point plot.
We’ve already imported Seaborn as sns
and matplotlib.pyplot
as
plt
.
Instructions
- Use
sns.catplot()
and thestudent_data
DataFrame to create a point plot with relationship status ("romantic"
) on the x-axis and number of absences ("absences"
) on the y-axis. Color the points based on the school that they attend ("school"
). - Turn off the confidence intervals for the plot.
- Since there may be outliers of students with many absences, use the
median
function that we’ve imported fromnumpy
to display the median number of absences instead of the average.
Answer
# Create a point plot that uses color to create subgroups
sns.catplot(x="romantic", y="absences",
data=student_data,
kind="point",
hue="school")
# Turn off the confidence intervals for this plot
sns.catplot(x="romantic", y="absences",
data=student_data,
kind="point",
hue="school",
ci=None)
# Import median function from numpy
from numpy import median
# Plot the median number of absences instead of the mean
sns.catplot(x="romantic", y="absences",
data=student_data,
kind="point",
hue="school",
ci=None,
estimator=median)
4 Customizing Seaborn Plots
4.1 Changing style and palette
Let’s return to our dataset containing the results of a survey given to young people about their habits and preferences. We’ve provided the code to create a count plot of their responses to the question “How often do you listen to your parents’ advice?”. Now let’s change the style and palette to make this plot easier to interpret.
We’ve already imported Seaborn as sns
and matplotlib.pyplot
as
plt
.
Instructions
- Set the style to
"whitegrid"
to help the audience determine the number of responses in each category. - Set the color palette to the sequential palette named
"Purples"
. - Change the color palette to the diverging palette named
"RdBu"
.
Answer
# added/edited
advice_mapping = {
1.0: "Never",
2.0: "Rarely",
3.0: "Sometimes",
4.0: "Often",
5.0: "Always",
np.nan: np.nan # Keeping NaN as it is
}
survey_data["Parents Advice"] = survey_data["Parents' advice"].map(advice_mapping)
# Set the style to "whitegrid"
sns.set_style("whitegrid")
# Create a count plot of survey responses
category_order = ["Never", "Rarely", "Sometimes",
"Often", "Always"]
sns.catplot(x="Parents Advice",
data=survey_data,
kind="count",
order=category_order)
# Set the color palette to "Purples"
sns.set_style("whitegrid")
sns.set_palette("Purples")
# Create a count plot of survey responses
category_order = ["Never", "Rarely", "Sometimes",
"Often", "Always"]
sns.catplot(x="Parents Advice",
data=survey_data,
kind="count",
order=category_order)
# Change the color palette to "RdBu"
sns.set_style("whitegrid")
sns.set_palette("RdBu")
# Create a count plot of survey responses
category_order = ["Never", "Rarely", "Sometimes",
"Often", "Always"]
sns.catplot(x="Parents Advice",
data=survey_data,
kind="count",
order=category_order)
4.2 Changing the scale
In this exercise, we’ll continue to look at the dataset containing responses from a survey of young people. Does the percentage of people reporting that they feel lonely vary depending on how many siblings they have? Let’s find out using a bar plot, while also exploring Seaborn’s four different plot scales (“contexts”).
We’ve already imported Seaborn as sns
and matplotlib.pyplot
as
plt
.
Instructions
Set the scale (“context”) to "paper"
, which is the smallest of the
scale options.
Change the context to "notebook"
to increase the scale.
Change the context to "talk"
to increase the scale.
Change the context to "poster"
, which is the largest scale available.
Answer
# Set the context to "paper"
sns.set_context("paper")
# Create bar plot
sns.catplot(x="Number of Siblings", y="Feels Lonely",
data=survey_data, kind="bar")
# Change the context to "notebook"
sns.set_context("notebook")
# Create bar plot
sns.catplot(x="Number of Siblings", y="Feels Lonely",
data=survey_data, kind="bar")
# Change the context to "talk"
sns.set_context("talk")
# Create bar plot
sns.catplot(x="Number of Siblings", y="Feels Lonely",
data=survey_data, kind="bar")
# Change the context to "poster"
sns.set_context("poster")
# Create bar plot
sns.catplot(x="Number of Siblings", y="Feels Lonely",
data=survey_data, kind="bar")
4.3 Using a custom palette
So far, we’ve looked at several things in the dataset of survey responses from young people, including their internet usage, how often they listen to their parents, and how many of them report feeling lonely. However, one thing we haven’t done is a basic summary of the type of people answering this survey, including their age and gender. Providing these basic summaries is always a good practice when dealing with an unfamiliar dataset.
The code provided will create a box plot showing the distribution of ages for male versus female respondents. Let’s adjust the code to customize the appearance, this time using a custom color palette.
We’ve already imported Seaborn as sns
and matplotlib.pyplot
as
plt
.
Instructions
- Set the style to
"darkgrid"
. - Set a custom color palette with the hex color codes
"#39A7D0"
and"#36ADA4"
.
Answer
# Set the style to "darkgrid"
sns.set_style("darkgrid")
# Set a custom color palette
sns.set_palette(["#39A7D0", "#36ADA4"])
# Create the box plot of age distribution by gender
sns.catplot(x="Gender", y="Age",
data=survey_data, kind="box")
4.4 FacetGrids vs. AxesSubplots
In the recent lesson, we learned that Seaborn plot functions create two
different types of objects: FacetGrid
objects and AxesSubplot
objects. The method for adding a title to your plot will differ
depending on the type of object it is.
In the code provided, we’ve used relplot()
with the miles per gallon
dataset to create a scatter plot showing the relationship between a
car’s weight and its horsepower. This scatter plot is assigned to the
variable name g
. Let’s identify which type of object it is.
We’ve already imported Seaborn as sns
and matplotlib.pyplot
as
plt
.
Instructions
- Identify what type of object plot
g
is and assign it to the variabletype_of_g
.
Answer
# Create scatter plot
g = sns.relplot(x="weight",
y="horsepower",
data=mpg,
kind="scatter")
# Identify plot type
type_of_g = type(g)
# Print type
plt.close() # added/edited
print(type_of_g)
## <class 'seaborn.axisgrid.FacetGrid'>
4.5 Adding a title to a FacetGrid object
In the previous exercise, we used relplot()
with the miles per gallon
dataset to create a scatter plot showing the relationship between a
car’s weight and its horsepower. This created a FacetGrid
object. Now
that we know what type of object it is, let’s add a title to this plot.
We’ve already imported Seaborn as sns
and matplotlib.pyplot
as
plt
.
Instructions
- Add the following title to this plot:
"Car Weight vs. Horsepower"
.
Answer
# Create scatter plot
g = sns.relplot(x="weight",
y="horsepower",
data=mpg,
kind="scatter")
# Add a title "Car Weight vs. Horsepower"
g.fig.suptitle("Car Weight vs. Horsepower")
# Show plot
plt.show()
4.6 Adding a title and axis labels
Let’s continue to look at the miles per gallon dataset. This time we’ll create a line plot to answer the question: How does the average miles per gallon achieved by cars change over time for each of the three places of origin? To improve the readability of this plot, we’ll add a title and more informative axis labels.
In the code provided, we create the line plot using the lineplot()
function. Note that lineplot()
does not support the creation of
subplots, so it returns an AxesSubplot
object instead of an
FacetGrid
object.
We’ve already imported Seaborn as sns
and matplotlib.pyplot
as
plt
.
Instructions
- Add the following title to the plot:
"Average MPG Over Time"
. - Label the x-axis as
"Car Model Year"
and the y-axis as"Average MPG"
.
Answer
# added/edited
mpg_mean = mpg.groupby(["model_year", "origin"]).agg(mpg_mean=("mpg", "mean")).reset_index()
# Create line plot
g = sns.lineplot(x="model_year", y="mpg_mean",
data=mpg_mean,
hue="origin")
# Add a title "Average MPG Over Time"
g.set_title("Average MPG Over Time")
# Show plot
plt.show()
# Create line plot
g = sns.lineplot(x="model_year", y="mpg_mean",
data=mpg_mean,
hue="origin")
# Add a title "Average MPG Over Time"
g.set_title("Average MPG Over Time")
# Add x-axis and y-axis labels
g.set(xlabel="Car Model Year",
ylabel="Average MPG")
# Show plot
plt.show()
4.7 Rotating x-tick labels
In this exercise, we’ll continue looking at the miles per gallon
dataset. In the code provided, we create a point plot that displays the
average acceleration for cars in each of the three places of origin.
Note that the "acceleration"
variable is the time to accelerate from 0
to 60 miles per hour, in seconds. Higher values indicate slower
acceleration.
Let’s use this plot to practice rotating the x-tick labels. Recall that the function to rotate x-tick labels is a standalone Matplotlib function and not a function applied to the plot object itself.
We’ve already imported Seaborn as sns
and matplotlib.pyplot
as
plt
.
Instructions
- Rotate the x-tick labels 90 degrees.
Answer
# Create point plot
sns.catplot(x="origin",
y="acceleration",
data=mpg,
kind="point",
join=False,
capsize=0.1)
## ([0, 1, 2], [Text(0, 0, 'usa'), Text(1, 0, 'japan'), Text(2, 0, 'europe')])
4.8 Box plot with subgroups
In this exercise, we’ll look at the dataset containing responses from a survey given to young people. One of the questions asked of the young people was: “Are you interested in having pets?” Let’s explore whether the distribution of ages of those answering “yes” tends to be higher or lower than those answering “no”, controlling for gender.
Instructions
- Set the color palette to
"Blues"
. - Add subgroups to color the box plots based on
"Interested in Pets"
. - Set the title of the
FacetGrid
objectg
to"Age of Those Interested in Pets vs. Not"
. - Make the plot display using a Matplotlib function.
Answer
# Set palette to "Blues"
sns.set_palette("Blues")
# Adjust to add subgroups based on "Interested in Pets"
g = sns.catplot(x="Gender",
y="Age", data=survey_data,
kind="box", hue="Interested in Pets")
# Set title to "Age of Those Interested in Pets vs. Not"
g.fig.suptitle("Age of Those Interested in Pets vs. Not")
# Show plot
plt.show()
4.9 Bar plot with subgroups and subplots
In this exercise, we’ll return to our young people survey dataset and
investigate whether the proportion of people who like techno music
("Likes Techno"
) varies by their gender ("Gender"
) or where they
live ("Village - town"
). This exercise will give us an opportunity to
practice the many things we’ve learned throughout this course!
We’ve already imported Seaborn as sns
and matplotlib.pyplot
as
plt
.
Instructions
- Set the figure style to
"dark"
. - Adjust the bar plot code to add subplots based on
"Gender"
, arranged in columns. - Add the title
"Percentage of Young People Who Like Techno"
to thisFacetGrid
plot. - Label the x-axis
"Location of Residence"
and y-axis"% Who Like Techno"
.
Answer
# Set the figure style to "dark"
sns.set_style("dark")
# Adjust to add subplots per gender
g = sns.catplot(x="Village - town", y="Likes Techno",
data=survey_data, kind="bar",
col="Gender")
# Add title and axis labels
g.fig.suptitle("Percentage of Young People Who Like Techno", y=1.02)
g.set(xlabel="Location of Residence",
ylabel="% Who Like Techno")