Introduction to Data Visualization with Matplotlib
Instructor: Ariel Rokem - Senior Data Scientist
1 Introduction to Matplotlib
1.1 Using the matplotlib.pyplot interface
There are many ways to use Matplotlib. In this course, we will focus on
the pyplot
interface, which provides the most flexibility in creating
and customizing data visualizations.
Initially, we will use the pyplot
interface to create two kinds of
objects: Figure
objects and Axes
objects.
This course introduces a lot of new concepts, so if you ever need a quick refresher, download the Matplotlib Cheat Sheet and keep it handy!
Instructions
- Import the
matplotlib.pyplot
API, using the conventional nameplt
. - Create
Figure
andAxes
objects using theplt.subplots
function. - Show the results, an empty set of axes, using the
plt.show
function.
Answer
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots()
# Call the show function to show the result
plt.show()
1.2 Adding data to an Axes object
Adding data to a figure is done by calling methods of the Axes
object.
In this exercise, we will use the plot
method to add data about
rainfall in two American cities: Seattle, WA and Austin, TX.
The data are stored in two pandas DataFrame objects that are already
loaded into memory: seattle_weather
stores information about the
weather in Seattle, and austin_weather
stores information about the
weather in Austin. Each of the DataFrames has a "MONTH"
column that
stores the three-letter name of the months. Each also has a column named
"MLY-PRCP-NORMAL"
that stores the average rainfall in each month
during a ten-year period.
In this exercise, you will create a visualization that will allow you to compare the rainfall in these two cities.
Instructions
- Import the
matplotlib.pyplot
submodule asplt
. - Create a Figure and an Axes object by calling
plt.subplots
. - Add data from the
seattle_weather
DataFrame by calling the Axesplot
method. - Add data from the
austin_weather
DataFrame in a similar manner and callplt.show
to show the results.
Answer
# added/edited
import pandas as pd
austin_weather = pd.read_csv("austin_weather.csv")
austin_weather["MONTH"] = austin_weather["DATE"]
seattle_weather = pd.read_csv("seattle_weather.csv")
seattle_weather = seattle_weather[seattle_weather["STATION"] == "USW00094290"]
seattle_weather["MONTH"] = seattle_weather["DATE"]
# Import the matplotlib.pyplot submodule and name it plt
import matplotlib.pyplot as plt
# Create a Figure and an Axes with plt.subplots
fig, ax = plt.subplots()
# Plot MLY-PRCP-NORMAL from seattle_weather against MONTH
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"])
# Plot MLY-PRCP-NORMAL from austin_weather against MONTH
ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"])
# Call the show function
plt.show()
1.3 Customizing data appearance
We can customize the appearance of data in our plots, while adding the data to the plot, using key-word arguments to the plot command.
In this exercise, you will customize the appearance of the markers, the linestyle that is used, and the color of the lines and markers for your data.
As before, the data is already provided in pandas DataFrame objects
loaded into memory: seattle_weather
and austin_weather
. These each
have a "MONTHS"
column and a "MLY-PRCP-NORMAL"
that you will plot
against each other.
In addition, a Figure object named fig
and an Axes object named ax
have already been created for you.
Instructions
- Call
ax.plot
to plot"MLY-PRCP-NORMAL"
against"MONTHS"
in both DataFrames. - Pass the
color
key-word arguments to these commands to set the color of the Seattle data to blue (‘b’) and the Austin data to red (‘r’). - Pass the
marker
key-word arguments to these commands to set the Seattle data to circle markers (‘o’) and the Austin markers to triangles pointing downwards (‘v’). - Pass the
linestyle
key-word argument to use dashed lines for the data from both cities (‘–’).
Answer
# added/edited
fig, ax = plt.subplots()
# Plot Seattle data, setting data appearance
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"],
color='b', marker='o', linestyle='--')
# Plot Austin data, setting data appearance
ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"],
color='r', marker='v', linestyle='--')
# Call show to display the resulting plot
plt.show()
1.4 Customizing axis labels and adding titles
Customizing the axis labels requires using the set_xlabel
and
set_ylabel
methods of the Axes object. Adding a title uses the
set_title
method.
In this exercise, you will customize the content of the axis labels and add a title to a plot.
As before, the data is already provided in pandas DataFrame objects
loaded into memory: seattle_weather
and austin_weather
. These each
have a "MONTH"
column and a "MLY-PRCP-NORMAL"
column. These data are
plotted against each other in the first two lines of the sample code
provided.
In addition, a Figure object named fig
and an Axes object named ax
have already been created for you.
Instructions
- Use the
set_xlabel
method to add the label:"Time (months)"
. - Use the
set_ylabel
method to add the label:"Precipitation (inches)"
. - Use the
set_title
method to add the title:"Weather patterns in Austin and Seattle"
.
Answer
# added/edited
fig, ax = plt.subplots()
ax.plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"])
ax.plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"])
# Customize the x-axis label
ax.set_xlabel("Time (months)")
# Customize the y-axis label
ax.set_ylabel("Precipitation (inches)")
# Add the title
ax.set_title("Weather patterns in Austin and Seattle")
# Display the figure
plt.show()
1.5 Creating small multiples with plt.subplots
Small multiples are used to plot several datasets side-by-side. In
Matplotlib, small multiples can be created using the plt.subplots()
function. The first argument is the number of rows in the array of Axes
objects generate and the second argument is the number of columns. In
this exercise, you will use the Austin and Seattle data to practice
creating and populating an array of subplots.
The data is given to you in DataFrames: seattle_weather
and
austin_weather
. These each have a "MONTH"
column and
"MLY-PRCP-NORMAL"
(for average precipitation), as well as
"MLY-TAVG-NORMAL"
(for average temperature) columns. In this exercise,
you will plot in a separate subplot the monthly average precipitation
and average temperatures in each city.
Instructions
- Create a Figure and an array of subplots with 2 rows and 2 columns.
- Addressing the top left Axes as index 0, 0, plot the Seattle precipitation.
- In the top right (index 0,1), plot Seattle temperatures.
- In the bottom left (1, 0) and bottom right (1, 1) plot Austin precipitations and temperatures.
Answer
# Create a Figure and an array of subplots with 2 rows and 2 columns
fig, ax = plt.subplots(2, 2)
# Addressing the top left Axes as index 0, 0, plot month and Seattle precipitation
ax[0, 0].plot(seattle_weather["MONTH"], seattle_weather["MLY-PRCP-NORMAL"])
# In the top right (index 0,1), plot month and Seattle temperatures
ax[0, 1].plot(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"])
# In the bottom left (1, 0) plot month and Austin precipitations
ax[1, 0].plot(austin_weather["MONTH"], austin_weather["MLY-PRCP-NORMAL"])
# In the bottom right (1, 1) plot month and Austin temperatures
ax[1, 1].plot(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"])
plt.show()
2 Plotting time-series
2.1 Read data with a time index
pandas DataFrame objects can have an index denoting time, this recognized by Matplotlib for axis labeling.
This exercise involves reading data from climate_change.csv
,
containing CO2 levels and temperatures recorded on the 6th of each month
from 1958 to 2016, using pandas’ read_csv
function. The parse_dates
and index_col
arguments help set a DateTimeIndex
.
Don’t forget to check out the Matplotlib Cheat Sheet for a quick overview of essential concepts and methods.
Instructions
- Import the pandas library as
pd
. - Read in the data from a CSV file called
'climate_change.csv'
usingpd.read_csv
. - Use the
parse_dates
key-word argument to parse the"date"
column as dates. - Use the
index_col
key-word argument to set the"date"
column as the index.
Answer
2.2 Plot time-series data
To plot time-series data, we use the Axes
object plot
command. The
first argument to this method are the values for the x-axis and the
second argument are the values for the y-axis.
This exercise provides data stored in a DataFrame called
climate_change
. This variable has a time-index with the dates of
measurements and two data columns: "co2"
and "relative_temp"
.
In this case, the index of the DataFrame would be used as the x-axis
values and we will plot the values stored in the "relative_temp"
column as the y-axis values. We will also properly label the x-axis and
y-axis.
Instructions
- Add the data from
climate_change
to the plot: use the DataFrameindex
for the x value and the"relative_temp"
column for the y values. - Set the x-axis label to
'Time'
. - Set the y-axis label to
'Relative temperature (Celsius)'
. - Show the figure.
Answer
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
# Add the time-series for "relative_temp" to the plot
ax.plot(climate_change.index, climate_change['relative_temp'])
# Set the x-axis label
ax.set_xlabel('Time')
# Set the y-axis label
ax.set_ylabel('Relative temperature (Celsius)')
# Show the figure
plt.show()
2.3 Using a time index to zoom in
When a time-series is represented with a time index, we can use this index for the x-axis when plotting. We can also select a range of dates to zoom in on a particular period within the time-series using pandas’ indexing facilities. In this exercise, you will select a portion of a time-series dataset and you will plot that period.
The data to use is stored in a DataFrame called climate_change
, which
has a time-index with dates of measurements and two data columns:
"co2"
and "relative_temp"
.
Instructions
- Use
plt.subplots
to create a Figure with one Axes calledfig
andax
, respectively. - Create a variable called
seventies
that includes all the data between"1970-01-01"
and"1979-12-31"
. - Add the data from
seventies
to the plot: use the DataFrameindex
for the x value and the"co2"
column for the y values.
Answer
import matplotlib.pyplot as plt
# Use plt.subplots to create fig and ax
fig, ax = plt.subplots()
# Create variable seventies with data from "1970-01-01" to "1979-12-31"
seventies = climate_change["1970-01-01":"1979-12-31"]
# Add the time-series for "co2" data from seventies to the plot
ax.plot(seventies.index, seventies["co2"])
# Show the figure
plt.show()
2.4 Plotting two variables
If you want to plot two time-series variables that were recorded at the same times, you can add both of them to the same subplot.
If the variables have very different scales, you’ll want to make sure that you plot them in different twin Axes objects. These objects can share one axis (for example, the time, or x-axis) while not sharing the other (the y-axis).
To create a twin Axes object that shares the x-axis, we use the twinx
method.
In this exercise, you’ll have access to a DataFrame that has the
climate_change
data loaded into it. This DataFrame was loaded with the
"date"
column set as a DateTimeIndex
, and it has a column called
"co2"
with carbon dioxide measurements and a column called
"relative_temp"
with temperature measurements.
Instructions
- Use
plt.subplots
to create a Figure and Axes objects calledfig
andax
, respectively. - Plot the carbon dioxide variable in blue using the Axes
plot
method. - Use the Axes
twinx
method to create a twin Axes that shares the x-axis. - Plot the relative temperature variable in red on the twin Axes using
its
plot
method.
Answer
import matplotlib.pyplot as plt
# Initalize a Figure and Axes
fig, ax = plt.subplots()
# Plot the CO2 variable in blue
ax.plot(climate_change.index, climate_change["co2"], color='blue')
# Create a twin Axes that shares the x-axis
ax2 = ax.twinx()
# Plot the relative temperature in red
ax2.plot(climate_change.index, climate_change["relative_temp"], color='red')
plt.show()
2.5 Defining a function that plots time-series data
Once you realize that a particular section of code that you have written is useful, it is a good idea to define a function that saves that section of code for you, rather than copying it to other parts of your program where you would like to use this code.
Here, we will define a function that takes inputs such as a time variable and some other variable and plots them as x and y inputs. Then, it sets the labels on the x- and y-axis and sets the colors of the y-axis label, the y-axis ticks and the tick labels.
Instructions
- Define a function called
plot_timeseries
that takes as input an Axes object (axes
), data (x
,y
), a string with the name of a color and strings for x- and y-axis labels. - Plot y as a function of in the color provided as the input
color
. - Set the x- and y-axis labels using the provided input
xlabel
andylabel
, setting the y-axis label color usingcolor
. - Set the y-axis tick parameters using the
tick_params
method of the Axes object, setting thecolors
key-word tocolor
.
Answer
# Define a function called plot_timeseries
def plot_timeseries(axes, x, y, color, xlabel, ylabel):
# Plot the inputs x,y in the provided color
axes.plot(x, y, color=color)
# Set the x-axis label
axes.set_xlabel(xlabel)
# Set the y-axis label
axes.set_ylabel(ylabel, color=color)
# Set the colors tick params for y-axis
axes.tick_params('y', colors=color)
2.6 Using a plotting function
Defining functions allows us to reuse the same code without having to repeat all of it. Programmers sometimes say “Don’t repeat yourself”.
In the previous exercise, you defined a function called
plot_timeseries
:
plot_timeseries(axes, x, y, color, xlabel, ylabel)
that takes an Axes object (as the argument axes
), time-series data (as
x
and y
arguments) the name of a color (as a string, provided as the
color
argument) and x-axis and y-axis labels (as xlabel
and ylabel
arguments). In this exercise, the function plot_timeseries
is already
defined and provided to you.
Use this function to plot the climate_change
time-series data,
provided as a pandas DataFrame object that has a DateTimeIndex with the
dates of the measurements and co2
and relative_temp
columns.
Instructions
- In the provided
ax
object, use the functionplot_timeseries
to plot the"co2"
column in blue, with the x-axis label"Time (years)"
and y-axis label"CO2 levels"
. - Use the
ax.twinx
method to add an Axes object to the figure that shares the x-axis withax
. - Use the function
plot_timeseries
to add the data in the"relative_temp"
column in red to the twin Axes object, with the x-axis label"Time (years)"
and y-axis label"Relative temperature (Celsius)"
.
Answer
fig, ax = plt.subplots()
# Plot the CO2 levels time-series in blue
plot_timeseries(ax, climate_change.index, climate_change["co2"], 'blue', "Time (years)", "CO2 levels")
# Create a twin Axes object that shares the x-axis
ax2 = ax.twinx()
# Plot the relative temperature data in red
plot_timeseries(ax2, climate_change.index, climate_change['relative_temp'], 'red', "Time (years)", "Relative temperature (Celsius)")
plt.show()
2.7 Annotating a plot of time-series data
Annotating a plot allows us to highlight interesting information in the plot. For example, in describing the climate change dataset, we might want to point to the date at which the relative temperature first exceeded 1 degree Celsius.
For this, we will use the annotate
method of the Axes object. In this
exercise, you will have the DataFrame
called climate_change
loaded
into memory. Using the Axes methods, plot only the relative temperature
column as a function of dates, and annotate the data.
Instructions
- Use the
ax.plot
method to plot the DataFrame index against therelative_temp
column. - Use the
annotate
method to add the text'>1 degree'
in the location(pd.Timestamp('2015-10-06'), 1)
.
Answer
fig, ax = plt.subplots()
# Plot the relative temperature data
ax.plot(climate_change.index, climate_change['relative_temp'])
# Annotate the date at which temperatures exceeded 1 degree
ax.annotate(">1 degree", xy=(pd.Timestamp('2015-10-06'), 1))
plt.show()
2.8 Plotting time-series: putting it all together
In this exercise, you will plot two time-series with different scales on the same Axes, and annotate the data from one of these series.
The CO2/temperatures data is provided as a DataFrame called
climate_change
. You should also use the function that we have defined
before, called plot_timeseries
, which takes an Axes object (as the
axes
argument) plots a time-series (provided as x
and y
arguments), sets the labels for the x-axis and y-axis and sets the color
for the data, and for the y tick/axis labels:
plot_timeseries(axes, x, y, color, xlabel, ylabel)
Then, you will annotate with text an important time-point in the data: on 2015-10-06, when the temperature first rose to above 1 degree over the average.
Instructions
- Use the
plot_timeseries
function to plot CO2 levels against time. Set xlabel to"Time (years)"
ylabel to"CO2 levels"
and color to'blue'
. - Create
ax2
, as a twin of the first Axes. - In
ax2
, plot temperature against time, setting the color ylabel to"Relative temp (Celsius)"
and color to'red'
. - Annotate the data using the
ax2.annotate
method. Place the text">1 degree"
in x=pd.Timestamp('2008-10-06')
, y=-0.2
pointing with a gray thin arrow to x=pd.Timestamp('2015-10-06')
, y =1
.
Answer
fig, ax = plt.subplots()
# Plot the CO2 levels time-series in blue
plot_timeseries(ax, climate_change.index, climate_change["co2"], 'blue', "Time (years)", "CO2 levels")
# Create an Axes object that shares the x-axis
ax2 = ax.twinx()
# Plot the relative temperature data in red
plot_timeseries(ax2, climate_change.index, climate_change['relative_temp'], 'red', "Time (years)", "Relative temp (Celsius)")
# Annotate the point with relative temperature >1 degree
ax2.annotate(">1 degree", xy=(pd.Timestamp('2015-10-06'), 1), xytext=(pd.Timestamp('2008-10-06'), -0.2), arrowprops={'arrowstyle':'->', 'color':'gray'})
plt.show()
3 Quantitative comparisons and statistical visualizations
3.1 Bar chart
Bar charts visualize data that is organized according to categories as a series of bars, where the height of each bar represents the values of the data in this category.
For example, in this exercise, you will visualize the number of gold
medals won by each country in the provided medals
DataFrame. The
DataFrame contains the countries as the index, and a column called
"Gold"
that contains the number of gold medals won by each country,
according to their rows.
Instructions
- Call the
ax.bar
method to plot the"Gold"
column as a function of the country. - Use the
ax.set_xticklabels
to set the x-axis tick labels to be the country names. - In the call to
ax.set_xticklabels
rotate the x-axis tick labels by 90 degrees by using therotation
key-word argument. - Set the y-axis label to
"Number of medals"
.
Answer
fig, ax = plt.subplots()
# Plot a bar-chart of gold medals as a function of country
ax.bar(medals.index, medals["Gold"])
# Set the x-axis tick labels to the country names
ax.set_xticklabels(medals.index, rotation=90)
# Set the y-axis label
ax.set_ylabel("Number of medals")
plt.show()
3.2 Stacked bar chart
A stacked bar chart contains bars, where the height of each bar represents values. In addition, stacked on top of the first variable may be another variable. The additional height of this bar represents the value of this variable. And you can add more bars on top of that.
In this exercise, you will have access to a DataFrame called medals
that contains an index that holds the names of different countries, and
three columns: "Gold"
, "Silver"
and "Bronze"
. You will also have a
Figure, fig
, and Axes, ax
, that you can add data to.
You will create a stacked bar chart that shows the number of gold, silver, and bronze medals won by each country, and you will add labels and create a legend that indicates which bars represent which medals.
Instructions
- Call the
ax.bar
method to add the"Gold"
medals. Call it with thelabel
set to"Gold"
. - Call the
ax.bar
method to stack"Silver"
bars on top of that, using thebottom
key-word argument so the bottom of the bars will be on top of the gold medal bars, andlabel
to add the label"Silver"
. - Use
ax.bar
to add"Bronze"
bars on top of that, using thebottom
key-word andlabel
it as"Bronze"
.
Answer
# added/edited
fig, ax = plt.subplots()
# Add bars for "Gold" with the label "Gold"
ax.bar(medals.index, medals["Gold"], label="Gold")
# Stack bars for "Silver" on top with label "Silver"
ax.bar(medals.index, medals["Silver"], bottom=medals["Gold"], label="Silver")
# Stack bars for "Bronze" on top of that with label "Bronze"
ax.bar(medals.index, medals["Bronze"], bottom=medals["Gold"] + medals["Silver"], label="Bronze")
# Display the legend
ax.legend()
plt.show()
3.3 Creating histograms
Histograms show the full distribution of a variable. In this exercise, we will display the distribution of weights of medalists in gymnastics and in rowing in the 2016 Olympic games for a comparison between them.
You will have two DataFrames to use. The first is called mens_rowing
and includes information about the medalists in the men’s rowing events.
The other is called mens_gymnastics
and includes information about
medalists in all of the Gymnastics events.
Instructions
- Use the
ax.hist
method to add a histogram of the"Weight"
column from themens_rowing
DataFrame. - Use
ax.hist
to add a histogram of"Weight"
for themens_gymnastics
DataFrame. - Set the x-axis label to
"Weight (kg)"
and the y-axis label to"# of observations"
.
Answer
# added/edited
summer_2016_medals = pd.read_csv("summer2016.csv")
mens_rowing = summer_2016_medals[(summer_2016_medals['Sport'] == 'Rowing') & (summer_2016_medals['Sex'] == 'M')]
mens_gymnastics = summer_2016_medals[(summer_2016_medals['Sport'] == 'Gymnastics') & (summer_2016_medals['Sex'] == 'M')]
fig, ax = plt.subplots()
# Plot a histogram of "Weight" for mens_rowing
ax.hist(mens_rowing["Weight"])
# Compare to histogram of "Weight" for mens_gymnastics
ax.hist(mens_gymnastics["Weight"])
# Set the x-axis label to "Weight (kg)"
ax.set_xlabel("Weight (kg)")
# Set the y-axis label to "# of observations"
ax.set_ylabel("# of observations")
plt.show()
3.4 “Step” histogram
Histograms allow us to see the distributions of the data in different groups in our data. In this exercise, you will select groups from the Summer 2016 Olympic Games medalist dataset to compare the height of medalist athletes in two different sports.
The data is stored in a pandas DataFrame object called
summer_2016_medals
that has a column “Height”. In addition, you are
provided a pandas GroupBy object that has been grouped by the sport.
In this exercise, you will visualize and label the histograms of two sports: “Gymnastics” and “Rowing” and see the marked difference between medalists in these two sports.
Instructions
- Use the
hist
method to display a histogram of the"Weight"
column from themens_rowing
DataFrame, label this as"Rowing"
. - Use
hist
to display a histogram of the"Weight"
column from themens_gymnastics
DataFrame, and label this as"Gymnastics"
. - For both histograms, use the
histtype
argument to visualize the data using the'step'
type and set the number of bins to use to 5. - Add a legend to the figure, before it is displayed.
Answer
fig, ax = plt.subplots()
# Plot a histogram of "Weight" for mens_rowing
ax.hist(mens_rowing["Weight"], histtype='step', label="Rowing", bins=5)
## (array([ 3., 18., 4., 44., 15.]), array([ 55., 66., 77., 88., 99., 110.]), [<matplotlib.patches.Polygon object at 0x73f36d1f8d70>])
# Compare to histogram of "Weight" for mens_gymnastics
ax.hist(mens_gymnastics["Weight"], histtype='step', label="Gymnastics", bins=5)
## (array([10., 10., 10., 5., 1.]), array([51., 56., 61., 66., 71., 76.]), [<matplotlib.patches.Polygon object at 0x73f36d21b1a0>])
ax.set_xlabel("Weight (kg)")
ax.set_ylabel("# of observations")
# Add the legend and show the Figure
ax.legend()
plt.show()
3.5 Adding error-bars to a bar chart
Statistical plotting techniques add quantitative information for comparisons into the visualization. For example, in this exercise, we will add error bars that quantify not only the difference in the means of the height of medalists in the 2016 Olympic Games, but also the standard deviation of each of these groups, as a way to assess whether the difference is substantial relative to the variability within each group.
For the purpose of this exercise, you will have two DataFrames:
mens_rowing
holds data about the medalists in the rowing events and
mens_gymnastics
will hold information about the medalists in the
gymnastics events.
Instructions
- Add a bar with size equal to the mean of the
"Height"
column in themens_rowing
DataFrame and an error-bar of its standard deviation. - Add another bar for the mean of the
"Height"
column inmens_gymnastics
with an error-bar of its standard deviation. - Add a label to the the y-axis:
"Height (cm)"
.
Answer
fig, ax = plt.subplots()
# Add a bar for the rowing "Height" column mean/std
ax.bar("Rowing", mens_rowing["Height"].mean(), yerr=mens_rowing["Height"].std())
# Add a bar for the gymnastics "Height" column mean/std
ax.bar("Gymnastics", mens_gymnastics["Height"].mean(), yerr=mens_gymnastics["Height"].std())
# Label the y-axis
ax.set_ylabel("Height (cm)")
plt.show()
3.6 Adding error-bars to a plot
Adding error-bars to a plot is done by using the errorbar
method of
the Axes
object.
Here, you have two DataFrames loaded: seattle_weather
has data about
the weather in Seattle and austin_weather
has data about the weather
in Austin. Each DataFrame has a column "MONTH"
that has the names of
the months, a column "MLY-TAVG-NORMAL"
that has the average
temperature in each month and a column "MLY-TAVG-STDDEV"
that has the
standard deviation of the temperatures across years.
In the exercise, you will plot the mean temperature across months and add the standard deviation at each point as y errorbars.
Instructions
- Use the
ax.errorbar
method to add the Seattle data: the"MONTH"
column as x values, the"MLY-TAVG-NORMAL"
as y values and"MLY-TAVG-STDDEV"
asyerr
values. - Add the Austin data: the
"MONTH"
column as x values, the"MLY-TAVG-NORMAL"
as y values and"MLY-TAVG-STDDEV"
asyerr
values. - Set the y-axis label as
"Temperature (Fahrenheit)"
.
Answer
fig, ax = plt.subplots()
# Add the Seattle temperature data in each month with standard deviation error bars
ax.errorbar(seattle_weather["MONTH"], seattle_weather["MLY-TAVG-NORMAL"], yerr=seattle_weather["MLY-TAVG-STDDEV"])
# Add the Austin temperature data in each month with standard deviation error bars
ax.errorbar(austin_weather["MONTH"], austin_weather["MLY-TAVG-NORMAL"], yerr=austin_weather["MLY-TAVG-STDDEV"])
# Set the y-axis label
ax.set_ylabel("Temperature (Fahrenheit)")
plt.show()
3.7 Creating boxplots
Boxplots provide additional information about the distribution of the data that they represent. They tell us what the median of the distribution is, what the inter-quartile range is and also what the expected range of approximately 99% of the data should be. Outliers beyond this range are particularly highlighted.
In this exercise, you will use the data about medalist heights that you previously visualized as histograms, and as bar charts with error bars, and you will visualize it as boxplots.
Again, you will have the mens_rowing
and mens_gymnastics
DataFrames
available to you, and both of these DataFrames have columns called
"Height"
that you will compare.
Instructions
- Create a boxplot that contains the
"Height"
column formens_rowing
on the left andmens_gymnastics
on the right. - Add x-axis tick labels:
"Rowing"
and"Gymnastics"
. - Add a y-axis label:
"Height (cm)"
.
Answer
fig, ax = plt.subplots()
# Add a boxplot for the "Height" column in the DataFrames
ax.boxplot([mens_rowing["Height"], mens_gymnastics["Height"]])
## {'whiskers': [<matplotlib.lines.Line2D object at 0x73f36d173650>, <matplotlib.lines.Line2D object at 0x73f36d173500>, <matplotlib.lines.Line2D object at 0x73f36d173c80>, <matplotlib.lines.Line2D object at 0x73f36d173f50>], 'caps': [<matplotlib.lines.Line2D object at 0x73f36d1730e0>, <matplotlib.lines.Line2D object at 0x73f36d172e10>, <matplotlib.lines.Line2D object at 0x73f36d1a8200>, <matplotlib.lines.Line2D object at 0x73f36d1a8500>], 'boxes': [<matplotlib.lines.Line2D object at 0x73f36d13bbf0>, <matplotlib.lines.Line2D object at 0x73f36d1739b0>], 'medians': [<matplotlib.lines.Line2D object at 0x73f36d172900>, <matplotlib.lines.Line2D object at 0x73f36d1a87a0>], 'fliers': [<matplotlib.lines.Line2D object at 0x73f36d173740>, <matplotlib.lines.Line2D object at 0x73f36d154b60>], 'means': []}
# Add x-axis tick labels:
ax.set_xticklabels(["Rowing", "Gymnastics"])
# Add a y-axis label
ax.set_ylabel("Height (cm)")
plt.show()
3.8 Simple scatter plot
Scatter are a bi-variate visualization technique. They plot each record in the data as a point. The location of each point is determined by the value of two variables: the first variable determines the distance along the x-axis and the second variable determines the height along the y-axis.
In this exercise, you will create a scatter plot of the climate_change
data. This DataFrame, which is already loaded, has a column "co2"
that
indicates the measurements of carbon dioxide every month and another
column, "relative_temp"
that indicates the temperature measured at the
same time.
Instructions
- Using the
ax.scatter
method, add the data to the plot:"co2"
on the x-axis and"relative_temp"
on the y-axis. - Set the x-axis label to
"CO2 (ppm)"
. - Set the y-axis label to
"Relative temperature (C)"
.
Answer
fig, ax = plt.subplots()
# Add data: "co2" on x-axis, "relative_temp" on y-axis
ax.scatter(climate_change["co2"], climate_change["relative_temp"])
# Set the x-axis label to "CO2 (ppm)"
ax.set_xlabel("CO2 (ppm)")
# Set the y-axis label to "Relative temperature (C)"
ax.set_ylabel("Relative temperature (C)")
plt.show()
3.9 Encoding time by color
The screen only has two dimensions, but we can encode another dimension
in the scatter plot using color. Here, we will visualize the
climate_change
dataset, plotting a scatter plot of the "co2"
column,
on the x-axis, against the "relative_temp"
column, on the y-axis. We
will encode time using the color dimension, with earlier times appearing
as darker shades of blue and later times appearing as brighter shades of
yellow.
Instructions
- Using the
ax.scatter
method add a scatter plot of the"co2"
column (x-axis) against the"relative_temp"
column. - Use the
c
key-word argument to pass in the index of the DataFrame as input to color each point according to its date. - Set the x-axis label to
"CO2 (ppm)"
and the y-axis label to"Relative temperature (C)"
.
Answer
fig, ax = plt.subplots()
# Add data: "co2", "relative_temp" as x-y, index as color
ax.scatter(climate_change["co2"], climate_change["relative_temp"], c=climate_change.index)
# Set the x-axis label to "CO2 (ppm)"
ax.set_xlabel("CO2 (ppm)")
# Set the y-axis label to "Relative temperature (C)"
ax.set_ylabel("Relative temperature (C)")
plt.show()