Today, we will use a scatter plot to show correlation within a dataset. Let’s look at the following scenario: You are given a dataset containing information about various animals. Visualize the correlation between the various animal attributes such as Maximum longevity in years and Body mass in grams.

Note: The Axes.set_xscale(‘log’) and the Axes.set_yscale(‘log’) change the scale of the x-axis and y-axis to a logarithmic scale, respectively.

1 Load library

Import the necessary libraries and modules

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

2 Use Matplotlib Library

2.1 Import dataset

Use pandas to read the data located in the subfolder data.

data = pd.read_csv('C:/Users/luong/Desktop/data/anage_data.csv')
print(data)

##       Unnamed: 0  HAGRID  ... Body mass (g) Temperature (K)
## 0              0       3  ...           NaN             NaN
## 1              1       5  ...           NaN             NaN
## 2              2       6  ...           NaN             NaN
## 3              3       8  ...           NaN             NaN
## 4              4       9  ...           NaN             NaN
## ...          ...     ...  ...           ...             ...
## 4213        4214    4239  ...           NaN             NaN
## 4214        4215    4241  ...           NaN             NaN
## 4215        4216    4242  ...           NaN             NaN
## 4216        4217    4243  ...           NaN             NaN
## 4217        4218    4244  ...           NaN             NaN
## 
## [4218 rows x 30 columns]

2.2 Manipulate data

The given dataset is not complete. Filter the data so that you end up with samples containing a body mass and a maximum longevity. Sort the data according to the animal class; here, the isfinite() function (to check whether the number is finite or not) checks for the finiteness of the given element:

# Preprocessing
longevity = 'Maximum longevity (yrs)'
mass = 'Body mass (g)'
data = data[np.isfinite(data[longevity]) & np.isfinite(data[mass])]
# Sort according to class
amphibia = data[data['Class'] == 'Amphibia']
aves = data[data['Class'] == 'Aves']
mammalia = data[data['Class'] == 'Mammalia']
reptilia = data[data['Class'] == 'Reptilia']

2.3 Visualize using Matplotlib

Create a scatter plot visualizing the correlation between the body mass and the maximum longevity. Use different colors to group data samples according to their class. Add a legend, labels, and a title. Use a log scale for both the x-axis and y-axis:

# Create figure
plt.figure(figsize=(10, 6), dpi=300)
# Create scatter plot
plt.scatter(amphibia[mass], amphibia[longevity], label='Amphibia')
plt.scatter(aves[mass], aves[longevity], label='Aves')
plt.scatter(mammalia[mass], mammalia[longevity], label='Mammalia')
plt.scatter(reptilia[mass], reptilia[longevity], label='Reptilia')
# Add legend
plt.legend()
# Log scale
ax = plt.gca()
ax.set_xscale('log')
ax.set_yscale('log')
# Add labels
plt.xlabel('Body mass in grams')
plt.ylabel('Maximum longevity in years')
# Show plot
plt.show()

3 Use Seaborn Library

Working with DataFrames using Matplotlib adds some inconvenient overhead. For example, simply exploring your dataset can take up a lot of time, since you require some additional data wrangling to be able to plot the data from the DataFrames using Matplotlib.

Seaborn, however, is built to operate on DataFrames and full dataset arrays, which makes this process simpler. It internally performs the necessary semantic mappings and statistical aggregation to produce informative plots.

data = pd.read_csv("C:/Users/luong/Desktop/data/age_salary_hours.csv")
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(data, hue='Education')
plt.show()

A work by by Luong Nguyen - 18 June 2020

ph.ntluong95@gmail.com

Visualization by Python using Rmarkdown