Today, we will use a scatter plot to show correlation within a dataset. Let’s look at the following scenario: You are given a dataset containing information about various animals. Visualize the correlation between the various animal attributes such as Maximum longevity in years and Body mass in grams.
Note: The Axes.set_xscale(‘log’) and the Axes.set_yscale(‘log’) change the scale of the x-axis and y-axis to a logarithmic scale, respectively.
Import the necessary libraries and modules
Use pandas to read the data located in the subfolder data.
## Unnamed: 0 HAGRID ... Body mass (g) Temperature (K)
## 0 0 3 ... NaN NaN
## 1 1 5 ... NaN NaN
## 2 2 6 ... NaN NaN
## 3 3 8 ... NaN NaN
## 4 4 9 ... NaN NaN
## ... ... ... ... ... ...
## 4213 4214 4239 ... NaN NaN
## 4214 4215 4241 ... NaN NaN
## 4215 4216 4242 ... NaN NaN
## 4216 4217 4243 ... NaN NaN
## 4217 4218 4244 ... NaN NaN
##
## [4218 rows x 30 columns]
The given dataset is not complete. Filter the data so that you end up with samples containing a body mass and a maximum longevity. Sort the data according to the animal class; here, the isfinite() function (to check whether the number is finite or not) checks for the finiteness of the given element:
# Preprocessing
longevity = 'Maximum longevity (yrs)'
mass = 'Body mass (g)'
data = data[np.isfinite(data[longevity]) & np.isfinite(data[mass])]
# Sort according to class
amphibia = data[data['Class'] == 'Amphibia']
aves = data[data['Class'] == 'Aves']
mammalia = data[data['Class'] == 'Mammalia']
reptilia = data[data['Class'] == 'Reptilia']Create a scatter plot visualizing the correlation between the body mass and the maximum longevity. Use different colors to group data samples according to their class. Add a legend, labels, and a title. Use a log scale for both the x-axis and y-axis:
# Create figure
plt.figure(figsize=(10, 6), dpi=300)
# Create scatter plot
plt.scatter(amphibia[mass], amphibia[longevity], label='Amphibia')
plt.scatter(aves[mass], aves[longevity], label='Aves')
plt.scatter(mammalia[mass], mammalia[longevity], label='Mammalia')
plt.scatter(reptilia[mass], reptilia[longevity], label='Reptilia')
# Add legend
plt.legend()
# Log scale
ax = plt.gca()
ax.set_xscale('log')
ax.set_yscale('log')
# Add labels
plt.xlabel('Body mass in grams')
plt.ylabel('Maximum longevity in years')
# Show plot
plt.show()Working with DataFrames using Matplotlib adds some inconvenient overhead. For example, simply exploring your dataset can take up a lot of time, since you require some additional data wrangling to be able to plot the data from the DataFrames using Matplotlib.
Seaborn, however, is built to operate on DataFrames and full dataset arrays, which makes this process simpler. It internally performs the necessary semantic mappings and statistical aggregation to produce informative plots.
data = pd.read_csv("C:/Users/luong/Desktop/data/age_salary_hours.csv")
sns.set(style="ticks", color_codes=True)
g = sns.pairplot(data, hue='Education')
plt.show()A work by by Luong Nguyen - 18 June 2020