Introduction

Welcome to our introductory guide to ggplot, a wonderful data visualization package in R. In this guide, we’ll learn how to translate our thoughts into ggplot code, understand the structure of ggplot inputs, and incrementally build a plot by adding layers for labels, groupings, and formatting changes.

Translating thoughts into ggplot code

Let’s start with an idea:

“I want to plot our data, with thing 1 on the x-axis and thing 2 on the y-axis. Oh, and I want it to be a scatter plot.”

In ggplot, this sentiment can be translated into the following code:

ggplot(data = your_data_frame, aes(x = thing1, y = thing2)) +
  geom_point()

In this code, your_data_frame is the data frame that contains your data, thing1 is the variable you want on the x-axis, and thing2 is the variable you want on the y-axis. geom_point() specifies that we want a scatter plot. Without a geom specified, nothin will be plotted, since it doesn’t know what we want.

Understanding ggplot Inputs

In ggplot, the main function ggplot() takes at least two inputs:

data: This is your data frame that contains the variables you want to plot. aes(): This is where you specify your x and y variables. The aes() function stands for aesthetic mappings. It describes how variables in the data are mapped to visual properties (aesthetics) of the plot.

Building a Plot Incrementally

Now that we understand the basics, let’s start by look at our data and then building a plot incrementally.

View data:

# Read in our data
data <- read.csv("temp_data/Spotify_Song_Attributes.csv")

# View what categories of data they included
colnames(data)
##  [1] "trackName"        "artistName"       "msPlayed"         "genre"           
##  [5] "danceability"     "energy"           "key"              "loudness"        
##  [9] "mode"             "speechiness"      "acousticness"     "instrumentalness"
## [13] "liveness"         "valence"          "tempo"            "type"            
## [17] "id"               "uri"              "track_href"       "analysis_url"    
## [21] "duration_ms"      "time_signature"

Base Plot: We start by creating a base plot. Lets plot danceability and energy for now.

library(ggplot2)


# Make plot base
base_plot <- ggplot(data = data, aes(x = energy, y = danceability))

Adding a Scatter Plot Layer: We then add a scatter plot layer to our base plot. ggplot can add more feature to a plot with + after a line of code.

# Add a geom_point layer to our plot 
scatter_plot <- base_plot + geom_point()

# Calling the object name returns the ggplot
scatter_plot

Adding Labels: Next, we add labels to our plot.

scatter_plot_with_labels <- scatter_plot +
  labs(x = "Energy", y = "Danceability", title = "Spotify tracks by energy and danceability")

scatter_plot_with_labels

Adding Groupings: If we have a categorical variable that we want to use to group our data, we can add that to our aesthetic mappings. Lets include genre, but with filtering for just the top 10 genres (there are 525 and our session can’t handle that)

# Find top 10 most common genres
# Load the dplyr package
library(dplyr)

# Subset the data to the top 10 genres, excluding the empty genre

top_genres_data <- data %>%
  # The filter() function is used to exclude rows where the genre is empty.
  filter(genre != "") %>%
  # The group_by() function is used to group the data by genre.
  group_by(genre) %>%
  # The summarise() function is used to create a new dataframe that
  # contains the count of rows for each genre.
  summarise(count = n()) %>%
  # The arrange() function is used to sort the genres in descending order of count.
  arrange(desc(count)) %>%
  # The slice_head() function is used to select the top 10 genres.
  slice_head(n = 10) %>%
  # The inner_join() function is used to subset the original data to these top 10 genres.
  inner_join(data, by = "genre")

table(top_genres_data$genre)
## 
##                 alt z     alternative metal           anime lo-fi 
##                   656                   150                   136 
##               art pop               brostep             dance pop 
##                   126                   116                   172 
##           drift phonk                 filmi                   pop 
##                   124                   412                   602 
## singer-songwriter pop 
##                   164

With the data trimmed, lets go ahead and plot:

scatter_plot_with_groupings <- ggplot(data = top_genres_data, 
                                      aes(x = energy, 
                                          y = danceability, 
                                          color = genre)) +
                                  geom_point() +
                                  labs(x = "Energy", 
                                       y = "Danceability", 
                                       title = "Top 10 spotify genre tracks by energy and danceability")


scatter_plot_with_groupings

This time, we had to remake the plot from scratch, since we were making a change in the initial layer with aes(... color = genre)

Changing formatting: Finally, we can change the formatting of our plot, like the theme.

final_plot <- scatter_plot_with_groupings + theme_minimal()

final_plot