Introduction

I am using the ggplot package to show how data visualization can be used in data analysis. I am taking my data from Kagle database. The data sets can be found from the following urls:

https://www.kaggle.com/datasets/joebeachcapital/30000-spotify-songs?select=spotify_songs.csv

https://www.kaggle.com/datasets/larsen0966/penguins?select=penguins.csv

Box plot

A box plot is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It can tell you about outliers in a dataset and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

Load the Spotify data

The preview of the spotify data is shown below.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(readr)

spotify_data <- read_csv('https://raw.githubusercontent.com/hawa1983/Tidyverse-CREATE/main/spotify_songs.csv')

## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(spotify_data)

## Rows: 32,833
## Columns: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
## $ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "dance…
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
## $ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
## $ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
## $ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
## $ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 16304…

Plot the box plot

The box plot below shows the distribution of Track Popularity by Playlist Genre.

library(ggplot2)

# Assuming your data is in a data frame named 'df'
p <- ggplot(spotify_data, aes(x = playlist_genre, y = track_popularity)) + 
  geom_boxplot() +
  theme_minimal() + 
  labs(title = "Distribution of Track Popularity by Playlist Genre",
       x = "Playlist Genre",
       y = "Track Popularity")

print(p)

Violin Plot

The violin plot combines box plots with kernel density plots. This gives you a deeper understanding of the density at different values, showing both the spread and the density of the data, with the thickness of the plot representing the frequency of data points.

library(ggplot2)

ggplot(spotify_data, aes(x = playlist_genre, y = track_popularity)) +
  geom_violin(trim = FALSE)

penquin_data <- read_csv('https://raw.githubusercontent.com/hawa1983/Tidyverse-CREATE/main/penguins.csv')

## New names:
## Rows: 344 Columns: 9
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): species, island, sex dbl (6): ...1, bill_length_mm, bill_depth_mm,
## flipper_length_mm, body_mass_g...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

penquin_data

## # A tibble: 344 × 9
##     ...1 species island    bill_length_mm bill_depth_mm flipper_length_mm
##    <dbl> <chr>   <chr>              <dbl>         <dbl>             <dbl>
##  1     1 Adelie  Torgersen           39.1          18.7               181
##  2     2 Adelie  Torgersen           39.5          17.4               186
##  3     3 Adelie  Torgersen           40.3          18                 195
##  4     4 Adelie  Torgersen           NA            NA                  NA
##  5     5 Adelie  Torgersen           36.7          19.3               193
##  6     6 Adelie  Torgersen           39.3          20.6               190
##  7     7 Adelie  Torgersen           38.9          17.8               181
##  8     8 Adelie  Torgersen           39.2          19.6               195
##  9     9 Adelie  Torgersen           34.1          18.1               193
## 10    10 Adelie  Torgersen           42            20.2               190
## # ℹ 334 more rows
## # ℹ 3 more variables: body_mass_g <dbl>, sex <chr>, year <dbl>

Strip Plot

The Strip Plot is similar to a scatter plot, but points along the categorical axis are jittered. It’s good for small datasets but can become cluttered with larger datasets.They can be used to show correlation between two variables.

library(ggplot2)

ggplot(penquin_data, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_jitter(width = 0.2)

## Warning: Removed 2 rows containing missing values (`geom_point()`).

# Dot Plot Similar to a strip plot, but typically without jitter. Each point represents an observation. This can be particularly useful when you have discrete data.

library(ggplot2)

ggplot(penquin_data, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_dotplot(binaxis = 'y', stackdir = 'center', dotsize = 1)

## Bin width defaults to 1/30 of the range of the data. Pick better value with
## `binwidth`.

## Warning: Removed 2 rows containing missing values (`stat_bindot()`).

Bee Swarm Plot

Also similar to a scatter plot, but the points are adjusted so they don’t overlap. This gives a clear indication of the distribution of the data and can show clusters within the data.

library(ggplot2)
library(ggbeeswarm)

## Warning: package 'ggbeeswarm' was built under R version 4.3.2

ggplot(penquin_data, aes(x = flipper_length_mm, y = body_mass_g)) +
  geom_beeswarm()

# Histogram Shows the frequency distribution of a single continuous variable by dividing the data into bins and counting the number of observations in each bin.

library(ggplot2)

ggplot(penquin_data, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 60)

## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

# Rug Plot Places individual data points along an axis, which is useful for showing the exact placement of distribution. It’s often combined with other plots like histograms or density plots.

library(ggplot2)

ggplot(penquin_data, aes(x = body_mass_g)) +
  geom_histogram(binwidth = 60)

## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

  geom_rug(sides = "b") # Add a rug at the bottom

## geom_rug: outside = FALSE, sides = b, length = 0.03, na.rm = FALSE
## stat_identity: na.rm = FALSE
## position_identity

Density Plot

A smoothed version of a histogram, it shows the distribution shape of the data. It can be particularly useful for comparing the distribution of a variable across different groups.

library(ggplot2)

ggplot(penquin_data, aes(x = body_mass_g, fill = species)) +
  geom_density(alpha = 0.5)

## Warning: Removed 2 rows containing non-finite values (`stat_density()`).

Cumulative Frequency Plot

Shows the cumulative counts or proportions up to a certain value, giving you a sense of the number or proportion of observations below a particular value in the data distribution.

library(ggplot2)

ggplot(penquin_data, aes(x = body_mass_g)) +
  stat_ecdf(geom = "step") # Cumulative distribution function

## Warning: Removed 2 rows containing non-finite values (`stat_ecdf()`).

#Empirical Cumulative Distribution Function (ECDF) Unlike a box plot that summarizes data, an ECDF provides a complete representation of the distribution and is especially useful for large datasets.

library(ggplot2)


ggplot(penquin_data, aes(x = body_mass_g)) +
  stat_ecdf(geom = "step") +
  labs(title = "Empirical Cumulative Distribution Function",
       x = "Body Mass Index",
       y = "ECDF")

## Warning: Removed 2 rows containing non-finite values (`stat_ecdf()`).

Tidyverse Create

Fomba Kassoh

2023-11-13