##1. Loading the Tidyverse packages and the data into R:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
spotify_data <- read_csv("https://raw.githubusercontent.com/Doumgit/Sentiment-Analysis-Project/main/spotify_songs.csv")
## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(spotify_data)
## # A tibble: 6 × 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5 67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson 70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm… 60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal… 69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran 67 2yiy9cd2QktrN…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## # playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## # playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # duration_ms <dbl>
##2. Data Manipulation:
The ‘dplyr’ package is a powerful tool for data transformation and summarization within the Tidyverse collection. It allows for clear and concise data manipulation, enabling a variety of operations such as filtering rows, selecting specific columns, mutating the dataset to include new variables, summarizing data, and arranging rows based on certain criteria. For example:
track_pop_above_60 <- spotify_data %>%
filter(track_popularity > 60) %>%
select(track_name, track_artist, danceability, energy, tempo)
head(track_pop_above_60)
## # A tibble: 6 × 5
## track_name track_artist danceability energy tempo
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 I Don't Care (with Justin Bieber) - Lo… Ed Sheeran 0.748 0.916 122.
## 2 Memories - Dillon Francis Remix Maroon 5 0.726 0.815 100.
## 3 All the Time - Don Diablo Remix Zara Larsson 0.675 0.931 124.
## 4 Someone You Loved - Future Humans Remix Lewis Capal… 0.65 0.833 124.
## 5 Beautiful People (feat. Khalid) - Jack… Ed Sheeran 0.675 0.919 125.
## 6 Never Really Over - R3HAB Remix Katy Perry 0.449 0.856 113.
In the example above, we demonstrated how to use dplyr
to refine the spotify_data dataset to focus on tracks with a popularity
greater than 60. We achieve this by employing the filter()
function. Subsequently, we pare down the dataset to include only
relevant columns that we are interested in analyzing: track name,
artist, danceability, energy, and tempo by using the
select()
function. This streamlined dataset is then
outputted, with head()
used to display just the first few
entries for a quick preview of the transformed data.
##3. Data Visualization:
With dplyr
and ggplot2
together, you can
create a variety of visualizations. For instance, a scatter plot to see
the relationship between ‘danceability’ and ‘energy’ could be made like
so:
# Data manipulation with dplyr: let's categorize tracks as 'High popularity' or 'Low popularity'
# assuming the median of the 'track_popularity' can be a good threshold
spotify_data_mutated <- spotify_data %>%
mutate(popularity_category = if_else(track_popularity > median(track_popularity, na.rm = TRUE),
"High Popularity",
"Low Popularity"))
# For a large dataset like spotify_data, you might want to take a sample to make plotting faster
sampled_data <- sample_n(spotify_data_mutated, 1000)
ggplot(sampled_data, aes(x = danceability, y = energy, color = popularity_category)) +
geom_point(alpha = 0.7) +
facet_wrap(~popularity_category) +
labs(title = "Danceability vs Energy by Popularity Category",
x = "Danceability",
y = "Energy",
color = "Popularity Category") +
theme_minimal()
# Saving the plot as png
ggsave("Danceability_vs_Energy_by_Popularity_Category.png", width = 10, height = 8)
In this vignette, we leverage the capabilities of the Tidyverse,
specifically dplyr
for data manipulation and
ggplot2
for data visualization. First, we use
dplyr
to enhance our dataset by creating a new column that
categorizes tracks based on their popularity. This categorization allows
us to explore nuances in the data, such as differences in danceability
and energy between tracks with high and low popularity. Due to the
potential size of the dataset, we use dplyr
to sample the
data, making our subsequent visualization more efficient and
manageable.
Once our data is prepared, we transition to visualizing it with
ggplot2
. We construct a scatter plot that illustrates the
relationship between danceability
and energy
,
utilizing the newly created popularity categories to color-code the
points. This not only adds a layer of information to our plot but also
enhances readability. To further refine our visualization, we employ
facet_wrap
to generate separate plots for each popularity
category, providing a clearer comparison between the groups. Finally, we
add appropriate labels and titles for context and clarity and save the
resulting plot as a PNG file. This process from data manipulation to
visualization exemplifies a seamless workflow within the Tidyverse
ecosystem, yielding insightful and aesthetically pleasing
representations of our data.
##4. Data Summarization:
Summarization is a crucial step in data analysis, allowing us to
extract meaningful statistics from larger datasets. The
dplyr
package simplifies this process by providing
intuitive functions such as group_by()
and
summarize()
. For example, to calculate the average
loudness
by playlist_genre
:
summarising_data <- spotify_data %>%
group_by(playlist_genre) %>%
summarize(avg_loudness = mean(loudness, na.rm = TRUE))
summarising_data
## # A tibble: 6 × 2
## playlist_genre avg_loudness
## <chr> <dbl>
## 1 edm -5.43
## 2 latin -6.26
## 3 pop -6.32
## 4 r&b -7.86
## 5 rap -7.04
## 6 rock -7.59
In the example above, we use these functions to calculate the average ‘loudness’ for each ‘playlist_genre’ within the ‘spotify_data’ dataset. The ‘group_by()’ function clusters the data by each unique genre, setting the stage for the calculation of summary statistics within each group. Then, ‘summarize()’ is applied to compute the mean ‘loudness’ across these groups, while ‘na.rm = TRUE’ ensures that missing values do not affect the calculation. The resulting object, ‘summarising_data’, contains the average loudness values neatly organized by genre, providing an immediate snapshot of this particular attribute across different genres.