##1. Loading the Tidyverse packages and the data into R:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
spotify_data <- read_csv("https://raw.githubusercontent.com/Doumgit/Sentiment-Analysis-Project/main/spotify_songs.csv")
## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(spotify_data)
## # A tibble: 6 × 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5 67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson 70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm… 60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal… 69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran 67 2yiy9cd2QktrN…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## # playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## # playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # duration_ms <dbl>
##2. Data Manipulation:
The ‘dplyr’ package is a powerful tool for data transformation and summarization within the Tidyverse collection. It allows for clear and concise data manipulation, enabling a variety of operations such as filtering rows, selecting specific columns, mutating the dataset to include new variables, summarizing data, and arranging rows based on certain criteria. For example:
track_pop_above_60 <- spotify_data %>%
filter(track_popularity > 60) %>%
select(track_name, track_artist, danceability, energy, tempo)
head(track_pop_above_60)
## # A tibble: 6 × 5
## track_name track_artist danceability energy tempo
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 I Don't Care (with Justin Bieber) - Lo… Ed Sheeran 0.748 0.916 122.
## 2 Memories - Dillon Francis Remix Maroon 5 0.726 0.815 100.
## 3 All the Time - Don Diablo Remix Zara Larsson 0.675 0.931 124.
## 4 Someone You Loved - Future Humans Remix Lewis Capal… 0.65 0.833 124.
## 5 Beautiful People (feat. Khalid) - Jack… Ed Sheeran 0.675 0.919 125.
## 6 Never Really Over - R3HAB Remix Katy Perry 0.449 0.856 113.
In the example above, we demonstrated how to use dplyr
to refine the spotify_data dataset to focus on tracks with a popularity
greater than 60. We achieve this by employing the filter()
function. Subsequently, we pare down the dataset to include only
relevant columns that we are interested in analyzing: track name,
artist, danceability, energy, and tempo by using the
select()
function. This streamlined dataset is then
outputted, with head()
used to display just the first few
entries for a quick preview of the transformed data.
##3. Data Visualization:
With dplyr
and ggplot2
together, you can
create a variety of visualizations. For instance, a scatter plot to see
the relationship between ‘danceability’ and ‘energy’ could be made like
so:
# Data manipulation with dplyr: let's categorize tracks as 'High popularity' or 'Low popularity'
# assuming the median of the 'track_popularity' can be a good threshold
spotify_data_mutated <- spotify_data %>%
mutate(popularity_category = if_else(track_popularity > median(track_popularity, na.rm = TRUE),
"High Popularity",
"Low Popularity"))
# For a large dataset like spotify_data, you might want to take a sample to make plotting faster
sampled_data <- sample_n(spotify_data_mutated, 1000)
ggplot(sampled_data, aes(x = danceability, y = energy, color = popularity_category)) +
geom_point(alpha = 0.7) +
facet_wrap(~popularity_category) +
labs(title = "Danceability vs Energy by Popularity Category",
x = "Danceability",
y = "Energy",
color = "Popularity Category") +
theme_minimal()
# Saving the plot as png
ggsave("Danceability_vs_Energy_by_Popularity_Category.png", width = 10, height = 8)
In this vignette, we leverage the capabilities of the Tidyverse,
specifically dplyr
for data manipulation and
ggplot2
for data visualization. First, we use
dplyr
to enhance our dataset by creating a new column that
categorizes tracks based on their popularity. This categorization allows
us to explore nuances in the data, such as differences in danceability
and energy between tracks with high and low popularity. Due to the
potential size of the dataset, we use dplyr
to sample the
data, making our subsequent visualization more efficient and
manageable.
Once our data is prepared, we transition to visualizing it with
ggplot2
. We construct a scatter plot that illustrates the
relationship between danceability
and energy
,
utilizing the newly created popularity categories to color-code the
points. This not only adds a layer of information to our plot but also
enhances readability. To further refine our visualization, we employ
facet_wrap
to generate separate plots for each popularity
category, providing a clearer comparison between the groups. Finally, we
add appropriate labels and titles for context and clarity and save the
resulting plot as a PNG file. This process from data manipulation to
visualization exemplifies a seamless workflow within the Tidyverse
ecosystem, yielding insightful and aesthetically pleasing
representations of our data.
##4. Data Summarization:
Summarization is a crucial step in data analysis, allowing us to
extract meaningful statistics from larger datasets. The
dplyr
package simplifies this process by providing
intuitive functions such as group_by()
and
summarize()
. For example, to calculate the average
loudness
by playlist_genre
:
summarising_data <- spotify_data %>%
group_by(playlist_genre) %>%
summarize(avg_loudness = mean(loudness, na.rm = TRUE))
summarising_data
## # A tibble: 6 × 2
## playlist_genre avg_loudness
## <chr> <dbl>
## 1 edm -5.43
## 2 latin -6.26
## 3 pop -6.32
## 4 r&b -7.86
## 5 rap -7.04
## 6 rock -7.59
In the example above, we use these functions to calculate the average ‘loudness’ for each ‘playlist_genre’ within the ‘spotify_data’ dataset. The ‘group_by()’ function clusters the data by each unique genre, setting the stage for the calculation of summary statistics within each group. Then, ‘summarize()’ is applied to compute the mean ‘loudness’ across these groups, while ‘na.rm = TRUE’ ensures that missing values do not affect the calculation. The resulting object, ‘summarising_data’, contains the average loudness values neatly organized by genre, providing an immediate snapshot of this particular attribute across different genres.
You can also use the dyply package to arrange tracks by loudness using the ‘arrange’ function
library(dplyr)
loud_tracks <- spotify_data %>%
arrange(desc(loudness))
head(loud_tracks)
## # A tibble: 6 × 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 3BnVqaDfgKyI4CFCfozors Raw Power… The Stooges 36 6mxbG8KrOTZIx…
## 2 6TaqE6fbzfOGTSnNqmsAMO Escape Fr… Eva Simons 0 51Tg4iZyqpqp1…
## 3 5DQGkXXLiOhf5cKqIyWh5L Rockstar Duki 2 1F7NrR7X4rxJf…
## 4 0jsbEBnXWgHLkvjV49vYVG Nails Ghostemane 49 6DIKWvXlVjvAx…
## 5 02fPJUlHDeH46EB1dHrgmZ Crema Owin 43 281103RQ7mvsW…
## 6 2kOmW169C7UV4SZDN9u0YO Vidrado E… Dj Guuga 78 5HebljJgeo97M…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## # playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## # playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # duration_ms <dbl>
If you want to know how many tracts are in each genre, you can use the ‘count’ function
genre_counts <- spotify_data %>% count(playlist_genre)
head(genre_counts)
## # A tibble: 6 × 2
## playlist_genre n
## <chr> <int>
## 1 edm 6043
## 2 latin 5155
## 3 pop 5507
## 4 r&b 5431
## 5 rap 5746
## 6 rock 4951
One of the common data reshaping tasks involves filtering out outliers. You can use the inter-quantile range to filter out outliers.
spotify_data_no_outliers <- spotify_data %>%
filter(between(duration_ms, quantile(duration_ms, 0.25) - 1.5 * IQR(duration_ms),
quantile(duration_ms, 0.75) + 1.5 * IQR(duration_ms)))
head(spotify_data_no_outliers)
## # A tibble: 6 × 23
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5 67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson 70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm… 60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal… 69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran 67 2yiy9cd2QktrN…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## # playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## # playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## # loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## # instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## # duration_ms <dbl>
In statistical analysis, you may be required to compare two variables that have difference measurement base. To do that, you must first normalize/standardize the data. You can achieve that using the following dplyr functions:
mutate_at(): This is a dplyr function used to apply a given function to multiple columns in the dataframe. vars(loudness, tempo): The vars() function is used here to specify the columns on which the function within mutate_at() will be applied. In this case, it’s being applied to the ‘loudness’ and ‘tempo’ columns. ~scale(.): The tilde (~) introduces a formula, which is a way of writing anonymous functions in R. The dot (.) is a placeholder for the columns specified in vars(). scale() is a base R function that standardizes a numeric vector to have a mean of zero and a standard deviation of one. It performs z-score normalization. %>% as.vector: This chain within the mutate_at() function takes the output of scale()—which is a matrix—and converts it back to a vector with as.vector(). This is because scale() returns a matrix with attributes for the scaled center and scale, but here we want just the scaled values as regular columns in the dataframe. select: The select function selects columns of interest.
normalized_spotify_data <- spotify_data %>%
mutate_at(vars(loudness, tempo), ~scale(.) %>% as.vector) %>%
select(track_name, loudness, tempo)
head(normalized_spotify_data)
## # A tibble: 6 × 3
## track_name loudness tempo
## <chr> <dbl> <dbl>
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix 1.37 0.0429
## 2 Memories - Dillon Francis Remix 0.586 -0.777
## 3 All the Time - Don Diablo Remix 1.10 0.116
## 4 Call You Mine - Keanu Silva Remix 0.984 0.0400
## 5 Someone You Loved - Future Humans Remix 0.685 0.115
## 6 Beautiful People (feat. Khalid) - Jack Wins Remix 0.447 0.152
In data analysis, you are sometimes required to generate a summary statistic with each group in a column. You can use group_by and apply a function independently to each group. Below is an example:
group_by: This function groups the data by the playlist_genre column, which means that subsequent operations will be performed on these groups independently. top_n: This function is used to select the top n entries for each group created by group_by(). In this case, it selects the top 5 tracks with the longest duration within each genre.
top_tracks_by_genre <- spotify_data %>%
group_by(playlist_genre) %>%
top_n(5, duration_ms) %>%
select(track_name, playlist_genre, duration_ms)
head(top_tracks_by_genre)
## # A tibble: 6 × 3
## # Groups: playlist_genre [2]
## track_name playlist_genre duration_ms
## <chr> <chr> <dbl>
## 1 Mirrors pop 484147
## 2 Mirrors pop 484147
## 3 Get The Balance Right! - Combination Mix pop 478208
## 4 Bailando - Jose Spinnin Cortes Remix pop 490057
## 5 Get The Balance Right! - Combination Mix pop 475600
## 6 Bring It On rap 496133
The pivot_longer transformation took the selected numerical columns from the Spotify dataset and converted them into two columns: attribute and value. This “long” format is useful for certain types of analysis and visualization where you want to treat these attributes uniformly. For example, after this transformation, you could easily plot all these attributes’ values against another variable without having to deal with separate columns.
Note: The pivot_longer function is in the tidyr not the dplyr package but these packages are often used together because they complement each other’s functionalities. Here, I am using it to help with visualizing the data.
library(dplyr)
library(tidyr)
library(readr)
# Select numerical columns for the transformation
numerical_columns <- c('danceability', 'energy', 'key', 'loudness',
'speechiness', 'acousticness', 'instrumentalness',
'liveness', 'valence', 'tempo')
# Perform the pivot_longer operation
pivot_longer_df <- spotify_data %>%
pivot_longer(cols = numerical_columns,
names_to = "attribute",
values_to = "value")
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(numerical_columns)
##
## # Now:
## data %>% select(all_of(numerical_columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
head(pivot_longer_df)
## # A tibble: 6 × 15
## track_id track_name track_artist track_popularity track_album_id
## <chr> <chr> <chr> <dbl> <chr>
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## 2 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## 3 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## 4 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## 5 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## 6 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran 66 2oCs0DGTsRO98…
## # ℹ 10 more variables: track_album_name <chr>, track_album_release_date <chr>,
## # playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## # playlist_subgenre <chr>, mode <dbl>, duration_ms <dbl>, attribute <chr>,
## # value <dbl>
Now that we have transformed the data from wide format to long format, let use use the long data frame to visualize the data. Here I am focusing on the plots that are most frequently used in statistical analysis.
This will show the distribution of each numerical attribute. Box plots are valuable in data analysis for concisely representing the distribution of a data set, highlighting its median, quartiles, and outliers. They provide a clear visual summary of the central tendency, variability, and extremes in the data.
library(ggplot2)
ggplot(pivot_longer_df, aes(x = attribute, y = value)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Distribution of Numerical Attributes", x = "Attribute", y = "Value")
#### Histograms of Numerical Attributes This will create a facet grid of
histograms for each attribute. Histograms are essential in data analysis
for visualizing the distribution of numerical data, identifying patterns
and outliers, comparing multiple data sets, and understanding their
spread and central tendencies. They offer a straightforward way to
analyze and gain insights from numerical data.
ggplot(pivot_longer_df, aes(x = value)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
facet_wrap(~ attribute, scales = "free") +
labs(title = "Histograms of Numerical Attributes", x = "Value", y = "Count")
#### Density Plots of Numerical Attributes This will show the density
distribution for each attribute.Density plots are used in data analysis
to visualize the distribution and density of continuous data, offering a
smooth representation of data variation and identifying where values are
concentrated. They provide a clear and continuous view of data
distribution, useful for understanding underlying patterns and
trends.
ggplot(pivot_longer_df, aes(x = value, fill = attribute)) +
geom_density(alpha = 0.7) +
facet_wrap(~ attribute, scales = "free") +
labs(title = "Density Plots of Numerical Attributes", x = "Value", y = "Density")
#### Bar Plot of Average Values of Attributes This will display the
average values of each attribute.Bar plots are used in data analysis to
visually represent categorical data, showing the frequency or proportion
of categories through the height or length of bars, making it easy to
compare different categories or groups within a dataset.
library(dplyr)
pivot_longer_df %>%
group_by(attribute) %>%
summarise(average_value = mean(value, na.rm = TRUE)) %>%
ggplot(aes(x = attribute, y = average_value, fill = attribute)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Average Values of Numerical Attributes", x = "Attribute", y = "Average Value")
#### Explaination of the ggplot functions used
ggplot(): This is the core function in ggplot2, used to
initialize a ggplot object. It sets up the data and specifies the set of
mappings from data to aesthetics. geom_ functions:
These functions add layers to the plot, specifying how to display the
data. Some common geom_* functions include:
geom_point(): Adds points to the plot, useful for scatter plots. geom_density(): Adds lines, great for time series or trend lines. geom_bar(): Creates bar plots. geom_histogram(): Plots a histogram for data distribution. geom_boxplot(): Creates boxplots to show distributions with quartiles.
aes(): This function is used to specify the aesthetic mappings, like mapping variables to x and y axes, color, fill, etc. facet_wrap() and facet_grid(): These functions are used for creating faceted plots, allowing you to split one plot into multiple plots based on a factor or combination of factors. labs(): Used to add or modify labels for the plot, including titles, axis labels, legends, etc. theme(): This function is used to customize the non-data components of your plot, like the plot background, grid lines, text, etc.