Tidyverse_Create

##1. Loading the Tidyverse packages and the data into R:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(readr)

spotify_data <- read_csv("https://raw.githubusercontent.com/Doumgit/Sentiment-Analysis-Project/main/spotify_songs.csv")

## Rows: 32833 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
## dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(spotify_data)

## # A tibble: 6 × 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5                   67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson               70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm…               60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal…               69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran                 67 2yiy9cd2QktrN…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## #   playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## #   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## #   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>

##2. Data Manipulation:

The ‘dplyr’ package is a powerful tool for data transformation and summarization within the Tidyverse collection. It allows for clear and concise data manipulation, enabling a variety of operations such as filtering rows, selecting specific columns, mutating the dataset to include new variables, summarizing data, and arranging rows based on certain criteria. For example:

track_pop_above_60 <- spotify_data %>% 
  filter(track_popularity > 60) %>%
  select(track_name, track_artist, danceability, energy, tempo)
head(track_pop_above_60)

## # A tibble: 6 × 5
##   track_name                              track_artist danceability energy tempo
##   <chr>                                   <chr>               <dbl>  <dbl> <dbl>
## 1 I Don't Care (with Justin Bieber) - Lo… Ed Sheeran          0.748  0.916  122.
## 2 Memories - Dillon Francis Remix         Maroon 5            0.726  0.815  100.
## 3 All the Time - Don Diablo Remix         Zara Larsson        0.675  0.931  124.
## 4 Someone You Loved - Future Humans Remix Lewis Capal…        0.65   0.833  124.
## 5 Beautiful People (feat. Khalid) - Jack… Ed Sheeran          0.675  0.919  125.
## 6 Never Really Over - R3HAB Remix         Katy Perry          0.449  0.856  113.

In the example above, we demonstrated how to use dplyr to refine the spotify_data dataset to focus on tracks with a popularity greater than 60. We achieve this by employing the filter() function. Subsequently, we pare down the dataset to include only relevant columns that we are interested in analyzing: track name, artist, danceability, energy, and tempo by using the select() function. This streamlined dataset is then outputted, with head() used to display just the first few entries for a quick preview of the transformed data.

##3. Data Visualization:

With dplyr and ggplot2 together, you can create a variety of visualizations. For instance, a scatter plot to see the relationship between ‘danceability’ and ‘energy’ could be made like so:

# Data manipulation with dplyr: let's categorize tracks as 'High popularity' or 'Low popularity'
# assuming the median of the 'track_popularity' can be a good threshold
spotify_data_mutated <- spotify_data %>%
  mutate(popularity_category = if_else(track_popularity > median(track_popularity, na.rm = TRUE), 
                                       "High Popularity", 
                                       "Low Popularity"))

# For a large dataset like spotify_data, you might want to take a sample to make plotting faster
sampled_data <- sample_n(spotify_data_mutated, 1000)

ggplot(sampled_data, aes(x = danceability, y = energy, color = popularity_category)) +
  geom_point(alpha = 0.7) +   
  facet_wrap(~popularity_category) +   
  labs(title = "Danceability vs Energy by Popularity Category",
       x = "Danceability", 
       y = "Energy",
       color = "Popularity Category") +
  theme_minimal()

# Saving the plot as png
ggsave("Danceability_vs_Energy_by_Popularity_Category.png", width = 10, height = 8)

In this vignette, we leverage the capabilities of the Tidyverse, specifically dplyr for data manipulation and ggplot2 for data visualization. First, we use dplyr to enhance our dataset by creating a new column that categorizes tracks based on their popularity. This categorization allows us to explore nuances in the data, such as differences in danceability and energy between tracks with high and low popularity. Due to the potential size of the dataset, we use dplyr to sample the data, making our subsequent visualization more efficient and manageable.

Once our data is prepared, we transition to visualizing it with ggplot2. We construct a scatter plot that illustrates the relationship between danceability and energy, utilizing the newly created popularity categories to color-code the points. This not only adds a layer of information to our plot but also enhances readability. To further refine our visualization, we employ facet_wrap to generate separate plots for each popularity category, providing a clearer comparison between the groups. Finally, we add appropriate labels and titles for context and clarity and save the resulting plot as a PNG file. This process from data manipulation to visualization exemplifies a seamless workflow within the Tidyverse ecosystem, yielding insightful and aesthetically pleasing representations of our data.

##4. Data Summarization:

Summarization is a crucial step in data analysis, allowing us to extract meaningful statistics from larger datasets. The dplyr package simplifies this process by providing intuitive functions such as group_by() and summarize(). For example, to calculate the average loudness by playlist_genre:

summarising_data <- spotify_data %>%
  group_by(playlist_genre) %>%
  summarize(avg_loudness = mean(loudness, na.rm = TRUE))
summarising_data

## # A tibble: 6 × 2
##   playlist_genre avg_loudness
##   <chr>                 <dbl>
## 1 edm                   -5.43
## 2 latin                 -6.26
## 3 pop                   -6.32
## 4 r&b                   -7.86
## 5 rap                   -7.04
## 6 rock                  -7.59

In the example above, we use these functions to calculate the average ‘loudness’ for each ‘playlist_genre’ within the ‘spotify_data’ dataset. The ‘group_by()’ function clusters the data by each unique genre, setting the stage for the calculation of summary statistics within each group. Then, ‘summarize()’ is applied to compute the mean ‘loudness’ across these groups, while ‘na.rm = TRUE’ ensures that missing values do not affect the calculation. The resulting object, ‘summarising_data’, contains the average loudness values neatly organized by genre, providing an immediate snapshot of this particular attribute across different genres.

4 This extension of Souleymane’s vignnete is by Fomba Kassoh

Arranging tracks by loudness

You can also use the dyply package to arrange tracks by loudness using the ‘arrange’ function

library(dplyr)

loud_tracks <- spotify_data %>% 
  arrange(desc(loudness))

head(loud_tracks)

## # A tibble: 6 × 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 3BnVqaDfgKyI4CFCfozors Raw Power… The Stooges                36 6mxbG8KrOTZIx…
## 2 6TaqE6fbzfOGTSnNqmsAMO Escape Fr… Eva Simons                  0 51Tg4iZyqpqp1…
## 3 5DQGkXXLiOhf5cKqIyWh5L Rockstar   Duki                        2 1F7NrR7X4rxJf…
## 4 0jsbEBnXWgHLkvjV49vYVG Nails      Ghostemane                 49 6DIKWvXlVjvAx…
## 5 02fPJUlHDeH46EB1dHrgmZ Crema      Owin                       43 281103RQ7mvsW…
## 6 2kOmW169C7UV4SZDN9u0YO Vidrado E… Dj Guuga                   78 5HebljJgeo97M…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## #   playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## #   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## #   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>

Counting tracks in each playlist genre

If you want to know how many tracts are in each genre, you can use the ‘count’ function

genre_counts <- spotify_data %>% count(playlist_genre)

head(genre_counts)

## # A tibble: 6 × 2
##   playlist_genre     n
##   <chr>          <int>
## 1 edm             6043
## 2 latin           5155
## 3 pop             5507
## 4 r&b             5431
## 5 rap             5746
## 6 rock            4951

Filtering tracks that are outliers in terms of duration (using IQR)

One of the common data reshaping tasks involves filtering out outliers. You can use the inter-quantile range to filter out outliers.

spotify_data_no_outliers <- spotify_data %>% 
  filter(between(duration_ms, quantile(duration_ms, 0.25) - 1.5 * IQR(duration_ms),
                               quantile(duration_ms, 0.75) + 1.5 * IQR(duration_ms)))
head(spotify_data_no_outliers)

## # A tibble: 6 × 23
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 2 0r7CVbZTWZgbTCYdfa2P31 Memories … Maroon 5                   67 63rPSO264uRjW…
## 3 1z1Hg7Vb0AhHDiEmnDE79l All the T… Zara Larsson               70 1HoSmj2eLcsrR…
## 4 75FpbthrwQmzHlBJLuGdC7 Call You … The Chainsm…               60 1nqYsOef1yKKu…
## 5 1e8PAfcKUYoKkxPhrHqw4x Someone Y… Lewis Capal…               69 7m7vv9wlQ4i0L…
## 6 7fvUMiyapMsRRxr07cU8Ef Beautiful… Ed Sheeran                 67 2yiy9cd2QktrN…
## # ℹ 18 more variables: track_album_name <chr>, track_album_release_date <chr>,
## #   playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## #   playlist_subgenre <chr>, danceability <dbl>, energy <dbl>, key <dbl>,
## #   loudness <dbl>, mode <dbl>, speechiness <dbl>, acousticness <dbl>,
## #   instrumentalness <dbl>, liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   duration_ms <dbl>

Normalizing audio feature columns (like loudness and tempo) to have a comparison scale

In statistical analysis, you may be required to compare two variables that have difference measurement base. To do that, you must first normalize/standardize the data. You can achieve that using the following dplyr functions:

mutate_at(): This is a dplyr function used to apply a given function to multiple columns in the dataframe. vars(loudness, tempo): The vars() function is used here to specify the columns on which the function within mutate_at() will be applied. In this case, it’s being applied to the ‘loudness’ and ‘tempo’ columns. ~scale(.): The tilde (~) introduces a formula, which is a way of writing anonymous functions in R. The dot (.) is a placeholder for the columns specified in vars(). scale() is a base R function that standardizes a numeric vector to have a mean of zero and a standard deviation of one. It performs z-score normalization. %>% as.vector: This chain within the mutate_at() function takes the output of scale()—which is a matrix—and converts it back to a vector with as.vector(). This is because scale() returns a matrix with attributes for the scaled center and scale, but here we want just the scaled values as regular columns in the dataframe. select: The select function selects columns of interest.

normalized_spotify_data <- spotify_data %>% 
  mutate_at(vars(loudness, tempo), ~scale(.) %>% as.vector) %>%
  select(track_name, loudness, tempo)

head(normalized_spotify_data)

## # A tibble: 6 × 3
##   track_name                                            loudness   tempo
##   <chr>                                                    <dbl>   <dbl>
## 1 I Don't Care (with Justin Bieber) - Loud Luxury Remix    1.37   0.0429
## 2 Memories - Dillon Francis Remix                          0.586 -0.777 
## 3 All the Time - Don Diablo Remix                          1.10   0.116 
## 4 Call You Mine - Keanu Silva Remix                        0.984  0.0400
## 5 Someone You Loved - Future Humans Remix                  0.685  0.115 
## 6 Beautiful People (feat. Khalid) - Jack Wins Remix        0.447  0.152

Grouping data and applying a function within each group

In data analysis, you are sometimes required to generate a summary statistic with each group in a column. You can use group_by and apply a function independently to each group. Below is an example:

group_by: This function groups the data by the playlist_genre column, which means that subsequent operations will be performed on these groups independently. top_n: This function is used to select the top n entries for each group created by group_by(). In this case, it selects the top 5 tracks with the longest duration within each genre.

top_tracks_by_genre <- spotify_data %>% 
  group_by(playlist_genre) %>% 
  top_n(5, duration_ms) %>%
  select(track_name, playlist_genre, duration_ms)

head(top_tracks_by_genre)

## # A tibble: 6 × 3
## # Groups:   playlist_genre [2]
##   track_name                               playlist_genre duration_ms
##   <chr>                                    <chr>                <dbl>
## 1 Mirrors                                  pop                 484147
## 2 Mirrors                                  pop                 484147
## 3 Get The Balance Right! - Combination Mix pop                 478208
## 4 Bailando - Jose Spinnin Cortes Remix     pop                 490057
## 5 Get The Balance Right! - Combination Mix pop                 475600
## 6 Bring It On                              rap                 496133

Transforming the data frame from the wide format to the long format

The pivot_longer transformation took the selected numerical columns from the Spotify dataset and converted them into two columns: attribute and value. This “long” format is useful for certain types of analysis and visualization where you want to treat these attributes uniformly. For example, after this transformation, you could easily plot all these attributes’ values against another variable without having to deal with separate columns.

Note: The pivot_longer function is in the tidyr not the dplyr package but these packages are often used together because they complement each other’s functionalities. Here, I am using it to help with visualizing the data.

library(dplyr)
library(tidyr)
library(readr)


# Select numerical columns for the transformation
numerical_columns <- c('danceability', 'energy', 'key', 'loudness', 
                       'speechiness', 'acousticness', 'instrumentalness', 
                       'liveness', 'valence', 'tempo')

# Perform the pivot_longer operation
pivot_longer_df <- spotify_data %>% 
  pivot_longer(cols = numerical_columns, 
               names_to = "attribute", 
               values_to = "value")

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(numerical_columns)
## 
##   # Now:
##   data %>% select(all_of(numerical_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

head(pivot_longer_df)

## # A tibble: 6 × 15
##   track_id               track_name track_artist track_popularity track_album_id
##   <chr>                  <chr>      <chr>                   <dbl> <chr>         
## 1 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 2 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 3 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 4 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 5 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## 6 6f807x0ima9a1j3VPbc7VN I Don't C… Ed Sheeran                 66 2oCs0DGTsRO98…
## # ℹ 10 more variables: track_album_name <chr>, track_album_release_date <chr>,
## #   playlist_name <chr>, playlist_id <chr>, playlist_genre <chr>,
## #   playlist_subgenre <chr>, mode <dbl>, duration_ms <dbl>, attribute <chr>,
## #   value <dbl>

Visualizations

Now that we have transformed the data from wide format to long format, let use use the long data frame to visualize the data. Here I am focusing on the plots that are most frequently used in statistical analysis.

Box Plot

This will show the distribution of each numerical attribute. Box plots are valuable in data analysis for concisely representing the distribution of a data set, highlighting its median, quartiles, and outliers. They provide a clear visual summary of the central tendency, variability, and extremes in the data.

library(ggplot2)

ggplot(pivot_longer_df, aes(x = attribute, y = value)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Distribution of Numerical Attributes", x = "Attribute", y = "Value")

#### Histograms of Numerical Attributes This will create a facet grid of histograms for each attribute. Histograms are essential in data analysis for visualizing the distribution of numerical data, identifying patterns and outliers, comparing multiple data sets, and understanding their spread and central tendencies. They offer a straightforward way to analyze and gain insights from numerical data.

ggplot(pivot_longer_df, aes(x = value)) +
  geom_histogram(bins = 30, fill = "blue", color = "black") +
  facet_wrap(~ attribute, scales = "free") +
  labs(title = "Histograms of Numerical Attributes", x = "Value", y = "Count")

#### Density Plots of Numerical Attributes This will show the density distribution for each attribute.Density plots are used in data analysis to visualize the distribution and density of continuous data, offering a smooth representation of data variation and identifying where values are concentrated. They provide a clear and continuous view of data distribution, useful for understanding underlying patterns and trends.

ggplot(pivot_longer_df, aes(x = value, fill = attribute)) +
  geom_density(alpha = 0.7) +
  facet_wrap(~ attribute, scales = "free") +
  labs(title = "Density Plots of Numerical Attributes", x = "Value", y = "Density")

#### Bar Plot of Average Values of Attributes This will display the average values of each attribute.Bar plots are used in data analysis to visually represent categorical data, showing the frequency or proportion of categories through the height or length of bars, making it easy to compare different categories or groups within a dataset.

library(dplyr)

pivot_longer_df %>%
  group_by(attribute) %>%
  summarise(average_value = mean(value, na.rm = TRUE)) %>%
  ggplot(aes(x = attribute, y = average_value, fill = attribute)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Average Values of Numerical Attributes", x = "Attribute", y = "Average Value")

#### Explaination of the ggplot functions used ggplot(): This is the core function in ggplot2, used to initialize a ggplot object. It sets up the data and specifies the set of mappings from data to aesthetics. geom_ functions: These functions add layers to the plot, specifying how to display the data. Some common geom_* functions include:

geom_point(): Adds points to the plot, useful for scatter plots. geom_density(): Adds lines, great for time series or trend lines. geom_bar(): Creates bar plots. geom_histogram(): Plots a histogram for data distribution. geom_boxplot(): Creates boxplots to show distributions with quartiles.

aes(): This function is used to specify the aesthetic mappings, like mapping variables to x and y axes, color, fill, etc. facet_wrap() and facet_grid(): These functions are used for creating faceted plots, allowing you to split one plot into multiple plots based on a factor or combination of factors. labs(): Used to add or modify labels for the plot, including titles, axis labels, legends, etc. theme(): This function is used to customize the non-data components of your plot, like the plot background, grid lines, text, etc.