library(dplyr)
library(ggplot2)
library(readr)
library(tidyverse)
── Attaching core tidyverse packages ────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.4 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1── Conflicts ──────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(conflicted)
library(skimr)
library(ggcorrplot)
#Reading the data set and printing first 5rows
data <- read.csv("dataset.csv")
conflicted::conflicts_prefer(dplyr::filter)
[conflicted] Removing existing preference.[conflicted] Will prefer dplyr::filter over any other package.
# Filtering dataset where explicit is "True" and taking a sample of 9,000 rows
sample_data <- data |> filter(explicit == "True") |> sample_n(9000)
data <- sample_data
nrow(data)
[1] 9000
# Display first few rows
head(data)
Step 1: Grouping Data and Analyzing Probabilities
Group 1: Grouping by Track Genre
group1 <- data |>
group_by(track_genre) |>
summarise(Count = n(), Mean_Popularity = mean(popularity, na.rm = TRUE))
print(group1)
The genres with lower counts have a lower probability of occurrence
in the dataset(black-metal).
ggplot(group1, aes(x =reorder(track_genre, -Mean_Popularity), y = Mean_Popularity)) +
geom_bar(stat = "identity", fill = "blue") +
labs(title = "Mean Popularity by Track Genre",
x = "Track Genre",
y = "Mean Popularity") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5,size = 6),
panel.grid.major.x = element_blank(), # Remove vertical grid lines
plot.margin = margin(b = 100) # Add bottom margin for labels
)

The graph shows a clear hierarchy in genre popularity, with the
highest-rated genres reaching nearly 60 on the popularity scale while
the lowest-rated genres fall below 10, demonstrating significant
variation in listener preferences.
Group2: Grouping by Genre and Explicit Content
group2 <- data |>
group_by(track_genre, explicit) |>
summarise(Count = n(), Median_Danceability = median(danceability, na.rm = TRUE))
`summarise()` has grouped output by 'track_genre'. You can override using the `.groups` argument.
print(group2)
# Create faceted bar chart
ggplot(group2, aes(x = reorder(track_genre, -Median_Danceability),
y = Median_Danceability,
fill = explicit)) +
geom_bar(stat = "identity") +
facet_wrap(~explicit) +
scale_fill_manual(values = c( "True" = "coral")) +
labs(title = "Median Danceability by Genre and Explicit Content",
x = "Track Genre",
y = "Median Danceability",
fill = "Explicit") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 90,
hjust = 1, vjust = 0.5,
size = 4),
legend.position = "none",
panel.grid.major.x = element_blank(),
plot.margin = margin(b = 100)
)

By grouping and graph, we can know-
Most genres have significantly more non-explicit tracks than explicit
ones - for example, acoustic has 948 non-explicit vs only 52 explicit
tracks, and ambient has 995 non-explicit tracks, showing a clear
preference for clean content in these genres.
Alternative music shows the highest proportion of explicit content
(164 explicit tracks out of 1000) compared to other genres, while
ambient and afrobeat have very few explicit tracks (less than 20
each).
Danceability varies notably between explicit and non-explicit tracks
within the same genre - for instance, alternative music shows higher
danceability in explicit tracks (0.663) compared to non-explicit ones
(0.538), while acoustic music shows the opposite trend with non-explicit
tracks being more danceable (0.564 vs 0.491).
Group 3: Grouping by Binned Popularity Scores
data <- data |> mutate(Popularity_Bin = cut(popularity, breaks = 5))
group3 <- data |>
group_by(track_genre, Popularity_Bin) |>
summarise(Count = n())
`summarise()` has grouped output by 'track_genre'. You can override using the `.groups` argument.
print(group3)
ggplot(group3, aes(x = Count, y = track_genre, fill = Popularity_Bin)) +
geom_bar(stat = "identity", position = "stack") +
scale_fill_brewer(palette = "RdYlBu") +
labs(title = "Distribution of Popularity Scores Across Music Genres",
x = "Number of Tracks",
y = "Genre") +
theme_minimal() +
theme(
axis.text.y = element_text(size = 5, # Increase text size
margin = margin(r = 10)), # Add right margin
plot.margin = margin(r = 8), # Add right margin for legend
legend.position = "right",
legend.title = element_text(size = 10),
legend.text = element_text(size = 9),
panel.grid.major.y = element_blank()
)

Popularity Distribution is Skewed: Most genres show a concentration
of tracks in the middle popularity ranges (20-60), with very few tracks
reaching the highest popularity bracket (80-100). For example, acoustic
has only 1 track in the highest range but 518 tracks in the middle range
(40-60).
Genre-Specific Patterns: Different genres show distinct popularity
patterns:
Acoustic music has a bell-shaped distribution with most tracks in the
middle ranges Afrobeat shows a left-skewed distribution with most tracks
in lower popularity ranges (482 tracks in the lowest bracket) Alt-rock
demonstrates a similar pattern to afrobeat with high counts in lower
popularity ranges Extreme Popularity is Rare: Across all genres, very
few tracks achieve the highest popularity bracket (80-100), suggesting
that reaching top popularity is exceptionally challenging regardless of
genre. The middle ranges (20-60) contain the bulk of tracks for most
genres.
Step 2 : Investigating Combinations of Genre and Explicitness
category_combinations <- data |> count(track_genre, explicit)
print(category_combinations)
# Find missing combinations
all_combinations <- expand.grid(unique(data$track_genre), unique(data$explicit))
missing_combinations <- anti_join(all_combinations, category_combinations, by = c("Var1" = "track_genre", "Var2" ="explicit"))
print(missing_combinations)
Step 3 : Finding combinations
# Find missing combinations
all_combinations <- expand.grid(unique(data$track_genre), unique(data$explicit))
missing_combinations <- anti_join(all_combinations, category_combinations, by = c("Var1" = "track_genre", "Var2" ="explicit"))
print(missing_combinations)
# Most common combinations
common_combinations <- group3 |>
group_by(track_genre, Popularity_Bin) |>
summarise(Count = n()) |>
arrange(desc(Count))
# Least common combinations
least_common_combinations <- group3 |>
group_by(track_genre, Popularity_Bin) |>
summarise(Count = n()) |>
arrange(Count)
# Display the most common combinations
print("Most common combinations:")
head(common_combinations, 10) # Views top 10 most common combinations
# Display the least common combinations
print("Least common combinations:")
head(least_common_combinations, 10) # Views top 10 least common combinations
From the least common and most common combinations, we can notice- 1.
Music genres like “acoustic,” “afrobeat,” and “alt-rock” are spread
across multiple popularity bins.
2. The count distribution is relatively even across different popularity
bins.
3. Some genres may perform better in higher popularity ranges, like
“alt-rock.”
4. Certain genres, like “acoustic” and “afrobeat,” could appeal to niche
audiences.
Summary:
Grouping by different categorical variables reveals patterns in
occurrence probabilities. The rarest groups can be identified and
analyzed for anomalies. Missing genre-explicitness combinations hint at
structured gaps in the dataset. Visualizations enhance interpretability
and hypothesis testing.
Further Questions to Investigate-
What factors contribute to the rarity of high-popularity tracks
across all genres?
Is there a temporal component affecting popularity distributions?
How do other track features (danceability, energy) correlate with
these popularity patterns?
Are there regional or cultural factors influencing these
distributions?
