This is a collaborative data analysis project. In your small groups, you will explore the distribution of durations of selected songs by their genre. The data set is stored in an excel file “Songs.csv” and it contains list of 88 songs.
List of variables in the data set:
You will consider the following data investigations:
Enjoy making sense of song durations data :)
library(tidyverse)
library(mosaic)
library(knitr)
library(RColorBrewer)
songs_data <- read.csv("Songs.csv")
# Convert duration from seconds to minutes
songs_data$duration_min <- songs_data$duration_sec / 60
attach(songs_data)
ggplot(data = songs_data, aes(x = duration_min)) +
geom_histogram(binwidth = 1,
fill = "darkolivegreen", color = "darkorchid", alpha = 0.7) +
scale_x_continuous(breaks = seq(0, max(songs_data$duration_min), by = 1),
labels = paste0(seq(0, max(songs_data$duration_min), by = 1), " min")) +
labs(title = "Histogram of Song Durations",
# Modify the following subtitle.
subtitle = "Constructed by Group ...",
x = "Duration (minutes)",
y = "Count") +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)
)
The histogram plot shows that the distribution of song durations (in minutes) is slightly right-skewed. Most songs are clustered between 2 and 5 minutes, with a few songs extending beyond 6 minutes. The distribution has a single peak and gradually tapers off, indicating fewer longer songs.
kable(favstats(duration_min))
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
2.166667 | 3.225 | 3.75 | 4.345833 | 9.133333 | 3.895076 | 1.094425 | 88 | 0 |
The duration of songs is 3.90 (minutes), on average, with standard deviation 1.09 (minutes), indicating that most song durations vary from the mean by around 1 minute. The minimum duration is about 2.17 minutes and the maximum is over 9 minutes. [here you interpret standard deviation within the context of data].
Mean = mean(duration_min)
SD = sd(duration_min)
# Compute the interval bounds for 1, 2, and 3 standard deviations
intervals <- tibble(
SD_Level = c("Mean ± 1 SD", "Mean ± 2 SD", "Mean ± 3 SD"),
Lower_Bound = c(Mean - SD, Mean - 2*SD, Mean - 3*SD),
Upper_Bound = c(Mean + SD, Mean + 2*SD, Mean + 3*SD)
)
# Display the intervals
intervals
# Calculate the proportion of data values within each interval
intervals <- intervals %>%
rowwise() %>%
mutate(
Proportion_Within = sum(duration_min >= Lower_Bound & duration_min <= Upper_Bound) / length(duration_min)
)
# Display the intervals with proportions
intervals
It appears that the empirical rule does reasonably well for the distribution of song durations.
ggplot(data = songs_data, mapping = aes(x = duration_min)) +
geom_boxplot(fill = "lemonchiffon3", color = "purple") +
labs(x = "Songs Durations (in minutes)",
title = " Boxplot of Song Durations",
# Modify the following subtitle.
subtitle = "Group ...") +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
The boxplot shows that the distribution of song durations (in minutes) is right-skewed
genre_counts <- songs_data %>%
group_by(genre) %>%
summarise(Count = n()) %>%
mutate(Proportion = Count / sum(Count))
print(genre_counts)
## # A tibble: 7 × 3
## genre Count Proportion
## <chr> <int> <dbl>
## 1 Ambient 3 0.0341
## 2 Classical 7 0.0795
## 3 Hip-Hop 6 0.0682
## 4 Indie 12 0.136
## 5 Pop 39 0.443
## 6 R&B 13 0.148
## 7 Rock 8 0.0909
ggplot(genre_counts, aes(x = reorder(genre, -Proportion), y = Proportion)) +
geom_bar(stat = "identity", fill = "darkgreen") +
labs(title = "Genre Proportion Distribution",
# Modify the following subtitle.
subtitle = "Group ...",
x = "Genre", y = "Proportion") +
theme_minimal() +
scale_y_continuous(labels = scales::percent_format()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
The frequency distribution of song genres shows that Pop is the most common genre, accounting for approximately 44.3% of all songs. It is followed by R&B (14.8%) and Indie (13.6%). Genres like Ambient and Hip-Hop are less represented, each below 7%.
summary_table <- songs_data %>%
group_by(genre) %>%
summarise(
Min = min(duration_min),
Q1 = quantile(duration_min, 0.25),
Median = median(duration_min),
Q3 = quantile(duration_min, 0.75),
Max = max(duration_min),
Mean = mean(duration_min),
SD = sd(duration_min)
)
summary_table
ggplot(data = songs_data, mapping = aes(x = genre,
y = duration_min,
fill = genre)) +
geom_boxplot() +
labs(x = "Genre",
y = "Song Durations (in minutes)",
fill = "Genre",
title = "Side-by-side Boxplots of Song Durations (in minutes) Grouped by Genre",
# Modify the following subtitle.
subtitle = "Constrcuted by Group ...") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))
The side-by-side box plot of song durations (in minutes) grouped by
genre shows that notable differences in average durations. Ambient and
Hip-Hop genres have higher means and greater variability, with some
extreme durations. R&B, Pop, and Indie songs tend to be shorter and
more consistent in length.
Have any questions? Post them . R Lady and here teaching term will get back to you.