\href{https://allisonhorst.com/}{Artwork by Allison Horst}

Purpose of the Task

This is a collaborative data analysis project. In your small groups, you will explore the distribution of durations of selected songs by their genre. The data set is stored in an excel file “Songs.csv” and it contains list of 88 songs.

List of variables in the data set:

You will consider the following data investigations:

  1. Plot a histogram of song durations (in minutes). Describe the shape, centre, and spread.
  2. Obtain the summary statistics for the distribution of song durations (in minutes). Interpret the mean and the standard deviation within the context of the data.
  3. Check whether the Empirical Rule holds for the distribution of songs duration (in minutes).
  4. Plot a boxplot of songs durations (in minutes). Describe the shape, centre, and spread.
  5. Obtain frequency table and bar plot for song genres. Describe the frequency distribution of song genres.
  6. Plot a side-by-side boxplot of song durations (in minutes) by different genres. Describe the shape, centre, and spread.
  7. Check whether individually plotted point(s) on the boxplot(s) is/are susceptible or highly suspectible outlier(s).

Enjoy making sense of song durations data :)

 \href{https://allisonhorst.com/}{Artwork by Allison Horst}

Load the Libraries

library(tidyverse)
library(mosaic)
library(knitr)
library(RColorBrewer)

Load Songs Data Set

songs_data <- read.csv("Songs.csv")
# Convert duration from seconds to minutes
songs_data$duration_min <- songs_data$duration_sec / 60
attach(songs_data)

Histogram of Song Durations (in minutes)

ggplot(data = songs_data, aes(x = duration_min)) +
  geom_histogram(binwidth = 1, 
                 fill = "darkolivegreen", color = "darkorchid", alpha = 0.7) +
  scale_x_continuous(breaks = seq(0, max(songs_data$duration_min), by = 1), 
                     labels = paste0(seq(0, max(songs_data$duration_min), by = 1), " min")) +
  labs(title = "Histogram of Song Durations", 
       # Modify the following subtitle. 
    subtitle = "Constructed by Group ...",
       x = "Duration (minutes)", 
       y = "Count") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),  
    plot.subtitle = element_text(hjust = 0.5) 
  )

The histogram plot shows that the distribution of song durations (in minutes) is slightly right-skewed. Most songs are clustered between 2 and 5 minutes, with a few songs extending beyond 6 minutes. The distribution has a single peak and gradually tapers off, indicating fewer longer songs.

Summary Statistics for Song Durations (in minutes)

kable(favstats(duration_min))
min Q1 median Q3 max mean sd n missing
2.166667 3.225 3.75 4.345833 9.133333 3.895076 1.094425 88 0

The duration of songs is 3.90 (minutes), on average, with standard deviation 1.09 (minutes), indicating that most song durations vary from the mean by around 1 minute. The minimum duration is about 2.17 minutes and the maximum is over 9 minutes. [here you interpret standard deviation within the context of data].

 \href{https://allisonhorst.com/}{Artwork by Allison Horst}

Check whether the Empirical Rules holds for Song Durations (in minutes)

Mean = mean(duration_min)
SD = sd(duration_min)

# Compute the interval bounds for 1, 2, and 3 standard deviations
intervals <- tibble(
  SD_Level = c("Mean ± 1 SD", "Mean ± 2 SD", "Mean ± 3 SD"),
  Lower_Bound = c(Mean - SD, Mean - 2*SD, Mean - 3*SD),
  Upper_Bound = c(Mean + SD, Mean + 2*SD, Mean + 3*SD)
)
# Display the intervals
intervals
# Calculate the proportion of data values within each interval
intervals <- intervals %>%
  rowwise() %>%
  mutate(
    Proportion_Within = sum(duration_min >= Lower_Bound & duration_min <= Upper_Bound) / length(duration_min)
  )

# Display the intervals with proportions
intervals

It appears that the empirical rule does reasonably well for the distribution of song durations.

Boxplot of Song Durations (in minutes)

ggplot(data = songs_data, mapping = aes(x = duration_min)) +
  geom_boxplot(fill = "lemonchiffon3", color = "purple") +
  labs(x = "Songs Durations (in minutes)", 
       title = " Boxplot of Song Durations",
      # Modify the following subtitle.
       subtitle = "Group ...") +
  theme(plot.title = element_text(hjust = 0.5), 
        plot.subtitle = element_text(hjust = 0.5))

The boxplot shows that the distribution of song durations (in minutes) is right-skewed

Create Frequency Distribution for Song Genres: Counts and Proportions

genre_counts <- songs_data %>%
  group_by(genre) %>%
  summarise(Count = n()) %>%
  mutate(Proportion = Count / sum(Count))

print(genre_counts)
## # A tibble: 7 × 3
##   genre     Count Proportion
##   <chr>     <int>      <dbl>
## 1 Ambient       3     0.0341
## 2 Classical     7     0.0795
## 3 Hip-Hop       6     0.0682
## 4 Indie        12     0.136 
## 5 Pop          39     0.443 
## 6 R&B          13     0.148 
## 7 Rock          8     0.0909

Bar Plot of Frequency Distribution for Song Genres

ggplot(genre_counts, aes(x = reorder(genre, -Proportion), y = Proportion)) +
  geom_bar(stat = "identity", fill = "darkgreen") +
  labs(title = "Genre Proportion Distribution",
      # Modify the following subtitle.
       subtitle = "Group ...",
       x = "Genre", y = "Proportion") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(hjust = 0.5), 
        plot.subtitle = element_text(hjust = 0.5))

The frequency distribution of song genres shows that Pop is the most common genre, accounting for approximately 44.3% of all songs. It is followed by R&B (14.8%) and Indie (13.6%). Genres like Ambient and Hip-Hop are less represented, each below 7%.

Summary for Song Durations (in minutes) Grouped by Genre

summary_table <- songs_data %>%
  group_by(genre) %>%
  summarise(
    Min = min(duration_min),
    Q1 = quantile(duration_min, 0.25),
    Median = median(duration_min),
    Q3 = quantile(duration_min, 0.75),
    Max = max(duration_min),
    Mean = mean(duration_min),
    SD = sd(duration_min)
  )
summary_table

Side-by-side Box plots for Song Durations (in minutes) by Genre

ggplot(data = songs_data, mapping = aes(x = genre, 
                                  y = duration_min,
                                  fill = genre)) +
  geom_boxplot() +
  labs(x = "Genre", 
       y = "Song Durations (in minutes)",
       fill = "Genre",
       title = "Side-by-side Boxplots of Song Durations (in minutes) Grouped by Genre",
       # Modify the following subtitle.
       subtitle = "Constrcuted by Group ...") +
  theme_minimal() + 
  theme(plot.title = element_text(hjust = 0.5), 
        plot.subtitle = element_text(hjust = 0.5))

The side-by-side box plot of song durations (in minutes) grouped by genre shows that notable differences in average durations. Ambient and Hip-Hop genres have higher means and greater variability, with some extreme durations. R&B, Pop, and Indie songs tend to be shorter and more consistent in length.

Have any questions? Post them \textcolor{blue}{\href{https://q.utoronto.ca/courses/392145/discussion_topics/3091350?module_item_id=6821413}{here}}. R Lady and here teaching term will get back to you.

Have any questions? Post them . R Lady and here teaching term will get back to you.