In this project, we are simulating a song shuffling program, such as that used by Spotify. (more detail here on the background etc.)
The data for this project comes from… Before importing the data into R, we…
To organize the data in R, we will first need to load the necessary libraries and import my data. I will also look at the data to see if I need to do any additional cleaning.
# load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# import data
df <- read_csv("playlist - data_clean.csv")
## Rows: 144 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): genre, artist, title
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#check first few rows of data
head(df)
## # A tibble: 6 × 3
## genre artist title
## <chr> <chr> <chr>
## 1 Genre 1 - hiphop/rap Alesso When I'm Gone
## 2 Genre 1 - hiphop/rap Aliyah's Interlude It Girl
## 3 Genre 1 - hiphop/rap Big Sean Beware
## 4 Genre 1 - hiphop/rap Clipse Ma, I don't love her
## 5 Genre 1 - hiphop/rap Common Be
## 6 Genre 1 - hiphop/rap Drake Nice for what
Looking at the first 6 rows, I think the column titles work and all 3 columns are an accurate type (character). I think I would like my genre column to be a factor type column so that I can categorize by the different Genres. Then I will look at the counts of each genre.
# change variable type to factor
df$genre <- as.factor(df$genre)
# get the counts for each genre
df %>% count(genre)
## # A tibble: 6 × 2
## genre n
## <fct> <int>
## 1 Genre 1 - hiphop/rap 26
## 2 Genre 2 - pop/kpop/Latin 22
## 3 Genre 3 - house/ska 5
## 4 Genre 4 - rnb/soul 43
## 5 Genre 5 - alt/indie/folk 34
## 6 Genre 6 - country/rock 14
Now I’m going to save this data frame so I can easily access it again later.
genre_counts <- df %>% count(genre)
Next I need to find the theoretical probabilities of getting each genre by creating a new column in my genre counts table.
genre_counts <- genre_counts %>% mutate(prob = n/sum(n))
genre_counts
## # A tibble: 6 × 3
## genre n prob
## <fct> <int> <dbl>
## 1 Genre 1 - hiphop/rap 26 0.181
## 2 Genre 2 - pop/kpop/Latin 22 0.153
## 3 Genre 3 - house/ska 5 0.0347
## 4 Genre 4 - rnb/soul 43 0.299
## 5 Genre 5 - alt/indie/folk 34 0.236
## 6 Genre 6 - country/rock 14 0.0972
I don’t like all those long decimal numbers, so I’m going to round them to 2 decimal places.
genre_counts$prob <- round(genre_counts$prob, 2)
genre_counts
## # A tibble: 6 × 3
## genre n prob
## <fct> <int> <dbl>
## 1 Genre 1 - hiphop/rap 26 0.18
## 2 Genre 2 - pop/kpop/Latin 22 0.15
## 3 Genre 3 - house/ska 5 0.03
## 4 Genre 4 - rnb/soul 43 0.3
## 5 Genre 5 - alt/indie/folk 34 0.24
## 6 Genre 6 - country/rock 14 0.1
This new table tells me that there is an 18% chance of getting a hiphop/rap song, a 15% chance of getting a pop/kpop/Latin song, a 3% chance of getting a house/ska song, a 30% chance of getting an rnb/soul song, a 24% chance of getting an alt/indie/folk song, and a 10% chance of getting a country/rock song.
Our next task is to simulate how shuffle works by pulling a random sample of songs and saving it as a new table.
sample <- sample_n(df, 10)
sample
## # A tibble: 10 × 3
## genre artist title
## <fct> <chr> <chr>
## 1 Genre 4 - rnb/soul Ella Mai Trip
## 2 Genre 4 - rnb/soul Beyonce Summertime
## 3 Genre 6 - country/rock Dolly parton Jolene
## 4 Genre 1 - hiphop/rap Isaiah Rashad Headshots
## 5 Genre 1 - hiphop/rap Lakeyah Worst thing
## 6 Genre 6 - country/rock Nirvana Come as you are
## 7 Genre 5 - alt/indie/folk Taylor Swift Ivy
## 8 Genre 5 - alt/indie/folk Nirvana About A Girl
## 9 Genre 2 - pop/kpop/Latin Natalia lafourcade Hasta la raiz
## 10 Genre 2 - pop/kpop/Latin Beyonce Break My Soul
Now I will adjust this table to show me just the genres and their probabilities.
sample_counts <- sample %>% count(genre) %>% mutate(probability = n/sum(n))
sample_counts
## # A tibble: 5 × 3
## genre n probability
## <fct> <int> <dbl>
## 1 Genre 1 - hiphop/rap 2 0.2
## 2 Genre 2 - pop/kpop/Latin 2 0.2
## 3 Genre 4 - rnb/soul 2 0.2
## 4 Genre 5 - alt/indie/folk 2 0.2
## 5 Genre 6 - country/rock 2 0.2
These probabilities are all nice so I don’t think I need to round them.
The first thing that stands out to me is that there are no songs from Genre 3. This isn’t super surprising because there is only a 3% of getting a song from genre 3. I also noticed the theoretical and experimental probabilities for Genre 1 and Genre 5 are pretty close, while the ones for Genres 2, 4, and 6 are further apart. Genres 2 and 6 had higher experimental probabilities than theoretical while Genre 4 had a lower experimental probability.
We combined all of our samples of 10 as a class. You can see the results in the table below.
combined <- tibble(genre = genre_counts$genre, n = c(2, 3, 0, 1, 2, 2), probability = n/sum(n))
combined
## # A tibble: 6 × 3
## genre n probability
## <fct> <dbl> <dbl>
## 1 Genre 1 - hiphop/rap 2 0.2
## 2 Genre 2 - pop/kpop/Latin 3 0.3
## 3 Genre 3 - house/ska 0 0
## 4 Genre 4 - rnb/soul 1 0.1
## 5 Genre 5 - alt/indie/folk 2 0.2
## 6 Genre 6 - country/rock 2 0.2
Analysis…
I will now run my simulation for different numbers of songs to see how the probabilities change.
Sampling 20
sample_n(df, 20) %>% count(genre) %>% mutate(probability = round(n/sum(n), 2))
## # A tibble: 5 × 3
## genre n probability
## <fct> <int> <dbl>
## 1 Genre 1 - hiphop/rap 6 0.3
## 2 Genre 2 - pop/kpop/Latin 3 0.15
## 3 Genre 3 - house/ska 1 0.05
## 4 Genre 4 - rnb/soul 6 0.3
## 5 Genre 5 - alt/indie/folk 4 0.2
Sampling 30
sample_n(df, 30) %>% count(genre) %>% mutate(probability = round(n/sum(n), 2))
## # A tibble: 5 × 3
## genre n probability
## <fct> <int> <dbl>
## 1 Genre 1 - hiphop/rap 7 0.23
## 2 Genre 2 - pop/kpop/Latin 7 0.23
## 3 Genre 4 - rnb/soul 9 0.3
## 4 Genre 5 - alt/indie/folk 5 0.17
## 5 Genre 6 - country/rock 2 0.07
Sampling 50
sample_n(df, 50) %>% count(genre) %>% mutate(probability = n/sum(n))
## # A tibble: 6 × 3
## genre n probability
## <fct> <int> <dbl>
## 1 Genre 1 - hiphop/rap 9 0.18
## 2 Genre 2 - pop/kpop/Latin 6 0.12
## 3 Genre 3 - house/ska 2 0.04
## 4 Genre 4 - rnb/soul 15 0.3
## 5 Genre 5 - alt/indie/folk 12 0.24
## 6 Genre 6 - country/rock 6 0.12
Analysis of probabilities, Law of Large numbers, etc. Answering first set of questions from handout 5.3.
Reflection on data science process and learning to code.