Introduction

In this project, we are simulating a song shuffling program, such as that used by Spotify. (more detail here on the background etc.)

Gathering & Organizing Data

The data for this project comes from… Before importing the data into R, we…

To organize the data in R, we will first need to load the necessary libraries and import my data. I will also look at the data to see if I need to do any additional cleaning.

# load necessary libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# import data
df <- read_csv("playlist - data_clean.csv")
## Rows: 144 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): genre, artist, title
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#check first few rows of data
head(df)
## # A tibble: 6 × 3
##   genre                artist             title               
##   <chr>                <chr>              <chr>               
## 1 Genre 1 - hiphop/rap Alesso             When I'm Gone       
## 2 Genre 1 - hiphop/rap Aliyah's Interlude It Girl             
## 3 Genre 1 - hiphop/rap Big Sean           Beware              
## 4 Genre 1 - hiphop/rap Clipse             Ma, I don't love her
## 5 Genre 1 - hiphop/rap Common             Be                  
## 6 Genre 1 - hiphop/rap Drake              Nice for what

Looking at the first 6 rows, I think the column titles work and all 3 columns are an accurate type (character). I think I would like my genre column to be a factor type column so that I can categorize by the different Genres. Then I will look at the counts of each genre.

# change variable type to factor
df$genre <- as.factor(df$genre)

# get the counts for each genre
df %>% count(genre)
## # A tibble: 6 × 2
##   genre                        n
##   <fct>                    <int>
## 1 Genre 1 - hiphop/rap        26
## 2 Genre 2 - pop/kpop/Latin    22
## 3 Genre 3 -  house/ska         5
## 4 Genre 4 -  rnb/soul         43
## 5 Genre 5 - alt/indie/folk    34
## 6 Genre 6 - country/rock      14

Now I’m going to save this data frame so I can easily access it again later.

genre_counts <- df %>% count(genre)

Modeling

Calculate Theoretical Probabilities

Next I need to find the theoretical probabilities of getting each genre by creating a new column in my genre counts table.

genre_counts <- genre_counts %>% mutate(prob = n/sum(n))
genre_counts
## # A tibble: 6 × 3
##   genre                        n   prob
##   <fct>                    <int>  <dbl>
## 1 Genre 1 - hiphop/rap        26 0.181 
## 2 Genre 2 - pop/kpop/Latin    22 0.153 
## 3 Genre 3 -  house/ska         5 0.0347
## 4 Genre 4 -  rnb/soul         43 0.299 
## 5 Genre 5 - alt/indie/folk    34 0.236 
## 6 Genre 6 - country/rock      14 0.0972

I don’t like all those long decimal numbers, so I’m going to round them to 2 decimal places.

genre_counts$prob <- round(genre_counts$prob, 2)
genre_counts
## # A tibble: 6 × 3
##   genre                        n  prob
##   <fct>                    <int> <dbl>
## 1 Genre 1 - hiphop/rap        26  0.18
## 2 Genre 2 - pop/kpop/Latin    22  0.15
## 3 Genre 3 -  house/ska         5  0.03
## 4 Genre 4 -  rnb/soul         43  0.3 
## 5 Genre 5 - alt/indie/folk    34  0.24
## 6 Genre 6 - country/rock      14  0.1

This new table tells me that there is an 18% chance of getting a hiphop/rap song, a 15% chance of getting a pop/kpop/Latin song, a 3% chance of getting a house/ska song, a 30% chance of getting an rnb/soul song, a 24% chance of getting an alt/indie/folk song, and a 10% chance of getting a country/rock song.

Calculating Experimental Probabilities

Our next task is to simulate how shuffle works by pulling a random sample of songs and saving it as a new table.

sample <- sample_n(df, 10)
sample
## # A tibble: 10 × 3
##    genre                    artist             title          
##    <fct>                    <chr>              <chr>          
##  1 Genre 4 -  rnb/soul      Ella Mai           Trip           
##  2 Genre 4 -  rnb/soul      Beyonce            Summertime     
##  3 Genre 6 - country/rock   Dolly parton       Jolene         
##  4 Genre 1 - hiphop/rap     Isaiah Rashad      Headshots      
##  5 Genre 1 - hiphop/rap     Lakeyah            Worst thing    
##  6 Genre 6 - country/rock   Nirvana            Come as you are
##  7 Genre 5 - alt/indie/folk Taylor Swift       Ivy            
##  8 Genre 5 - alt/indie/folk Nirvana            About A Girl   
##  9 Genre 2 - pop/kpop/Latin Natalia lafourcade Hasta la raiz  
## 10 Genre 2 - pop/kpop/Latin Beyonce            Break My Soul

Now I will adjust this table to show me just the genres and their probabilities.

sample_counts <- sample %>% count(genre) %>% mutate(probability = n/sum(n))
sample_counts
## # A tibble: 5 × 3
##   genre                        n probability
##   <fct>                    <int>       <dbl>
## 1 Genre 1 - hiphop/rap         2         0.2
## 2 Genre 2 - pop/kpop/Latin     2         0.2
## 3 Genre 4 -  rnb/soul          2         0.2
## 4 Genre 5 - alt/indie/folk     2         0.2
## 5 Genre 6 - country/rock       2         0.2

These probabilities are all nice so I don’t think I need to round them.

The first thing that stands out to me is that there are no songs from Genre 3. This isn’t super surprising because there is only a 3% of getting a song from genre 3. I also noticed the theoretical and experimental probabilities for Genre 1 and Genre 5 are pretty close, while the ones for Genres 2, 4, and 6 are further apart. Genres 2 and 6 had higher experimental probabilities than theoretical while Genre 4 had a lower experimental probability.

Combining Class Values

We combined all of our samples of 10 as a class. You can see the results in the table below.

combined <- tibble(genre = genre_counts$genre, n = c(2, 3, 0, 1, 2, 2), probability = n/sum(n))  

combined
## # A tibble: 6 × 3
##   genre                        n probability
##   <fct>                    <dbl>       <dbl>
## 1 Genre 1 - hiphop/rap         2         0.2
## 2 Genre 2 - pop/kpop/Latin     3         0.3
## 3 Genre 3 -  house/ska         0         0  
## 4 Genre 4 -  rnb/soul          1         0.1
## 5 Genre 5 - alt/indie/folk     2         0.2
## 6 Genre 6 - country/rock       2         0.2

Analysis…

More Simulations of Shuffle

I will now run my simulation for different numbers of songs to see how the probabilities change.

Sampling 20

sample_n(df, 20) %>% count(genre) %>% mutate(probability = round(n/sum(n), 2))
## # A tibble: 5 × 3
##   genre                        n probability
##   <fct>                    <int>       <dbl>
## 1 Genre 1 - hiphop/rap         6        0.3 
## 2 Genre 2 - pop/kpop/Latin     3        0.15
## 3 Genre 3 -  house/ska         1        0.05
## 4 Genre 4 -  rnb/soul          6        0.3 
## 5 Genre 5 - alt/indie/folk     4        0.2

Sampling 30

sample_n(df, 30) %>% count(genre) %>% mutate(probability = round(n/sum(n), 2))
## # A tibble: 5 × 3
##   genre                        n probability
##   <fct>                    <int>       <dbl>
## 1 Genre 1 - hiphop/rap         7        0.23
## 2 Genre 2 - pop/kpop/Latin     7        0.23
## 3 Genre 4 -  rnb/soul          9        0.3 
## 4 Genre 5 - alt/indie/folk     5        0.17
## 5 Genre 6 - country/rock       2        0.07

Sampling 50

sample_n(df, 50) %>% count(genre) %>% mutate(probability = n/sum(n))
## # A tibble: 6 × 3
##   genre                        n probability
##   <fct>                    <int>       <dbl>
## 1 Genre 1 - hiphop/rap         9        0.18
## 2 Genre 2 - pop/kpop/Latin     6        0.12
## 3 Genre 3 -  house/ska         2        0.04
## 4 Genre 4 -  rnb/soul         15        0.3 
## 5 Genre 5 - alt/indie/folk    12        0.24
## 6 Genre 6 - country/rock       6        0.12

Analyze & Synthesize

Analysis of probabilities, Law of Large numbers, etc. Answering first set of questions from handout 5.3.

Reflection

Reflection on data science process and learning to code.