library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
playlist_data_clean <- read_csv("playlist - data_clean.csv")
## Rows: 144 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): genre, artist, title
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Project #5: Song Shuffle

Intro:

For this project we made a class playlist of each of our favorite songs, including varying genres and artists. Essential question: Do the genres you hear played on shuffle represent the genres of the songs in our class playlist?

playlist_data_clean$genre <- as.factor(playlist_data_clean$genre)
playlist_data_clean %>% count(genre)
## # A tibble: 6 × 2
##   genre                        n
##   <fct>                    <int>
## 1 Genre 1 - hiphop/rap        26
## 2 Genre 2 - pop/kpop/Latin    22
## 3 Genre 3 -  house/ska         5
## 4 Genre 4 -  rnb/soul         43
## 5 Genre 5 - alt/indie/folk    34
## 6 Genre 6 - country/rock      14
Genre_amount <- playlist_data_clean %>% count(genre)
sum(Genre_amount$n)
## [1] 144
total_songs <- sum(Genre_amount$n)

Calculating theoretical probability:

Here is the theoretical probability of each genre being played

mutate(Genre_amount, genre_probability = Genre_amount$n/total_songs)
## # A tibble: 6 × 3
##   genre                        n genre_probability
##   <fct>                    <int>             <dbl>
## 1 Genre 1 - hiphop/rap        26            0.181 
## 2 Genre 2 - pop/kpop/Latin    22            0.153 
## 3 Genre 3 -  house/ska         5            0.0347
## 4 Genre 4 -  rnb/soul         43            0.299 
## 5 Genre 5 - alt/indie/folk    34            0.236 
## 6 Genre 6 - country/rock      14            0.0972
theoretical <- mutate(Genre_amount, genre_probability = Genre_amount$n/total_songs)
theory_prob <- theoretical %>% select(genre, genre_probability)

These numbers show us the fraction of the total songs that each genre is.

Simulating song shuffle:

We can simulate a shuffle of the playlist

Experiment 1

experiment1 <- sample_n(playlist_data_clean, 10)
experiment1
## # A tibble: 10 × 3
##    genre                    artist               title               
##    <fct>                    <chr>                <chr>               
##  1 Genre 1 - hiphop/rap     MF DOOM              Doomsday            
##  2 Genre 4 -  rnb/soul      Frank ocean          pyramids            
##  3 Genre 2 - pop/kpop/Latin Post Malone          Goodbyes            
##  4 Genre 4 -  rnb/soul      Lauryn Hill          Tell Him            
##  5 Genre 1 - hiphop/rap     Kendrick Lamar       Money Trees         
##  6 Genre 1 - hiphop/rap     Clipse               Ma, I don't love her
##  7 Genre 4 -  rnb/soul      sza                  snooze              
##  8 Genre 2 - pop/kpop/Latin olivia.R             pretty isnt pretty  
##  9 Genre 5 - alt/indie/folk Cigarettes after sex Sunsetz             
## 10 Genre 1 - hiphop/rap     Isaiah Rashad        Headshots
experiment1 %>% count(genre)
## # A tibble: 4 × 2
##   genre                        n
##   <fct>                    <int>
## 1 Genre 1 - hiphop/rap         4
## 2 Genre 2 - pop/kpop/Latin     2
## 3 Genre 4 -  rnb/soul          3
## 4 Genre 5 - alt/indie/folk     1
experiment1_count <- experiment1 %>% count(genre)
mutate(experiment1_count, genre_prob1 = experiment1_count$n/10)
## # A tibble: 4 × 3
##   genre                        n genre_prob1
##   <fct>                    <int>       <dbl>
## 1 Genre 1 - hiphop/rap         4         0.4
## 2 Genre 2 - pop/kpop/Latin     2         0.2
## 3 Genre 4 -  rnb/soul          3         0.3
## 4 Genre 5 - alt/indie/folk     1         0.1
ex1_prob <- mutate(experiment1_count, genre_prob1 = experiment1_count$n/10)
prob1 <- ex1_prob %>% select(genre, genre_prob1)

Experiment 2

experiment2 <- sample_n(playlist_data_clean, 30)
experiment2
## # A tibble: 30 × 3
##    genre                    artist         title                
##    <fct>                    <chr>          <chr>                
##  1 Genre 2 - pop/kpop/Latin Alec Benjamin  Devil Doesn't Bargain
##  2 Genre 5 - alt/indie/folk Cherry Glazerr Soft Drink           
##  3 Genre 4 -  rnb/soul      Drake          Look what You’ve done
##  4 Genre 4 -  rnb/soul      SZA            Kill Bill            
##  5 Genre 4 -  rnb/soul      Queen Naija    Medicine             
##  6 Genre 6 - country/rock   Dolly parton   Jolene               
##  7 Genre 5 - alt/indie/folk alkaline trio  Scars                
##  8 Genre 3 -  house/ska     KAYTRANDA      BUS RIDE             
##  9 Genre 5 - alt/indie/folk Fleetwood Mac  Landslide            
## 10 Genre 2 - pop/kpop/Latin Franglish      Trop Parler          
## # ℹ 20 more rows
experiment2 %>% count(genre)
## # A tibble: 6 × 2
##   genre                        n
##   <fct>                    <int>
## 1 Genre 1 - hiphop/rap         3
## 2 Genre 2 - pop/kpop/Latin     4
## 3 Genre 3 -  house/ska         1
## 4 Genre 4 -  rnb/soul         11
## 5 Genre 5 - alt/indie/folk    10
## 6 Genre 6 - country/rock       1
experiment2_count <- experiment2 %>% count(genre)
mutate(experiment2_count, genre_prob2 = experiment2_count$n/30)
## # A tibble: 6 × 3
##   genre                        n genre_prob2
##   <fct>                    <int>       <dbl>
## 1 Genre 1 - hiphop/rap         3      0.1   
## 2 Genre 2 - pop/kpop/Latin     4      0.133 
## 3 Genre 3 -  house/ska         1      0.0333
## 4 Genre 4 -  rnb/soul         11      0.367 
## 5 Genre 5 - alt/indie/folk    10      0.333 
## 6 Genre 6 - country/rock       1      0.0333
ex2_prob <- mutate(experiment2_count, genre_prob2 = experiment2_count$n/30)
prob2 <- ex2_prob %>% select(genre, genre_prob2)

Experiment 3

experiment3 <- sample_n(playlist_data_clean, 75)
experiment3
## # A tibble: 75 × 3
##    genre                    artist               title               
##    <fct>                    <chr>                <chr>               
##  1 Genre 4 -  rnb/soul      Summer Walker        To summer, from cole
##  2 Genre 5 - alt/indie/folk The Internet         Under Control       
##  3 Genre 2 - pop/kpop/Latin shinee               Replay              
##  4 Genre 2 - pop/kpop/Latin ariana grande        everytime           
##  5 Genre 1 - hiphop/rap     Russ                 Handsomer           
##  6 Genre 5 - alt/indie/folk The Shins            New Slang           
##  7 Genre 5 - alt/indie/folk Nirvana              Come As You Are     
##  8 Genre 5 - alt/indie/folk Ginger Root          Loretta             
##  9 Genre 4 -  rnb/soul      Mariah the Scientist From A Woman        
## 10 Genre 4 -  rnb/soul      Fountains of Wayne   Halley's Waitress   
## # ℹ 65 more rows
experiment3 %>% count(genre)
## # A tibble: 6 × 2
##   genre                        n
##   <fct>                    <int>
## 1 Genre 1 - hiphop/rap        11
## 2 Genre 2 - pop/kpop/Latin    11
## 3 Genre 3 -  house/ska         4
## 4 Genre 4 -  rnb/soul         21
## 5 Genre 5 - alt/indie/folk    19
## 6 Genre 6 - country/rock       9
experiment3_count <- experiment3 %>% count(genre)
mutate(experiment3_count, genre_prob3 = experiment3_count$n/75)
## # A tibble: 6 × 3
##   genre                        n genre_prob3
##   <fct>                    <int>       <dbl>
## 1 Genre 1 - hiphop/rap        11      0.147 
## 2 Genre 2 - pop/kpop/Latin    11      0.147 
## 3 Genre 3 -  house/ska         4      0.0533
## 4 Genre 4 -  rnb/soul         21      0.28  
## 5 Genre 5 - alt/indie/folk    19      0.253 
## 6 Genre 6 - country/rock       9      0.12
ex3_prob <- mutate(experiment3_count, genre_prob3 = experiment3_count$n/75)
prob3 <- ex3_prob %>% select(genre, genre_prob3)

We need to create a table that shows all the probabilities that we have gathered.

full_join(theoretical, ex1_prob, by = "genre")
## # A tibble: 6 × 5
##   genre                      n.x genre_probability   n.y genre_prob1
##   <fct>                    <int>             <dbl> <int>       <dbl>
## 1 Genre 1 - hiphop/rap        26            0.181      4         0.4
## 2 Genre 2 - pop/kpop/Latin    22            0.153      2         0.2
## 3 Genre 3 -  house/ska         5            0.0347    NA        NA  
## 4 Genre 4 -  rnb/soul         43            0.299      3         0.3
## 5 Genre 5 - alt/indie/folk    34            0.236      1         0.1
## 6 Genre 6 - country/rock      14            0.0972    NA        NA
theory_1 <- full_join(theoretical, ex1_prob, by = "genre")
full_join(theory_1, ex2_prob, by = "genre")
## # A tibble: 6 × 7
##   genre                n.x genre_probability   n.y genre_prob1     n genre_prob2
##   <fct>              <int>             <dbl> <int>       <dbl> <int>       <dbl>
## 1 Genre 1 - hiphop/…    26            0.181      4         0.4     3      0.1   
## 2 Genre 2 - pop/kpo…    22            0.153      2         0.2     4      0.133 
## 3 Genre 3 -  house/…     5            0.0347    NA        NA       1      0.0333
## 4 Genre 4 -  rnb/so…    43            0.299      3         0.3    11      0.367 
## 5 Genre 5 - alt/ind…    34            0.236      1         0.1    10      0.333 
## 6 Genre 6 - country…    14            0.0972    NA        NA       1      0.0333
theory_1_2 <- full_join(theory_1, ex2_prob, by = "genre")
full_join(theory_1_2, ex3_prob, by = "genre")
## # A tibble: 6 × 9
##   genre          n.x genre_probability   n.y genre_prob1 n.x.x genre_prob2 n.y.y
##   <fct>        <int>             <dbl> <int>       <dbl> <int>       <dbl> <int>
## 1 Genre 1 - h…    26            0.181      4         0.4     3      0.1       11
## 2 Genre 2 - p…    22            0.153      2         0.2     4      0.133     11
## 3 Genre 3 -  …     5            0.0347    NA        NA       1      0.0333     4
## 4 Genre 4 -  …    43            0.299      3         0.3    11      0.367     21
## 5 Genre 5 - a…    34            0.236      1         0.1    10      0.333     19
## 6 Genre 6 - c…    14            0.0972    NA        NA       1      0.0333     9
## # ℹ 1 more variable: genre_prob3 <dbl>
theory_1_2_3 <- full_join(theory_1_2, ex3_prob, by = "genre")
theory_vs_experiments <- theory_1_2_3%>%select(genre, genre_probability, genre_prob1, genre_prob2, genre_prob3)
theory_vs_experiments
## # A tibble: 6 × 5
##   genre                    genre_probability genre_prob1 genre_prob2 genre_prob3
##   <fct>                                <dbl>       <dbl>       <dbl>       <dbl>
## 1 Genre 1 - hiphop/rap                0.181          0.4      0.1         0.147 
## 2 Genre 2 - pop/kpop/Latin            0.153          0.2      0.133       0.147 
## 3 Genre 3 -  house/ska                0.0347        NA        0.0333      0.0533
## 4 Genre 4 -  rnb/soul                 0.299          0.3      0.367       0.28  
## 5 Genre 5 - alt/indie/folk            0.236          0.1      0.333       0.253 
## 6 Genre 6 - country/rock              0.0972        NA        0.0333      0.12

The bigger the sample is, the closer the experimental probability gets to the theoretical probability.

Analyze and synthesize:

To answer the unit question, yes, the songs you hear on shuffle ARE representative of the genres on the playlist, but it becomes more true when you play more songs.

Theoretical probability is from the whole data set, whereas experimental probability is using data from actual samples. In this project we saw theoretical probability when we calculated the probability of each genre in the entire playlist. Then we saw experimental probability when we calculated probability in the experiments we conducted.

Here is how we used the data science process in this project:

As questions: We asked the unit question to guide us.

Gather and organize data: We collected songs from everyone in the class and sorted them by genre.

Model: We used R studio to model the data in tables.

Analyze and synthesize: We looked at the results and considered the implications of the results.

Reflection:

Using a simulation to answer the question helped me get an idea of what it would be like to play a random selection of songs, but with the random selection being simulated with a simple code. You could use coding in this same way to simulate numbers on dice being rolled, or other situations with a set of possible outcomes.

The most dificult part to code for me was using mutate to add columns showing probability. It took me a while to figure out the syntax properly.

The part of my coding I’m most proud of is the table I created comparing the theoretical probability and all the experimental probabilities. It was an idea I had and became ambitious to figure it out. It took a lot of trial and error, of course, but with a lot of editing I got it figured out.

I would like to learn more about creating nice looking graphics such as bar graphs and scatter plots. I want to learn how to mess with the aesthetics of it.

I don’t think I’d do this project any differently because I think I learned things along the way and I’m proud of what I learned.