#In this project, everyone in our class were asked to submit 10 songs of any genre which were then compiled into a class playlist. We are trying to see which songs play when we shuffle the playlist and the probibility of each genre being played. Our essential question we are aiming to answer is: What is the likelihood of a song of each genre being played?

#right here we installed our clean playlist data of each song and their genre as long as tidyverse which is a set of packages that make R easier to use and more accesible for beginners.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_csv("playlist - data_clean.csv")
## Rows: 144 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): genre, artist, title
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#above, we imported our data and below we assigned our data frame a name

df$genre <- as.factor(df$genre)

#below, we checked the head of our data to make sure everything looks right and cleaned up in our table

genre_counts <- df %>% count(genre)
genre_counts
## # A tibble: 6 × 2
##   genre                        n
##   <fct>                    <int>
## 1 Genre 1 - hiphop/rap        26
## 2 Genre 2 - pop/kpop/Latin    22
## 3 Genre 3 -  house/ska         5
## 4 Genre 4 -  rnb/soul         43
## 5 Genre 5 - alt/indie/folk    34
## 6 Genre 6 - country/rock      14

#We then renamed our table and combined both data sets

genre_counts %>% mutate(prob=n/sum(n))
## # A tibble: 6 × 3
##   genre                        n   prob
##   <fct>                    <int>  <dbl>
## 1 Genre 1 - hiphop/rap        26 0.181 
## 2 Genre 2 - pop/kpop/Latin    22 0.153 
## 3 Genre 3 -  house/ska         5 0.0347
## 4 Genre 4 -  rnb/soul         43 0.299 
## 5 Genre 5 - alt/indie/folk    34 0.236 
## 6 Genre 6 - country/rock      14 0.0972

#We figured out what values we will need and what calculations we could do to obtain the probibility of each genre being played throughout our playlist and we added a new column with those calculations. I figured out that there are 144 songs and that there is a higher possibility that rnb/soul of playing on shuffle with the probability being 0.298/30%

pt.2 <- genre_counts %>% mutate(prob=n/sum(n))

#The column we just made calculates the probability of how many times each song and genre will play. For example, we should expect hiphop/rap songs to play 18% of the time.

shuffle <- sample_n(df,10)
shuffle
## # A tibble: 10 × 3
##    genre                    artist      title         
##    <fct>                    <chr>       <chr>         
##  1 Genre 4 -  rnb/soul      Beyonce     Summertime    
##  2 Genre 2 - pop/kpop/Latin Taeyang     Wedding dress 
##  3 Genre 1 - hiphop/rap     Alesso      When I'm Gone 
##  4 Genre 3 -  house/ska     KAYTRANDA   BUS RIDE      
##  5 Genre 2 - pop/kpop/Latin Franglish   Trop Parler   
##  6 Genre 4 -  rnb/soul      Khamari     Drifting      
##  7 Genre 4 -  rnb/soul      lauren hill final hour    
##  8 Genre 1 - hiphop/rap     Rod Wave    Get Ready     
##  9 Genre 5 - alt/indie/folk Crumb       Dust Bunny    
## 10 Genre 4 -  rnb/soul      Sonder      What You Heard
shuffle <- shuffle %>% count(genre)
shuffle 
## # A tibble: 5 × 2
##   genre                        n
##   <fct>                    <int>
## 1 Genre 1 - hiphop/rap         2
## 2 Genre 2 - pop/kpop/Latin     2
## 3 Genre 3 -  house/ska         1
## 4 Genre 4 -  rnb/soul          4
## 5 Genre 5 - alt/indie/folk     1
shuffle <- shuffle %>% mutate(prob=n/10)
shuffle
## # A tibble: 5 × 3
##   genre                        n  prob
##   <fct>                    <int> <dbl>
## 1 Genre 1 - hiphop/rap         2   0.2
## 2 Genre 2 - pop/kpop/Latin     2   0.2
## 3 Genre 3 -  house/ska         1   0.1
## 4 Genre 4 -  rnb/soul          4   0.4
## 5 Genre 5 - alt/indie/folk     1   0.1

#Compared to the original table of the playlist and the probability of each song showing up I noticed that the probability of genre 1 was similar as this is 20% and the main one is 18% but genres 1,2, and 6 all have a probability of 0% because they didn’t show up in this shuffle. After finding the probabiity of how many times each genre will show up this makes sense because the highest probability of each song with our orignal table was also rnb/soul with a probability of 0.30 and the probability in this table was 0.4.

shuffle <- sample_n(df,7)
shuffle
## # A tibble: 7 × 3
##   genre                    artist              title                 
##   <fct>                    <chr>               <chr>                 
## 1 Genre 2 - pop/kpop/Latin Natalia lafourcade  Hasta la raiz         
## 2 Genre 1 - hiphop/rap     MF DOOM             Doomsday              
## 3 Genre 5 - alt/indie/folk Death Cab for Cutie No Sunlight           
## 4 Genre 5 - alt/indie/folk Lana Del Rey        Million Dollar Man    
## 5 Genre 6 - country/rock   Dolly Parton        9 to 5                
## 6 Genre 6 - country/rock   talking heads       this must be the place
## 7 Genre 3 -  house/ska     Tobias Dray         Sunsets
shuffle <- shuffle %>% count(genre)
shuffle 
## # A tibble: 5 × 2
##   genre                        n
##   <fct>                    <int>
## 1 Genre 1 - hiphop/rap         1
## 2 Genre 2 - pop/kpop/Latin     1
## 3 Genre 3 -  house/ska         1
## 4 Genre 5 - alt/indie/folk     2
## 5 Genre 6 - country/rock       2
shuffle <- shuffle %>% mutate(prob=n/7)
shuffle
## # A tibble: 5 × 3
##   genre                        n  prob
##   <fct>                    <int> <dbl>
## 1 Genre 1 - hiphop/rap         1 0.143
## 2 Genre 2 - pop/kpop/Latin     1 0.143
## 3 Genre 3 -  house/ska         1 0.143
## 4 Genre 5 - alt/indie/folk     2 0.286
## 5 Genre 6 - country/rock       2 0.286
shuffle <- sample_n(df,13)
shuffle
## # A tibble: 13 × 3
##    genre                    artist            title                         
##    <fct>                    <chr>             <chr>                         
##  1 Genre 4 -  rnb/soul      Jhene Aiko        Stranger                      
##  2 Genre 4 -  rnb/soul      Thee sacred souls For now                       
##  3 Genre 5 - alt/indie/folk Jeff Buckley      Love you should have come over
##  4 Genre 5 - alt/indie/folk Cherry Glazerr    Soft Drink                    
##  5 Genre 6 - country/rock   talking heads     this must be the place        
##  6 Genre 5 - alt/indie/folk Video Age         Comic Relief                  
##  7 Genre 4 -  rnb/soul      Brent faiyaz      Forever yours                 
##  8 Genre 1 - hiphop/rap     Clipse            Ma, I don't love her          
##  9 Genre 4 -  rnb/soul      Giveon            Still your best               
## 10 Genre 4 -  rnb/soul      Ciara             Like a boy                    
## 11 Genre 3 -  house/ska     Sublime           Garden Grove                  
## 12 Genre 4 -  rnb/soul      Childish Gambino  Redbone                       
## 13 Genre 2 - pop/kpop/Latin Kali Uchis        I Wish you Roses
shuffle <- shuffle %>% count(genre)
shuffle 
## # A tibble: 6 × 2
##   genre                        n
##   <fct>                    <int>
## 1 Genre 1 - hiphop/rap         1
## 2 Genre 2 - pop/kpop/Latin     1
## 3 Genre 3 -  house/ska         1
## 4 Genre 4 -  rnb/soul          6
## 5 Genre 5 - alt/indie/folk     3
## 6 Genre 6 - country/rock       1
shuffle <- shuffle %>% mutate(prob=n/13)
shuffle
## # A tibble: 6 × 3
##   genre                        n   prob
##   <fct>                    <int>  <dbl>
## 1 Genre 1 - hiphop/rap         1 0.0769
## 2 Genre 2 - pop/kpop/Latin     1 0.0769
## 3 Genre 3 -  house/ska         1 0.0769
## 4 Genre 4 -  rnb/soul          6 0.462 
## 5 Genre 5 - alt/indie/folk     3 0.231 
## 6 Genre 6 - country/rock       1 0.0769
shuffle <- sample_n(df,23)
shuffle
## # A tibble: 23 × 3
##    genre                    artist               title           
##    <fct>                    <chr>                <chr>           
##  1 Genre 4 -  rnb/soul      Jcole                In the morning  
##  2 Genre 1 - hiphop/rap     Aliyah's Interlude   It Girl         
##  3 Genre 4 -  rnb/soul      Beyonce              Summertime      
##  4 Genre 4 -  rnb/soul      Mariah the Scientist From A Woman    
##  5 Genre 2 - pop/kpop/Latin Kali Uchis           I Wish you Roses
##  6 Genre 4 -  rnb/soul      Lauryn Hill          Doo Wop         
##  7 Genre 5 - alt/indie/folk Taylor Swift         Peace           
##  8 Genre 1 - hiphop/rap     Big Sean             Beware          
##  9 Genre 6 - country/rock   Citizen              Hyper Trophy    
## 10 Genre 5 - alt/indie/folk Lake Street Dive     Hypotheticals   
## # ℹ 13 more rows
shuffle <- shuffle %>% count(genre)
shuffle 
## # A tibble: 6 × 2
##   genre                        n
##   <fct>                    <int>
## 1 Genre 1 - hiphop/rap         4
## 2 Genre 2 - pop/kpop/Latin     2
## 3 Genre 3 -  house/ska         1
## 4 Genre 4 -  rnb/soul          8
## 5 Genre 5 - alt/indie/folk     5
## 6 Genre 6 - country/rock       3
shuffle <- shuffle %>% mutate(prob=n/23)
shuffle
## # A tibble: 6 × 3
##   genre                        n   prob
##   <fct>                    <int>  <dbl>
## 1 Genre 1 - hiphop/rap         4 0.174 
## 2 Genre 2 - pop/kpop/Latin     2 0.0870
## 3 Genre 3 -  house/ska         1 0.0435
## 4 Genre 4 -  rnb/soul          8 0.348 
## 5 Genre 5 - alt/indie/folk     5 0.217 
## 6 Genre 6 - country/rock       3 0.130

#After running these tables with random songs I believe that if I were to run 100 or 1000 more randomly generated tables the probabilities of each song will continue to match what we found with our main playlist. One thing that I found was that not every genre shows up in the randomly generated tables so that means that the likelihood of each song in a genre showing up is lower.

#Analysis and synthesis: After completing this project and going back to our original question of “What is the likelihood of a song of each genre being played?” I would say that the likelihood of a song of each genre being played is more likely than not with a greater amount of songs being shuffled. I also know that the genres that have more songs will have a higher likelihood of being played because the probability is higher but that doesn’t outweigh the possibility of other genres being played. I have learned that in our case, experimental and theoretical probabilty are similar enough, for example: songs in genre 4 had a higher theoretical probability of being played and when we ran our experiment we found that that was true because there are significantly more songs in that genre than the others. Throughout this project we used the data science process frequently… We figured out our main question, gathered data, cleaned and prepared our data, explored our data, found a model and an experiment we wanted to pursue further than continued with that.

#Reflection: For this project we had our main question and without coding a simulation and experiment I wouldn’t have known how to approach the question. The piece of code that was most challenging for me to write was when we were finding the probabilty of each genre being played and at the same time I feel most proud of that code and the “shuffles” experiment we did because it felt like I was starting to understand what the code meant and what exactly I was doing. I’m not sure if this specifically deals with programming or if there is more to it but I would like to learn more about what it takes to develop an app or something like that. If I were to do this project again, I would probably have a few more “shuffles” so we can have more examples of each genre showing up and the probability of that.