#In this project, everyone in our class were asked to submit 10 songs of any genre which were then compiled into a class playlist. We are trying to see which songs play when we shuffle the playlist and the probibility of each genre being played. Our essential question we are aiming to answer is: What is the likelihood of a song of each genre being played?
#right here we installed our clean playlist data of each song and their genre as long as tidyverse which is a set of packages that make R easier to use and more accesible for beginners.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_csv("playlist - data_clean.csv")
## Rows: 144 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): genre, artist, title
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#above, we imported our data and below we assigned our data frame a name
df$genre <- as.factor(df$genre)
#below, we checked the head of our data to make sure everything looks right and cleaned up in our table
genre_counts <- df %>% count(genre)
genre_counts
## # A tibble: 6 × 2
## genre n
## <fct> <int>
## 1 Genre 1 - hiphop/rap 26
## 2 Genre 2 - pop/kpop/Latin 22
## 3 Genre 3 - house/ska 5
## 4 Genre 4 - rnb/soul 43
## 5 Genre 5 - alt/indie/folk 34
## 6 Genre 6 - country/rock 14
#We then renamed our table and combined both data sets
genre_counts %>% mutate(prob=n/sum(n))
## # A tibble: 6 × 3
## genre n prob
## <fct> <int> <dbl>
## 1 Genre 1 - hiphop/rap 26 0.181
## 2 Genre 2 - pop/kpop/Latin 22 0.153
## 3 Genre 3 - house/ska 5 0.0347
## 4 Genre 4 - rnb/soul 43 0.299
## 5 Genre 5 - alt/indie/folk 34 0.236
## 6 Genre 6 - country/rock 14 0.0972
#We figured out what values we will need and what calculations we could do to obtain the probibility of each genre being played throughout our playlist and we added a new column with those calculations. I figured out that there are 144 songs and that there is a higher possibility that rnb/soul of playing on shuffle with the probability being 0.298/30%
pt.2 <- genre_counts %>% mutate(prob=n/sum(n))
#The column we just made calculates the probability of how many times each song and genre will play. For example, we should expect hiphop/rap songs to play 18% of the time.
shuffle <- sample_n(df,10)
shuffle
## # A tibble: 10 × 3
## genre artist title
## <fct> <chr> <chr>
## 1 Genre 4 - rnb/soul Beyonce Summertime
## 2 Genre 2 - pop/kpop/Latin Taeyang Wedding dress
## 3 Genre 1 - hiphop/rap Alesso When I'm Gone
## 4 Genre 3 - house/ska KAYTRANDA BUS RIDE
## 5 Genre 2 - pop/kpop/Latin Franglish Trop Parler
## 6 Genre 4 - rnb/soul Khamari Drifting
## 7 Genre 4 - rnb/soul lauren hill final hour
## 8 Genre 1 - hiphop/rap Rod Wave Get Ready
## 9 Genre 5 - alt/indie/folk Crumb Dust Bunny
## 10 Genre 4 - rnb/soul Sonder What You Heard
shuffle <- shuffle %>% count(genre)
shuffle
## # A tibble: 5 × 2
## genre n
## <fct> <int>
## 1 Genre 1 - hiphop/rap 2
## 2 Genre 2 - pop/kpop/Latin 2
## 3 Genre 3 - house/ska 1
## 4 Genre 4 - rnb/soul 4
## 5 Genre 5 - alt/indie/folk 1
shuffle <- shuffle %>% mutate(prob=n/10)
shuffle
## # A tibble: 5 × 3
## genre n prob
## <fct> <int> <dbl>
## 1 Genre 1 - hiphop/rap 2 0.2
## 2 Genre 2 - pop/kpop/Latin 2 0.2
## 3 Genre 3 - house/ska 1 0.1
## 4 Genre 4 - rnb/soul 4 0.4
## 5 Genre 5 - alt/indie/folk 1 0.1
#Compared to the original table of the playlist and the probability of each song showing up I noticed that the probability of genre 1 was similar as this is 20% and the main one is 18% but genres 1,2, and 6 all have a probability of 0% because they didn’t show up in this shuffle. After finding the probabiity of how many times each genre will show up this makes sense because the highest probability of each song with our orignal table was also rnb/soul with a probability of 0.30 and the probability in this table was 0.4.
shuffle <- sample_n(df,7)
shuffle
## # A tibble: 7 × 3
## genre artist title
## <fct> <chr> <chr>
## 1 Genre 2 - pop/kpop/Latin Natalia lafourcade Hasta la raiz
## 2 Genre 1 - hiphop/rap MF DOOM Doomsday
## 3 Genre 5 - alt/indie/folk Death Cab for Cutie No Sunlight
## 4 Genre 5 - alt/indie/folk Lana Del Rey Million Dollar Man
## 5 Genre 6 - country/rock Dolly Parton 9 to 5
## 6 Genre 6 - country/rock talking heads this must be the place
## 7 Genre 3 - house/ska Tobias Dray Sunsets
shuffle <- shuffle %>% count(genre)
shuffle
## # A tibble: 5 × 2
## genre n
## <fct> <int>
## 1 Genre 1 - hiphop/rap 1
## 2 Genre 2 - pop/kpop/Latin 1
## 3 Genre 3 - house/ska 1
## 4 Genre 5 - alt/indie/folk 2
## 5 Genre 6 - country/rock 2
shuffle <- shuffle %>% mutate(prob=n/7)
shuffle
## # A tibble: 5 × 3
## genre n prob
## <fct> <int> <dbl>
## 1 Genre 1 - hiphop/rap 1 0.143
## 2 Genre 2 - pop/kpop/Latin 1 0.143
## 3 Genre 3 - house/ska 1 0.143
## 4 Genre 5 - alt/indie/folk 2 0.286
## 5 Genre 6 - country/rock 2 0.286
shuffle <- sample_n(df,13)
shuffle
## # A tibble: 13 × 3
## genre artist title
## <fct> <chr> <chr>
## 1 Genre 4 - rnb/soul Jhene Aiko Stranger
## 2 Genre 4 - rnb/soul Thee sacred souls For now
## 3 Genre 5 - alt/indie/folk Jeff Buckley Love you should have come over
## 4 Genre 5 - alt/indie/folk Cherry Glazerr Soft Drink
## 5 Genre 6 - country/rock talking heads this must be the place
## 6 Genre 5 - alt/indie/folk Video Age Comic Relief
## 7 Genre 4 - rnb/soul Brent faiyaz Forever yours
## 8 Genre 1 - hiphop/rap Clipse Ma, I don't love her
## 9 Genre 4 - rnb/soul Giveon Still your best
## 10 Genre 4 - rnb/soul Ciara Like a boy
## 11 Genre 3 - house/ska Sublime Garden Grove
## 12 Genre 4 - rnb/soul Childish Gambino Redbone
## 13 Genre 2 - pop/kpop/Latin Kali Uchis I Wish you Roses
shuffle <- shuffle %>% count(genre)
shuffle
## # A tibble: 6 × 2
## genre n
## <fct> <int>
## 1 Genre 1 - hiphop/rap 1
## 2 Genre 2 - pop/kpop/Latin 1
## 3 Genre 3 - house/ska 1
## 4 Genre 4 - rnb/soul 6
## 5 Genre 5 - alt/indie/folk 3
## 6 Genre 6 - country/rock 1
shuffle <- shuffle %>% mutate(prob=n/13)
shuffle
## # A tibble: 6 × 3
## genre n prob
## <fct> <int> <dbl>
## 1 Genre 1 - hiphop/rap 1 0.0769
## 2 Genre 2 - pop/kpop/Latin 1 0.0769
## 3 Genre 3 - house/ska 1 0.0769
## 4 Genre 4 - rnb/soul 6 0.462
## 5 Genre 5 - alt/indie/folk 3 0.231
## 6 Genre 6 - country/rock 1 0.0769
shuffle <- sample_n(df,23)
shuffle
## # A tibble: 23 × 3
## genre artist title
## <fct> <chr> <chr>
## 1 Genre 4 - rnb/soul Jcole In the morning
## 2 Genre 1 - hiphop/rap Aliyah's Interlude It Girl
## 3 Genre 4 - rnb/soul Beyonce Summertime
## 4 Genre 4 - rnb/soul Mariah the Scientist From A Woman
## 5 Genre 2 - pop/kpop/Latin Kali Uchis I Wish you Roses
## 6 Genre 4 - rnb/soul Lauryn Hill Doo Wop
## 7 Genre 5 - alt/indie/folk Taylor Swift Peace
## 8 Genre 1 - hiphop/rap Big Sean Beware
## 9 Genre 6 - country/rock Citizen Hyper Trophy
## 10 Genre 5 - alt/indie/folk Lake Street Dive Hypotheticals
## # ℹ 13 more rows
shuffle <- shuffle %>% count(genre)
shuffle
## # A tibble: 6 × 2
## genre n
## <fct> <int>
## 1 Genre 1 - hiphop/rap 4
## 2 Genre 2 - pop/kpop/Latin 2
## 3 Genre 3 - house/ska 1
## 4 Genre 4 - rnb/soul 8
## 5 Genre 5 - alt/indie/folk 5
## 6 Genre 6 - country/rock 3
shuffle <- shuffle %>% mutate(prob=n/23)
shuffle
## # A tibble: 6 × 3
## genre n prob
## <fct> <int> <dbl>
## 1 Genre 1 - hiphop/rap 4 0.174
## 2 Genre 2 - pop/kpop/Latin 2 0.0870
## 3 Genre 3 - house/ska 1 0.0435
## 4 Genre 4 - rnb/soul 8 0.348
## 5 Genre 5 - alt/indie/folk 5 0.217
## 6 Genre 6 - country/rock 3 0.130
#After running these tables with random songs I believe that if I were to run 100 or 1000 more randomly generated tables the probabilities of each song will continue to match what we found with our main playlist. One thing that I found was that not every genre shows up in the randomly generated tables so that means that the likelihood of each song in a genre showing up is lower.
#Analysis and synthesis: After completing this project and going back to our original question of “What is the likelihood of a song of each genre being played?” I would say that the likelihood of a song of each genre being played is more likely than not with a greater amount of songs being shuffled. I also know that the genres that have more songs will have a higher likelihood of being played because the probability is higher but that doesn’t outweigh the possibility of other genres being played. I have learned that in our case, experimental and theoretical probabilty are similar enough, for example: songs in genre 4 had a higher theoretical probability of being played and when we ran our experiment we found that that was true because there are significantly more songs in that genre than the others. Throughout this project we used the data science process frequently… We figured out our main question, gathered data, cleaned and prepared our data, explored our data, found a model and an experiment we wanted to pursue further than continued with that.
#Reflection: For this project we had our main question and without coding a simulation and experiment I wouldn’t have known how to approach the question. The piece of code that was most challenging for me to write was when we were finding the probabilty of each genre being played and at the same time I feel most proud of that code and the “shuffles” experiment we did because it felt like I was starting to understand what the code meant and what exactly I was doing. I’m not sure if this specifically deals with programming or if there is more to it but I would like to learn more about what it takes to develop an app or something like that. If I were to do this project again, I would probably have a few more “shuffles” so we can have more examples of each genre showing up and the probability of that.