Introduction

A. In data science we focus on modeling and interpertating data, the purpose of this partifular project was to learn how to calculate theoretical probabilities in a dataset while also learning about theoretical and experimental probabilities.
B. Essential Question - do the genres we hear played on shuffle represent the genres of songs in our class playlist?

Gathering & Organizing Data

A. In order to gather data, each student from our class submitted ten songs to a google form, all of different genres, those were then put into a spreadhseet before being compiled into a class playlist. B. Before importing the data into R, as a class we looked at the spreadsheet containing the data and looked to clean it up. This included deleting duplicate songs, naming the columns, and organining each song by genre. This was done to make the data easier to read and work with.

df <- read_csv("Copy of playlist - data_clean.csv")
## Rows: 144 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): genre, artist, title
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df$genre <- as.factor(df$genre)

Modeling

Here we imported the data into into the file and assigned it a name. Before importing the date we cleaned up, deleting repeated songs, assigning the columns names, and organizing the data per genre.

df %>% count(genre) 
## # A tibble: 6 × 2
##   genre                        n
##   <fct>                    <int>
## 1 Genre 1 - hiphop/rap        26
## 2 Genre 2 - pop/kpop/Latin    22
## 3 Genre 3 -  house/ska         5
## 4 Genre 4 -  rnb/soul         43
## 5 Genre 5 - alt/indie/folk    34
## 6 Genre 6 - country/rock      14

Modeling

In the code above, it took the data set and counted how values where in a section, the parentheses had it count the amount of genres there are in the dataset.

Modeling

“tidyverse” is a package of functions and we downloaded it in order to preform certain codes and functions. In order to uzse the functions we input the code below which loads the packages from my library into the working space for use.

genre_counts<- df %>% count(genre)

Modeling

The code took the data frame which is the playlist data and counted the number of occurences of each value in the genre section of the data and stores it into a new data frame called genre_counts. The code right below adds a new column named “prob” to the genre counts data frame, it represents the probabality of each genres amount in relation to the total number of all genres.

genre_counts %>% mutate(prob=n/sum(n))
## # A tibble: 6 × 3
##   genre                        n   prob
##   <fct>                    <int>  <dbl>
## 1 Genre 1 - hiphop/rap        26 0.181 
## 2 Genre 2 - pop/kpop/Latin    22 0.153 
## 3 Genre 3 -  house/ska         5 0.0347
## 4 Genre 4 -  rnb/soul         43 0.299 
## 5 Genre 5 - alt/indie/folk    34 0.236 
## 6 Genre 6 - country/rock      14 0.0972
short <- genre_counts %>% mutate(prob=n/sum(n))

Modeling

This code took the previous one and assigned the data frame that resulted from that to a new variable which is “short.”

sample_n(df, 10)
## # A tibble: 10 × 3
##    genre                    artist        title               
##    <fct>                    <chr>         <chr>               
##  1 Genre 4 -  rnb/soul      Queen Naija   Medicine            
##  2 Genre 2 - pop/kpop/Latin bad bunny     Ojitos lindos       
##  3 Genre 5 - alt/indie/folk laufey        like the movies     
##  4 Genre 4 -  rnb/soul      Jcole         In the morning      
##  5 Genre 4 -  rnb/soul      sza           normal girl         
##  6 Genre 4 -  rnb/soul      Summer Walker To summer, from cole
##  7 Genre 5 - alt/indie/folk Nirvana       About A Girl        
##  8 Genre 4 -  rnb/soul      SZA           Good Days           
##  9 Genre 4 -  rnb/soul      Brent Faiyaz  Missin Out          
## 10 Genre 4 -  rnb/soul      Maxwell       This woman’s work

Modeling

The code above randomly selects 10 rows from the playlist data frame and turns it into a new data frame.

sample1<- sample_n(df, 10)

Modeling

This code takes the results of the previous code, which was randomply selecting 10 rows from the playlist data set and stores the reults in a new data frame called “sample1.”

genre_counts<- sample1 %>% count(genre)
genre_counts
## # A tibble: 4 × 2
##   genre                        n
##   <fct>                    <int>
## 1 Genre 2 - pop/kpop/Latin     2
## 2 Genre 3 -  house/ska         1
## 3 Genre 4 -  rnb/soul          4
## 4 Genre 5 - alt/indie/folk     3

Modeling

The line of code above takes the “sample1” data frame from the previous code, counts the number of occurences for each value in the genre column and stores the results in the data frame genre_counts.

genre_counts %>% mutate(prob=n/sum(n))
## # A tibble: 4 × 3
##   genre                        n  prob
##   <fct>                    <int> <dbl>
## 1 Genre 2 - pop/kpop/Latin     2   0.2
## 2 Genre 3 -  house/ska         1   0.1
## 3 Genre 4 -  rnb/soul          4   0.4
## 4 Genre 5 - alt/indie/folk     3   0.3

Modeling

The code above takes the genre_counts data frame and adds a new column names “prob” which represents each genres probable amount relative to its total amount, giving a modified data table.

sample_n(df, 6)
## # A tibble: 6 × 3
##   genre                  artist        title               
##   <fct>                  <chr>         <chr>               
## 1 Genre 1 - hiphop/rap   Mac Miller    ROS                 
## 2 Genre 4 -  rnb/soul    Summer Walker To summer, from cole
## 3 Genre 6 - country/rock Citizen       Hyper Trophy        
## 4 Genre 1 - hiphop/rap   Alesso        When I'm Gone       
## 5 Genre 4 -  rnb/soul    Khamari       Drifting            
## 6 Genre 1 - hiphop/rap   Lakeyah       Worst thing
sample_n(df, 3)
## # A tibble: 3 × 3
##   genre                    artist                     title            
##   <fct>                    <chr>                      <chr>            
## 1 Genre 6 - country/rock   Citizen                    World            
## 2 Genre 4 -  rnb/soul      Maxwell                    This woman’s work
## 3 Genre 2 - pop/kpop/Latin Rosalia and Rauw Alejandro Beso
sample_n(df, 8)
## # A tibble: 8 × 3
##   genre                    artist        title               
##   <fct>                    <chr>         <chr>               
## 1 Genre 1 - hiphop/rap     Lakeyah       Worst thing         
## 2 Genre 4 -  rnb/soul      Brent faiyaz  Rehab               
## 3 Genre 1 - hiphop/rap     Mac Miller    Surf                
## 4 Genre 4 -  rnb/soul      Tayc          Le Miel             
## 5 Genre 4 -  rnb/soul      Soul for real Candy rain          
## 6 Genre 4 -  rnb/soul      Summer Walker To summer, from cole
## 7 Genre 2 - pop/kpop/Latin Frank Ocean   Nights              
## 8 Genre 5 - alt/indie/folk Aidan Bissett Tripping Over Air

Modeling

All 3 of the previous codes and data tables randomly selects a number of songs from the playlist data frame of our choice to simulate shuffle play, I chose to do 6,3, and 8 songs.

Analysis & Synthesis

Drawing back to our unit question,do the genres we hear played on shuffle represent the genres of songs in our class playlist? Yes, I think the genres and songs we hear played on shuffle represent the genres of songs in our class playlist and the data supports this. Looking at the data table that contains the 10 randomly selected songs, there are multiple genres represented that are all genres in the class playlist, some genres are represented more than others.

Through this project I learned about how eprimental and theoretical probability data can largely differ, as experimental probability includes trials and actual testing and is usually the correct answers in comparison to theoretical probability which is what is thought to happen. They both apply here as we modeled both experimental and theoretical probability using the data.

Considering the genres in the shuffled platlist in comparison to the class playlist, I notice that while RnB and Alt/Indie have a larger number of songs in the class playlist, they aren’t overrepresented in the shuffled playlist. All genres are represented pretty equally.

In this project we use the data science project through asking a question then collecting data and then further modeling and analyzing that data.

Reflection

Coding a simulation helped answer the unit question bye allowing us something to use as a comparison to other results, it also visualized the different posibilites for the data. You can use simulations to answer real world problems or even to simulate video games.

The piece of code that was most challenging to write was the “10” sample. It’s also the one I feel most proud of because although I found it chalangeing I eventually got it and that felt really rewarding. In programming I’d like to learn more about the process between inputing a code and function and the outputs from that. If were to do this again, I wouldn’t do anything differently.