A. In data science we focus on modeling and interpertating data, the
purpose of this partifular project was to learn how to calculate
theoretical probabilities in a dataset while also learning about
theoretical and experimental probabilities.
B. Essential Question - do the genres we hear played on shuffle
represent the genres of songs in our class playlist?
A. In order to gather data, each student from our class submitted ten songs to a google form, all of different genres, those were then put into a spreadhseet before being compiled into a class playlist. B. Before importing the data into R, as a class we looked at the spreadsheet containing the data and looked to clean it up. This included deleting duplicate songs, naming the columns, and organining each song by genre. This was done to make the data easier to read and work with.
df <- read_csv("Copy of playlist - data_clean.csv")
## Rows: 144 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): genre, artist, title
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df$genre <- as.factor(df$genre)
Here we imported the data into into the file and assigned it a name. Before importing the date we cleaned up, deleting repeated songs, assigning the columns names, and organizing the data per genre.
df %>% count(genre)
## # A tibble: 6 × 2
## genre n
## <fct> <int>
## 1 Genre 1 - hiphop/rap 26
## 2 Genre 2 - pop/kpop/Latin 22
## 3 Genre 3 - house/ska 5
## 4 Genre 4 - rnb/soul 43
## 5 Genre 5 - alt/indie/folk 34
## 6 Genre 6 - country/rock 14
In the code above, it took the data set and counted how values where in a section, the parentheses had it count the amount of genres there are in the dataset.
“tidyverse” is a package of functions and we downloaded it in order to preform certain codes and functions. In order to uzse the functions we input the code below which loads the packages from my library into the working space for use.
genre_counts<- df %>% count(genre)
The code took the data frame which is the playlist data and counted the number of occurences of each value in the genre section of the data and stores it into a new data frame called genre_counts. The code right below adds a new column named “prob” to the genre counts data frame, it represents the probabality of each genres amount in relation to the total number of all genres.
genre_counts %>% mutate(prob=n/sum(n))
## # A tibble: 6 × 3
## genre n prob
## <fct> <int> <dbl>
## 1 Genre 1 - hiphop/rap 26 0.181
## 2 Genre 2 - pop/kpop/Latin 22 0.153
## 3 Genre 3 - house/ska 5 0.0347
## 4 Genre 4 - rnb/soul 43 0.299
## 5 Genre 5 - alt/indie/folk 34 0.236
## 6 Genre 6 - country/rock 14 0.0972
short <- genre_counts %>% mutate(prob=n/sum(n))
This code took the previous one and assigned the data frame that resulted from that to a new variable which is “short.”
sample_n(df, 10)
## # A tibble: 10 × 3
## genre artist title
## <fct> <chr> <chr>
## 1 Genre 4 - rnb/soul Queen Naija Medicine
## 2 Genre 2 - pop/kpop/Latin bad bunny Ojitos lindos
## 3 Genre 5 - alt/indie/folk laufey like the movies
## 4 Genre 4 - rnb/soul Jcole In the morning
## 5 Genre 4 - rnb/soul sza normal girl
## 6 Genre 4 - rnb/soul Summer Walker To summer, from cole
## 7 Genre 5 - alt/indie/folk Nirvana About A Girl
## 8 Genre 4 - rnb/soul SZA Good Days
## 9 Genre 4 - rnb/soul Brent Faiyaz Missin Out
## 10 Genre 4 - rnb/soul Maxwell This woman’s work
The code above randomly selects 10 rows from the playlist data frame and turns it into a new data frame.
sample1<- sample_n(df, 10)
This code takes the results of the previous code, which was randomply selecting 10 rows from the playlist data set and stores the reults in a new data frame called “sample1.”
genre_counts<- sample1 %>% count(genre)
genre_counts
## # A tibble: 4 × 2
## genre n
## <fct> <int>
## 1 Genre 2 - pop/kpop/Latin 2
## 2 Genre 3 - house/ska 1
## 3 Genre 4 - rnb/soul 4
## 4 Genre 5 - alt/indie/folk 3
The line of code above takes the “sample1” data frame from the previous code, counts the number of occurences for each value in the genre column and stores the results in the data frame genre_counts.
genre_counts %>% mutate(prob=n/sum(n))
## # A tibble: 4 × 3
## genre n prob
## <fct> <int> <dbl>
## 1 Genre 2 - pop/kpop/Latin 2 0.2
## 2 Genre 3 - house/ska 1 0.1
## 3 Genre 4 - rnb/soul 4 0.4
## 4 Genre 5 - alt/indie/folk 3 0.3
The code above takes the genre_counts data frame and adds a new column names “prob” which represents each genres probable amount relative to its total amount, giving a modified data table.
sample_n(df, 6)
## # A tibble: 6 × 3
## genre artist title
## <fct> <chr> <chr>
## 1 Genre 1 - hiphop/rap Mac Miller ROS
## 2 Genre 4 - rnb/soul Summer Walker To summer, from cole
## 3 Genre 6 - country/rock Citizen Hyper Trophy
## 4 Genre 1 - hiphop/rap Alesso When I'm Gone
## 5 Genre 4 - rnb/soul Khamari Drifting
## 6 Genre 1 - hiphop/rap Lakeyah Worst thing
sample_n(df, 3)
## # A tibble: 3 × 3
## genre artist title
## <fct> <chr> <chr>
## 1 Genre 6 - country/rock Citizen World
## 2 Genre 4 - rnb/soul Maxwell This woman’s work
## 3 Genre 2 - pop/kpop/Latin Rosalia and Rauw Alejandro Beso
sample_n(df, 8)
## # A tibble: 8 × 3
## genre artist title
## <fct> <chr> <chr>
## 1 Genre 1 - hiphop/rap Lakeyah Worst thing
## 2 Genre 4 - rnb/soul Brent faiyaz Rehab
## 3 Genre 1 - hiphop/rap Mac Miller Surf
## 4 Genre 4 - rnb/soul Tayc Le Miel
## 5 Genre 4 - rnb/soul Soul for real Candy rain
## 6 Genre 4 - rnb/soul Summer Walker To summer, from cole
## 7 Genre 2 - pop/kpop/Latin Frank Ocean Nights
## 8 Genre 5 - alt/indie/folk Aidan Bissett Tripping Over Air
All 3 of the previous codes and data tables randomly selects a number of songs from the playlist data frame of our choice to simulate shuffle play, I chose to do 6,3, and 8 songs.
Drawing back to our unit question,do the genres we hear played on shuffle represent the genres of songs in our class playlist? Yes, I think the genres and songs we hear played on shuffle represent the genres of songs in our class playlist and the data supports this. Looking at the data table that contains the 10 randomly selected songs, there are multiple genres represented that are all genres in the class playlist, some genres are represented more than others.
Through this project I learned about how eprimental and theoretical probability data can largely differ, as experimental probability includes trials and actual testing and is usually the correct answers in comparison to theoretical probability which is what is thought to happen. They both apply here as we modeled both experimental and theoretical probability using the data.
Considering the genres in the shuffled platlist in comparison to the class playlist, I notice that while RnB and Alt/Indie have a larger number of songs in the class playlist, they aren’t overrepresented in the shuffled playlist. All genres are represented pretty equally.
In this project we use the data science project through asking a question then collecting data and then further modeling and analyzing that data.
Coding a simulation helped answer the unit question bye allowing us something to use as a comparison to other results, it also visualized the different posibilites for the data. You can use simulations to answer real world problems or even to simulate video games.
The piece of code that was most challenging to write was the “10” sample. It’s also the one I feel most proud of because although I found it chalangeing I eventually got it and that felt really rewarding. In programming I’d like to learn more about the process between inputing a code and function and the outputs from that. If were to do this again, I wouldn’t do anything differently.