This first chink is just for adding all of the needed libraries.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.0.6 v dplyr 1.0.3
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(gganimate)
library(magick)
## Warning: package 'magick' was built under R version 4.0.4
## Linking to ImageMagick 6.9.11.57
## Enabled features: cairo, freetype, fftw, ghostscript, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fontconfig, x11
This chunk sets the working directory and reads in the dataset.
setwd("C:/Users/noahz/Desktop/Data 110 R/Datasets/Hate crime data sets")
albums <-read_csv("albums.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## id = col_double(),
## artist_id = col_double(),
## album_title = col_character(),
## genre = col_character(),
## year_of_pub = col_double(),
## num_of_tracks = col_double(),
## num_of_sales = col_double(),
## rolling_stone_critic = col_double(),
## mtv_critic = col_double(),
## music_maniac_critic = col_double()
## )
This is the beginning of the cleaning and is just for condensing all of the critic scores into a single variable.
condensed_scores <- pivot_longer(albums, cols = 8:10, names_to = "reviewer", values_to = "critic" )
condensed_scores
## # A tibble: 300,000 x 9
## id artist_id album_title genre year_of_pub num_of_tracks num_of_sales
## <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 1767 Call me Ca~ Folk 2006 11 905193
## 2 1 1767 Call me Ca~ Folk 2006 11 905193
## 3 1 1767 Call me Ca~ Folk 2006 11 905193
## 4 2 23548 Down Mare Metal 2014 7 969122
## 5 2 23548 Down Mare Metal 2014 7 969122
## 6 2 23548 Down Mare Metal 2014 7 969122
## 7 3 17822 Embarrasse~ Lati~ 2000 11 522095
## 8 3 17822 Embarrasse~ Lati~ 2000 11 522095
## 9 3 17822 Embarrasse~ Lati~ 2000 11 522095
## 10 4 19565 Standard I~ Pop 2017 4 610116
## # ... with 299,990 more rows, and 2 more variables: reviewer <chr>,
## # critic <dbl>
This pivots the set wide to separate all of the genres into their own variables.
scores_by_genre <- pivot_wider(condensed_scores, names_from = "genre", values_from = "critic")
scores_by_genre
## # A tibble: 300,000 x 45
## id artist_id album_title year_of_pub num_of_tracks num_of_sales reviewer
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 1767 Call me Ca~ 2006 11 905193 rolling~
## 2 1 1767 Call me Ca~ 2006 11 905193 mtv_cri~
## 3 1 1767 Call me Ca~ 2006 11 905193 music_m~
## 4 2 23548 Down Mare 2014 7 969122 rolling~
## 5 2 23548 Down Mare 2014 7 969122 mtv_cri~
## 6 2 23548 Down Mare 2014 7 969122 music_m~
## 7 3 17822 Embarrasse~ 2000 11 522095 rolling~
## 8 3 17822 Embarrasse~ 2000 11 522095 mtv_cri~
## 9 3 17822 Embarrasse~ 2000 11 522095 music_m~
## 10 4 19565 Standard I~ 2017 4 610116 rolling~
## # ... with 299,990 more rows, and 38 more variables: Folk <dbl>, Metal <dbl>,
## # Latino <dbl>, Pop <dbl>, `Black Metal` <dbl>, Progressive <dbl>,
## # `Pop-Rock` <dbl>, Retro <dbl>, Western <dbl>, `K-Pop` <dbl>, Indie <dbl>,
## # Lounge <dbl>, `J-Rock` <dbl>, `Hard Rock` <dbl>, Unplugged <dbl>,
## # Jazz <dbl>, Trap <dbl>, Ambient <dbl>, Rap <dbl>, `Heavy Metal` <dbl>,
## # Dance <dbl>, Alternative <dbl>, `Death Metal` <dbl>, Live <dbl>,
## # Blues <dbl>, Compilation <dbl>, Gospel <dbl>, Country <dbl>, `Deep
## # House` <dbl>, `Brit-Pop` <dbl>, Parody <dbl>, Techno <dbl>, Rock <dbl>,
## # Punk <dbl>, `Boy Band` <dbl>, Indietronica <dbl>, `Holy Metal` <dbl>,
## # `Electro-Pop` <dbl>
This first filters the genres to only 6 that I believe are diverse and the most popular. It then pivots the set longer to compile all of the filtered genres into a single genre variable.
filtered <- scores_by_genre %>%
select('Metal', 'Pop', 'Rock', 'Jazz', 'Rap', 'Country', 'year_of_pub', 'num_of_sales', 'num_of_tracks') %>%
pivot_longer(cols = 1:6, names_to = "genre", values_to = "score")
filtered
## # A tibble: 1,800,000 x 5
## year_of_pub num_of_sales num_of_tracks genre score
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 2006 905193 11 Metal NA
## 2 2006 905193 11 Pop NA
## 3 2006 905193 11 Rock NA
## 4 2006 905193 11 Jazz NA
## 5 2006 905193 11 Rap NA
## 6 2006 905193 11 Country NA
## 7 2006 905193 11 Metal NA
## 8 2006 905193 11 Pop NA
## 9 2006 905193 11 Rock NA
## 10 2006 905193 11 Jazz NA
## # ... with 1,799,990 more rows
This chunk first creates a simple box and whisker plot comparing the genre to the scores and adds custom colors and a title. Then using gganimate it has the boxes change as a factor of the year. Finally, using the Magick library the chart is animated.
animation <- ggplot(filtered) +
geom_boxplot(mapping = aes(y = genre, x = score, fill = genre), na.rm = TRUE) +
ggtitle("music genre critic rating") +
scale_fill_manual(name = "genre", labels = c("Country", "Jazz", "Metal", "Pop", "Rap", "Rock"), values = c("green3", "gold1", "grey47", "pink1", "slateblue", "tan4")) +
transition_states(
year_of_pub,
transition_length = 2,
state_length = 1) +
enter_fade() +
exit_shrink() +
ease_aes('sine-in-out') +
labs(caption = '{closest_state}')
animate(animation, duration = 15, fps=10, renderer = magick_renderer())
To address a potential problem upfront, this dataset seems to have been artificially created using data that may not be representative of any events. This dataset is from Kagle, it titled “Music Label Dataset” by Revil Rosa. The main variables that I have used from this dataset are: “genre”, which is the genre of the album; “year_0f_pub”, which is the year that the album was published; “rolling_stone_critic”, “mtv_critic”, “music_maniac_critic”, all these variables are scores that different music critics have given the album. there are other variables, but I did not use them in my project. I knew generally what I wanted to do with this dataset before I had started working on it, as such most of the tidying was done in service of that goal, I believe that the dataset was already tidy for general use. To clean the data into a form that I could use for my purposes I first pivoted the dataframe longer to compile all of the critic scores from the 3 sources into 1 variable. Then I pivoted it wider making the different genres their own variable. I then filtered the dataset down to only the variables that I had listed previously and only the 6 genres of music that I believe are the most popular and diverse in the dataset as I believe that having too many genres to compare would get visually cluttered, as I did not need any of the other variables, I left them out of the filter. Doing this allowed me to pivot longer yet again returning the genres to being observations again but now with all the critic score condensed into 1 variable. Once I had the data in a tidy form that I needed I started by making a basic box plot that compared the genres to the scores. Once I had the basic plot done, I used gganimate to have each box plot change to represent how the scores changed as time passed. now that I know that the data is artificially created and random, I do find that it is interesting that the medians of all the genre’s changes depending on the year, but only some of the first quartile change and none of the third quartile change. This initially seemed weird to me as if the data were randomly created you would expect that it stays consistent, but after considering it further for it to always be the same would require a level of consistency that would be extremely unlikely for random numbers. there were some avenues that I wanted to explore, like do score and sales or number of songs have any sort of correlation and do different genres tend to differ in the number of songs but given that this dataset is random all they produced were identical and completely random charts so I chose to leave them out as I felt they would not add anything to the project.