DATA 607 TidyVerse CREATE: Dplyr

Part 1: assignment

In this assignment, you’ll practice collaborating around a code project with GitHub. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.

For this assignment dplyr was utilized to conduct a superficial analysis of the ‘Music Dataset : 1950 to 2019’ which provides a list of songs ‘from 1950 to 2019 describing music metadata as sadness, danceability, loudness, acousticness, etc.’ More specifically, dplyr was used to analyze aspects of the ‘sadness’ variable to demonstrate the main functions of the dplyr package. The final product is a playlist of songs between the years 1950-2019 which includes the top one ‘saddest’ song from each year.

This dataset is available via Kaggle: https://www.kaggle.com/datasets/saurabhshahane/music-dataset-1950-to-2019?resource=downoad

Part 2: dataset/package

Below the dataset is read into R from a github repository and dplyr package loaded.

suppressWarnings(suppressMessages(library(dplyr)))

sad<- read.csv("https://raw.githubusercontent.com/greggmaloy/Data607_R/main/20230406_MUSIC.csv", na.strings=c("","NA"))

glimpse(sad)

## Rows: 28,372
## Columns: 31
## $ X                        <int> 0, 4, 6, 10, 12, 14, 15, 17, 20, 23, 28, 32, …
## $ artist_name              <chr> "mukesh", "frankie laine", "johnnie ray", "pé…
## $ track_name               <chr> "mohabbat bhi jhoothi", "i believe", "cry", "…
## $ release_date             <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195…
## $ genre                    <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ lyrics                   <chr> "hold time feel break feel untrue convince sp…
## $ len                      <int> 95, 51, 24, 54, 48, 98, 179, 21, 30, 61, 113,…
## $ dating                   <dbl> 0.000598086, 0.035537134, 0.002770083, 0.0482…
## $ violence                 <dbl> 0.063746128, 0.096776742, 0.002770083, 0.0015…
## $ world.life               <dbl> 0.000598086, 0.443435174, 0.002770083, 0.0015…
## $ night.time               <dbl> 0.000598086, 0.001283697, 0.002770083, 0.0015…
## $ shake.the.audience       <dbl> 0.000598086, 0.001283697, 0.002770083, 0.0215…
## $ family.gospel            <dbl> 0.048857015, 0.027007477, 0.002770083, 0.0015…
## $ romantic                 <dbl> 0.017104339, 0.001283697, 0.158564466, 0.4115…
## $ communication            <dbl> 0.263750881, 0.001283697, 0.250667910, 0.0015…
## $ obscene                  <dbl> 0.000598086, 0.001283697, 0.002770083, 0.0015…
## $ music                    <dbl> 0.039288366, 0.118033841, 0.323794052, 0.0015…
## $ movement.places          <dbl> 0.000598086, 0.001283697, 0.002770083, 0.1292…
## $ light.visual.perceptions <dbl> 0.000598086, 0.212681067, 0.002770083, 0.0015…
## $ family.spiritual         <dbl> 0.000598086, 0.051124199, 0.002770083, 0.0015…
## $ like.girls               <dbl> 0.000598086, 0.001283697, 0.002770084, 0.0811…
## $ sadness                  <dbl> 0.380298895, 0.001283697, 0.002770083, 0.2258…
## $ feelings                 <dbl> 0.117175451, 0.001283697, 0.225422323, 0.0015…
## $ danceability             <dbl> 0.3577385, 0.3317448, 0.4562981, 0.6869923, 0…
## $ loudness                 <dbl> 0.4541189, 0.6475399, 0.5852883, 0.7444043, 0…
## $ acousticness             <dbl> 0.99799197, 0.95481923, 0.84036129, 0.0839348…
## $ instrumentalness         <dbl> 0.901821862, 0.000001530, 0.000000000, 0.1993…
## $ valence                  <dbl> 0.33944765, 0.32502061, 0.35181368, 0.7753503…
## $ energy                   <dbl> 0.1371102, 0.2632403, 0.1391123, 0.7437357, 0…
## $ topic                    <chr> "sadness", "world/life", "music", "romantic",…
## $ age                      <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0000000, 1…

Part 3: select()

The select() function is used to select variables by name and index number. Select() also allows the variable to be renamed. The below code produces a new data frame, sadder_df, which consists of four variables and is a subset of the larger dataframe ‘sad’.

sadder_df<-sad %>% select(2,3,4,22) 
sadder_df %>% slice(1:5)

##            artist_name           track_name release_date     sadness
## 1               mukesh mohabbat bhi jhoothi         1950 0.380298895
## 2        frankie laine            i believe         1950 0.001283697
## 3          johnnie ray                  cry         1950 0.002770083
## 4          pérez prado             patricia         1950 0.225889484
## 5 giorgos papadopoulos   apopse eida oneiro         1950 0.068800152

Part 4: arrange()

The arrange() function sorts rows in the dataframe via values of referenced columns. Below the ‘sadness’ variable is sorted in descending order. The song ‘fading’ by decyfer down is the song with the highest ‘sadness’ score in the entire dataset.

saddest_song<-sadder_df %>%arrange(desc(sadness))
saddest_song %>% slice(1:5)

##            artist_name                       track_name release_date   sadness
## 1         decyfer down                           fading         2009 0.9814241
## 2           carl smith                           slowly         1958 0.9391476
## 3            ray price whose heart are you breaking now         1962 0.9300551
## 4 fine young cannibals              ever fallen in love         1988 0.9259045
## 5          frank zappa              tears began to fall         1971 0.9252872

Part 5: filter()

The filter() function is used to subset rows based on a reference value. Below the dataframe is filtered to songs from the year ‘1976’ and subsequently arrange by the sadness score to produce the dataframe ‘saddest_song_1976’. The song ‘why must i cry’ by Peter Tosh is the song with the highest ‘sadness’ score in 1976.

#unique(sad$genre)
saddest_song_1976<-sadder_df%>% 
  filter(release_date %in% c('1976')) %>% 
  arrange(desc(sadness))
saddest_song_1976 %>% slice(1:5)

##     artist_name      track_name release_date   sadness
## 1    peter tosh  why must i cry         1976 0.7120046
## 2 george benson this masquerade         1976 0.6924788
## 3  prince far 1          shadow         1976 0.6682303
## 4     ry cooder    yellow roses         1976 0.6283087
## 5  henri texier            amir         1976 0.6022294

Part 6: group_by()

The group_by() function is used to group rows by referenced columns. Below the group_by() function is used to create a dataframe, saddest_song, which groups each row(song) by release date and then arranges in descending order by the variable ‘sadness’. Finally the slice() function grabs the top row of each group_by. The end result is a dataframe which consists of the song with the largest ‘sadness’ score for each year.

saddest_song<-sadder_df%>%
  group_by (release_date)%>%
  arrange(desc(sadness))%>%
  slice(1)

saddest_song %>% slice(1:5)

## # A tibble: 70 × 4
## # Groups:   release_date [70]
##    artist_name          track_name                release_date sadness
##    <chr>                <chr>                            <int>   <dbl>
##  1 stélios kazantzídis  finito la mouzika                 1950   0.631
##  2 louis prima          eleanor                           1951   0.700
##  3 dean martin          i'm yours                         1952   0.594
##  4 lefty frizzell       making believe                    1953   0.694
##  5 oscar peterson       bill                              1954   0.590
##  6 tony bennett         i fall in love too easily         1955   0.841
##  7 chet baker           i fall in love too easily         1956   0.821
##  8 johnny mathis        baby, baby, baby                  1957   0.750
##  9 carl smith           slowly                            1958   0.939
## 10 sons of the pioneers teardrops in my heart             1959   0.786
## # … with 60 more rows

Part 7: summarize()

Below the summarzie() function is introduced. The summarise() function returns one row for ‘each combination of grouping variables’ (https://dplyr.tidyverse.org/reference/summarise.html). Below summarise(), group_by() and arrange_by() functions are used to 1. create a variable, sadness_agg, which aggregates the ‘sadness’ variable, 2. groups ‘sadness_agg’ by ‘release_date’ so that each year has a sadness score based on the ‘sadness_agg variable, and 3. arranges the ’release_date’ variable in descending order so the first row represents the year with the highest aggregated sadness score. The year 1981 is the year with the highest aggregated sadness score of 78.6.

saddest_year<-sadder_df%>% 
  group_by (release_date) %>% 
  summarise(sadness_agg=sum(sadness))%>% 
  arrange(desc(sadness_agg))

saddest_year %>% slice(1:5)

## # A tibble: 5 × 2
##   release_date sadness_agg
##          <int>       <dbl>
## 1         1981        78.6
## 2         1983        75.2
## 3         1984        74.5
## 4         1991        74.3
## 5         1993        71.6

Part 8: mutate()

The mutate() function facilitates variable creation.
Below a new variable, rank_sadness_year, is created which ranks the year with regard to the aggregated ‘sadness_agg’ variable. The year 1981 is the year with the highest aggregated ‘sadness’ value and is thus assigned the rank value of ‘1’.

saddest_year_rank<-sadder_df%>% 
  group_by (release_date) %>% 
  summarise(sadness_agg=sum(sadness))%>% 
  arrange(desc(sadness_agg))%>% 
  mutate(rank_sadness_year = row_number())

saddest_year_rank %>% slice(1:5)

## # A tibble: 5 × 3
##   release_date sadness_agg rank_sadness_year
##          <int>       <dbl>             <int>
## 1         1981        78.6                 1
## 2         1983        75.2                 2
## 3         1984        74.5                 3
## 4         1991        74.3                 4
## 5         1993        71.6                 5

Part 9: SQL join

Dplyr can also perform SQL left joins. Below the table ‘saddest_year_rank’, which was created in part 8, is left joined to the saddest_song dataframe to append the column ‘saddest_year_rank’ to the saddest_song dataframe. The end result is a list/potential playlist of the saddest songs for each year between the years 1950-2019 with an additional column appended ‘rank’ which ranks the song’s release_year in terms of ‘sadness’ compared to other years.

final<-saddest_song %>% left_join(saddest_year_rank, by=c('release_date'))%>% select(1,2,3,6) 
final %>% slice(1:5)

## # A tibble: 70 × 4
## # Groups:   release_date [70]
##    artist_name          track_name                release_date rank_sadness_year
##    <chr>                <chr>                            <int>             <int>
##  1 stélios kazantzídis  finito la mouzika                 1950                69
##  2 louis prima          eleanor                           1951                68
##  3 dean martin          i'm yours                         1952                67
##  4 lefty frizzell       making believe                    1953                70
##  5 oscar peterson       bill                              1954                66
##  6 tony bennett         i fall in love too easily         1955                65
##  7 chet baker           i fall in love too easily         1956                64
##  8 johnny mathis        baby, baby, baby                  1957                63
##  9 carl smith           slowly                            1958                55
## 10 sons of the pioneers teardrops in my heart             1959                50
## # … with 60 more rows

Part 10: conclusion

Dplyr functions are a powerful tool for data manipulation. Below is the final palylist

final

## # A tibble: 70 × 4
## # Groups:   release_date [70]
##    artist_name          track_name                release_date rank_sadness_year
##    <chr>                <chr>                            <int>             <int>
##  1 stélios kazantzídis  finito la mouzika                 1950                69
##  2 louis prima          eleanor                           1951                68
##  3 dean martin          i'm yours                         1952                67
##  4 lefty frizzell       making believe                    1953                70
##  5 oscar peterson       bill                              1954                66
##  6 tony bennett         i fall in love too easily         1955                65
##  7 chet baker           i fall in love too easily         1956                64
##  8 johnny mathis        baby, baby, baby                  1957                63
##  9 carl smith           slowly                            1958                55
## 10 sons of the pioneers teardrops in my heart             1959                50
## # … with 60 more rows