In this assignment, you’ll practice collaborating around a code project with GitHub. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.
For this assignment dplyr was utilized to conduct a superficial analysis of the ‘Music Dataset : 1950 to 2019’ which provides a list of songs ‘from 1950 to 2019 describing music metadata as sadness, danceability, loudness, acousticness, etc.’ More specifically, dplyr was used to analyze aspects of the ‘sadness’ variable to demonstrate the main functions of the dplyr package. The final product is a playlist of songs between the years 1950-2019 which includes the top one ‘saddest’ song from each year.
This dataset is available via Kaggle: https://www.kaggle.com/datasets/saurabhshahane/music-dataset-1950-to-2019?resource=downoad
Below the dataset is read into R from a github repository and dplyr package loaded.
suppressWarnings(suppressMessages(library(dplyr)))
sad<- read.csv("https://raw.githubusercontent.com/greggmaloy/Data607_R/main/20230406_MUSIC.csv", na.strings=c("","NA"))
glimpse(sad)
## Rows: 28,372
## Columns: 31
## $ X <int> 0, 4, 6, 10, 12, 14, 15, 17, 20, 23, 28, 32, …
## $ artist_name <chr> "mukesh", "frankie laine", "johnnie ray", "pé…
## $ track_name <chr> "mohabbat bhi jhoothi", "i believe", "cry", "…
## $ release_date <int> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 195…
## $ genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
## $ lyrics <chr> "hold time feel break feel untrue convince sp…
## $ len <int> 95, 51, 24, 54, 48, 98, 179, 21, 30, 61, 113,…
## $ dating <dbl> 0.000598086, 0.035537134, 0.002770083, 0.0482…
## $ violence <dbl> 0.063746128, 0.096776742, 0.002770083, 0.0015…
## $ world.life <dbl> 0.000598086, 0.443435174, 0.002770083, 0.0015…
## $ night.time <dbl> 0.000598086, 0.001283697, 0.002770083, 0.0015…
## $ shake.the.audience <dbl> 0.000598086, 0.001283697, 0.002770083, 0.0215…
## $ family.gospel <dbl> 0.048857015, 0.027007477, 0.002770083, 0.0015…
## $ romantic <dbl> 0.017104339, 0.001283697, 0.158564466, 0.4115…
## $ communication <dbl> 0.263750881, 0.001283697, 0.250667910, 0.0015…
## $ obscene <dbl> 0.000598086, 0.001283697, 0.002770083, 0.0015…
## $ music <dbl> 0.039288366, 0.118033841, 0.323794052, 0.0015…
## $ movement.places <dbl> 0.000598086, 0.001283697, 0.002770083, 0.1292…
## $ light.visual.perceptions <dbl> 0.000598086, 0.212681067, 0.002770083, 0.0015…
## $ family.spiritual <dbl> 0.000598086, 0.051124199, 0.002770083, 0.0015…
## $ like.girls <dbl> 0.000598086, 0.001283697, 0.002770084, 0.0811…
## $ sadness <dbl> 0.380298895, 0.001283697, 0.002770083, 0.2258…
## $ feelings <dbl> 0.117175451, 0.001283697, 0.225422323, 0.0015…
## $ danceability <dbl> 0.3577385, 0.3317448, 0.4562981, 0.6869923, 0…
## $ loudness <dbl> 0.4541189, 0.6475399, 0.5852883, 0.7444043, 0…
## $ acousticness <dbl> 0.99799197, 0.95481923, 0.84036129, 0.0839348…
## $ instrumentalness <dbl> 0.901821862, 0.000001530, 0.000000000, 0.1993…
## $ valence <dbl> 0.33944765, 0.32502061, 0.35181368, 0.7753503…
## $ energy <dbl> 0.1371102, 0.2632403, 0.1391123, 0.7437357, 0…
## $ topic <chr> "sadness", "world/life", "music", "romantic",…
## $ age <dbl> 1.0000000, 1.0000000, 1.0000000, 1.0000000, 1…
The select() function is used to select variables by name and index number. Select() also allows the variable to be renamed. The below code produces a new data frame, sadder_df, which consists of four variables and is a subset of the larger dataframe ‘sad’.
sadder_df<-sad %>% select(2,3,4,22)
sadder_df %>% slice(1:5)
## artist_name track_name release_date sadness
## 1 mukesh mohabbat bhi jhoothi 1950 0.380298895
## 2 frankie laine i believe 1950 0.001283697
## 3 johnnie ray cry 1950 0.002770083
## 4 pérez prado patricia 1950 0.225889484
## 5 giorgos papadopoulos apopse eida oneiro 1950 0.068800152
The arrange() function sorts rows in the dataframe via values of referenced columns. Below the ‘sadness’ variable is sorted in descending order. The song ‘fading’ by decyfer down is the song with the highest ‘sadness’ score in the entire dataset.
saddest_song<-sadder_df %>%arrange(desc(sadness))
saddest_song %>% slice(1:5)
## artist_name track_name release_date sadness
## 1 decyfer down fading 2009 0.9814241
## 2 carl smith slowly 1958 0.9391476
## 3 ray price whose heart are you breaking now 1962 0.9300551
## 4 fine young cannibals ever fallen in love 1988 0.9259045
## 5 frank zappa tears began to fall 1971 0.9252872
The filter() function is used to subset rows based on a reference value. Below the dataframe is filtered to songs from the year ‘1976’ and subsequently arrange by the sadness score to produce the dataframe ‘saddest_song_1976’. The song ‘why must i cry’ by Peter Tosh is the song with the highest ‘sadness’ score in 1976.
#unique(sad$genre)
saddest_song_1976<-sadder_df%>%
filter(release_date %in% c('1976')) %>%
arrange(desc(sadness))
saddest_song_1976 %>% slice(1:5)
## artist_name track_name release_date sadness
## 1 peter tosh why must i cry 1976 0.7120046
## 2 george benson this masquerade 1976 0.6924788
## 3 prince far 1 shadow 1976 0.6682303
## 4 ry cooder yellow roses 1976 0.6283087
## 5 henri texier amir 1976 0.6022294
The group_by() function is used to group rows by referenced columns. Below the group_by() function is used to create a dataframe, saddest_song, which groups each row(song) by release date and then arranges in descending order by the variable ‘sadness’. Finally the slice() function grabs the top row of each group_by. The end result is a dataframe which consists of the song with the largest ‘sadness’ score for each year.
saddest_song<-sadder_df%>%
group_by (release_date)%>%
arrange(desc(sadness))%>%
slice(1)
saddest_song %>% slice(1:5)
## # A tibble: 70 × 4
## # Groups: release_date [70]
## artist_name track_name release_date sadness
## <chr> <chr> <int> <dbl>
## 1 stélios kazantzídis finito la mouzika 1950 0.631
## 2 louis prima eleanor 1951 0.700
## 3 dean martin i'm yours 1952 0.594
## 4 lefty frizzell making believe 1953 0.694
## 5 oscar peterson bill 1954 0.590
## 6 tony bennett i fall in love too easily 1955 0.841
## 7 chet baker i fall in love too easily 1956 0.821
## 8 johnny mathis baby, baby, baby 1957 0.750
## 9 carl smith slowly 1958 0.939
## 10 sons of the pioneers teardrops in my heart 1959 0.786
## # … with 60 more rows
Below the summarzie() function is introduced. The summarise() function returns one row for ‘each combination of grouping variables’ (https://dplyr.tidyverse.org/reference/summarise.html). Below summarise(), group_by() and arrange_by() functions are used to 1. create a variable, sadness_agg, which aggregates the ‘sadness’ variable, 2. groups ‘sadness_agg’ by ‘release_date’ so that each year has a sadness score based on the ‘sadness_agg variable, and 3. arranges the ’release_date’ variable in descending order so the first row represents the year with the highest aggregated sadness score. The year 1981 is the year with the highest aggregated sadness score of 78.6.
saddest_year<-sadder_df%>%
group_by (release_date) %>%
summarise(sadness_agg=sum(sadness))%>%
arrange(desc(sadness_agg))
saddest_year %>% slice(1:5)
## # A tibble: 5 × 2
## release_date sadness_agg
## <int> <dbl>
## 1 1981 78.6
## 2 1983 75.2
## 3 1984 74.5
## 4 1991 74.3
## 5 1993 71.6
The mutate() function facilitates variable creation.
Below a new variable, rank_sadness_year, is created which ranks the year
with regard to the aggregated ‘sadness_agg’ variable. The year 1981 is
the year with the highest aggregated ‘sadness’ value and is thus
assigned the rank value of ‘1’.
saddest_year_rank<-sadder_df%>%
group_by (release_date) %>%
summarise(sadness_agg=sum(sadness))%>%
arrange(desc(sadness_agg))%>%
mutate(rank_sadness_year = row_number())
saddest_year_rank %>% slice(1:5)
## # A tibble: 5 × 3
## release_date sadness_agg rank_sadness_year
## <int> <dbl> <int>
## 1 1981 78.6 1
## 2 1983 75.2 2
## 3 1984 74.5 3
## 4 1991 74.3 4
## 5 1993 71.6 5
Dplyr can also perform SQL left joins. Below the table ‘saddest_year_rank’, which was created in part 8, is left joined to the saddest_song dataframe to append the column ‘saddest_year_rank’ to the saddest_song dataframe. The end result is a list/potential playlist of the saddest songs for each year between the years 1950-2019 with an additional column appended ‘rank’ which ranks the song’s release_year in terms of ‘sadness’ compared to other years.
final<-saddest_song %>% left_join(saddest_year_rank, by=c('release_date'))%>% select(1,2,3,6)
final %>% slice(1:5)
## # A tibble: 70 × 4
## # Groups: release_date [70]
## artist_name track_name release_date rank_sadness_year
## <chr> <chr> <int> <int>
## 1 stélios kazantzídis finito la mouzika 1950 69
## 2 louis prima eleanor 1951 68
## 3 dean martin i'm yours 1952 67
## 4 lefty frizzell making believe 1953 70
## 5 oscar peterson bill 1954 66
## 6 tony bennett i fall in love too easily 1955 65
## 7 chet baker i fall in love too easily 1956 64
## 8 johnny mathis baby, baby, baby 1957 63
## 9 carl smith slowly 1958 55
## 10 sons of the pioneers teardrops in my heart 1959 50
## # … with 60 more rows
Dplyr functions are a powerful tool for data manipulation. Below is the final palylist
final
## # A tibble: 70 × 4
## # Groups: release_date [70]
## artist_name track_name release_date rank_sadness_year
## <chr> <chr> <int> <int>
## 1 stélios kazantzídis finito la mouzika 1950 69
## 2 louis prima eleanor 1951 68
## 3 dean martin i'm yours 1952 67
## 4 lefty frizzell making believe 1953 70
## 5 oscar peterson bill 1954 66
## 6 tony bennett i fall in love too easily 1955 65
## 7 chet baker i fall in love too easily 1956 64
## 8 johnny mathis baby, baby, baby 1957 63
## 9 carl smith slowly 1958 55
## 10 sons of the pioneers teardrops in my heart 1959 50
## # … with 60 more rows