In this coding project I will use the datset called “spotifysongs.csv”. This datset is found/sourced from Link to source This datset explores many different artists and their songs. It provides data about those artist by including genre, popularity, year, duration of the song and more! However, for my project I wanted to explore the top 5 artist for a particular year(2014). I was going to determine this by identifying their tops songs and how popular they were during that year. Meaning, the variables I will be using are: artist, song, year, and popularity. Popularity is a quantitative variable and is one I will be using to measure and rank these artist. To clean this datset, I first need to filter for the year 2014. By doing this, I now have all the data for only that year. Next, I need to find the top 5 artist for that year by measuring their popularity. This was achieved, by creating a new variable called top_artists. Which contains the group_by, summarize, and arrange function to group all the artist together and add all of their popularity points. After this, using the same functions I was able to find their top song and match them to that particular artist. As a result, I successfully cleaned my dataset by filtering only the most relevant data needed for me to embark on this journey to find the top 5 artist/songs in the year 2014. Please see all chunks with their specific clarification for more details.
In this project, I will examine the top 5 artist for a particular year and identify their most popular song.
Great question! My dataset is about Spotify songs! In this dataset I will focus on a few variables/columns. These include: year, artist, song, and popularity. My x variable will be songs. Y-axis will be popularity. And my legend will be the artist name that will be shown on the graph by color.
These just provide me with the tools/functions I’ll need.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tinytex)
This is because I downloaded my dataset. As a result, I need to tell Rstudio how to find that data and load it into the program.
setwd("/Users/mikea/Desktop/Data 110 ")
Spotify <- read_csv("spotifysongs.csv")
unique(Spotify$year)
## [1] 2000 1999 2001 2011 2002 2016 1998 2018 2004 2010 2015 2006 2008 2019 2003
## [16] 2013 2005 2012 2020 2007 2009 2017 2014
Reason? This was my freshman year in High School and I wanted to find out what songs were most popular at the time and see if I also listened to those artists/songs. Plus, nostalgia.
The reason? I need to filter out the artist and other information to only that year. So, I used the filter function to look at the column “year” and return only the data that pertains to 2014.
spotify_2014 <- Spotify %>%
filter(year == 2014)
spotify_2014
## # A tibble: 104 × 18
## artist song duration_ms explicit year popularity danceability energy key
## <chr> <chr> <dbl> <lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Azeali… "212" 204956 TRUE 2014 0 0.847 0.769 11
## 2 Klangk… "Son… 238120 FALSE 2014 68 0.579 0.549 5
## 3 Ellie … "Bur… 231211 FALSE 2014 70 0.559 0.777 1
## 4 Storm … "Loo… 150400 FALSE 2014 0 0.832 0.815 0
## 5 Pharre… "Hap… 232720 FALSE 2014 79 0.647 0.822 5
## 6 ScHool… "Col… 299960 TRUE 2014 0 0.826 0.571 11
## 7 Iggy A… "Fan… 199938 TRUE 2014 69 0.912 0.716 10
## 8 Maroon… "Ani… 231013 FALSE 2014 79 0.279 0.742 4
## 9 Sam Sm… "Sta… 172723 FALSE 2014 80 0.418 0.42 0
## 10 MAGIC! "Rud… 224840 FALSE 2014 80 0.773 0.758 1
## # ℹ 94 more rows
## # ℹ 9 more variables: loudness <dbl>, mode <dbl>, speechiness <dbl>,
## # acousticness <dbl>, instrumentalness <dbl>, liveness <dbl>, valence <dbl>,
## # tempo <dbl>, genre <chr>
Why? Remember I’m looking for those top 5 artist and the variable spotify_2014 loaded in all the artists from 2014. I now created a new variable called top_artist. By using the group_by function it will group all the artist together and add all of their popularity points. After, it will return the 5 artist with the most popularity in a descending order. Ranking it from most popular to least in this 1-5 scale.
top_artists <- spotify_2014 %>%
group_by(artist) %>%
summarise(total_popularity = sum(popularity)) %>%
arrange(desc(total_popularity)) %>%
top_n(5)
## Selecting by total_popularity
top_artists
## # A tibble: 5 × 2
## artist total_popularity
## <chr> <dbl>
## 1 Taylor Swift 431
## 2 Ariana Grande 300
## 3 Calvin Harris 246
## 4 Ed Sheeran 222
## 5 Sam Smith 221
I’ll just create a new variable and filter for the songs that belong to those top 5 artist by referring to my previous variable top_artists and looking at the artist column.
filtered_songs <- spotify_2014 %>%
filter(artist %in% top_artists$artist) %>%
distinct(song)
filtered_songs
## # A tibble: 19 × 1
## song
## <chr>
## 1 Stay With Me
## 2 Bad Blood
## 3 Under Control (feat. Hurts)
## 4 Blame (feat. John Newman)
## 5 Summer
## 6 Break Free
## 7 Don't
## 8 Love Me Harder
## 9 Style
## 10 Problem
## 11 Shake It Off
## 12 Sing
## 13 Money On My Mind
## 14 Thinking out Loud
## 15 Outside (feat. Ellie Goulding)
## 16 Blank Space
## 17 Like I Can
## 18 Wildest Dreams
## 19 One Last Time
In the previous variable it showed me all of the songs released by those artist in 2014. Now I need to find their TOP song. I created a new a variable top_songs_by_artist. I again need to filter for the same 5 artist using the top_artists variable. Next, group by artist and their songs. Then same as before, were arrange in a descending order showing the top songs by calculating the sum of their popularity. Lastly, well group by artist and choose their number one song and print.
This was the hardest part. Now the heavy lifting has been done and its time to plot!
top_songs_by_artist <- spotify_2014 %>%
filter(artist %in% top_artists$artist) %>%
group_by(artist, song) %>%
summarise(total_popularity = sum(popularity)) %>%
arrange(artist, desc(total_popularity)) %>%
group_by(artist) %>%
top_n(1)
## `summarise()` has grouped output by 'artist'. You can override using the
## `.groups` argument.
## Selecting by total_popularity
top_songs_by_artist
## # A tibble: 5 × 3
## # Groups: artist [5]
## artist song total_popularity
## <chr> <chr> <dbl>
## 1 Ariana Grande Love Me Harder 148
## 2 Calvin Harris Outside (feat. Ellie Goulding) 78
## 3 Ed Sheeran Thinking out Loud 81
## 4 Sam Smith Stay With Me 80
## 5 Taylor Swift Style 138
Great! The heavy liftings been done. Now it’s just a matter of putting it all into a graph. First i used to ggplot function and called the top_songs_by_artist variable. Next using the aes() I put songs as my x-axis. total_popularity as my Y axis. And fill by artist.
The rest is just me fixing the aesthetic part by putting an x and y label. A title and fixing the proportions of the graph to fit the artist/songs names. This is to achieve a pleasing visualization to the eye.
ggplot(top_songs_by_artist, aes(x = song, y = total_popularity, fill = artist)) +
geom_bar(stat = "identity") +
scale_fill_discrete(name = "Artist Names") +
xlab(" Top Songs of the year") +
ylab("Popularity Ratings") +
ggtitle("Top 5 Songs by Artist in the year 2014") +
scale_x_discrete(labels = function(x) str_wrap(x, width = 5)) +
theme(legend.position = "bottom")
This visualization shows accurate information on the top 5 artist and their songs in 2014. This visualization proves that the number 1 song was “Love Me Harder” by Ariana Grande. In 2nd Place, “style” by Taylor Swift. In 3rd place, “Thinking out loud” by Ed Sheeran. In 4th place was, “Stay With Me” by Sam Smith. Lastly, “Outside” by Calvin Harris. Thus, we have successfully completed our project goal by identifying the top artist and their most popular song in 2014 on Spotify. This was a very fun project to explore, however I think for my next project I would like the explore the genre aspect of spotify. I would like to compare the most popular genre types between 2000-2010. I think this would be a good project to do next because it would be interesting to see if there are any repeating genre types or drastic changes. My prediction is that I will see a spike in the rap genre, due to its increasing popularity among teens. These are some new things that I would like to include as I continue to explore the spotify dataset.