The data source used here is inspired from “Tidy Data” from “Journal of Statistical Software” by Hadley Wickham. The dataset consists of information regarding the songs that first entered the billboard top 100. The variables include artist, track, date_entered, week and rank. Each week the rank is recorded, if the track enters the Billboard top 100, in 75 columns which are nothing but week1 to week 75. The dataset is in the form of .csv. Let us load the data and see how it looks like. Get the data from: https://raw.githubusercontent.com/hadley/tidy-data/master/data/billboard.csv
rm(list = ls())
library(tibble)
billboard.df <- as_tibble(read.csv("./Billboard_Data.csv", stringsAsFactors = FALSE))
head(billboard.df)
## # A tibble: 6 x 83
## year artist.inverted track time genre date.entered date.peaked
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2000 Destiny's Child Inde… 3:38 Rock 2000-09-23 2000-11-18
## 2 2000 Santana Mari… 4:18 Rock 2000-02-12 2000-04-08
## 3 2000 Savage Garden I Kn… 4:07 Rock 1999-10-23 2000-01-29
## 4 2000 Madonna Music 3:45 Rock 2000-08-12 2000-09-16
## 5 2000 Aguilera, Chri… Come… 3:38 Rock 2000-08-05 2000-10-14
## 6 2000 Janet Does… 4:17 Rock 2000-06-17 2000-08-26
## # … with 76 more variables: x1st.week <int>, x2nd.week <int>,
## # x3rd.week <int>, x4th.week <int>, x5th.week <int>, x6th.week <int>,
## # x7th.week <int>, x8th.week <int>, x9th.week <int>, x10th.week <int>,
## # x11th.week <int>, x12th.week <int>, x13th.week <int>,
## # x14th.week <int>, x15th.week <int>, x16th.week <int>,
## # x17th.week <int>, x18th.week <int>, x19th.week <int>,
## # x20th.week <int>, x21st.week <int>, x22nd.week <int>,
## # x23rd.week <int>, x24th.week <int>, x25th.week <int>,
## # x26th.week <int>, x27th.week <int>, x28th.week <int>,
## # x29th.week <int>, x30th.week <int>, x31st.week <int>,
## # x32nd.week <int>, x33rd.week <int>, x34th.week <int>,
## # x35th.week <int>, x36th.week <int>, x37th.week <int>,
## # x38th.week <int>, x39th.week <int>, x40th.week <int>,
## # x41st.week <int>, x42nd.week <int>, x43rd.week <int>,
## # x44th.week <int>, x45th.week <int>, x46th.week <int>,
## # x47th.week <int>, x48th.week <int>, x49th.week <int>,
## # x50th.week <int>, x51st.week <int>, x52nd.week <int>,
## # x53rd.week <int>, x54th.week <int>, x55th.week <int>,
## # x56th.week <int>, x57th.week <int>, x58th.week <int>,
## # x59th.week <int>, x60th.week <int>, x61st.week <int>,
## # x62nd.week <int>, x63rd.week <int>, x64th.week <int>,
## # x65th.week <int>, x66th.week <lgl>, x67th.week <lgl>,
## # x68th.week <lgl>, x69th.week <lgl>, x70th.week <lgl>,
## # x71st.week <lgl>, x72nd.week <lgl>, x73rd.week <lgl>,
## # x74th.week <lgl>, x75th.week <lgl>, x76th.week <lgl>
We can say that this data is untidy as it violates the tidy data principles. As per tidy data principles:
Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.
ref: https://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure
In this dataset, the column header are the values: the week numbers. We need to assign the week numbers a column of its own and the values, the ranks, a column of it own. Hence, it violates the principle#1 and principle#2.
hence, To tidy this dataset, we first gather together all the week columns and the column names give the week and the values become the ranks. Before gathering, lets do some data cleaning.
library(tidyverse)
## ── Attaching packages ──────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.0
## ✔ tidyr 0.8.1 ✔ dplyr 0.7.7
## ✔ readr 1.3.1 ✔ stringr 1.4.0
## ✔ ggplot2 3.1.0 ✔ forcats 0.3.0
## ── Conflicts ─────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
names(billboard.df)[8:83]<-gsub(".week","",colnames(billboard.df[,8:83]))
names(billboard.df)[8:83]<-gsub('.{-2}.$','',colnames(billboard.df[,8:83]))
names(billboard.df)[8:83]<-gsub('x','wk',colnames(billboard.df[,8:83]))
names(billboard.df)
## [1] "year" "artist.inverted" "track"
## [4] "time" "genre" "date.entered"
## [7] "date.peaked" "wk1" "wk2"
## [10] "wk3" "wk4" "wk5"
## [13] "wk6" "wk7" "wk8"
## [16] "wk9" "wk10" "wk11"
## [19] "wk12" "wk13" "wk14"
## [22] "wk15" "wk16" "wk17"
## [25] "wk18" "wk19" "wk20"
## [28] "wk21" "wk22" "wk23"
## [31] "wk24" "wk25" "wk26"
## [34] "wk27" "wk28" "wk29"
## [37] "wk30" "wk31" "wk32"
## [40] "wk33" "wk34" "wk35"
## [43] "wk36" "wk37" "wk38"
## [46] "wk39" "wk40" "wk41"
## [49] "wk42" "wk43" "wk44"
## [52] "wk45" "wk46" "wk47"
## [55] "wk48" "wk49" "wk50"
## [58] "wk51" "wk52" "wk53"
## [61] "wk54" "wk55" "wk56"
## [64] "wk57" "wk58" "wk59"
## [67] "wk60" "wk61" "wk62"
## [70] "wk63" "wk64" "wk65"
## [73] "wk66" "wk67" "wk68"
## [76] "wk69" "wk70" "wk71"
## [79] "wk72" "wk73" "wk74"
## [82] "wk75" "wk76"
names(billboard.df)[2] <- "artist"
library(dplyr)
billboard.newdf <- billboard.df %>% gather(week, rank, wk1:wk76, na.rm = TRUE)
billboard.newdf <- arrange(billboard.newdf, year, artist, track)
In the above line of code, we used na.rm to drop the missing values from the gather columns.The missing values represent weeks that the song wasn’t a part of the charts on Billboard 100, so they can be safely dropped.
Let us convert a week variable to a number, and figure out the date corresponding to each week on the charts
billboard <- billboard.newdf %>% mutate(
week = parse_number(week),
date = as.Date(date.entered) + 7*(week - 1))
billboard <- billboard[c("year", "artist", "time", "track", "date", "week", "rank")]
billboard
## # A tibble: 5,307 x 7
## year artist time track date week rank
## <int> <chr> <chr> <chr> <date> <dbl> <int>
## 1 2000 2 Pac 4:22 Baby Don't Cry (Keep Ya Head… 2000-02-26 1 87
## 2 2000 2 Pac 4:22 Baby Don't Cry (Keep Ya Head… 2000-03-04 2 82
## 3 2000 2 Pac 4:22 Baby Don't Cry (Keep Ya Head… 2000-03-11 3 72
## 4 2000 2 Pac 4:22 Baby Don't Cry (Keep Ya Head… 2000-03-18 4 77
## 5 2000 2 Pac 4:22 Baby Don't Cry (Keep Ya Head… 2000-03-25 5 87
## 6 2000 2 Pac 4:22 Baby Don't Cry (Keep Ya Head… 2000-04-01 6 94
## 7 2000 2 Pac 4:22 Baby Don't Cry (Keep Ya Head… 2000-04-08 7 99
## 8 2000 2Ge+her 3:15 The Hardest Part Of Breaking… 2000-09-02 1 91
## 9 2000 2Ge+her 3:15 The Hardest Part Of Breaking… 2000-09-09 2 87
## 10 2000 2Ge+her 3:15 The Hardest Part Of Breaking… 2000-09-16 3 92
## # … with 5,297 more rows
The data, now looks, cleaner and tidyer as it has some meaning and key insights can be drawn from this kind of data. We can further normalize the data by allocating different tables for songs and Ranks of that songs.
Normalization of the data to make it more meaningful
#Song Table
song <- billboard %>% select(artist, track, year, time) %>% unique()
song$song_id <- 1:nrow(song)
head(song)
## # A tibble: 6 x 5
## artist track year time song_id
## <chr> <chr> <int> <chr> <int>
## 1 2 Pac Baby Don't Cry (Keep Ya Head Up II) 2000 4:22 1
## 2 2Ge+her The Hardest Part Of Breaking Up (Is Gett… 2000 3:15 2
## 3 3 Doors Do… Kryptonite 2000 3:53 3
## 4 3 Doors Do… Loser 2000 4:24 4
## 5 504 Boyz Wobble Wobble 2000 3:35 5
## 6 "98\xa1" Give Me Just One Night (Una Noche) 2000 3:24 6
#Rank Table
rank <- left_join(billboard, song, match = "first")
## Joining, by = c("year", "artist", "time", "track")
rank <- rank[c("song_id","week", "date", "rank")]
rank <- arrange(rank, song_id, date)
rank$rank_id <- 1:nrow(rank)
rank
## # A tibble: 5,307 x 5
## song_id week date rank rank_id
## <int> <dbl> <date> <int> <int>
## 1 1 1 2000-02-26 87 1
## 2 1 2 2000-03-04 82 2
## 3 1 3 2000-03-11 72 3
## 4 1 4 2000-03-18 77 4
## 5 1 5 2000-03-25 87 5
## 6 1 6 2000-04-01 94 6
## 7 1 7 2000-04-08 99 7
## 8 2 1 2000-09-02 91 8
## 9 2 2 2000-09-09 87 9
## 10 2 3 2000-09-16 92 10
## # … with 5,297 more rows
We are going to plot a histogram of the our dataset to see the distribution of our songs based on the time-lengths. In our data set the values of time are in the format mm:ss. To plot a histogram based on time length we need to convert the length of the time into seconds and then plot. We can do it using a package called “lubridate”:
# histogram of the song time
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
song$time <- as.numeric(as.period(ms(song$time), unit = "sec"))
ggplot(song, aes(x=time)) +
geom_histogram(color="darkblue", fill="lightblue", binwidth=10)
Patterns in the plot
The above histogram of the songs based on time length gives us insights about the outliers where there are songs that hit the Billboard 100 which are more than 6 minutes long! Also, it gives us a general idea the mostly songs that hit Billboard 100 have a generic time length distribution of 220 to 250 seconds i.e. around 3.6 minutes to 4.2 minutes.
We are going to plot a scatter plot of the songs over Weeks and Ranks and see a smooth line.
bars <- inner_join(song, rank, by = "song_id")
gg <- ggplot(bars, aes(x=week, y=rank)) +
geom_point(aes(col=rank)) +
geom_smooth(method="loess", se=F) +
labs(subtitle="Songs over Weeks Vs ranks",
y="Rank",
x="Weeks",
title="Scatterplot",
caption = "Source: Billboard")
plot(gg)
Patterns in the plot
The above scatterplot shows the nature of relationship between weeks and ranks of the particular songs that are present in our dataset. As the weeks go furtherweek 25 the songs that are present in our dataset are no longer the part of the Billboard 100 list. There must be new songs that are being recorded and are overriding the existing hit-list. The songs that once entered the Billboard 100 list are consistent till week 20 that is around for 5 months before it leaks away from the chartbusters.