Data Tidying Using R

Data Source, Background and Variables:

The data source used here is inspired from “Tidy Data” from “Journal of Statistical Software” by Hadley Wickham. The dataset consists of information regarding the songs that first entered the billboard top 100. The variables include artist, track, date_entered, week and rank. Each week the rank is recorded, if the track enters the Billboard top 100, in 75 columns which are nothing but week1 to week 75. The dataset is in the form of .csv. Let us load the data and see how it looks like. Get the data from: https://raw.githubusercontent.com/hadley/tidy-data/master/data/billboard.csv

rm(list = ls())
library(tibble)
billboard.df <- as_tibble(read.csv("./Billboard_Data.csv", stringsAsFactors = FALSE))
head(billboard.df)

## # A tibble: 6 x 83
##    year artist.inverted track time  genre date.entered date.peaked
##   <int> <chr>           <chr> <chr> <chr> <chr>        <chr>      
## 1  2000 Destiny's Child Inde… 3:38  Rock  2000-09-23   2000-11-18 
## 2  2000 Santana         Mari… 4:18  Rock  2000-02-12   2000-04-08 
## 3  2000 Savage Garden   I Kn… 4:07  Rock  1999-10-23   2000-01-29 
## 4  2000 Madonna         Music 3:45  Rock  2000-08-12   2000-09-16 
## 5  2000 Aguilera, Chri… Come… 3:38  Rock  2000-08-05   2000-10-14 
## 6  2000 Janet           Does… 4:17  Rock  2000-06-17   2000-08-26 
## # … with 76 more variables: x1st.week <int>, x2nd.week <int>,
## #   x3rd.week <int>, x4th.week <int>, x5th.week <int>, x6th.week <int>,
## #   x7th.week <int>, x8th.week <int>, x9th.week <int>, x10th.week <int>,
## #   x11th.week <int>, x12th.week <int>, x13th.week <int>,
## #   x14th.week <int>, x15th.week <int>, x16th.week <int>,
## #   x17th.week <int>, x18th.week <int>, x19th.week <int>,
## #   x20th.week <int>, x21st.week <int>, x22nd.week <int>,
## #   x23rd.week <int>, x24th.week <int>, x25th.week <int>,
## #   x26th.week <int>, x27th.week <int>, x28th.week <int>,
## #   x29th.week <int>, x30th.week <int>, x31st.week <int>,
## #   x32nd.week <int>, x33rd.week <int>, x34th.week <int>,
## #   x35th.week <int>, x36th.week <int>, x37th.week <int>,
## #   x38th.week <int>, x39th.week <int>, x40th.week <int>,
## #   x41st.week <int>, x42nd.week <int>, x43rd.week <int>,
## #   x44th.week <int>, x45th.week <int>, x46th.week <int>,
## #   x47th.week <int>, x48th.week <int>, x49th.week <int>,
## #   x50th.week <int>, x51st.week <int>, x52nd.week <int>,
## #   x53rd.week <int>, x54th.week <int>, x55th.week <int>,
## #   x56th.week <int>, x57th.week <int>, x58th.week <int>,
## #   x59th.week <int>, x60th.week <int>, x61st.week <int>,
## #   x62nd.week <int>, x63rd.week <int>, x64th.week <int>,
## #   x65th.week <int>, x66th.week <lgl>, x67th.week <lgl>,
## #   x68th.week <lgl>, x69th.week <lgl>, x70th.week <lgl>,
## #   x71st.week <lgl>, x72nd.week <lgl>, x73rd.week <lgl>,
## #   x74th.week <lgl>, x75th.week <lgl>, x76th.week <lgl>

Why is it not tidy?

We can say that this data is untidy as it violates the tidy data principles. As per tidy data principles:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

ref: https://r4ds.had.co.nz/tidy-data.html#fig:tidy-structure

In this dataset, the column header are the values: the week numbers. We need to assign the week numbers a column of its own and the values, the ranks, a column of it own. Hence, it violates the principle#1 and principle#2.

hence, To tidy this dataset, we first gather together all the week columns and the column names give the week and the values become the ranks. Before gathering, lets do some data cleaning.

library(tidyverse)

## ── Attaching packages ──────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.3.0
## ✔ tidyr   0.8.1     ✔ dplyr   0.7.7
## ✔ readr   1.3.1     ✔ stringr 1.4.0
## ✔ ggplot2 3.1.0     ✔ forcats 0.3.0

## ── Conflicts ─────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

names(billboard.df)[8:83]<-gsub(".week","",colnames(billboard.df[,8:83]))
names(billboard.df)[8:83]<-gsub('.{-2}.$','',colnames(billboard.df[,8:83]))
names(billboard.df)[8:83]<-gsub('x','wk',colnames(billboard.df[,8:83]))
names(billboard.df)

##  [1] "year"            "artist.inverted" "track"          
##  [4] "time"            "genre"           "date.entered"   
##  [7] "date.peaked"     "wk1"             "wk2"            
## [10] "wk3"             "wk4"             "wk5"            
## [13] "wk6"             "wk7"             "wk8"            
## [16] "wk9"             "wk10"            "wk11"           
## [19] "wk12"            "wk13"            "wk14"           
## [22] "wk15"            "wk16"            "wk17"           
## [25] "wk18"            "wk19"            "wk20"           
## [28] "wk21"            "wk22"            "wk23"           
## [31] "wk24"            "wk25"            "wk26"           
## [34] "wk27"            "wk28"            "wk29"           
## [37] "wk30"            "wk31"            "wk32"           
## [40] "wk33"            "wk34"            "wk35"           
## [43] "wk36"            "wk37"            "wk38"           
## [46] "wk39"            "wk40"            "wk41"           
## [49] "wk42"            "wk43"            "wk44"           
## [52] "wk45"            "wk46"            "wk47"           
## [55] "wk48"            "wk49"            "wk50"           
## [58] "wk51"            "wk52"            "wk53"           
## [61] "wk54"            "wk55"            "wk56"           
## [64] "wk57"            "wk58"            "wk59"           
## [67] "wk60"            "wk61"            "wk62"           
## [70] "wk63"            "wk64"            "wk65"           
## [73] "wk66"            "wk67"            "wk68"           
## [76] "wk69"            "wk70"            "wk71"           
## [79] "wk72"            "wk73"            "wk74"           
## [82] "wk75"            "wk76"

names(billboard.df)[2] <- "artist"

library(dplyr)
billboard.newdf <- billboard.df %>% gather(week, rank, wk1:wk76, na.rm = TRUE)
billboard.newdf <- arrange(billboard.newdf, year, artist, track)

In the above line of code, we used na.rm to drop the missing values from the gather columns.The missing values represent weeks that the song wasn’t a part of the charts on Billboard 100, so they can be safely dropped.

Further Data cleaning..

Let us convert a week variable to a number, and figure out the date corresponding to each week on the charts

billboard <- billboard.newdf %>% mutate(
  week = parse_number(week),
  date = as.Date(date.entered) + 7*(week - 1))
  
billboard <- billboard[c("year", "artist", "time", "track", "date", "week", "rank")]
billboard

## # A tibble: 5,307 x 7
##     year artist  time  track                         date        week  rank
##    <int> <chr>   <chr> <chr>                         <date>     <dbl> <int>
##  1  2000 2 Pac   4:22  Baby Don't Cry (Keep Ya Head… 2000-02-26     1    87
##  2  2000 2 Pac   4:22  Baby Don't Cry (Keep Ya Head… 2000-03-04     2    82
##  3  2000 2 Pac   4:22  Baby Don't Cry (Keep Ya Head… 2000-03-11     3    72
##  4  2000 2 Pac   4:22  Baby Don't Cry (Keep Ya Head… 2000-03-18     4    77
##  5  2000 2 Pac   4:22  Baby Don't Cry (Keep Ya Head… 2000-03-25     5    87
##  6  2000 2 Pac   4:22  Baby Don't Cry (Keep Ya Head… 2000-04-01     6    94
##  7  2000 2 Pac   4:22  Baby Don't Cry (Keep Ya Head… 2000-04-08     7    99
##  8  2000 2Ge+her 3:15  The Hardest Part Of Breaking… 2000-09-02     1    91
##  9  2000 2Ge+her 3:15  The Hardest Part Of Breaking… 2000-09-09     2    87
## 10  2000 2Ge+her 3:15  The Hardest Part Of Breaking… 2000-09-16     3    92
## # … with 5,297 more rows

Data looks tidy now!!

The data, now looks, cleaner and tidyer as it has some meaning and key insights can be drawn from this kind of data. We can further normalize the data by allocating different tables for songs and Ranks of that songs.

Normalization of the data to make it more meaningful

#Song Table
song <- billboard %>% select(artist, track, year, time) %>% unique()
song$song_id <- 1:nrow(song)
head(song)

## # A tibble: 6 x 5
##   artist      track                                      year time  song_id
##   <chr>       <chr>                                     <int> <chr>   <int>
## 1 2 Pac       Baby Don't Cry (Keep Ya Head Up II)        2000 4:22        1
## 2 2Ge+her     The Hardest Part Of Breaking Up (Is Gett…  2000 3:15        2
## 3 3 Doors Do… Kryptonite                                 2000 3:53        3
## 4 3 Doors Do… Loser                                      2000 4:24        4
## 5 504 Boyz    Wobble Wobble                              2000 3:35        5
## 6 "98\xa1"    Give Me Just One Night (Una Noche)         2000 3:24        6

#Rank Table
rank <- left_join(billboard, song, match = "first")

## Joining, by = c("year", "artist", "time", "track")

rank <- rank[c("song_id","week", "date", "rank")]
rank <- arrange(rank, song_id, date)
rank$rank_id <- 1:nrow(rank)
rank

## # A tibble: 5,307 x 5
##    song_id  week date        rank rank_id
##      <int> <dbl> <date>     <int>   <int>
##  1       1     1 2000-02-26    87       1
##  2       1     2 2000-03-04    82       2
##  3       1     3 2000-03-11    72       3
##  4       1     4 2000-03-18    77       4
##  5       1     5 2000-03-25    87       5
##  6       1     6 2000-04-01    94       6
##  7       1     7 2000-04-08    99       7
##  8       2     1 2000-09-02    91       8
##  9       2     2 2000-09-09    87       9
## 10       2     3 2000-09-16    92      10
## # … with 5,297 more rows

Data Visualization

Visualization#1

We are going to plot a histogram of the our dataset to see the distribution of our songs based on the time-lengths. In our data set the values of time are in the format mm:ss. To plot a histogram based on time length we need to convert the length of the time into seconds and then plot. We can do it using a package called “lubridate”:

# histogram of the song time
library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

song$time <- as.numeric(as.period(ms(song$time), unit = "sec"))
ggplot(song, aes(x=time)) + 
  geom_histogram(color="darkblue", fill="lightblue", binwidth=10)

Patterns in the plot

The above histogram of the songs based on time length gives us insights about the outliers where there are songs that hit the Billboard 100 which are more than 6 minutes long! Also, it gives us a general idea the mostly songs that hit Billboard 100 have a generic time length distribution of 220 to 250 seconds i.e. around 3.6 minutes to 4.2 minutes.

Visualization#2

We are going to plot a scatter plot of the songs over Weeks and Ranks and see a smooth line.

bars <- inner_join(song, rank, by = "song_id")
gg <- ggplot(bars, aes(x=week, y=rank)) + 
  geom_point(aes(col=rank)) + 
  geom_smooth(method="loess", se=F) + 
  labs(subtitle="Songs over Weeks Vs ranks", 
       y="Rank", 
       x="Weeks", 
       title="Scatterplot", 
       caption = "Source: Billboard")

plot(gg)

Patterns in the plot

The above scatterplot shows the nature of relationship between weeks and ranks of the particular songs that are present in our dataset. As the weeks go furtherweek 25 the songs that are present in our dataset are no longer the part of the Billboard 100 list. There must be new songs that are being recorded and are overriding the existing hit-list. The songs that once entered the Billboard 100 list are consistent till week 20 that is around for 5 months before it leaks away from the chartbusters.