STAT 545A - Homework 6
Christian Birch Okkels
October 21, 2013
Here, we perform a data analysis and visualization of the music chart history. The dataset to be investigated comes from the so-called Whitburn Project - a huge undertaking by music enthusiasts to preserve and share high-quality recordings of popular songs since the 1890s. The project has spawned a vast spreadsheet with data about almost 40,000 songs (as of 2013) that have been hits on the Billboard Chart since 1890. The dataset contains more than a hundred columns of raw data, with everything from song lengths, artists, songwriters, albums, labels, peak positions, number of weeks on the charts, and even week-by-week chart position, and so much more. This presents a ton of different options for data analysis and visualization; thus, with the relatively small time allotted for this assignment, we will probably just scratch the surface of the true range of possibilities. Although various versions of the dataset exist, none of them are easy to come by. However, we have been lucky to dig one up from the mound of information that is the Internet. Of course, the dataset does come with plenty of shortcomings related to missing data for certain variables. As a result, it needs cleaning and preparation before we can go toe-to-toe with the fun stuff. This cleaning procedure will be described more thoroughly later. For now, let us have a look at some of what the dataset has to offer.
As mentioned above, the dataset contains countless columns of raw data for each of the many observations. Some of the more interesting columns (for our case) are described in the table below. As some of the original variable names can be difficult to decipher, the variable names below are those specified in the cleaned version of the data.
Variable name | Description |
---|---|
Year | Year in which the single first hit its highest weekly position. |
Yearly.Rank | Yearly rankings. Formula: Highest position / Number of weeks at highest / Number of weeks in Top 10/40/100. |
Prefix | Year and Rank combined for sorting purposes. |
nWeeksChart | Number of weeks the single charted. |
nWeeksChartTop40 | Number of weeks the single charted at 40 or below. |
nWeeksChartTop10 | Number of weeks the single charted at 10 or below. |
nWeeksChartPeak | Number of weeks the single charted at its highest position. |
High | The peak position of the single. |
Artist | The artist. |
Artist.Inverted | The artist (Last name, First name). |
Featured | Featured artist(s). |
UnFeatured | Additional artists not listed as featured. |
Album | Title of the album that the single originally came from. |
Track | Title of the song. |
Time | Length of the song. |
Artist.ID | A number ID to distinguish artists. |
Label.Number | Name of the label. |
Genre | Type of music. |
Written.By | The writers of the song. |
ScorePoints | Scoring system where points are given for every week on the charts, and by how high it charted each week; 100 points for no. 1, 99 for no. 2, etc. |
Data.Entered | The month/day/year the song first hit the charts. |
Date.Peaked | The month/day/year the song first hit its highest peak position. |
X1st.Week - X66th.Week | Chart ranking history: “1st Week” is the ranking of the song upon entering the chart; “2nd Week” is the ranking position of the song in the second week (if it's still on the chart); etc. |
A separate R script, data_cleanPrepare.R
has been coded to perform the initial data cleaning and preparation.
First, the script loads some necessary R packages: lattice
, plyr
, and xtable
:
library(lattice)
library(plyr)
library(xtable)
It then proceeds to load, or source, two functions from separate scripts:
source("func_timeToSec.R")
source("func_htmlPrint.R")
The first script, func_timeToSec.R
, contains a function that converts time formats of “hh:mm:ss” into seconds. This conversion is critical, since we need numeric or integer values to compute certain statistical properties, etc.
The second script, func_htmlPrint.R
, holds a function to print data.frames as HTML tables.
Now, the data cleaning script then loads the raw data from the text file charts.txt
:
charts_orig = read.delim("charts.txt")
# str(charts_orig) # basic sanity check.
The second line performs a basic sanity check. It has been out-commented since the output is very space-consuming.
Next, we start to cut to the bone by keeping only the most interesting columns:
charts <- subset(charts_orig, select = c(Year, Yearly.Rank, Prefix, CH, X40,
X10, PK, High, Artist, Featured, Album, Track, Time, Genre, Temp.1, Date.Entered,
Date.Peaked))
All of the many other columns are thereby excluded in the new data.frame charts
. Now, as seen in the code above, some variables have rather odd names; e.g. CH
, X40
, etc. We therefore rename them to something more meaningful:
charts <- rename(charts, c(CH = "nWeeksChart", X40 = "nWeeksChartTop40", X10 = "nWeeksChartTop10",
PK = "nWeeksChartPeak", Temp.1 = "ScorePoints"))
Many observations are missing data for certain variables. This is particularly so for older songs from the 1940s and earlier. Therefore, we keep only the data for the years 1950-2013:
charts <- subset(charts, Year > 1949)
Now, the current column Time
contains song lengths in the format “mm:ss”. For many purposes, we would like to work with a more manageable format. We thus create a new column, Time.num
, which contains the song length in seconds, and then add it to our data.frame. This is where we use the aforementioned function in the script func_timeToSec.R
.
charts$Time.num <- sapply(charts$Time, func_timeToSec)
As mentioned, some observations are missing data for certain variables–even after we cut away the older data. The main variable that we consider here is the new Time.num
. Therefore, we find all blank entries and replace them with NA
s, after which we eliminate all observations with NA
:
is.na(charts$Time) <- which(charts$Time == "")
charts <- subset(charts, !is.na(charts$Time))
The nWeeksChart
column has also presented plenty of problems. It contains both blank elements and strange n/a
(not even “NA”). Moreover, we would like it to be numeric. These cleaning procedures took a long time to figure out for this particular case, but we finally found the solution:
# Remove all observations with blanks or 'n/a''s in nWeeksChart column.
charts <- subset(charts, !nWeeksChart == "")
charts <- subset(charts, !nWeeksChart == "n/a")
charts$nWeeksChart <- factor(charts$nWeeksChart) # update factor levels.
# Convert nWeeksChart from Factor to Numeric:
charts$nWeeksChart <- as.numeric(levels(charts$nWeeksChart))[as.integer(charts$nWeeksChart)]
Finally, we write the cleaned data to the file charts_clean.tsv
:
write.table(charts, "charts_clean.tsv", quote = FALSE, sep = "\t", row.names = FALSE)
This part is the central one; here, we perform a variety of data aggregation and plotting tasks and write the data tables and figures to file.
The R script to perform all of this is data_aggregatePlot.R
. It starts out by loading the necessary libraries and sourcing the two functions. Moreover, it reads the cleaned data saved by the script described above.
charts <- read.delim("charts_clean.tsv")
str(charts) # basic sanity check.
## 'data.frame': 28142 obs. of 18 variables:
## $ Year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ Yearly.Rank : Factor w/ 890 levels "1","10","100",..: 127 206 230 244 271 295 314 330 339 352 ...
## $ Prefix : Factor w/ 28142 levels "1950_001","1950_002",..: 27812 27813 27814 27815 27816 27817 27818 27819 27820 27821 ...
## $ nWeeksChart : int 48 40 47 24 33 55 25 49 22 24 ...
## $ nWeeksChartTop40: int 17 33 29 19 17 35 15 41 6 18 ...
## $ nWeeksChartTop10: int 0 13 0 10 0 20 0 21 0 1 ...
## $ nWeeksChartPeak : int 1 1 2 2 1 4 1 6 1 1 ...
## $ High : Factor w/ 103 levels "","--","0","1",..: 17 60 11 49 12 27 22 4 26 93 ...
## $ Artist : Factor w/ 7290 levels "'Til Tuesday",..: 5075 5357 3116 3761 2108 3116 6033 4397 4160 5148 ...
## $ Featured : Factor w/ 1856 levels ""," Feat. Master P, Destiny's Child, O'Dell, Mo B. Dick",..: 1 1 1 1088 1 1 1417 189 647 1 ...
## $ Album : Factor w/ 6668 levels "","#1's","#1s\311and Then Some",..: 3723 6656 1224 675 17 1224 6656 6656 6656 6116 ...
## $ Track : Factor w/ 22807 levels "#1","#1 Dee Jay",..: 11329 7476 10001 1905 18377 14935 6755 19766 13542 20213 ...
## $ Time : Factor w/ 357 levels "1:02","1:16",..: 185 128 158 146 177 106 121 154 141 166 ...
## $ Genre : Factor w/ 39 levels "","Adult Contemporary",..: 1 32 1 1 1 1 1 1 1 1 ...
## $ ScorePoints : Factor w/ 2686 levels "","#REF!","0",..: 1346 1688 1555 911 935 1855 475 1918 2681 810 ...
## $ Date.Entered : Factor w/ 3285 levels "02/20/2010","04/21/2012",..: 1755 2448 2322 2709 2637 2818 3268 3069 3141 513 ...
## $ Date.Peaked : Factor w/ 3284 levels "","1/1/00","1/1/05",..: 1155 91 162 232 27 2698 1232 1197 27 1155 ...
## $ Time.num : int 266 209 239 227 258 187 202 235 222 247 ...
The output above gives us a good overview of the cleaned, prepared data. For instance, we see the new Time.num
column; it has integer elements which are exactly the song lengths in seconds (this can be confirmed by comparing the song lengths to those in the “mm:ss” format in the Time
column). The nWeeksChart
column, which was earlier a factor, has also successfully been converted; it was converted to numeric, but since it only contained integers, it is now saved as such.
In the following, we consider an array of different data aggregation and plotting tasks.
We start by the simple task of visualizing the distribution of song durations.
histogram(~Time.num, charts, nint = 50, col = "blue", main = "Song length distribution",
xlab = "Song length (seconds)")
We see two peaks at roughly 150 and 225 seconds, corresponding to 2:30 and 3:45 minutes, respectively. It is interesting to see that the left dropoff is significantly steeper than the one on the right; apparently, for these more unusual song lengths, it is more common for songs to be longer.
In this part we consider how song lengths have evolved through time. This is done by making a scatterplot of duration against year. In order to avoid overplotting, we specify the alpha
argument. Moreover, a line is drawn through the average song length for each year.
startYear <- min(charts$Year) # starting year (used for plotting x-range).
endYear <- max(charts$Year) # final year (used for plotting x-range).
songLengthVsYear_xyPlot <- xyplot(Time.num ~ Year, charts,
main = "Song length vs. year", xlab = "Year", ylab = "Song length (seconds)", grid = TRUE,
alpha = 0.5, # combat overplotting via alpha.
type = c("p", "a"), col.line = "darkorange", lwd = 3, # draw averages.
scales = list(y = list(at = seq(0, 600, 30)),
x = list(at = seq(startYear, endYear, 5))))
print(songLengthVsYear_xyPlot)
This shows something very interesting; having increased through the 1960s and up to the beginning of 1990s, the average song length peaked just before 1995 and has slowly decreased since then. (A similar plot can be made using stripplot
, as exemplified in the data aggregation script.)
Now, we can also combat overplotting via smoothScatter
in the panel
argument, as done below:
songLengthVsYear_xyPlot2 <- xyplot(Time.num ~ Year, charts, main = "Song length vs. year",
xlab = "Year", ylab = "Song length (seconds)", grid = TRUE, scales = list(y = list(at = seq(0,
600, 30)), x = list(at = seq(startYear, endYear, 5))), panel = panel.smoothScatter) # combat overplotting via smoothScatter.
print(songLengthVsYear_xyPlot2)
## KernSmooth 2.23 loaded Copyright M. P. Wand 1997-2009 (loaded the
## KernSmooth namespace)
This gives a nice view of where the majority of the data points are located. From the blue brush-like stroke, we see the same behaviour as described above.
This part is much like the previous, except we consider decades instead of years. This is mainly used as an exercise to learn more about R. In this case, considering decades instead of years can lead to some very interesting possibilities; for example, it allows us to more easily condition on the decade variable as a factor (conditioning on the year variable would simply be too much, as there are too many levels), whereby we can obtain a lot of interesting plots.
First, we use cut()
to make a new factor called Decade
. The factor levels are renamed directly in the function call. Then we use the decades in the plot:
Decade <- cut(charts$Year, 6, labels = c("1950s", "1960s", "1970s", "1980s",
"1990s", "2000s"))
songLengthVsDecade_stripplot <- stripplot(Time.num ~ Decade, charts, grid = TRUE,
main = "Song length vs. decade", xlab = "Decade", ylab = "Song length (seconds)",
jitter.data = TRUE, panel = panel.smoothScatter, scales = list(y = list(at = seq(0,
600, 30), rot = c(0, 0), cex = 0.8), x = list(rot = c(0, 0), cex = 0.8)))
print(songLengthVsDecade_stripplot)
The same behaviour described above can also be hinted at here. Another interesting thing is the data for the 1960s; the big, blue, faded dot appears darker than those for the other decades. This means that there are more points gathered there. Consequently, it was more common in the 1960s for songs to be of roughly similar durations. Especially later on, in the past two decades, there seems to be a larger spread.
Let us find the shortest and longest songs for each year. We not only want just the minimum and maximum song lengths, but we would also like to see some info about these songs; e.g. song title, artist, etc. This data aggregation is performed via the plyr
library. In the ddply()
call we have custumized our own function:
minSongLengthInfoEachYear <- ddply(charts, ~Year, function(x) {
theMin <- which.min(x$Time.num)
shortestSongInfo <- x[theMin, c("Year", "Track", "Artist", "Time.num", "Time")]
shortestSongInfo <- rename(shortestSongInfo, c(Time.num = "TimeInSeconds"))
})
maxSongLengthInfoEachYear <- ddply(charts, ~Year, function(x) {
theMax <- which.max(x$Time.num)
longestSongInfo <- x[theMax, c("Year", "Track", "Artist", "Time.num", "Time")]
longestSongInfo <- rename(longestSongInfo, c(Time.num = "TimeInSeconds"))
})
write.table(minSongLengthInfoEachYear, "table_minSongLengthInfoEachYear.txt",
quote = FALSE, sep = "\t", row.names = FALSE) # write to file.
write.table(maxSongLengthInfoEachYear, "table_maxSongLengthInfoEachYear.txt",
quote = FALSE, sep = "\t", row.names = FALSE) # write to file.
The above code only defines the data.frames and writes them to file.
NOTE: For the sake of exercise, let us try to read in data from these files and then show it in HTML tables. The tables are rather long, so we will just do it for the minimum song lengths. (To see the longest songs and their info, uncomment the second line.)
htmlPrint(read.delim("table_minSongLengthInfoEachYear.txt"))
Year | Track | Artist | TimeInSeconds | Time |
---|---|---|---|---|
1950 | A Bushel and a Peck | Margaret Whiting & Jimmy Wakely | 120 | 2:00 |
1951 | Jingle Bells | Les Paul | 94 | 1:34 |
1952 | Meet Mister Callaghan | Les Paul | 107 | 1:47 |
1953 | The Typewriter | Leroy Anderson & His Pops Concert Orchestra | 95 | 1:35 |
1954 | Oh, That’ll be Joyful | Four Lads | 155 | 2:35 |
1955 | Ballad Of Davy Crockett | Fess Parker | 99 | 1:39 |
1956 | Dear Elvis (Page 1) | Audrey | 93 | 1:33 |
1957 | Santa And The Satellite (Part I) | Buchanan & Goodman | 83 | 1:23 |
1958 | Bluebirds Over The Mountain | Ersel Hickey | 86 | 1:26 |
1959 | Some Kind-A Earthquake | Duane Eddy | 77 | 1:17 |
1960 | What Do You Want? | Bobby Vee | 94 | 1:34 |
1961 | Let’s Get Together | Hayley Mills | 88 | 1:28 |
1962 | Sugar Blues | Ace Cannon | 90 | 1:30 |
1963 | Ten Little Indians | Beach Boys, The | 85 | 1:25 |
1964 | Little Boxes | Womenfolk, The | 62 | 1:02 |
1965 | Sunshine, Lollipops And Rainbows | Lesley Gore | 97 | 1:37 |
1966 | Please Tell Me Why | Dave Clark Five | 90 | 1:30 |
1967 | Long Legged Girl (With The Short Dress On) | Elvis Presley | 86 | 1:26 |
1968 | Tip-Toe Thru The Tulips With Me | Tiny Tim | 108 | 1:48 |
1969 | She’s A Lady | John Sebastian | 105 | 1:45 |
1970 | Theme Music for The film 2001 A Space Odyssey | Berlin Philharmonic | 98 | 1:38 |
1971 | Rags To Riches | Elvis Presley | 114 | 1:54 |
1972 | Those Were The Days | Carroll O'Connor & Jean Stapleton (As The Bunkers) | 87 | 1:27 |
1973 | Dueling Tubas (Theme From Belligerence) | Martin Mull | 86 | 1:26 |
1974 | Energy Crisis ‘74 | Dickie Goodman | 120 | 2:00 |
1975 | Sneaky Snake | Tom T. Hall | 117 | 1:57 |
1976 | Hurt | Elvis Presley | 125 | 2:05 |
1977 | Telephone Man | Meri Wilson | 118 | 1:58 |
1978 | Do You Wanna Dance | Ramones | 115 | 1:55 |
1979 | Good Timin’ | Beach Boys, The | 130 | 2:10 |
1980 | Theme From The Dukes Of Hazzard (Good Ol' Boys) | Waylon | 126 | 2:06 |
1981 | Almost Saturday Night | Dave Edmunds | 131 | 2:11 |
1982 | Come Go With Me | Beach Boys, The | 126 | 2:06 |
1983 | Holiday Road | Lindsey Buckingham | 131 | 2:11 |
1984 | Sunshine In The Shade | Fixx, The | 146 | 2:26 |
1985 | Miami Vice Theme | Jan Hammer | 146 | 2:26 |
1986 | In Between Days | Cure, The | 136 | 2:16 |
1987 | Come On, Let’s Go | Los Lobos | 129 | 2:09 |
1988 | Hippy Hippy Shake | Georgia Satellites, The | 105 | 1:45 |
1989 | Pop Singer | John Cougar Mellencamp | 165 | 2:45 |
1990 | Drag My Bad Name Down | 4 Of Us, The | 170 | 2:50 |
1991 | The Star Spangled Banner | Whitney Houston | 129 | 2:09 |
1992 | All Shook Up | Billy Joel | 125 | 2:05 |
1993 | Chattahoochee | Alan Jackson | 144 | 2:24 |
1994 | Bizarre Love Triangle | Frente! | 117 | 1:57 |
1995 | Roll To Me | Del Amitri | 127 | 2:07 |
1996 | Esa Nena Linda | Artie The 1 Man Party | 156 | 2:36 |
1997 | Little Bitty | Alan Jackson | 155 | 2:35 |
1998 | We Shouldn’t Really Be Doing This | George Strait | 149 | 2:29 |
1999 | Crazy Little Thing Called Love | Dwight Yoakam | 142 | 2:22 |
2000 | www.memory | Alan Jackson | 156 | 2:36 |
2001 | The Star Spangled Banner | Whitney Houston | 129 | 2:09 |
2002 | Some Days You Gotta Dance | Dixie Chicks, The | 150 | 2:30 |
2003 | Faint | Linkin Park | 162 | 2:42 |
2004 | Drinkin' Bone | Tracy Byrd | 129 | 2:09 |
2005 | Naked | Marques Houston | 130 | 2:10 |
2006 | What I’ve Been Looking For (Reprise) | Zac Efron | 79 | 1:19 |
2007 | Not Fade Away | Sheryl Crow | 125 | 2:05 |
2008 | Anyone Else But You | Michael Cera & Ellen Page | 116 | 1:56 |
2009 | It’s My Life / Confessions Part II | Glee Cast | 111 | 1:51 |
2010 | Sing! | Glee Cast | 111 | 1:51 |
2011 | Isn’t She Lovely | Glee Cast | 98 | 1:38 |
2012 | Yesterday | Adam Levine | 131 | 2:11 |
2013 | Cups | Anna Kendrick | 76 | 1:16 |
# htmlPrint(read.delim('table_maxSongLengthInfoEachYear.txt'))
We can easily find the shortest and longest songs for ALL the years considered:
htmlPrint(minSongLengthInfoEachYear[which.min(minSongLengthInfoEachYear$TimeInSeconds),
c("Year", "Track", "Artist", "TimeInSeconds", "Time")])
Year | Track | Artist | TimeInSeconds | Time |
---|---|---|---|---|
1964 | Little Boxes | Womenfolk, The | 62 | 1:02 |
htmlPrint(maxSongLengthInfoEachYear[which.max(maxSongLengthInfoEachYear$TimeInSeconds),
c("Year", "Track", "Artist", "TimeInSeconds", "Time")])
Year | Track | Artist | TimeInSeconds | Time |
---|---|---|---|---|
1976 | A Better Place To Be (Live) (Parts 1 & 2) | Harry Chapin | 570 | 9:30 |
On a related note, we can also find the average song length for ALL years:
sprintf("Average song length for years %d-%d = %4.2f seconds.", min(charts$Year),
max(charts$Year), mean(charts$Time.num, na.rm = TRUE))
## [1] "Average song length for years 1950-2013 = 200.79 seconds."
Above, we looked at tables and song info for the shortest and longest songs. We now take a step forward and try to visualize the minimum and maximum song lengths through time. We also include the average. In the function in the ddply()
call below, these statistics are included as the levels of a factor in the resulting data.frame:
songLengthStatsEachYear2 <- ddply(charts, ~Year, function(x) {
cLevels <- c("min", "max", "avg")
data.frame(stat = factor(cLevels, levels = cLevels), songLength = c(range(x$Time.num,
na.rm = TRUE), mean(x$Time.num, na.rm = TRUE)))
})
write.table(songLengthStatsEachYear2, "table_songLengthStatsEachYear2.txt",
quote = FALSE, sep = "\t", row.names = FALSE)
NOTE: Above, the data.frame is only saved and written it to file, because we once again want to toy with reading it back in and showing it;
htmlPrint(read.delim("table_songLengthStatsEachYear2.txt"))
Year | stat | songLength |
---|---|---|
1950 | min | 120 |
1950 | max | 213 |
1950 | avg | 170 |
1951 | min | 94 |
1951 | max | 246 |
1951 | avg | 167 |
1952 | min | 107 |
1952 | max | 367 |
1952 | avg | 165 |
1953 | min | 95 |
1953 | max | 405 |
1953 | avg | 165 |
1954 | min | 155 |
1954 | max | 155 |
1954 | avg | 155 |
1955 | min | 99 |
1955 | max | 370 |
1955 | avg | 156 |
1956 | min | 93 |
1956 | max | 335 |
1956 | avg | 153 |
1957 | min | 83 |
1957 | max | 220 |
1957 | avg | 147 |
1958 | min | 86 |
1958 | max | 410 |
1958 | avg | 143 |
1959 | min | 77 |
1959 | max | 249 |
1959 | avg | 143 |
1960 | min | 94 |
1960 | max | 282 |
1960 | avg | 148 |
1961 | min | 88 |
1961 | max | 495 |
1961 | avg | 148 |
1962 | min | 90 |
1962 | max | 354 |
1962 | avg | 150 |
1963 | min | 85 |
1963 | max | 271 |
1963 | avg | 149 |
1964 | min | 62 |
1964 | max | 219 |
1964 | avg | 149 |
1965 | min | 97 |
1965 | max | 360 |
1965 | avg | 155 |
1966 | min | 90 |
1966 | max | 333 |
1966 | avg | 157 |
1967 | min | 86 |
1967 | max | 298 |
1967 | avg | 161 |
1968 | min | 108 |
1968 | max | 440 |
1968 | avg | 170 |
1969 | min | 105 |
1969 | max | 444 |
1969 | avg | 180 |
1970 | min | 98 |
1970 | max | 413 |
1970 | avg | 186 |
1971 | min | 114 |
1971 | max | 410 |
1971 | avg | 186 |
1972 | min | 87 |
1972 | max | 516 |
1972 | avg | 200 |
1973 | min | 86 |
1973 | max | 391 |
1973 | avg | 201 |
1974 | min | 120 |
1974 | max | 390 |
1974 | avg | 199 |
1975 | min | 117 |
1975 | max | 444 |
1975 | avg | 200 |
1976 | min | 125 |
1976 | max | 570 |
1976 | avg | 207 |
1977 | min | 118 |
1977 | max | 391 |
1977 | avg | 208 |
1978 | min | 115 |
1978 | max | 475 |
1978 | avg | 213 |
1979 | min | 130 |
1979 | max | 437 |
1979 | avg | 218 |
1980 | min | 126 |
1980 | max | 396 |
1980 | avg | 217 |
1981 | min | 131 |
1981 | max | 393 |
1981 | avg | 219 |
1982 | min | 126 |
1982 | max | 347 |
1982 | avg | 217 |
1983 | min | 131 |
1983 | max | 367 |
1983 | avg | 227 |
1984 | min | 146 |
1984 | max | 372 |
1984 | avg | 233 |
1985 | min | 146 |
1985 | max | 382 |
1985 | avg | 236 |
1986 | min | 136 |
1986 | max | 375 |
1986 | avg | 240 |
1987 | min | 129 |
1987 | max | 352 |
1987 | avg | 238 |
1988 | min | 105 |
1988 | max | 361 |
1988 | avg | 241 |
1989 | min | 165 |
1989 | max | 444 |
1989 | avg | 245 |
1990 | min | 170 |
1990 | max | 400 |
1990 | avg | 248 |
1991 | min | 129 |
1991 | max | 489 |
1991 | avg | 248 |
1992 | min | 125 |
1992 | max | 536 |
1992 | avg | 256 |
1993 | min | 144 |
1993 | max | 423 |
1993 | avg | 254 |
1994 | min | 117 |
1994 | max | 392 |
1994 | avg | 245 |
1995 | min | 127 |
1995 | max | 455 |
1995 | avg | 250 |
1996 | min | 156 |
1996 | max | 410 |
1996 | avg | 250 |
1997 | min | 155 |
1997 | max | 443 |
1997 | avg | 248 |
1998 | min | 149 |
1998 | max | 392 |
1998 | avg | 240 |
1999 | min | 142 |
1999 | max | 429 |
1999 | avg | 235 |
2000 | min | 156 |
2000 | max | 470 |
2000 | avg | 242 |
2001 | min | 129 |
2001 | max | 415 |
2001 | avg | 238 |
2002 | min | 150 |
2002 | max | 370 |
2002 | avg | 239 |
2003 | min | 162 |
2003 | max | 468 |
2003 | avg | 241 |
2004 | min | 129 |
2004 | max | 394 |
2004 | avg | 235 |
2005 | min | 130 |
2005 | max | 349 |
2005 | avg | 232 |
2006 | min | 79 |
2006 | max | 380 |
2006 | avg | 227 |
2007 | min | 125 |
2007 | max | 448 |
2007 | avg | 231 |
2008 | min | 116 |
2008 | max | 515 |
2008 | avg | 236 |
2009 | min | 111 |
2009 | max | 392 |
2009 | avg | 229 |
2010 | min | 111 |
2010 | max | 416 |
2010 | avg | 231 |
2011 | min | 98 |
2011 | max | 436 |
2011 | avg | 227 |
2012 | min | 131 |
2012 | max | 502 |
2012 | avg | 231 |
2013 | min | 76 |
2013 | max | 484 |
2013 | avg | 232 |
The data in this data.frame can be used for plotting:
minMaxAvgSongLengthVsYear <- xyplot(songLength ~ Year, songLengthStatsEachYear2,
main = "min, max, and average song length vs. year", ylab = "Song length (seconds)",
group = stat, type = "b", grid = "h", as.table = TRUE, auto.key = list(columns = 3))
# print(minMaxAvgSongLengthVsYear)
png("plot_minMaxAvgSongLengthVsYear.png")
print(minMaxAvgSongLengthVsYear)
dev.off()
## pdf
## 2
It is on purpose that we don't immediately print the plot in the above code chunk; we just save it to file.
NOTE: We now want to embed the pre-made plot in this document:
From the green graph we see that the average song length increased from the 1960s to the beginning of the 1990s, and from there on is decreasing a little bit. This is just like what we discussed in some of the earlier sections. Moreover, it is interesting to notice how the minimum song length appears to vary much less from year to year than the maximum song length. (It should be noted that the point–somewhere in the 1960s–where the max, min, and avg are the same, is likely the only data point for that year and should thus be taken with a grain of salt–or more likely just be disregarded entirely.)
We now consider how the number of songs listed on the charts has evolved from year to year. We skip years 1955 and older, since these contain so few data points and so many missing values to really cause havoc now that we look at the number of songs.
nSongsEachYear <- ddply(subset(charts, Year > 1955), ~Year, summarize, nSongs = length(Prefix))
# the variable Prefix is unique for each song, which is optimal for this
# case.
nSongsVsYear <- xyplot(nSongs ~ Year, nSongsEachYear, main = "No. of songs on the charts vs. year",
ylab = "No. of songs", type = "b", grid = "h")
# print(nSongsVsYear)
png("plot_nSongsVsYear.png")
print(nSongsVsYear)
dev.off()
## pdf
## 2
NOTE: Again, we only define the plot and print it to file; we want to bring it back in just like before by embedding the pre-made plot in this document:
This plot shows something very interesting, telling us a lot about the diversity of songs on the chart through time. We see a clear peak in the end of the 1960s; more than 700 songs were on the chart for each year in this period. From the 1970s and all the way to the beginning of the 2000s, the number of songs decreased overall; this observed behaviour indicates a smaller amount of diversity on the charts in this period, and thus that the same songs seemed to dominate. From about 2004, the number of songs takes a sudden increase, peaking at about 500 in 2011. The sudden downfall after this could be true, or it could be attributed to the dataset not having been updated with all the new songs for the most recent years of 2012 and 2013.
Here, we consider the number and the proportion of songs that have a duration longer than a certain threshold. This threshold is set to be the average song length for the entire time period, but it can be changed at will. The resulting data.frame is shown in the table below.
threshold <- mean(charts$Time.num, na.rm = TRUE)
nSongsLongerThanAvgEachYear <- ddply(subset(charts, Year > 1955), ~Year, function(x) {
count <- sum(x$Time.num >= threshold, na.rm = TRUE)
total <- nrow(x)
prop <- count/total
data.frame(Count = count, Total = total, Proportion = prop)
})
htmlPrint(nSongsLongerThanAvgEachYear, digits = 2)
Year | Count | Total | Proportion |
---|---|---|---|
1956 | 11 | 505 | 0.02 |
1957 | 6 | 496 | 0.01 |
1958 | 4 | 530 | 0.01 |
1959 | 5 | 576 | 0.01 |
1960 | 15 | 602 | 0.02 |
1961 | 12 | 681 | 0.02 |
1962 | 14 | 676 | 0.02 |
1963 | 13 | 658 | 0.02 |
1964 | 5 | 718 | 0.01 |
1965 | 27 | 717 | 0.04 |
1966 | 20 | 743 | 0.03 |
1967 | 45 | 739 | 0.06 |
1968 | 83 | 686 | 0.12 |
1969 | 138 | 672 | 0.21 |
1970 | 161 | 653 | 0.25 |
1971 | 161 | 635 | 0.25 |
1972 | 236 | 591 | 0.40 |
1973 | 237 | 536 | 0.44 |
1974 | 223 | 496 | 0.45 |
1975 | 250 | 568 | 0.44 |
1976 | 277 | 534 | 0.52 |
1977 | 280 | 473 | 0.59 |
1978 | 292 | 453 | 0.64 |
1979 | 346 | 476 | 0.73 |
1980 | 343 | 474 | 0.72 |
1981 | 292 | 408 | 0.72 |
1982 | 310 | 424 | 0.73 |
1983 | 373 | 452 | 0.83 |
1984 | 391 | 435 | 0.90 |
1985 | 370 | 405 | 0.91 |
1986 | 374 | 397 | 0.94 |
1987 | 359 | 398 | 0.90 |
1988 | 355 | 387 | 0.92 |
1989 | 365 | 392 | 0.93 |
1990 | 359 | 376 | 0.95 |
1991 | 360 | 385 | 0.94 |
1992 | 342 | 371 | 0.92 |
1993 | 321 | 349 | 0.92 |
1994 | 306 | 345 | 0.89 |
1995 | 333 | 357 | 0.93 |
1996 | 302 | 324 | 0.93 |
1997 | 320 | 341 | 0.94 |
1998 | 312 | 346 | 0.90 |
1999 | 262 | 315 | 0.83 |
2000 | 281 | 317 | 0.89 |
2001 | 261 | 301 | 0.87 |
2002 | 269 | 295 | 0.91 |
2003 | 281 | 312 | 0.90 |
2004 | 262 | 306 | 0.86 |
2005 | 297 | 342 | 0.87 |
2006 | 295 | 363 | 0.81 |
2007 | 288 | 349 | 0.83 |
2008 | 329 | 396 | 0.83 |
2009 | 339 | 436 | 0.78 |
2010 | 369 | 483 | 0.76 |
2011 | 380 | 497 | 0.76 |
2012 | 303 | 374 | 0.81 |
2013 | 259 | 331 | 0.78 |
We then plot the proportion of songs vs. year:
propSongsLongerThanAvgVsYear <- xyplot(Proportion ~ Year, nSongsLongerThanAvgEachYear,
main = paste("Proportion of songs with length >= ", threshold, "(= avg. over all years)"),
ylab = "Proportion of songs", type = "b", grid = "h")
print(propSongsLongerThanAvgVsYear)
So, we are comparing the song lengths to the average duration (over all years), which is about 3:20 minutes. Evidently, there are very few older songs longer than this threshold. But there actually is a reason for this: In the 1960s and earlier, the songs were recorded in a so-called 45 RPM format, which had a capacity of about 3 minutes. It is thus no wonder why the left end of graph looks the way it does, with very low proportions at each year. Now, in the end of the 1960s, these recording constraints were removed. And this is exactly what we can see in the plot; from the end of the 1960s, the proportion of songs with a duration longer than 3:20 increases. The peak seemed to have been reached in the 1990s, where almost all songs on the chart were longer than 3:20 minutes. Since then, the trend has been decreasing.
In this part we will investigate which songs that charted the longest in the Top 100, Top 40, and Top 10, as well as which song charted longest at its highest position. We begin by considering the entire time period. We also extract some additional information belonging to the longest charting songs, e.g. artist, song title, etc.
# longest charting song in top 100 and its info:
htmlPrint(charts[which.max(charts$nWeeksChart), c("Track", "Artist", "Time",
"Date.Entered", "High", "Date.Peaked", "nWeeksChart")])
Track | Artist | Time | Date.Entered | High | Date.Peaked | nWeeksChart |
---|---|---|---|---|---|---|
I’m Yours | Jason Mraz | 4:03 | 5/3/08 | 6 | 9/20/08 | 76 |
# in top 40:
htmlPrint(charts[which.max(charts$nWeeksChartTop40), c("Track", "Artist", "Time",
"Date.Entered", "High", "Date.Peaked", "nWeeksChartTop40")])
Track | Artist | Time | Date.Entered | High | Date.Peaked | nWeeksChartTop40 |
---|---|---|---|---|---|---|
I’m Yours | Jason Mraz | 4:03 | 5/3/08 | 6 | 9/20/08 | 62 |
# in top 10:
htmlPrint(charts[which.max(charts$nWeeksChartTop10), c("Track", "Artist", "Time",
"Date.Entered", "High", "Date.Peaked", "nWeeksChartTop10")])
Track | Artist | Time | Date.Entered | High | Date.Peaked | nWeeksChartTop10 |
---|---|---|---|---|---|---|
How Do I Live | LeAnn Rimes | 4:18 | 6/21/97 | 2 | 12/13/97 | 32 |
# song that charted longest at its peak/highest position:
htmlPrint(charts[which.max(charts$nWeeksChartPeak), c("Track", "Artist", "Time",
"Date.Entered", "High", "Date.Peaked", "nWeeksChartPeak")])
Track | Artist | Time | Date.Entered | High | Date.Peaked | nWeeksChartPeak |
---|---|---|---|---|---|---|
One Sweet Day | Mariah Carey | 4:42 | 12/2/95 | 1 | 12/2/95 | 16 |
It is quite impressive what the outputs above tell us. First of all, “I'm Yours” by Jason Mraz was in Top 100 for 76 weeks, and in Top 40 for as long as 62 weeks! And this even though its highest position was only 6. More impressive, perhaps, is that “One Sweet Day” with Mariah Carey held the no. 1 position for 16 weeks–that's 4 months without being pushed off the pole position!
One can also look at earlier years, as is done in the data aggregation and plotting script file. This poses the question whether songs are charting longer nowadays?
We can investigate this:
nWeeksInTop100vsYear <- xyplot(nWeeksChart ~ Year, subset(charts, Year > 1955),
grid = "h", main = "No. of weeks a song has charted in Top 100 vs. Year",
ylab = "No. of weeks", type = c("p", "a"), col.line = "darkorange", lwd = 3,
alpha = 0.5)
print(nWeeksInTop100vsYear)
Some songs in newer time do seem to be charting much longer (as seen by the scattered points in the top right), but the average (orange line) is decreasing.
In this part we investigate whether/how song length is related to success on the charts; i.e. should we make our new song long or short? For this purpose we create a large data.frame through a customized function in a ddply()
call:
# longest charting songs (in Top 100, 40, 10) and their lengths for each
# year:
longestChartingSongsEachYear <- ddply(charts, ~Year, function(x) {
max_nWeeksChart <- max(x$nWeeksChart)
theMax_nWeeksChart <- which.max(x$nWeeksChart)
max_nWeeksChartTop40 <- max(x$nWeeksChartTop40)
theMax_nWeeksChartTop40 <- which.max(x$nWeeksChartTop40)
max_nWeeksChartTop10 <- max(x$nWeeksChartTop10)
theMax_nWeeksChartTop10 <- which.max(x$nWeeksChartTop10)
cLevels <- c("weeksInTop100", "weeksInTop40", "weeksInTop10")
data.frame(successMeasure = factor(cLevels, levels = cLevels), nWeeks = c(max_nWeeksChart,
max_nWeeksChartTop40, max_nWeeksChartTop10), Track = c(as.character(x$Track[theMax_nWeeksChart]),
as.character(x$Track[theMax_nWeeksChartTop40]), as.character(x$Track[theMax_nWeeksChartTop10])),
Artist = c(as.character(x$Artist[theMax_nWeeksChart]), as.character(x$Artist[theMax_nWeeksChartTop40]),
as.character(x$Artist[theMax_nWeeksChartTop10])), songLength = c(x$Time.num[theMax_nWeeksChart],
x$Time.num[theMax_nWeeksChartTop40], x$Time.num[theMax_nWeeksChartTop10]))
})
write.table(longestChartingSongsEachYear, "table_longestChartingSongsEachYear.txt",
quote = FALSE, sep = "\t", row.names = FALSE)
The data.frame is written to file in the last two lines of code above. NOTE: Let's read the table back in and print it as HTML (though it is quite large):
htmlPrint(read.delim("table_longestChartingSongsEachYear.txt"))
Year | successMeasure | nWeeks | Track | Artist | songLength |
---|---|---|---|---|---|
1950 | weeksInTop100 | 27 | The Third Man Theme | Anton Karas | 131 |
1950 | weeksInTop40 | The Third Man Theme | Anton Karas | 131 | |
1950 | weeksInTop10 | The Third Man Theme | Anton Karas | 131 | |
1951 | weeksInTop100 | 34 | Be My Love | Mario Lanza | 208 |
1951 | weeksInTop40 | Be My Love | Mario Lanza | 208 | |
1951 | weeksInTop10 | Be My Love | Mario Lanza | 208 | |
1952 | weeksInTop100 | 38 | Blue Tango | Leroy Anderson & His Pops Concert Orchestra | 171 |
1952 | weeksInTop40 | Blue Tango | Leroy Anderson & His Pops Concert Orchestra | 171 | |
1952 | weeksInTop10 | Blue Tango | Leroy Anderson & His Pops Concert Orchestra | 171 | |
1953 | weeksInTop100 | 31 | Vaya Con Dios (May God Be With You) | Les Paul & Mary Ford | 171 |
1953 | weeksInTop40 | Vaya Con Dios (May God Be With You) | Les Paul & Mary Ford | 171 | |
1953 | weeksInTop10 | Vaya Con Dios (May God Be With You) | Les Paul & Mary Ford | 171 | |
1954 | weeksInTop100 | 1 | Oh, That’ll be Joyful | Four Lads | 155 |
1954 | weeksInTop40 | Oh, That’ll be Joyful | Four Lads | 155 | |
1954 | weeksInTop10 | Oh, That’ll be Joyful | Four Lads | 155 | |
1955 | weeksInTop100 | 27 | Melody Of Love | Billy Vaughn | 175 |
1955 | weeksInTop40 | Melody Of Love | Billy Vaughn | 175 | |
1955 | weeksInTop10 | Cherry Pink And Apple Blossom White | Perez Prado and His Orchestra | 176 | |
1956 | weeksInTop100 | 31 | Canadian Sunset | Hugo Winterhalter | 170 |
1956 | weeksInTop40 | 24 | Lisbon Antigua | Nelson Riddle and His Orchestra | 153 |
1956 | weeksInTop10 | 21 | Don’t Be Cruel | Elvis Presley | 123 |
1957 | weeksInTop100 | 39 | Wonderful! Wonderful! | Johnny Mathis | 167 |
1957 | weeksInTop40 | So Rare | Jimmy Dorsey | 150 | |
1957 | weeksInTop10 | So Rare | Jimmy Dorsey | 150 | |
1958 | weeksInTop100 | 30 | All The Way | Frank Sinatra | 170 |
1958 | weeksInTop40 | Chantilly Lace | Big Bopper | 140 | |
1958 | weeksInTop10 | Patricia | Perez Prado and His Orchestra | 138 | |
1959 | weeksInTop100 | 26 | Mack The Knife | Bobby Darin | 184 |
1959 | weeksInTop40 | 22 | Mack The Knife | Bobby Darin | 184 |
1959 | weeksInTop10 | 16 | Mack The Knife | Bobby Darin | 184 |
1960 | weeksInTop100 | 27 | Running Bear | Johnny Preston | 153 |
1960 | weeksInTop40 | 20 | He’ll Have To Go | Jim Reeves | 136 |
1960 | weeksInTop10 | 12 | The Theme From A Summer Place | Percy Faith | 144 |
1961 | weeksInTop100 | 26 | Moon River | Henry Mancini, His Orchestra and Chorus | 161 |
1961 | weeksInTop40 | 18 | Exodus | Ferrante and Teicher | 174 |
1961 | weeksInTop10 | 12 | Tossin' And Turnin' | Bobby Lewis | 150 |
1962 | weeksInTop100 | 23 | Limbo Rock | Chubby Checker | 142 |
1962 | weeksInTop40 | 18 | The Twist | Chubby Checker | 152 |
1962 | weeksInTop10 | 13 | The Twist | Chubby Checker | 152 |
1963 | weeksInTop100 | 20 | Up On The Roof | Drifters, The | 154 |
1963 | weeksInTop40 | 13 | Sugar Shack | Jimmy Gilmer | 121 |
1963 | weeksInTop10 | 10 | Sugar Shack | Jimmy Gilmer | 121 |
1964 | weeksInTop100 | 22 | Hello, Dolly! | Louis Armstrong | 142 |
1964 | weeksInTop40 | 19 | Hello, Dolly! | Louis Armstrong | 142 |
1964 | weeksInTop10 | 13 | Hello, Dolly! | Louis Armstrong | 142 |
1965 | weeksInTop100 | 18 | Wooly Bully | Sam the Sham and the Pharaohs | 140 |
1965 | weeksInTop40 | 14 | Wooly Bully | Sam the Sham and the Pharaohs | 140 |
1965 | weeksInTop10 | 10 | I Can’t Help Myself | Four Tops | 163 |
1966 | weeksInTop100 | 21 | Born Free | Roger Williams | 142 |
1966 | weeksInTop40 | 14 | Devil With A Blue Dress On & Good Golly Miss Molly | Mitch Ryder | 194 |
1966 | weeksInTop10 | 12 | I’m A Believer | Monkees, The | 161 |
1967 | weeksInTop100 | 18 | Boogaloo Down Broadway | Fantastic Johnny C, The | 161 |
1967 | weeksInTop40 | 15 | To Sir With Love | Lulu | 164 |
1967 | weeksInTop10 | 10 | Daydream Believer | Monkees, The | 177 |
1968 | weeksInTop100 | 26 | Sunshine Of Your Love | Cream, The | 183 |
1968 | weeksInTop40 | 19 | Hey Jude | Beatles, The | 431 |
1968 | weeksInTop10 | 14 | Hey Jude | Beatles, The | 431 |
1969 | weeksInTop100 | 22 | Sugar, Sugar | Archies, The | 168 |
1969 | weeksInTop40 | Sugar, Sugar | Archies, The | 168 | |
1969 | weeksInTop10 | Sugar, Sugar | Archies, The | 168 | |
1970 | weeksInTop100 | 23 | Yellow River | Christie | 160 |
1970 | weeksInTop40 | Raindrops Keep Fallin' On My Head | B.J. Thomas | 182 | |
1970 | weeksInTop10 | Raindrops Keep Fallin' On My Head | B.J. Thomas | 182 | |
1971 | weeksInTop100 | 26 | I’ve Found Someone Of My Own | Free Movement, The | 225 |
1971 | weeksInTop40 | Knock Three Times | Dawn | 176 | |
1971 | weeksInTop10 | Joy To The World | Three Dog Night | 197 | |
1972 | weeksInTop100 | 22 | I Am Woman | Helen Reddy | 188 |
1972 | weeksInTop40 | American Pie (Parts 1 and 2) | Don McLean | 516 | |
1972 | weeksInTop10 | The First Time Ever I Saw Your Face | Roberta Flack | 255 | |
1973 | weeksInTop100 | 38 | Why Me | Kris Kristofferson | 205 |
1973 | weeksInTop40 | Why Me | Kris Kristofferson | 205 | |
1973 | weeksInTop10 | Let’s Get It On | Marvin Gaye | 238 | |
1974 | weeksInTop100 | 28 | One Hell Of A Woman | Mac Davis | 172 |
1974 | weeksInTop40 | Come And Get Your Love | Redbone | 210 | |
1974 | weeksInTop10 | The Way We Were | Barbra Streisand | 209 | |
1975 | weeksInTop100 | 32 | Feelings | Morris Albert | 226 |
1975 | weeksInTop40 | Rhinestone Cowboy | Glen Campbell | 188 | |
1975 | weeksInTop10 | One Of These Nights | Eagles | 208 | |
1976 | weeksInTop100 | 28 | A Fifth Of Beethoven | Walter Murphy | 182 |
1976 | weeksInTop40 | A Fifth Of Beethoven | Walter Murphy | 182 | |
1976 | weeksInTop10 | Tonight’s The Night (Gonna Be Alright) | Rod Stewart | 235 | |
1977 | weeksInTop100 | 33 | How Deep Is Your Love | Bee Gees | 210 |
1977 | weeksInTop40 | How Deep Is Your Love | Bee Gees | 210 | |
1977 | weeksInTop10 | How Deep Is Your Love | Bee Gees | 210 | |
1978 | weeksInTop100 | 40 | I Go Crazy | Paul Davis | 234 |
1978 | weeksInTop40 | I Go Crazy | Paul Davis | 234 | |
1978 | weeksInTop10 | Le Freak | Chic | 210 | |
1979 | weeksInTop100 | 27 | I Will Survive | Gloria Gaynor | 195 |
1979 | weeksInTop40 | Pop Muzik | M | 200 | |
1979 | weeksInTop10 | Hot Stuff | Donna Summer | 227 | |
1980 | weeksInTop100 | 31 | Another One Bites The Dust | Queen | 212 |
1980 | weeksInTop40 | Do That To Me One More Time | Captain and Tennille | 229 | |
1980 | weeksInTop10 | Another One Bites The Dust | Queen | 212 | |
1981 | weeksInTop100 | 32 | Jessie’s Girl | Rick Springfield | 194 |
1981 | weeksInTop40 | 22 | Jessie’s Girl | Rick Springfield | 194 |
1981 | weeksInTop10 | 15 | Physical | Olivia Newton-John | 223 |
1982 | weeksInTop100 | 43 | Tainted Love | Soft Cell | 158 |
1982 | weeksInTop40 | 22 | Hurts So Good | John Cougar | 215 |
1982 | weeksInTop10 | 16 | Hurts So Good | John Cougar | 215 |
1983 | weeksInTop100 | 32 | Baby, Come To Me | Patti Austin | 210 |
1983 | weeksInTop40 | 21 | You And I | Eddie Rabbitt | 238 |
1983 | weeksInTop10 | 14 | Flashdance…What A Feeling | Irene Cara | 235 |
1984 | weeksInTop100 | 30 | Borderline | Madonna | 238 |
1984 | weeksInTop40 | What’s Love Got To Do With It | Tina Turner | 229 | |
1984 | weeksInTop10 | When Doves Cry | Prince | 229 | |
1985 | weeksInTop100 | 29 | I Miss You | Klymaxx | 244 |
1985 | weeksInTop40 | Careless Whisper | Wham! | 300 | |
1985 | weeksInTop10 | Say You, Say Me | Lionel Richie | 239 | |
1986 | weeksInTop100 | 27 | Something About You | Level 42 | 228 |
1986 | weeksInTop40 | 17 | That’s What Friends Are For | Dionne & Friends | 254 |
1986 | weeksInTop10 | 10 | That’s What Friends Are For | Dionne & Friends | 254 |
1987 | weeksInTop100 | 30 | In My Dreams | REO Speedwagon | 260 |
1987 | weeksInTop40 | 16 | Shake You Down | Gregory Abbott | 244 |
1987 | weeksInTop10 | 9 | Faith | George Michael | 194 |
1988 | weeksInTop100 | 30 | I’ll Always Love You | Taylor Dayne | 258 |
1988 | weeksInTop40 | 17 | Need You Tonight | INXS | 181 |
1988 | weeksInTop10 | 8 | Every Rose Has Its Thorn | Poison | 260 |
1989 | weeksInTop100 | 39 | Bust A Move | Young M.C. | 260 |
1989 | weeksInTop40 | 20 | Bust A Move | Young M.C. | 260 |
1989 | weeksInTop10 | 10 | Another Day In Paradise | Phil Collins | 284 |
1990 | weeksInTop100 | 30 | Close To You | Maxi Priest | 235 |
1990 | weeksInTop40 | 19 | From A Distance | Bette Midler | 275 |
1990 | weeksInTop10 | 10 | Because I Love You (The Postman Song) | Stevie B | 255 |
1991 | weeksInTop100 | 29 | High Enough | Damn Yankees | 250 |
1991 | weeksInTop40 | 20 | Emotions | Mariah Carey | 249 |
1991 | weeksInTop10 | 11 | (Everything I Do) I Do It For You | Bryan Adams | 243 |
1992 | weeksInTop100 | 37 | Just Another Day | Jon Secada | 251 |
1992 | weeksInTop40 | Just Another Day | Jon Secada | 251 | |
1992 | weeksInTop10 | End Of The Road | Boyz II Men | 350 | |
1993 | weeksInTop100 | 45 | Whoomp! There It Is | Tag Team | 267 |
1993 | weeksInTop40 | Whoomp! There It Is | Tag Team | 267 | |
1993 | weeksInTop10 | Whoomp! There It Is | Tag Team | 267 | |
1994 | weeksInTop100 | 45 | Another Night | Real McCoy | 233 |
1994 | weeksInTop40 | Another Night | Real McCoy | 233 | |
1994 | weeksInTop10 | Another Night | Real McCoy | 233 | |
1995 | weeksInTop100 | 49 | Run-Around | Blues Traveler | 252 |
1995 | weeksInTop40 | Gangsta’s Paradise | Coolio | 240 | |
1995 | weeksInTop10 | Gangsta’s Paradise | Coolio | 240 | |
1996 | weeksInTop100 | 60 | Macarena (Bayside Boys Mix) | Los Del Rio | 234 |
1996 | weeksInTop40 | You’re Makin' Me High | Toni Braxton | 269 | |
1996 | weeksInTop10 | Un-Break My Heart | Toni Braxton | 264 | |
1997 | weeksInTop100 | 69 | How Do I Live | LeAnn Rimes | 258 |
1997 | weeksInTop40 | How Do I Live | LeAnn Rimes | 258 | |
1997 | weeksInTop10 | How Do I Live | LeAnn Rimes | 258 | |
1998 | weeksInTop100 | 56 | I Don’t Want To Wait | Paula Cole | 247 |
1998 | weeksInTop40 | 52 | Truly Madly Deeply | Savage Garden | 279 |
1998 | weeksInTop10 | 26 | Truly Madly Deeply | Savage Garden | 279 |
1999 | weeksInTop100 | 58 | Smooth | Santana | 244 |
1999 | weeksInTop40 | 50 | Smooth | Santana | 244 |
1999 | weeksInTop10 | 30 | Smooth | Santana | 244 |
2000 | weeksInTop100 | 57 | Higher | Creed | 316 |
2000 | weeksInTop40 | 43 | Amazed | Lonestar | 265 |
2000 | weeksInTop10 | 19 | Everything You Want | Vertical Horizon | 241 |
2001 | weeksInTop100 | 56 | The Way You Love Me | Faith Hill | 186 |
2001 | weeksInTop40 | 45 | Hanging By A Moment | Lifehouse | 213 |
2001 | weeksInTop10 | 23 | How You Remind Me | Nickelback | 223 |
2002 | weeksInTop100 | 45 | Wherever You Will Go | Calling, The | 208 |
2002 | weeksInTop40 | 40 | Wherever You Will Go | Calling, The | 208 |
2002 | weeksInTop10 | 19 | Dilemma | Nelly | 287 |
2003 | weeksInTop100 | 54 | Unwell | matchbox twenty | 228 |
2003 | weeksInTop40 | 42 | Here Without You | 3 Doors Down | 238 |
2003 | weeksInTop10 | 17 | Hey Ya! | OutKast | 249 |
2004 | weeksInTop100 | 50 | Someday | Nickelback | 207 |
2004 | weeksInTop40 | 41 | Yeah! | Usher | 250 |
2004 | weeksInTop10 | 24 | Yeah! | Usher | 250 |
2005 | weeksInTop100 | 62 | You And Me | Lifehouse | 195 |
2005 | weeksInTop40 | 44 | You And Me | Lifehouse | 195 |
2005 | weeksInTop10 | 23 | We Belong Together | Mariah Carey | 201 |
2006 | weeksInTop100 | 58 | How To Save A Life | Fray, The | 261 |
2006 | weeksInTop40 | 36 | How To Save A Life | Fray, The | 261 |
2006 | weeksInTop10 | 19 | How To Save A Life | Fray, The | 261 |
2007 | weeksInTop100 | 64 | Before He Cheats | Carrie Underwood | 200 |
2007 | weeksInTop40 | 53 | Before He Cheats | Carrie Underwood | 200 |
2007 | weeksInTop10 | 25 | Apologize | Timbaland | 184 |
2008 | weeksInTop100 | 76 | I’m Yours | Jason Mraz | 243 |
2008 | weeksInTop40 | 62 | I’m Yours | Jason Mraz | 243 |
2008 | weeksInTop10 | 23 | Low | Flo Rida | 232 |
2009 | weeksInTop100 | 57 | Use Somebody | Kings Of Leon | 230 |
2009 | weeksInTop40 | 47 | I Gotta Feeling | Black Eyed Peas, The | 289 |
2009 | weeksInTop10 | 24 | Down | Jay Sean | 212 |
2010 | weeksInTop100 | 60 | Need You Now | Lady Antebellum | 237 |
2010 | weeksInTop40 | 50 | Need You Now | Lady Antebellum | 237 |
2010 | weeksInTop10 | 22 | Just The Way You Are | Bruno Mars | 220 |
2011 | weeksInTop100 | 68 | Party Rock Anthem | LMFAO | 263 |
2011 | weeksInTop40 | 53 | Rolling In The Deep | Adele | 228 |
2011 | weeksInTop10 | 29 | Party Rock Anthem | LMFAO | 263 |
2012 | weeksInTop100 | 62 | Ho Hey | Lumineers, The | 163 |
2012 | weeksInTop40 | 44 | Somebody That I Used To Know | Gotye | 244 |
2012 | weeksInTop10 | 24 | Somebody That I Used To Know | Gotye | 244 |
2013 | weeksInTop100 | 55 | Radioactive | Imagine Dragons | 187 |
2013 | weeksInTop40 | 41 | Thrift Shop | Macklemore | 235 |
2013 | weeksInTop10 | 21 | Thrift Shop | Macklemore | 235 |
What's more interesting, perhaps, is to make a plot:
nWeeksOnChartsVsSongLength <- xyplot(nWeeks ~ songLength, longestChartingSongsEachYear,
group = successMeasure, main = "Longest charting songs (3 different measures) for each year vs. song length",
xlab = "Song length (seconds)", ylab = "No. of weeks", grid = "h", auto.key = list(columns = 3))
print(nWeeksOnChartsVsSongLength)
pdf("plot_nWeeksOnChartsVsSongLength.pdf")
print(nWeeksOnChartsVsSongLength)
dev.off()
## pdf
## 2
Finally, we can calculate the average song length based on the longest charting songs; this would give us the “best” song length for success:
bestSongLengthForSucess <- mean(longestChartingSongsEachYear$songLength)
sprintf("Best song length for staying on charts = %4.2f seconds.", bestSongLengthForSucess)
## [1] "Best song length for staying on charts = 213.27 seconds."
So, if we were to make a song and want it to last long on the charts, a good starting point might be to have it last 213.2656 seconds.
Here, we look at the number and the proportion of songs that have charted longer than a certain amount of weeks (set to 10 by default) in Top 100. We do this for every year.
benchmark <- 10
nSongsChartLongerEachYear <- ddply(subset(charts, Year > 1955), ~Year, function(x) {
count <- sum(x$nWeeksChart >= benchmark, na.rm = TRUE)
total <- nrow(x)
prop <- count/total
data.frame(Count = count, Total = total, Proportion = prop)
})
# write table to file:
write.table(nSongsChartLongerEachYear, "table_nSongsChartLongerEachYear.txt",
quote = FALSE, sep = "\t", row.names = FALSE)
# print HTML table:
htmlPrint(nSongsChartLongerEachYear, digits = 2)
Year | Count | Total | Proportion |
---|---|---|---|
1956 | 255 | 505 | 0.50 |
1957 | 252 | 496 | 0.51 |
1958 | 249 | 530 | 0.47 |
1959 | 252 | 576 | 0.44 |
1960 | 251 | 602 | 0.42 |
1961 | 210 | 681 | 0.31 |
1962 | 243 | 676 | 0.36 |
1963 | 225 | 658 | 0.34 |
1964 | 203 | 718 | 0.28 |
1965 | 198 | 717 | 0.28 |
1966 | 180 | 743 | 0.24 |
1967 | 178 | 739 | 0.24 |
1968 | 197 | 686 | 0.29 |
1969 | 222 | 672 | 0.33 |
1970 | 220 | 653 | 0.34 |
1971 | 242 | 635 | 0.38 |
1972 | 252 | 591 | 0.43 |
1973 | 262 | 536 | 0.49 |
1974 | 241 | 496 | 0.49 |
1975 | 239 | 568 | 0.42 |
1976 | 223 | 534 | 0.42 |
1977 | 233 | 473 | 0.49 |
1978 | 263 | 453 | 0.58 |
1979 | 251 | 476 | 0.53 |
1980 | 271 | 474 | 0.57 |
1981 | 240 | 408 | 0.59 |
1982 | 260 | 424 | 0.61 |
1983 | 267 | 452 | 0.59 |
1984 | 263 | 435 | 0.60 |
1985 | 276 | 405 | 0.68 |
1986 | 268 | 397 | 0.68 |
1987 | 275 | 398 | 0.69 |
1988 | 269 | 387 | 0.70 |
1989 | 263 | 392 | 0.67 |
1990 | 265 | 376 | 0.70 |
1991 | 274 | 385 | 0.71 |
1992 | 264 | 371 | 0.71 |
1993 | 256 | 349 | 0.73 |
1994 | 249 | 345 | 0.72 |
1995 | 256 | 357 | 0.72 |
1996 | 247 | 324 | 0.76 |
1997 | 255 | 341 | 0.75 |
1998 | 240 | 346 | 0.69 |
1999 | 250 | 315 | 0.79 |
2000 | 242 | 317 | 0.76 |
2001 | 257 | 301 | 0.85 |
2002 | 242 | 295 | 0.82 |
2003 | 247 | 312 | 0.79 |
2004 | 245 | 306 | 0.80 |
2005 | 240 | 342 | 0.70 |
2006 | 219 | 363 | 0.60 |
2007 | 223 | 349 | 0.64 |
2008 | 239 | 396 | 0.60 |
2009 | 223 | 436 | 0.51 |
2010 | 227 | 483 | 0.47 |
2011 | 221 | 497 | 0.44 |
2012 | 204 | 374 | 0.55 |
2013 | 170 | 331 | 0.51 |
# plot:
propSongsChartLongerVsYear <- xyplot(Proportion ~ Year, nSongsChartLongerEachYear,
main = paste("Proportion of songs charting in Top 100 \n longer than",
benchmark, "weeks vs. Year"), ylab = "Proportion", type = "b", grid = "h")
print(propSongsChartLongerVsYear)
# write plot to file:
pdf("plot_propSongsChartLongerVsYear.pdf")
print(propSongsChartLongerVsYear)
dev.off()
pdf 2
This is quite interesting to see; fewer and fewer songs charted in Top 100 for more than 10 weeks in the time period from th 1950s to the end of the 1960s. Afterwards, we observe an overall increase lasting all the way into the 2000s; here, an increasing number, or proportion rather, of songs charted longer than 10 weeks in top 100. The trend for the last 10 or so years is decreasing, however, indicating that songs are on the top 100 for fewer and fewer weeks.
In the data aggregation and plotting script, data_aggregatePlot.R
, we have also considered the variable ScorePoints
that contains a certain score (see the beginning sections for details) for each song. We leave it out here for the sake of brevity (though it seems we are a far cry from brevity with this report…) and leave it at the mention.
As mentioned in the beginning, the dataset considered here provides an almost endless array of opportunities for data aggregation and visualization. The limiting factors are only the finite amount of time we have available in our lives as well as the dataset's shortcomings in regards to e.g. missing observations.
One thing we would have liked to look at was the Genre
variable/column. This would be a great factor on which we could condition, thus making e.g. nice multi-panel plots out of some of the plots already made in the above. However, there are only very few observations in the dataset for which the Genre
is available.
There are other cool analyses and visualizations of the data in the Whitburn Project:
Regarding code externalization, I have tried to read in code chunks from my R scripts to this R Markdown document–but without success so far. It seems most of the Internet resources on this topic deals with .Rnw files with a slightly different syntax for code chunks than what is used here for .Rmd. My failed attempt is below. The referenced code chunk with the label my-label
is at the bottom of the data_aggregatePlot.R
script.
read_chunk('data_aggregatePlot.R') <<my-label>>= @
However, in the analysis and visualizations above, I have embedded pre-existing figures into this R Markdown document as well as imported pre-existing data from files and worked with it.
As of Monday, Oct 21, just before the deadline, I am also working on a Git repository.
UPDATE on Monday, Oct 21: The Git repository is up and running!
Christian Birch Okkels
October 21, 2013