Metallica is an American heavy metal band formed in 1981 by James Hetfield and Lars Ulrich. As pioneers of thrash and members of the “Big Four,” they have sold tens of millions of records and earned multiple Grammy Awards. The classic lineup features Hetfield (vocals, rhythm guitar), Ulrich (drums), Kirk Hammett (lead guitar), and Robert Trujillo (bass).
All raw data sources: metallica.com and the official Metallica YouTube channel.
Next, we outline the pipeline for obtaining the data, applying preprocessing steps, and deriving the relevant features.
This dataset comprises 136 songs across 13 studio albums.
The dataset contains 136 songs from 13 Metallica studio albums (1983–2025). Audio comes from official releases. Metadata (year, band, album, track, song, duration, bpm) from public discographies. Lyrics from official sources.
Each track is decoded to WAV via FFmpeg, down-mixed to mono, and amplitude-normalized. Frames use a Hann window (\(46\) ms) with a \(23\) ms hop. For frame \(i\): samples \(x_i[m]\), DFT \(X_i[k]\), magnitudes \(M_i[k]=|X_i[k]|\) for bins \(k=1,\dots,K\), frequencies \(f_k\), and a small \(\varepsilon>0\) for numerical stability.
Per song we compute four per-frame metrics, smooth each with a cubic spline, and optionally \(z\)-score:
Loudness: \[ \mathrm{RMS}_i=\sqrt{\tfrac{1}{L}\sum_{m=0}^{L-1}x_i[m]^2}, \qquad \log\mathrm{RMS}_i=\log(\mathrm{RMS}_i+\varepsilon). \]
Higher values indicate louder passages with more acoustic energy; lower values correspond to quieter sections or silence.
Spectral brightness: \[ \displaystyle \mathrm{logSC}_i=\frac{\sum_{k=1}^K \log(f_k)\,M_i[k]}{\sum_{k=1}^K M_i[k]+\varepsilon}. \]
Higher values reflect brighter, treble-rich timbres (e.g., cymbals, distorted guitars), while lower values indicate darker or bass-heavy sounds.
Spectral flatness: \[ \displaystyle \mathrm{SFM}_i=\frac{\exp\!\big(\tfrac{1}{K}\sum_{k=1}^K \log(M_i[k]+\varepsilon)\big)}{\tfrac{1}{K}\sum_{k=1}^K M_i[k]+\varepsilon},\qquad \mathrm{logitSFM}_i=\log\!\left(\tfrac{\mathrm{SFM}_i}{1-\mathrm{SFM}_i}\right). \]
Higher values suggest noise-like or texture-like content with flatter spectra; lower values point to tonal, pitched, or harmonic sounds.
Onset/rhythmic activity: \[ \displaystyle \mathrm{Flux}_i=\sum_{k=1}^K \max\!\big(M_i[k]-M_{i-1}[k],0\big),\qquad \log\mathrm{Flux}_i=\log(\mathrm{Flux}_i+\varepsilon). \]
Higher values capture frequent or strong spectral changes, often aligned with beats or accents; lower values correspond to sustained tones, smoother textures, or silence.
For each metric \(y_s(t)\) we retain three representations: the raw series, a smoothed version \(\tilde y_s(t)\) obtained using cubic smoothing splines with smoothing parameter \(0.5\), and a standardized series \(z_s(t)=(\tilde y_s(t)-\text{mean}_s)/\text{sd}_s\). All frames are indexed on a normalized time scale \(t\in[0,1]\).
Lyrics are tokenized to words, lowercased, stopwords removed, and accents folded to ASCII. We compute: Bing coverage and negative proportion; NRC emotion proportions (anger, anticipation, disgust, fear, joy, sadness, surprise, trust); a sentiment arc (mean, SD, slope, range); lexical richness (type–token ratio, hapax proportion); NRC VAD means (valence, arousal, dominance) and coverage.
We retain year
, band
, album
,
track
, song
, duration
,
bpm
per track (duration in seconds and bpm is predominant
tempo).
Each song curve \(y_i(t)\) is
interpolated onto a uniform grid
\[
t = \{t_1,\dots,t_M\}, \quad t_m \in [0,1],
\]
so that all songs have aligned time points.
For each song \(i\), the resampled
curve is standardized:
\[
\tilde y_i(t_m) = \frac{y_i(t_m)-\mu_i}{\sigma_i},
\qquad
\mu_i = \frac{1}{M}\sum_{m=1}^M y_i(t_m), \quad
\sigma_i^2 = \frac{1}{M-1}\sum_{m=1}^M \big(y_i(t_m)-\mu_i\big)^2.
\]
For two songs \(i\) and \(j\), their dissimilarity is computed
as
\[
D_{i,j} = \sqrt{\sum_{m=1}^M \Big(\tilde y_i(t_m)-\tilde
y_j(t_m)\Big)^2}.
\]
Distances are transformed into similarities using a Gaussian (RBF)
kernel:
\[
W_{i,j} = \exp\!\left(-\frac{D_{i,j}^2}{2\sigma^2}\right),
\qquad W_{i,i}=0,
\]
where \(\sigma\) is chosen as the
median of nonzero distances.
For each song \(i\), keep only the \(k\) strongest similarities \(W_{i,j}\) and set the rest to zero.
The adjacency matrix is then symmetrized:
\[
A_{i,j} = \max\!\big(W_{i,j}, W_{j,i}\big).
\]