I decided to select the data set for FiveThirtyEight’s article, “Our Guide to the Exuberant Nonesense of College Fight Songs”. In high school, I was part of the band and had to play our fight song (which is the same as Cal’s I have discovered) repeatedly. I’m a bit overloaded on politics and not particularly into professional sports, so this dataset had a certain charm!
The brass tacks for the data set is reviewing the lyrics for certain clichés, things like “fight,” “Victory,” “Rah,” etc. The article is more of a tool to view each school’s song and have summary stats on the cliches and how it rates among the others. It also automatically starts playing, so make sure your speakers are at an appropriate low volume before clicking that link.
Linked directly with 538’s GitHub raw source data.
fightSongRawURL <- "https://raw.githubusercontent.com/fivethirtyeight/data/master/fight-songs/fight-songs.csv"
fightSongRaw <- read.csv(file = fightSongRawURL, header = TRUE, sep = ",")
Just to check initial impressions of the data, I used the summary function to breakout the columns along summary stats. Because most of the columns are stored as characters, there’s not much to glean from the presence of a certain trope. I did zero in on the beats per minute (bpm) and song duration (sec_duration) columns and chose those to create a subset dataframe.
summary(fightSongRaw)
## school conference song_name writers
## Length:65 Length:65 Length:65 Length:65
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## year student_writer official_song contest
## Length:65 Length:65 Length:65 Length:65
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## bpm sec_duration fight number_fights
## Min. : 65.0 Min. : 27.00 Length:65 Min. : 0.000
## 1st Qu.: 90.0 1st Qu.: 58.00 Class :character 1st Qu.: 0.000
## Median :140.0 Median : 67.00 Mode :character Median : 2.000
## Mean :128.8 Mean : 71.91 Mean : 2.846
## 3rd Qu.:151.0 3rd Qu.: 85.00 3rd Qu.: 5.000
## Max. :180.0 Max. :172.00 Max. :17.000
## victory win_won victory_win_won rah
## Length:65 Length:65 Length:65 Length:65
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## nonsense colors men opponents
## Length:65 Length:65 Length:65 Length:65
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## spelling trope_count spotify_id
## Length:65 Min. :0.000 Length:65
## Class :character 1st Qu.:3.000 Class :character
## Mode :character Median :4.000 Mode :character
## Mean :3.615
## 3rd Qu.:5.000
## Max. :8.000
I named a variable for the subset and made a few selections of the raw data source. As a check to ensure I pulled the right vars, I put in a “head” function.
fightSongSub <- fightSongRaw[c(1,2,9:10,22)]
head(fightSongSub)
## school conference bpm sec_duration trope_count
## 1 Notre Dame Independent 152 64 6
## 2 Baylor Big 12 76 99 5
## 3 Iowa State Big 12 155 55 4
## 4 Kansas Big 12 137 62 3
## 5 Kansas State Big 12 80 67 3
## 6 Oklahoma Big 12 153 37 2
Turning these data into 538’s tool to have user explore each song more in-depth is a logical next step. I was curious how the songs may stack up against championship wins, bowl game appearances, and season records. If the songs (and perhaps the number of times they’re played per game) have any connection to sports wins. I rather doubt it, but an interesting future project with additional data included.
Thought it’d be fun to throw a scatterplot graphic in. Using what I learned from the R bridge program, I wanted to compare bpm and sec_duration columns along conference lines. I didn’t figure the conference would make much of a difference and that seemed to bear out with no pattern. I was a bit suprised to see that there was a polarization of bpm, with a gap in the middle.
library(ggplot2)
ggplot(fightSongSub, aes(x=bpm, y=sec_duration, color=conference)) +
geom_point()