library('ggplot2')
library('plyr')
fight_songs <- read.csv('https://raw.githubusercontent.com/anilak1978/data/master/fight-songs/fight-songs.csv')
Are fight song attributes, features (bpm and type of lyrics for example), conference that the school is in, predictive of School college football conference championships?
Each Case represent the school. There are 66 schools and 6 school college football conferences represented.
The dataset is available from https://fivethirtyeight.com/ and can be found here: https://github.com/fivethirtyeight/data/blob/master/fight-songs/fight-songs.csv. The dataset contains data about fight songs from all schools in the Power Five Conferences and I have added the ncaa won column for each school for college football. The number of each school ncaa chanpionship count is collected via google search and added as a column to the data set.
Further documentation including description of each variable can be found here: https://github.com/anilak1978/data/tree/master/fight-songs .
head(fight_songs)
## school conference song_name
## 1 Notre Dame Independent Victory March
## 2 Baylor Big 12 Old Fight
## 3 Iowa State Big 12 Iowa State Fights
## 4 Kansas Big 12 I'm a Jayhawk
## 5 Kansas State Big 12 Wildcat Victory
## 6 Oklahoma Big 12 Boomer Sooner
## writers year student_writer
## 1 Michael J. Shea and John F. Shea 1908 No
## 2 Dick Baker and Frank Boggs 1947 Yes
## 3 Jack Barker, Manly Rice, Paul Gnam, Rosalind K. Cook 1930 Yes
## 4 George "Dumpy" Bowles 1912 Yes
## 5 Harry E. Erickson 1927 Yes
## 6 Arthur M. Alden 1905 Yes
## official_song contest bpm sec_duration fight number_fights victory
## 1 Yes No 152 64 Yes 1 Yes
## 2 Yes No 76 99 Yes 4 Yes
## 3 Yes No 155 55 Yes 5 No
## 4 Yes No 137 62 No 0 No
## 5 Yes No 80 67 Yes 6 Yes
## 6 Yes No 153 37 No 0 No
## win_won victory_win_won rah nonsense colors men opponents spelling
## 1 Yes Yes Yes No Yes Yes No No
## 2 Yes Yes No No Yes No No Yes
## 3 No No Yes No No Yes No Yes
## 4 No No No Yes No Yes Yes No
## 5 No Yes No No Yes No No No
## 6 No No Yes No No No No Yes
## trope_count spotify_id ncaa_won
## 1 6 15a3ShKX3XWKzq0lSS48yr 13
## 2 5 2ZsaI0Cu4nz8DHfBkPt0Dl 9
## 3 4 3yyfoOXZQCtR6pfRJqu9pl 2
## 4 3 0JzbjZgcjugS0dmPjF9R89 0
## 5 3 4xxDK4g1OHhZ44sTFy8Ktm 0
## 6 2 0QXC8Gg1oKWkORegslTXoT 16
This is an observational study.
The data is collected by https://fivethirtyeight.com for an article published in Aug 30th 2019. Data was collected from Spotify, school websites, news accounts and google manually and added to the dataset csv. The lyrics are limited to the fight songs that are sung most regularly and published by the school. Each song’s cliche count excludes the words of the song’s title. Songs were counted as mentioning “men”, “sons”, or “boys” even if those words were part of a compound word like “cowboys”. All cliché counts are based on the version of the lyrics available on that school’s website.
Full article can be found here: https://projects.fivethirtyeight.com/college-fight-song-lyrics/
The response variable is the ncaa football championship of school and is numerical.
The independent variables are bpm and sec_duration as numerical variable , student_writer, conference name, contest (if the song is chosen by contest), if certain words are mentioned within the lyrics catregorical variable.
str(fight_songs)
## 'data.frame': 65 obs. of 24 variables:
## $ school : Factor w/ 65 levels "Alabama","Arizona",..: 37 6 19 20 21 39 40 52 50 54 ...
## $ conference : Factor w/ 6 levels "ACC","Big 12",..: 4 2 2 2 2 2 2 2 2 2 ...
## $ song_name : Factor w/ 65 levels "Aggie War Hymn",..: 62 40 34 30 64 4 45 50 48 19 ...
## $ writers : Factor w/ 64 levels " Harold P. Williams and N. Loyall McLaren; Kelly James",..: 45 9 36 23 28 2 34 60 7 5 ...
## $ year : Factor w/ 44 levels "1893","1898",..: 6 34 25 10 23 4 28 19 24 29 ...
## $ student_writer : Factor w/ 3 levels "No","Unknown",..: 1 3 3 3 3 3 1 1 1 3 ...
## $ official_song : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ contest : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ bpm : int 152 76 155 137 80 153 180 81 149 159 ...
## $ sec_duration : int 64 99 55 62 67 37 29 65 47 54 ...
## $ fight : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 2 2 2 ...
## $ number_fights : int 1 4 5 0 6 0 5 17 2 8 ...
## $ victory : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 2 1 2 2 ...
## $ win_won : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 2 1 1 ...
## $ victory_win_won: Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 2 2 2 2 ...
## $ rah : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 1 2 1 ...
## $ nonsense : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ colors : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 2 2 2 ...
## $ men : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 2 1 ...
## $ opponents : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 2 1 1 ...
## $ spelling : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ trope_count : int 6 5 4 3 3 2 4 4 6 3 ...
## $ spotify_id : Factor w/ 65 levels "06Qz83gVtmHtWTClrucgjX",..: 12 28 33 4 42 8 5 46 3 30 ...
## $ ncaa_won : int 13 9 2 0 0 16 1 47 2 2 ...
summary(fight_songs$bpm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 65.0 90.0 140.0 128.8 151.0 180.0
summary(fight_songs$sec_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 27.00 58.00 67.00 71.91 85.00 172.00
We can look at the structure of our data, we can see the categorical variable that provides if certain types of words exists within the lyrics of the fight song. We can also explore the average beats per minute which is 128.9 and average sec duration of the songs is approximately 72 seconds.
ggplot(data=fight_songs, aes(x=bpm, y=ncaa_won)) +
geom_point(aes(col=win_won, size=sec_duration))+
geom_smooth(method = "loess", se=F)
We can investigate if there is a positive or negative correlation between ncaa_won and bpm. At first glance we can see that bpm and ncaa_won has a strong positive linear relationship up to a little over 90 beats per minute than weak negative linear relationship between 100 bpm and 140bpm.
ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=nonsense))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can look at the the championship winner fight songs lyrics, at first glance we can see that majority of them do not have nonsense syllabes such as “Whoo-rah or Hooperay” within their lyrics.
count(fight_songs, "student_writer")
## student_writer freq
## 1 No 30
## 2 Unknown 3
## 3 Yes 32
count(fight_songs, "official_song")
## official_song freq
## 1 No 7
## 2 Yes 58
count(fight_songs, "contest")
## contest freq
## 1 No 55
## 2 Yes 10
count(fight_songs, "victory")
## victory freq
## 1 No 32
## 2 Yes 33
count(fight_songs, "win_won")
## win_won freq
## 1 No 34
## 2 Yes 31
count(fight_songs, "victory_win_won")
## victory_win_won freq
## 1 No 24
## 2 Yes 41
count(fight_songs, "rah")
## rah freq
## 1 No 47
## 2 Yes 18
count(fight_songs, "nonsense")
## nonsense freq
## 1 No 55
## 2 Yes 10
count(fight_songs, "colors")
## colors freq
## 1 No 30
## 2 Yes 35
count(fight_songs, "men")
## men freq
## 1 No 41
## 2 Yes 24
count(fight_songs, "opponents")
## opponents freq
## 1 No 53
## 2 Yes 12
count(fight_songs, "spelling")
## spelling freq
## 1 No 36
## 2 Yes 29
ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=official_song))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=student_writer))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=contest))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=victory))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=win_won))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=victory_win_won))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=rah))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=nonsense))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=colors))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=men))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=opponents))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.