DATA 606 Data Project Proposal

Data Preparation

library('ggplot2')
library('plyr')

fight_songs <- read.csv('https://raw.githubusercontent.com/anilak1978/data/master/fight-songs/fight-songs.csv')

Research question

Are fight song attributes, features (bpm and type of lyrics for example), conference that the school is in, predictive of School college football conference championships?

Cases

Each Case represent the school. There are 66 schools and 6 school college football conferences represented.

Data collection

The dataset is available from https://fivethirtyeight.com/ and can be found here: https://github.com/fivethirtyeight/data/blob/master/fight-songs/fight-songs.csv. The dataset contains data about fight songs from all schools in the Power Five Conferences and I have added the ncaa won column for each school for college football. The number of each school ncaa chanpionship count is collected via google search and added as a column to the data set.

Further documentation including description of each variable can be found here: https://github.com/anilak1978/data/tree/master/fight-songs .

head(fight_songs)

##         school  conference         song_name
## 1   Notre Dame Independent     Victory March
## 2       Baylor      Big 12         Old Fight
## 3   Iowa State      Big 12 Iowa State Fights
## 4       Kansas      Big 12     I'm a Jayhawk
## 5 Kansas State      Big 12   Wildcat Victory
## 6     Oklahoma      Big 12     Boomer Sooner
##                                                writers year student_writer
## 1                     Michael J. Shea and John F. Shea 1908             No
## 2                           Dick Baker and Frank Boggs 1947            Yes
## 3 Jack Barker, Manly Rice, Paul Gnam, Rosalind K. Cook 1930            Yes
## 4                                George "Dumpy" Bowles 1912            Yes
## 5                                    Harry E. Erickson 1927            Yes
## 6                                      Arthur M. Alden 1905            Yes
##   official_song contest bpm sec_duration fight number_fights victory
## 1           Yes      No 152           64   Yes             1     Yes
## 2           Yes      No  76           99   Yes             4     Yes
## 3           Yes      No 155           55   Yes             5      No
## 4           Yes      No 137           62    No             0      No
## 5           Yes      No  80           67   Yes             6     Yes
## 6           Yes      No 153           37    No             0      No
##   win_won victory_win_won rah nonsense colors men opponents spelling
## 1     Yes             Yes Yes       No    Yes Yes        No       No
## 2     Yes             Yes  No       No    Yes  No        No      Yes
## 3      No              No Yes       No     No Yes        No      Yes
## 4      No              No  No      Yes     No Yes       Yes       No
## 5      No             Yes  No       No    Yes  No        No       No
## 6      No              No Yes       No     No  No        No      Yes
##   trope_count             spotify_id ncaa_won
## 1           6 15a3ShKX3XWKzq0lSS48yr       13
## 2           5 2ZsaI0Cu4nz8DHfBkPt0Dl        9
## 3           4 3yyfoOXZQCtR6pfRJqu9pl        2
## 4           3 0JzbjZgcjugS0dmPjF9R89        0
## 5           3 4xxDK4g1OHhZ44sTFy8Ktm        0
## 6           2 0QXC8Gg1oKWkORegslTXoT       16

Type of study

This is an observational study.

Data Source

The data is collected by https://fivethirtyeight.com for an article published in Aug 30th 2019. Data was collected from Spotify, school websites, news accounts and google manually and added to the dataset csv. The lyrics are limited to the fight songs that are sung most regularly and published by the school. Each song’s cliche count excludes the words of the song’s title. Songs were counted as mentioning “men”, “sons”, or “boys” even if those words were part of a compound word like “cowboys”. All cliché counts are based on the version of the lyrics available on that school’s website.

Full article can be found here: https://projects.fivethirtyeight.com/college-fight-song-lyrics/

Dependent Variable

The response variable is the ncaa football championship of school and is numerical.

Independent Variable

The independent variables are bpm and sec_duration as numerical variable , student_writer, conference name, contest (if the song is chosen by contest), if certain words are mentioned within the lyrics catregorical variable.

Relevant summary statistics

str(fight_songs)

## 'data.frame':    65 obs. of  24 variables:
##  $ school         : Factor w/ 65 levels "Alabama","Arizona",..: 37 6 19 20 21 39 40 52 50 54 ...
##  $ conference     : Factor w/ 6 levels "ACC","Big 12",..: 4 2 2 2 2 2 2 2 2 2 ...
##  $ song_name      : Factor w/ 65 levels "Aggie War Hymn",..: 62 40 34 30 64 4 45 50 48 19 ...
##  $ writers        : Factor w/ 64 levels " Harold P. Williams and N. Loyall McLaren; Kelly James",..: 45 9 36 23 28 2 34 60 7 5 ...
##  $ year           : Factor w/ 44 levels "1893","1898",..: 6 34 25 10 23 4 28 19 24 29 ...
##  $ student_writer : Factor w/ 3 levels "No","Unknown",..: 1 3 3 3 3 3 1 1 1 3 ...
##  $ official_song  : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ contest        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ bpm            : int  152 76 155 137 80 153 180 81 149 159 ...
##  $ sec_duration   : int  64 99 55 62 67 37 29 65 47 54 ...
##  $ fight          : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 2 2 2 ...
##  $ number_fights  : int  1 4 5 0 6 0 5 17 2 8 ...
##  $ victory        : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 2 1 2 2 ...
##  $ win_won        : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 2 1 1 ...
##  $ victory_win_won: Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 2 2 2 2 ...
##  $ rah            : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 1 2 1 ...
##  $ nonsense       : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ colors         : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 2 2 2 ...
##  $ men            : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 2 1 ...
##  $ opponents      : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 2 1 1 ...
##  $ spelling       : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
##  $ trope_count    : int  6 5 4 3 3 2 4 4 6 3 ...
##  $ spotify_id     : Factor w/ 65 levels "06Qz83gVtmHtWTClrucgjX",..: 12 28 33 4 42 8 5 46 3 30 ...
##  $ ncaa_won       : int  13 9 2 0 0 16 1 47 2 2 ...

summary(fight_songs$bpm)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    65.0    90.0   140.0   128.8   151.0   180.0

summary(fight_songs$sec_duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   27.00   58.00   67.00   71.91   85.00  172.00

We can look at the structure of our data, we can see the categorical variable that provides if certain types of words exists within the lyrics of the fight song. We can also explore the average beats per minute which is 128.9 and average sec duration of the songs is approximately 72 seconds.

ggplot(data=fight_songs, aes(x=bpm, y=ncaa_won)) +
         geom_point(aes(col=win_won, size=sec_duration))+
         geom_smooth(method = "loess", se=F)

We can investigate if there is a positive or negative correlation between ncaa_won and bpm. At first glance we can see that bpm and ncaa_won has a strong positive linear relationship up to a little over 90 beats per minute than weak negative linear relationship between 100 bpm and 140bpm.

ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
  geom_histogram(aes(fill=nonsense))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can look at the the championship winner fight songs lyrics, at first glance we can see that majority of them do not have nonsense syllabes such as “Whoo-rah or Hooperay” within their lyrics.

count(fight_songs, "student_writer")

##   student_writer freq
## 1             No   30
## 2        Unknown    3
## 3            Yes   32

count(fight_songs, "official_song")

##   official_song freq
## 1            No    7
## 2           Yes   58

count(fight_songs, "contest")

##   contest freq
## 1      No   55
## 2     Yes   10

count(fight_songs, "victory")

##   victory freq
## 1      No   32
## 2     Yes   33

count(fight_songs, "win_won")

##   win_won freq
## 1      No   34
## 2     Yes   31

count(fight_songs, "victory_win_won")

##   victory_win_won freq
## 1              No   24
## 2             Yes   41

count(fight_songs, "rah")

##   rah freq
## 1  No   47
## 2 Yes   18

count(fight_songs, "nonsense")

##   nonsense freq
## 1       No   55
## 2      Yes   10

count(fight_songs, "colors")

##   colors freq
## 1     No   30
## 2    Yes   35

count(fight_songs, "men")

##   men freq
## 1  No   41
## 2 Yes   24

count(fight_songs, "opponents")

##   opponents freq
## 1        No   53
## 2       Yes   12

count(fight_songs, "spelling")

##   spelling freq
## 1       No   36
## 2      Yes   29

ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
  geom_histogram(aes(fill=official_song))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
  geom_histogram(aes(fill=student_writer))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
  geom_histogram(aes(fill=contest))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
  geom_histogram(aes(fill=victory))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
  geom_histogram(aes(fill=win_won))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
  geom_histogram(aes(fill=victory_win_won))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
  geom_histogram(aes(fill=rah))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
  geom_histogram(aes(fill=nonsense))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
  geom_histogram(aes(fill=colors))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
  geom_histogram(aes(fill=men))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data=fight_songs, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
  geom_histogram(aes(fill=opponents))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.