##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Introduction
Data Requirements
Data Collection
Data Understanding
Analytical Approach
Data Exploration
Model Development
Model Evaluation
Conclusion
American Football is a huge part of American tradition and It is very competitive on every level from 3rd grade to high school and college. In some cases College Football can be more competitive than NFL and teams do whatever it takes to win. Teams push themselves beyond their limits to add extra speed and strenght, from waking up 4:15 am every morning to lifting weights and running. The impact of this hard work is definately positive but are there other methods that can contribute the their victory? What about using bagpipes for motiovation or intimidation? Bagpipes were traditionality used during war by Scots to scare their enemies or encourage the Scots into battle. How about College Fight Songs? Would the songs all college students memorized and sang during every college football game have an impact of the game result? Would the lyrics of the College Fight Song have an impact on the victory or loss? Would it matter if the Collge Fight Song is fast or slow? Would it help if the song uses the word victory often?
The purpose of this study is to determine if College Football Fight Song attributes have an impact on College Football Team’s victory or loss. The business problem in question is " Can we find attributes in College Football Fight Song that can help College Football team win. The main group that may benefit from this study is the Athletic Department of the College. They may learn to identify that College Fight Songs have positive or negative impact to victory.
There are variety of data that we require to collect in order to answer the problem statement and solve the business objectives. These can be loudness, rythm, tempo and lyrics. Considering the scope, we proceed with obvervational study rather than creating an experimental study and we use publicly available data rather than obtaining the data. The study is limited to all schools in the Power Five Conferences - The ACC, Big Ten, Big 12, Pac-12 and SEC- plus Notre Dame. For the purpose of this study, what constitues as the College Fight Song is defined as “official” by their schools and the ones fans sing out. The study is also limited to the lyrics sung most regularly and also published by the school.
https://fivethirtyeight.com/ published an article on August 30th 2019 around guide to College Fight Songs including a data that is collected by https://fivethirtyeight.com/ staff. The full dataset can be found here: https://github.com/fivethirtyeight/data/blob/master/fight-songs/fight-songs.csv . Addition to the “fight songs” dataset provided by five thirty eight, NCAA winnings are collected from google search. The complete dataset can be found in csv format here: https://raw.githubusercontent.com/anilak1978/data/master/fight-songs/fight-songs.csv
Definitions of the variables are as follows:
school: School name
conference: School college football conference
song_name: Song title
writers: Song author
year: Year the song written. Some values are Unknown
student_writer: Was the author a student? Some values are Unknown
official_song: Is the song the official fight song according to the university?
contest: Was the song chosen as the result of a contest?
bpm: Beats per minute
sec_duration: Duration of song in seconds
fight: Does the song say “fight”?
number_fights: Number of times the song says “fight”?
victory: Does the song say “victory”?
win_won: Does the song say “win” or “won”?
victory_win_won: Does the song say “victory,” “win” or “won”?
rah: Does the song say “rah”?
nonsense: Does the song use nonsense syllables (e.g. “Whoo-Rah” or “Hooperay”)
colors: Does the song mention the school colors?
men: Does the song refer to a group of men (e.g. men, boys, sons, etc.)?
opponents: Does the song mention any opponents?
spelling: Does the song spell anything?
trope_count: Total number of tropes (fight, victory, win_won, rah, nonsense,colors, men, opponents, and spelling).
spotify_id: Spotify id for the song
fight_songs <- read.csv('https://raw.githubusercontent.com/anilak1978/data/master/fight-songs/fight-songs.csv')
head(fight_songs)
## school conference song_name
## 1 Notre Dame Independent Victory March
## 2 Baylor Big 12 Old Fight
## 3 Iowa State Big 12 Iowa State Fights
## 4 Kansas Big 12 I'm a Jayhawk
## 5 Kansas State Big 12 Wildcat Victory
## 6 Oklahoma Big 12 Boomer Sooner
## writers year student_writer
## 1 Michael J. Shea and John F. Shea 1908 No
## 2 Dick Baker and Frank Boggs 1947 Yes
## 3 Jack Barker, Manly Rice, Paul Gnam, Rosalind K. Cook 1930 Yes
## 4 George "Dumpy" Bowles 1912 Yes
## 5 Harry E. Erickson 1927 Yes
## 6 Arthur M. Alden 1905 Yes
## official_song contest bpm sec_duration fight number_fights victory
## 1 Yes No 152 64 Yes 1 Yes
## 2 Yes No 76 99 Yes 4 Yes
## 3 Yes No 155 55 Yes 5 No
## 4 Yes No 137 62 No 0 No
## 5 Yes No 80 67 Yes 6 Yes
## 6 Yes No 153 37 No 0 No
## win_won victory_win_won rah nonsense colors men opponents spelling
## 1 Yes Yes Yes No Yes Yes No No
## 2 Yes Yes No No Yes No No Yes
## 3 No No Yes No No Yes No Yes
## 4 No No No Yes No Yes Yes No
## 5 No Yes No No Yes No No No
## 6 No No Yes No No No No Yes
## trope_count spotify_id ncaa_won
## 1 6 15a3ShKX3XWKzq0lSS48yr 13
## 2 5 2ZsaI0Cu4nz8DHfBkPt0Dl 9
## 3 4 3yyfoOXZQCtR6pfRJqu9pl 2
## 4 3 0JzbjZgcjugS0dmPjF9R89 0
## 5 3 4xxDK4g1OHhZ44sTFy8Ktm 0
## 6 2 0QXC8Gg1oKWkORegslTXoT 16
## 'data.frame': 65 obs. of 24 variables:
## $ school : Factor w/ 65 levels "Alabama","Arizona",..: 37 6 19 20 21 39 40 52 50 54 ...
## $ conference : Factor w/ 6 levels "ACC","Big 12",..: 4 2 2 2 2 2 2 2 2 2 ...
## $ song_name : Factor w/ 65 levels "Aggie War Hymn",..: 62 40 34 30 64 4 45 50 48 19 ...
## $ writers : Factor w/ 64 levels " Harold P. Williams and N. Loyall McLaren; Kelly James",..: 45 9 36 23 28 2 34 60 7 5 ...
## $ year : Factor w/ 44 levels "1893","1898",..: 6 34 25 10 23 4 28 19 24 29 ...
## $ student_writer : Factor w/ 3 levels "No","Unknown",..: 1 3 3 3 3 3 1 1 1 3 ...
## $ official_song : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ contest : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ bpm : int 152 76 155 137 80 153 180 81 149 159 ...
## $ sec_duration : int 64 99 55 62 67 37 29 65 47 54 ...
## $ fight : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 2 2 2 2 ...
## $ number_fights : int 1 4 5 0 6 0 5 17 2 8 ...
## $ victory : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 2 1 2 2 ...
## $ win_won : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 2 1 1 ...
## $ victory_win_won: Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 2 2 2 2 ...
## $ rah : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 2 1 1 2 1 ...
## $ nonsense : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ colors : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 2 2 2 ...
## $ men : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 2 1 ...
## $ opponents : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 2 1 1 ...
## $ spelling : Factor w/ 2 levels "No","Yes": 1 2 2 1 1 2 1 1 2 1 ...
## $ trope_count : int 6 5 4 3 3 2 4 4 6 3 ...
## $ spotify_id : Factor w/ 65 levels "06Qz83gVtmHtWTClrucgjX",..: 12 28 33 4 42 8 5 46 3 30 ...
## $ ncaa_won : int 13 9 2 0 0 16 1 47 2 2 ...
There are 65 cases(observations) with 24 variables in our dataset. Each of the cases represents the name of the School. There are categorical variable types that needs to be character and date data types. Not all variables are required for the study.
# Update data types to correct format
fight_songs$school <- as.character(fight_songs$school)
fight_songs$school <- as.character(fight_songs$song_name)
fight_songs$school <- as.character(fight_songs$writers)
fight_songs$year <- format(as.Date(fight_songs$year, format='%Y'),"%Y")
fight_songs$spotify_id <- as.character(fight_songs$spotify_id)
Name of the conference , song name, writers, contest, opponents and spotify_id variables are not neccessarily required for our analysis. There were 5 missing values which are removed from the dataset.
# select the needed columns
fight_songs_df <- fight_songs %>%
select(school, conference, year, student_writer, official_song, bpm, sec_duration, fight, number_fights, victory, win_won, victory_win_won, rah, nonsense, colors, men, opponents, ncaa_won)
# total na values in the dataset
sum(is.na(fight_songs_df))
## [1] 5
## school conference year
## 1 Michael J. Shea and John F. Shea Independent 1908
## 2 Dick Baker and Frank Boggs Big 12 1947
## 3 Jack Barker, Manly Rice, Paul Gnam, Rosalind K. Cook Big 12 1930
## 4 George "Dumpy" Bowles Big 12 1912
## 5 Harry E. Erickson Big 12 1927
## 6 Arthur M. Alden Big 12 1905
## student_writer official_song bpm sec_duration fight number_fights
## 1 No Yes 152 64 Yes 1
## 2 Yes Yes 76 99 Yes 4
## 3 Yes Yes 155 55 Yes 5
## 4 Yes Yes 137 62 No 0
## 5 Yes Yes 80 67 Yes 6
## 6 Yes Yes 153 37 No 0
## victory win_won victory_win_won rah nonsense colors men opponents
## 1 Yes Yes Yes Yes No Yes Yes No
## 2 Yes Yes Yes No No Yes No No
## 3 No No No Yes No No Yes No
## 4 No No No No Yes No Yes Yes
## 5 Yes No Yes No No Yes No No
## 6 No No No Yes No No No No
## ncaa_won
## 1 13
## 2 9
## 3 2
## 4 0
## 5 0
## 6 16
## school conference year student_writer
## Length:60 ACC :11 Length:60 No :29
## Class :character Big 12 :10 Class :character Unknown: 0
## Mode :character Big Ten :14 Mode :character Yes :31
## Independent: 1
## Pac-12 :11
## SEC :13
## official_song bpm sec_duration fight number_fights
## No : 7 Min. : 65.0 Min. : 27.00 No :18 Min. : 0.00
## Yes:53 1st Qu.: 82.5 1st Qu.: 58.75 Yes:42 1st Qu.: 0.00
## Median :139.0 Median : 68.00 Median : 2.00
## Mean :126.9 Mean : 73.20 Mean : 2.85
## 3rd Qu.:150.2 3rd Qu.: 86.00 3rd Qu.: 5.00
## Max. :180.0 Max. :172.00 Max. :17.00
## victory win_won victory_win_won rah nonsense colors men
## No :29 No :31 No :21 No :43 No :50 No :27 No :37
## Yes:31 Yes:29 Yes:39 Yes:17 Yes:10 Yes:33 Yes:23
##
##
##
##
## opponents ncaa_won
## No :49 Min. : 0.000
## Yes:11 1st Qu.: 1.000
## Median : 3.000
## Mean : 6.017
## 3rd Qu.: 6.250
## Max. :51.000
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 60 126.92 33.82 139 128.48 20.76 65 180 115 -0.6 -1.16
## se
## X1 4.37
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 60 2.85 3.21 2 2.33 2.97 0 17 17 1.75 4.48 0.41
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 60 73.2 24.94 68 71.15 20.76 27 172 145 1.19 2.83
## se
## X1 3.22
Below outlines the basic measures of the College Fight Songs data.
After removing the 5 observations that had missing values, our dataset has 60 cases which are the names of the schools.
31 of the songs are written by a student.
53 of the songs are official school songs.
Average beat per minute is 126.9.
Average duration of the songs is 73.20.
42 songs has the word fight in it.
Songs uses average 2.85 times the word “fight”
31 of the songs have victory in it.
10 of the songs have words that does not make sense.
23 of the songs have word men in it.
11 of the songs have opponents word in it.
Average win of ncaa is 6 times.
theme_set(theme_classic())
ggplot(fight_songs_df, aes(victory))+
geom_bar(aes(fill=conference), width=0.5)
The songs that uses the word victory tends to use the word win and won as well.
theme_set(theme_bw())
ggplot(fight_songs_df, aes(sec_duration)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=student_writer),
binwidth = 10,
col="black",
size=1)
The sec duration is slightly normal distributed and right skewed.
theme_set(theme_bw())
ggplot(fight_songs_df, aes(number_fights)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=conference),
binwidth = 1,
col="black",
size=1)
The use of word “fight” is right skewed and not normally distributed.
Majority of the Collge Football teams that won the Ncaa Won most frequently have College Fight Songs between 130-150 and 70-80 bpm.
theme_set(theme_bw())
ggplot(fight_songs_df, aes(bpm)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=fight),
binwidth = 10,
col="black",
size=1)
If we look at the bpm >110 mark, we see normal distribution.
theme_set(theme_bw())
ggplot(fight_songs_df, aes(ncaa_won)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=conference),
binwidth = 1,
col="black",
size=1)
The ncaa_won variable within the dataset is not normally distributed.
Based on the initial overview of the data set , we select the bpm, sec_duration, conference and ncaa_won. BPM is numerical variable which provides the information on song’s rythm, sec_duration is numerical variable provides the information on the song’s length, conference is a categorical variable, provides the information on what conference the College team is participating and ncaa_won numerical variable, which provides us the information on how many times that school won the ncaa chanmpionship.
The potential explnatory variables we focus on are bpm and sec_duration. The response variable is ncaa_won.
This study is an observational study. The collected data is not a census data, it does not include all the schools but rather subset of the schools. The data is not collected ramdonly as the cases are from the Five Conferences and Notrada Dame. The cases are independent from each other however there might be a sampling bias, since the samples are not collected randomly. One important factor to keep in mind is that majority of the schools which participate the championship are included in the dataset. The cases for each conference is normally distributed.
The analysis can not be generalized for all the population (which is all the Colleges that has a Football team), however, can be generalized for the subsection of the population which is all the Colleges that participates in the Five Conferences for NCAA Championship. The The expectation from the study is to find association between the explanatory feature bpm and response target variable ncaa_won. We can safely say that the result of the study probably holds true for all the College’s that participate in the NCAA Championship within the Five Conferences. This study is not an experimential design as there is no control group. The sampling captures most of the population of schools and expected conclusion is casual and possibly might be generalized.
Based on the business problem in question, our hypothesis is as follows.
\(H_{0}:\) College Fight Song Attributes can be predictive of NCAA championship.
\(H_{1}:\) College Fight Song Attributes can not be predictive of NCAA championship.
# create the dataframe with only the variables we need
fight_songs_df_2 <- fight_songs_df %>%
select(school, conference,bpm, sec_duration, ncaa_won)
head(fight_songs_df_2)
## school conference bpm
## 1 Michael J. Shea and John F. Shea Independent 152
## 2 Dick Baker and Frank Boggs Big 12 76
## 3 Jack Barker, Manly Rice, Paul Gnam, Rosalind K. Cook Big 12 155
## 4 George "Dumpy" Bowles Big 12 137
## 5 Harry E. Erickson Big 12 80
## 6 Arthur M. Alden Big 12 153
## sec_duration ncaa_won
## 1 64 13
## 2 99 9
## 3 55 2
## 4 62 0
## 5 67 0
## 6 37 16
## Warning in describe(fight_songs_df_2): NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
## vars n mean sd median trimmed mad min max range
## school* 1 60 NaN NA NA NaN NA Inf -Inf -Inf
## conference* 2 60 3.50 1.85 3 3.50 2.97 1 6 5
## bpm 3 60 126.92 33.82 139 128.48 20.76 65 180 115
## sec_duration 4 60 73.20 24.94 68 71.15 20.76 27 172 145
## ncaa_won 5 60 6.02 9.50 3 4.00 4.45 0 51 51
## skew kurtosis se
## school* NA NA NA
## conference* 0.09 -1.50 0.24
## bpm -0.60 -1.16 4.37
## sec_duration 1.19 2.83 3.22
## ncaa_won 3.27 11.65 1.23
# see correlation between bpm and ncaa_won
options(scipen=999)
theme_set(theme_bw())
ggplot(fight_songs_df_2, aes(bpm, ncaa_won))+
geom_point(aes(conference, size=sec_duration))
options(scipen=999)
theme_set(theme_bw())
ggplot(fight_songs_df_2, aes(sec_duration, ncaa_won))+
geom_point()
# filter bpm above 110 to simulate normal distribution
fight_songs_df_3 <- filter(fight_songs_df_2, fight_songs_df_2$bpm >110)
theme_set(theme_bw())
ggplot(fight_songs_df_3, aes(bpm)) + scale_fill_brewer(palette = "Spectral")+
geom_histogram(aes(fill=conference),
binwidth = 10,
col="black",
size=1)
When we look at the bpm above 110, we see that it is normally distributed and there are some residuals towards both top and bottom of the line.
When we look at the sec_duration , we see that it is normally distributed and there are some residuals towards the top of the line.
The sample size is 60, the dataset for bpm follows close to normal distribution and sample set are randomly selected for Five conference College Fight Songs.
Based on our analytical approach, we use simple regression model.
options(scipen=999)
theme_set(theme_bw())
ggplot(fight_songs_df_3, aes(bpm, ncaa_won))+
geom_point()
## [1] -0.2099433
## [1] 0.003739684
There is a very weak , negative linear correlation between bpm and ncaa_won , explantory and response variable. There is almost no linear correlation between sec_duration and ncaa_won.
##
## Call:
## lm(formula = ncaa_won ~ bpm, data = fight_songs_df_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.558 -4.594 -1.977 1.317 45.095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.8783 14.7559 1.754 0.0869 .
## bpm -0.1377 0.1002 -1.375 0.1766
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.824 on 41 degrees of freedom
## Multiple R-squared: 0.04408, Adjusted R-squared: 0.02076
## F-statistic: 1.89 on 1 and 41 DF, p-value: 0.1766
The model meets the linearity, nearly normal residuals condition. The strength of the fit defined by \(R^2\) is 0.020 which is not that strong. 2% of the variablity can be explained by the NCAA championship variable within the model.
Our study of College Fight Songs findings are as follow;
The strength of the model fit defined by \(R^2\) is 0.020 which is not that strong. 2% of the variablity can be explained by the NCAA championship variable within the model.
P value is not low p-value which means bpm explantory variable is not a good predictor.
Every additional bpm increase , we can expect to reduce ncaa championship by 1.
With simple linear regression, we can not select available College Fight Song Attributes to predict the victory of NCAA Championship conference. We reject the null hypothesis.