The following analysis will compare the ages of professional soccer and hockey players.The data includes the ages of all soccer player in the Bundesliga 2015-2016 and all players in the NHL 2015-2016.

Research Question

Do soccer or hockey players tend to keep playing professionally into there veteran years. That is, which sport consists of more veteran or older players. For the purposes of this analysis a veteran will be any player greater than the age of 35.

Data Processing

skip to analysis section if this section doesnt interest you

Bundesliga Data:

library(jsonlite)
## Warning: package 'jsonlite' was built under R version 3.2.5
library(plyr)
## Warning: package 'plyr' was built under R version 3.2.5
footy<-fromJSON("https://raw.githubusercontent.com/jokecamp/FootballData/master/Germany/bundesliga-2015-2016-rosters.json")

##Read in data from Bundesliga 2015-2016. Convert it from a JSON to a nested list

I wrote this function below to calculate each soccer player’s age from there birthdate and to get the data into a data frame from the nested list (Team >>Player>>Characteristics was the layers of the nesting)

ageForTeam<-function(x)
                  {
                      team1<-(laply(footy[[x]],identity))
                      birthdate<-as.character(team1[,1])
                          index<-1:nrow(team1)
##convert to date variable
              team<-data.frame(birthdate, index)
               team$birthdate<-as.Date(team$birthdate,format="%d.%m.%Y")
               ##calculate age
                      date<-Sys.Date()
                      ageForTeam<-transform(team, age=as.numeric(round((date-birthdate)/365)))

}

The for loop then applies the function to all 18 teams in the Bundesliga and binds them in a data frame called soccer.

soccer<-data.frame()

for (i in 1:18) {
  soccer<-rbind(soccer,ageForTeam(i))
  
}
 

head(soccer) ##first 6 rows of the data
##    birthdate index age
## 1 1993-12-19     1  23
## 2 1995-02-13     2  22
## 3 1994-03-13     3  23
## 4 1990-05-27     4  27
## 5 1993-08-15     5  24
## 6 1993-05-12     6  24

We now have the age of every player in the Bundesliga.

NHL Data:

library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
puck <- read.xlsx("/Users/roberttalarico/Desktop/Coursera/NHL Ages.xlsx", 1)  ##Read in Hockey Data
head(puck) ## First 6 rows of data
##    Last.Name        DOB Age
## 1 Abdelkader 1987-02-25  28
## 2    Acciari 1991-12-01  24
## 3   Agostino 1992-04-30  23
## 4   Agozzino 1991-01-03  25
## 5     Alzner 1988-09-24  27
## 6   Anderson 1994-05-07  21

We already have the age for the NHL player so no further processing is necessary.

Analysis

Since there are more player in the NHL (n=898) then the Bundesliga (n=530) the overall count of veteran players will be misleading. Thus, we will look at the proportion of veteran players within each league.

##Overall Summary
summary(soccer$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   24.00   27.00   26.93   30.00   40.00
summary(puck$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   23.00   25.00   26.28   29.00   43.00

The median age shows that soccer players are slightly older.

data1<-data.frame(age=soccer$age, sport="soccer")
data2<-data.frame(age=puck$Age, sport="hockey")
data<-rbind(data1, data2) ## Create one dataset called data

data$agecat<-cut(data$age, seq(16,45,9)) ##Divide into age groups
unique(data$agecat) 
## [1] (16,25] (25,34] (34,43]
## Levels: (16,25] (25,34] (34,43]

The player are divided into 3 ages groups: 17-25, 26-34, 35-43

library(ggplot2)
ggplot(data=data)+
  geom_bar(mapping=aes(x=agecat,y=..prop..,group=1))+
  facet_grid(.~sport)+
  labs(x="Age Group", title="Proportion of Each Age Group by Sport")

##proportions bar plot of age categories for hockey vs soccer

A higher proportion of soccer players are between the ages of 24-34. However, a greater proportion of hockey players are greater than 35.

Calculated the Propotions above 30 and above 35 for each sport.

over30<-soccer$age>30
soccerProportionOver30<-sum(over30)/nrow(soccer)
over35<-soccer$age>=35
soccerProportionOver35<-sum(sum(over35)/nrow(soccer))

over30<-puck$Age>30
puckProportionOver30<-sum(over30)/nrow(puck)
over35<-puck$Age>=35
puckProportionOver35<-sum(sum(over35)/nrow(puck))

lst<-list(soccerProportionOver30*100,soccerProportionOver35*100,
     puckProportionOver30*100,puckProportionOver35*100)
names(lst)<-c("Soccer Players Over 30 (%)","Soccer Players Age 35+ (%)",
              "Hockey Players Over 30 (%)", "Hockey Players Age 35+ (%)")

lst
## $`Soccer Players Over 30 (%)`
## [1] 18.2
## 
## $`Soccer Players Age 35+ (%)`
## [1] 3.4
## 
## $`Hockey Players Over 30 (%)`
## [1] 18.59688
## 
## $`Hockey Players Age 35+ (%)`
## [1] 5.345212

Looking at the actual proportions we see that hockey players have a 2% edge over soccer players in the proportion of veterans playing the respective sports. Also, a simiar proportion of hockey players and soccer players are above age 30 (~18%)

Limitations

Soccer data was from the Bundesliga only. Hockey data was from the NHL only.

Conclusion

More hockey players tend to play into there veteran years (35+) compared to soccer players.

Note these represent population parameters since every player in each league was included in the analysis. Also, age is measured without error so there is no sampling variablity or measurement error to quantify.