R Markdown

Project Statement

Analyze life span of British first class cricketers to understand how the sport impacted their longevity, and analyse cause of death

Read data file

#install.packages("ggplot2")
require(ggplot2)
## Loading required package: ggplot2
cricketer_data = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/DAAG/cricketer.csv"

cricketer <- read.csv(cricketer_data, header=TRUE, sep=",")

View file content and format

head(cricketer)
##   X  left year life dead acd kia inbed cause
## 1 1 right 1890  102    0   0   0     0 alive
## 2 2  left 1892  100    0   0   0     0 alive
## 3 3 right 1893   99    0   0   0     0 alive
## 4 4 right 1894   98    0   0   0     0 alive
## 5 5 right 1896   96    0   0   0     0 alive
## 6 6 right 1896   96    0   0   0     0 alive

Data summary and other counts

summary(cricketer)
##        X           left           year           life       
##  Min.   :   1   left :1101   Min.   :1840   Min.   : 19.00  
##  1st Qu.:1491   right:4859   1st Qu.:1878   1st Qu.: 49.00  
##  Median :2992                Median :1908   Median : 63.00  
##  Mean   :3036                Mean   :1906   Mean   : 61.84  
##  3rd Qu.:4579                3rd Qu.:1935   3rd Qu.: 76.00  
##  Max.   :6172                Max.   :1960   Max.   :102.00  
##       dead             acd               kia              inbed       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.00000   Median :0.00000   Median :1.0000  
##  Mean   :0.5683   Mean   :0.03154   Mean   :0.02013   Mean   :0.5367  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##    cause     
##  acd  : 188  
##  alive:2573  
##  inbed:3199  
##              
##              
## 
table(cricketer$cause, cricketer$kia )
##        
##            0    1
##   acd     68  120
##   alive 2573    0
##   inbed 3199    0
table(cricketer$cause)
## 
##   acd alive inbed 
##   188  2573  3199
table(cricketer$kia )
## 
##    0    1 
## 5840  120
table(cricketer$kia, cricketer$cause)
##    
##      acd alive inbed
##   0   68  2573  3199
##   1  120     0     0

Extract subset of columns and rename

cricketer1 <- data.frame("handedness"=cricketer$left, "year"=cricketer$year, "life"=cricketer$life, "dead"=cricketer$dead, "onField"=cricketer$kia, "cause"=cricketer$cause)


head(cricketer1)
##   handedness year life dead onField cause
## 1      right 1890  102    0       0 alive
## 2       left 1892  100    0       0 alive
## 3      right 1893   99    0       0 alive
## 4      right 1894   98    0       0 alive
## 5      right 1896   96    0       0 alive
## 6      right 1896   96    0       0 alive
summary(cricketer1)
##  handedness        year           life             dead       
##  left :1101   Min.   :1840   Min.   : 19.00   Min.   :0.0000  
##  right:4859   1st Qu.:1878   1st Qu.: 49.00   1st Qu.:0.0000  
##               Median :1908   Median : 63.00   Median :1.0000  
##               Mean   :1906   Mean   : 61.84   Mean   :0.5683  
##               3rd Qu.:1935   3rd Qu.: 76.00   3rd Qu.:1.0000  
##               Max.   :1960   Max.   :102.00   Max.   :1.0000  
##     onField          cause     
##  Min.   :0.00000   acd  : 188  
##  1st Qu.:0.00000   alive:2573  
##  Median :0.00000   inbed:3199  
##  Mean   :0.02013               
##  3rd Qu.:0.00000               
##  Max.   :1.00000

Analyse live vs Dead

Here “1” is Dead, “0” is Alive
liveVsdead <- table(cricketer1$dead, cricketer1$cause)

barplot(liveVsdead, main = "Live Vs Dead", xlab = "Status as of 1992", col=c("green","red"), legend = rownames(liveVsdead))

Analyse subset of data focussed on cause of deaths

Box plot to show spread of age
deathstats <-  cricketer1[which(cricketer1$dead>0),]
attach(deathstats)

boxplot(deathstats$life, xlab = "Age")

Plot to show scatter chart of birth year Vs the age when a player died
This shows that most deaths were at older age with fewer at younger age
plot(year, life, main="Age Vs birth year", 
    xlab="year ", ylab="age ", pch=19)

Bar plot that shows ratio between death “in bed”, which indicates normal death, Vs accidental death. It clearly shows that the number of accidental deaths were much less
barplot(table(deathstats$cause), main = "Cause of death", xlab = "Status as of 1992", col=c("red","green", "blue"), legend = rownames(deathstats$cause))

Plot showing distribution of age of players who died and confirming that most of the players were in 60’s and above.
qplot(life, data=deathstats, xlab="Age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Zoom in on cause of accidental deaths

Barplot showing more than 50% of accidental deaths were on field.
accidentstat <-  cricketer[which(cricketer$acd>0),]

barplot(table(accidentstat$kia), main = "Cause of death", xlab = "Status as of 1992", col=c("red","green", "blue"), names.arg=c("On field", "Other"))

Overall histogram comparisons of lifespans for All players, players who died, and accidental deaths on Field

hist(cricketer$life, main= "Lifespan of all Players", xlab = "Age")

hist(deathstats$life,  main= "Life span of Players who died", xlab = "Age")

hist(accidentstat$life, main= "Players who died on field", xlab = "Age")

Conclusion

From the histogram above it is clear that most of the players lived a long life, and life span of the players death was more shifted towards older age. However, the players who died due to accident were largely below the age of 50.