library(moments)
library(cricketdata)
remove_missing_infinite <- function(x) {
x <- as.numeric(x)
x[x == Inf] <- NA
# Remove rows with missing or infinite values
complete_cases <- complete.cases(x)
x <- x[complete_cases]
return(x)
}Skewness
Skewness is a statistical measure of how symmetrical (or lopsided) a distribution of data is.
In this tutorial, we will look at skewness — and how to calculate it in R — by using player data from cricket as an example.
Following along
If you don’t already have R installed, you can download RStudio from here.
In these examples, the cricketdata package will be used to download cricket match and player data from ESPNCricinfo and Cricsheet.
You can install this package in R (and run the other code in this tutorial) by running one of the following two installation methods on your R console, which is the bottom left panel in RStudio:
Installation Method 1
install.packages("cricketdata", dependencies = TRUE)
Installation Method 2
# install.packages("devtools")
devtools::install_github("robjhyndman/cricketdata")
Load Libraries
Then, load the required libraries by running each of these lines — and other lines of code — in your R console.
Types of skewness
There are three types of skewness: none (zero), positive, and negative skewness.
As well as visually inspecting the distribution of data, the difference between the mean (average) and median (middle number) can also indicate skewness.
To calculate skewness in R, we can use the skewness() function from the moments library (make sure that you have loaded the moments library by running library(moments)).
The mean, median and skewness can then be calculated in R with the mean(), median(), and skewness() functions.
Zero skewness
A distribution with no skewenss is symmetrical, meaning that the left and right sides of the distribution are basically mirror images of each other.
For example, a bell-shaped normal distribution like the following has essentially zero skewness.
Positive skewness
If there is positive skewness, the right-hand side of the distribution is longer than the left-hand side, meaning there are more extreme values on the right side of the distribution than on the left side.
In other words, the values are more concentrated on the left side of the distribution.
This is a common feature of player-related data in sports, where there are often a small number of truly exceptional players in a particular facet of play.
In the case of positive skewness, the mean value is typically greater than the median.
For example, consider the distribution of ODI bowling averages of Australian men bowlers:
# Fetch data
df <- fetch_cricinfo("ODI", "men", "bowling", country = "australia")
x <- df$Average
x <- remove_missing_infinite(x)
hist(x, main = "Histogram of Australian Men ODI Bowling Averages", xlab="")
abline(v = mean(x), col = "red", lwd = 2)
abline(v = median(x), col = "blue", lwd = 2)
# Adding a legend
legend("topright", legend = c("Mean", "Median"),
col = c("red", "blue"), lty = 2, lwd = 2)print(paste("Mean:", mean(x)))[1] "Mean: 37.993105753619"
print(paste("Median:", median(x)))[1] "Median: 31.25"
print(paste("Skewness:", skewness(x)))[1] "Skewness: 4.60126754229636"
The skewness was calculated to be 4.601, indicating positive skewness, and this is reflected in the mean (37.99) being greater than the median value (31.25).
Negative skewness
The left-hand tail of the distribution is longer than the right-hand tail, meaning there are more extreme values on the left side of the distribution than on the right side.
In other words, the values are more concentrated on the right side of the distribution.
In the case of positive skewness, the mean value is typically less than than the median.
For example, consider the distribution of the years in which New Zealand ODI bowlers retire (i.e., excluding 2023):
# Fetch data
df <- fetch_cricinfo("ODI", "men", "bowling", country = "new zealand")
df <- df[df$End < 2023, ]
x <- df$End
x <- remove_missing_infinite(x)
hist(x, main = "Histogram of NZ Men ODI bowler years of retirement", xlab="")
abline(v = mean(x), col = "red", lwd = 2)
abline(v = median(x), col = "blue", lwd = 2)
# Adding a legend
legend("topright", legend = c("Mean", "Median"),
col = c("red", "blue"), lty = 2, lwd = 2)print(paste("Mean:", mean(x)))[1] "Mean: 1999.31182795699"
print(paste("Median:", median(x)))[1] "Median: 2000"
print(paste("Skewness:", skewness(x)))[1] "Skewness: -0.190567244136317"
The skewness was calculated to be -0.1906, indicating very slight negative skewness, and this is reflected in the mean (1999.3) being slightly less than the median value (2000).