Descriptive statistics are brief descriptive numerical values that summarize a given data set, which can be either a representation of the entire, or a sample of, a population. Descriptive statistics consist of measures of central tendency and statistical dispersion.
The mean, median and mode should be among the first thing you look at when examining a data set. The mean is the average of all of the observations. The median is the middle number of the observations when arranged in order (or the average of the two middle numbers). The mode is the number that shows up most often. We often do not report the mode if there is more than 2 modes.
dat <- mtcars
head(dat)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
To find the mean and median, we use the following code.
mean(dat$mpg)
## [1] 20.09062
median(dat$mpg)
## [1] 19.2
Unfortunately, there is no r function for the mode of a variable. We can, however, create one.
mode <- function(v) {
unique_values <- unique(v)
unique_values[which.max(tabulate(match(v, unique_values)))]
}
mode(dat$mpg)
## [1] 21
The central tendencies give a glimpse of typical values. Be careful - they do not provide you with information on the variability of the values. There are many measures of central tendency. Each one gives you an idea of how the data is spread out.
The range is a difference between the largest and smallest values in the data set.
min(dat$mpg)
## [1] 10.4
max(dat$mpg)
## [1] 33.9
range(dat$mpg)
## [1] 10.4 33.9
max(dat$mpg) - min(dat$mpg)
## [1] 23.5
We can find the threshold below which a certain percent of data resides. These are known as percentiles. When the percentiles we are referring to are the 25th, 50th and 75th percentiles, we refer to these values as quartiles as they divide the data into 4 groups (1st, 2nd 3rd and 4th quartiles respectively). For any percentage \(p\), the \(p\)th percentile is the value such that a percentage \(p\) of all values are less than it. Note that the second quartile is equal to the median by definition and the fourth quartile is all of the data. These measures are easily computed.
quantile(dat$mpg, 0.25)
## 25%
## 15.425
quantile(dat$mpg, 0.5)
## 50%
## 19.2
quantile(dat$mpg, 0.75)
## 75%
## 22.8
quantile(dat$mpg, 1)
## 100%
## 33.9
quantile(dat$mpg)
## 0% 25% 50% 75% 100%
## 10.400 15.425 19.200 22.800 33.900
The distance between the 75th and 25th quartile is called the interquartile range (IQR).
IQR(dat$mpg)
## [1] 7.375
It should be noted that there are several ways to compute percentiles. All of these procedures give similar answers.
A 5-number summary is a quick numerical representation of the data. It contains the minimum, maximum, mean, 1st quartile and 3rd quartile values.
fivenum(dat$mpg)
## [1] 10.40 15.35 19.20 22.80 33.90
The summary command also includes the median (technically making it a 6-number summary).
summary(dat$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.43 19.20 20.09 22.80 33.90
Individually, each of these can be computed on their own.
min(dat$mpg)
## [1] 10.4
quantile(dat$mpg, 0.25)
## 25%
## 15.425
mean(dat$mpg)
## [1] 20.09062
quantile(dat$mpg, 0.75)
## 75%
## 22.8
max(dat$mpg)
## [1] 33.9
The range provides a measure of variability. Percentiles provide an understanding of divisions of the data. The most common measures to summarize variability are variance and standard deviation. Variance provides an idea of variability and standard deviation provides a measure of variability in the same units as the data.
var(dat$mpg)
## [1] 36.3241
sd(dat$mpg)
## [1] 6.026948
To compute the variance and standard deviation, we use the formula \[s^2 = \dfrac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1},\] where \(x_i\) is the \(i^{th}\) observation (of \(n\) total) and \(\bar{x}\) is the mean of the observations. The standard deviation is simply the positive square root of the variance, \[s = \sqrt{s^2}.\] Standard deviation is often preferred to variance since it is in the same units as the data. Alternatively, the coefficient of variation expresses the standard deviation as a percentage of the mean of the data, \[\text{coefficient of variation} = \dfrac{s}{\bar{x}} \times 100\%.\] The coefficient of variation provides a context to the size of the standard deviation relative to the data.
An outlier is a data point that is unexpectedly large or small relative to the rest of the data. These points may be correctly part of the data; however, often they are not. It is important to take time to examine each of these points for accuracy and appropriateness of inclusion. If an error has been made in the data collection, corrective action should be taken. This may include rejecting the data value entirely.
An outlier can often be identified using \(z\)-scores. The \(z\) score can be calculated using the formula \[z = \dfrac{x_i - \bar{x}}{s}.\] If a data point has a positive \(z\)-score, it is larger than average. If a data point has a negative \(z\)-score, it is smaller than average. Any data point with a \(z\)-score that is larger than 3 or smaller than -3 should be investigated and considered an outlier.
Anderson, David R. , Williams, Thomas A. and Sweeney, Dennis J.. “Statistics”. Encyclopedia Britannica, 20 Oct. 2020, https://www.britannica.com/science/statistics. Accessed 6 April 2021.
Boehmke, B., & Greenwell, B.M. 2019. Hands-On Machine Learning with R.
Boeree, George. Descriptive Statistics. (2005). http://webspace.ship.edu/cgboer/descstats.html. Accessed 6 April 2021.