Descriptive Statistics for Quantitative Data

Descriptive statistics are brief descriptive numerical values that summarize a given data set, which can be either a representation of the entire, or a sample of, a population. Descriptive statistics consist of measures of central tendency and statistical dispersion.

Central Tendancy

The mean, median and mode should be among the first thing you look at when examining a data set. The mean is the average of all of the observations. The median is the middle number of the observations when arranged in order (or the average of the two middle numbers). The mode is the number that shows up most often. We often do not report the mode if there is more than 2 modes.

dat <- mtcars
head(dat)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

To find the mean and median, we use the following code.

mean(dat$mpg)

## [1] 20.09062

median(dat$mpg)

## [1] 19.2

Unfortunately, there is no r function for the mode of a variable. We can, however, create one.

mode <- function(v) {
        unique_values <- unique(v)
        unique_values[which.max(tabulate(match(v, unique_values)))]
}

mode(dat$mpg)

## [1] 21

Statistical Dispersion

The central tendencies give a glimpse of typical values. Be careful - they do not provide you with information on the variability of the values. There are many measures of central tendency. Each one gives you an idea of how the data is spread out.

Range

The range is a difference between the largest and smallest values in the data set.

min(dat$mpg)

## [1] 10.4

max(dat$mpg)

## [1] 33.9

range(dat$mpg)

## [1] 10.4 33.9

max(dat$mpg) - min(dat$mpg)

## [1] 23.5

Percentiles

We can find the threshold below which a certain percent of data resides. These are known as percentiles. When the percentiles we are referring to are the 25th, 50th and 75th percentiles, we refer to these values as quartiles as they divide the data into 4 groups (1st, 2nd 3rd and 4th quartiles respectively). For any percentage \(p\), the \(p\)th percentile is the value such that a percentage \(p\) of all values are less than it. Note that the second quartile is equal to the median by definition and the fourth quartile is all of the data. These measures are easily computed.

quantile(dat$mpg, 0.25)

##    25% 
## 15.425

quantile(dat$mpg, 0.5)

##  50% 
## 19.2

quantile(dat$mpg, 0.75)

##  75% 
## 22.8

quantile(dat$mpg, 1)

## 100% 
## 33.9

quantile(dat$mpg)

##     0%    25%    50%    75%   100% 
## 10.400 15.425 19.200 22.800 33.900

The distance between the 75th and 25th quartile is called the interquartile range (IQR).

IQR(dat$mpg)

## [1] 7.375

It should be noted that there are several ways to compute percentiles. All of these procedures give similar answers.

The 5 Number Summary

A 5-number summary is a quick numerical representation of the data. It contains the minimum, maximum, mean, 1st quartile and 3rd quartile values.

fivenum(dat$mpg)

## [1] 10.40 15.35 19.20 22.80 33.90

The summary command also includes the median (technically making it a 6-number summary).

summary(dat$mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.43   19.20   20.09   22.80   33.90

Individually, each of these can be computed on their own.

min(dat$mpg)

## [1] 10.4

quantile(dat$mpg, 0.25)

##    25% 
## 15.425

mean(dat$mpg)

## [1] 20.09062

quantile(dat$mpg, 0.75)

##  75% 
## 22.8

max(dat$mpg)

## [1] 33.9

Variance

The range provides a measure of variability. Percentiles provide an understanding of divisions of the data. The most common measures to summarize variability are variance and standard deviation. Variance provides an idea of variability and standard deviation provides a measure of variability in the same units as the data.

var(dat$mpg)

## [1] 36.3241

sd(dat$mpg)

## [1] 6.026948

To compute the variance and standard deviation, we use the formula \[s^2 = \dfrac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1},\] where \(x_i\) is the \(i^{th}\) observation (of \(n\) total) and \(\bar{x}\) is the mean of the observations. The standard deviation is simply the positive square root of the variance, \[s = \sqrt{s^2}.\] Standard deviation is often preferred to variance since it is in the same units as the data. Alternatively, the coefficient of variation expresses the standard deviation as a percentage of the mean of the data, \[\text{coefficient of variation} = \dfrac{s}{\bar{x}} \times 100\%.\] The coefficient of variation provides a context to the size of the standard deviation relative to the data.

Outliers

An outlier is a data point that is unexpectedly large or small relative to the rest of the data. These points may be correctly part of the data; however, often they are not. It is important to take time to examine each of these points for accuracy and appropriateness of inclusion. If an error has been made in the data collection, corrective action should be taken. This may include rejecting the data value entirely.

An outlier can often be identified using \(z\)-scores. The \(z\) score can be calculated using the formula \[z = \dfrac{x_i - \bar{x}}{s}.\] If a data point has a positive \(z\)-score, it is larger than average. If a data point has a negative \(z\)-score, it is smaller than average. Any data point with a \(z\)-score that is larger than 3 or smaller than -3 should be investigated and considered an outlier.

Numerical Descriptive Statistics

OC Data Science