Central Tendency

Mean

The population mean \((\mu)\):

\[\mu = \frac{\sum_{i=1}^n X_i}{N}\]

The sample mean \((\bar{X})\):

\[\bar{X} = \frac{\sum X}{n}\]

> market_returns <- c(0.12,0.25,0.34,0.15,0.19,0.44,
+                     0.54,0.33,0.22,0.28,0.17,0.24)
> 
> cat("population mean ",mean(market_returns),"\n")
> 
> # random sample of 5
> set.seed(1)
> cat("sample mean ",mean(sample(market_returns,5)))

population mean  0.2725 
sample mean  0.256

The Geometric Mean:

\[G = \sqrt[n]{X_1 \times X_2 \times \dots \times X_n}\]

Since there’s only a solution if the product is non-negative you can add 1/0 to values under the radical and then subtract 1.0 from the result.

Since there’s no formula for geometric mean in base R, I created one.

> stock_returns <-  c(-.0934,0.2345,0.0892)
> 
> cat("arithmetic mean ",mean(stock_returns),"\n")
> 
> # created a new function
> gm_mean <-  function(x, na.rm=TRUE){
+   
+   prod(x+1,na.rm=na.rm)^(1/length(x))-1
+   
+ }
> 
> cat("geometric mean", gm_mean(stock_returns))

arithmetic mean  0.07676667 
geometric mean 0.0682465

Geometric means are used to calculate investment returns over multiple periods. The geometric mean will always be less than or equal to the arithmetic mean. The difference between the two will increase with the variability between observations.

> #Return year 1 = 100%
> #Return year 2 = -50%
> 
> stock_returns <-  c(1,-0.5)
> mean(stock_returns) # arithmetic

[1] 0.25

> gm_mean(stock_returns)# geometric

[1] 0

The geometric mean of 0% reflects the fact that the ending price equals the starting price.

The Weighted Mean:

\[\bar{X}_W = (w_1X_1+w_2X_2+\dots+w_nX_n)/(w_1+w_2+\dots+w_n)\]

> stock_returns <- c(0.12,0.07,0.03)
> stock_weights <- c(0.50,0.40,0.10)
> 
> weighted.mean(stock_returns,stock_weights)

[1] 0.091

Median

The median returns the mid-point of ascending data.

> stock_returns <- c(0.20,0.20,0.20,0.10,0.10,0.10,0.15)
> median(stock_returns)

[1] 0.15

Mode

The mode is the most frequent value. A set of data can have more than one mode or no mode.

Strangely, there is no mode function in base R.

You can sort the table() function to display the frequency.

> sort(table(stock_returns),decreasing = TRUE)

stock_returns
 0.1  0.2 0.15 
   3    3    1

I created a simple function to either display the modes or a message.

> mode_check <- function(x){
+   
+   if (max(table(x)) == min(table(x))){
+       return(cat("same frequency for all values"))
+   } else {
+     z <- as.data.frame(table(x))
+     z[z$Freq==max(table(x)),]
+   }
+   
+ }
> 
> mode_check(stock_returns)

    x Freq
1 0.1    3
3 0.2    3

> stock_returns <- c(0.20,0.10,0.15)
> 
> mode_check(stock_returns)

same frequency for all values

Quantiles

Quantile is a general term for a value at or below which a stated fraction of data lies.

Quartiles - divided into quarters
Quintile - divided into fifths
Decile - divided into tenths
Percentile - divided into hundredths

The formula is:

\[L_y = (n+1)\frac{y}{100}\]

y = % of the observations lie below
n = number of data points

To find the 3rd quartile (75% lie below) for the following returns, in ascending order:

\[8\%, 10\%, 12\%, 13\%, 15\%, 17\%, 17\%, 18\%, 19\%, 23\%, 24\%\] \[L_y = (11+1)/(75/100)=9\] The \(9^{th}\) data point from the left, or \(19\%\).

\[8\%, 10\%, 12\%, 13\%, 15\%, 17\%, 17\%, 18\%, 19\%, 23\%, 24\%, 26\%\] \[L_y = (12+1)/(75/100)=9.75\]

The \(9^{th}\) data point from the left plus \(0.75 \times\) the distance between the \(9^{th}\) and \(10^{th}\) values, or \(22\%\).

> stock_returns <- c(0.08,0.10,0.12,0.13,0.15,
+                    0.17,0.17,0.18,0.19,0.23,0.24)
> 
> quantile(stock_returns, type=6)

  0%  25%  50%  75% 100% 
0.08 0.12 0.17 0.19 0.24

> stock_returns <- c(0.08,0.10,0.12,0.13,0.15,0.17,
+                    0.17,0.18,0.19,0.23,0.24,0.26)
> 
> quantile(stock_returns, type=6)

    0%    25%    50%    75%   100% 
0.0800 0.1225 0.1700 0.2200 0.2600

> #quintile
> quantile(stock_returns, type=6, probs = c(0:5)/5)

   0%   20%   40%   60%   80%  100% 
0.080 0.112 0.154 0.178 0.234 0.260

> #decile
> quantile(stock_returns, type=6, probs = c(0:10)/10)

   0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
0.080 0.086 0.112 0.129 0.154 0.170 0.178 0.194 0.234 0.254 0.260

Dispersion

Dispersion is the variability around the central tendency.

Range

Range = Max - Min

> stock_returns <- c(0.30,0.12,0.25,0.20,0.23)
> range(stock_returns)

[1] 0.12 0.30

> cat("range: ",diff(range(stock_returns)))

range:  0.18

Mean Absolute Deviation

The average of absolute values of the deviations from the arithmetic mean.

\[MAD = \frac{\sum_{i=1}^n |X_i-\bar{X}|}{n}\]

There is no MAD function in base R, but it was easy to create.

> mean_abs_dev <- function(x){
+   
+   sum(abs(x-mean(x)))/
+     length(x)
+ }
> 
> mean_abs_dev(stock_returns)

[1] 0.048

Variance

Population Variance:

\[\sigma^2 = \frac{\sum_{i=1}^n(X_i-\mu)^2}{N} = \] \[\frac{\sum_{i=1}^n X_i^2}{N}-\left(\frac{\sum_{i=1}^n X_i}{N}\right)^2 =\]

\[E[X^2]-E[X]^2\]

Sample Variance:

\[s^2 = \frac{\sum_{i=1}^n(X_i-\bar{X})^2}{n-1} = \]

\[s^2 = \frac{\sum X_i^2-\left(\frac{\sum_{i=1}^n X_i}{n}\right)^2}{n-1}\]

Using \(N\) in the denominator when using a sample to represent its population will result in underestimating the population variance. By using \(n-1\) instead we compensate for this.

Base R will only calculate the variance of sample returns. To calculate variance for the population you can multiply the sample variance by \((n-1)/n\).

> stock_returns <- c(0.30,0.12,0.25,0.20,0.23)
> 
> # Sample variance * (n-1)/n
> var(stock_returns) * 
+   ((length(stock_returns)-1)/length(stock_returns))

[1] 0.00356

> # variance function
> var_pop <- function(x) {
+   
+   mean((x-mean(x))^2)
+ }
> 
> var_pop(stock_returns) # population variance

[1] 0.00356

> var(stock_returns) # sample variance

[1] 0.00445

Standard Deviation

Variance is in terms of units squared and is difficult to interpret. Standard Deviation is the square root of variance

\[\sigma = \sqrt[2]{\text{variance}}\]

Since base R only calculates sample standard deviation, you can create a function to calculate population sd or multiply by sqrt((n-1)/n)

> sd_pop <- function(x) {
+   
+   (mean((x-mean(x))^2))^0.5
+ }
> 
> sd_pop(stock_returns)

[1] 0.05966574

> # multiply sample sd by sqrt((n-1)/n)
> sd(stock_returns)* 
+   ((length(stock_returns)-1)/length(stock_returns))^0.5

[1] 0.05966574

> sd(stock_returns) # sample SD

[1] 0.06670832

Chebyshev’s Inequality

Chebyshev’s Inequality states that for any set of observations, the proportion of the observations within \(k\) standard deviations of the mean is at least \(1-1/k^2\) for all \(k>1\).

> chebyshev <- function(x){
+   
+   if (x<=1) {
+     return(cat("must be >1"))
+   } else{
+     1-(1/x^2)
+   }
+ }
> 
> chebyshev(1.25)

[1] 0.36

> chebyshev(1.50)

[1] 0.5555556

> chebyshev(2.00)

[1] 0.75

> chebyshev(3.00)

[1] 0.8888889

> chebyshev(4.00)

[1] 0.9375

Relative Dispersion

Relative dispersion is the amount of variability present in comparison to a reference point. A common measure of relative dispersion is the coefficient of variation (CV).

\[CV = \frac{s}{\overline{X}}\]

Example:

Monthly returns on T- bills: 0.25%
SD of returns: 0.36%

Monthly returns on S&P 500: 1.09%
SD of returns: 7.30%

T-bills: \(CV=0.36/0.25 = 1.44\)
S&P 500: \(CV=7.30/1.09 = 6.70\)

There is less dispersion relative to the mean for t-bills.

Sharpe Measure

The Sharpe Ratio measures excess return per unit of risk.

\[\frac{\overline{r_p}-\overline{r_f}}{\sigma_p}\]

\(\overline{r_p} =\) portfolio return
\(\overline{r_f} =\) risk free return
\(\sigma_p\) = standard deviation

Example:

Mean monthly return on t-bills: 0.25%
Mean monthly return on S&P 500: 1.30%
Standard deviation: 7.30%

Sharpe = \((1.30-0.25)/7.30 = 0.144\)

To annualize it, multiply by \(\sqrt{12}\)

\[0.144 \times \sqrt{12}= 0.499\]

Descriptive Statistics using Returns

Measures of Central Tendency and Dispersion

Paul Jozefek

2021-08-04