The population mean \((\mu)\):
\[\mu = \frac{\sum_{i=1}^n X_i}{N}\]
The sample mean \((\bar{X})\):
\[\bar{X} = \frac{\sum X}{n}\]
> market_returns <- c(0.12,0.25,0.34,0.15,0.19,0.44,
+ 0.54,0.33,0.22,0.28,0.17,0.24)
>
> cat("population mean ",mean(market_returns),"\n")
>
> # random sample of 5
> set.seed(1)
> cat("sample mean ",mean(sample(market_returns,5)))population mean 0.2725
sample mean 0.256
The Geometric Mean:
\[G = \sqrt[n]{X_1 \times X_2 \times \dots \times X_n}\]
Since there’s only a solution if the product is non-negative you can add 1/0 to values under the radical and then subtract 1.0 from the result.
Since there’s no formula for geometric mean in base R, I created one.
> stock_returns <- c(-.0934,0.2345,0.0892)
>
> cat("arithmetic mean ",mean(stock_returns),"\n")
>
> # created a new function
> gm_mean <- function(x, na.rm=TRUE){
+
+ prod(x+1,na.rm=na.rm)^(1/length(x))-1
+
+ }
>
> cat("geometric mean", gm_mean(stock_returns))arithmetic mean 0.07676667
geometric mean 0.0682465
Geometric means are used to calculate investment returns over multiple periods. The geometric mean will always be less than or equal to the arithmetic mean. The difference between the two will increase with the variability between observations.
> #Return year 1 = 100%
> #Return year 2 = -50%
>
> stock_returns <- c(1,-0.5)
> mean(stock_returns) # arithmetic[1] 0.25
[1] 0
The geometric mean of 0% reflects the fact that the ending price equals the starting price.
The Weighted Mean:
\[\bar{X}_W = (w_1X_1+w_2X_2+\dots+w_nX_n)/(w_1+w_2+\dots+w_n)\]
> stock_returns <- c(0.12,0.07,0.03)
> stock_weights <- c(0.50,0.40,0.10)
>
> weighted.mean(stock_returns,stock_weights)[1] 0.091
The median returns the mid-point of ascending data.
[1] 0.15
The mode is the most frequent value. A set of data can have more than one mode or no mode.
Strangely, there is no mode function in base R.
You can sort the table() function to display the frequency.
stock_returns
0.1 0.2 0.15
3 3 1
I created a simple function to either display the modes or a message.
> mode_check <- function(x){
+
+ if (max(table(x)) == min(table(x))){
+ return(cat("same frequency for all values"))
+ } else {
+ z <- as.data.frame(table(x))
+ z[z$Freq==max(table(x)),]
+ }
+
+ }
>
> mode_check(stock_returns) x Freq
1 0.1 3
3 0.2 3
same frequency for all values
Quantile is a general term for a value at or below which a stated fraction of data lies.
The formula is:
\[L_y = (n+1)\frac{y}{100}\]
To find the 3rd quartile (75% lie below) for the following returns, in ascending order:
\[8\%, 10\%, 12\%, 13\%, 15\%, 17\%, 17\%, 18\%, 19\%, 23\%, 24\%\] \[L_y = (11+1)/(75/100)=9\] The \(9^{th}\) data point from the left, or \(19\%\).
\[8\%, 10\%, 12\%, 13\%, 15\%, 17\%, 17\%, 18\%, 19\%, 23\%, 24\%, 26\%\] \[L_y = (12+1)/(75/100)=9.75\]
The \(9^{th}\) data point from the left plus \(0.75 \times\) the distance between the \(9^{th}\) and \(10^{th}\) values, or \(22\%\).
> stock_returns <- c(0.08,0.10,0.12,0.13,0.15,
+ 0.17,0.17,0.18,0.19,0.23,0.24)
>
> quantile(stock_returns, type=6) 0% 25% 50% 75% 100%
0.08 0.12 0.17 0.19 0.24
> stock_returns <- c(0.08,0.10,0.12,0.13,0.15,0.17,
+ 0.17,0.18,0.19,0.23,0.24,0.26)
>
> quantile(stock_returns, type=6) 0% 25% 50% 75% 100%
0.0800 0.1225 0.1700 0.2200 0.2600
0% 20% 40% 60% 80% 100%
0.080 0.112 0.154 0.178 0.234 0.260
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
0.080 0.086 0.112 0.129 0.154 0.170 0.178 0.194 0.234 0.254 0.260
Dispersion is the variability around the central tendency.
Range = Max - Min
[1] 0.12 0.30
range: 0.18
The average of absolute values of the deviations from the arithmetic mean.
\[MAD = \frac{\sum_{i=1}^n |X_i-\bar{X}|}{n}\]
There is no MAD function in base R, but it was easy to create.
> mean_abs_dev <- function(x){
+
+ sum(abs(x-mean(x)))/
+ length(x)
+ }
>
> mean_abs_dev(stock_returns)[1] 0.048
Population Variance:
\[\sigma^2 = \frac{\sum_{i=1}^n(X_i-\mu)^2}{N} = \] \[\frac{\sum_{i=1}^n X_i^2}{N}-\left(\frac{\sum_{i=1}^n X_i}{N}\right)^2 =\]
\[E[X^2]-E[X]^2\]
Sample Variance:
\[s^2 = \frac{\sum_{i=1}^n(X_i-\bar{X})^2}{n-1} = \]
\[s^2 = \frac{\sum X_i^2-\left(\frac{\sum_{i=1}^n X_i}{n}\right)^2}{n-1}\]
Using \(N\) in the denominator when using a sample to represent its population will result in underestimating the population variance. By using \(n-1\) instead we compensate for this.
Base R will only calculate the variance of sample returns. To calculate variance for the population you can multiply the sample variance by \((n-1)/n\).
> stock_returns <- c(0.30,0.12,0.25,0.20,0.23)
>
> # Sample variance * (n-1)/n
> var(stock_returns) *
+ ((length(stock_returns)-1)/length(stock_returns))[1] 0.00356
> # variance function
> var_pop <- function(x) {
+
+ mean((x-mean(x))^2)
+ }
>
> var_pop(stock_returns) # population variance[1] 0.00356
[1] 0.00445
Variance is in terms of units squared and is difficult to interpret. Standard Deviation is the square root of variance
\[\sigma = \sqrt[2]{\text{variance}}\]
Since base R only calculates sample standard deviation, you can create a function to calculate population sd or multiply by sqrt((n-1)/n)
[1] 0.05966574
> # multiply sample sd by sqrt((n-1)/n)
> sd(stock_returns)*
+ ((length(stock_returns)-1)/length(stock_returns))^0.5[1] 0.05966574
[1] 0.06670832
Chebyshev’s Inequality states that for any set of observations, the proportion of the observations within \(k\) standard deviations of the mean is at least \(1-1/k^2\) for all \(k>1\).
> chebyshev <- function(x){
+
+ if (x<=1) {
+ return(cat("must be >1"))
+ } else{
+ 1-(1/x^2)
+ }
+ }
>
> chebyshev(1.25)[1] 0.36
[1] 0.5555556
[1] 0.75
[1] 0.8888889
[1] 0.9375
Relative dispersion is the amount of variability present in comparison to a reference point. A common measure of relative dispersion is the coefficient of variation (CV).
\[CV = \frac{s}{\overline{X}}\]
Example:
Monthly returns on T- bills: 0.25%
SD of returns: 0.36%
Monthly returns on S&P 500: 1.09%
SD of returns: 7.30%
T-bills: \(CV=0.36/0.25 = 1.44\)
S&P 500: \(CV=7.30/1.09 = 6.70\)
There is less dispersion relative to the mean for t-bills.