Measures of Central Tendancy

The population mean for random variable \(X_j\) is \(\mu_j = E(X_j)\). The expected value is the mean of the population, or equivalently, the mean of all random draws from a stochastic model. The sample mean \(\bar{x}_j\) is an unbiased estimator of \(\mu_j\), \(E(\bar{x}_j) = \mu_j\).

\[\bar{x}_j = \frac{1}{n} \sum_{i=1}^{n}{X_{ij}}\]

Measures of Dispersion

The population variance is the average squared difference between a variable’s values and their mean, \(\sigma_j^2 = E(X_j - \mu_j)^2\). The sample variance \(s_j^2\) is an unbiased estimator of \(\sigma_j^2\), \(E(s_j^2) = \sigma_j^2\).

\[s^2 = \frac{1}{n-1} \sum_{i=1}^{n}{(X_{ij} - \bar{x}_j)^2}\]

The total variation of a data set with \(p\) variables is the sum of the \(p\) variances. In matrix notation, the total variation is the trace of the variance-covariance matrix, \(trace(S) = \sum{s_p^2}\). The total variation is of interest for principal components analysis and factor analysis.

The generalized variance of a data set with \(p\) variables is the determinant of the variance-covariance matrix, \(|S|\). This measure takes a large value when the various variables show very little correlation among themselves.

Measures of Association

The population covariance is the average product of the difference between two variables’ values and their means, \(\sigma_{jk} = E((X_{ij} - \mu_j)(X_{ik} - \mu_k))\). The sample covariance is an unbiased estimator of \(\sigma_{jk}\), \(E(s_{jk}) = \sigma_{jk}\).

\[s_{jk} = \frac{1}{n-1} \sum_{i=1}^{n}{(X_{ij} - \bar{x}_j) (X_{ik} - \bar{x}_k)} = \frac{1}{n-1} \sum_{i=1}^{n}{(X_{i} - \bar{x}) (X_{i} - \bar{x})'}\]

The magnitude of the covariance reveals nothing about the strength of the associations. To assess the strength of an association, use the correlation. The population correlation is the population covariance divided by the product of the population standard deviations, \(\rho_{jk} = \sigma_{jk} / (\sigma_j \sigma_k)\). The sample correlation \(r_{jk}\) is an unbaised estimator of \(\rho_{jk}\), \(E(r_{jk}) = \rho_{jk}\).

\[r_{jk} = \frac{s_{jk}}{s_j s_k}\]

The coefficient of correlation \(r_{jk}^2\) is another measure of association and is simply equal to the square of the correlation. It says that about \(r_{jk}^2\) fraction of the variation in \(j\) is explained by \(k\), or vice-versa.

Example

A study mreasured nutrient intake for a random sample of 737 women aged 25-50 years.

library(readr)

nutrient <- read_fwf(file = "./Data/nutrient.txt",
                     skip = 0,
                     fwf_widths(c(3, 8, 7, 8, 9, 8), 
                                c("id", "Calcium", "Iron", "Protien", "A", "C")))

The standard deviations of Vitamin A and Vitamin C are large relative to their means, indicating high variability among the subjects.

library(dplyr)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
knitr::kable(
nutrient %>%
  gather(key = "Variable",
         value = "Value",
         -id) %>%
  group_by(Variable) %>%
  summarise(Mean = round(mean(Value), 1),
            SD = round(sd(Value), 1)),
caption = "Descriptive Statistics")
## Warning: `as_dictionary()` is soft-deprecated as of rlang 0.3.0.
## Please use `as_data_pronoun()` instead
## This warning is displayed once per session.
## Warning: `new_overscope()` is soft-deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead
## This warning is displayed once per session.
## Warning: The `parent` argument of `new_data_mask()` is deprecated.
## The parent of the data mask is determined from either:
## 
##   * The `env` argument of `eval_tidy()`
##   * Quosure environments when applicable
## This warning is displayed once per session.
## Warning: `overscope_clean()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
Descriptive Statistics
Variable Mean SD
A 839.6 1633.5
C 78.9 73.6
Calcium 624.0 397.3
Iron 11.1 6.0
Protien 65.8 30.6

The covariances are all positive, indicating that the daily intake of each nutrient increases with increased intake of the remaining nutrients.

knitr::kable(
cov(nutrient[, -1]),  
caption = "Variance-Covariance Matrix")
Variance-Covariance Matrix
Calcium Iron Protien A C
Calcium 157829.4439 940.08944 6075.8163 102411.127 6701.6160
Iron 940.0894 35.81054 114.0580 2383.153 137.6720
Protien 6075.8163 114.05803 934.8769 7330.052 477.1998
A 102411.1266 2383.15341 7330.0515 2668452.371 22063.2486
C 6701.6160 137.67199 477.1998 22063.249 5416.2641

Protein, iron, and calcium are all positively associated. Each of these three nutrients intake increases with increasing values of the remaining two.

knitr::kable(
cor(nutrient[, -1]),
caption = "Correlation Matrix")
Correlation Matrix
Calcium Iron Protien A C
Calcium 1.0000000 0.3954301 0.5001882 0.1578060 0.2292111
Iron 0.3954301 1.0000000 0.6233662 0.2437905 0.3126009
Protien 0.5001882 0.6233662 1.0000000 0.1467574 0.2120670
A 0.1578060 0.2437905 0.1467574 1.0000000 0.1835227
C 0.2292111 0.3126009 0.2120670 0.1835227 1.0000000
library(GGally)
ggpairs(nutrient[, -1],
        title = "Correlation Matrix",
        progress = FALSE)

The coefficient of determination is the square of the correlation. About 39% of the variation in iron intake is explained by protein intake. Or, conversely, 39% of the protein intake is explained by the variation in the iron intake.

knitr::kable(
cor(nutrient[, -1])^2,
caption = "Coefficient of Determination Matrix")
Coefficient of Determination Matrix
Calcium Iron Protien A C
Calcium 1.0000000 0.1563650 0.2501882 0.0249027 0.0525377
Iron 0.1563650 1.0000000 0.3885854 0.0594338 0.0977193
Protien 0.2501882 0.3885854 1.0000000 0.0215377 0.0449724
A 0.0249027 0.0594338 0.0215377 1.0000000 0.0336806
C 0.0525377 0.0977193 0.0449724 0.0336806 1.0000000

The total variation is the trace of the variance-covariance matrix. The total variation equals 2,832,668.8. This is a very large number.

sum(diag(cov(nutrient[, -1])))
## [1] 2832669

The generalized variance is the determinant of the variance-covariance matrix. The generalized variance equals 2.83E19.

det(cov(nutrient[, -1]))
## [1] 2.831042e+19

References

PSU STAT505.