Measures of Central Tendancy

The population mean for random variable \(X_j\) is \(\mu_j = E(X_j)\). The expected value is the mean of the population, or equivalently, the mean of all random draws from a stochastic model. The sample mean \(\bar{x}_j\) is an unbiased estimator of \(\mu_j\), \(E(\bar{x}_j) = \mu_j\).

\[\bar{x}_j = \frac{1}{n} \sum_{i=1}^{n}{X_{ij}}\]

Measures of Dispersion

The population variance is the average squared difference between a variable’s values and their mean, \(\sigma_j^2 = E(X_j - \mu_j)^2\). The sample variance \(s_j^2\) is an unbiased estimator of \(\sigma_j^2\), \(E(s_j^2) = \sigma_j^2\).

\[s^2 = \frac{1}{n-1} \sum_{i=1}^{n}{(X_{ij} - \bar{x}_j)^2}\]

The total variation of a data set with \(p\) variables is the sum of the \(p\) variances. In matrix notation, the total variation is the trace of the variance-covariance matrix, \(trace(S) = \sum{s_p^2}\). The total variation is of interest for principal components analysis and factor analysis.

The generalized variance of a data set with \(p\) variables is the determinant of the variance-covariance matrix, \(|S|\). This measure takes a large value when the various variables show very little correlation among themselves.

Measures of Association

The population covariance is the average product of the difference between two variables’ values and their means, \(\sigma_{jk} = E((X_{ij} - \mu_j)(X_{ik} - \mu_k))\). The sample covariance is an unbiased estimator of \(\sigma_{jk}\), \(E(s_{jk}) = \sigma_{jk}\).

\[s_{jk} = \frac{1}{n-1} \sum_{i=1}^{n}{(X_{ij} - \bar{x}_j) (X_{ik} - \bar{x}_k)} = \frac{1}{n-1} \sum_{i=1}^{n}{(X_{i} - \bar{x}) (X_{i} - \bar{x})'}\]

The magnitude of the covariance reveals nothing about the strength of the associations. To assess the strength of an association, use the correlation. The population correlation is the population covariance divided by the product of the population standard deviations, \(\rho_{jk} = \sigma_{jk} / (\sigma_j \sigma_k)\). The sample correlation \(r_{jk}\) is an unbaised estimator of \(\rho_{jk}\), \(E(r_{jk}) = \rho_{jk}\).

\[r_{jk} = \frac{s_{jk}}{s_j s_k}\]

The coefficient of correlation \(r_{jk}^2\) is another measure of association and is simply equal to the square of the correlation. It says that about \(r_{jk}^2\) fraction of the variation in \(j\) is explained by \(k\), or vice-versa.

Example

A study mreasured nutrient intake for a random sample of 737 women aged 25-50 years.

library(readr)

nutrient <- read_fwf(file = "./Data/nutrient.txt",
                     skip = 0,
                     fwf_widths(c(3, 8, 7, 8, 9, 8), 
                                c("id", "Calcium", "Iron", "Protien", "A", "C")))

The standard deviations of Vitamin A and Vitamin C are large relative to their means, indicating high variability among the subjects.

library(dplyr)
library(tidyr)

## Warning: package 'tidyr' was built under R version 3.4.4

knitr::kable(
nutrient %>%
  gather(key = "Variable",
         value = "Value",
         -id) %>%
  group_by(Variable) %>%
  summarise(Mean = round(mean(Value), 1),
            SD = round(sd(Value), 1)),
caption = "Descriptive Statistics")

## Warning: `as_dictionary()` is soft-deprecated as of rlang 0.3.0.
## Please use `as_data_pronoun()` instead
## This warning is displayed once per session.

## Warning: `new_overscope()` is soft-deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead
## This warning is displayed once per session.

## Warning: The `parent` argument of `new_data_mask()` is deprecated.
## The parent of the data mask is determined from either:
## 
##   * The `env` argument of `eval_tidy()`
##   * Quosure environments when applicable
## This warning is displayed once per session.

## Warning: `overscope_clean()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.

Descriptive Statistics
Variable	Mean	SD
A	839.6	1633.5
C	78.9	73.6
Calcium	624.0	397.3
Iron	11.1	6.0
Protien	65.8	30.6

The covariances are all positive, indicating that the daily intake of each nutrient increases with increased intake of the remaining nutrients.

knitr::kable(
cov(nutrient[, -1]),  
caption = "Variance-Covariance Matrix")

Variance-Covariance Matrix
	Calcium	Iron	Protien	A	C
Calcium	157829.4439	940.08944	6075.8163	102411.127	6701.6160
Iron	940.0894	35.81054	114.0580	2383.153	137.6720
Protien	6075.8163	114.05803	934.8769	7330.052	477.1998
A	102411.1266	2383.15341	7330.0515	2668452.371	22063.2486
C	6701.6160	137.67199	477.1998	22063.249	5416.2641

Protein, iron, and calcium are all positively associated. Each of these three nutrients intake increases with increasing values of the remaining two.

knitr::kable(
cor(nutrient[, -1]),
caption = "Correlation Matrix")

Correlation Matrix
	Calcium	Iron	Protien	A	C
Calcium	1.0000000	0.3954301	0.5001882	0.1578060	0.2292111
Iron	0.3954301	1.0000000	0.6233662	0.2437905	0.3126009
Protien	0.5001882	0.6233662	1.0000000	0.1467574	0.2120670
A	0.1578060	0.2437905	0.1467574	1.0000000	0.1835227
C	0.2292111	0.3126009	0.2120670	0.1835227	1.0000000

library(GGally)
ggpairs(nutrient[, -1],
        title = "Correlation Matrix",
        progress = FALSE)

The coefficient of determination is the square of the correlation. About 39% of the variation in iron intake is explained by protein intake. Or, conversely, 39% of the protein intake is explained by the variation in the iron intake.

knitr::kable(
cor(nutrient[, -1])^2,
caption = "Coefficient of Determination Matrix")

Coefficient of Determination Matrix
	Calcium	Iron	Protien	A	C
Calcium	1.0000000	0.1563650	0.2501882	0.0249027	0.0525377
Iron	0.1563650	1.0000000	0.3885854	0.0594338	0.0977193
Protien	0.2501882	0.3885854	1.0000000	0.0215377	0.0449724
A	0.0249027	0.0594338	0.0215377	1.0000000	0.0336806
C	0.0525377	0.0977193	0.0449724	0.0336806	1.0000000

The total variation is the trace of the variance-covariance matrix. The total variation equals 2,832,668.8. This is a very large number.

sum(diag(cov(nutrient[, -1])))

## [1] 2832669

The generalized variance is the determinant of the variance-covariance matrix. The generalized variance equals 2.83E19.

det(cov(nutrient[, -1]))

## [1] 2.831042e+19

References

PSU STAT505.

Multivariate Measures