The population mean for random variable \(X_j\) is \(\mu_j = E(X_j)\). The expected value is the mean of the population, or equivalently, the mean of all random draws from a stochastic model. The sample mean \(\bar{x}_j\) is an unbiased estimator of \(\mu_j\), \(E(\bar{x}_j) = \mu_j\).
\[\bar{x}_j = \frac{1}{n} \sum_{i=1}^{n}{X_{ij}}\]
The population variance is the average squared difference between a variable’s values and their mean, \(\sigma_j^2 = E(X_j - \mu_j)^2\). The sample variance \(s_j^2\) is an unbiased estimator of \(\sigma_j^2\), \(E(s_j^2) = \sigma_j^2\).
\[s^2 = \frac{1}{n-1} \sum_{i=1}^{n}{(X_{ij} - \bar{x}_j)^2}\]
The total variation of a data set with \(p\) variables is the sum of the \(p\) variances. In matrix notation, the total variation is the trace of the variance-covariance matrix, \(trace(S) = \sum{s_p^2}\). The total variation is of interest for principal components analysis and factor analysis.
The generalized variance of a data set with \(p\) variables is the determinant of the variance-covariance matrix, \(|S|\). This measure takes a large value when the various variables show very little correlation among themselves.
The population covariance is the average product of the difference between two variables’ values and their means, \(\sigma_{jk} = E((X_{ij} - \mu_j)(X_{ik} - \mu_k))\). The sample covariance is an unbiased estimator of \(\sigma_{jk}\), \(E(s_{jk}) = \sigma_{jk}\).
\[s_{jk} = \frac{1}{n-1} \sum_{i=1}^{n}{(X_{ij} - \bar{x}_j) (X_{ik} - \bar{x}_k)} = \frac{1}{n-1} \sum_{i=1}^{n}{(X_{i} - \bar{x}) (X_{i} - \bar{x})'}\]
The magnitude of the covariance reveals nothing about the strength of the associations. To assess the strength of an association, use the correlation. The population correlation is the population covariance divided by the product of the population standard deviations, \(\rho_{jk} = \sigma_{jk} / (\sigma_j \sigma_k)\). The sample correlation \(r_{jk}\) is an unbaised estimator of \(\rho_{jk}\), \(E(r_{jk}) = \rho_{jk}\).
\[r_{jk} = \frac{s_{jk}}{s_j s_k}\]
The coefficient of correlation \(r_{jk}^2\) is another measure of association and is simply equal to the square of the correlation. It says that about \(r_{jk}^2\) fraction of the variation in \(j\) is explained by \(k\), or vice-versa.
A study mreasured nutrient intake for a random sample of 737 women aged 25-50 years.
library(readr)
nutrient <- read_fwf(file = "./Data/nutrient.txt",
skip = 0,
fwf_widths(c(3, 8, 7, 8, 9, 8),
c("id", "Calcium", "Iron", "Protien", "A", "C")))
The standard deviations of Vitamin A and Vitamin C are large relative to their means, indicating high variability among the subjects.
library(dplyr)
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.4
knitr::kable(
nutrient %>%
gather(key = "Variable",
value = "Value",
-id) %>%
group_by(Variable) %>%
summarise(Mean = round(mean(Value), 1),
SD = round(sd(Value), 1)),
caption = "Descriptive Statistics")
## Warning: `as_dictionary()` is soft-deprecated as of rlang 0.3.0.
## Please use `as_data_pronoun()` instead
## This warning is displayed once per session.
## Warning: `new_overscope()` is soft-deprecated as of rlang 0.2.0.
## Please use `new_data_mask()` instead
## This warning is displayed once per session.
## Warning: The `parent` argument of `new_data_mask()` is deprecated.
## The parent of the data mask is determined from either:
##
## * The `env` argument of `eval_tidy()`
## * Quosure environments when applicable
## This warning is displayed once per session.
## Warning: `overscope_clean()` is soft-deprecated as of rlang 0.2.0.
## This warning is displayed once per session.
| Variable | Mean | SD |
|---|---|---|
| A | 839.6 | 1633.5 |
| C | 78.9 | 73.6 |
| Calcium | 624.0 | 397.3 |
| Iron | 11.1 | 6.0 |
| Protien | 65.8 | 30.6 |
The covariances are all positive, indicating that the daily intake of each nutrient increases with increased intake of the remaining nutrients.
knitr::kable(
cov(nutrient[, -1]),
caption = "Variance-Covariance Matrix")
| Calcium | Iron | Protien | A | C | |
|---|---|---|---|---|---|
| Calcium | 157829.4439 | 940.08944 | 6075.8163 | 102411.127 | 6701.6160 |
| Iron | 940.0894 | 35.81054 | 114.0580 | 2383.153 | 137.6720 |
| Protien | 6075.8163 | 114.05803 | 934.8769 | 7330.052 | 477.1998 |
| A | 102411.1266 | 2383.15341 | 7330.0515 | 2668452.371 | 22063.2486 |
| C | 6701.6160 | 137.67199 | 477.1998 | 22063.249 | 5416.2641 |
Protein, iron, and calcium are all positively associated. Each of these three nutrients intake increases with increasing values of the remaining two.
knitr::kable(
cor(nutrient[, -1]),
caption = "Correlation Matrix")
| Calcium | Iron | Protien | A | C | |
|---|---|---|---|---|---|
| Calcium | 1.0000000 | 0.3954301 | 0.5001882 | 0.1578060 | 0.2292111 |
| Iron | 0.3954301 | 1.0000000 | 0.6233662 | 0.2437905 | 0.3126009 |
| Protien | 0.5001882 | 0.6233662 | 1.0000000 | 0.1467574 | 0.2120670 |
| A | 0.1578060 | 0.2437905 | 0.1467574 | 1.0000000 | 0.1835227 |
| C | 0.2292111 | 0.3126009 | 0.2120670 | 0.1835227 | 1.0000000 |
library(GGally)
ggpairs(nutrient[, -1],
title = "Correlation Matrix",
progress = FALSE)
The coefficient of determination is the square of the correlation. About 39% of the variation in iron intake is explained by protein intake. Or, conversely, 39% of the protein intake is explained by the variation in the iron intake.
knitr::kable(
cor(nutrient[, -1])^2,
caption = "Coefficient of Determination Matrix")
| Calcium | Iron | Protien | A | C | |
|---|---|---|---|---|---|
| Calcium | 1.0000000 | 0.1563650 | 0.2501882 | 0.0249027 | 0.0525377 |
| Iron | 0.1563650 | 1.0000000 | 0.3885854 | 0.0594338 | 0.0977193 |
| Protien | 0.2501882 | 0.3885854 | 1.0000000 | 0.0215377 | 0.0449724 |
| A | 0.0249027 | 0.0594338 | 0.0215377 | 1.0000000 | 0.0336806 |
| C | 0.0525377 | 0.0977193 | 0.0449724 | 0.0336806 | 1.0000000 |
The total variation is the trace of the variance-covariance matrix. The total variation equals 2,832,668.8. This is a very large number.
sum(diag(cov(nutrient[, -1])))
## [1] 2832669
The generalized variance is the determinant of the variance-covariance matrix. The generalized variance equals 2.83E19.
det(cov(nutrient[, -1]))
## [1] 2.831042e+19