Joint Probabilities, Independance, Correlation and Covariance in R |
|
It can be useful (or not useful) to plot the marginal probabilities in the margins of a chart or table…
head(penguins_dt)
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fctr> <fctr> <num> <num> <int> <int>
## 1: Adelie Torgersen 39.1 18.7 181 3750
## 2: Adelie Torgersen 39.5 17.4 186 3800
## 3: Adelie Torgersen 40.3 18.0 195 3250
## 4: Adelie Torgersen NA NA NA NA
## 5: Adelie Torgersen 36.7 19.3 193 3450
## 6: Adelie Torgersen 39.3 20.6 190 3650
## sex year
## <fctr> <int>
## 1: male 2007
## 2: female 2007
## 3: female 2007
## 4: <NA> 2007
## 5: female 2007
## 6: male 2007
There’s two distinct clusters in the data corresponding to two
classes of penguin…
library(ggplot2)
library(ggExtra)
# Create a joint density plot
p <- ggplot(dt, aes(x = Distance, y = Speed)) +
geom_point(color = "lightblue", alpha = 0.85) +
geom_density_2d(color = "steelblue", alpha = 0.75) +
theme_minimal()
# Add marginal densities
ggMarginal(p, type = "density",color = "steelblue", fill = "lightblue")
A joint mass probability function \(f_{XY}(x_i,y_j)\) describes the probability distribution of random variable pairs \(\{x_i,y_j\}\) where \(0\leq f(x_i,y_j)\) and \(\sum_{i=1}^N\sum_{j=1}^Mf(x_i,y_j)=1\). The point mass probability function can be visualised in joint probability table in which the probability of an event correspond to subsets of cells in the table: \[P(E) = \sum_Ef_{XY}(x_i,y_j)\] Similarly, the joint cumulative distribution function is given by: \[F_{XY}=\sum_{x_i\leq x}\sum_{y_j\leq y}\:f_{XY}(x_i,y_i)\]
A joint probability density function \(f_{XY}(x_i,y_j)\) describes the probability distribution of a pair of continuous random variables \(\{x_i,y_j\}\) where \(0\leq f(x_i,y_j)\) and \(\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}f_{XY}(x_i,y_j)\:dx\:dy=1\). The point mass probability function can be visualised on a 2-d plot in which an event correspond to subsets of the area on the canvas:
joint probability density function: \[\qquad\qquad P(E) = \int\int_Ef_{XY}(x_i,y_j)\:dx\:dy\qquad\qquad\] the joint cumulative distribution: \[F_{X,Y}(x,y)=\int_\infty^y\int_\infty^x\:f_{XY}(x,y)\:dx\:dy\] \[f_{XY}(x,y)=\frac{d}{dx}\frac{d}{dy}F_{XY}(x,y)\]
For independent events \(A\) and \(B\): \[P(A\cap B)=P(A)P(B)\] …that is to say any event defined by \(X\) is independent of any even defined by \(Y\), (they can be accurately represented on two independent sapmling spaces).
Jointly-distributed random variables \(X\) and \(Y\) are independent if their joint cdf or pdf is the product of the marginal cdfs and pdfs: \[F_{XY}(x,y)=F_X(x)F_Y(y)\] \[f_{XY}(x,y)=f_X(x)f_Y(y)\] For independent random variables, each “slice” or “line” through the joint PDF is just a scaled version of the other marginal PDF.
Correlation is given by: \[\rho=\boxed{\text{Cor}(X,Y)=\frac{\text{Cov}(X,Y)}{\sigma_X\sigma_Y}}\]
if \(X\) and \(Y\) are discrete random variables with joint pmf \(p(x_i,y_j)\) then covariance is:\[\begin{align} \text{Cov}(X,Y)&=E\left[ p(x_i,y_i)(x_i-\mu_X)(y_j-\mu_Y)\right]\\ &=E\left[ p(x_i,y_i)(x_iy_i-x_i\mu_Y-\mu_Xy_i+\mu_X\mu_Y)\right]\\ &=E\left[ p(x_i,y_i)x_iy_i\right]-E\left[ p(x_i,y_i)x_i\mu_Y\right]-E\left[p(x_i,y_i)\mu_Xy_i\right]+E\left[p(x_i,y_i)\mu_X\mu_Y\right]\\ &=E\left[ p(x_i,y_i)x_iy_i\right]-\mu_YE\left[ p(x_i,y_i)x_i\right]-\mu_XE\left[p(x_i,y_i)y_iE\right]+\mu_X\mu_YE\left[ p(x_i,y_i)\right]\\ &=E\left[ p(x_i,y_i)x_iy_i\right]-\mu_Y\mu_X-\mu_X\mu_Y+\mu_X\mu_Y\\ &=E\left[ p(x_i,y_i)x_iy_i\right]-\mu_X\mu_Y \end{align}\]
The bivariate normal distribution has density: \[f(x,y)=\frac{e^{\frac{-1}{2(1-p^2)}\left[\frac{(x-\mu_X)^2}{\sigma_X^2}+z\frac{(y-\mu_Y)^2}{\sigma_Y^2}-\frac{2p(x-\mu_X)(y-\mu_y)}{\sigma_X\sigma_Y}\right]}}{2\pi\sigma_X\sigma_Y\sqrt{1-p^2}}\] convariance matrix \(\begin{bmatrix}\sigma_X&\sigma_{XY}\\\sigma_{YX}&\sigma_{Y}\end{bmatrix}\)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select