Multivariate distributions are designed to describe the probability distributions of more than one random variable at the same time. Since the variables are often correlated, exploring them individually wpuld only provide limited insight
# Read in the wine dataset
wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep = ",")
# Print the first four entries
head(wine, n = 4)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
## 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
## 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
## 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
## 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
# Find the dimensions of the data
dim(wine)
## [1] 178 14
# Check the names of the wine dataset
names(wine)
## [1] "V1" "V2" "V3" "V4" "V5" "V6" "V7" "V8" "V9" "V10" "V11" "V12"
## [13] "V13" "V14"
# Assign new names
colnames(wine) <- c('Type', 'Alcohol', 'Malic', 'Ash', 'Alcalinity', 'Magnesium', 'Phenols', 'Flavanoids', 'Nonflavanoids','Proanthocyanins', 'Color', 'Hue', 'Dilution', 'Proline')
# Check the new column names
names(wine)
## [1] "Type" "Alcohol" "Malic" "Ash"
## [5] "Alcalinity" "Magnesium" "Phenols" "Flavanoids"
## [9] "Nonflavanoids" "Proanthocyanins" "Color" "Hue"
## [13] "Dilution" "Proline"
# Check data type/structure of each variable
str(wine)
## 'data.frame': 178 obs. of 14 variables:
## $ Type : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
## $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
## $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
## $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
## $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
## $ Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
## $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
## $ Nonflavanoids : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
## $ Proanthocyanins: num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
## $ Color : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
## $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
## $ Dilution : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
## $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
# Change the Type variable data type
wine$Type <- as.factor(wine$Type)
# Check data type/structure again
str(wine)
## 'data.frame': 178 obs. of 14 variables:
## $ Type : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
## $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
## $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
## $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
## $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
## $ Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
## $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
## $ Nonflavanoids : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
## $ Proanthocyanins: num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
## $ Color : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
## $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
## $ Dilution : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
## $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
For multivaritate data, the mean identifies the location of the distribution in a multidimensional space
The variance-covariance measures spread of multivariate data in several directions
# Calculate the mean of the Alcohol, Malic, Ash, and Alcalinity variables
colMeans(wine[, 2:5])
## Alcohol Malic Ash Alcalinity
## 13.000618 2.336348 2.366517 19.494944
# Calculate the mean of the variables by wine type
by(wine[2:5], wine$Type, colMeans)
## wine$Type: 1
## Alcohol Malic Ash Alcalinity
## 13.744746 2.010678 2.455593 17.037288
## ------------------------------------------------------------
## wine$Type: 2
## Alcohol Malic Ash Alcalinity
## 12.278732 1.932676 2.244789 20.238028
## ------------------------------------------------------------
## wine$Type: 3
## Alcohol Malic Ash Alcalinity
## 13.153750 3.333750 2.437083 21.416667
# Calculate the variance-covariance matrix of the variables Alcohol, Malic, Ash, Alcalinity
var.wine <- var(wine[2:5])
# Round the matrix values to two decimal places
round(var.wine, 2)
## Alcohol Malic Ash Alcalinity
## Alcohol 0.66 0.09 0.05 -0.84
## Malic 0.09 1.25 0.05 1.08
## Ash 0.05 0.05 0.08 0.41
## Alcalinity -0.84 1.08 0.41 11.15
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded
# Calculate the correlation matrix
cor.wine <- cor(wine[, 2:5])
# Round the matrix to two decimal places
round(cor.wine, 2)
## Alcohol Malic Ash Alcalinity
## Alcohol 1.00 0.09 0.21 -0.31
## Malic 0.09 1.00 0.16 0.29
## Ash 0.21 0.16 1.00 0.44
## Alcalinity -0.31 0.29 0.44 1.00
# Plot the correlations
corrplot(cor.wine, method = "ellipse")
library(lattice)
# Scatter plot matrix using the base R plot function
pairs(wine[,2:5])
# Scatter plot matrix colored by groups
splom( ~ wine[,2:5], pch = 16, col = wine$Type)
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Color the points by wine type
wine.gg <- ggpairs(data = wine, mapping = aes(color = Type), columns = 2:5)
wine.gg
It is a generalization of the univariate normal and is specified by a mean vector and variance covariance matrix. Even if all variables individually follow a univariate normal, the joint distribution might not be a multivariate normal.
library(mvtnorm)
## Warning: package 'mvtnorm' was built under R version 4.0.3
mu.sim <- c(2, -2, 1)
sigma.sim <- matrix(data = c(9,5,5,5,5,3,5,3,4), nrow = 3)
# Generate 100 bivariate normal samples
multnorm.sample <- rmvnorm(n = 100, mean = mu.sim, sigma = sigma.sim)
# View the first 6 samples
head(multnorm.sample)
## [,1] [,2] [,3]
## [1,] -0.1031677 -4.0362852 -1.0112424
## [2,] 10.4023207 3.6987711 7.4120582
## [3,] -0.8593295 -0.5760292 -1.1337082
## [4,] 3.5361230 -0.6839956 1.3939369
## [5,] 0.3137523 -3.5272389 -0.7769091
## [6,] 5.4426615 1.4807898 2.8038625
# Scatterplot of the bivariate samples
ggpairs(data.frame(multnorm.sample))
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 4.0.3
# Calculate density
multnorm.dens <- dmvnorm(multnorm.sample, mean = mu.sim, sigma = sigma.sim)
# Create scatter plot of density heights
scatterplot3d(cbind(multnorm.sample, multnorm.dens),
color="blue", pch="", type = "h",
xlab = "x", ylab = "y", zlab = "density")
# Volume under a bivariate standard normal
pmvnorm(lower = c(-1, -1), upper = c(1, 1))
## [1] 0.4660649
## attr(,"error")
## [1] 1e-15
## attr(,"msg")
## [1] "Normal Completion"
# Volume under specified mean and variance-covariance matrix
pmvnorm(lower = c(-5, -5), upper = c(5, 5), mean = mu.sim, sigma = sigma.sim)
## Warning in cbind(lower, upper, mean): number of rows of result is not a multiple
## of vector length (arg 1)
## [1] 0.7485242
## attr(,"error")
## [1] 0.0001896578
## attr(,"msg")
## [1] "Normal Completion"
# Probability contours for a standard bivariate normal
qmvnorm(0.9, tail = "both", sigma = diag(2))
## $quantile
## [1] 1.948779
##
## $f.quantile
## [1] -1.537507e-06
##
## attr(,"message")
## [1] "Normal Completion"
# Probability contours for a bivariate normal
qmvnorm(0.95, tail = "both", mean = mu.sim, sigma = sigma.sim)
## $quantile
## [1] 7.253087
##
## $f.quantile
## [1] 5.054298e-06
##
## attr(,"message")
## [1] "Normal Completion"
If any single variable fails to follow normality we cannot have joint multivariate normality
# Test sample normality
qqnorm(multnorm.sample[, 1])
qqline(multnorm.sample[, 1])
library(MVN)
## Warning: package 'MVN' was built under R version 4.0.3
## sROC 0.1-2 loaded
# Create qqnorm plot
# uniPlot(wine[, 2:5], type = "qqplot")
# mardiaTest qqplot
wine.mvntest <- mardiaTest(wine[, 2:5], qqplot = TRUE)
## Warning: 'mardiaTest' is deprecated.
## Use 'mvn' instead.
## See help("Deprecated")
wine.mvntest
## [1] "'mardiaTest' is deprecated.\nUse 'mvn' instead.\nSee help(\"Deprecated\")"
# Use mardiaTest
mardiaTest(multnorm.sample)
## Warning: 'mardiaTest' is deprecated.
## Use 'mvn' instead.
## See help("Deprecated")
# Use hzTest
hzTest(wine[, 2:5])
## Warning: 'hzTest' is deprecated.
## Use 'mvn' instead.
## See help("Deprecated")
It can model heavier tails and are generalization of univariate Stundent’s t-distribution
\[ t_{df}(\delta, \Sigma) \]
Where \(df\) are the degrees of freedom in all dimensions
# Generate the t-samples
multt.sample <- rmvt(n = 200,sigma = sigma.sim, df = 5, delta = mu.sim)
# Print the first 6 samples
head(multt.sample)
## [,1] [,2] [,3]
## [1,] 3.130277 -1.9400025 0.9762581
## [2,] 1.361335 -3.0098776 0.1205345
## [3,] 3.802967 -0.7209706 2.6175912
## [4,] -1.776014 -6.8858459 -2.2717340
## [5,] 4.637783 -1.2810504 3.8066385
## [6,] 3.051600 -2.3924482 1.4898844
# Check multivariate normality
mardiaTest(multt.sample, qqplot = TRUE)
## Warning: 'mardiaTest' is deprecated.
## Use 'mvn' instead.
## See help("Deprecated")
# Calculate densities
multt.dens <- dmvt(x = multt.sample, delta = mu.sim, sigma = sigma.sim, df = 5)
# Plot 3D heights of densities
#scatterplot3d(cbind(multt.sample, multt.dens),
# color = "blue", pch = "", type = "h",
# xlab = "x", ylab = "y", zlab = "density")
# Calculate the volume under a t-distribution
pmvt(lower = c(-5, -5), upper = c(5, 5), delta = mu.sim, df = 5, sigma = sigma.sim)
## Warning in cbind(lower, upper, mean): number of rows of result is not a multiple
## of vector length (arg 1)
## [1] 0.6434244
## attr(,"error")
## [1] 0.0003743722
## attr(,"msg")
## [1] "Normal Completion"
# Calculate the equal probability contour
qmvt(p = 0.9, tail = "both", sigma = diag(2))
## $quantile
## [1] 8.956747
##
## $f.quantile
## [1] -1.35426e-07
##
## attr(,"message")
## [1] "Normal Completion"
\(SN(\xi, \omega, \alpha)\)
library(sn)
## Warning: package 'sn' was built under R version 4.0.3
## Loading required package: stats4
##
## Attaching package: 'sn'
## The following object is masked from 'package:stats':
##
## sd
# Generate the skew-normal samples
skewnorm.sample <- rmsn(n = 100, xi = mu.sim, Omega = sigma.sim, alpha = c(4, -4))
## Warning in matrix(dp$Omega, d, d): la longitud de los datos [9] no es un
## submúltiplo o múltiplo del número de filas [2] en la matriz
## Warning in dp[[1]] + mu.z * omega: longitud de objeto mayor no es múltiplo de la
## longitud de uno menor
## Warning in dp0$xi + lot$aux$omega * z: longitud de objeto mayor no es múltiplo
## de la longitud de uno menor
# Print first six samples
head(skewnorm.sample)
## [,1] [,2]
## [1,] -0.27268774 -4.0253926
## [2,] 1.00484814 0.2731621
## [3,] 0.05980053 2.5412612
## [4,] 2.60591642 -1.8870326
## [5,] 3.63497414 1.4335588
## [6,] 0.18127730 2.1798305
# Generate the skew-t samples
skewt.sample <- rmst(n = 100, xi = mu.sim, Omega = sigma.sim, alpha = c(4, -4), nu = 5)
## Warning in matrix(dp$Omega, d, d): la longitud de los datos [9] no es un
## submúltiplo o múltiplo del número de filas [2] en la matriz
## Warning in xi + t(z/sqrt(v)): longitud de objeto mayor no es múltiplo de la
## longitud de uno menor
# Print first six samples
head(skewt.sample)
## [,1] [,2]
## [1,] 3.7659280 -3.5896138
## [2,] 3.4277042 0.7036523
## [3,] 0.1525090 1.5846982
## [4,] 0.6648123 -4.6884818
## [5,] -6.6229018 -4.6319690
## [6,] -5.9962971 -2.4466693
# Contour plot for skew-normal sample
#ggplot(data.frame(skewnorm.sample), aes(x = x, y = y)) + geom_point() + geom_density_2d()
# Normality test for skew-t sample
skewt.Test <- mardiaTest(skewt.sample, qqplot = TRUE)
## Warning: 'mardiaTest' is deprecated.
## Use 'mvn' instead.
## See help("Deprecated")
# Calculate PCs
pca.state <- princomp(state.x77, cor = TRUE, scores = TRUE)
# Plot the PCA object
plot(pca.state)
# Print the summary of the PCs
summary(pca.state)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.8970755 1.2774659 1.0544862 0.84113269 0.62019488
## Proportion of Variance 0.4498619 0.2039899 0.1389926 0.08843803 0.04808021
## Cumulative Proportion 0.4498619 0.6538519 0.7928445 0.88128252 0.92936273
## Comp.6 Comp.7 Comp.8
## Standard deviation 0.55449226 0.3800642 0.33643379
## Proportion of Variance 0.03843271 0.0180561 0.01414846
## Cumulative Proportion 0.96779544 0.9858515 1.00000000
# Variance explained by each PC
pca.var <- pca.state$sdev^2
# Proportion of variance explained by each PC
pca.pvar <- pca.var/sum(pca.var)
# Proportion of variance explained by each principal component
pca.pvar
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 0.44986195 0.20398990 0.13899264 0.08843803 0.04808021 0.03843271 0.01805610
## Comp.8
## 0.01414846
# Cumulative variance explained plot
plot(cumsum(pca.pvar), xlab = "Principal component", ylab = "Cumulative Proportion of variance explained", ylim = c(0,1), type = 'b')
grid()
# Add a horizontal line
abline(h = 0.95, col = "blue")
# Draw screeplot
screeplot(pca.state, type = "l")
grid()
Show the weights that were used to construct the PCA. the blank entries in the data frame are values that are close to zero and have been ommitted to provide a clear interpretation.
# Create dataframe of scores
scores.state <- data.frame(pca.state$scores)
# Plot of scores colored by region
ggplot(data = scores.state, aes(x = Comp.1, y = Comp.2, label = rownames(scores.state), color = state.region)) +
geom_text(alpha = 0.8, size = 3) +
ggtitle("PCA of states data colored by region")
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.0.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_pca_ind(pca.state)
fviz_pca_var(pca.state)
fviz_pca_biplot(pca.state)
Technique for performing dimension reduction. The primary goal of multidimensional scaling is to providea visual representation of the pattern of proximities, meaning the similarities or distances, among a set of objects.
This is accomplished by assigning objects to specific locations in a conceptual space usually in two or three dimensions, so the distances between points in the space match the given similarities as closely as possible
# Calculate distance
state.dist <- dist(state.x77)
# Perform multidimensional scaling
mds.state <- cmdscale(state.dist)
# Change mds.state to a dataframe for use in ggplot
mds.state_df <- data.frame(mds.state)
# Plot the representation of the data in two dimensions
ggplot(data = mds.state_df, aes(x = X1, y = X2, label = rownames(mds.state), color = state.region)) +
geom_text(alpha = 0.8, size = 3)
# Calculate distance
wine.dist <- dist(wine[,-1])
# Perform multidimensional scaling
mds.wine <- cmdscale(wine.dist, k = 3)
mds.wine_df <- data.frame(mds.wine)
# Plot the representation of the data in three dimensions
scatterplot3d(mds.wine_df, color = wine$Type, pch = 19, type = "h", lty.hplot = 2)