Define Numer of variables nvars and numer of samples n.
n <- 20 # number of samples
nvars <- 3 # number of variables
Define the correlation matrix sigma, with dimension nvars x nvars which represents the correlation between each variable.
Of course, the diagonal must be = 1.0.
# correlation matrix (symmetric!)
sigma <- matrix(rep(0, nvars*nvars), nrow = nvars, ncol = nvars) # uncorrelated
sigma[1,] <- c(1.0, 0.5, 0.0)
sigma[2,] <- c(0.5, 1.0, 0.8)
sigma[3,] <- c(0.0, 0.8, 1.0)
If n is large, LHS funciton may not converge. Try increasing eps, at cost of reducing accuracy.
set.seed(123)
corrLHS <- pse::LHS(factors = nvars, N = n, method = "HL", opts = list(COR = sigma, eps = 0.05))
XX <- pse::get.data(corrLHS)
head(XX)
## I1 I2 I3
## 1 0.725 0.425 0.175
## 2 0.925 0.975 0.825
## 3 0.675 0.925 0.975
## 4 0.125 0.325 0.475
## 5 0.475 0.475 0.375
## 6 0.075 0.125 0.575
plot(XX)
All variables are in [0, 1] interval.
Note that can be any kind of distribution
df <- data.frame(var1 = qlnorm(XX[, 1], 10, 2),
var2 = qnorm(XX[, 2], 5, 0.5),
var3 = qunif(XX[, 3], 0, 1))
The correlation of sampled variables should be equal to sigma. Increasing eps (necessary for large number of samples) reduces the accuracy and therefore there may be difference between sigma and the sampled variables correlation.
round(cor(df, method = "spearman"), 2)
## var1 var2 var3
## var1 1.00 0.50 0.02
## var2 0.50 1.00 0.77
## var3 0.02 0.77 1.00
sigma
## [,1] [,2] [,3]
## [1,] 1.0 0.5 0.0
## [2,] 0.5 1.0 0.8
## [3,] 0.0 0.8 1.0
df[ , 1] = log10(df[ , 1])
psych::pairs.panels(df,
method = "spearman", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE) # show correlation ellipses
What you obtain in this step are navrs normal distributions, correlated between each other acocording to sigma.
# n <- 1000
set.seed(123)
copula <- MASS::mvrnorm(n, mu = rep(0, nvars), Sigma = sigma, empirical = TRUE)
par(mfrow = c(1, 3))
hist(copula[ ,1])
hist(copula[ ,2])
hist(copula[ ,3])
XX <- pnorm(copula)
head(XX)
## [,1] [,2] [,3]
## [1,] 0.1564713 0.6764229 0.76505670
## [2,] 0.3125947 0.5049848 0.59719605
## [3,] 0.8290174 0.9836215 0.95746657
## [4,] 0.7305729 0.1816715 0.02272921
## [5,] 0.6366200 0.3544741 0.14755066
## [6,] 0.8733880 0.9929314 0.95192462
plot(as.data.frame(XX))
All variables are in [0, 1] interval.
Following steps are the same as for the case of LHS.
Note that can be any kind of distribution
df <- data.frame(var1 = qlnorm(XX[, 1], 10, 2),
var2 = qnorm(XX[, 2], 5, 0.5),
var3 = qunif(XX[, 3], 0, 1))
The correlation of sampled variables should be equal to sigma. In this case, since the sampling is random, greater is the number of samples n, closer to sigma is the sampled variables correlation.
round(cor(df, method = "spearman"), 2)
## var1 var2 var3
## var1 1.00 0.43 -0.08
## var2 0.43 1.00 0.75
## var3 -0.08 0.75 1.00
sigma
## [,1] [,2] [,3]
## [1,] 1.0 0.5 0.0
## [2,] 0.5 1.0 0.8
## [3,] 0.0 0.8 1.0
df[ , 1] = log10(df[ , 1])
psych::pairs.panels(df,
method = "spearman", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = TRUE) # show correlation ellipses