MULTIVARIATE ANALYSIS

Set up Rstudio

Setting up RMarkdown when opening it enables you to create dynamic, reproducible, and visually appealing reports, presentations, and documents, that can help you communicate your data analysis and research findings more effectively.

Course Purpose

The purpose of this course is to propagate knowledge and skills on various methods used to analyze multivariate data.

Expected Learning Outcomes

Estimate the mean and variance of a multivariate normal distribution
Assess the distribution of the mean in multivariate normal data
Estimate parameters in Wishart distribution
Calculate Hotellings T2 statistic from given data
Make inferences based on the T2 statistic and union intersection principle
Carry out Multivariate Analysis of Variance (MANOVA)
Carry out the test of independence
Calculate the discriminant coefficient of given multivariate data
Carry out cluster and factor analysis of given multivariate data
Obtain the principal component of given multivariate data
Calculate the canonical correlation coefficient of given multivariate data

Estimate the mean and variance of a multivariate normal distribution

library(MASS)

# Set mean vector and covariance matrix
mu <- c(1, 2, 3)
Sigma <- matrix(c(1, 0.5, 0.2,
                  0.5, 2, 0.7,
                  0.2, 0.7, 3), nrow = 3, ncol = 3)
head(Sigma,5)

     [,1] [,2] [,3]
[1,]  1.0  0.5  0.2
[2,]  0.5  2.0  0.7
[3,]  0.2  0.7  3.0

# Generate sample from multivariate normal distribution
set.seed(123)
sample <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)
head(sample,10)

            [,1]        [,2]     [,3]
 [1,]  0.8985705  2.71939684 1.411937
 [2,]  1.6317920  2.79092281 1.938381
 [3,]  1.0589338  3.59973722 5.450206
 [4,]  2.0770323  1.76314070 3.071658
 [5,]  2.4365231  4.70174090 1.486549
 [6,]  0.5273357  2.66529265 6.413310
 [7,] -0.4449927  2.79818429 3.831340
 [8,] -1.1090918 -1.40574623 2.594450
 [9,]  2.1080602 -0.05643363 2.451312
[10,]  0.6246550  2.26885997 1.964628

# Calculate sample mean and sample covariance matrix
sample_mean <- colMeans(sample)
sample_mean

[1] 0.9678422 1.9774317 3.0536556

sample_cov <- cov(sample)
sample_cov

          [,1]      [,2]      [,3]
[1,] 0.9175884 0.4226797 0.1237425
[2,] 0.4226797 1.8737283 0.5748587
[3,] 0.1237425 0.5748587 3.1527724

Assess the distribution of the mean in multivariate normal data

library(MASS)

# Set mean vector and covariance matrix
mu <- c(1, 2, 3)
Sigma <- matrix(c(1, 0.5, 0.2,
                  0.5, 2, 0.7,
                  0.2, 0.7, 3), nrow = 3, ncol = 3)

# Generate multiple samples from multivariate normal distribution
set.seed(123)
n_samples <- 1000
sample_size <- 100
samples <- array(NA, dim = c(n_samples, length(mu), sample_size))
for (i in 1:n_samples) {
  samples[i,,] <- mvrnorm(n = sample_size, mu = mu, Sigma = Sigma)
}

# Calculate sample means for each sample
sample_means <- apply(samples, c(1,2), mean)
head(sample_means,10)

          [,1]     [,2]     [,3]
 [1,] 2.052967 2.100055 2.251202
 [2,] 1.997831 1.937313 1.852243
 [3,] 1.732880 1.779636 2.024594
 [4,] 1.918827 2.002618 1.912966
 [5,] 1.911350 2.039182 2.139675
 [6,] 2.082321 1.850699 2.105551
 [7,] 2.211016 1.730511 2.120147
 [8,] 1.907134 1.845789 2.000088
 [9,] 1.824661 1.954251 1.967257
[10,] 2.109356 1.786705 2.068021

# Plot histogram of sample means
hist(sample_means[,1], breaks = 30, col = "grey", main = "Histogram of Sample Means",
     xlab = "Sample Mean", ylim = c(0, 140))

hist(sample_means[,2], breaks = 30, col = "grey", main = "Histogram of Sample Means",
     xlab = "Sample Mean", ylim = c(0, 140))

hist(sample_means[,3], breaks = 30, col = "grey", main = "Histogram of Sample Means",
     xlab = "Sample Mean", ylim = c(0, 140))

In this example, we generate 1000 samples of size 100 from a multivariate normal distribution with mean vector mu and covariance matrix Sigma. We then calculate the sample means for each sample using the apply() function. Finally, we plot a histogram of the sample means for the first variable in the mean vector (mu[1]).

The resulting histogram should give us an idea of the distribution of the sample means. If the true distribution of the mean is normal, then the histogram should also look roughly normal, assuming that the sample size is large enough. We can also perform a normality test on the sample means using the shapiro.test() function:

# Perform normality test on sample means
shapiro.test(sample_means[,1])


    Shapiro-Wilk normality test

data:  sample_means[, 1]
W = 0.9986, p-value = 0.6211

shapiro.test(sample_means[,2])


    Shapiro-Wilk normality test

data:  sample_means[, 2]
W = 0.99648, p-value = 0.02399

shapiro.test(sample_means[,3])


    Shapiro-Wilk normality test

data:  sample_means[, 3]
W = 0.99801, p-value = 0.2852

The shapiro.test() function returns a p-value, which tells us how likely it is that the sample means are drawn from a normal distribution. If the p-value is less than 0.05, we can reject the null hypothesis that the sample means are normally distributed. If the p-value is greater than 0.05, we fail to reject the null hypothesis, meaning that the sample means are consistent with a normal distribution.

Note that this is just one way to assess the distribution of the mean in multivariate normal data, and there may be other techniques and tests that are more appropriate depending on the specific research question and data.

Calculate Hotellings T^2 statistic from given data

Hotelling’s T^2 statistic is a multivariate statistical test used to compare two groups of multivariate observations, which is analogous to the t-test for univariate data. The T^2 statistic measures the distance between the mean vectors of the two groups in a multi-dimensional space, and takes into account the correlations between the variables.

To calculate Hotelling’s T^2 statistic in R, you can use the hotelling.test() function from the mvtnorm package. Here is an example of how to use the function:

# Load the mvtnorm package
library(mvtnorm)
library(Hotelling)

# Generate two groups of multivariate data
set.seed(123)
group1 <- rmvnorm(n = 10, mean = c(1, 2, 3), sigma = matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), nrow = 3))
group2 <- rmvnorm(n = 10, mean = c(2, 3, 4), sigma = matrix(c(1, 0.5, 0.5, 0.5, 1, 0.5, 0.5, 0.5, 1), nrow = 3))

group1

           [,1]      [,2]     [,3]
 [1,] 0.7847162 2.0182723 4.283206
 [2,] 1.5011941 2.5427573 4.664071
 [3,] 0.9744854 0.7540351 2.162890
 [4,] 0.9531536 3.1338408 3.522711
 [5,] 1.2729261 2.0678024 2.596499
 [6,] 2.3385262 2.4270213 1.684380
 [7,] 1.2981182 1.4678707 2.047119
 [8,] 0.3808584 0.8094952 2.019586
 [9,] 0.2106182 0.4599154 3.244993
[10,] 1.1718671 1.2586316 3.949997

group2

           [,1]     [,2]     [,3]
 [1,] 2.5435085 3.033306 4.874902
 [2,] 3.1838748 4.143886 5.049883
 [3,] 2.4355298 3.000073 3.827503
 [4,] 1.4285379 2.206340 3.550552
 [5,] 1.6029201 5.031374 5.351849
 [6,] 0.7361705 2.245446 3.200353
 [7,] 2.7754156 3.164946 4.403020
 [8,] 2.2855638 3.275435 5.273497
 [9,] 1.7795321 4.011483 2.844043
[10,] 2.6312698 3.305464 4.370579

# Calculate the T^2 statistic
T2 <- hotelling.test(x = group1, y = group2)$statistic
T2

NULL

# Print the T^2 statistic
print(T2)

NULL

Carry out Multivariate Analysis of Variance (MANOVA)

Multivariate Analysis of Variance (MANOVA) is a multivariate statistical technique used to test for differences in means of two or more groups of variables. In R, you can use the manova() function to perform MANOVA.

# Load the data
data(iris)
head(iris,5)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

# Perform MANOVA
fit <- manova(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~ Species, data = iris)
fit

Call:
   manova(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~ 
    Species, data = iris)

Terms:
                 Species Residuals
Sepal.Length     63.2121   38.9562
Sepal.Width      11.3449   16.9620
Petal.Length    437.1028   27.2226
Petal.Width      80.4133    6.1566
Deg. of Freedom        2       147

Residual standard errors: 0.5147894 0.3396877 0.4303345 0.20465
Estimated effects may be unbalanced

# Print the results
summary(fit)

           Df Pillai approx F num Df den Df    Pr(>F)    
Species     2 1.1919   53.466      8    290 < 2.2e-16 ***
Residuals 147                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this example, we use the iris dataset that is included in R by default. The dataset contains measurements of sepal length, sepal width, petal length, and petal width for three different species of iris flowers. We use the manova() function to perform MANOVA on the four variables, where Species is the grouping variable. The cbind() function is used to combine the four variables into a single matrix.

Finally, we use the summary() function to print the results of the MANOVA. The output shows the Wilks’ lambda statistic, the F statistic, and the p-value for the MANOVA. It also shows the results of univariate ANOVA tests for each of the four variables.

Note that MANOVA assumes that the data are normally distributed and have equal covariance matrices across groups. You should check these assumptions before interpreting the results.

Carry out test of independence

To test the independence between two categorical variables in R, you can use the chi-squared test of independence, which is implemented in the chisq.test() function.

Here is an example of how to perform the chi-squared test of independence in R:

# Load the data
data(mtcars)

# Create a contingency table
contingency_table <- table(mtcars$cyl, mtcars$gear)
contingency_table

# Perform the chi-squared test of independence
test_result <- chisq.test(contingency_table)

# Print the results
print(test_result)


    Pearson's Chi-squared test

data:  contingency_table
X-squared = 18.036, df = 4, p-value = 0.001214

In this example, we use the mtcars dataset that is included in R by default. The dataset contains information about cars, including the number of cylinders (cyl) and the type of transmission (gear). We use the table() function to create a contingency table that shows the number of cars for each combination of cyl and gear. We then use the chisq.test() function to perform the chi-squared test of independence on the contingency table. The function returns the chi-squared statistic, the degrees of freedom, and the p-value. Finally, we print the results using the print() function.

Note that the chi-squared test of independence assumes that the observations are independent and the expected cell counts are at least 5. You should check these assumptions before interpreting the results.

viii) Calculate the discriminant coefficient of given multivariate data

To calculate the discriminant coefficients for a given multivariate dataset in R, you can use the lda() function, which performs Linear Discriminant Analysis (LDA).

Here is an example of how to calculate the discriminant coefficients using LDA in R:

# Load the data
data(iris)

# Create a data frame with the predictors
X <- data.frame(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)

# Create a factor with the class variable
y <- factor(iris$Species)

# Fit the LDA model
lda_model <- lda(X, y)

# Print the coefficients
lda_model$scaling

                         LD1         LD2
iris.Sepal.Length  0.8293776  0.02410215
iris.Sepal.Width   1.5344731  2.16452123
iris.Petal.Length -2.2012117 -0.93192121
iris.Petal.Width  -2.8104603  2.83918785

In this example, we use the iris dataset that is included in R by default. The dataset contains measurements of sepal length, sepal width, petal length, and petal width for three different species of iris flowers. We create a data frame X with the four predictors and a factor y with the class variable. We then fit the LDA model using the lda() function on X and y. Finally, we print the discriminant coefficients using the $scaling attribute of the LDA model.

The discriminant coefficients are a set of weights that are used to combine the predictors into a linear discriminant function. The function separates the observations into different classes based on their distances from the class centroids.

Carry out cluster and factor analysis of given multivariate data

# Load the required packages
library(stats)
library(factoextra)

# Load the data
data(iris)

# Create a data frame with the predictors
X <- data.frame(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)

# Perform hierarchical clustering
hc <- hclust(dist(X))
plot(hc)

# Perform k-means clustering
kmeans_clusters <- kmeans(X, 3)
fviz_cluster(kmeans_clusters, data = X)

# Perform factor analysis
fa <- factanal(X, factors = 1)
print(fa, digits=2)


Call:
factanal(x = X, factors = 1)

Uniquenesses:
iris.Sepal.Length  iris.Sepal.Width iris.Petal.Length  iris.Petal.Width 
             0.24              0.82              0.00              0.07 

Loadings:
                  Factor1
iris.Sepal.Length  0.87  
iris.Sepal.Width  -0.42  
iris.Petal.Length  1.00  
iris.Petal.Width   0.96  

               Factor1
SS loadings       2.86
Proportion Var    0.72

Test of the hypothesis that 1 factor is sufficient.
The chi square statistic is 85.51 on 2 degrees of freedom.
The p-value is 2.7e-19

We create a data frame X with the four predictors. We then perform hierarchical clustering using the hclust() function and plot the dendrogram using the plot() function. We also perform k-means clustering using the kmeans() function and plot the clusters using the fviz_cluster() function from the factoextra package.

Finally, we perform factor analysis using the factanal() function and specify that we want to extract two factors. We print the results using the print() function.

Note that cluster analysis and factor analysis are unsupervised learning methods, meaning that they do not require a response variable to be specified. However, the interpretation of the results may require domain knowledge of the dataset being analyzed.

Obtain the principal component of given multivariate data

To obtain the principal components of a given multivariate dataset in R, you can use the prcomp() function.

Here is an example of how to obtain the principal components of a dataset in R:

# Load the data
data(iris)

# Create a data frame with the predictors
X <- data.frame(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)

# Run PCA
pca <- prcomp(X, scale. = TRUE)

# View summary of results
summary(pca)

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.7084 0.9560 0.38309 0.14393
Proportion of Variance 0.7296 0.2285 0.03669 0.00518
Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

# Plot the results
plot(pca, type = "l")

We create a data frame X with the four predictors. We then run PCA using the prcomp() function and set scale. = TRUE to scale the variables to have unit variance.

We can view a summary of the results using the summary() function, which shows the proportion of variance explained by each principal component, the loadings (correlations between the original variables and the principal components), and the scores (the values of each observation on each principal component).

We can also plot the results using the plot() function, which shows the scree plot of the proportion of variance explained by each principal component and the biplot of the loadings and scores.

Note that the prcomp() function can also be used to calculate principal components regression (PCR), where the principal components are used as predictors in a regression model.

Calculate the canonical correlation coefficient of given multivariate data

To calculate the canonical correlation coefficients of a given multivariate dataset in R, you can use the canoncorr() function.

Here is an example of how to calculate the canonical correlation coefficients of a dataset in R:

# load the following library
library(stats)
library(heplots)
library(candisc)
# Load the data
data(mtcars)

# Create a data frame with the predictors
X <- data.frame(mtcars$mpg, mtcars$cyl, mtcars$disp, mtcars$hp, mtcars$drat)

# Create a data frame with the response variables
Y <- data.frame(mtcars$qsec, mtcars$vs, mtcars$am, mtcars$gear, mtcars$carb)

# Run canonical correlation analysis
cca <- cancor(X, Y)

# View summary of results
summary(cca)


Canonical correlation analysis of:
     5   X  variables:  mtcars.mpg, mtcars.cyl, mtcars.disp, mtcars.hp, mtcars.drat 
  with   5   Y  variables:  mtcars.qsec, mtcars.vs, mtcars.am, mtcars.gear, mtcars.carb 

     CanR   CanRSQ     Eigen  percent    cum                          scree
1 0.95424 0.910573 10.182325 77.01444  77.01 ******************************
2 0.85411 0.729498  2.696831 20.39760  97.41 ********                      
3 0.49651 0.246527  0.327188  2.47470  99.89 *                             
4 0.10667 0.011378  0.011509  0.08705  99.97                               
5 0.05877 0.003453  0.003465  0.02621 100.00                               

Test of H0: The canonical correlations in the 
current row and all that follow are zero

     CanR LR test stat approx F numDF  denDF   Pr(> F)    
1 0.95424      0.01796   6.4946    25 83.228 4.095e-11 ***
2 0.85411      0.20080   3.0635    16 70.904 0.0006133 ***
3 0.49651      0.74233   0.8474     9 58.560 0.5761997    
4 0.10667      0.98521   0.0935     4 50.000 0.9840856    
5 0.05877      0.99655   0.0901     1 26.000 0.7664349    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Raw canonical coefficients

   X  variables: 
                  Xcan1      Xcan2      Xcan3      Xcan4     Xcan5
mtcars.mpg  -0.04881581  0.0061958 -0.2835440 -0.1711798  0.120250
mtcars.cyl   0.43373239 -0.0177430 -0.9944985  0.1119323 -1.151769
mtcars.disp -0.00032343  0.0079218 -0.0040436  0.0085164  0.016677
mtcars.hp    0.00027270 -0.0198376  0.0055437 -0.0181979  0.012690
mtcars.drat  0.04262095 -0.8862728 -0.8457911  2.7866709 -0.231910

   Y  variables: 
               Ycan1    Ycan2    Ycan3    Ycan4    Ycan5
mtcars.qsec -0.19228  0.23567  0.36964  0.95067 -0.19650
mtcars.vs   -0.52321 -0.83425  1.03760 -3.38968 -0.89386
mtcars.am   -0.72499 -0.47399 -1.00237  0.90750 -3.32937
mtcars.gear -0.45040 -0.34614  0.12384  0.65607  2.56483
mtcars.carb  0.20417 -0.37941  0.62171  0.10447 -0.59766

Measure Theory and Probability

R provides a number of packages that can be used for Measure Theory and Probability. Here are some examples:

The measurements package provides tools for working with measures and distributions in R. It includes functions for generating random variables, computing density functions, and performing transformations on probability distributions.

The distr package provides a framework for working with probability distributions in R. It includes a variety of distribution functions, such as the normal, t, and chi-square distributions. The package also provides tools for manipulating distributions, such as generating random variables, computing density functions, and calculating cumulative distribution functions.

The copula package provides tools for working with copulas, which are mathematical objects used to model dependence structures between random variables. The package includes functions for simulating copulas, computing densities, and fitting copula models to data.

The mcsm package provides tools for Monte Carlo simulations and Markov chain Monte Carlo (MCMC) methods. It includes functions for generating random variables, simulating stochastic processes, and performing Bayesian inference.

The prob package provides a range of tools for working with probability distributions, such as calculating probabilities and moments, computing distribution functions, and generating random variables.

The spatstat package provides tools for analyzing spatial point patterns, which can be used to model a wide range of phenomena, such as the distribution of trees in a forest or the locations of earthquakes in a region. It includes functions for fitting models to spatial data, simulating point patterns, and performing hypothesis tests.

These are just a few examples of packages that can be used for Measure Theory and Probability in R. Depending on your specific needs and interests, there may be other packages that are more relevant to your work.

Generating random variables from the normal distribution:

# Set seed for reproducibility
set.seed(123)

# Generate 100 random values from a normal distribution
x <- rnorm(100, mean = 0, sd = 1)

# View the first 10 values
head(x, n = 10)

 [1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499
 [7]  0.46091621 -1.26506123 -0.68685285 -0.44566197

Calculating the probability density function of the t-distribution:

# Generate a sequence of values for the t-distribution
x <- seq(-4, 4, by = 0.1)

# Calculate the probability density function
y <- dt(x, df = 10)

# Plot the results
plot(x, y, type = "l", main = "t-Distribution Density", xlab = "x", ylab = "Density")

Calculating the cumulative distribution function of the chi-square distribution:

# Generate a sequence of values for the chi-square distribution
x <- seq(0, 20, by = 0.1)

# Calculate the cumulative distribution function
y <- pchisq(x, df = 5)

# Plot the results
plot(x, y, type = "l", main = "Chi-Square Distribution CDF", xlab = "x", ylab = "Probability")

Computing the expected value and variance of a discrete probability distribution:

# Define the probability distribution
p <- c(0.2, 0.4, 0.3, 0.1)
x <- c(1, 2, 3, 4)

# Compute the expected value
mean <- sum(p * x)

# Compute the variance
var <- sum(p * (x - mean)^2)

# Print the results
cat("Expected value:", mean, "\n")

Expected value: 2.3

cat("Variance:", var, "\n")

Variance: 0.81

Central limit theorem in R

The central limit theorem (CLT) is a fundamental result in probability theory which states that the sum (or average) of a large number of independent and identically distributed (i.i.d.) random variables will converge to a normal distribution regardless of the underlying distribution of the individual random variables. In R, we can simulate the CLT using the following code:

set.seed(123) # for reproducibility
n <- 1000 # sample size
B <- 1000 # number of simulations
mu <- 2 # population mean
sigma <- 1 # population standard deviation

# simulate B samples of size n from the population distribution
samples <- replicate(B, rnorm(n, mu, sigma))

# calculate the sample means
sample_means <- apply(samples, 2, mean)

# plot the sample means
hist(sample_means, breaks = 20, probability = TRUE, col = "lightblue", main = "CLT Simulation")
curve(dnorm(x, mean = mu, sd = sigma/sqrt(n)), add = TRUE, col = "red", lwd = 2)

In this example, we simulate B = 1000 samples of size n = 1000 from a normal population distribution with mean mu = 2 and standard deviation sigma = 1. We then calculate the sample means and plot the histogram of the sample means along with the theoretical normal distribution with mean mu and standard deviation sigma/sqrt(n), where sqrt(n) is the square root of the sample size. As we can see from the resulting plot, the histogram of the sample means closely follows the normal distribution, which is in accordance with the CLT.

Establish convergences in probability distributions

To establish convergence in probability distributions in R, we can use simulation and comparison of empirical distributions to theoretical distributions. The following are some examples:

Law of Large Numbers (LLN):

The LLN states that the sample mean converges to the population mean as the sample size increases. To demonstrate this in R, we can generate a sequence of i.i.d. random variables and calculate the sample mean for increasing sample sizes. We can then plot the sample mean against the sample size and compare it to the population mean using a horizontal line. The code is as follows:

set.seed(123) # for reproducibility
n <- seq(1, 10000, by = 10) # sample sizes
mu <- 2 # population mean
sigma <- 1 # population standard deviation

# generate the random variables
x <- rnorm(max(n), mu, sigma)

# calculate the sample means for increasing sample sizes
sample_means <- sapply(n, function(m) mean(x[1:m]))

# plot the sample mean against the sample size
plot(n, sample_means, type = "l", col = "blue", lwd = 2, xlab = "Sample Size", ylab = "Sample Mean")

# add a horizontal line for the population mean
abline(h = mu, col = "red", lwd = 2)

In this example, we generate a sequence of sample sizes ranging from 1 to 10,000 and calculate the sample mean for each size. We then plot the sample mean against the sample size and add a horizontal line for the population mean. As we can see from the resulting plot, the sample mean approaches the population mean as the sample size increases, which is in accordance with the LLN.

Central Limit Theorem (CLT):

The CLT states that the sample mean follows a normal distribution with mean equal to the population mean and variance equal to the population variance divided by the sample size as the sample size increases. To demonstrate this in R, we can simulate a large number of i.i.d. random variables and calculate the sample mean for a range of sample sizes. We can then plot the empirical distribution of the sample means against the theoretical normal distribution with the aforementioned mean and variance. The code is as follows:

set.seed(123) # for reproducibility
n <- c(10, 100, 1000) # sample sizes
mu <- 2 # population mean
sigma <- 1 # population standard deviation
B <- 1000 # number of simulations

# simulate B samples for each sample size
samples <- lapply(n, function(m) replicate(B, mean(rnorm(m, mu, sigma))))

# plot the empirical distributions of the sample means
par(mfrow = c(1, length(n)))
for (i in seq_along(n)) {
  hist(samples[[i]], breaks = 30, probability = TRUE, main = paste("Sample Size =", n[i]), xlab = "Sample Mean", col = "lightblue")
  curve(dnorm(x, mean = mu, sd = sigma/sqrt(n[i])), add = TRUE, col = "red", lwd = 2)
}

In this example, we simulate B = 1000 samples for each of the three sample sizes: n = 10, 100, and 1000. We then plot the empirical distributions of the sample means for each sample size and add the theoretical normal distribution with mean equal to the population mean and variance equal to the population variance divided by the sample size. As we can see from the resulting plots, the empirical distributions of the sample means closely follow the theoretical normal distribution as the sample size increases, which is in accordance with the central limit theorem.

PARAMETRIC REGRESSION ANALYSIS

Course Purpose

The purpose of this course is to enable the learners to handle models involving different types least squares

Expected Learning Outcomes

At the end of the course, the student should be able to: * Explain the statistical properties of ordinary least squares. * Fit regression models involving ordinary, generalized and nonlinear least squares. * Carry out hypothesis testing on regression models. * Apply Quasi Maximum Likelihood theory in modelling data.

Properties of Ordinary Least Squares

Ordinary Least Squares (OLS) is a commonly used method in R for linear regression analysis. Here are some properties of OLS in R: * Linearity: OLS assumes a linear relationship between the dependent variable and the independent variables. The linear relationship can be assessed using scatter plots and residual plots. * Constant Variance: OLS assumes that the variance of the errors is constant across all values of the independent variable. This assumption can be checked using residual plots. * Independence: OLS assumes that the errors are independent of each other. Independence can be checked using residual plots and Durbin-Watson tests. * Normality: OLS assumes that the errors are normally distributed. Normality can be checked using QQ plots, histograms, and normal probability plots. * Unbiasedness: OLS is an unbiased estimator, which means that on average, it estimates the true value of the population regression coefficients. However, OLS can be biased if the above assumptions are not met. * Efficiency: OLS is an efficient estimator, which means that it has the smallest variance among all unbiased estimators. * Consistency: OLS is a consistent estimator, which means that as the sample size increases, the estimates converge to the true population values. * Hypothesis Testing: OLS can be used to test hypotheses about the regression coefficients using t-tests or F-tests.

In summary, OLS is a powerful method for linear regression analysis in R, but it is important to check the assumptions before interpreting the results.

# Fit a linear regression model
library(kableExtra)
library(jtools)
library(gtsummary)
library(broom)
model <- lm(mpg ~ wt + hp, data = mtcars)
summ(model,confint = TRUE, digits = 3)

Observations	32
Dependent variable	mpg
Type	OLS linear regression

F(2,29)	69.211
R²	0.827
Adj. R²	0.815

	Est.	2.5%	97.5%	t val.	p
(Intercept)	37.227	33.957	40.497	23.285	0.000
wt	-3.878	-5.172	-2.584	-6.129	0.000
hp	-0.032	-0.050	-0.013	-3.519	0.001
Standard errors: OLS

# Obtain the residuals
resid <- resid(model)
# Check for linearity
plot(mtcars$wt, resid)

# Check for constant variance
plot(fitted(model), resid)

# Check for independence
plot(resid ~ seq_along(resid))

# Check for normality
hist(resid)

qqnorm(resid)
qqline(resid)

Sufficiency

To test sufficiency of a statistic in R, you can use the sufficiency() function in the stats package. The function tests if a statistic is sufficient for a parameter of a distribution.

Here’s an example code:

# Load the stats package
library(stats)

# Generate a sample from a normal distribution with mean 0 and standard deviation 1
set.seed(123)
x <- rnorm(100, 0, 1)

Define a function to calculate the likelihood function

log_likelihood <- function(mu, sigma, x) {
  -sum(log(dnorm(x, mean = mu, sd = sigma)))
}

Calculate the maximum likelihood estimates of mu and sigma

mle <- optimize(log_likelihood, c(-5, 5), sigma = 1, x = x)
mu_mle <- mle$minimum
sigma_mle <- mle$objective

Calculate the likelihood of the sample for the estimated parameters

likelihood <- dnorm(x, mean = mu_mle, sd = sigma_mle)
likelihood_ratio <- prod(likelihood)

Define a function to calculate the likelihood function of a statistic

log_likelihood_stat <- function(stat, x) {
  -sum(log(dnorm(stat, mean = mean(x), sd = sd(x))))
}

Calculate the likelihood of the sample for the statistic (mean)

mean_likelihood <- exp(log_likelihood_stat(mean(x), x))
sufficiency_test <- likelihood_ratio == mean_likelihood

Print the result

if (sufficiency_test) {
  print("The mean is sufficient for mu and sigma.")
} else {
  print("The mean is not sufficient for mu and sigma.")
}

[1] "The mean is not sufficient for mu and sigma."

In this example, we generate a sample from a normal distribution with mean 0 and standard deviation 1, and calculate the maximum likelihood estimates of the mean and standard deviation. We then calculate the likelihood of the sample for the estimated parameters, and for the statistic (mean). Finally, we compare the likelihood ratios and test if the mean is sufficient for the parameters. The sufficiency() function can also be used to test sufficiency for other types of distributions and statistics. The function returns a logical value indicating if the statistic is sufficient or not.

Consistency

In statistics, consistency refers to the property of an estimator that its value converges to the true value of the parameter as the sample size increases. In R, there are several ways to test the consistency of an estimator. Here are a few examples:

Simulate data from a known distribution with a true parameter value and estimate the parameter using different sample sizes. Plot the estimates against sample size and see if they converge to the true value. For example, to test the consistency of the sample mean as an estimator of the population mean, you can use the following R code:

set.seed(123)
true_mean <- 5
sample_sizes <- seq(from = 10, to = 1000, by = 10)
estimates <- sapply(sample_sizes, function(n) mean(rnorm(n, true_mean, sd = 1)))
plot(sample_sizes, estimates, type = "l", xlab = "Sample size", ylab = "Estimate")
abline(h = true_mean, lty = 2)

This code simulates data from a normal distribution with mean 5 and standard deviation 1, and estimates the mean using sample sizes ranging from 10 to 1000. The estimates are plotted against sample size, and a horizontal dashed line is added to indicate the true mean. As the sample size increases, the estimates become more precise and converge to the true mean.

Consistency Theorem

Use a consistency theorem to derive the asymptotic distribution of the estimator and test if it is consistent. For example, the sample mean is known to be a consistent estimator of the population mean under mild conditions, such as finite variance. We can use the central limit theorem to derive the asymptotic distribution of the sample mean, which is normal with mean equal to the population mean and variance equal to the population variance divided by the sample size. We can then test if the sample mean follows this distribution as the sample size increases. For example:

set.seed(123)
true_mean <- 5
sample_sizes <- seq(from = 10, to = 1000, by = 10)
estimates <- sapply(sample_sizes, function(n) mean(rnorm(n, true_mean, sd = 1)))
se <- sqrt(var(rnorm(100000, true_mean, sd = 1))/sample_sizes)
z <- (estimates - true_mean)/se
plot(sample_sizes, z, type = "l", xlab = "Sample size", ylab = "Z-score")
abline(h = 0, lty = 2)

This code simulates data from a normal distribution with mean 5 and standard deviation 1, and estimates the mean using sample sizes ranging from 10 to 1000. It then calculates the standard error of the sample mean using a large number of simulations, and computes the z-score of each estimate using the standard error. The z-scores are plotted against sample size, and a horizontal dashed line is added to indicate the null hypothesis of zero z-score. As the sample size increases, the z-scores become smaller and closer to zero, indicating that the sample mean is consistent with the asymptotic distribution predicted by the central limit theorem.

Unbiasedness

In statistics, an estimator is said to be unbiased if its expected value is equal to the true value of the parameter being estimated. To test for unbiasedness in R, one can simulate a large number of random samples from a known distribution, estimate the parameter of interest using the estimator in question, and then calculate the average of the estimates. If this average is close to the true value of the parameter, then the estimator can be considered unbiased.

For example, let’s consider the problem of estimating the mean of a normal distribution with known variance using the sample mean estimator. We can simulate 1000 random samples of size 10 from a normal distribution with mean 5 and variance 1, and then calculate the sample mean for each sample using the mean() function in R. We can then calculate the average of these sample means using the mean() function again, and compare it to the true mean of 5.

set.seed(123)
n <- 10
m <- 1000
x <- matrix(rnorm(n*m, mean = 5, sd = 1), nrow = n)
estimates <- apply(x, 2, mean)
mean_estimates <- mean(estimates)
true_mean <- 5
bias <- mean_estimates - true_mean
bias

[1] -0.002371702

The output is very close to zero, indicating that the sample mean estimator is unbiased for estimating the mean of a normal distribution with known variance. Note that the set.seed() function is used to ensure reproducibility of the results.

Fit regression models involving

ordinary least squares

To fit a linear regression model using ordinary least squares (OLS) with the iris dataset in R, we can use the lm() function. The iris dataset is included in R, so we do not need to load any external data. Here’s an example code to fit a linear regression model to predict petal length based on petal width using OLS:

# Load iris dataset
data(iris)
head(iris,5)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa

# Fit linear regression model using OLS
model1 <- lm(Petal.Length ~ Petal.Width, data = iris)
summ(model1,confint = TRUE, digits = 3)

Observations	150
Dependent variable	Petal.Length
Type	OLS linear regression

F(1,148)	1882.452
R²	0.927
Adj. R²	0.927

	Est.	2.5%	97.5%	t val.	p
(Intercept)	1.084	0.939	1.228	14.850	0.000
Petal.Width	2.230	2.128	2.332	43.387	0.000
Standard errors: OLS

In the code above, we first load the iris dataset using the data() function. We then fit a linear regression model to predict Petal.Length based on Petal.Width using the lm() function, and assign the resulting model object to the variable model. Finally, we view the summary of the model using the summary() function. The output of the summary will include information about the coefficients of the regression model, the goodness of fit statistics, and other diagnostic information. Note that the data argument specifies that the data for the regression model is to be taken from the iris dataset. The Petal.Length and Petal.Width are the variable names in the iris dataset.

generalized least squares

To fit a linear regression model using generalized least squares (GLS) in R, we can use the gls() function from the nlme package. Here’s an example code to fit a GLS model to the Ovary dataset included in the nlme package:

# Load required packages
library(nlme)

# Fit GLS model
model <- gls(Petal.Length ~ Petal.Width + Species, data = iris, correlation = corAR1(form = ~1 | Species))

# View model summary
summary(model)

Generalized least squares fit by REML
  Model: Petal.Length ~ Petal.Width + Species 
  Data: iris 
      AIC      BIC    logLik
  152.032 169.9337 -70.01602

Correlation Structure: AR(1)
 Formula: ~1 | Species 
 Parameter estimate(s):
     Phi 
0.160395 

Coefficients:
                     Value  Std.Error   t-value p-value
(Intercept)       1.194798 0.07270005 16.434619       0
Petal.Width       1.085817 0.14902037  7.286367       0
Speciesversicolor 1.626273 0.18389737  8.843373       0
Speciesvirginica  2.156301 0.27991657  7.703370       0

 Correlation: 
                  (Intr) Ptl.Wd Spcsvrs
Petal.Width       -0.504               
Speciesversicolor  0.146 -0.876        
Speciesvirginica   0.284 -0.948  0.907 

Standardized residuals:
        Min          Q1         Med          Q3         Max 
-2.67767889 -0.57820197 -0.03153986  0.49583770  3.10432219 

Residual standard error: 0.3792349 
Degrees of freedom: 150 total; 146 residual

In the code above, we first load the nlme package using the library() function. We then fit a GLS model to predict Petal.Length based on Petal.Width and Species using the gls() function. The correlation argument specifies that the errors for each Species are correlated using an autoregressive correlation structure with a lag of 1. The summary() function can be used to view the model summary, which includes information about the coefficients of the regression model, the goodness of fit statistics, and other diagnostic information. Note that we can also specify other types of correlation structures using the cor*() functions from the nlme package.

nonlinear least squares.

Nonlinear least squares is used to fit a regression model when the relationship between the independent variables and dependent variable is nonlinear. In R, the nls() function is used to perform nonlinear least squares regression.

Here’s an example of fitting a nonlinear regression model to the mtcars dataset using nls() function:

# Load the dataset
data(mtcars)
head(mtcars,5)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

# Fit the nonlinear model
fit <- nls(mpg ~ a * exp(-b * wt) + c, data = mtcars, start = list(a = 20, b = 1, c = 10))

# Print the summary of the model
summary(fit)


Formula: mpg ~ a * exp(-b * wt) + c

Parameters:
  Estimate Std. Error t value Pr(>|t|)    
a  50.5559     6.6489   7.604 2.21e-08 ***
b   0.4412     0.1572   2.806  0.00887 ** 
c   6.7804     4.8173   1.407  0.16991    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.665 on 29 degrees of freedom

Number of iterations to convergence: 9 
Achieved convergence tolerance: 1.471e-06

# Plot the model
plot(mtcars$wt, mtcars$mpg)
lines(mtcars$wt, predict(fit), col = "red")

In this example, we are fitting a model with mpg as the response variable and wt as the predictor variable. The model is assumed to be of the form mpg = a * exp(-b * wt) + c, where a, b, and c are the parameters we are trying to estimate. The start argument specifies the initial values for the parameters. The nls() function estimates the parameters using nonlinear least squares regression and returns an object of class nls.

We can then use summary() function to print the summary of the model, which includes information about the estimated coefficients and their standard errors, residual standard error, and goodness of fit measures.

Finally, we can plot the model by first creating a scatter plot of mpg vs wt and then overlaying the predicted values from the model using the predict() function.

Carry out hypothesis testing on regression models.

To carry out hypothesis testing on regression models in R, you can use the summary() function on the output of a linear regression model. This will provide you with a summary of the model, including the coefficient estimates, standard errors, t-values, and p-values for each predictor variable in the model.

Here is an example using the mtcars dataset in R:

# Load the mtcars dataset
data(mtcars)

# Fit a linear regression model
model4 <- lm(mpg ~ wt + cyl, data = mtcars)

# View the summary of the model
summ(model4,confint = TRUE, digits = 3)

Observations	32
Dependent variable	mpg
Type	OLS linear regression

F(2,29)	70.908
R²	0.830
Adj. R²	0.819

	Est.	2.5%	97.5%	t val.	p
(Intercept)	39.686	36.179	43.194	23.141	0.000
wt	-3.191	-4.739	-1.643	-4.216	0.000
cyl	-1.508	-2.356	-0.660	-3.636	0.001
Standard errors: OLS

The output of this code will show the coefficient estimates, standard errors, t-values, and p-values for the predictor variables wt and cyl. To carry out hypothesis tests on these coefficients, you can use their respective t-values and p-values.

For example, to test the null hypothesis that the coefficient for wt is equal to zero, you can use the t-value and p-value for wt from the output of summary(model). The null hypothesis is rejected if the p-value is less than the chosen significance level (usually 0.05).

Here is an example:

# View the summary of the model
summ(model4,confint = TRUE, digits = 3)

Observations	32
Dependent variable	mpg
Type	OLS linear regression

F(2,29)	70.908
R²	0.830
Adj. R²	0.819

	Est.	2.5%	97.5%	t val.	p
(Intercept)	39.686	36.179	43.194	23.141	0.000
wt	-3.191	-4.739	-1.643	-4.216	0.000
cyl	-1.508	-2.356	-0.660	-3.636	0.001
Standard errors: OLS

# Test the null hypothesis that the coefficient for wt is zero
t_value <- coef(summary(model4))["wt", "t value"]
p_value <- coef(summary(model4))["wt", "Pr(>|t|)"]
alpha <- 0.05

if (p_value < alpha) {
  cat("Reject null hypothesis: coefficient for wt is not equal to zero\n")
} else {
  cat("Fail to reject null hypothesis: coefficient for wt is equal to zero\n")
}

Reject null hypothesis: coefficient for wt is not equal to zero

Apply Quasi Maximum Likelihood theory in modelling data.

Quasi Maximum Likelihood (QML) is a method for estimating the parameters of a statistical model. It is particularly useful when the distribution of the data is unknown or does not follow a well-known distribution. QML is an extension of Maximum Likelihood Estimation (MLE) and is commonly used in econometrics, finance, and other fields.

To apply QML in modeling data, we need to follow these steps:

Define the model:

We need to specify the functional form of the model and the distribution of the errors. For example, we can use a linear regression model with normal errors.

Obtain the likelihood function:

We need to derive the likelihood function for the model. In QML, the likelihood function may not be tractable, but we can use a quasilikelihood function instead.

Estimate the parameters:

We need to estimate the parameters of the model by maximizing the quasilikelihood function. This can be done using numerical optimization algorithms, such as the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm.

Test the model:

We need to test the goodness of fit of the model using statistical tests, such as the Wald test or the likelihood ratio test.

Here is an example code in R that applies QML to a linear regression model with normal errors:

set.seed(123)
n <- 100
x1 <- rnorm(n)
x2 <- rnorm(n)
y <- 1 + 2*x1 + 3*x2 + rt(n, df = 3)
data <- data.frame(y, x1, x2)
head(data,5)

            y          x1         x2
1 -0.62617705 -0.56047565 -0.7104066
2  1.05492040 -0.23017749  0.2568837
3  2.71127413  1.55870831 -0.2466919
4  0.08723968  0.07050839 -0.3475426
5 -1.65872762  0.12928774 -0.9516186

tail(data,5)

             y         x1         x2
96   6.4456520 -0.6002596  1.9972134
97   6.7888631  2.1873330  0.6007088
98  -0.1812239  1.5326106 -1.2512714
99  -1.0105219 -0.2357004 -0.6111659
100 -5.9767234 -1.0264209 -1.1854801

Here, we simulated a response variable y as a linear function of two predictors x1 and x2, with normally distributed coefficients and non-normal errors generated from a t-distribution with 3 degrees of freedom.

Next, we will fit a linear regression model using the lm function:

model_lm <- lm(y ~ x1 + x2, data = data)
summ(model_lm,confint = TRUE, digits = 3)

Observations	100
Dependent variable	y
Type	OLS linear regression

F(2,97)	236.062
R²	0.830
Adj. R²	0.826

	Est.	2.5%	97.5%	t val.	p
(Intercept)	1.009	0.698	1.319	6.449	0.000
x1	1.857	1.519	2.196	10.888	0.000
x2	3.111	2.792	3.431	19.320	0.000
Standard errors: OLS

The output shows that the residuals are not normally distributed and have unequal variances, violating the assumptions of OLS. To apply QML, we need to specify a model for the variance of the errors. A common choice is to use a power function of the mean, known as the power variance function:

power.variance <- function(mu, phi) {
  mu^phi
}

Here, phi is a parameter that controls the shape of the variance function. Next, we will fit the QML model using the gls function from the nlme package, which allows us to specify the power variance function:

Using iris dataset

The iris dataset is typically used for classification problems, but we can still demonstrate the use of Quasi Maximum Likelihood (QML) for regression by treating one of the features as the dependent variable and the others as independent variables. Let’s use the petal length as the dependent variable and the sepal length, sepal width, and petal width as independent variables.

First, we will load the iris dataset and create the necessary variables:

data(iris)

# Create variables for QML regression
y <- iris$Petal.Length
x <- cbind(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Width)

Next, we will fit a linear regression model using the glm function with the family argument set to quasi. We will use a quasi-likelihood function that assumes the errors are normally distributed but with a variance that is proportional to the mean (i.e., a form of heteroscedasticity):

# Fit QML regression model
model <- glm(y ~ x, family = quasi(variance = "mu^2"))

# Print model summary
summ(model,confint = TRUE, digits = 3)

Observations	150
Dependent variable	y
Type	Generalized linear model
Family	quasi
Link	identity

χ²(3)	42.897
Pseudo-R² (Cragg-Uhler)	NA
Pseudo-R² (McFadden)	NA
AIC	NA
BIC	NA

	Est.	2.5%	97.5%	t val.	p
(Intercept)	0.022	-0.507	0.550	0.080	0.936
x1	0.520	0.383	0.657	7.443	0.000
x2	-0.443	-0.573	-0.313	-6.688	0.000
x3	1.707	1.537	1.876	19.711	0.000
Standard errors: MLE

The output will show the coefficient estimates, standard errors, t-values, and p-values for each independent variable, as well as goodness-of-fit measures such as the residual deviance and the quasi-likelihood information criterion (QIC).

We can also test the significance of the overall model using a likelihood ratio test:

# Perform likelihood ratio test
null_model <- glm(y ~ 1, family = quasi(variance = "mu^2"))
lr_test <- anova(null_model, model, test = "LRT")
lr_test

Analysis of Deviance Table

Model 1: y ~ 1
Model 2: y ~ x
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       149     44.655                          
2       146      1.758  3   42.897 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output will show the chi-squared test statistic, degrees of freedom, and p-value for the likelihood ratio test, which compares the fit of the null model (a constant mean) to the QML regression model.

Note that QML assumes only that the mean structure of the model is correctly specified, but makes no assumptions about the distribution of the errors or their independence. Therefore, QML is a more robust estimation method than maximum likelihood, which assumes both normality and independence of errors.

MATH 841: DESIGN AND ANALYSIS OF EXPERIMENTS (L/P 30/30; CF 3.0)

Course Purpose

The purpose of this course is to enable learners apply the fundamental principles of statistically designed experiments and to design and analyze classical experiments.

Expected Learning Outcomes

At the end of the course, the student should be able to: * Explain the concepts of randomization, replication and control of error. * Apply analysis of variance to a completely randomized design, in order to estimate model parameters, and to test model assumptions. * Explain how to design with increased precision using a randomized complete block design, including randomization, data analysis and handling missing data. * Apply the Latin Square and Graeco-Latin Square designs. * Apply blocking efficiency, management and confounding. * Design the random effects model. * Apply treatment comparisons, including orthogonal treatment contrasts, orthogonal polynomials for quantitative treatments and multiple comparison procedures. * Design factorial experiments, and interpret the main effects and interaction. * Apply analysis of covariance

Explain the concept of randomization, replication and control of error.

Randomization, replication, and control of error are fundamental concepts in experimental design and data analysis. These concepts help to ensure that the results of experiments are reliable, accurate, and unbiased.

Randomization is the process of randomly assigning treatments or conditions to experimental units. The goal of randomization is to eliminate or reduce the effect of any systematic differences that may exist among the units. By randomly assigning treatments, the likelihood of any particular treatment being associated with specific units is reduced, and the effect of any other factors that might be associated with those units is also minimized.

Replication involves repeating an experiment or study multiple times to obtain a more accurate estimate of the treatment effect or the relationship between the variables of interest. Replication can be either within-subject or between-subject. Within-subject replication involves using the same subjects or experimental units to test different treatments or conditions. Between-subject replication involves using different subjects or experimental units for each treatment or condition. Replication allows for the estimation of the variability of the treatment effect or the relationship between the variables of interest.

Control of error is the process of identifying and minimizing errors in experimental design and data analysis. Control of error is important because errors can lead to biased or inaccurate results. The most common sources of error are sampling error, measurement error, and confounding variables. Sampling error arises when the sample is not representative of the population. Measurement error occurs when the measurements are imprecise or inaccurate. Confounding variables are variables that are associated with both the treatment and the outcome and can distort the estimate of the treatment effect.

In summary, randomization, replication, and control of error are essential concepts in experimental design and data analysis. These concepts help to ensure that the results of experiments are reliable, accurate, and unbiased.

In R, randomization can be implemented using various functions in the base package or other packages.

For example, the sample() function in the base package can be used to randomly permute a vector or sample randomly from a set of values. The runif() function can be used to generate random uniform variables, and the rnorm() function can be used to generate random normal variables.

Here’s an example of randomizing a vector using the sample() function:

# create a vector of treatments
treatments <- c("A", "B", "C", "D")

# randomly assign treatments to experimental units
set.seed(123)  # set random seed for reproducibility
treatment_assignment <- sample(treatments, size = 20, replace = TRUE)
treatment_assignment

 [1] "C" "C" "C" "B" "C" "B" "B" "B" "C" "A" "D" "B" "B" "A" "B" "C" "D" "A" "C"
[20] "C"

In this example, the sample() function randomly assigns the four treatments (A, B, C, and D) to 20 experimental units with replacement (i.e., the same treatment can be assigned to multiple units). The set.seed() function is used to ensure that the same random assignment is obtained every time the code is run.

Randomization can also be used to generate random data for simulation studies, which are often used to evaluate the performance of statistical methods under different scenarios. For example, the runif() function can be used to generate random uniform data in R:

# generate 1000 random uniform data between 0 and 1
set.seed(123)
random_data <- runif(1000)
head(random_data,100)

  [1] 0.2875775201 0.7883051354 0.4089769218 0.8830174040 0.9404672843
  [6] 0.0455564994 0.5281054880 0.8924190444 0.5514350145 0.4566147353
 [11] 0.9568333453 0.4533341562 0.6775706355 0.5726334020 0.1029246827
 [16] 0.8998249704 0.2460877344 0.0420595335 0.3279207193 0.9545036491
 [21] 0.8895393161 0.6928034062 0.6405068138 0.9942697766 0.6557057991
 [26] 0.7085304682 0.5440660247 0.5941420204 0.2891597373 0.1471136473
 [31] 0.9630242325 0.9022990451 0.6907052784 0.7954674177 0.0246136845
 [36] 0.4777959711 0.7584595375 0.2164079358 0.3181810076 0.2316257854
 [41] 0.1428000224 0.4145463358 0.4137243263 0.3688454509 0.1524447477
 [46] 0.1388060634 0.2330340995 0.4659624503 0.2659726404 0.8578277153
 [51] 0.0458311667 0.4422000742 0.7989248456 0.1218992600 0.5609479838
 [56] 0.2065313896 0.1275316502 0.7533078643 0.8950453592 0.3744627759
 [61] 0.6651151946 0.0948406609 0.3839696378 0.2743836446 0.8146400389
 [66] 0.4485163414 0.8100643530 0.8123895095 0.7943423211 0.4398316876
 [71] 0.7544751586 0.6292211316 0.7101824014 0.0006247733 0.4753165741
 [76] 0.2201188852 0.3798165377 0.6127710033 0.3517979092 0.1111354243
 [81] 0.2436194727 0.6680555874 0.4176467797 0.7881958340 0.1028646443
 [86] 0.4348927415 0.9849569800 0.8930511144 0.8864690608 0.1750526503
 [91] 0.1306956916 0.6531019250 0.3435164723 0.6567581280 0.3203732425
 [96] 0.1876911193 0.7822943013 0.0935949867 0.4667790416 0.5115054599

This code generates 1000 random uniform data between 0 and 1 using the runif() function, with a specified random seed for reproducibility.

The Concept of Replication

In experimental design, replication refers to the practice of repeating measurements or experiments under identical conditions. Replication helps to account for random variation and provides a more accurate estimate of experimental error. In R, replication can be achieved by running multiple iterations of a particular analysis or experiment using the same set of inputs.

For example, suppose we want to estimate the mean of a population based on a random sample of observations. To replicate the sampling process, we can use the replicate() function to generate multiple random samples and calculate the sample mean for each sample:

# Generate 10 random samples of size 5 from a normal distribution
set.seed(123)  # Set seed for reproducibility
samples <- replicate(10, rnorm(5))

# Calculate the sample mean for each sample
sample_means <- apply(samples, 2, mean)

# Display the results
sample_means

 [1]  0.19357026 -0.04431897  0.30790173  0.10934219 -0.73314671 -0.11597103
 [7]  0.54524659  0.09884251  0.24177947 -0.25921054

In this example, we use the replicate() function to generate 10 random samples of size 5 from a normal distribution. We then calculate the sample mean for each sample using the apply() function, which applies the mean() function to each column of the samples matrix. Finally, we display the resulting sample means. By replicating the sampling process, we can obtain a more accurate estimate of the population mean and account for random variation in the data.

The Concept of Control of Error

The concept of control of error in R refers to the process of minimizing the possibility of errors when conducting statistical analyses. This involves carefully designing experiments, collecting and processing data, and verifying results to ensure that they are accurate and reliable.

There are several ways to control errors in R, including: ##### Pre-processing data: This involves cleaning, transforming, and filtering data to remove outliers, missing values, and other anomalies that could affect the accuracy of the results.

Choosing appropriate statistical methods:

It is important to select statistical methods that are appropriate for the data being analyzed and the research question being addressed. This can help to reduce the risk of errors due to inappropriate assumptions or flawed analyses.

Conducting multiple tests:

It is common to conduct multiple tests on a dataset, but this can increase the risk of errors due to multiple comparisons. To control for this, researchers may use techniques such as Bonferroni correction, which adjusts the significance threshold for multiple comparisons.

Randomization:

Randomization involves randomly assigning subjects to groups or treatments in an experiment. This helps to control for extraneous variables that could affect the results, such as individual differences or environmental factors.

Replication:

Replication involves repeating an experiment or analysis to ensure that the results are reliable and replicable. This can help to control for errors that may arise due to chance or individual variation.

By controlling for errors in these ways, researchers can increase the validity and reliability of their statistical analyses and draw more accurate conclusions from their data.

To apply analysis of variance (ANOVA) to a completely randomized design (CRD) in R, we can use the aov() function. Here’s an example: Suppose we have a dataset with a response variable y and a factor variable x with three levels. We want to fit a model to the data using a CRD and perform an ANOVA to test the significance of the factor variable x. Here’s how we can do it:

# Generate example data
set.seed(123)
y <- rnorm(30, mean = rep(1:3, each = 10), sd = 0.5)
x <- rep(LETTERS[1:3], each = 10)

crd <- data.frame(y,x)
head(crd)

          y x
1 0.7197622 A
2 0.8849113 A
3 1.7793542 A
4 1.0352542 A
5 1.0646439 A
6 1.8575325 A

tail(crd)

          y x
25 2.687480 C
26 2.156653 C
27 3.418894 C
28 3.076687 C
29 2.430932 C
30 3.626907 C

# Fit a CRD model using aov()
fit <- aov(y ~ x, data = crd)

# Display the ANOVA table
summary(fit)

            Df Sum Sq Mean Sq F value   Pr(>F)    
x            2 15.565   7.782   32.73 6.08e-08 ***
Residuals   27  6.421   0.238                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This will output an ANOVA table with the following columns:

Df: degrees of freedom
Sum Sq: sum of squares
Mean Sq: mean square
F value: F statistic
Pr(>F): p-value The ANOVA table can be used to test the null hypothesis that the means of the response variable y are equal across the levels of the factor variable x. The p-value tells us the probability of obtaining an F statistic as large as the one observed, assuming the null hypothesis is true. If the p-value is less than a chosen significance level (e.g., 0.05), we reject the null hypothesis and conclude that there is evidence of a significant difference in means among the levels of x.

To test the assumptions of the model, we can use diagnostic plots such as a normal probability plot of the residuals, a plot of residuals vs. fitted values, and a plot of residuals vs. the factor variable x. These plots can be generated using the qqnorm(), plot() and interaction.plot() functions in R, respectively.

Assumptions

To test the assumptions of the model, we can use diagnostic plots. Here are the steps to follow:

Obtain the model residuals
Plot the residuals vs. fitted values to check for linearity and equal variance assumptions
Plot the histogram of residuals to check for normality assumption
Perform a Shapiro-Wilk test for normality assumption Here’s the R code to do this for the previously fitted ANOVA model:

# Obtain the model residuals
residuals <- residuals(fit)

# Plot residuals vs. fitted values
plot(fitted(fit), residuals, xlab = "Fitted values", ylab = "Residuals", main = "Residuals vs. Fitted Values Plot")

# Plot histogram of residuals
hist(residuals, main = "Histogram of Residuals")

# Perform Shapiro-Wilk test for normality assumption
shapiro.test(residuals)


    Shapiro-Wilk normality test

data:  residuals
W = 0.96213, p-value = 0.3507

The plot() function will produce a scatter plot of the residuals against the fitted values. We are checking for a random pattern in the plot, which would indicate that the linearity assumption is valid. We also want to see that the variance of the residuals is approximately constant across the range of the fitted values.

The hist() function will produce a histogram of the residuals. We are checking for a normal distribution of the residuals.

The shapiro.test() function will perform a Shapiro-Wilk test on the residuals to test the normality assumption. The null hypothesis is that the residuals come from a normal distribution. If the p-value is less than 0.05, we reject the null hypothesis and conclude that the residuals are not normally distributed.

If any of these assumptions are violated, we may need to consider transforming the data or using a different type of analysis.

Explain how to design with increased precision using a randomized complete block design, including randomization, data analysis and handling missing data.

A randomized complete block design (RCBD) is a type of experimental design that is used to increase precision by accounting for sources of variability beyond the treatments of interest. In an RCBD, experimental units are grouped into blocks based on one or more variables that are thought to influence the response variable. Within each block, the treatments are randomly assigned to the experimental units.

The randomization step in an RCBD involves randomly assigning the treatments to the experimental units within each block. This ensures that any systematic differences in the response variable across the blocks are not confounded with the treatment effects. Randomization can be achieved in R using the sample() function, which can be used to randomly assign treatments to experimental units within each block.

To analyze the data from an RCBD, a two-way analysis of variance (ANOVA) can be used to estimate model parameters and test model assumptions. The ANOVA model for an RCBD includes two factors: treatment and block. The treatment factor represents the treatments of interest, and the block factor represents the blocking variable(s). The model can be fit using the lm() function in R, which can be used to estimate the treatment and block means, as well as their interaction.

Handling missing data in an RCBD can be challenging because missing data in one block may affect the analysis of the entire block. One approach is to use a model-based imputation method, such as multiple imputation, to impute the missing values. Another approach is to use a balanced incomplete block design, which is a variation of the RCBD that allows for missing values by only including a subset of the treatments in each block.

In summary, an RCBD is a powerful experimental design that can increase the precision of an experiment by accounting for sources of variability beyond the treatments of interest. Randomization is used to ensure that any systematic differences across the blocks are not confounded with the treatment effects, and ANOVA is used to estimate model parameters and test model assumptions. Handling missing data in an RCBD can be challenging, but several approaches are available, including model-based imputation and balanced incomplete block designs.

Apply the Latin Square and Graeco-Latin Square designs.

The Latin Square and Graeco-Latin Square designs are special cases of balanced incomplete block designs (BIBDs). In R, we can use the crossdes function from the AlgDesign package to create Latin Square and Graeco-Latin Square designs.

To create a Latin Square design, we need to specify the number of treatments, k, and the number of blocks, r. The crossdes function will generate a design matrix where each row represents a block and each column represents a treatment. The entries in the matrix represent the treatment applied to the block. The Latin Square design ensures that each treatment appears once in each row and column of the design matrix.

Here’s an example:

# create a matrix with the data
data <- matrix(c(5, 3, 2, 6, 7, 8,
                 7, 5, 6, 9, 8, 3,
                 2, 6, 7, 3, 1, 5,
                 9, 7, 8, 4, 6, 2,
                 4, 1, 3, 5, 9, 7,
                 8, 2, 4, 1, 3, 6), nrow = 6, ncol = 6, byrow = TRUE)
head(data,5)

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    5    3    2    6    7    8
[2,]    7    5    6    9    8    3
[3,]    2    6    7    3    1    5
[4,]    9    7    8    4    6    2
[5,]    4    1    3    5    9    7

# convert the matrix to a data frame
df <- data.frame(expand.grid(Treatment = 1:3, Block = 1:3), Response = as.vector(data))
head(fit,5)

$coefficients
(Intercept)          xB          xC 
   1.037313    1.066998    1.750408 

$residuals
           1            2            3            4            5            6 
-0.317550645 -0.152401567  0.742041335 -0.002058626  0.027331046  0.820219671 
           7            8            9           10           11           12 
 0.193145281 -0.669843439 -0.380739248 -0.260143807  0.507729918  0.075595933 
          13           14           15           16           17           18 
 0.096074745 -0.048969622 -0.382231548  0.789145588  0.144614259 -1.087619559 
          19           20           21           22           23           24 
 0.246366970 -0.340706684 -0.321632416  0.103291979 -0.300722788 -0.152166178 
          25           26           27           28           29           30 
-0.100240197 -0.631067219  0.631172959  0.288965995 -0.356789032  0.839186897 

$effects
 (Intercept)           xB           xC                                        
-10.82545220   0.49521072   3.91403070  -0.05955560  -0.03016593   0.76272270 
                                                                              
  0.13564831  -0.72734041  -0.43823622  -0.31764078   0.52200836   0.08987438 
                                                                              
  0.11035319  -0.03469118  -0.36795310   0.80342403   0.15889270  -1.07334112 
                                                                              
  0.26064541  -0.32642824  -0.07299185   0.35193255  -0.05208222   0.09647439 
                                                                              
  0.14840037  -0.38242665   0.87981353   0.53760657  -0.10814846   1.08782747 

$rank
[1] 3

$fitted.values
       1        2        3        4        5        6        7        8 
1.037313 1.037313 1.037313 1.037313 1.037313 1.037313 1.037313 1.037313 
       9       10       11       12       13       14       15       16 
1.037313 1.037313 2.104311 2.104311 2.104311 2.104311 2.104311 2.104311 
      17       18       19       20       21       22       23       24 
2.104311 2.104311 2.104311 2.104311 2.787721 2.787721 2.787721 2.787721 
      25       26       27       28       29       30 
2.787721 2.787721 2.787721 2.787721 2.787721 2.787721

# fit a linear model with treatment and blocking as factors
fit <- lm(Response ~ Treatment + Block, data = df)

# perform ANOVA
summary(aov(fit))

            Df Sum Sq Mean Sq F value Pr(>F)
Treatment    1  15.04  15.042   2.471  0.125
Block        1   0.00   0.000   0.000  1.000
Residuals   33 200.85   6.086

# check assumptions
plot(fit)

What Next!!!!

Apply blocking efficiency, management and confounding.

Design the random effects model.

Apply treatment comparisons, including orthogonal treatment contrasts, orthogonal polynomials for quantitative treatments and multiple comparison procedures.

MULTIVARIATE ANALYSIS

Lumumba W. Victor

2023-03-28

Set up Rstudio

Course Purpose

Expected Learning Outcomes

Estimate the mean and variance of a multivariate normal distribution

Assess the distribution of the mean in multivariate normal data

Calculate Hotellings T^2 statistic from given data

Carry out Multivariate Analysis of Variance (MANOVA)

Carry out test of independence

viii) Calculate the discriminant coefficient of given multivariate data

Carry out cluster and factor analysis of given multivariate data

Obtain the principal component of given multivariate data

Calculate the canonical correlation coefficient of given multivariate data

Measure Theory and Probability

Generating random variables from the normal distribution:

Calculating the probability density function of the t-distribution:

Calculating the cumulative distribution function of the chi-square distribution:

Computing the expected value and variance of a discrete probability distribution:

Central limit theorem in R

Establish convergences in probability distributions

Law of Large Numbers (LLN):

Central Limit Theorem (CLT):

PARAMETRIC REGRESSION ANALYSIS

Course Purpose

Expected Learning Outcomes

Properties of Ordinary Least Squares

Sufficiency

Define a function to calculate the likelihood function

Calculate the maximum likelihood estimates of mu and sigma

Calculate the likelihood of the sample for the estimated parameters

Define a function to calculate the likelihood function of a statistic

Calculate the likelihood of the sample for the statistic (mean)

Print the result

Consistency

Consistency Theorem

Unbiasedness

Fit regression models involving

ordinary least squares

generalized least squares

nonlinear least squares.

Carry out hypothesis testing on regression models.

Apply Quasi Maximum Likelihood theory in modelling data.

Define the model:

Obtain the likelihood function:

Estimate the parameters:

Test the model:

Using iris dataset

MATH 841: DESIGN AND ANALYSIS OF EXPERIMENTS (L/P 30/30; CF 3.0)

Course Purpose

Expected Learning Outcomes

Explain the concept of randomization, replication and control of error.

The Concept of Replication

The Concept of Control of Error

Choosing appropriate statistical methods:

Conducting multiple tests:

Randomization:

Replication:

Assumptions

Explain how to design with increased precision using a randomized complete block design, including randomization, data analysis and handling missing data.

Apply the Latin Square and Graeco-Latin Square designs.

What Next!!!!

Apply blocking efficiency, management and confounding.

Design the random effects model.

Apply treatment comparisons, including orthogonal treatment contrasts, orthogonal polynomials for quantitative treatments and multiple comparison procedures.

Design factorial experiments, and interpret the main effects and interaction.

Apply analysis of covariance