Completely Randomized Design - Non-Parametric (Matthew Harvey, Jessica Dedeaux, Harsh Patel)

Parametric ANOVA tests assume normality of data and constant variance. If the assumption of normality is proven false, then non-parametric methods may be used. This method is known as the Kruskal-Wallis Test. Prior to the testing the hypothesis using Kruskal-Wallis, the Completely Randomized Design or CRD needs to be set-up in the R simulation console. Further, CRDs are built on the basis of defined levels and replications as shown in the Process section.


Assumptions

  • Independence: The observations within each group are assumed to be independent of each other. This means that values in one group should not be related to or influence the values in any other group.

  • Random Sampling: The data should be collected using a random sampling process. This means that each observation has an equal probability of being included in the study.

  • Constant Variance: Like one-way ANOVA testing, the variance among the groups is assumed to be equal for non-parametric tests.

  • Normality: For hypothesis testing, in one-way ANOVA, the model errors are assumed to be normally distributed. However, for the Kruskal-Wallis test (non-parametric ANOVA) we cannot assume normal distribution.


Process

Please note: the replications for the lesson are pre-defined. Traditionally, one would run a power analysis (pwr.t.test) for the desired alpha (i.e. 0.05) and the power for a specific effect (i.e. 1). This would denote the number of replications needed statistically satisfy the performance of the hypothesis testing.


Design Layout

library(agricolae)
# Extensive functionality on experimental design functions package, typically used for agriculture research, package needs to be loaded to develop the functions


OutDesign<-function(levels, replication){
  outdesign<-design.crd(levels, replication, serie = 1, 34268)$book
  #outdesign<-outdesign[order(outdesign$levels), ]
  write.csv(outdesign,"layout.csv")}
# Line 1: OutDesign creates a function to export data
# Line 2: CRD have defaults in the serie (plot number degrees) and seed (random number for testing) that can be altered to create the data. In this we have created number plot of 2 digits (1) and the random seed (34268). The random seed can be any set of numbers based on the study.
# Line 3: CRD are random sets of data that do not need to be sorted from A to Z to run properly; therefore, this line is not needed
# Line 4: Writes the layout to a csv file to the set working directory, check the set working directory to ensure it is saved in the proper location for your exploration


?design.crd
# provides the parameters used in the function and proper description, typically used for agriculture research


Exporting

The Outdesign function is created to pull data out of R into a csv file. This is done by creating a CRD with the listed parameters then writing the CSV. The CSV file saves to the set working directory. Note where the file was saved because this will need to be call later to import the dat. The importance of exporting the data and working from an exported file is the repeatability of the analysis. The export can be saved to online repository (i.e. https://github.com), and it can be linked in the code. The link will enable other users to repeat your analysis to be used in future renditions of the study.


Collect Data

InDesign<-function(path){
  indesign<-read.csv(path)
  indesign<-indesign[,2:4]
  indesign <- indesign[,-c(2)]
  colnames(indesign)<-c("response","levels")
  indesign}
# Line 1: Creates the InDesign function and then defines the pathname from the set working directory
# Line 2: Note the columns needed for test, this will be based on the layout.csv column numbers
# Line 3: Further manipulate the data, remove or add any columns needed for the analysis
# Line 4: Name the columns, proper names to represent your data set


levels<-c("Level1","Level2","Level3")
replication<-5
OutDesign(levels, response) #open layout.csv, add column with response, and save file
# Line 1: Define the levels of the analysis (i.e. i values).
# Line 2: Define the number of replications in the analysis (i.e. j values).
# Line 3: This will rename the replications output column to response, so you have the levels and the responses. This will also generate a file titled layout.csv.
5 replications was chosen by the lesson to show an extensive amount of data for example and reference material.


dat<-InDesign(path="/Users/YOURNAMEHERE/layout.csv") #read file back into R with response added
dat
# Line 1: Reads in the file layout.csv from the pathname from your set working directory to InDesign. If you view your files (right side of R Studio layout), you will see the file layout.csv . If not, make sure you have the library(agricolae) package loaded.
# Line 2: Shows modified export data with levels and response only.


Adding Response to your data

The InDesign function is created to pull data from the csv file back into R, but it is modified to perform the analysis (i.e. Kruskal-Wallace Test). This is done by calling the path that the CSV file is located which is in the set working directory. Then the data is modified by showing only the levels and the responses which is previously known for the replication outputs in the data set. Following, the proper naming of columns and rows can be used to present the “dat” of the data which is used for the study. Additionally, one can save the export to an online directory, and one can pull the data from there. Make sure the file is up to data and accurate before they test is run.


Other Examples of Design Layout for CRD

No seed means that the random number will be generated by the default parameters.
outdesign2 <-design.crd(trt,r,serie=3)
print(outdesign2$parameters)
book2<-outdesign2


Seeds mean the random number set is identified.
# seed = 12543
outdesign1 <-design.crd(trt,r,serie=2,12543,"Mersenne-Twister")
book1<-outdesign1


Pre-defined data set means that the data will then be randomized in which order the samples are taken.
trt <-c("CIP-101","CIP-201","CIP-301","CIP-401","CIP-501")
r <-c(4,3,5,4,3)


Statistical Test


Linear effects equation

\[ y_{ij}= \mu + \tau_{i} + \epsilon_{ij} ~; ~where ~i=1,2,3 ~and ~j=1,2,3,4,5 \]


Hypothesis Testing:

\[ H_{0} : \mu_{1} = \mu_{2} = \mu_{3}~or~H_{0} : \tau_{i} = 0 \]

\[ H_{a} : At~least~one~\mu ~is ~different ~or~H_{a} : \tau_{i} \neq 0 \]


Equation explanations can be found in the textbook Sections 3.2-3.3.
(Montgomery, D. C. (2013). Design and analysis of experiments (8. ed). Wiley.)


Level1 <- c(values)
Level2 <- c(values)
Level3 <- c(values)
# Create vectors of the levels that contain the data values.


Levels <- data.frame(Level1, Level2, Level3)
# Compiles the vectors together to make the data frame.


Level_long <- pivot_longer(Levels, c(Level1, Level2, Level3))
# Pivot the data frame from wide to long.


aov.model<-aov(value~name, data=Level_long)
# Create an AOV model.


summary(aov.model)
# Generates a summary of the ANOVA test. This can be used to evaluate the data.


plot(aov.model)
# Generates Residual Plots from the data


qqnorm(Level_long$value)
qqline(Level_long$value)
# Generates a Q-Q Normal Plot with a best fit line


kruskal.test(value~name, data= Level_long)
# This function performs the Kruskal-Wallis test.


Residuals

Residuals can help determine if the data fits normality or not, and example plots are provided below:


If we see that the data is following the line of best fit, then we can say that the data is normally distributed. But if the data starts to deviate from the line of best fit then we need to look for another method to test our Null hypothesis of the means of each group being equal to each other, like the Kruskal-Wallace Test.

As we can see in the plots above, the red line deviates from the line of best fit there showing us that the data is not normally distributed. With this conclusion that the data is not normally distributed and has “messy” data, we can say that another test can be used to test the Null hypothesis that means of all the groups are equal (Kruskal-Wallace).


One instance of when non-parametric testing can be used is Ordinal or Categorical Data: Non-parametric tests are appropriate for data with ordered categories (ordinal data) or purely categorical data (nominal data). These tests, such as the Mann-Whitney U test, Kruskal-Wallis test, or Chi-squared test, do not assume interval-level measurement or normally distributed data


Inference

Once you run the ANOVA and check Residuals, you can determine whether the data is normally distributed or not. If it is not, you can run the Kruskal-Wallis test to analyze the data properly instead.


Once the Kruskal-Wallis test is run, compare the P-value to alpha. If it is less than alpha, we reject the null hypothesis. If it is greater than alpha, we fail to reject the null hypothesis.


Example

library(tidyr)
# Organizes the "messy" data to usable format, need to load the package


library(dplyr)
# Package used to manipulate the data into a uniform set to resolve the analysis, need to load the package


library(MASS)
# Package supports functions and dataset for least squares, best fit, linear modes, and Kruskal-Wallace for multidimensional scaling
-Libraries’ description from https://cran.r-project.org


Modifying the example from https://openpress.usask.ca/introtoappliedstatsforpsych/chapter/16-6-kruskal-wallis-test-h-test/:

With the following data on ml of potassium/quart in brands of drink, determine if there is a significant difference in the potassium content between brands. Assume alpha = 0.05.


##   Brand.A Brand.B Brand.C
## 1     4.7     5.3     6.3
## 2     3.2     6.4     8.2
## 3     5.1     7.3     6.2
## 4     5.2     6.8     7.1
## 5     5.0     7.2     6.6


Linear Effects Equation

\[ y_{ij}= \mu + \tau_{i} + \epsilon_{ij} ; ~where ~i=1,2,3 ~and ~j=1,2,3,4,5 \]

i is the number of levels (brands) and j is the number of observations (DV).


Hypothesis Testing:

\[ H_{0} : \mu_{1} = \mu_{2} = \mu_{3} ~or ~H_{0} : \tau_{i} = 0 \]

The null hypothesis assumes that the potassium content of each brand is the same.


\[ H_{a} : At~least~one~\mu~is~different~or~H_{a} : \tau_{i} \neq 0 \]

The alternative hypothesis assumes that at least one brand’s potassium content will differ.


Create the vectors and compile them together to make the data frame.
A <- c(4.7, 3.2, 5.1,5.2,5)
B <- c(5.3,6.4,7.3,6.8,7.2)
C <- c(6.3,8.2,6.2,7.1,6.6)
brands1 <- data.frame(A,B,C)


Pivot the data frame from wide to long and rename columns to match the original data table
brand_long <- pivot_longer(brands1,c(A,B,C))
colnames(brand_long)<-c("Brand","DV")
brand_long
## # A tibble: 15 × 2
##    Brand    DV
##    <chr> <dbl>
##  1 A       4.7
##  2 B       5.3
##  3 C       6.3
##  4 A       3.2
##  5 B       6.4
##  6 C       8.2
##  7 A       5.1
##  8 B       7.3
##  9 C       6.2
## 10 A       5.2
## 11 B       6.8
## 12 C       7.1
## 13 A       5  
## 14 B       7.2
## 15 C       6.6


OR


An alternative way: Read in data from an online repository.
brands2<-read.csv("https://raw.githubusercontent.com/mharvey2696/Test1DoE/main/CRD-NONPAR-CSV.csv")


Create an AOV model and then generate a summary of the ANOVA test. This can be used to evaluate the data.
aov.model<-aov(DV~Brand,data=brands2)
summary(aov.model)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## Brand        2  14.90   7.448   11.14 0.00184 **
## Residuals   12   8.02   0.668                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Now that the ANOVA test has been completed, is the data Normally Distributed?


Generate Residual Plots from the data.
plot(aov.model)

The residuals do not follow the best fit line, and we can see there are some outliers. Markers 2, 6, and 12. The outliers represent data points that may skew the data to a false negative or positive.


The residuals denote the error as it relates to the fitted values. The line of best fit is skewed above the zero line, splitting the fitted values at varying locations. The residuals’ responses for each group are not similar in vertical length or variance. Therefore, the data does not show a constant variance. The varying data points cause for the assumption of the non-normal data, and it needs a transformation to become a data set with constant variance. In the case of a CRD, the data is non-parametric. This can be analyzed using the Kruskal-Wallace Test to understand the acceptance or rejection of the null hypothesis.


The Levene’s Test uses the data to test for constant variance. According to the chart, outliers 2, 6, 12 are extending the data outside of the constant variance; therefore, the data set does not maintain a constant variance. The data depicts a non-parametric data set that would need further exploration to access the condition of the null hypothesis.


The Scale-Location plot is used to assess assumptions of constant variance of residuals and identify outliers and influential observations. In this data set, the data shows an inconsistent variance, the data is not split evenly over the line of best fit, and the points do not have similar distance between each set of fitted values; therefore, the data is non-parametric.


Now, generate a Q-Q Normal Plot with a best fit line.
qqnorm(brands2$DV)
qqline(brands2$DV)


Looking at the plot, the data does not closely follow the best fit line. The outliers represent data points that may skew the data to a false negative or positive.


After analyzing all of the plots, we can conclude that the data is not normally distributed. Now that we have come to this conclusion, we can run the Kruskal-Wallis test for a proper analysis.


kruskal.test(DV~Brand,data=brands2)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  DV by Brand
## Kruskal-Wallis chi-squared = 9.38, df = 2, p-value = 0.009187
# This function performs the Kruskal-Wallis test.


The p-value = 9.187e-03 is less than alpha = .05. Therefore, we reject the null hypothesis and conclude that there is a difference in potassium content between brands.


Complete Code of Example

library(agricolae)
library(tidyr)
library(dplyr)
library(MASS)

brands<-read.csv("https://raw.githubusercontent.com/mharvey2696/Test1DoE/main/test.csv")
brands

A <- c(4.7, 3.2, 5.1,5.2,5)
B <- c(5.3,6.4,7.3,6.8,7.2)
C <- c(6.3,8.2,6.2,7.1,6.6)

brands1 <- data.frame(A,B,C)
brand_long <- pivot_longer(brands1,c(A,B,C))
colnames(brand_long)<-c("Brand","DV")
# OR
brands2<-read.csv("https://raw.githubusercontent.com/mharvey2696/Test1DoE/main/CRD-NONPAR-CSV.csv")

aov.model<-aov(DV~Brand,data=brands2)
summary(aov.model)

plot(aov.model)
qqnorm(brands2$DV)
qqline(brands2$DV)

kruskal.test(DV~Brand,data=brands2)