The use of parametric tests for numerical variable comparison (such as Students t test) is ubiquitous and in many cases robust. Assumptions underly the use of these tests. They can be tedious to check, so much so that some researchers ignore them. It is important to be aware of these, though, and to test them before the final decision to use parametric tests is made.
What are these assumptions, then? Below follows a short description of the four important assumptions.
The p value for parametric tests depends upon a normal sampling distribution. If the sample size is large enough and the actual sample data point value are approximately normally distributed, then the central limit theorem ensures a normally distributed sampling distribution.
In regression analysis and in general linear models, it is the errors that need to be normally distributed.
This refers to the need for a similarity in the variance throughout the data. This means that the variable in the populations from which the samples were taken have a similar variance in these populations.
In the case of regression, the variance of one variable should be the same as all the other variables.
It is obvious that the data point values should be for a numerical variable and measured at this level.
Data point values for variables for different groups should be independent of each other. In regression analysis, the errors should likewise be independent.
In the rest of this post, the first two assumptions are investigated.
This asumption is arguably of most importance. It can be checked visually or numerically. Histograms and quantile-quantile (QQ) plots serve as visual markers and various statistical tests, such as the Shapiro-Wilk test, serves as numerical tests.
In the code below, 100 data point values for a variable named hb
is created. The values are taken from a normal distribution with a mean of 15 and a standard deviation of 3. A histogram with default bin size is created to visualize the frequency distribution of the data. A kernel density estimate is provided using the line()
command.
hb <- rnorm(100, mean = 15, sd = 3)
hist(hb, prob = TRUE, main = "Histogram of hemoglobin values", las = 1, xlab = "Hemoglobin")
lines(density(hb))
From the plot above it seems obvious that the data are normally distributed. The QQ plot below plots the sample quantile of each data point value against its theoretical quantile. A line is added for clarity. The closer the data point values follow the line, the more likely that our assumption has been met.
qqnorm(hb, main = "QQ plot of hemoglobin values")
qqline(hb)
The next computer variable is named
crp
and takes on a gamma distribution. Once again, 100 data point values are created. Following this is the accompanying histogram and QQ plot.
crp = rgamma(100, 2, 2)
hist(crp, prob = TRUE, main = "Histogram of c-reactive protein values", las = 1, xlab = "CRP")
lines(density(crp))
qqnorm(crp, main = "QQ plot of CRP values")
qqline(crp)
As expected, the visual indication is that the assumption of normality is not met.
Simply describing the data point values of a variable can give a good understanding of the underlying ditribution. The summary()
command returns the basic descriptive statistics, including the minimum, maximum, and the quartile values.
summary(hb)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.922 13.051 14.555 14.777 16.755 20.038
If the mean of the data point values is approximately the same as the median value, it can be an indication of the normality of the data.
mean(hb)
## [1] 14.77742
The desc()
command in the pastecs
package, gives more information. It is used below in conjuction with the round()
command, with its argument digits =
set to 3.
library(pastecs)
round(stat.desc(hb, basic = FALSE, norm = TRUE), digits = 3)
## median mean SE.mean CI.mean.0.95 var
## 14.555 14.777 0.276 0.548 7.622
## std.dev coef.var skewness skew.2SE kurtosis
## 2.761 0.187 -0.235 -0.487 -0.417
## kurt.2SE normtest.W normtest.p
## -0.436 0.983 0.226
The absolute values of skewness
and kurtosis
are interpreted as usual. The skew.2SE
and kurt.2SE
need some care. They express the relevant absolute value divided by twice the standard error. For small sample sizes, values of less than -1.0 or more than +1.0 indicate a p value of less than 0.05. For values less than -1.29 or more than +1.29, a p value of about 0.01 is assumed, and so on. For very large sample sizes, a small change in standard error means a large effect on significance and in such cases these values should be greatly revised.
The normtest.W
and normtest.p
gives the test statistic and p value of the Shapiro-Wilk test.
The Shapiro-Wilk test can be used on its own. A p value of less than 0.05 indicates a high likelihood that the assumption for normality is NOT met. Below the hb
and crp
variables are passed as argument to the shapito.test()
command, resulting in the same test statistic and p value as above.
shapiro.test(hb)
##
## Shapiro-Wilk normality test
##
## data: hb
## W = 0.983, p-value = 0.2261
shapiro.test(crp)
##
## Shapiro-Wilk normality test
##
## data: crp
## W = 0.89939, p-value = 1.324e-06
The Levene test is used to test for homogeneity of variance. The null hypothesis states equality of variances. In order to conduct Levene’s test, the Companion to Applied Regression, car
, package is required.
The leveneTest()
command requires the use of a data.frame
object. The code below imports a csv
file and prints the first six rows and a summary to the screen.
df = read.csv(file = "data.csv", header = TRUE)
head(df)
## CRP Group
## 1 11.7 A
## 2 9.1 C
## 3 9.2 C
## 4 6.1 C
## 5 11.0 C
## 6 10.3 A
summary(df)
## CRP Group
## Min. : 3.700 A:150
## 1st Qu.: 8.500 C:150
## Median : 9.800
## Mean : 9.859
## 3rd Qu.:11.200
## Max. :15.400
Note that there are two variables, namely CRP
and Group
. The first variable is ratio-type numerical and the second is nominal categorical (a factor
in R
). The command required specification of a factor by which to group the numerical variable under consideration. The third argument used in the command is center =
and can take the value median
(default) or mean
. We are interested in the mean. Using the median creates the Brown-Forsythe test for variances.
The code below imports the car
package and runs the Levene test.
library(car)
## Loading required package: carData
leveneTest(df$CRP, df$Group, center = mean)
## Levene's Test for Homogeneity of Variance (center = mean)
## Df F value Pr(>F)
## group 1 0.5336 0.4657
## 298
The p value is less than the usually chosen \(\alpha\) value of 0.05. The null hypothesis is not rejected and the variances are accepted as being equal, allowing for the use of a paramatric test.
It is taken for granted that the variable considered for testing the asumptions is numerical and for the different groups are independent and this post will conclude here.
Testing the assumptions for the use of parametric tests can seem laborious, but is an essential requirement in data analysis.
Written by Dr Juan H Klopper
http://www.juanklopper.com
YouTube https://www.youtube.com/jhklopper
Twitter @docjuank