Hello Witches and Wizards, :male_mage:
In the initial posts for this thread, my colleagues aptly described the formal and informal ways with which we can assess for normality. In this post, I’d like to add to the discussion by showing examples of how this can be done in practice using the R programming language.
In order for us to start, it’s first important to generate a normal distribution and a non-normal distribution.
## as we're going to deal with random numbers, setting the seed ensures
## that results are reproducible
set.seed(123)
## the rnorm function has the arguments n, mean and sd
## when mean and sd aren't specified, the default values are 0 and 1
## which are the ones we're going to use here
samplesize <- 1000
normal <- rnorm(samplesize)
## let's see what the distribution looks like
head(normal, 100)
## [1] -0.560475647 -0.230177489 1.558708314 0.070508391 0.129287735
## [6] 1.715064987 0.460916206 -1.265061235 -0.686852852 -0.445661970
## [11] 1.224081797 0.359813827 0.400771451 0.110682716 -0.555841135
## [16] 1.786913137 0.497850478 -1.966617157 0.701355902 -0.472791408
## [21] -1.067823706 -0.217974915 -1.026004448 -0.728891229 -0.625039268
## [26] -1.686693311 0.837787044 0.153373118 -1.138136937 1.253814921
## [31] 0.426464221 -0.295071483 0.895125661 0.878133488 0.821581082
## [36] 0.688640254 0.553917654 -0.061911711 -0.305962664 -0.380471001
## [41] -0.694706979 -0.207917278 -1.265396352 2.168955965 1.207961998
## [46] -1.123108583 -0.402884835 -0.466655354 0.779965118 -0.083369066
## [51] 0.253318514 -0.028546755 -0.042870457 1.368602284 -0.225770986
## [56] 1.516470604 -1.548752804 0.584613750 0.123854244 0.215941569
## [61] 0.379639483 -0.502323453 -0.333207384 -1.018575383 -1.071791226
## [66] 0.303528641 0.448209779 0.053004227 0.922267468 2.050084686
## [71] -0.491031166 -2.309168876 1.005738524 -0.709200763 -0.688008616
## [76] 1.025571370 -0.284773007 -1.220717712 0.181303480 -0.138891362
## [81] 0.005764186 0.385280401 -0.370660032 0.644376549 -0.220486562
## [86] 0.331781964 1.096839013 0.435181491 -0.325931586 1.148807618
## [91] 0.993503856 0.548396960 0.238731735 -0.627906076 1.360652449
## [96] -0.600259587 2.187332993 1.532610626 -0.235700359 -1.026420900
## now let's generate a skewed distribution
## baseR doesn't have a function that does it by itself, so let's use a package
require(fGarch)
## rsnorm has the arguments n, mean, sd and xi (skewness)
## the defaults are mean = 0, sd = 1, xi = 1, which we're going to use
skewed <- rsnorm (samplesize)
## let's see what skewed looks like
head(skewed,100)
## [1] 0.506286044 -0.182308105 0.615006252 0.246361750 0.907554213
## [6] 2.257130082 -0.103416992 0.578391509 0.779038086 0.618572857
## [11] -0.736081292 1.493971530 0.426296284 0.919335329 -0.769953740
## [16] 0.124620240 -0.758505467 0.320413212 -0.436935962 -0.132152815
## [21] -0.660645621 -0.214278806 -0.946234535 -0.170673986 -0.583223417
## [26] -0.316027499 1.459883876 1.226086543 -0.326584475 -0.970364978
## [31] 1.767632129 -0.023128088 -0.575715578 -0.181906423 -0.038818876
## [36] -1.183357642 -0.347709901 -0.691707300 -0.723384977 -1.028599850
## [41] 0.895524784 -0.219968058 -0.492052562 -1.424987090 -0.692167624
## [46] -1.534117394 -0.631039340 0.228741373 1.478200171 -0.443502167
## [51] -0.564937395 0.422322749 -1.211258434 0.676746020 -1.138738272
## [56] 1.146647634 -0.025320375 -1.256922259 -1.220301753 0.110103398
## [61] 0.325384363 -1.185991978 -0.542514051 -0.913548145 0.107709809
## [66] 0.948926880 0.631967768 -0.275043484 -0.736569230 -0.635395685
## [71] 1.183074344 -1.010066228 0.410343784 0.236370951 1.530344647
## [76] -1.082423966 -0.336301452 -0.316460666 -0.619973555 0.089659870
## [81] 0.509545743 -1.101623172 -1.443765282 -0.003981283 0.468252619
## [86] -0.215746660 0.398658430 -0.905918511 1.777477259 0.301604121
## [91] 2.280832704 -0.424709391 0.011451877 0.929363795 -0.312957018
## [96] -0.671574219 -0.789850229 2.050027552 2.419955115 0.529384446
Looking at the numbers alone, it’s almost impossible to differentiate the two.
So, now that we have generated both distributions, let’s start with exploratory data analysis which will also serve as our informal assessment of normality.
## let's load the packages that we will use in the informal analysis
require(tidyverse)
## we will create multiple plots and then plot them all in a single page.
## the ggplot function allows us to plot data
## it takes a data frame as argument and needs aesthetics to be set.
## in order to convert our numeric vectors into data frames, we're using the
## as.data.frame function and we're setting the aesthetics with the aes
## function.
## our first plot is going to be a histogram
## after defining our data and our aesthetics, we're then adding
## the histogram geom to finish the plot, establishing the
## number of bins as the rounded up value of the square root of the sample size
p1 <- ggplot(as.data.frame(normal), aes(x=normal)) + geom_histogram(bins = ceiling(sqrt(samplesize)))
## next is a Q-Q plot
p2 <- ggplot(as.data.frame(normal), aes(sample = normal)) + geom_qq() + geom_qq_line()
## next is a density plot
p3 <- ggplot(as.data.frame(normal), aes(x = normal)) + geom_density()
## we will now do the same three plots for our skewed distribution
p4 <- ggplot(as.data.frame(skewed), aes(x=skewed)) + geom_histogram(bins = ceiling(sqrt(samplesize)))
p5 <- ggplot(as.data.frame(skewed), aes(sample = skewed)) + geom_qq() + geom_qq_line()
p6 <- ggplot(as.data.frame(skewed), aes(x = skewed)) + geom_density()
## multiplot is not a standard function in R, being defined by
## Winston Chang [1]
multiplot(p1,p2,p3,p4,p5,p6, cols = 2)
Our informal assessment of the distributions seems to reveal the first distribution as being normal, while the second one isn’t.
Let’s proceed with the formal assessment (which is a lot more simple in the code side of things).
## There is a function in R for the Shapiro-Wilk test, which we're
## going to use, its name is shapiro.test and its only argument is
## a numerical vector
shapiro.test(normal)
##
## Shapiro-Wilk normality test
##
## data: normal
## W = 0.99838, p-value = 0.4765
The output above has a p-value greater than 0.05, this means that we cannot refute the null-hypothesis, therefore the distribution is normal.
shapiro.test(skewed)
##
## Shapiro-Wilk normality test
##
## data: skewed
## W = 0.97752, p-value = 2.574e-11
We can format the result of the p-value better for easier comprehension.
format(shapiro.test(skewed)$p.value, scientific = FALSE)
## [1] "0.00000000002574398"
With this, it is easy to see that the p-value is smaller than 0.05, therefore we can reject the null-hypothesis and conclude that this is not a normal distribution.
In order to understand the importance of this, I will do the same experiment as done above but changing the samplesize variable to 50. I will omit the code as it will be repeated
samplesize <- 150
The purpose of this exercise is to show how a small sample size can muddle the waters and make it harder to visually determine if the variables are normally distributed or not.
Let’s try the Shapiro-Wilk test and see what comes of it.
shapiro.test(normal)
##
## Shapiro-Wilk normality test
##
## data: normal
## W = 0.98713, p-value = 0.1802
shapiro.test(skewed)
##
## Shapiro-Wilk normality test
##
## data: skewed
## W = 0.95652, p-value = 0.0001188
Even with the visual inspection being compromised, the Shapiro-Wilk test still managed to determine the normal distribution as normal and the skewed as not normal. That might not always happen, though.