This is a quick on that was prompted by a colleague who had a query about data transforms that minimise skewness. Quite often a log transform is used to do this. However, when there are zero values (or even negative values) in the data, this will lead to some of the values in the transformed variable being undefined. For this reason, a constant is sometimes added to the variable before taking a log transform:
\[ f(x_i) = \log(x_i + c) \]
where \( x_i \) is the \( i \)'th observed value, and \( f \) is the transformation function. Sometimes \( c \) is 1, but it can take other values. The question my colleague asked was whether it was possible to find a value of \( c \) that would minimise the sample skewness of the variable \( x \). The idea here was that this would then be fed into some kind of data mining algorithm, such as k-means.
This can be viewed as an optimisation problem. The skewness of the transformed data can be regarded as a function of \( c \) - and we wish to find a value of \( c \) that makes the skewness as close to zero as possible. Here, the definition of skewness is
\[ b_3(c) = \frac{1}{n} \left( \frac{n-1}{n} \right)^{\frac{3}{2}} \sum{\left( \log(x_i + c) - \bar{x} \right)^3} \left( \frac{1}{n} \sum{\left( \log(x_i + c) - \bar{x} \right)^2} \right)^{-\frac{3}{2}} \]
which is the same as that used by Minitab and BMDP - but slightly different to that used by SAS and SPSS.
For this example, you need 2 R packages - the somewhat enigmatically named e1071 which contains the skewness function, and datasets which provides an example data set, relating to air quality measurements in New York. Make sure they are installed, and then load them:
library(e1071)
## Loading required package: class
library(datasets)
Next, extract the air quality data:
data(airquality)
ozone <- airquality$Ozone
ozone <- ozone[!is.na(ozone)]
and check visually that it is skewed:
hist(ozone, col = "tomato")
and check its skewness numerically
skewness(ozone)
## [1] 1.21
So this data has a clear positive skew. To find the most appropriate value of \( c \) it is neccessary to minimise a function that measures the discrepancy between the skewness and zero. One possibility is \( \left( b_3(c) \right)^2 \). To use the R function optimise
it is necessary first to define this function:
skew.score <- function(c, x) (skewness(log(x + c)))^2
in this function x
is the data set and c
is the value referred to in the above equation - the variable we wish to find the optimal value of. Before dsoing this, it is often worth inspecting the situation graphically. Here we plot c
against the skewness for c values between 0 and 20:
cval <- seq(0, 20, l = 101)
skew <- cval * 0
for (i in 1:length(cval)) skew[i] <- skewness(log(cval[i] + ozone))
plot(cval, skew, type = "l", ylab = expression(b[3](c)), xlab = expression(c))
abline(h = 0, lty = 3)
The line \( y=0 \) is shown as a dotted line. The fact that the curve intersects this line shows that it is in fact possible to find a \( c \) value that makes the skewness equal to zero for this data set - it occurs when \( c \) has a value near to 5. Now, optimise
is used to find a more precise optimal value. In the code below, optimise
is given the name of the function to optimise, then a search range for the input variable. Finally, the x=...
part of the function call supplies the extra information that skew.score
needs - basicall;y the x
argument for the data set. The optimise function returns a list of values (check it out in help to see what they all are) but here we are only interested in the actual optimal value (called minimum
) hence the $minimum
at the end of the function call.
best.c <- optimise(skew.score, c(0, 20), x = ozone)$minimum
best.c
## [1] 5.066
The optimal \( c \) value is around 5.066. A histogram of the transformed data is below:
ozone.transformed <- log(ozone + best.c)
hist(ozone.transformed, col = "azure")
The skewness of the transformed data can also be checked:
skewness(ozone.transformed)
## [1] -4.931e-08
Although the minimisation algorithm is not perfectly exact, it has done pretty well.
The above technique minimises skewness, but does not necessarily transform the data to normality - there are lots of distributions that are symmetrical, and have zero skewness, but are not normal. To see how close to the normal distribution the transformed data is, a Q-Q plot may be used. This plots the sorted data set against the corresponding quantiles of the normal distribution - a data set that is close to the normal should give something reasonably close to a straight line relationship.
qqnorm(ozone.transformed)
qqline(ozone.transformed)
The plot suggests that the transformed data is reasonably close to normal - but that there are departures in both the upper and lower tails. In the upper tail, the sample has lower values than expected under an assumption of normality, and in the lower tail the sample has higher values. This suggests that our sample is actually 'sharper' than the normal distribution - in the sense that it is more concentrated toward thew center.