Author: Daniele Melotti
Date:
Nov 2023
See the related GitHub repository for code, data and list of tasks.
Consider the composite distribution below - Distribution 1. It has been created by combining three normal distributions. When we visualize it, there are two important things to observe: first, the distribution is positively skewed (tail stretches to the right); second, the mean and the median are different.
# Setting seed
set.seed(123)
# Creating three normally distributed data sets
d1 <- rnorm(n=500, mean=15, sd=5)
d2 <- rnorm(n=200, mean=30, sd=5)
d3 <- rnorm(n=100, mean=45, sd=5)
# Combining them into a composite dataset
d123 <- c(d1, d2, d3)
# Plotting the density function of d123
plot(density(d123), col="cornflowerblue", lwd=2,
main = "Distribution 1")
# Adding vertical lines showing mean and median
abline(v=mean(d123))
abline(v=median(d123), lty="dashed")Similarly, we create and visualize a new Distribution 2: a combined dataset (n=800) that is negatively skewed (tail stretches to the left). We change the mean and standard deviation of d1, d2, and d3 to achieve this new distribution. Additionally, we compute the mean and median, and draw lines showing the mean (thick line) and median (dashed line).
# Setting seed
set.seed(456)
# Three normally distributed data sets
d4 <- rnorm(n = 500, mean = 55, sd = 5)
d5 <- rnorm(n = 200, mean = 45, sd = 10)
d6 <- rnorm(n = 100, mean = 30, sd = 15)
# Merging the three data sets together
d456 <- c(d4, d5, d6)
# Computing the mean and median
mean(d456)## [1] 49.70087
## [1] 52.88905
# Plotting the density function of d123
plot(density(d456), col= "darkorange", lwd = 2, main = "Distribution 2", )
abline(v = mean(d456))
abline(v = median(d456), lty = "dashed")As we can see, the mean is 49.7009, while the median of Distribution 2 is 52.8891.
Next, we proceed to create Distribution 3: a single dataset
that is normally distributed (bell-shaped, symmetric). We can use the R
function rnorm() to create a single large dataset (n =
800). Similarly to Distribution 2, we compute the mean and median, and
draw lines showing the mean (thick line) and median (dashed line).
# Setting seed
set.seed(789)
# Creating a normally distributed dataset d
d <- rnorm(n = 800)
# Computing the mean and median
mean(d)## [1] -0.006176743
## [1] 0.02046861
# Plotting the density function of d
plot(density(d), col = "coral2", lwd = 2, main = "Distribution 3", )
abline(v = mean(d))
abline(v = median(d), lty = "dashed")The mean and median of Distribution 3 are both close to 0 (-0.0062
and 0.020 respectively). That is because Distribution 3 is derived from
the normal distribution, and in this type of distribution, mean and
median are the same (mode as well). In fact, normal distributions are
symmetrical, which means that the distribution of values on the left and
on the right sides of the mean is the same. Consequently, the median
aligns with the mean. Moreover, we have not modified the mean
parameter of rnorm(), resulting in the mean and median
being both close to the default 0.
In the case of the composite distributions 1 and 2, why do the means and medians not align? That is because when we merge distributions with different means and standard deviations, we create a composite distribution that is no longer normally distributed. Indeed, in our case we created two distributions that are asymmetric, showing skewness (positive and negative). Usually, skewed distributions present a mean value that is moved towards the tail, with the median sticking to the peak of the distribution.
If we were to add outliers to our data, the mean would be affected the most. That is because the calculation of the mean is made by summing all the values in a dataset and then dividing by the number of values. Outliers are values that are significantly lower or significantly higher than the rest of the data, hence, they can influence the mean value. On the other hand, the median is the middle value when a data set is ordered from lowest to highest. It divides the dataset into two equal halves. Therefore, it is less sensitive to outliers because it doesn’t take into account the magnitude of each value. We can demonstrate the influence of outliers with the code below:
# Adding an extreme value to Distribution 3 data:
d_with_outlier <- c(d, 500)
# Showing results before adding the outlier vs after adding the outlier:
c(mean(d), median(d), mean(d_with_outlier), median(d_with_outlier))## [1] -0.006176743 0.020468613 0.618050693 0.023874958
As we can see, adding an outlier to Distribution 3 data contributed to increasing the mean from approximately zero to 0.6, while the median value remain almost unchanged.
Now, we will get more insights about what standard deviations are. At first, we create a random dataset (call it rdata) that is normally distributed with: n = 2000, mean = 0, sd = 1. Then we draw a density plot and put a solid vertical line on the mean, and dashed vertical lines at the 1st, 2nd, and 3rd standard deviations to the left and right of the mean.
# Creating a random dataset
set.seed(180)
rdata <- rnorm(n = 2000, mean = 0, sd = 1)
# Plotting the density function of rdata
plot(density(rdata), col = "red", lwd = 2, main = "rdata density plot")
# With the following vertical lines
abline(v = mean(rdata), col = "blue" )
abline(v = (mean(rdata) - (3 * sd(rdata))), lty = "dashed")
abline(v = (mean(rdata) - (2 * sd(rdata))), lty = "dashed")
abline(v = (mean(rdata) - sd(rdata)), lty = "dashed")
abline(v = (mean(rdata) + sd(rdata)), lty = "dashed")
abline(v = (mean(rdata) + (2 * sd(rdata))), lty = "dashed")
abline(v = (mean(rdata) + (3 * sd(rdata))), lty = "dashed")Adding standard deviation lines allows us to to see how much data is included within 1, 2 or 3 standard deviations. If a distribution is normal, then it should contain about 68% of the data within 1 standard deviation, 95% of the data within 2 standard deviations and 99.7% percent of the data within 3 standard deviations.
Now, using the quantile() function, we will find out
which data points correspond to the 1st, 2nd, and 3rd quartiles (i.e.,
25th, 50th, 75th percentiles) of rdata.
## 25%
## -0.693956
## 50%
## -0.05575455
## 75%
## 0.6296026
Computing the quartiles informs us about what is the data point below which we find 25%, 50% and 75% of the data. The 2nd quartile is the median. Next, we compute how many standard deviations away from the mean (divide by standard-deviation) are those points corresponding to the 1st, 2nd, and 3rd quartiles. This is an indicator of how spread is our data in relation to the mean.
## 25%
## -0.6930447
## 50%
## -0.01890233
## 75%
## 0.7050513
The distances of the quartiles in terms of standard deviations from the mean are very close to the values of the quartiles themselves (indeed we are subtracting a mean of circa 0 and dividing by a standard deviation of circa 1). The distance of the 1st quartile from the mean is 0.693, the distance of the 2nd quartile is 0.0019 and the distance of the 3rd quartile is 0.705.
Next, we create a new random dataset that is normally distributed with: n = 2000, mean = 35, sd = 3.5. We take a peek at how many standard deviations away from the mean are those points corresponding to the 1st, 2nd and 3rd quartiles in this scenario.
# Creating a new random dataset
set.seed(190)
rdata2 <- rnorm(n = 2000, mean = 35, sd = 3.5)
# Computing the distance of 1st, 2nd and 3rd quartile from the mean in terms of sd
(quantile(rdata2, 0.25) - mean(rdata2)) / (sd(rdata2))## 25%
## -0.6903467
## 50%
## -0.006950097
## 75%
## 0.6800506
The quartiles’ distance from the mean in terms of standard deviations are quite similar to those of the quartiles computed from rdata, despite the two datasets present different mean and standard deviation. This is a consequence of the properties of normal distributions, indeed: * In normal distributions, the quartiles are always located at the same position in terms of standard deviations from the mean. This is due to the symmetric and consistent shape of the normal distribution curve. * The 1st quartile (25th percentile) is always approximately -0.675 standard deviations from the mean, the 2nd quartile (50th percentile or median) is 0 standard deviations from the mean (since in a normal distribution the median is the mean as well), and the 3rd quartile (75th percentile) is approximately +0.675 standard deviations from the mean.
Now, recall Distribution 1 shown at the beginning of this project. Does it look like a normal distribution? Again, we will verify how many standard deviations away from the mean are those data points corresponding to the 1st and 3rd quartiles.
## 25%
## -0.7463353
## 50%
## -0.2909566
## 75%
## 0.5974905
In this scenario, we see that the distances have changed, especially the distance of the 2nd quartile from the mean. That is because Distribution 1 was created by merging three random datasets from a normal distribution, but that does not mean that Distribution 1 is normally distributed as well. We can notice it by taking a peek at its density plot, and confirm it when we check the distances above.
When we choose to display histograms of data, we should be wary of
the influence of the bin size of histograms. Specifically, there are
different ways to calculate bin width h and number of bins
k. Note that, for any dataset d, we can calculate the
number of bins k from the bin width h:
k = ceiling((max(d) - min(d))/h); and the bin width from
the number of bins: h = (max(d) - min(d)) / k.
However, if we weren’t given k or h initially, we could use one of the several existing methods for calculating the ideal bin width and number of bins. Among the most popular lay: * Sturges’ formula * Scott’s normal reference rule * Freedman-Diaconis’ choice
What rule is best? According to a thread on Cross Validated related to the ways of calculating the optimal number of bins, it is suggested to use the Freedman-Diaconis’ rule. This formula uses the IQR instead of standard deviation, which according to the Wikipedia histograms’ page is less sensitive to outliers in the data.
We shall explore what the three methods suggest. We create a random normal distribution and compute h and k for each of the above methods. We will summarize the results at the end of the calculations.
Sturges’ formula is the following:
Hence, we can calculate the number of bins and then the bin width:
Scott’s normal reference rule follows the following formula:
We move on to compute h using the above formula, the we derive k:
scott_h <- (3.49 * sd(rand_data))/(n ^ (1/3)) # h
scott_k <- ceiling((max(rand_data) - min(rand_data))/scott_h) # kFinally, Freedman-Diaconis’ choice:
Before computing the bin width, we need to calculate the interquartile range (IQR):
IQR <- quantile(rand_data, 0.75) - quantile(rand_data, 0.25)
freedman_h <- unname(2 * (IQR / (n ^ (1/3)))) # h
freedman_k <- unname(ceiling((max(rand_data) - min(rand_data))/freedman_h)) # kNow that we computed h and k using each method, we can summarize the results:
## Loading required package: gt
hk_summary <- data.frame(
Method = c("Sturges' formula", "Scott's normal reference rule",
"Freedman-Diaconis'"),
h = c(sturges_h, scott_h, freedman_h),
k = c(sturges_k, scott_k, freedman_k))
# Summary
gt::gt(hk_summary) %>%
tab_header(title = "Table 1: Summary of results.",
subtitle = "Bin width (h) and number of bins (k).") %>%
tab_style(style = list(cell_text(weight = "bold")),
locations = cells_column_labels())| Table 1: Summary of results. | ||
| Bin width (h) and number of bins (k). | ||
| Method | h | k |
|---|---|---|
| Sturges' formula | 3.011333 | 11 |
| Scott's normal reference rule | 1.922323 | 18 |
| Freedman-Diaconis' | 1.396355 | 24 |
It seems that the higher the number of bins, the smallest their width. We can see that the method which proposes the highest number of bins is Freedman-Diaconis’ choice, while Sturges’ proposes the least. In the middle, we find Scott’s normal reference rule.
Now, let’s verify which method is more or less sensitive to outliers in the data. We focus especially on changes in terms of bin width. We are interested in this because the bin width in a histogram can significantly impact how the underlying distribution of the data is perceived. Analyzing changes in bin width is crucial as it helps understanding how robust are the bin width calculation methods against data anomalies, such as outliers. Hence, we extend rand_data with some outliers and recompute all the bin widths and number of bins with the three methods:
# Adding the outlier data:
set.seed(100)
out_data <- c(rand_data, runif(10, min=40, max=60))
n2 <- n + 10
# Sturge's formula:
sturge_out_k <- ceiling(log2(n2) + 1) # k
sturge_out_h <- (max(out_data) - min(out_data))/sturge_out_k #h
# Scott's normal reference rule:
scott_out_h <- (3.49 * sd(out_data))/(n2 ^ (1/3)) # h
scott_out_k <- unname(ceiling((max(out_data) - min(out_data))/scott_out_h)) # k
# Freedman-Diaconis' choice:
IQR_out <- quantile(out_data, 0.75) - quantile(out_data, 0.25)
freedman_out_h <- unname(2 * (IQR_out / (n2 ^ (1/3)))) # h
freedman_out_k <- unname(ceiling((max(out_data) - min(out_data))/freedman_out_h)) # kWe can summarize the new results:
hk_out_summary <- data.frame(
Method = c("Sturges' formula", "Scott's normal reference rule", "Freedman-Diaconis'"),
h = c(sturge_out_h, scott_out_h, freedman_out_h),
k = c(sturge_out_k, scott_out_k, freedman_out_k))
# Summary
gt::gt(hk_out_summary) %>%
tab_header(title = "Table 2: Summary of results - Data with outliers.",
subtitle = "Bin width (h) and number of bins (k).") %>%
tab_style(style = list(cell_text(weight = "bold")),
locations = cells_column_labels())| Table 2: Summary of results - Data with outliers. | ||
| Bin width (h) and number of bins (k). | ||
| Method | h | k |
|---|---|---|
| Sturges' formula | 4.804724 | 11 |
| Scott's normal reference rule | 2.234558 | 24 |
| Freedman-Diaconis' | 1.410308 | 38 |
As we can see, the method which shows the least change in h is Freedman-Diaconis’ choice. Indeed, this method uses the IQR as opposed to Scott’s normal reference rule, which use standard deviation. If we look at the formula for the quartiles (necessary to calculate IQR), we see that the only parameter affecting their value is the number of observations. Standard deviation instead also takes into account the value of the observations. Therefore, a small number of extreme observations will cause greater change in the standard deviation than in the quartiles (and consequently in IQR). Freedman-Diaconis’ is more reliable than Sturge’s formula too, as after calculating k using Sturge’s formula, we’ll have to find h by considering the maximum and minimum datapoints in the given dataset, which they are likely going to be outliers.
In this comprehensive study, we delved into the realm of computational statistics with a particular emphasis on statistical distributions and histogram analysis. The project was structured in a few key sections. We began by creating and visualizing different types of distributions. This included a positively skewed composite distribution (Distribution 1), created by combining three normal distributions, a negatively skewed distribution (Distribution 2), and a normal distribution (Distribution 3). We realized that the measure of central tendency which would most be affected by the addition of outliers to the data is the mean. Then we simulated another dataset, computed quartiles as well as their distance from the mean in terms of standard deviations. We found that as long as the distribution is normal, the distance of quartiles from the mean will be similar between different distributions, even though they might have different means and standard deviations. However, when considering a composite distribution, such as Distribution 1, we found that this rule doesn’t hold anymore. Further we conducted an analysis of different methods for computing histograms’ bin width. We explored the Freedman-Diaconis’ choice, Sturges’ formula, and Scott’s normal reference rule. The sensitivity of these methods to outliers was scrutinized, revealing insights into how data anomalies can affect visual representation and data interpretation. The method that is more resilient to outliers is Freedman-Diaconis’ choice.