Consider the composite distribution below - Distribution 1. It has been created by combining three normal distributions. When we visualize it, there are two important things to observe: first, the distribution is positively skewed (tail stretches to the right); second, the mean and the median are different.

# Setting seed
set.seed(123)

# Creating three normally distributed data sets
d1 <- rnorm(n=500, mean=15, sd=5)
d2 <- rnorm(n=200, mean=30, sd=5)
d3 <- rnorm(n=100, mean=45, sd=5)

# Combining them into a composite dataset
d123 <- c(d1, d2, d3)
# Plotting the density function of d123
plot(density(d123), col="cornflowerblue", lwd=2,
main = "Distribution 1")

# Adding vertical lines showing mean and median
abline(v=mean(d123))
abline(v=median(d123), lty="dashed")

Generate and visualize Distribution 2 (negatively skewed) and Distribution 3 (normal distribution).

Similarly, we create and visualize a new Distribution 2: a combined dataset (n=800) that is negatively skewed (tail stretches to the left). We change the mean and standard deviation of d1, d2, and d3 to achieve this new distribution. Additionally, we compute the mean and median, and draw lines showing the mean (thick line) and median (dashed line).

# Setting seed
set.seed(456)

# Three normally distributed data sets
d4 <- rnorm(n = 500, mean = 55, sd = 5)
d5 <- rnorm(n = 200, mean = 45, sd = 10)
d6 <- rnorm(n = 100, mean = 30, sd = 15)

# Merging the three data sets together
d456 <- c(d4, d5, d6)

# Computing the mean and median
mean(d456)

median(d456)

# Plotting the density function of d123
plot(density(d456), col= "darkorange", lwd = 2, main = "Distribution 2", )
abline(v = mean(d456))
abline(v = median(d456), lty = "dashed")

As we can see, the mean is 49.7009, while the median of Distribution 2 is 52.8891.

Next, we proceed to create Distribution 3: a single dataset that is normally distributed (bell-shaped, symmetric). We can use the R function rnorm() to create a single large dataset (n = 800). Similarly to Distribution 2, we compute the mean and median, and draw lines showing the mean (thick line) and median (dashed line).

# Setting seed
set.seed(789)

# Creating a normally distributed dataset d
d <- rnorm(n = 800)

# Computing the mean and median
mean(d)

median(d)

# Plotting the density function of d
plot(density(d), col = "coral2", lwd = 2, main = "Distribution 3", )
abline(v = mean(d))
abline(v = median(d), lty = "dashed")

The mean and median of Distribution 3 are both close to 0 (-0.0062 and 0.020 respectively). That is because Distribution 3 is derived from the normal distribution, and in this type of distribution, mean and median are the same (mode as well). In fact, normal distributions are symmetrical, which means that the distribution of values on the left and on the right sides of the mean is the same. Consequently, the median aligns with the mean. Moreover, we have not modified the mean parameter of rnorm(), resulting in the mean and median being both close to the default 0.

In the case of the composite distributions 1 and 2, why do the means and medians not align? That is because when we merge distributions with different means and standard deviations, we create a composite distribution that is no longer normally distributed. Indeed, in our case we created two distributions that are asymmetric, showing skewness (positive and negative). Usually, skewed distributions present a mean value that is moved towards the tail, with the median sticking to the peak of the distribution.

Which measure of central tendency will be more sensitive to outliers being added to data? Mean or Median?

If we were to add outliers to our data, the mean would be affected the most. That is because the calculation of the mean is made by summing all the values in a dataset and then dividing by the number of values. Outliers are values that are significantly lower or significantly higher than the rest of the data, hence, they can influence the mean value. On the other hand, the median is the middle value when a data set is ordered from lowest to highest. It divides the dataset into two equal halves. Therefore, it is less sensitive to outliers because it doesn’t take into account the magnitude of each value. We can demonstrate the influence of outliers with the code below:

# Adding an extreme value to Distribution 3 data:
d_with_outlier <- c(d, 500)

# Showing results before adding the outlier vs after adding the outlier:
c(mean(d), median(d), mean(d_with_outlier), median(d_with_outlier))

As we can see, adding an outlier to Distribution 3 data contributed to increasing the mean from approximately zero to 0.6, while the median value remain almost unchanged.

2. Standard Deviations

Create rdata, plot its density, and mark mean and standard deviations.

Now, we will get more insights about what standard deviations are. At first, we create a random dataset (call it rdata) that is normally distributed with: n = 2000, mean = 0, sd = 1. Then we draw a density plot and put a solid vertical line on the mean, and dashed vertical lines at the 1st, 2nd, and 3rd standard deviations to the left and right of the mean.

# Creating a random dataset
set.seed(180)
rdata <- rnorm(n = 2000, mean = 0, sd = 1)

# Plotting the density function of rdata
plot(density(rdata), col = "red", lwd = 2, main = "rdata density plot")
# With the following vertical lines
abline(v = mean(rdata), col = "blue" )
abline(v = (mean(rdata) - (3 * sd(rdata))), lty = "dashed")
abline(v = (mean(rdata) - (2 * sd(rdata))), lty = "dashed")
abline(v = (mean(rdata) - sd(rdata)), lty = "dashed")
abline(v = (mean(rdata) + sd(rdata)), lty = "dashed")
abline(v = (mean(rdata) + (2 * sd(rdata))), lty = "dashed")
abline(v = (mean(rdata) + (3 * sd(rdata))), lty = "dashed")

Adding standard deviation lines allows us to to see how much data is included within 1, 2 or 3 standard deviations. If a distribution is normal, then it should contain about 68% of the data within 1 standard deviation, 95% of the data within 2 standard deviations and 99.7% percent of the data within 3 standard deviations.

Calculate and identify the quartiles and their distance to the mean in terms of standard deviations

Now, using the quantile() function, we will find out which data points correspond to the 1st, 2nd, and 3rd quartiles (i.e., 25th, 50th, 75th percentiles) of rdata.

quantile(rdata, 0.25)

quantile(rdata, 0.5)

quantile(rdata, 0.75)

Computing the quartiles informs us about what is the data point below which we find 25%, 50% and 75% of the data. The 2nd quartile is the median. Next, we compute how many standard deviations away from the mean (divide by standard-deviation) are those points corresponding to the 1st, 2nd, and 3rd quartiles. This is an indicator of how spread is our data in relation to the mean.

(quantile(rdata, 0.25) - mean(rdata)) / (sd(rdata))

(quantile(rdata, 0.5) - mean(rdata)) / (sd(rdata))

(quantile(rdata, 0.75) - mean(rdata)) / (sd(rdata))

The distances of the quartiles in terms of standard deviations from the mean are very close to the values of the quartiles themselves (indeed we are subtracting a mean of circa 0 and dividing by a standard deviation of circa 1). The distance of the 1st quartile from the mean is 0.693, the distance of the 2nd quartile is 0.0019 and the distance of the 3rd quartile is 0.705.

Repeat with an adjusted dataset to see the effect of changing mean and standard deviation

Next, we create a new random dataset that is normally distributed with: n = 2000, mean = 35, sd = 3.5. We take a peek at how many standard deviations away from the mean are those points corresponding to the 1st, 2nd and 3rd quartiles in this scenario.

# Creating a new random dataset
set.seed(190)
rdata2 <- rnorm(n = 2000, mean = 35, sd = 3.5)

# Computing the distance of 1st, 2nd and 3rd quartile from the mean in terms of sd
(quantile(rdata2, 0.25) - mean(rdata2)) / (sd(rdata2))

(quantile(rdata2, 0.5) - mean(rdata2)) / (sd(rdata2))

(quantile(rdata2, 0.75) - mean(rdata2)) / (sd(rdata2))

The quartiles’ distance from the mean in terms of standard deviations are quite similar to those of the quartiles computed from rdata, despite the two datasets present different mean and standard deviation. This is a consequence of the properties of normal distributions, indeed: * In normal distributions, the quartiles are always located at the same position in terms of standard deviations from the mean. This is due to the symmetric and consistent shape of the normal distribution curve. * The 1st quartile (25th percentile) is always approximately -0.675 standard deviations from the mean, the 2nd quartile (50th percentile or median) is 0 standard deviations from the mean (since in a normal distribution the median is the mean as well), and the 3rd quartile (75th percentile) is approximately +0.675 standard deviations from the mean.

Verify positioning of quartiles in Distribution 1

Now, recall Distribution 1 shown at the beginning of this project. Does it look like a normal distribution? Again, we will verify how many standard deviations away from the mean are those data points corresponding to the 1st and 3rd quartiles.

(quantile(d123, 0.25) - mean(d123)) / (sd(d123))

(quantile(d123, 0.5) - mean(d123)) / (sd(d123))

(quantile(d123, 0.75) - mean(d123)) / (sd(d123))

In this scenario, we see that the distances have changed, especially the distance of the 2nd quartile from the mean. That is because Distribution 1 was created by merging three random datasets from a normal distribution, but that does not mean that Distribution 1 is normally distributed as well. We can notice it by taking a peek at its density plot, and confirm it when we check the distances above.

3. Histogram Binning

Summarize different binning methods: Sturges’, Scott’s and Freedman-Diaconis

When we choose to display histograms of data, we should be wary of the influence of the bin size of histograms. Specifically, there are different ways to calculate bin width h and number of bins k. Note that, for any dataset d, we can calculate the number of bins k from the bin width h: k = ceiling((max(d) - min(d))/h); and the bin width from the number of bins: h = (max(d) - min(d)) / k.

However, if we weren’t given k or h initially, we could use one of the several existing methods for calculating the ideal bin width and number of bins. Among the most popular lay: * Sturges’ formula * Scott’s normal reference rule * Freedman-Diaconis’ choice

Apply the methods above to calculate bin widths and number of bins for a randomly distributed dataset

We shall explore what the three methods suggest. We create a random normal distribution and compute h and k for each of the above methods. We will summarize the results at the end of the calculations.

# Creating a new random dataset\
set.seed(100)
n <- 800
rand_data <- rnorm(n, mean=20, sd = 5)

sturges_k <- ceiling(log2(n) + 1) # k
sturges_h <- (max(rand_data) - min(rand_data))/sturges_k # h

scott_h <- (3.49 * sd(rand_data))/(n ^ (1/3)) # h
scott_k <- ceiling((max(rand_data) - min(rand_data))/scott_h) # k

Before computing the bin width, we need to calculate the interquartile range (IQR):

IQR <- quantile(rand_data, 0.75) - quantile(rand_data, 0.25)

freedman_h <- unname(2 * (IQR / (n ^ (1/3)))) # h
freedman_k <- unname(ceiling((max(rand_data) - min(rand_data))/freedman_h)) # k

require(gt) # loading gt package for knitting tables

hk_summary <- data.frame(
  Method = c("Sturges' formula", "Scott's normal reference rule",
             "Freedman-Diaconis'"),
  h = c(sturges_h, scott_h, freedman_h),
  k = c(sturges_k, scott_k, freedman_k))

# Summary
gt::gt(hk_summary) %>%
  tab_header(title = "Table 1: Summary of results.",
             subtitle = "Bin width (h) and number of bins (k).") %>%
  tab_style(style = list(cell_text(weight = "bold")),
            locations = cells_column_labels())

Method	h	k
Table 1: Summary of results.
Bin width (h) and number of bins (k).
Sturges' formula	3.011333	11
Scott's normal reference rule	1.922323	18
Freedman-Diaconis'	1.396355	24

It seems that the higher the number of bins, the smallest their width. We can see that the method which proposes the highest number of bins is Freedman-Diaconis’ choice, while Sturges’ proposes the least. In the middle, we find Scott’s normal reference rule.

Add outliers and repeat the analysis; discuss what method is least sensitive to outliers

Now, let’s verify which method is more or less sensitive to outliers in the data. We focus especially on changes in terms of bin width. We are interested in this because the bin width in a histogram can significantly impact how the underlying distribution of the data is perceived. Analyzing changes in bin width is crucial as it helps understanding how robust are the bin width calculation methods against data anomalies, such as outliers. Hence, we extend rand_data with some outliers and recompute all the bin widths and number of bins with the three methods:

# Adding the outlier data:
set.seed(100)
out_data <- c(rand_data, runif(10, min=40, max=60))
n2 <- n + 10

# Sturge's formula:
sturge_out_k <- ceiling(log2(n2) + 1) # k
sturge_out_h <- (max(out_data) - min(out_data))/sturge_out_k #h

# Scott's normal reference rule:
scott_out_h <- (3.49 * sd(out_data))/(n2 ^ (1/3)) # h
scott_out_k <- unname(ceiling((max(out_data) - min(out_data))/scott_out_h)) # k

# Freedman-Diaconis' choice:
IQR_out <- quantile(out_data, 0.75) - quantile(out_data, 0.25)
freedman_out_h <- unname(2 * (IQR_out / (n2 ^ (1/3)))) # h
freedman_out_k <- unname(ceiling((max(out_data) - min(out_data))/freedman_out_h)) # k

hk_out_summary <- data.frame(
  Method = c("Sturges' formula", "Scott's normal reference rule", "Freedman-Diaconis'"),
  h = c(sturge_out_h, scott_out_h, freedman_out_h),
  k = c(sturge_out_k, scott_out_k, freedman_out_k))

# Summary
gt::gt(hk_out_summary) %>%
  tab_header(title = "Table 2: Summary of results - Data with outliers.",
             subtitle = "Bin width (h) and number of bins (k).") %>%
  tab_style(style = list(cell_text(weight = "bold")),
            locations = cells_column_labels())

Method	h	k
Table 2: Summary of results - Data with outliers.
Bin width (h) and number of bins (k).
Sturges' formula	4.804724	11
Scott's normal reference rule	2.234558	24
Freedman-Diaconis'	1.410308	38

As we can see, the method which shows the least change in h is Freedman-Diaconis’ choice. Indeed, this method uses the IQR as opposed to Scott’s normal reference rule, which use standard deviation. If we look at the formula for the quartiles (necessary to calculate IQR), we see that the only parameter affecting their value is the number of observations. Standard deviation instead also takes into account the value of the observations. Therefore, a small number of extreme observations will cause greater change in the standard deviation than in the quartiles (and consequently in IQR). Freedman-Diaconis’ is more reliable than Sturge’s formula too, as after calculating k using Sturge’s formula, we’ll have to find h by considering the maximum and minimum datapoints in the given dataset, which they are likely going to be outliers.

Summary

In this comprehensive study, we delved into the realm of computational statistics with a particular emphasis on statistical distributions and histogram analysis. The project was structured in a few key sections. We began by creating and visualizing different types of distributions. This included a positively skewed composite distribution (Distribution 1), created by combining three normal distributions, a negatively skewed distribution (Distribution 2), and a normal distribution (Distribution 3). We realized that the measure of central tendency which would most be affected by the addition of outliers to the data is the mean. Then we simulated another dataset, computed quartiles as well as their distance from the mean in terms of standard deviations. We found that as long as the distribution is normal, the distance of quartiles from the mean will be similar between different distributions, even though they might have different means and standard deviations. However, when considering a composite distribution, such as Distribution 1, we found that this rule doesn’t hold anymore. Further we conducted an analysis of different methods for computing histograms’ bin width. We explored the Freedman-Diaconis’ choice, Sturges’ formula, and Scott’s normal reference rule. The sensitivity of these methods to outliers was scrutinized, revealing insights into how data anomalies can affect visual representation and data interpretation. The method that is more resilient to outliers is Freedman-Diaconis’ choice.

Computational Statistics in Business Analytics - Statistical Simulations

1. Generate and Visualize Artificial Distributions