This is distribution 1:
# Three normally distributed data sets
d1 <- rnorm(n=500, mean=15, sd=5)
d2 <- rnorm(n=200, mean=30, sd=5)
d3 <- rnorm(n=100, mean=45, sd=5)
# Combining them into a composite dataset
d123 <- c(d1, d2, d3)
# Let’s plot the density function of d123
plot(density(d123), col="blue", lwd=2,
main = "Distribution 1") +
# Add vertical lines showing mean and median
abline(v=mean(d123)) +
abline(v=median(d123), lty="dashed")
## integer(0)
d4 <- rnorm(n=500, mean=45, sd=6)
d5 <- rnorm(n=200, mean=30, sd=6)
d6 <- rnorm(n=100, mean=15, sd=8)
d456 <- c(d4,d5,d6)
plot(density(d456), col="blue", lwd=2,
main = "Distribution 2") +
abline(v=mean(d456),lwd=4) +
abline(v=median(d456), lwd=2)
## integer(0)
ndis <- rnorm(800, mean = 20, sd = 5)
plot(density(ndis), col="blue", lwd=2,
main = "Distribution 3") +
abline(v=mean(ndis),col="red",lwd=4) +
abline(v=median(ndis), lwd=2)
## integer(0)
When there are outliers in the data, the mean is more sensitive to it, whereas the median is less affected.
This is because the calculation of the mean involves adding up all the data points and dividing by the total number of points, and if there are some extreme values, their influence will be magnified in the numerator, resulting in a deviation of the mean from the overall trend.
In contrast, the median is the middle value arranged in order of size, which means that it is only affected by the value in the middle position of the data, so it can better reflect the trend of the data than the mean.
rdata <- rnorm(n=2000,mean=0,sd=1)
m <- c(-3,-2,-1,1,2,3)
plot(density(rdata),col="blue", lwd=2) +
abline(v=mean(rdata)) +
for(i in m){
abline(v=i*sd(rdata),col="darkgreen",lty="dashed")
}
## integer(0)
first <- quantile(rdata,0.25);first
## 25%
## -0.7316187
sec <- quantile(rdata,0.5);sec
## 50%
## -0.08612573
thr <- quantile(rdata,0.75);thr
## 75%
## 0.5958916
(first - mean(rdata))/sd(rdata)
## 25%
## -0.6823764
(sec - mean(rdata))/sd(rdata)
## 50%
## -0.02597815
(thr - mean(rdata))/sd(rdata)
## 75%
## 0.6675614
data2 <- rnorm(2000,mean=35,sd=3.5)
(quantile(data2,0.25)-mean(data2)) / sd(data2)
## 25%
## -0.6939739
(quantile(data2,0.75)-mean(data2)) / sd(data2)
## 75%
## 0.6808987
Compared my answer to (b), corresponding to 1st quartiles, the difference is 0.0409749. Corresponding to 3rd quartiles, the difference is -0.0064419.The answer in data2 is similar to rdata.
(quantile(d123,0.25)-mean(d123)) / sd(d123)
## 25%
## -0.7578198
(quantile(d123,0.75)-mean(d123)) / sd(d123)
## 75%
## 0.6345025
Compared my answer to (b), corresponding to 1st quartiles, the difference is -0.0482468. Corresponding to 3rd quartiles, the difference is -0.0229364.
According to the Wikipedia, the Freedman-Diaconis rule is based on the interquartile range, denoted by IQR. It replaces 3.5σ of Scott’s rule with 2 IQR, which is less sensitive than the standard deviation to outliers in data.
rand_data <- rnorm(800, mean=20, sd = 5)
St_k <- log2(800) + 1;St_k
## [1] 10.64386
St_h <- (max(rand_data) - min(rand_data)) / St_k;St_h
## [1] 2.829943
Sc_h <- 3.49 * sd(rand_data) / 800^(1/3);Sc_h
## [1] 1.835896
Sc_k <- ceiling((max(rand_data) - min(rand_data))/Sc_h);Sc_k
## [1] 17
Fr_h <- 2 * IQR(rand_data) / 800^(1/3);Fr_h
## [1] 1.348928
Fr_k <- ceiling((max(rand_data) - min(rand_data))/Fr_h);Fr_k
## [1] 23
out_data <- c(rand_data, runif(10, min=40, max=60))
St_k1 <- log2(800) + 1;St_k1
## [1] 10.64386
St_h1 <- (max(out_data) - min(out_data)) / St_k1;St_h1
## [1] 4.814394
Sc_h1 <- 3.49 * sd(out_data) / 800^(1/3);Sc_h1
## [1] 2.190623
Sc_k1 <- ceiling((max(out_data) - min(out_data))/Sc_h1);Sc_k1
## [1] 24
Fr_h1 <- 2 * IQR(out_data) / 800^(1/3);Fr_h1
## [1] 1.383518
Fr_k1 <- ceiling((max(out_data) - min(out_data))/Fr_h1);Fr_k1
## [1] 38
When outliers was added,the bin width(h) change the least in the method Freedman-Diacoins.
In my opinion, The interquartile range is a measure of where the “middle fifty” is in a data set. Freedman method’s formula apply the interquartile range rather than standard deviation could be less sensitive to outliers in data.