Distribution & binwidth concept

This is distribution 1:

# Three normally distributed data sets
d1 <- rnorm(n=500, mean=15, sd=5)
d2 <- rnorm(n=200, mean=30, sd=5)
d3 <- rnorm(n=100, mean=45, sd=5)

# Combining them into a composite dataset
d123 <- c(d1, d2, d3)

# Let’s plot the density function of d123
plot(density(d123), col="blue", lwd=2, 
     main = "Distribution 1") +

# Add vertical lines showing mean and median
abline(v=mean(d123)) +
abline(v=median(d123), lty="dashed")

## integer(0)

Q1-a) Create and visualize a new “Distribution 2”: a combined dataset (n=800) that is negatively skewed (tail stretches to the left

d4 <- rnorm(n=500, mean=45, sd=6)
d5 <- rnorm(n=200, mean=30, sd=6)
d6 <- rnorm(n=100, mean=15, sd=8)

d456 <- c(d4,d5,d6)

plot(density(d456), col="blue", lwd=2, 
     main = "Distribution 2") +

abline(v=mean(d456),lwd=4) +
abline(v=median(d456), lwd=2)

## integer(0)

Q1-b) Create a “Distribution 3”: a single dataset that is normally distributed (bell-shaped, symmetric)

ndis <- rnorm(800, mean = 20, sd = 5)
plot(density(ndis), col="blue", lwd=2, 
     main = "Distribution 3") +
abline(v=mean(ndis),col="red",lwd=4) +
abline(v=median(ndis), lwd=2)

## integer(0)

Q1-c) Which measure of central tendency (mean or median) do you think will be more sensitive (will change more) to outliers being added to your data?

When there are outliers in the data, the mean is more sensitive to it, whereas the median is less affected.

This is because the calculation of the mean involves adding up all the data points and dividing by the total number of points, and if there are some extreme values, their influence will be magnified in the numerator, resulting in a deviation of the mean from the overall trend.

In contrast, the median is the middle value arranged in order of size, which means that it is only affected by the value in the middle position of the data, so it can better reflect the trend of the data than the mean.

Q2-a) Create a random dataset and draw a density plot and put a solid vertical line on the mean, and dashed vertical lines at the 1st, 2nd, and 3rd standard deviations to the left and right of the mean.

rdata <- rnorm(n=2000,mean=0,sd=1)
m <- c(-3,-2,-1,1,2,3)
plot(density(rdata),col="blue", lwd=2) +
abline(v=mean(rdata)) +
for(i in m){
  abline(v=i*sd(rdata),col="darkgreen",lty="dashed")
}

## integer(0)

Q2-b) which data points correspond to the 1st, 2nd, and 3rd quartiles (i.e., 25th, 50th, 75th percentiles) of rdata?

first <- quantile(rdata,0.25);first

##        25% 
## -0.7316187

sec <- quantile(rdata,0.5);sec

##         50% 
## -0.08612573

thr <- quantile(rdata,0.75);thr

##       75% 
## 0.5958916

How many standard deviations away from the mean are those points corresponding to the 1st, 2nd, and 3rd quartiles?

(first - mean(rdata))/sd(rdata)

##        25% 
## -0.6823764

(sec - mean(rdata))/sd(rdata)

##         50% 
## -0.02597815

(thr - mean(rdata))/sd(rdata)

##       75% 
## 0.6675614

Q2-c) Create a new random dataset and answer the question “how many standard deviations away from the mean (use positive or negative) are those points corresponding to the 1st and 3rd quartiles?”

data2 <- rnorm(2000,mean=35,sd=3.5)
(quantile(data2,0.25)-mean(data2)) / sd(data2)

##        25% 
## -0.6939739

(quantile(data2,0.75)-mean(data2)) / sd(data2)

##       75% 
## 0.6808987

Compare your answer to (b)

Compared my answer to (b), corresponding to 1st quartiles, the difference is 0.0409749. Corresponding to 3rd quartiles, the difference is -0.0064419.The answer in data2 is similar to rdata.

Q2-d) recall the dataset d123 and answer the question “how many standard deviations away from the mean (use positive or negative) are those data points corresponding to the 1st and 3rd quartiles?”

(quantile(d123,0.25)-mean(d123)) / sd(d123)

##        25% 
## -0.7578198

(quantile(d123,0.75)-mean(d123)) / sd(d123)

##       75% 
## 0.6345025

Compared my answer to (b), corresponding to 1st quartiles, the difference is -0.0482468. Corresponding to 3rd quartiles, the difference is -0.0229364.

Q3-a) From the question on the forum, Rob Hyndman’s answer (1st answer) suggest to use the Freedman-Diaconis rule.

According to the Wikipedia, the Freedman-Diaconis rule is based on the interquartile range, denoted by IQR. It replaces 3.5σ of Scott’s rule with 2 IQR, which is less sensitive than the standard deviation to outliers in data.

Q3-b) Compute the bin widths (h) and number of bins (k) to each of the following formula.

Sturges’ formula

rand_data <- rnorm(800, mean=20, sd = 5)
St_k <- log2(800) + 1;St_k

## [1] 10.64386

St_h <- (max(rand_data) - min(rand_data)) / St_k;St_h

## [1] 2.829943

Scott’s normal reference rule (uses standard deviation)

Sc_h <- 3.49 * sd(rand_data) / 800^(1/3);Sc_h

## [1] 1.835896

Sc_k <- ceiling((max(rand_data) - min(rand_data))/Sc_h);Sc_k

## [1] 17

Freedman-Diaconis’ choice (uses IQR)

Fr_h <- 2 * IQR(rand_data) / 800^(1/3);Fr_h

## [1] 1.348928

Fr_k <- ceiling((max(rand_data) - min(rand_data))/Fr_h);Fr_k

## [1] 23

Q3-c) We extend rand_data dataset with some outliers. Compute the bin widths (h) and number of bins (k) to each of the following formula.

Sturges’ formula

out_data <- c(rand_data, runif(10, min=40, max=60))
St_k1 <- log2(800) + 1;St_k1

## [1] 10.64386

St_h1 <- (max(out_data) - min(out_data)) / St_k1;St_h1

## [1] 4.814394

Scott’s normal reference rule (uses standard deviation)

Sc_h1 <- 3.49 * sd(out_data) / 800^(1/3);Sc_h1

## [1] 2.190623

Sc_k1 <- ceiling((max(out_data) - min(out_data))/Sc_h1);Sc_k1

## [1] 24

Freedman-Diaconis’ choice (uses IQR)

Fr_h1 <- 2 * IQR(out_data) / 800^(1/3);Fr_h1

## [1] 1.383518

Fr_k1 <- ceiling((max(out_data) - min(out_data))/Fr_h1);Fr_k1

## [1] 38

When outliers was added,the bin width(h) change the least in the method Freedman-Diacoins.

In my opinion, The interquartile range is a measure of where the “middle fifty” is in a data set. Freedman method’s formula apply the interquartile range rather than standard deviation could be less sensitive to outliers in data.

Distribution & binwidth concept

111078517

2023-02-27

Q1-a) Create and visualize a new “Distribution 2”: a combined dataset (n=800) that is negatively skewed (tail stretches to the left

Q1-b) Create a “Distribution 3”: a single dataset that is normally distributed (bell-shaped, symmetric)

Q1-c) Which measure of central tendency (mean or median) do you think will be more sensitive (will change more) to outliers being added to your data?

Q2-a) Create a random dataset and draw a density plot and put a solid vertical line on the mean, and dashed vertical lines at the 1st, 2nd, and 3rd standard deviations to the left and right of the mean.

Q2-b) which data points correspond to the 1st, 2nd, and 3rd quartiles (i.e., 25th, 50th, 75th percentiles) of rdata?

How many standard deviations away from the mean are those points corresponding to the 1st, 2nd, and 3rd quartiles?

Q2-c) Create a new random dataset and answer the question “how many standard deviations away from the mean (use positive or negative) are those points corresponding to the 1st and 3rd quartiles?”

Compare your answer to (b)

Q2-d) recall the dataset d123 and answer the question “how many standard deviations away from the mean (use positive or negative) are those data points corresponding to the 1st and 3rd quartiles?”

Q3-a) From the question on the forum, Rob Hyndman’s answer (1st answer) suggest to use the Freedman-Diaconis rule.

Q3-b) Compute the bin widths (h) and number of bins (k) to each of the following formula.

Sturges’ formula

Scott’s normal reference rule (uses standard deviation)

Freedman-Diaconis’ choice (uses IQR)

Q3-c) We extend rand_data dataset with some outliers. Compute the bin widths (h) and number of bins (k) to each of the following formula.

Sturges’ formula

Scott’s normal reference rule (uses standard deviation)

Freedman-Diaconis’ choice (uses IQR)