R Notebook

Question 1)

Create and visualize a new “Distribution 2”:a combined dataset (n=800) that is negatively skewed (tail stretches to the left).

> #construct 3 normal distribution data set
> d1<-rnorm(n = 500,mean = 45,sd = 5)
> d2<-rnorm(n = 200,mean = 30,sd = 5)
> d3<-rnorm(n = 100,mean = 15,sd = 5)
> #combine them into single data set
> d123<-c(d1,d2,d3)
> #plot
> plot(density(d123),main = "Distribution 2",col="blue",lwd=2)
> abline(v=mean(d123),lwd=3)
> abline(v=median(d123))

b) Create a “Distribution 3”: a single dataset that is normally distributed (bell-shaped, symmetric) – you do not need to combine datasets, just use the rnorm function to create a single large dataset (n=800). Show your code, compute the mean and median, and draw lines showing the mean (thick line) and median (thin line).

> #construct a normal distribution
> d<-rnorm(n = 800,mean = 10,sd = 3)
> #compute the mean
> mean(d)

[1] 9.858738

> #compute the median
> median(d)

[1] 9.877601

> #plot
> plot(density(d),main="Distribution 3",col="blue")
> #mean and median is overlapped.
> abline(v=mean(d),lwd=3,col="red")
> abline(v=median(d),lty="dashed")

the mean is 9.858738.
the median is 9.8776009.

c) In general, which measure of central tendency (mean or median) do you think will be more sensitive (will change more) to outliers being added to your data?

I think that is more sensitive to outliers,because median stands for middle point of the data ,and when we calculate mean, we have to sum up all the values,which may contain outliers.

Question 2)

a) Create a random dataset (call it ‘rdata’) that is normally distributed with: n=2000, mean=0, sd=1. Draw a density plot and put a solid vertical line on the mean, and dashed vertical lines at the 1st, 2nd, and 3rd standard deviations on both sides of the mean.

> #construct a noraml distribution with mean=0 sd=1
> rdata<-rnorm(n = 2000,mean = 0,sd = 1)
> #create a sequence from -3 to 3
> grid<-seq(from=-3,to = 3)
> #the cutting line which contains the 1st,2nd,3rd standard deviations and the mean(from left to right ,7 points totally.)
> lines<-mean(rdata)+grid*sd(rdata)
> #plot
> plot(density(rdata),main="rdata")
> #lty=2 means dashed line,1 means solid line
> abline(v=lines,lty=c(2,2,2,1,2,2,2))

b) Using the quantile() function, which data points correspond to the 1st, 2nd, and 3rd quartiles (i.e., 25th, 50th, 75th percentiles)?How many standard deviations away from the mean (use positive or negative) are those points corresponding to the 1st, 2nd, and 3rd quartiles?

> #calculate quantile (25th,50th,75th percentiles)
> rdata.quantile<-quantile(x = rdata,probs = c(.25,.5,.75))
> #calculate how many sd away from the mean
> dist<-unname((rdata.quantile-mean(rdata))/sd(rdata))

there are -0.6878983 standard deviations away from the mean corresponding to the 1st quartile.
there are -0.0233312 standard deviations away from the mean corresponding to the 2nd quartile.
there are 0.6696678 standard deviations away from the mean corresponding to the 3rd quartile.

c) Now create a new random dataset that is normally distributed with: n=2000, mean=35, sd=3.5. In this distribution, how many standard deviations away from the mean (use positive or negative) are those points corresponding to the 1st and 3rd quartiles? Compare your answer to (b)

> rdata.new<-rnorm(n = 2000,mean = 35,sd = 3.5)
> #calculate quantile (25th,50th,75th percentiles)
> rdata.new.quantile<-quantile(x = rdata.new,probs = c(.25,.5,.75))
> #calculate how many sd away from the mean (called dist)
> dist<-unname((rdata.new.quantile-mean(rdata.new))/sd(rdata.new))

there are -0.6714332 standard deviations away from the mean corresponding to the 1st quartile.
there are 0.001645 standard deviations away from the mean corresponding to the 2nd quartile.
there are 0.6408483 standard deviations away from the mean corresponding to the 3rd quartile.
there are no significance difference between (b) and (c).the number of “how many standard deviation away from the mean” is similar.

d) Finally, recall the dataset d123 shown in the description of question 1. In that distribution, how many standard deviations away from the mean (use positive or negative) are those data points corresponding to the 1st and 3rd quartiles? Compare your answer to (b)

> #Distribution 1
> #construct 3 normal distribution data set
> d1<-rnorm(n = 500,mean = 15,sd = 5)
> d2<-rnorm(n = 200,mean = 30,sd = 5)
> d3<-rnorm(n = 100,mean = 45,sd = 5)
> #combine them into single data set
> d123<-c(d1,d2,d3)
> #calculate the quantile
> d123.quantile<-quantile(x = d123,probs = c(.25,.75))
> #substract the mean and see how many sd away from the mean
> dist_d123<-unname((d123.quantile-mean(d123))/sd(d123))

there are -0.7525149 standard deviations away from the mean corresponding to the 1st quartile.
there are 0.6231439 standard deviations away from the mean corresponding to the 1st quartile. -There are no huge difference between (b) and (d).

Question 3)

a) From the StackOverflow question, which formula does Rob Hyndman’s answer (1st answer) suggest to use for bin widths/number?Also, what does the Wikipedia article say is the benefit of that formula?

it is suggested that : \(h=2\frac{IQR(x)}{n^{1/3}}\). which is use for bin widths/number.

It replaces 3.5 \(\sigma\) of Scott’s rule with 2 IQR, which is less sensitive than the standard deviation to outliers in data.

(Scott’s rule suggested that: \(h=2\frac{3.5\hat{\sigma}}{n^{1/3}}\))

b) Given a random normal distribution:

>   rand_data <- rnorm(800, mean=20, sd = 5)

Compute the bin widths (h) and number of bins (k) according to each of the following formula:

Sturges’ formula \[k=\left \lceil \log_2{n} \right \rceil+1\]

> #number of bins
> k<-ceiling(log2(length(rand_data)))+1
> #bin widths
> h<-(max(rand_data)-min(rand_data))/k
> paste("the number of bins is ",k)

[1] “the number of bins is 11”

> paste("the bin width is",h)

[1] “the bin width is 2.72289242639274”

Scott’s normal reference rule (uses standard deviation)

\[h=\frac{3.5\hat{\sigma}}{n^{1/3}}\]

> #bin widths
> h<-3.5*sd(rand_data)/(length(rand_data)^1/3)
> #number of bins
> k<-ceiling((max(rand_data) - min(rand_data))/h)
> paste("the number of bins is ",k)

[1] “the number of bins is 486”

> paste("the bin width is",h)

[1] “the bin width is 0.0617253871202444”

Freedman-Diaconis’ choice (uses IQR)

\[h=2\frac{IQR(x)}{n^{1/3}}\]

> #bin widths
> h<-2*IQR(rand_data)/(length(rand_data)^1/3)
> #number of bins
> k<-ceiling((max(rand_data) - min(rand_data))/h)
> paste("the number of bins is ",k)

[1] “the number of bins is 674”

> paste("the bin width is",h)

[1] “the bin width is 0.0445011670381162”

c) Repeat part (b) but extend the rand_data dataset with some outliers (use a new dataset out_data):

>   out_data <- c(rand_data, runif(10, min=40, max=60))

Sturges’ formula

> #number of bins
> k<-ceiling(log2(length(out_data)))+1
> #bin widths
> h<-(max(out_data)-min(out_data))/k
> paste("the number of bins is ",k)

[1] “the number of bins is 11”

> paste("the bin width is",h)

[1] “the bin width is 5.17609144060784”

Scott’s normal reference rule (uses standard deviation)

> #bin widths
> h<-3.5*sd(out_data)/(length(out_data)^1/3)
> #number of bins
> k<-ceiling((max(out_data) - min(out_data))/h)
> paste("the number of bins is ",k)

[1] “the number of bins is 744”

> paste("the bin width is",h)

[1] “the bin width is 0.0765575894317041”

Freedman-Diaconis’ choice (uses IQR)

> #bin widths
> h<-2*IQR(out_data)/(length(out_data)^1/3)
> #number of bins
> k<-ceiling((max(out_data) - min(out_data))/h)
> paste("the number of bins is ",k)

[1] “the number of bins is 1269”

> paste("the bin width is",h)

[1] “the bin width is 0.0448792178075419”

d) From your answers above, in which of the three methods does the bin width (h) change the least when outliers are added (i.e., which is least sensitive to outliers), and (briefly) WHY do you think that is?

I think \(\textbf{ Freedman–Diaconis' choice}\) is the least sensitive when there are outliers. the reason I think is that it use IQR instead of sigma hat (\(\hat{\sigma}\)) which used in \(\textbf{ Scott’s normal reference rule}\) .IQR is more robust to quantify the amount of variation. Also,\(\textbf{Sturges’ formula}\) is sensitive to n.(the number of the data)