R Markdown

DISCUSSION -

Part 1. The (student) t distribution converges to normal distribution as the degrees of freedom increase (beyond 120). Please plot a normal distribution, and a few t distributions on the same chart with 2, 5, 15, 30, 120 degrees of freedom.

myNormal <- rnorm(500,0,1)
myT <- rt(500,2)
plot(density(myNormal),xlim=c(-5,5),ylim=c(0,0.5), col="darkolivegreen4", lwd = 4)
lines(density(myT),col="purple", lwd = 2)
lines(density(rt(500,5)),col="red",lwd = 2)
lines(density(rt(500,15)),col="blue",lwd = 2)
lines(density(rt(500,30)),col="green",lwd = 2)
lines(density(rt(500,120)),col="orange",lwd = 4)
legend(-4.5,.42, legend=c("normal","df=2","df=5", "df=15","df=30","df=120"),
       col=c("darkolivegreen4","purple","red","blue","green","orange"), lty=1:1, lwd=2, cex=0.8)

Part 2. Lets work with normal data below (1000 observations with mean of 108 and sd of 7.2).

set.seed(123)  # Set seed for reproducibility
nVals <- 1000
mu <- 108
sigma <- 7.2
data_values <- rnorm(n = nVals,   mean = mu,  sd = sigma   ) 

Plot two charts - the normally distributed data (above) and the Z score distribution of the same data.

par(mfrow=c(1,2))
hist_data <- hist(data_values,breaks = 40, col = "aquamarine", main = "Normal Distribution", xlab = "Observations", ylab = "Frequency")
x_values <- seq(min(data_values), max(data_values), length = 100)
y_values <- dnorm(x_values, mean = mean(data_values), sd = sd(data_values)) 
y_values <- y_values * diff(hist_data$mids[1:2]) * length(data_values) 
lines(x_values, y_values, lwd=2,col="red")

ZSDist <- mu + data_values*sigma
hist_data <- hist(ZSDist, breaks = 40, col = "aquamarine", main = "Z-Score Distribution", xlab = "Observations", ylab = "Frequency")
x_values <- seq(min(ZSDist), max(ZSDist), length = 100)
y_values <- dnorm(x_values, mean = mean(ZSDist), sd = sd(ZSDist)) 
y_values <- y_values * diff(hist_data$mids[1:2]) * length(ZSDist) 
lines(x_values, y_values, lwd=2,col="red")

Do they have the same distributional shape ? Why or why not ?

The distributions are a similar shape because the calculation of z-scores is standardizing the data, which tends towards a standardized distribution, which is Normal, especially when the number of samples is large as stated by the Central Limit Theorem.

Part 3. In your own words, please explain what is p-value?

A p-value is the probability of observed results when a null hypthesis is true. It is used to measure significance of a sample relative to a population and a type 1 error rate in order to support or reject a hypothesis that we have believed to be true. If a p-value is less than the type 1 error rate, represented by \(\alpha\), it is said to be statistically significant enough to reject the null hypothesis (\(H_{o}\)) in favor of the alternative hypothesis (\(H_{a}\)).