DISCUSSION -
Part 1. The (student) t distribution converges to normal distribution as the degrees of freedom increase (beyond 120). Please plot a normal distribution, and a few t distributions on the same chart with 2, 5, 15, 30, 120 degrees of freedom.
myNormal <- rnorm(500,0,1)
myT <- rt(500,2)
plot(density(myNormal),xlim=c(-5,5),ylim=c(0,0.5), col="darkolivegreen4", lwd = 4)
lines(density(myT),col="purple", lwd = 2)
lines(density(rt(500,5)),col="red",lwd = 2)
lines(density(rt(500,15)),col="blue",lwd = 2)
lines(density(rt(500,30)),col="green",lwd = 2)
lines(density(rt(500,120)),col="orange",lwd = 4)
legend(-4.5,.42, legend=c("normal","df=2","df=5", "df=15","df=30","df=120"),
col=c("darkolivegreen4","purple","red","blue","green","orange"), lty=1:1, lwd=2, cex=0.8)
Part 2. Lets work with normal data below (1000 observations with mean of 108 and sd of 7.2).
set.seed(123) # Set seed for reproducibility
nVals <- 1000
mu <- 108
sigma <- 7.2
data_values <- rnorm(n = nVals, mean = mu, sd = sigma )
Plot two charts - the normally distributed data (above) and the Z score distribution of the same data.
par(mfrow=c(1,2))
hist_data <- hist(data_values,breaks = 40, col = "aquamarine", main = "Normal Distribution", xlab = "Observations", ylab = "Frequency")
x_values <- seq(min(data_values), max(data_values), length = 100)
y_values <- dnorm(x_values, mean = mean(data_values), sd = sd(data_values))
y_values <- y_values * diff(hist_data$mids[1:2]) * length(data_values)
lines(x_values, y_values, lwd=2,col="red")
ZSDist <- mu + data_values*sigma
hist_data <- hist(ZSDist, breaks = 40, col = "aquamarine", main = "Z-Score Distribution", xlab = "Observations", ylab = "Frequency")
x_values <- seq(min(ZSDist), max(ZSDist), length = 100)
y_values <- dnorm(x_values, mean = mean(ZSDist), sd = sd(ZSDist))
y_values <- y_values * diff(hist_data$mids[1:2]) * length(ZSDist)
lines(x_values, y_values, lwd=2,col="red")
Do they have the same distributional shape ? Why or why not ?
The distributions are a similar shape because the calculation of z-scores is standardizing the data, which tends towards a standardized distribution, which is Normal, especially when the number of samples is large as stated by the Central Limit Theorem.
Part 3. In your own words, please explain what is p-value?
A p-value is the probability of observed results when a null hypthesis is true. It is used to measure significance of a sample relative to a population and a type 1 error rate in order to support or reject a hypothesis that we have believed to be true. If a p-value is less than the type 1 error rate, represented by \(\alpha\), it is said to be statistically significant enough to reject the null hypothesis (\(H_{o}\)) in favor of the alternative hypothesis (\(H_{a}\)).