Part 1:

The (student) t distribution converges to normal distribution as the degrees of freedom increase (beyond 120). Please plot a normal distribution, and a few t distributions on the same chart with 2, 5, 15, 30, 120 degrees of freedom.

#Set the seed
set.seed(123)

#Create a dataframe
x=seq(-5,5,length=1000)

#Assign the normal distribution using dnorm
normal_dist <- dnorm(x,mean = 0, sd = 1)

#Set the various t distributions and their degrees of freedom
t_2 <- dt(x,df=2)
t_5 <- dt(x,df=5)
t_15 <- dt(x,df=15)
t_30 <- dt(x,df=30)
t_120 <- dt(x,df=120)

#Plot the normal distribution first, and highlight it with a dark black line
plot(x,normal_dist,
     t='l',
     main = "Normal Distribution vs t Distributions",
     ylab = "Density",lwd = 3
     )

#Plot the various t distributions with different colors to distinguish each of them
lines(x,t_2,col='blue')
lines(x,t_5,col='red')
lines(x,t_15,col='darkorchid')
lines(x,t_30,col='darkgreen')
lines(x,t_120,col='orange')

#Add a legend to identify each of the t distributions by color
legend("topright",
       legend=c("Normal Dist.(Bolded)","DF = 2","DF=5","DF=15","DF=30","DF=120"), 
       col=c("black","blue", "red","darkorchid","darkgreen","orange"),
       lty=1, 
       bty="o", 
       title="Graph Legend"
       )

Answer:

Looking at the different t-distributions, it is true that the higher the degrees of freedom, the greater the convergence to the mean.(The orange line is essentially on top of the black line (df = 120 = normal distribution))

Part II:

Plot two charts - the normally distributed data (above) and the Z score distribution of the same data. Do they have the same distributional shape ? Why or why not ?

set.seed(123)  # Set seed for reproducibility

mu <- 108

sigma <- 7.2

data_values <- rnorm(n = 1000,   mean = mu,  sd = sigma) 

hist(data_values)

z_score <- (data_values-mu)/sigma

hist(z_score)

Answer:

Yes, the distributions are the same shape. This is because the z_score standardizes the distribution by computing the number of standard deviations an observation is above and below the mean. As a result, given that this is a normal distribution, we would expect the “raw data” to resemble the z-score distribution, since the z-score distribution is measuring the position of the observations away from the mean and is essentially re-scaling the data.

Part III:

In your own words, please explain what is p-value?

Answer:

Note that I could not access the article linked- it said “access denied”, so I used other online resources (linked below) to inform my explanation. A p-value is used to evaluate how strongly the data rejects or supports the null hypothesis. The p-value is the likelihood that the observed results are valid, assuming the null hypothesis is correct. A higher p-value indicates a lower statistical significance, while a lower p-value indicates a higher statistical significance. For example, a p-value of .0001 provides strong evidence against the null hypothesis, and points in favor of the alternative hypothesis. This value is typically compared to a chosen cut off point, which is usually 0.05. In other words, this p-value tells us that 0.01% of the time, we are more likely to see a result at least as extreme as the values we’ve already observed, again, assuming the null hypothesis is true.

Sources: https://www.scribbr.com/statistics/p-value/ https://www.simplypsychology.org/p-value.html https://www.investopedia.com/terms/p/p-value.asp