The (student) t distribution converges to normal distribution as the
degrees of freedom increase (beyond 120). Please plot a normal
distribution, and a few t distributions on the same chart with 2, 5, 15,
30, 120 degrees of freedom.
Observations: The red line (corresponding to df=2) has
fatter “tails” than the normal. This implies more extreme values are
more likely. Also, we can see that as degrees of freedom increase, the
curves get closer and closer to Normal. We can also see when degrees of
freedom increase past 20 or so, it really becomes indistinguishable from
Normal.
# Obvious normal distribution paramters
mean <- 0
sigma <- 1
# Create sequence of x values ranging from -4 to 4
x <- seq(-4, 4, length.out = 1000)
# Calculate density
n_dist <- dnorm(x, mean, sigma)
# Calculate densities for t-dist with different degreese of freedom
tdf_2 <- dt(x, df = 2)
tdf_5 <- dt(x, df = 5)
tdf_15 <- dt(x, df = 15)
tdf_30 <- dt(x, df = 30)
tdf_120 <- dt(x, df = 120)
# Create plot with the normal distribution
plot(x, n_dist,
type = "l",
lwd = 3,
col = "black",
main = "Convergence of t-distribution\nto normal distribution",
xlab = "Standard Deviations",
ylab = "Probability Density")
# Add the remaining degree of freedom plots
lines(x, tdf_2, col = "red", lwd = 2, lty = 2)
lines(x, tdf_5, col = "orange", lwd = 2, lty = 2)
lines(x, tdf_15, col = "yellow", lwd = 2, lty = 2)
lines(x, tdf_30, col = "green", lwd = 2, lty = 2)
lines(x, tdf_120, col = "blue", lwd = 2, lty = 3)
# Add a legend
legend("topright",
title = "t-Dist",
text.font = 3,
legend = c("Normal (Z)" , "df=2" , "df=5", "df=15", "df=30", "df=120"),
col = c("black", "red", "orange", "yellow", "green", "blue"),
lwd = 3,
cex = 0.8
)
grid()
Lets work with normal data below (1000 observations with mean of 108
and sd of 7.2).
set.seed(123) # Set seed for reproducibility
mu <- 108
sigma <- 7.2
data_values <- rnorm(n = 1000, mean = mu, sd = sigma )
Plot two charts - the normally distributed data (above) and the Z score distribution of the same data. Do they have the same distributional shape ? Why or why not ?
set.seed(42) # Life, the Universe, and Everything
mu <- 108
sigma <- 7.2
# Store this in a data vector of 1000 random values
data_values <- rnorm(n = 1000, mean = mu, sd = sigma)
# Now calculate Z-scores using the formula: Z = (X - mu) / sigma
z_scores <- (data_values - mean(data_values)) / sd(data_values)
# Plot the plots (1 row, 2 columns)
par(mfrow = c(1, 2))
# Plot 1: Original data
hist(data_values,
breaks = 30,
probability = TRUE,
main = "Original Normal Data",
xlab = "Value",
col = "pink",
)
# Overlay a density curve
curve(dnorm(x, mean = mean(data_values), sd = sd(data_values)),
add = TRUE,
col = "darkblue",
lwd = 2)
# Draw a vertical line at the mean and label it
abline(v = mean(data_values), col = "red", lwd = 2, lty = 2)
text(mean(data_values), 0.05,
paste0("Mean = ", round(mean(data_values), 2), "\nSD = ", round(sd(data_values), 2)),
pos = 4, col = "red", cex = 0.7,)
# Plot 2: Z-scores
hist(z_scores,
breaks = 30,
probability = TRUE,
main = "Z-Score Distribution",
xlab = "Z-Score",
col = "lightgreen"
)
curve(dnorm(x, mean = 0, sd = 1),
add = TRUE,
col = "darkgreen",
lwd = 2)
abline(v = mean(z_scores), col = "red", lwd = 2, lty = 2)
text(0, 0.3,
paste0("Mean = ", round(mean(z_scores), 4), "\nSD = ", round(sd(z_scores), 2)),
pos = 4, col = "red", cex = 0.7,)
par(mfrow = c(1, 1)) # Reset plot layout for next time
Thoughts: They do have the same shape. The reason for this is all the Z-score transfomation does is shift and re-scale the original, normal values. But what it does change is the mean (from 108 -> 0), the scale (from 7.2-> to 1), and our units -> standard deviations from the mean (the literal definition of the Z-score).
In your own words, please explain what is p-value?
Response: After watching a few videos, I came across a
goofy one that actually helped me wrap my brain around it: https://youtu.be/vemZtEM63GY?si=wJ5slg5K082s0HGy.
Strictly speaking, and straight out of our textbook, p-value is defined as the probability of observing data that is at least as favorable to the alternative hypothesis as the current data set, assuming that the null hypothesis is true. A p-value is the probability of observing data as extreme, or more extreme, than what was actually observed.
Using a classic example, suppose we take a coin and flip it 100 times
and we received Heads 70% of the time. The question that can be asked is
“Is the coin fair?” as it would seem unlikely to have 70 heads out of a
100 tosses. To determine this:
+Step 1: Set up an hypothesis (Null hypothesis is that the coin is fair
which is our default position).
+Step 2: Observe data
+Step 3: Calculate the p-value
+Step 4: Decision. The standard is p-value < 0.05. (important to note
that the ASA pdf provided us, calls into question this standard).
I believe this would be a 2-tailed test as we would be equally surprised
with receiving 30 tails.
Thank you. Dan Leone