Week 5 Discussion - Statistical Inference

Part 1:

The (student) t distribution converges to normal distribution as the degrees of freedom increase (beyond 120). Please plot a normal distribution, and a few t distributions on the same chart with 2, 5, 15, 30, 120 degrees of freedom.

library(ggplot2)

# Set seed for reproducibility
set.seed(224)

# Generating a sequence of values from -4 to 4 with 200 points in between
part1_data <- seq(-4, 4, length.out = 200)

# Creating a data frame for nomal distribution
normal_dist_data <- data.frame(x = part1_data, 
                          y = dnorm(part1_data ), 
                          distribution = 'Normal')

# Creating a data frame for t distributions with different degrees of freedom
t_dist_2 <- data.frame(x = part1_data, y = dt(part1_data, df = 2), distribution = 't (df=2)')
t_dist_5 <- data.frame(x = part1_data, y = dt(part1_data, df = 5), distribution = 't (df=5)')
t_dist_15 <- data.frame(x = part1_data, y = dt(part1_data, df = 15), distribution = 't (df=15)')
t_dist_30 <- data.frame(x = part1_data, y = dt(part1_data, df = 30), distribution = 't (df=30)')
t_dist_120 <- data.frame(x = part1_data, y = dt(part1_data, df = 120), distribution = 't (df=120)')

# Combine everything
all_t_data <- rbind(normal_dist_data, t_dist_2, t_dist_5, t_dist_15, t_dist_30, t_dist_120)

# Plot Graph
ggplot(all_t_data, 
       aes(x = x, 
           y = y, 
           color = distribution)) +
  geom_line() +
  theme_minimal() +
  labs(title = "Normal and t Distributions",
       x = "Value",
       y = "Density")

Part 2:

Lets work with normal data below (1000 observations with mean of 108 and sd of 7.2).

set.seed(123) # Set seed for reproducibility

mu <- 108

sigma <- 7.2

data_values <- rnorm(n = 1000, mean = mu, sd = sigma )

Plot two charts - the normally distributed data (above) and the Z score distribution of the same data. Do they have the same distributional shape ? Why or why not ?

# Set seed for reproducibility
set.seed(224)

# Generate normal data for graph
mu <- 108
sigma <- 7.2
data_values <- rnorm(n = 1000, 
                     mean = mu, 
                     sd = sigma) 

# Calculate Z-scores
z_scores <- (data_values - mu) / sigma

# Create a layout for side-by-side histograms
par(mfrow = c(1, 2))

# Plot the original normal data
hist(data_values, 
     main = "Original Normal Data",
     xlab = "Value", 
     ylab = "Frequency", 
     col = "lightgreen", 
     border = "black")


# Plot the Z-score distribution
hist(z_scores, 
     main = "Z-Score Distribution",
     xlab = "Z-Score", 
     ylab = "Frequency", 
     col = "lightblue", 
     border = "black")

Both graphs, of the above data and the subsequent Z score distribution data, differ in scale but have very similar bell shaped normal distributions. Since the distribution of the above data is normally distributed, so is the distribution of z-scores, with the main difference being, that the z-scre distrbution is centered at 0 and has a standard deviation of 1. In Standard Normal Distribution (SND), the mean is 0 and the standard deviation is 1, hencewhy, the z-score distribution is centered around zero, with a standard deviation of 1.

Part 3:

In your own words, please explain what is p-value?

When conducting hypthosis tests, P-values are often utilized to determine the probability of obtaining results, as or more extreme, than the observed results of a statistical hypothesis test. A smaller p-value generally means that there is stronger evidence in favor of the alternative hypothesis.

When conducting Hypothesis Tests, you state a NULL (H₀) and Alternative (H₁ or Hₐ) Hypothesis (H₁ or Hₐ). Generally, a Null Hypothesis states, that there is no significance, while the Alternative Hypothesis states the opposite of the null hypothesis.

Wording wise, there are 2 possible outcomes for hypothesis testing, either you reject the null or you do not reject the null, but you never say “I accept the null hypothesis” . The reason this distinction is important in my expatiation of p-values, is that p-values based on significance level allows you to either reject or not reject the null. The null is often rejected, if p-value is less than, the significance level because the results are considered statistically significant.

For example with a significance level of 0.05, and p-value of 0.03, you would reject the null hypothis, and these results would would indicate there is evidence for the alternative hypothesis. Similarly, with a significance level of 0.05, and p-value of 0.05 or more, you would not reject the null hypothesis.

Week 5 Discussion - Statistical Inference

Anddrew Gregory

2024-14-2

Part 1:

Part 2:

Part 3: