rm(list = ls())

Part 1

# Load necessary libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load required libraries
library(ggplot2)
library(dplyr)

# Set up a sequence of values for the x-axis
x_values <- seq(-4, 4, 0.01)

# Create data frames for normal and t-distributions with different degrees of freedom
normal_data <- data.frame(x = x_values, y = dnorm(x_values))
t_dist_df2 <- data.frame(x = x_values, y = dt(x_values, df = 2))
t_dist_df5 <- data.frame(x = x_values, y = dt(x_values, df = 5))
t_dist_df15 <- data.frame(x = x_values, y = dt(x_values, df = 15))
t_dist_df30 <- data.frame(x = x_values, y = dt(x_values, df = 30))
t_dist_df120 <- data.frame(x = x_values, y = dt(x_values, df = 120))

# Combine data frames
combined_data <- bind_rows(
  data.frame(distribution = "Normal", normal_data),
  data.frame(distribution = "t (df=2)", t_dist_df2),
  data.frame(distribution = "t (df=5)", t_dist_df5),
  data.frame(distribution = "t (df=15)", t_dist_df15),
  data.frame(distribution = "t (df=30)", t_dist_df30),
  data.frame(distribution = "t (df=120)", t_dist_df120)
)

# Plot the distributions with a bold and thick line for the normal distribution
ggplot(data = combined_data, 
       mapping = aes(x = x, 
                     y = y, 
                     color = distribution)
) +
  geom_line(size = 1.5, 
            linetype = ifelse(test = combined_data$distribution == "Normal", 
                              yes = "solid",
                              no =  "dashed"
            )
  ) +
  labs(title = "Normal and Student's t-Distributions",
       x = "Value",
       y = "Density") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Part 2

# Set seed for reproducibility
set.seed(123)

# Generate normally distributed data
mu <- 108
sigma <- 7.2
data_values <- rnorm(n = 1000, mean = mu, sd = sigma)

# Calculate Z-scores
z_scores <- (data_values - mean(data_values)) / sd(data_values)

# Plot the original normally distributed data
ggplot(data = data.frame(x = data_values), aes(x = x)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Original Normally Distributed Data",
       x = "Value",
       y = "Frequency") +
  theme_minimal()

# Plot the Z-score distribution
ggplot(data = data.frame(x = z_scores), aes(x = x)) +
  geom_histogram(binwidth = 0.1, fill = "lightcoral", color = "black", alpha = 0.7) +
  labs(title = "Z-Score Distribution",
       x = "Z-Score",
       y = "Frequency") +
  theme_minimal()

The data follows a normal distribution, the two histograms have a similar shape, but the second chart has a mean of 0 and a standard deviation of 1 due to the Z-score transformation.

The Z-score transformation linearly scales and shifts the original data, resulting in a distribution with a mean of 0 and a standard deviation of 1. Despite this transformation, the shape of the distribution remains the same. The reason for the similar shape lies in the nature of the Z-score transformation.

The formula for calculating the Z-score (standard score) for a data point 𝒙 in a distribution with mean 𝜇 and standard deviation 𝝈 is given by:

\[ Z = \frac{{x - \mu}}{{\sigma}} \]

This formula essentially standardizes the data by expressing each data point in terms of how many standard deviations it is away from the mean. As a result, the Z-score transformation doesn’t change the relative relationships between data points; it only scales and shifts the distribution.

P-Value:

In statistics, the p-value is a measure that helps us determine if the results we obtain in a study are likely due to chance or if they are genuinely relevant. It’s another way of asking, “What are the odds of getting these results if there’s no real effect?”

Consider flipping a coin. If the coin is fair, there is a 50% chance it will land heads or tails. If one flip the coin ten times in a row and it lands heads, one could wonder if it’s still a fair coin or if anything else is going on.

In statistics, the p-value is equivalent to informing one the probability of receiving ten heads in a row by chance, providing the coin is fair. If the p-value is low (usually less than 0.05), it indicates that the results are unlikely to be due to chance, and we may begin to suspect that there is a true effect or difference.

In a word, the p-value lets us determine if our findings are likely true or if they are the consequence of chance or randomness.