1 Introduction and Objectives

This mini-project explores concepts of probability distributions through simulation and analysis of the normal distribution. Understanding probability distributions is essential for quantitative social research, as they form the foundation of statistical inference, hypothesis testing, and data modeling.

1.1 Background: Probability Distributions

A probability distribution describes how the values of a random variable are distributed, specifying what values are possible and how likely each value is to occur. In social science research, probability distributions help us:

  • Model and understand variation in social phenomena
  • Make inferences about populations from samples
  • Test hypotheses and quantify uncertainty
  • Assess the reliability of statistical estimates

2 The Normal Distribution

The normal distribution (Gaussian distribution or bell curve) is the most important probability distribution in statistics, characterized by two parameters:

  • Mean (μ): The center of the distribution, representing the average value
  • Standard Deviation (σ): The spread of the distribution, indicating variability around the mean

2.1 Key Properties of the Normal Distribution

  • Symmetric around the mean
  • Bell-shaped curve
  • Mean = Median = Mode
  • 68% of data falls within 1 standard deviation of the mean
  • 95% of data falls within 2 standard deviations of the mean
  • 99.7% of data falls within 3 standard deviations of the mean

3 Task 1: Simulate a Normal Distribution

# Set seed for reproducibility
set.seed(123)

# Generate random sample of 1000 observations
sample_1000 <- rnorm(n = 1000, mean = 100, sd = 15)

# Create histogram with theoretical density curve
library(ggplot2)

task1_plot <- ggplot(data.frame(x = sample_1000), aes(x = x)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", 
                 color = "black", alpha = 0.7) +
  geom_density(color = "red", size = 1.2) +
  stat_function(fun = dnorm, args = list(mean = 100, sd = 15), 
                color = "blue", size = 1, linetype = "dashed") +
  labs(title = "Normal Distribution Simulation (n=1000)",
       subtitle = "Histogram with Theoretical Density Curve",
       x = "Value", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 12))

task1_plot

Interpretation: The histogram shows the distribution of our simulated data with mean = 100 and standard deviation = 15. The red line represents the empirical density curve from our sample, while the blue dashed line shows the theoretical normal distribution. The close alignment between these curves demonstrates that our random sample accurately represents the intended normal distribution.

4 Task 2: Calculate Descriptive Statistics

# Calculate descriptive statistics
descriptive_stats <- data.frame(
  Statistic = c("Sample Mean", "Population Mean", "Sample SD", "Population SD", 
                "Sample Median", "Minimum", "Maximum"),
  Value = c(mean(sample_1000), 100, sd(sample_1000), 15, 
            median(sample_1000), min(sample_1000), max(sample_1000)),
  Type = c("Sample", "Theoretical", "Sample", "Theoretical", 
           "Sample", "Sample", "Sample")
)

# Display formatted table
library(knitr)
kable(descriptive_stats, digits = 3, caption = "Descriptive Statistics for Simulated Data")
Descriptive Statistics for Simulated Data
Statistic Value Type
Sample Mean 100.242 Sample
Population Mean 100.000 Theoretical
Sample SD 14.875 Sample
Population SD 15.000 Theoretical
Sample Median 100.138 Sample
Minimum 57.853 Sample
Maximum 148.616 Sample

4.1 Analysis of Sample vs. Theoretical Parameters

The analysis reveals differences between sample statistics and theoretical parameters. Our sample mean of 100.242 closely approximates the theoretical mean of 100, showing that the sample means can be an unbiased estimation of population means. Similarly, the sample standard deviation of 14.875 aligns well with the theoretical value of 15.

These minor discrepancies arise from sampling variability, which is the natural fluctuation that occurs when drawing random samples from a population. With a sample size of 1,000, we expect these estimates to be quite close to their theoretical counterparts, and indeed they are. The sample median of 100.138 further confirms the symmetric charateristic of the normal distribution, as it closely matches both the sample and population means.

5 Task 3: Explore Sample Size Effects

# Generate samples of different sizes
set.seed(123)
sample_30 <- rnorm(n = 30, mean = 100, sd = 15)
sample_100 <- rnorm(n = 100, mean = 100, sd = 15)
sample_10000 <- rnorm(n = 10000, mean = 100, sd = 15)

# Create combined data frame for plotting
df_30 <- data.frame(x = sample_30, size = "n=30")
df_100 <- data.frame(x = sample_100, size = "n=100")
df_1000 <- data.frame(x = sample_1000, size = "n=1000")
df_10000 <- data.frame(x = sample_10000, size = "n=10000")

combined_data <- rbind(df_30, df_100, df_1000, df_10000)

# Create 2x2 grid of histograms
task3_plot <- ggplot(combined_data, aes(x = x, fill = size)) +
  geom_histogram(aes(y = ..density..), bins = 25, alpha = 0.7, position = "identity") +
  geom_density(color = "red", size = 0.8) +
  facet_wrap(~size, ncol = 2) +
  labs(title = "Effect of Sample Size on Distribution Appearance",
       x = "Value", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        legend.position = "none")

task3_plot

5.1 Interpretation of Sample Size Effects

The visualization of different sample sizes illustrates how sample size affects our ability to recognize distribution patterns. With n=30, the histogram appears irregular and somewhat jagged, making it challenging to definitively identify the underlying normal distribution. The n=100 sample shows clearer bell-shaped characteristics, while the n=1,000 sample provides a much smoother representation that closely matches the theoretical normal curve. The n=10,000 sample offers the most precise depiction, with minimal sampling fluctuation.

This shows that larger samples provide more accurate representations of population distributions, reducing the impact of random sampling variation and allowing clearer identification of the underlying distributional shape.

6 Task 4: Demonstrate the Central Limit Theorem

# Demonstrate Central Limit Theorem
set.seed(456)
n_samples <- 1000
sample_size <- 30

# Draw 1000 samples of size n=30 and calculate means
sample_means <- replicate(n_samples, {
  sample_data <- rnorm(n = sample_size, mean = 100, sd = 15)
  mean(sample_data)
})

# Create histogram of sample means
task4_plot <- ggplot(data.frame(means = sample_means), aes(x = means)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgreen", 
                 color = "black", alpha = 0.7) +
  geom_density(color = "red", size = 1.2) +
  geom_vline(xintercept = mean(sample_means), color = "blue", size = 1, linetype = "dashed") +
  labs(title = "Central Limit Theorem Demonstration",
       subtitle = paste("Distribution of 1000 Sample Means (n=30)",
                       "Mean of sampling distribution:", round(mean(sample_means), 3),
                       "SD of sampling distribution:", round(sd(sample_means), 3)),
       x = "Sample Mean", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 10))

task4_plot

# Calculate sampling distribution statistics
theoretical_se <- 15/sqrt(30)

sampling_stats <- data.frame(
  Statistic = c("Mean of Sampling Distribution", 
                "SD of Sampling Distribution", 
                "Theoretical Standard Error (σ/√n)"),
  Value = c(mean(sample_means), 
            sd(sample_means), 
            theoretical_se),
  Description = c("Observed mean of sample means",
                  "Observed standard deviation of sample means", 
                  "Expected standard error (15/√30)")
)

kable(sampling_stats, digits = 4, caption = "Central Limit Theorem Results")
Central Limit Theorem Results
Statistic Value Description
Mean of Sampling Distribution 100.1119 Observed mean of sample means
SD of Sampling Distribution 2.6988 Observed standard deviation of sample means
Theoretical Standard Error (σ/√n) 2.7386 Expected standard error (15/√30)

6.1 Central Limit Theorem

The Central Limit Theorem demonstration shows that the sampling distribution of means (mean = 100.112, SD = 2.699) closely approximates a normal distribution, even though we started with samples of only 30 observations.

The observed standard deviation of the sampling distribution (2.699) aligns well with the theoretical standard error (2.739), showing that the standard error σ/√n accurately predicts the variability of sample means. This result shows that that sample means become normally distributed around the population mean with variability determined by the population standard deviation and sample size, forming the foundation for statistical inference and confidence intervals.

7 Summary Comparison Across Sample Sizes

# Summary statistics for all samples
summary_comparison <- data.frame(
  Sample_Size = c(30, 100, 1000, 10000),
  Sample_Mean = c(mean(sample_30), mean(sample_100), mean(sample_1000), mean(sample_10000)),
  Sample_SD = c(sd(sample_30), sd(sample_100), sd(sample_1000), sd(sample_10000)),
  Theoretical_Mean = rep(100, 4),
  Theoretical_SD = rep(15, 4)
)

kable(summary_comparison, digits = 3, caption = "Summary Comparison Across Sample Sizes")
Summary Comparison Across Sample Sizes
Sample_Size Sample_Mean Sample_SD Theoretical_Mean Theoretical_SD
30 99.293 14.715 100 15
100 100.324 12.902 100 15
1000 100.242 14.875 100 15
10000 99.964 14.997 100 15

8 Conclusions

8.1 Main Findings

  1. Sample Statistics as Estimators: Our sample statistics closely approximated theoretical parameters. Random sampling is a reliable way pf estimating population parameters.

  2. Sample Size Impact: Larger samples provide more accurate representations of population distributions, with n=10,000 showing minimal sampling variation.

  3. Central Limit Theorem: The sampling distribution of means followed a normal distribution.

8.2 Practical Implications

  • Research Design: Larger sample sizes yield more reliable statistical estimates
  • Statistical Inference: The Central Limit Theorem justifies using normal theory for hypothesis testing and confidence intervals
  • Quality Control: Understanding sampling variability helps interpret differences between sample and population parameters