1 Introduction and Objectives

This mini-project explores concepts of probability distributions through simulation and analysis of the normal distribution. Understanding probability distributions is essential for quantitative social research, as they form the foundation of statistical inference, hypothesis testing, and data modeling.

1.1 Background: Probability Distributions

A probability distribution describes how the values of a random variable are distributed, specifying what values are possible and how likely each value is to occur. In social science research, probability distributions help us:

Model and understand variation in social phenomena
Make inferences about populations from samples
Test hypotheses and quantify uncertainty
Assess the reliability of statistical estimates

2 The Normal Distribution

The normal distribution (Gaussian distribution or bell curve) is the most important probability distribution in statistics, characterized by two parameters:

Mean (μ): The center of the distribution, representing the average value
Standard Deviation (σ): The spread of the distribution, indicating variability around the mean

2.1 Key Properties of the Normal Distribution

Symmetric around the mean
Bell-shaped curve
Mean = Median = Mode
68% of data falls within 1 standard deviation of the mean
95% of data falls within 2 standard deviations of the mean
99.7% of data falls within 3 standard deviations of the mean

3 Task 1: Simulate a Normal Distribution

# Set seed for reproducibility
set.seed(123)

# Generate random sample of 1000 observations
sample_1000 <- rnorm(n = 1000, mean = 100, sd = 15)

# Create histogram with theoretical density curve
library(ggplot2)

task1_plot <- ggplot(data.frame(x = sample_1000), aes(x = x)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue", 
                 color = "black", alpha = 0.7) +
  geom_density(color = "red", size = 1.2) +
  stat_function(fun = dnorm, args = list(mean = 100, sd = 15), 
                color = "blue", size = 1, linetype = "dashed") +
  labs(title = "Normal Distribution Simulation (n=1000)",
       subtitle = "Histogram with Theoretical Density Curve",
       x = "Value", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 12))

task1_plot

Interpretation: The histogram shows the distribution of our simulated data with mean = 100 and standard deviation = 15. The red line represents the empirical density curve from our sample, while the blue dashed line shows the theoretical normal distribution. The close alignment between these curves demonstrates that our random sample accurately represents the intended normal distribution.

4 Task 2: Calculate Descriptive Statistics

# Calculate descriptive statistics
descriptive_stats <- data.frame(
  Statistic = c("Sample Mean", "Population Mean", "Sample SD", "Population SD", 
                "Sample Median", "Minimum", "Maximum"),
  Value = c(mean(sample_1000), 100, sd(sample_1000), 15, 
            median(sample_1000), min(sample_1000), max(sample_1000)),
  Type = c("Sample", "Theoretical", "Sample", "Theoretical", 
           "Sample", "Sample", "Sample")
)

# Display formatted table
library(knitr)
kable(descriptive_stats, digits = 3, caption = "Descriptive Statistics for Simulated Data")

Descriptive Statistics for Simulated Data
Statistic	Value	Type
Sample Mean	100.242	Sample
Population Mean	100.000	Theoretical
Sample SD	14.875	Sample
Population SD	15.000	Theoretical
Sample Median	100.138	Sample
Minimum	57.853	Sample
Maximum	148.616	Sample

4.1 Analysis of Sample vs. Theoretical Parameters

The analysis reveals differences between sample statistics and theoretical parameters. Our sample mean of 100.242 closely approximates the theoretical mean of 100, showing that the sample means can be an unbiased estimation of population means. Similarly, the sample standard deviation of 14.875 aligns well with the theoretical value of 15.

These minor discrepancies arise from sampling variability, which is the natural fluctuation that occurs when drawing random samples from a population. With a sample size of 1,000, we expect these estimates to be quite close to their theoretical counterparts, and indeed they are. The sample median of 100.138 further confirms the symmetric charateristic of the normal distribution, as it closely matches both the sample and population means.

5 Task 3: Explore Sample Size Effects

# Generate samples of different sizes
set.seed(123)
sample_30 <- rnorm(n = 30, mean = 100, sd = 15)
sample_100 <- rnorm(n = 100, mean = 100, sd = 15)
sample_10000 <- rnorm(n = 10000, mean = 100, sd = 15)

# Create combined data frame for plotting
df_30 <- data.frame(x = sample_30, size = "n=30")
df_100 <- data.frame(x = sample_100, size = "n=100")
df_1000 <- data.frame(x = sample_1000, size = "n=1000")
df_10000 <- data.frame(x = sample_10000, size = "n=10000")

combined_data <- rbind(df_30, df_100, df_1000, df_10000)

# Create 2x2 grid of histograms
task3_plot <- ggplot(combined_data, aes(x = x, fill = size)) +
  geom_histogram(aes(y = ..density..), bins = 25, alpha = 0.7, position = "identity") +
  geom_density(color = "red", size = 0.8) +
  facet_wrap(~size, ncol = 2) +
  labs(title = "Effect of Sample Size on Distribution Appearance",
       x = "Value", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        legend.position = "none")

task3_plot

5.1 Interpretation of Sample Size Effects

The visualization of different sample sizes illustrates how sample size affects our ability to recognize distribution patterns. With n=30, the histogram appears irregular and somewhat jagged, making it challenging to definitively identify the underlying normal distribution. The n=100 sample shows clearer bell-shaped characteristics, while the n=1,000 sample provides a much smoother representation that closely matches the theoretical normal curve. The n=10,000 sample offers the most precise depiction, with minimal sampling fluctuation.

This shows that larger samples provide more accurate representations of population distributions, reducing the impact of random sampling variation and allowing clearer identification of the underlying distributional shape.

6 Task 4: Demonstrate the Central Limit Theorem

# Demonstrate Central Limit Theorem
set.seed(456)
n_samples <- 1000
sample_size <- 30

# Draw 1000 samples of size n=30 and calculate means
sample_means <- replicate(n_samples, {
  sample_data <- rnorm(n = sample_size, mean = 100, sd = 15)
  mean(sample_data)
})

# Create histogram of sample means
task4_plot <- ggplot(data.frame(means = sample_means), aes(x = means)) +
  geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgreen", 
                 color = "black", alpha = 0.7) +
  geom_density(color = "red", size = 1.2) +
  geom_vline(xintercept = mean(sample_means), color = "blue", size = 1, linetype = "dashed") +
  labs(title = "Central Limit Theorem Demonstration",
       subtitle = paste("Distribution of 1000 Sample Means (n=30)",
                       "Mean of sampling distribution:", round(mean(sample_means), 3),
                       "SD of sampling distribution:", round(sd(sample_means), 3)),
       x = "Sample Mean", y = "Density") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
        plot.subtitle = element_text(hjust = 0.5, size = 10))

task4_plot

# Calculate sampling distribution statistics
theoretical_se <- 15/sqrt(30)

sampling_stats <- data.frame(
  Statistic = c("Mean of Sampling Distribution", 
                "SD of Sampling Distribution", 
                "Theoretical Standard Error (σ/√n)"),
  Value = c(mean(sample_means), 
            sd(sample_means), 
            theoretical_se),
  Description = c("Observed mean of sample means",
                  "Observed standard deviation of sample means", 
                  "Expected standard error (15/√30)")
)

kable(sampling_stats, digits = 4, caption = "Central Limit Theorem Results")

Central Limit Theorem Results
Statistic	Value	Description
Mean of Sampling Distribution	100.1119	Observed mean of sample means
SD of Sampling Distribution	2.6988	Observed standard deviation of sample means
Theoretical Standard Error (σ/√n)	2.7386	Expected standard error (15/√30)

6.1 Central Limit Theorem

The Central Limit Theorem demonstration shows that the sampling distribution of means (mean = 100.112, SD = 2.699) closely approximates a normal distribution, even though we started with samples of only 30 observations.

The observed standard deviation of the sampling distribution (2.699) aligns well with the theoretical standard error (2.739), showing that the standard error σ/√n accurately predicts the variability of sample means. This result shows that that sample means become normally distributed around the population mean with variability determined by the population standard deviation and sample size, forming the foundation for statistical inference and confidence intervals.

7 Summary Comparison Across Sample Sizes

# Summary statistics for all samples
summary_comparison <- data.frame(
  Sample_Size = c(30, 100, 1000, 10000),
  Sample_Mean = c(mean(sample_30), mean(sample_100), mean(sample_1000), mean(sample_10000)),
  Sample_SD = c(sd(sample_30), sd(sample_100), sd(sample_1000), sd(sample_10000)),
  Theoretical_Mean = rep(100, 4),
  Theoretical_SD = rep(15, 4)
)

kable(summary_comparison, digits = 3, caption = "Summary Comparison Across Sample Sizes")

Summary Comparison Across Sample Sizes
Sample_Size	Sample_Mean	Sample_SD	Theoretical_Mean	Theoretical_SD
30	99.293	14.715	100	15
100	100.324	12.902	100	15
1000	100.242	14.875	100	15
10000	99.964	14.997	100	15

8 Conclusions

8.1 Main Findings

Sample Statistics as Estimators: Our sample statistics closely approximated theoretical parameters. Random sampling is a reliable way pf estimating population parameters.
Sample Size Impact: Larger samples provide more accurate representations of population distributions, with n=10,000 showing minimal sampling variation.
Central Limit Theorem: The sampling distribution of means followed a normal distribution.

8.2 Practical Implications

Research Design: Larger sample sizes yield more reliable statistical estimates
Statistical Inference: The Central Limit Theorem justifies using normal theory for hypothesis testing and confidence intervals
Quality Control: Understanding sampling variability helps interpret differences between sample and population parameters

Mini Project 2:

Simulating a Probability Distribution

SOC470

2025-11-19