Introduction and
Objectives
This mini-project explores concepts of probability distributions
through simulation and analysis of the normal distribution.
Understanding probability distributions is essential for quantitative
social research, as they form the foundation of statistical inference,
hypothesis testing, and data modeling.
Background:
Probability Distributions
A probability distribution describes how the values of a random
variable are distributed, specifying what values are possible and how
likely each value is to occur. In social science research, probability
distributions help us:
- Model and understand variation in social phenomena
- Make inferences about populations from samples
- Test hypotheses and quantify uncertainty
- Assess the reliability of statistical estimates
The Normal
Distribution
The normal distribution (Gaussian distribution or bell curve) is the
most important probability distribution in statistics, characterized by
two parameters:
- Mean (μ): The center of the distribution,
representing the average value
- Standard Deviation (σ): The spread of the
distribution, indicating variability around the mean
Key Properties of the
Normal Distribution
- Symmetric around the mean
- Bell-shaped curve
- Mean = Median = Mode
- 68% of data falls within 1 standard deviation of the mean
- 95% of data falls within 2 standard deviations of the mean
- 99.7% of data falls within 3 standard deviations of the mean
Task 1: Simulate a
Normal Distribution
# Set seed for reproducibility
set.seed(123)
# Generate random sample of 1000 observations
sample_1000 <- rnorm(n = 1000, mean = 100, sd = 15)
# Create histogram with theoretical density curve
library(ggplot2)
task1_plot <- ggplot(data.frame(x = sample_1000), aes(x = x)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "lightblue",
color = "black", alpha = 0.7) +
geom_density(color = "red", size = 1.2) +
stat_function(fun = dnorm, args = list(mean = 100, sd = 15),
color = "blue", size = 1, linetype = "dashed") +
labs(title = "Normal Distribution Simulation (n=1000)",
subtitle = "Histogram with Theoretical Density Curve",
x = "Value", y = "Density") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 12))
task1_plot

Interpretation: The histogram shows the distribution
of our simulated data with mean = 100 and standard deviation = 15. The
red line represents the empirical density curve from our sample, while
the blue dashed line shows the theoretical normal distribution. The
close alignment between these curves demonstrates that our random sample
accurately represents the intended normal distribution.
Task 2: Calculate
Descriptive Statistics
# Calculate descriptive statistics
descriptive_stats <- data.frame(
Statistic = c("Sample Mean", "Population Mean", "Sample SD", "Population SD",
"Sample Median", "Minimum", "Maximum"),
Value = c(mean(sample_1000), 100, sd(sample_1000), 15,
median(sample_1000), min(sample_1000), max(sample_1000)),
Type = c("Sample", "Theoretical", "Sample", "Theoretical",
"Sample", "Sample", "Sample")
)
# Display formatted table
library(knitr)
kable(descriptive_stats, digits = 3, caption = "Descriptive Statistics for Simulated Data")
Descriptive Statistics for Simulated Data
| Sample Mean |
100.242 |
Sample |
| Population Mean |
100.000 |
Theoretical |
| Sample SD |
14.875 |
Sample |
| Population SD |
15.000 |
Theoretical |
| Sample Median |
100.138 |
Sample |
| Minimum |
57.853 |
Sample |
| Maximum |
148.616 |
Sample |
Analysis of Sample
vs. Theoretical Parameters
The analysis reveals differences between sample statistics and
theoretical parameters. Our sample mean of 100.242
closely approximates the theoretical mean of 100,
showing that the sample means can be an unbiased estimation of
population means. Similarly, the sample standard deviation of
14.875 aligns well with the theoretical value of
15.
These minor discrepancies arise from sampling
variability, which is the natural fluctuation that occurs when
drawing random samples from a population. With a sample size of 1,000,
we expect these estimates to be quite close to their theoretical
counterparts, and indeed they are. The sample median of
100.138 further confirms the symmetric charateristic of
the normal distribution, as it closely matches both the sample and
population means.
Task 3: Explore Sample
Size Effects
# Generate samples of different sizes
set.seed(123)
sample_30 <- rnorm(n = 30, mean = 100, sd = 15)
sample_100 <- rnorm(n = 100, mean = 100, sd = 15)
sample_10000 <- rnorm(n = 10000, mean = 100, sd = 15)
# Create combined data frame for plotting
df_30 <- data.frame(x = sample_30, size = "n=30")
df_100 <- data.frame(x = sample_100, size = "n=100")
df_1000 <- data.frame(x = sample_1000, size = "n=1000")
df_10000 <- data.frame(x = sample_10000, size = "n=10000")
combined_data <- rbind(df_30, df_100, df_1000, df_10000)
# Create 2x2 grid of histograms
task3_plot <- ggplot(combined_data, aes(x = x, fill = size)) +
geom_histogram(aes(y = ..density..), bins = 25, alpha = 0.7, position = "identity") +
geom_density(color = "red", size = 0.8) +
facet_wrap(~size, ncol = 2) +
labs(title = "Effect of Sample Size on Distribution Appearance",
x = "Value", y = "Density") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
legend.position = "none")
task3_plot

Interpretation of
Sample Size Effects
The visualization of different sample sizes illustrates how sample
size affects our ability to recognize distribution patterns. With
n=30, the histogram appears irregular and somewhat
jagged, making it challenging to definitively identify the underlying
normal distribution. The n=100 sample shows clearer
bell-shaped characteristics, while the n=1,000 sample
provides a much smoother representation that closely matches the
theoretical normal curve. The n=10,000 sample offers
the most precise depiction, with minimal sampling fluctuation.
This shows that larger samples provide more accurate
representations of population distributions, reducing the
impact of random sampling variation and allowing clearer identification
of the underlying distributional shape.
Task 4: Demonstrate the
Central Limit Theorem
# Demonstrate Central Limit Theorem
set.seed(456)
n_samples <- 1000
sample_size <- 30
# Draw 1000 samples of size n=30 and calculate means
sample_means <- replicate(n_samples, {
sample_data <- rnorm(n = sample_size, mean = 100, sd = 15)
mean(sample_data)
})
# Create histogram of sample means
task4_plot <- ggplot(data.frame(means = sample_means), aes(x = means)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "lightgreen",
color = "black", alpha = 0.7) +
geom_density(color = "red", size = 1.2) +
geom_vline(xintercept = mean(sample_means), color = "blue", size = 1, linetype = "dashed") +
labs(title = "Central Limit Theorem Demonstration",
subtitle = paste("Distribution of 1000 Sample Means (n=30)",
"Mean of sampling distribution:", round(mean(sample_means), 3),
"SD of sampling distribution:", round(sd(sample_means), 3)),
x = "Sample Mean", y = "Density") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
plot.subtitle = element_text(hjust = 0.5, size = 10))
task4_plot

# Calculate sampling distribution statistics
theoretical_se <- 15/sqrt(30)
sampling_stats <- data.frame(
Statistic = c("Mean of Sampling Distribution",
"SD of Sampling Distribution",
"Theoretical Standard Error (σ/√n)"),
Value = c(mean(sample_means),
sd(sample_means),
theoretical_se),
Description = c("Observed mean of sample means",
"Observed standard deviation of sample means",
"Expected standard error (15/√30)")
)
kable(sampling_stats, digits = 4, caption = "Central Limit Theorem Results")
Central Limit Theorem Results
| Mean of Sampling Distribution |
100.1119 |
Observed mean of sample means |
| SD of Sampling Distribution |
2.6988 |
Observed standard deviation of sample means |
| Theoretical Standard Error (σ/√n) |
2.7386 |
Expected standard error (15/√30) |
Central Limit
Theorem
The Central Limit Theorem demonstration shows that the sampling
distribution of means (mean = 100.112, SD =
2.699) closely approximates a normal distribution, even
though we started with samples of only 30 observations.
The observed standard deviation of the sampling distribution
(2.699) aligns well with the theoretical standard error
(2.739), showing that the standard error
σ/√n accurately predicts the variability of sample
means. This result shows that that sample means become normally
distributed around the population mean with variability determined by
the population standard deviation and sample size, forming the
foundation for statistical inference and confidence intervals.
Summary Comparison
Across Sample Sizes
# Summary statistics for all samples
summary_comparison <- data.frame(
Sample_Size = c(30, 100, 1000, 10000),
Sample_Mean = c(mean(sample_30), mean(sample_100), mean(sample_1000), mean(sample_10000)),
Sample_SD = c(sd(sample_30), sd(sample_100), sd(sample_1000), sd(sample_10000)),
Theoretical_Mean = rep(100, 4),
Theoretical_SD = rep(15, 4)
)
kable(summary_comparison, digits = 3, caption = "Summary Comparison Across Sample Sizes")
Summary Comparison Across Sample Sizes
| 30 |
99.293 |
14.715 |
100 |
15 |
| 100 |
100.324 |
12.902 |
100 |
15 |
| 1000 |
100.242 |
14.875 |
100 |
15 |
| 10000 |
99.964 |
14.997 |
100 |
15 |
Conclusions
Main Findings
Sample Statistics as Estimators: Our sample
statistics closely approximated theoretical parameters. Random sampling
is a reliable way pf estimating population parameters.
Sample Size Impact: Larger samples provide more
accurate representations of population distributions, with n=10,000
showing minimal sampling variation.
Central Limit Theorem: The sampling distribution
of means followed a normal distribution.
Practical
Implications
- Research Design: Larger sample sizes yield more
reliable statistical estimates
- Statistical Inference: The Central Limit Theorem
justifies using normal theory for hypothesis testing and confidence
intervals
- Quality Control: Understanding sampling variability
helps interpret differences between sample and population
parameters