df <- read.csv("synthetic_microdata_01.csv")
I will categorize each person by their education level, if at least college = 1. No college = 0.
df$college <- ifelse(df$education == "High school", 0, 1)
p_hat_full <- mean(df$college)
p_hat_full
## [1] 0.8174192
Therefore the estimated precentage of the population of individuals in this dataset is: 81.74%
B <- 2000
phat_300 <- replicate(B, mean(sample(df$college, 300, replace = TRUE)))
phat_1000 <- replicate(B, mean(sample(df$college, 1000, replace = TRUE)))
phat_3000 <- replicate(B, mean(sample(df$college, 3000, replace = TRUE)))
phat_10000 <- replicate(B, mean(sample(df$college, 10000, replace = TRUE)))
mean(phat_300)
## [1] 0.817645
var(phat_300)
## [1] 0.0004948736
mean(phat_1000)
## [1] 0.817445
var(phat_1000)
## [1] 0.0001546993
mean(phat_3000)
## [1] 0.8178225
var(phat_3000)
## [1] 5.164982e-05
mean(phat_10000)
## [1] 0.8174468
var(phat_10000)
## [1] 1.464395e-05
p <- p_hat_full
n_vals <- c(300, 1000, 3000, 10000)
empirical_sd <- c(
sd(phat_300),
sd(phat_1000),
sd(phat_3000),
sd(phat_10000)
)
theoretical_sd <- sqrt(p * (1 - p) / n_vals)
table_9d <- data.frame(
n = n_vals,
empirical_sd = empirical_sd,
theoretical_sd = theoretical_sd
)
table_9d
## n empirical_sd theoretical_sd
## 1 300 0.022245755 0.022304340
## 2 1000 0.012437818 0.012216590
## 3 3000 0.007186781 0.007053252
## 4 10000 0.003826742 0.003863225
Explanation: The empirical standard deviations are very close to the theoretical ones, and as the sample size n increases both the empirical and theoretical sd’s decrease, coming closer together. This shows that larger samples produce more precise estimates of the population portion, the closeness of the empirical to theoretical standard deviations proces the simulation behaves under expected sampling theory.
moe <- 2 * theoretical_sd
moe_table <- data.frame(
n = n_vals,
margin_of_error = moe
)
moe_table
## n margin_of_error
## 1 300 0.04460868
## 2 1000 0.02443318
## 3 3000 0.01410650
## 4 10000 0.00772645
The margin of error decreases as the sample size increases, but improvements become smaller as the sample size grows. To exemplify, when n=300, the margin of error is about 4.46%, and increasing the sample size to n=1000 gives a margin of error of about 2.44%. Then we increase the sample more to n=3000, and our margin of error only decreases to about 1.41%, a 3x increase in sample size for only a .58x reduction, showing this difference in sample size growth the margin of error decrease begins to get very uneffecient even at these lower sample size numbers. Next we look at n=10,000 which gives us a margin of error about 0.77%, seeing this sentiment again in a 3.33x increase in sample size only netted a 0.55x reduction in margin of error, showing that it gets increasingly difficult to reduce margin of error as sample size grows. This occurs because the margin of error shrinks at a rate proportional to 1/sqrt(n), meaning that reducing the margin of error by half requires 4 times as many as observations. As a result, the larger our n grows, large increase to sample size only lead to a relatively small improvement in precision, this is why surveys often stop at a few thousand observations instead of doing tens of thousands to barely reduce margin of error.
par(mfrow = c(1, 2))
hist(
phat_300,
breaks = 30,
main = "Sampling Distribution (n = 300)",
xlab = "p-hat",
col = "lightblue",
border = "white"
)
abline(v = p_hat_full, col = "red", lwd = 2)
hist(
phat_10000,
breaks = 30,
main = "Sampling Distribution (n = 10000)",
xlab = "p-hat",
col = "lightgreen",
border = "white"
)
abline(v = p_hat_full, col = "red", lwd = 2)
par(mfrow = c(1, 1))
The histograms show the sampling distribution of phat for sample sizes n=300 and n=10000. Both distributions show that they are centered around the red vertical line, the estimated phat value, but the histogram for n=300 and n=10000 appear very differently. The n=300 distribution is much more spread out and non-uniform as there is greater variability in the sample estimates. Whereas, the n=10000 distribution is much more concentrated around the estimated phat value, also showing more uniform shape, telling us that the n=10000 sample size estimates much more preciously. This demonstrates the Law of Large numbers and Central Limit Theory in the sense that as sample size increases, the variability of the sample proportion decreases and the distribution becomes tighter around the true population portion.
mu_hat <- mean(df$commute_minutes, na.rm = TRUE)
mu_hat
## [1] 27.00011
group_share <- prop.table(table(df$city_type))
group_share
##
## NYC Other large city Rural Suburb
## 0.1747508 0.5256067 0.0998325 0.1998100
group_means <- tapply(df$commute_minutes, df$city_type, mean, na.rm = TRUE)
group_means
## NYC Other large city Rural Suburb
## 21.29285 25.23610 38.67529 30.79854
left_side <- mu_hat
right_side <- sum(group_share * group_means)
difference <- left_side - right_side
data.frame(
left_side = left_side,
right_side = right_side,
difference = difference
)
## left_side right_side difference
## 1 27.00011 27.00011 0
The overall sample mean is 27.00011 and so is the weighted average if the group, meaning their difference is 0, which verifies the total law of expectation.
overall_var <- var(df$commute_minutes, na.rm = TRUE)
group_vars <- tapply(df$commute_minutes, df$city_type, var, na.rm = TRUE)
within <- sum(group_share * group_vars)
between <- sum(group_share * (group_means - mu_hat)^2)
rhs <- within + between
data.frame(
overall_var = overall_var,
within = within,
between = between,
rhs = rhs,
difference = overall_var - rhs
)
## overall_var within between rhs difference
## 1 284.6424 260.8247 23.8187 284.6434 -0.0009546345
group_vars <- tapply(df$commute_minutes, df$city_type, var, na.rm = TRUE)
group_vars
## NYC Other large city Rural Suburb
## 113.4814 175.6522 793.4533 347.6169
within_component <- sum(group_share * group_vars)
within_component
## [1] 260.8247
Ehat[X|G] in my computation is the expected commute time, or sample mean, for each city type G=g. For example the Ehat[X|G=NYC] = 21.29 minutes.
The within-group component is much larger than the between-group. The within group variance accounts for about 92% of the total variance in commute times, whereas the between-group variance accounts only for the remaining about 8%. This means that the most of the variation in commute times originate from individuals within the same city rather than between different city types. Although the average commute times differ across cities, those differences only contribute to a small portion of the overall variation in commute times. In contrast, there is substantial variability in commute times among people living in the same type of area. Overall, city type only explains a small fraction of the differences in commuting behavior, suggesting that there are more factors outside of city type that impact commute times.
mu_hat_height <- mean(df$height_cm, na.rm = TRUE)
sigma_hat_height <- sd(df$height_cm, na.rm = TRUE)
mu_hat_height
## [1] 168.5179
sigma_hat_height
## [1] 9.536798
z_scores <- (df$height_cm - mu_hat_height) / sigma_hat_height
frac_1 <- mean(abs(z_scores) <= 1, na.rm = TRUE)
frac_2 <- mean(abs(z_scores) <= 2, na.rm = TRUE)
frac_3 <- mean(abs(z_scores) <= 3, na.rm = TRUE)
frac_1
## [1] 0.6599064
frac_2
## [1] 0.9630449
frac_3
## [1] 0.9995716
empirical <- c(frac_1, frac_2, frac_3)
normal_rule <- c(0.68, 0.95, 0.997)
comparison_table <- data.frame(
range = c("|z| <= 1", "|z| <= 2", "|z| <= 3"),
empirical_fraction = empirical,
normal_prediction = normal_rule
)
comparison_table
## range empirical_fraction normal_prediction
## 1 |z| <= 1 0.6599064 0.680
## 2 |z| <= 2 0.9630449 0.950
## 3 |z| <= 3 0.9995716 0.997
The empirical fractions are close to the predictions from 68-95-99.7 rule for a normal distribution. About 65.99% of people fall within one standard deviation of the mean, close to the 68 prediction. While 96.3% of people fall within 2 standard deviations, which is also close to the normal prediction of 95%. Then, within 3 standard deviations, about 99.96% of observations fall in this range, compared to the normal prediction of 99.7%. Overall, the empirical data distribution of height appears to be very close to the normal distribution. This suggest that the normal model provides a reasonable estimator for the distribution of heights in this data set.
empirical_190 <- mean(df$height_cm >= 190, na.rm = TRUE)
z_190 <- (190 - mu_hat_height) / sigma_hat_height
normal_190 <- 1 - pnorm(z_190)
empirical_190
## [1] 0.008374223
normal_190
## [1] 0.01214371
The empirical fraction of people with height at least 190cm is about 0.84% of the sample, which the normal model predicts that about 1.21% of people would be above 190cm. The normal approximation is reasonably close so we can conclude that our data is very close to being normally distributed.
The normal model can still be useful to predict the distribution of heights even if the distribution of heights is not perfectly normal. Modeling height as approximately normal allows us to make easy probability predictions and comparisons. To exemplify this, we can use the normal distribution to estimate the probability of people with height at or above 190cm, and while the prediction wasn’t exactly correct, it is still reasonably close to the empirical value. So we can conclude that the normal distribution captures the general shape and spread of height distribution, while it may not be able to exactly predict the extreme tails of the empirical distribution, it still provides us with an simple analysis that can produce reasonable estimates. So, the normal model is a helpful approximation for understanding and analyzing height data.
hist(
df$height_cm,
breaks = 40,
probability = TRUE,
ylim = c(0, 0.045),
main = "Height Distribution with Normal Curve",
xlab = "Height (cm)",
col = "lightblue",
border = "white"
)
curve(
dnorm(x, mean = mu_hat_height, sd = sigma_hat_height),
add = TRUE,
col = "red",
lwd = 2
)
The histogram with the fitted normal curve shows that the distribution
of heights is roughly symmetric and centered around the sample mean of
168.5. The normal curve closely follows the overall shape of the
histogram, but there are small deviations, namely that the histogram
appears slightly flatter and wider than the normal curve but overall it
tracks the data very closely. This is consistent with earlier results in
that the normal model gives us a good approximation of the empirical
data but is not exactly accurate across the entire distribution.
Overall, the plot shows that the Normal distribution provides a
reasonable estimation for height, although is not perfect.