Problem Set 2

df <- read.csv("synthetic_microdata_01.csv")

Question 9

Binary Atribute : College Education (Yes/No)

I will categorize each person by their education level, if at least college = 1. No college = 0.

df$college <- ifelse(df$education == "High school", 0, 1)

9(A) Estimate P(Hat)full

p_hat_full <- mean(df$college)
p_hat_full

## [1] 0.8174192

Therefore the estimated precentage of the population of individuals in this dataset is: 81.74%

9(B) Sampling

B <- 2000

phat_300 <- replicate(B, mean(sample(df$college, 300, replace = TRUE)))

phat_1000 <- replicate(B, mean(sample(df$college, 1000, replace = TRUE)))

phat_3000 <- replicate(B, mean(sample(df$college, 3000, replace = TRUE)))

phat_10000 <- replicate(B, mean(sample(df$college, 10000, replace = TRUE)))

9(C) Empirical Mean

mean(phat_300)

## [1] 0.817645

var(phat_300)

## [1] 0.0004948736

mean(phat_1000)

## [1] 0.817445

var(phat_1000)

## [1] 0.0001546993

mean(phat_3000)

## [1] 0.8178225

var(phat_3000)

## [1] 5.164982e-05

mean(phat_10000)

## [1] 0.8174468

var(phat_10000)

## [1] 1.464395e-05

9(d) Compare

p <- p_hat_full

n_vals <- c(300, 1000, 3000, 10000)

empirical_sd <- c(
  sd(phat_300),
  sd(phat_1000),
  sd(phat_3000),
  sd(phat_10000)
)

theoretical_sd <- sqrt(p * (1 - p) / n_vals)

table_9d <- data.frame(
  n = n_vals,
  empirical_sd = empirical_sd,
  theoretical_sd = theoretical_sd
)

table_9d

##       n empirical_sd theoretical_sd
## 1   300  0.022245755    0.022304340
## 2  1000  0.012437818    0.012216590
## 3  3000  0.007186781    0.007053252
## 4 10000  0.003826742    0.003863225

Explanation: The empirical standard deviations are very close to the theoretical ones, and as the sample size n increases both the empirical and theoretical sd’s decrease, coming closer together. This shows that larger samples produce more precise estimates of the population portion, the closeness of the empirical to theoretical standard deviations proces the simulation behaves under expected sampling theory.

9(e)

moe <- 2 * theoretical_sd

moe_table <- data.frame(
  n = n_vals,
  margin_of_error = moe
)

moe_table

##       n margin_of_error
## 1   300      0.04460868
## 2  1000      0.02443318
## 3  3000      0.01410650
## 4 10000      0.00772645

9(f)

The margin of error decreases as the sample size increases, but improvements become smaller as the sample size grows. To exemplify, when n=300, the margin of error is about 4.46%, and increasing the sample size to n=1000 gives a margin of error of about 2.44%. Then we increase the sample more to n=3000, and our margin of error only decreases to about 1.41%, a 3x increase in sample size for only a .58x reduction, showing this difference in sample size growth the margin of error decrease begins to get very uneffecient even at these lower sample size numbers. Next we look at n=10,000 which gives us a margin of error about 0.77%, seeing this sentiment again in a 3.33x increase in sample size only netted a 0.55x reduction in margin of error, showing that it gets increasingly difficult to reduce margin of error as sample size grows. This occurs because the margin of error shrinks at a rate proportional to 1/sqrt(n), meaning that reducing the margin of error by half requires 4 times as many as observations. As a result, the larger our n grows, large increase to sample size only lead to a relatively small improvement in precision, this is why surveys often stop at a few thousand observations instead of doing tens of thousands to barely reduce margin of error.

9(g)

par(mfrow = c(1, 2))

hist(
  phat_300,
  breaks = 30,
  main = "Sampling Distribution (n = 300)",
  xlab = "p-hat",
  col = "lightblue",
  border = "white"
)
abline(v = p_hat_full, col = "red", lwd = 2)

hist(
  phat_10000,
  breaks = 30,
  main = "Sampling Distribution (n = 10000)",
  xlab = "p-hat",
  col = "lightgreen",
  border = "white"
)
abline(v = p_hat_full, col = "red", lwd = 2)

par(mfrow = c(1, 1))

The histograms show the sampling distribution of phat for sample sizes n=300 and n=10000. Both distributions show that they are centered around the red vertical line, the estimated phat value, but the histogram for n=300 and n=10000 appear very differently. The n=300 distribution is much more spread out and non-uniform as there is greater variability in the sample estimates. Whereas, the n=10000 distribution is much more concentrated around the estimated phat value, also showing more uniform shape, telling us that the n=10000 sample size estimates much more preciously. This demonstrates the Law of Large numbers and Central Limit Theory in the sense that as sample size increases, the variability of the sample proportion decreases and the distribution becomes tighter around the true population portion.

Problem 10

10(a)

mu_hat <- mean(df$commute_minutes, na.rm = TRUE)
mu_hat

## [1] 27.00011

10(b)

group_share <- prop.table(table(df$city_type))
group_share

## 
##              NYC Other large city            Rural           Suburb 
##        0.1747508        0.5256067        0.0998325        0.1998100

group_means <- tapply(df$commute_minutes, df$city_type, mean, na.rm = TRUE)
group_means

##              NYC Other large city            Rural           Suburb 
##         21.29285         25.23610         38.67529         30.79854

10(c)

left_side <- mu_hat
right_side <- sum(group_share * group_means)
difference <- left_side - right_side

data.frame(
  left_side = left_side,
  right_side = right_side,
  difference = difference
)

##   left_side right_side difference
## 1  27.00011   27.00011          0

The overall sample mean is 27.00011 and so is the weighted average if the group, meaning their difference is 0, which verifies the total law of expectation.

10(d)

overall_var <- var(df$commute_minutes, na.rm = TRUE)

group_vars <- tapply(df$commute_minutes, df$city_type, var, na.rm = TRUE)

within <- sum(group_share * group_vars)

between <- sum(group_share * (group_means - mu_hat)^2)

rhs <- within + between

data.frame(
  overall_var = overall_var,
  within = within,
  between = between,
  rhs = rhs,
  difference = overall_var - rhs
)

##   overall_var   within between      rhs    difference
## 1    284.6424 260.8247 23.8187 284.6434 -0.0009546345

10(e)

group_vars <- tapply(df$commute_minutes, df$city_type, var, na.rm = TRUE)
group_vars

##              NYC Other large city            Rural           Suburb 
##         113.4814         175.6522         793.4533         347.6169

10(f)

within_component <- sum(group_share * group_vars)

within_component

## [1] 260.8247

Ehat[X|G] in my computation is the expected commute time, or sample mean, for each city type G=g. For example the Ehat[X|G=NYC] = 21.29 minutes.

10(g)

The within-group component is much larger than the between-group. The within group variance accounts for about 92% of the total variance in commute times, whereas the between-group variance accounts only for the remaining about 8%. This means that the most of the variation in commute times originate from individuals within the same city rather than between different city types. Although the average commute times differ across cities, those differences only contribute to a small portion of the overall variation in commute times. In contrast, there is substantial variability in commute times among people living in the same type of area. Overall, city type only explains a small fraction of the differences in commuting behavior, suggesting that there are more factors outside of city type that impact commute times.

Problem 11

11(a)

mu_hat_height <- mean(df$height_cm, na.rm = TRUE)
sigma_hat_height <- sd(df$height_cm, na.rm = TRUE)

mu_hat_height

## [1] 168.5179

sigma_hat_height

## [1] 9.536798

11(b)

z_scores <- (df$height_cm - mu_hat_height) / sigma_hat_height

11(c)

frac_1 <- mean(abs(z_scores) <= 1, na.rm = TRUE)
frac_2 <- mean(abs(z_scores) <= 2, na.rm = TRUE)
frac_3 <- mean(abs(z_scores) <= 3, na.rm = TRUE)

frac_1

## [1] 0.6599064

frac_2

## [1] 0.9630449

frac_3

## [1] 0.9995716

empirical <- c(frac_1, frac_2, frac_3)
normal_rule <- c(0.68, 0.95, 0.997)

comparison_table <- data.frame(
  range = c("|z| <= 1", "|z| <= 2", "|z| <= 3"),
  empirical_fraction = empirical,
  normal_prediction = normal_rule
)

comparison_table

##      range empirical_fraction normal_prediction
## 1 |z| <= 1          0.6599064             0.680
## 2 |z| <= 2          0.9630449             0.950
## 3 |z| <= 3          0.9995716             0.997

The empirical fractions are close to the predictions from 68-95-99.7 rule for a normal distribution. About 65.99% of people fall within one standard deviation of the mean, close to the 68 prediction. While 96.3% of people fall within 2 standard deviations, which is also close to the normal prediction of 95%. Then, within 3 standard deviations, about 99.96% of observations fall in this range, compared to the normal prediction of 99.7%. Overall, the empirical data distribution of height appears to be very close to the normal distribution. This suggest that the normal model provides a reasonable estimator for the distribution of heights in this data set.

11(d)

empirical_190 <- mean(df$height_cm >= 190, na.rm = TRUE)

z_190 <- (190 - mu_hat_height) / sigma_hat_height
normal_190 <- 1 - pnorm(z_190)

empirical_190

## [1] 0.008374223

normal_190

## [1] 0.01214371

The empirical fraction of people with height at least 190cm is about 0.84% of the sample, which the normal model predicts that about 1.21% of people would be above 190cm. The normal approximation is reasonably close so we can conclude that our data is very close to being normally distributed.

11(e)

The normal model can still be useful to predict the distribution of heights even if the distribution of heights is not perfectly normal. Modeling height as approximately normal allows us to make easy probability predictions and comparisons. To exemplify this, we can use the normal distribution to estimate the probability of people with height at or above 190cm, and while the prediction wasn’t exactly correct, it is still reasonably close to the empirical value. So we can conclude that the normal distribution captures the general shape and spread of height distribution, while it may not be able to exactly predict the extreme tails of the empirical distribution, it still provides us with an simple analysis that can produce reasonable estimates. So, the normal model is a helpful approximation for understanding and analyzing height data.

11(f)

hist(
  df$height_cm,
  breaks = 40,
  probability = TRUE,
  ylim = c(0, 0.045),
  main = "Height Distribution with Normal Curve",
  xlab = "Height (cm)",
  col = "lightblue",
  border = "white"
)

curve(
  dnorm(x, mean = mu_hat_height, sd = sigma_hat_height),
  add = TRUE,
  col = "red",
  lwd = 2
)

The histogram with the fitted normal curve shows that the distribution of heights is roughly symmetric and centered around the sample mean of 168.5. The normal curve closely follows the overall shape of the histogram, but there are small deviations, namely that the histogram appears slightly flatter and wider than the normal curve but overall it tracks the data very closely. This is consistent with earlier results in that the normal model gives us a good approximation of the empirical data but is not exactly accurate across the entire distribution. Overall, the plot shows that the Normal distribution provides a reasonable estimation for height, although is not perfect.