Consider Question 1, but now suppose 50 homes are sampled instead of 4. Simulate 10,000 replications of this sampling scheme. Visualize the joint distribution of Y_1= number of fires in family homes and Y_2= number of fires in an apartment. Make sure to use appropriate methods for avoiding overplotting. Answer the following.
Which (Y_1,Y_2) combination is most common?
Simulate Cov(Y_1,Y_2). How well does it approximate the analytic covariance?
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)library(GGally)(houses_samples <-rmultinom(10000, size =50, prob =c(.73, .2, .07))%>%t()%>%data.frame()%>%setNames(c('Y1','Y2','Y3'))) %>% head
They are very similiar results with only a .04 difference.
Consider the problem of simulating bivariate normal data from Lecture 5.8. Using parameters() and add_trials() from the purrrfect package, and the relationships given on Slide 12, write the code that will produce the simulated data on Slide 13. This data set consists of 10000 realizations of (X,Y)∼BVN(μ_X=5,μ_Y=5,σ_X2,σ_Y2,ρ) for each combination of σ_X∈{0.5,1}, σ_Y∈{0.5,1}, and ρ∈{-0.8,-0.4,0,0.4,0.8}. Use this data set to reproduce the visualizations on Slide 14.
library(tidyverse)library(purrrfect)
Attaching package: 'purrrfect'
The following objects are masked from 'package:base':
replicate, tabulate
ggplot(simstudy, aes(x = x, y = y)) +geom_point(alpha =0.5, size =0.5) +geom_point(aes(x =5, y =5), color ="red") +facet_grid( rho ~ sigx + sigy,labeller = label_both ) +theme_bw()
Warning in geom_point(aes(x = 5, y = 5), color = "red"): All aesthetics have length 1, but the data has 200000 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Consider Question 4. Simulate 10,000 realizations of (X,Y) pairs. Create a scatterplot of these (X,Y) pairs. Superimpose the analytic and simulated conditional mean function. Then use the simulated data to approximate the answers to e)-j). Use lm() to estimate the regression coefficients β_0 and β_1 for part k).
library(tidyverse)N <-10000(sim_study <-data.frame(x =rexp(N, rate =1)) %>%mutate(y =map_dbl(x, \(x) x +rexp(1,1)))) %>% head
x y
1 0.8015773 2.0204804
2 0.7599702 1.3518701
3 1.8424854 1.9670407
4 0.2127960 0.8643048
5 1.9040161 2.4508210
6 0.7748084 2.1749495