library(haven)
setwd("/Users/isaiahmireles/Desktop")
dat <- read_dta("SamplingPrac/usfacts.dta")
dat2 <- read_dta("SamplingPrac/usfacts.dta")
# class(dat) # understand obj type 
length(dat$state)
## [1] 51

Now I may b dumb but last I checked there are 50…

dat
print("found it : ")
## [1] "found it : "
dat[dat$state=="District of Columbia",]
# which(dat$state=="District of Columbia")

Alright I found it.

Time to sample :

dat <- dat[-9,]
dat2 <- dat2[-9,]

W/ replacement :

set.seed(2)
s2 <- sample(dat2$state, 5,replace = T)
cat(s2)
## Massachusetts Iowa Colorado Colorado New York
# notice its 4 large, not 5
dat2[dat2$state%in%s2,]

As we can see, we’ve sample Colorado twice – not good. As we can see, despite sampling 5, we have four total.

Each has \(\frac{1}{N}\) chance of being chose each time.

W/out replacement :

Here we use sample() , where each state index has equal prob. without replacement or in the scope of the class :

\[ \pi_i=\frac{n}{N} \]

set.seed(2)
s <- sample(dat$state, 5, replace = F)
cat(s)
## Massachusetts Iowa Colorado West Virginia New York
# subsetted df
dat <- dat[dat$state %in% s,]

Graphic

Nothing makes sense without pictures :

library(tidyverse)
us_map <- map_data("state")
dat <- 
  dat |> select(state, home, area, density)|> 
  mutate(state = tolower(state))
map_df <- 
  us_map |>
  right_join(dat, by = c("region" = "state"))
map_df <- map_df |> select(-subregion) #dont know that
library(tidyr)
library(ggplot2)

map_long <- map_df %>%
  pivot_longer(
    cols = c(density, area, home),
    names_to = "variable",
    values_to = "value"
  )
library(patchwork) #might combine plts

p1 <- ggplot(map_df, aes(long, lat, group = group, fill = density)) +
  geom_polygon(color = "white") +
  coord_fixed(1.3) +
  scale_fill_viridis_c(option = "C", name = "Density") +
  ggtitle("Population Density")

p2 <- ggplot(map_df, aes(long, lat, group = group, fill = area)) +
  geom_polygon(color = "white") +
  coord_fixed(1.3) +
  scale_fill_distiller(palette = "Blues", name = "Land Area") +
  ggtitle("Land Area")

p3 <- ggplot(map_df, aes(long, lat, group = group, fill = home)) +
  geom_polygon(color = "white") +
  coord_fixed(1.3) +
  scale_fill_distiller(palette = "Reds", name = "Homes") +
  ggtitle("Number of Homes")

p1

p2

p3

Finite vs. Infinite :

Recall that the general form for a confidence interval is :

\[ \hat{\theta}\pm\text{C}_{\text{val}}*\text{S.E.} \]

And under the same confidence,

critical value remains constant!

So the same sample statistic varies only by \(\text{S.E.}\) . for the estimate of \(\mu\) is \(\bar{x}\). Meaning the standard errors are :

\[ \text{S.E.}_{\text{Inf}} = \sqrt{\hat{V}_{\bar{x}}} \]

\[ \text{S.E.}_{\text{Inf}} = \sqrt{\hat{V}_{\bar{x}}*\text{FPC}} \]

Meaning :

Approximation Investigation ( \(\frac{N-n}{N-1}\approx 1-\frac{n}{N}\) ) :

Therefore, the only time they are the same with a constant \(N\) is when \(n=1\). Meaning when the finite population is sampled, its variance is equal when you only take 1 value from the finite population. As \(n \rightarrow N\), we are getting closer to a census & the \(\text{FPC}\) becomes smaller.

Metrics :

Based on our sample of 5, what do we estimate?

sumd <- 
  dat |>
  select(-state) |>
  summarize(across(everything(), mean))
cat("w/out replacement")
## w/out replacement
sumd
# Create new data with just the select W/ replacement :
dat2 <- dat2[dat2$state %in% s2,]
sumd2 <- 
  dat2[,colnames(dat)] |> 
  select(-state) |> 
  summarize(across(everything(), mean))
cat("w/ replacement")
## w/ replacement
sumd2

Notice that the samples differ by just a little bit :

wor <- unlist(sumd)
wr <- unlist(sumd2)
cat("difference")
## difference
wor-wr
##       home       area    density 
## -509057.65   -5916.45     -50.26