library(haven)
setwd("/Users/isaiahmireles/Desktop")
dat <- read_dta("SamplingPrac/usfacts.dta")
dat2 <- read_dta("SamplingPrac/usfacts.dta")
# class(dat) # understand obj type 
length(dat$state)
## [1] 51

Now I may b dumb but last I checked there are 50…

dat
print("found it : ")
## [1] "found it : "
dat[dat$state=="District of Columbia",]
# which(dat$state=="District of Columbia")

Alright I found it.

Time to sample :

dat <- dat[-9,]
dat2 <- dat2[-9,]

W/ replacement :

set.seed(22)
s2 <- sample(dat2$state, 5,replace = T)
s2
## [1] "Pennsylvania" "Florida"      "Mississippi"  "Georgia"      "New Jersey"
# notice its 4 large, not 5
dat2[dat2$state%in%s2,]

As we can see, we’ve sample Ohio twice – not good. As we can see, despite sampling 5, we have two.

Each has \(\frac{1}{N}\) chance of being chose each time.

W/out replacement :

Here we use sample() , where each state index has equal prob. without replacement or in the scope of the class :

\[ \pi_i=\frac{n}{N} \]

set.seed(123)
s <- sample(dat$state, 5)
s
## [1] "New Mexico" "Iowa"       "Indiana"    "Arizona"    "Tennessee"
# subsetted df
dat <- dat[dat$state %in% s,]

Graphic

Nothing makes sense without pictures :

library(tidyverse)
us_map <- map_data("state")
dat <- 
  dat |> select(state, home, area, density)|> 
  mutate(state = tolower(state))
map_df <- 
  us_map |>
  right_join(dat, by = c("region" = "state"))
map_df <- map_df |> select(-subregion) #dont know that
library(tidyr)
library(ggplot2)

map_long <- map_df %>%
  pivot_longer(
    cols = c(density, area, home),
    names_to = "variable",
    values_to = "value"
  )
library(patchwork) #might combine plts

p1 <- ggplot(map_df, aes(long, lat, group = group, fill = density)) +
  geom_polygon(color = "white") +
  coord_fixed(1.3) +
  scale_fill_viridis_c(option = "C", name = "Density") +
  ggtitle("Population Density")

p2 <- ggplot(map_df, aes(long, lat, group = group, fill = area)) +
  geom_polygon(color = "white") +
  coord_fixed(1.3) +
  scale_fill_distiller(palette = "Blues", name = "Land Area") +
  ggtitle("Land Area")

p3 <- ggplot(map_df, aes(long, lat, group = group, fill = home)) +
  geom_polygon(color = "white") +
  coord_fixed(1.3) +
  scale_fill_distiller(palette = "Reds", name = "Homes") +
  ggtitle("Number of Homes")

p1

p2

p3

Finite vs. Infinite :

Recall that the general form for a confidence interval is :

\[ \hat{\theta}\pm\text{C}_{\text{val}}*\text{S.E.} \]

And under the same confidence,

critical value remains constant!

So the same sample statistic varies only by \(\text{S.E.}\) . for the estimate of \(\mu\) is \(\bar{x}\). Meaning the standard errors are :

\[ \text{S.E.}_{\text{Inf}} = \sqrt{\hat{V}_{\bar{x}}} \]

\[ \text{S.E.}_{\text{Inf}} = \sqrt{\hat{V}_{\bar{x}}*\text{FPC}} \]

Meaning :

Approximation Investigation ( \(\frac{N-n}{N-1}\approx 1-\frac{n}{N}\) )

Therefore, the only time they are the same with a constant \(N\) is when \(n=1\). Meaning when the finite population is sampled, its variance is equal when you only take 1 value from the finite population – otherwise,

Metrics :