Week_Four

I. SAMPLING METHODOLOGIES

a. Probability sampling refers to the selection of a sample from a population, when this selection is based on the principle of randomization, that is, random selection or chance. Probability sampling is more complex, more time-consuming and usually more costly than non-probability sampling.

b. However, Non-probability sampling is a method of selecting units from a population using a subjective (i.e. non-random) method. Since non-probability sampling does not require a complete survey frame, it is a fast, easy and inexpensive way of obtaining data.

There are several sampling methods that fall under the probability sampling.

Simple random sampling :

In simple random sampling, all members of the population have an equal chance of being selected and the selection is done randomly.

Stratified random sampling :

Many populations can be divided into smaller groups based on specific characteristics that don’t overlap but represent the entire population when put together.

Systematic sampling :

Systematic sampling is similar to simple random sampling, though it’s usually a bit easier to conduct. Each member of the population is assigned a number, then selected at regular intervals to form a sample.

Cluster sampling :

Like stratified sampling, cluster sampling also involves separating the population into subgroups, or clusters. But that’s where the two probability sampling methods diverge.

Types of Non-probability sampling :

Convenience sampling :

It is a non-probability sampling technique where samples are selected from the population only because they are conveniently available to the researcher.

Quota sampling :

Quota sampling is one of the most common methods for collecting data in surveys and research studies.

Purposive sampling :

The purposive sampling method is about selecting samples from the overall sample size based on the judgment of the survey taker or researcher.

Snowball sampling :

It is for acquiring a sample that uses participants to recruit additional participants.

Part II

What is the population of interest in the Current Population Survey (CPS) ? Be precise – are there any age criteria, occupation criteria, and/or geographic criteria ?

The target audience is anyone in the United States who is civilian and not institutionalized, aged 16 or older.People who live in institutions and are active in the armed forces are not included in the sample.The survey is designed for individuals between the ages of 16 and over (without any age limit).

What is the sample used to estimate the population parameters/headline statisticsLinks to an external site. like unemployment rate, employment to population ratio ? Be precise - how many households and/or people are included ? Is there anything interesting/unique about the survey methodology that you found ?

The U.S. Census Bureau conducts the Current Population Survey (CPS), which involves conducting a sample survey of about 60,000 eligible households. The CPS is a probability of sample and cluster sampling uses to create a areas.

3)Do you think CPS is a representative sample of the US entire population after reading about its methodology or your online reserach ?

Because it chooses a multistage probability-based sample of American households, the CPS is a representative sample.The sample size is also set by particular parameters that ensure a trustworthy source for assessing the unemployment rate at the national and state level.

# NOTE: To load data, you must download both the extract's data and the DDI
# and also set the working directory to the folder with these files (or change the path below).

if (!require("ipumsr")) stop("Reading IPUMS data into R requires the ipumsr package. It can be installed using the following command: install.packages('ipumsr')")

## Loading required package: ipumsr

ddi <- read_ipums_ddi("cps_00007.xml")
data <- read_ipums_micro(ddi)

## Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.

library(readr)
library(psych)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ purrr     1.0.2
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()   masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data1 <- data  %>%   filter(INCWAGE != 99999999)

Income by labor force status

library(ggplot2)
ggplot(data = data1, 
       mapping = aes(x = LABFORCE ,
        y = INCWAGE  )) + geom_point()

Yes the graph makes sense because we can observe that households that reported employment have a higher income that household that not reported employment.

describe(data1$INCWAGE)

##    vars      n     mean       sd median  trimmed   mad min     max   range skew
## X1    1 790984 33857.95 62954.22  15000 33857.95 22239   0 2099999 2099999 7.66
##    kurtosis    se
## X1   105.19 70.78

describe(data1$LABFORCE)

##    vars      n mean  sd median trimmed mad min max range  skew kurtosis se
## X1    1 790984 1.62 0.5      2    1.62   0   0   2     2 -0.58    -1.38  0

data1 %>% 
  group_by(LABFORCE) %>% 
  summarize(mean_income = mean(INCWAGE),
            median_income = median(INCWAGE))

## # A tibble: 3 × 3
##   LABFORCE                       mean_income median_income
##   <int+lbl>                            <dbl>         <dbl>
## 1 0 [NIU]                             57966.         50002
## 2 1 [No, not in the labor force]       2279.             0
## 3 2 [Yes, in the labor force]         52796.         38000

# Set the values for N, x, and p
N <- 100      # Total number of procedures performed
x <- 10       # Number of procedures resulting in death within 30 days
p <- 0.05     # National proportion of deaths in these cases

# Binomial distribution
binom_prob <- dbinom(x, N, p)
binom_cum_prob <- pbinom(x, N, p, lower.tail = FALSE)

# Poisson distribution (approximation to binomial for large N)
lambda <- N * p
poisson_prob <- dpois(x, lambda)
poisson_cum_prob <- ppois(x, lambda, lower.tail = FALSE)

# Print the probabilities
cat("Binomial Probability:", binom_prob, "\n")

## Binomial Probability: 0.01671588

cat("Binomial Cumulative Probability:", binom_cum_prob, "\n")

## Binomial Cumulative Probability: 0.01147241

cat("Poisson Probability:", poisson_prob, "\n")

## Poisson Probability: 0.01813279

cat("Poisson Cumulative Probability:", poisson_cum_prob, "\n")

## Poisson Cumulative Probability: 0.01369527

# Perform hypothesis test
alpha <- 0.05  # Significance level

# Binomial test
binom_test <- binom.test(x, N, p, alternative = "greater")
cat("Binomial Test p-value:", binom_test$p.value, "\n")

## Binomial Test p-value: 0.02818829

cat("Binomial Test Conclusion:", ifelse(binom_test$p.value < alpha, "Reject Null Hypothesis", "Fail to Reject Null Hypothesis"), "\n")

## Binomial Test Conclusion: Reject Null Hypothesis

# Poisson test
poisson_test <- poisson.test(x, T = N, alternative = "greater")
cat("Poisson Test p-value:", poisson_test$p.value, "\n")

## Poisson Test p-value: 1

cat("Poisson Test Conclusion:", ifelse(poisson_test$p.value < alpha, "Reject Null Hypothesis", "Fail to Reject Null Hypothesis"), "\n")

## Poisson Test Conclusion: Fail to Reject Null Hypothesis