rm(list = ls()) # Clear all files from your environment
         gc()            # Clear unused memory
##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 526562 28.2    1169750 62.5         NA   669425 35.8
## Vcells 971900  7.5    8388608 64.0      16384  1852074 14.2
         cat("\f")       # Clear the console
 graphics.off()  # Clear all graphs

Part A

Part 1

Describe what is probability sampling and non-probability sampling (max 5 lines each). You can even talk about the differencesLinks to an external site. between the sampling methodologies. Which probability sampling method should we use?

Answer

Probability sampling involves getting a sample where every unit of the population has an equal chance of being included. It is divided into 4 major methods which include simple random sampling, stratified sampling, systematic sampling, and cluster sampling. The overall goal is to make sure we limit bias by taking a well represented sample of the population.

Non probability sampling on the other hand takes a sample but focuses on non random criteria such as geographical proximity, the expert opinion, or the availability of individuals. It is divided into 5 major methods which include convenience sampling, judgmental sampling, snowball sampling, quota sampling, and volunteer sampling. Unlike probability sampling, not every individual or unit of the population has an equal chance of being included and may also have higher forms of bias.

For these reason, we should lean toward using probability sampling as it is more suitable for making statistical inferences and generalizing findings to the larger population. This being said probability sampling data tends to become more expensive and harder to obtain overall and so as a result often times non probability sampling becomes easier to do.

Part 2

Also describe at least 2 survey designs from each of the two categorizations above 

Answer

  1. Systematic sampling: This is one of the most popular survey designs, in which you create a random starting point and then you sample every nth item/individual in the population. This is largely seen in manufacturing facilities when doing test control, an inspector might inspect every third product to make sure there are no errors.

  2. Convenience sampling: Another example of a popular survey design, as stated in the name it it based on picking individuals who are nearby and easy to reach, thus convenient to sample. Often times it is used to produce a quick and inexpensive sample set but because of the geographic proximity of the samples it could produce forms of bias. This can involve asking shoppers about their favorite items to buy at a nearby mall. The responses could change if a different mall was chosen which was located in a higher or lower income area.

Part B

Part 1

What is the population of interest in the Current Population Survey (CPS) ? Be precise – are there any age criteria, occupation criteria, and/or geographic criteria ?

Answer: It is a monthly household survey to gather information on education, labor force status, demographics, and other aspects of the U.S. population. The CPS looks at the civilian (non institutional) population of the United States which are 16 years and older, including all 50 states and the District of Columbia, It does not contain occupation criteria as it looks at a broad range of occupations and industries.

Part 2

What is the sample used to estimate the population parameters/headline statistics like unemployment rate, employment to population ratio ? Be precise - how many households and/or people are included? Is there anything interesting/unique about the survey methodology that you found ?  

Answer: The sample covers 65K+ civilian households across the United States. I found interesting that the questions on the sample change overtime depending on the current events, allowing the topics to stay relevent. The difficulty with this becomes piecing together surveys from several different years where the format or overall variables have changed.

The sample method used is a stratified multistage stage sampling scheme. In this process PSUs (Primary Sampling Units) consist of counties or groups of contiguous counties in the United States, and are grouped into strata.

Part 3

Do you think CPS is a representative sample of the US entire population after reading about its methodology or your online research ?

Answer: Based on the overall criteria, structure, and sample size, the CPS gives a good representative sample of the US. This being said, there are many obstacles such as incomplete surveys that create for shortfalls in the data. Nevertheless, it has been the benchmark for the united states over several years and economists have been able to provide strong economic analysis based on data, making it a strong representation.

Part 6

#install.packages("ipumsr")
library(ipumsr)
ddi <- read_ipums_ddi("cps_00002.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
#install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data2 <- data %>% filter(INCWAGE !=  99999999)


data_2022 <- subset(data2, YEAR == "2022") # Includes people in the labor force

dataLF <- subset(data2, LABFORCE == 2 & YEAR == "2022") # Out of the labor force for 2022
library(psych)
describe(data2$INCWAGE)
##    vars      n     mean       sd median  trimmed   mad min     max   range skew
## X1    1 632936 35669.63 66776.78  15000 35669.63 22239   0 2099999 2099999 7.64
##    kurtosis    se
## X1   101.36 83.94
# Data for all
data2 %>%
  group_by(LABFORCE) %>%
  summarise(mean(INCWAGE),
            median(INCWAGE))
## # A tibble: 3 × 3
##   LABFORCE                       `mean(INCWAGE)` `median(INCWAGE)`
##   <int+lbl>                                <dbl>             <dbl>
## 1 0 [NIU]                                 59852.             54650
## 2 1 [No, not in the labor force]           2388.                 0
## 3 2 [Yes, in the labor force]             56369.             40000
# Data for Year of 2022
data_2022 %>%
  group_by(LABFORCE) %>%
  summarise(mean(INCWAGE),
            median(INCWAGE))
## # A tibble: 3 × 3
##   LABFORCE                       `mean(INCWAGE)` `median(INCWAGE)`
##   <int+lbl>                                <dbl>             <dbl>
## 1 0 [NIU]                                 63039.             55681
## 2 1 [No, not in the labor force]           2424.                 0
## 3 2 [Yes, in the labor force]             58378.             42000

The average income for people in the labor force is substantially higher than those who aren’t which makes sense considering they have no steady sources of income.

data_2022 %>%
  group_by(STATECENSUS) %>%
  summarise(mean(INCWAGE),
            median(INCWAGE))
## # A tibble: 51 × 3
##    STATECENSUS        `mean(INCWAGE)` `median(INCWAGE)`
##    <int+lbl>                    <dbl>             <dbl>
##  1 11 [Maine]                  31029.             12000
##  2 12 [New Hampshire]          42582.             19000
##  3 13 [Vermont]                38920.             17961
##  4 14 [Massachusetts]          49584.             22000
##  5 15 [Rhode Island]           41278.             20000
##  6 16 [Connecticut]            47757.             20901
##  7 21 [New York]               38270.             11000
##  8 22 [New Jersey]             46196.             20000
##  9 23 [Pennsylvania]           38014.             15301
## 10 31 [Ohio]                   31851.             12000
## # ℹ 41 more rows

When we look at the average income by state for 2022 we see states such Washington DC, Massachusetts, and Maryland are among the highest in income.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
ggplot(data_2022, aes(x = STATECENSUS, y = INCWAGE)) +
  stat_summary(fun = "mean", geom = "bar", fill = "darkgreen") +
  labs(x = "State", y = "Avg Income") +
  ggtitle("Avg Income by State (2022)")

This is a graph to show the average income levels for 2022 by state. Washington DC being the standout in the middle.

library(ggplot2)

ggplot(dataLF, aes(x = STATECENSUS, y = INCWAGE)) +
  stat_summary(fun = "mean", geom = "bar", fill = "red") +
  labs(x = "State", y = "Avg Income (in the labor force)") +
  ggtitle("Avg Income for people in the labor force by State (2022)")