1.1 In your own words, after your online readings, describe what is probability sampling and non-probability sampling (max 5 lines each). You can even talk about the differences between the sampling methodologies. Which probability sampling method should we use?

Probability Sampling:

In probability sampling, we randomly select a group of the population that we want to do research on, this is also often referred to as random sampling. It is extremely important that this group is truly chosen at random, so we can make predictions most accurately on a population; if these groups are not truly random, we could end up with errors with results showing certain biases. There are four types of probability sampling; simple random sampling, stratified sampling, systematic sampling, and cluster sampling. The ultimate goal is to take our finding from our small sample group, and apply it to the much large population.

Non-Probability Sampling:

Non-Probability sampling is a separate method when we use non-random sampling methods. Since this is not random, not everyone has an equal chance of being selected for the sample and there is a higher chance of having bias in the sample. There are five common types of probability sampling; convenience sampling, quota sampling, self-selection (volunteer) sampling, snowball sampling, and purposive (judgmental) sampling.

Differences and when we should use each:

We should use Probability sampling when we are looking to get a much more accurate sample of an entire population. By taking our sample completely at random, it gives us the highest probability to represent a large population without bias. An example of this would be surveying 25 people at random if they liked a new Netflix show that recently came out.

We should use non-probability sampling when we want to sample a specific group sharing similar characteristics. With this method, it is much easier to find the correct target audience. An example of this would be sampling a group to see if we thought that the show Suits is an accurate representation of real life. To do so, we sample a group of attorneys to answer this. By doing so, our sample group is not random and likely provide a more expert opinion on this.

1.2 Also describe at least 2 survey designs from each of the two categorizations above Download 2 survey designs from each of the two categorizations above(max 3 lines each). A more colorful chart here. Note that survey designs are only broadly designed, so different articles will have different numbers of survey designs in the two taxonomies above.

Simple Random Sampling:

This method is when we gather a random selection from the entire population where every unit has an equal chance of selection. This is the most commonly used method for selecting a random sample. An example of this would be researching political views.

Systematic Sampling:

This is when we take a random sample from the target population by selecting units at regular intervals starting from a random point. To calculate this we would divide our sampling frame into a number of segments which we list as intervals. We then select our intervals until we get the total number we want for our sample.

Non-Probability Sampling

Convenience Sampling:

Convenience sampling is simply defined as being convenient to the researcher. Examples given of this were ease of access, geographical proximity, and existing contact with the population of interest. There is something important to know with this method, however; sometimes these samples can be referred to as accidental samples since participants can sometimes be selected simply because they happen to be near the researcher.

Self-Selection Sampling:

Also know as volunteer sampling, Self-selection sampling is when participants volunteer to be apart of the research sample. The most common example is when there is a specific criteria that must be met by our sample, often times being in medical or psychological research. Volunteers usually sign up for these through ads or are recruited.

2 The Current Population Survey (CPS) is a monthly survey of households conducted by the Bureau of Census for the Bureau of Labor Statistics. It provides a comprehensive body of data on the labor force, employment, unemployment, persons not in the labor force, hours of work, earnings, and other demographic and labor force characteristics.

In particular, CPS is a sample surveyLinks to an external site. of about 60,000 eligible households (in 2017) scientifically selected to reflect the entire U.S. civilian noninstitutional population. On the basis of responses to a series of questions on work and job search activities, each person 16 years and over in a sample household is classified as employed, unemployed, or not in the labor force.

2.1 What is the population of interest in the Current Population Survey (CPS) ? Be precise – are there any age criteria, occupation criteria, and/or geographic criteria ?

The population of interest in the Current Population Survey is the entire United States civilian non-institutional population. The age criteria were those over 16 years of age. The occupational criteria are those who are employed and unemployed . This survey was done for the entire United States.

2.2 What is the sample used to estimate the population parameters like unemployment rate, employment to population ratio ? Be precise - how many households and/or people are included ? Is there anything interesting/unique about the survey methodology that you found ?

The amount of households in the monthly sample was just over 65,000 per IPUMS. While we know the amount of households there are, we do not know exactly how many people there are since this would differ per household. I found it interesting seeing the way they broke up their finding into 10 year periods to be able to capture current events. By splitting this up, it allows us to reduce bias for certain factors that could be causing unemployment. For example, in the 1940’s WWII was going on and we can see it in the graph posted, this led to unemployment being at an all time high. Without looking at external economic events, it would lead to more questions than answers as to why certain time periods has such drastic spikes. The sample method used was stratified, multistage sampling which we know is a collection of random sampling from subgroups within a population. We know that for this study they took these samples from different areas in the country as well.

2.3 Do you think CPS is a representative sample of the US entire population after reading about its methodology or your online research ?

Yes, I do. Since we used stratified sampling, we know that this does a good job in limiting bias to get the best sense of the entire population, in this case, the entire US population. This methodology does so by looking at different geometric areas as well as account for other economic factors.

2.5 Importing the Data into R

# Installed ipumsr
library(ipumsr)

ddi <- read_ipums_ddi("cps_00001.xml")
data <- read_ipums_micro(ddi)

## Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.

2.6 Plot / summarize income wage variable by labor force status. You can revise your data extract easily as you decide what variables to add/discard/keep. Do you find any patterns in labor force statistics that make sense, such as income varying by labor force status?

summary(data)

##       YEAR          SERIAL          MONTH           HWTFINL       
##  Min.   :2009   Min.   :    1   Min.   : 1.000   Min.   :    0    
##  1st Qu.:2017   1st Qu.:19513   1st Qu.: 3.000   1st Qu.: 1567    
##  Median :2021   Median :38877   Median : 3.000   Median : 3381    
##  Mean   :2019   Mean   :40153   Mean   : 4.807   Mean   : 3105    
##  3rd Qu.:2022   3rd Qu.:58851   3rd Qu.: 7.000   3rd Qu.: 4310    
##  Max.   :2024   Max.   :99461   Max.   :12.000   Max.   :20133    
##                                                  NA's   :2777271  
##      CPSID              ASECFLAG           HFLAG            ASECWTH       
##  Min.   :0.000e+00   Min.   :1         Min.   :0         Min.   :   53    
##  1st Qu.:2.013e+13   1st Qu.:1         1st Qu.:0         1st Qu.:  871    
##  Median :2.020e+13   Median :1         Median :0         Median : 1623    
##  Mean   :1.730e+13   Mean   :1         Mean   :0         Mean   : 1785    
##  3rd Qu.:2.022e+13   3rd Qu.:1         3rd Qu.:1         3rd Qu.: 2301    
##  Max.   :2.024e+13   Max.   :2         Max.   :1         Max.   :28654    
##                      NA's   :3713050   NA's   :6594980   NA's   :4017265  
##      PERNUM           WTFINL            CPSIDP              CPSIDV         
##  Min.   : 1.000   Min.   :    0     Min.   :0.000e+00   Min.   :0.000e+00  
##  1st Qu.: 1.000   1st Qu.: 1572     1st Qu.:2.013e+13   1st Qu.:2.013e+14  
##  Median : 2.000   Median : 3380     Median :2.020e+13   Median :2.020e+14  
##  Mean   : 2.166   Mean   : 3184     Mean   :1.730e+13   Mean   :1.730e+14  
##  3rd Qu.: 3.000   3rd Qu.: 4401     3rd Qu.:2.022e+13   3rd Qu.:2.022e+14  
##  Max.   :16.000   Max.   :44748     Max.   :2.024e+13   Max.   :2.024e+14  
##                   NA's   :2777271                                          
##      ASECWT           LABFORCE        INCWAGE        
##  Min.   :   50     Min.   :0.000   Min.   :       0  
##  1st Qu.:  882     1st Qu.:1.000   1st Qu.:       0  
##  Median : 1638     Median :1.000   Median :   30000  
##  Mean   : 1828     Mean   :1.293   Mean   :22444229  
##  3rd Qu.: 2389     3rd Qu.:2.000   3rd Qu.:  129000  
##  Max.   :44424     Max.   :2.000   Max.   :99999999  
##  NA's   :4017265                   NA's   :4017265

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data2 <- data %>% filter(INCWAGE !=  99999999)

library(psych)

describe(data2$INCWAGE)

##    vars       n     mean       sd median  trimmed     mad min     max   range
## X1    1 2154587 30310.41 56835.37  12000 30310.41 17791.2   0 2099999 2099999
##    skew kurtosis    se
## X1 8.15   121.31 38.72

describe(data2$LABFORCE)

##    vars       n mean   sd median trimmed mad min max range  skew kurtosis se
## X1    1 2154587 1.62 0.49      2    1.62   0   0   2     2 -0.62    -1.34  0

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(data = data2, 
       mapping = aes(x = LABFORCE ,
        y = INCWAGE  )) + geom_point()

DrewBaker_Discussion4

2024-04-12