PART 1

1a. Probability Sampling: Probability sampling is a method that uses random sampling. This means that each member of the population being studied must have an equal chance of being selected. There is simple random sampling (random number generator), stratified (random sample within strata), systematic (randomly selecting at intervals), and clusters. A large probability sample is the best way to ensure a sample will be representative of the population and obtaining generalizable results.

1b. Non-Probability Sampling: Uses non-random methods to obtain a sample, mainly used when population parameters are unknown or not possible to identify. Convenience Sampling (ease of access), Quota (predetermined number of proportion units), Volunteer, Snowball (networking), and Purposive (judgmental). Non-probability is best for qualitative research, pilot studies to determine cost of a larger study, looking to study a specific trait, and is usually much easier to collect data; being cheaper and quicker than probability sampling in a lot of cases.

1c. We should probably use Probability Sampling whenever possible to ensure a representative sample and to be able to use inference and generalize the result of a statistical analysis. The method of probability sampling is most likely to be determined based on the goals of the study and the nature of the data you are trying to study. It is not always possible to conduct simple random sampling, so other methods must be used (ex. stats on the US population… would not be feaseable to do simple random sampling). Non-Probability could be useful if there is a specific section of the population of interest or preliminary studies.

2a. Pobability Sampling

a. Simple Random Sampling: Each unit in the population has an equal chance of being selected. A common way to do this is give a unique number to each unit in the population and use a random number generator to select random units from the population.

b. Systematic Sampling : This method has some similarities with simple random sampling. In this method the sampler divides the population into segments, called intervals. The size of each interval is determined by dividing the population size by the desired sample size. Then select one unit in the first interval using simple random sampling, and then select the unit in that same position for all of the other intervals (ex. if you picked unit 5 in interval 1, then you must pick unit 5 in all the other intervals).

2b. Non-Probability Sampling:

a. Convenience Sampling : This method is what it sounds like, it is convenient to the sampler. The units or individuals in the sample are selected out of convenience and accessibility rather than being the most representative of the population. This method is sometimes called “accidental” sampling because the subjects are selected for the sample because they were nearby when the experiment in being conducted rather than truly being representative. Usually, it is due to ease of access, geographic proximity, or existing contact with a person of interest. This type of sampling can introduce biases, so it is important to consider potential weaknesses before conducting data collection. An example of the bias would be asking people at the mall if they enjoy shopping.

b. Self-Selection (Volunteer): This method of sampling relies on people volunteering to be in the study. This method is common for sampling that needs people fitting a very specific criteria (such as medical studies… needing samples that have a certain rare condition). This type of sampling is not always representative of the population because people are choosing to be in the study. This can introduce response bias. The researcher also can reject some volunteers if they do not think they fit the correct criteria. This is useful when trying to conduct research on a rare condition or where only a very small percentage of the population would be relevant to the study.

Part II

  1. The current CPS is administered monthly by the US Bureau of the Census to over 65,000 households. The surveys gather information on education, labor force status, demographics, and other aspects of the population. A “basic monthly survey” is asked every month. It is designed to reflect the civilian non institutional population (excludes people serving in the Armed Forces, and people living in institutions such as nursing and care facilities and correctional institutions). In terms of employment, every person over the age of 16 is classified as employed, unemployed, or not in the labor force. The survey also measures age, sex, race, ethnicity, education, full or part time job status, multiple jobholding, duration of employment, and reason for unemployment.

    Definitions:

    Not in the Labor Force: Not employed AND… they have not actively looked for work (or been on temporary layoff) in the last 4 weeks. Usually split into three groups; 1. people who want a job now, 2. people marginally attached to the labor force, 3. discouraged workers. Others include retirees, children, students, etc.

    Employed: Worked at least 1 hour as paid employee or in own business or on temporary absence from their job.

    Unemployed: Not employed, available for work, made at least one specific and active effort to find a job during the 4 week period, temporarily laid off and expecting to be recalled.

  2. Sample Method:

    The CPS sample uses a multi-stage probability sampling. The sampling consists of independent samples from each of the 50 states and DC that are each tailored to the demographic and labor market conditions of the particular state. The sample is made up of household addresses.

The sample starts with a sample of 72,000 to 74,000 housing units (depends on the source) from 824 sample areas, from which about 62,000 households are available for interview each month (after eliminating destroyed houses, vacant, converted, and usual place of residence is elsewhere). In all, interviews are conducted for about 54,000 households due to temporary absence of occupants, inability to contact people, or refusals to cooperate. On average, this gives information for about 105,000 people aged 16 and over.

In the first round of sampling 824 sample areas are chosen, and in the second stage, sampling units are selected. About 72,000 housing units are selected from which about 60,000 are occupied and eligible for interview. Of the 60,000; about 7.5% are not interviewed due to temporary absence (like vacation) or inability to contact. Overall, information is gathered on about 112,000 people age 16 and over.

In order to select sample areas, the US is divided into 1,987 orPrimary Sampling Units (PSUs). Typical PSUs include urban and rural residents of various economic levels and diverse occupations. The 1,987 PSUs are grouped into strata within each state, and then one PSU is selected from each stratum with the probability of being selected proportional to the population of the PSU. In the first round of sampling 852 sample areas are chosen.

Within states, the sampling

  1. Sampling Design (implemented 2015): https://www2.census.gov/programs-surveys/cps/methodology/CPS-Tech-Paper-77.pdf

    The CPS sample is a probability sample. The sample consists of independent samples from each state and DC that is specifically tailored to the demographic and labor market conditions of that particular state.

    STAGE 1: Divide the United States into primary sampling units (PSUs), and each PSU is within the boundary of a state. PSUs are grouped into strata so that they are as homologous as possible in respect to labor force and socioeconomic characteristics that are highly correlated with unemployment. One PSU from each stratum is sampled, and the probability of PSU selection within the strata is proportional to the its population on the 2010 census.

    STAGE 2: This stage is conducted annually. A sample of housing units (HU) within the same PSUs are selected. Ultamate sampling units (USUs) are small groups of HUs. The HUs selected are systematically drawn from a list of blocks.

    STAGE 3: Each month, interviewers collect data from the HUs selected for the sample. A HU is interviewed for 4 consecutive months and then dropped from the sample for the next 8 months and then interviewed again for another 4 months. (4-8-4 design). This rotation ensures that any single month, about 1/8th of the HUs are interviewed for the first time, 1/8th are interviewed for the second time, etc. After month 1 there will be 75% month to month overlap and 50% year-to year overlap.

  2. Yes, I do think the CPS survey gives a good representation of the whole US population. The sample design is probability sampling, aimed to get

# NOTE: To load data, you must download both the extract's data and the DDI
# and also set the working directory to the folder with these files (or change the path below).

if (!require("ipumsr")) stop("Reading IPUMS data into R requires the ipumsr package. It can be installed using the following command: install.packages('ipumsr')")
## Loading required package: ipumsr
ddi <- read_ipums_ddi("cps_00002.xml")
data <- read_ipums_micro(ddi)
## Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
#Remove the 99999 values

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
clean_data <- data %>%
  filter(INCWAGE != 99999999)
#plot labor force by income wage

library(ggplot2)
par(mar = c(5,5,1,1))
ggplot(clean_data, aes(x = LABFORCE, y = INCWAGE)) + geom_point() +
     labs(title = "Labor Force vs Income Wage",
     x = "Labor Force",
     y = "Income")

0 = NIU - not in universe - young people (1989 + records people 15+ age)

1 = No, not in labor force - school, retire, housework, unable to work

2 = Yes, in the labor force - employed, unemployed

Looking at the graph of labor status vs wage there are some trends worth noting. The group with the lowest wages, according to the graph is the NIU group (which accounts for people under the age of 15). This makes sense, as most people in this age group are not in the labor force. Some may have part time jobs, but their incomes are expected to be low as they are dependents and in school.

The group that had the second lowest wages according to the graph is “Not in the Labor Force”. This is expected as people not in the labor force are not expected to have much income, if any. This could include discouraged workers. It can also include retirees, who may make some income on investments as well as programs like social security. It is expected that they would make more than people not in the universe (children), but less than people in the workforce.

The group that made the largest wages according to the graph is the group in the labor force. This is expected as this group contains employed people. These are the people making a substantial paycheck every month for working. It is important to note that this group also includes unemployed people (and can explain why there are also a lot of data points on the lower end of the y-axis).

Overall, there are individuals in each group (NIU, not in labor force, in the labor force) that are expected to make $0 income, or very little income. However, as we move from NIU to not in the labor force, to in the labor force; we expect there to be more and more data points (individuals) with higher income which we see reflected in the graph.

#plot income wage by state
ggplot(clean_data, aes(x = STATECENSUS, y = INCWAGE)) +
  geom_point() +
  labs(x = "State",
       y = "Income",
       Title = "Income By State") +
  ylim(0,1000000)
## Warning: Removed 622 rows containing missing values (`geom_point()`).

#same plot as before but zoomed in
ggplot(clean_data, aes(x = STATECENSUS, y = INCWAGE)) +
  geom_point() +
  labs(x = "State",
       y = "Income",
       Title = "Income By State") +
  ylim(0,150000)
## Warning: Removed 19193 rows containing missing values (`geom_point()`).

#grouping the data by state and then taking the mean and median values for each state respectively and storing it in a data frame called "average_wage_by_state"

average_wage_by_state <- clean_data %>%
  group_by(STATECENSUS) %>%
  summarise(wage_mean = mean(INCWAGE),
           wage_median = median(INCWAGE))



print(average_wage_by_state)
## # A tibble: 51 × 3
##    STATECENSUS        wage_mean wage_median
##    <int+lbl>              <dbl>       <dbl>
##  1 11 [Maine]            30130.       12800
##  2 12 [New Hampshire]    40533.       20000
##  3 13 [Vermont]          36188.       17000
##  4 14 [Massachusetts]    47610.       22000
##  5 15 [Rhode Island]     38578.       20000
##  6 16 [Connecticut]      43162.       17000
##  7 21 [New York]         38210.       14000
##  8 22 [New Jersey]       44808.       20000
##  9 23 [Pennsylvania]     36856.       15000
## 10 31 [Ohio]             32974.       13000
## # ℹ 41 more rows
#finding the max mean income wage and indexing the state number for the max value

max(average_wage_by_state$wage_mean)
## [1] 65851.74
index_max <- average_wage_by_state$STATECENSUS[which.max(average_wage_by_state$wage_mean)]

print(index_max)
## <labelled<integer>[1]>: State (Census code)
## [1] 53
## 
## Labels:
##  value                                                                   label
##      0                                                                 Unknown
##     11                                                                   Maine
##     12                                                           New Hampshire
##     13                                                                 Vermont
##     14                                                           Massachusetts
##     15                                                            Rhode Island
##     16                                                             Connecticut
##     19                             Maine, New Hampshire, Vermont, Rhode Island
##     21                                                                New York
##     22                                                              New Jersey
##     23                                                            Pennsylvania
##     31                                                                    Ohio
##     32                                                                 Indiana
##     33                                                                Illinois
##     34                                                                Michigan
##     35                                                               Wisconsin
##     39                                                     Michigan, Wisconsin
##     41                                                               Minnesota
##     42                                                                    Iowa
##     43                                                                Missouri
##     44                                                            North Dakota
##     45                                                            South Dakota
##     46                                                                Nebraska
##     47                                                                  Kansas
##     49 Minnesota, Iowa, Missouri, North Dakota, South Dakota, Nebraska, Kansas
##     50                             Delaware, Maryland, Virginia, West Virginia
##     51                                                                Delaware
##     52                                                                Maryland
##     53                                                    District of Columbia
##     54                                                                Virginia
##     55                                                           West Virginia
##     56                                                          North Carolina
##     57                                                          South Carolina
##     58                                                                 Georgia
##     59                                                                 Florida
##     60                                                 South Carolina, Georgia
##     61                                                                Kentucky
##     62                                                               Tennessee
##     63                                                                 Alabama
##     64                                                             Mississippi
##     67                                                     Kentucky, Tennessee
##     69                                                    Alabama, Mississippi
##     71                                                                Arkansas
##     72                                                               Louisiana
##     73                                                                Oklahoma
##     74                                                                   Texas
##     79                                           Arkansas, Louisiana, Oklahoma
##     81                                                                 Montana
##     82                                                                   Idaho
##     83                                                                 Wyoming
##     84                                                                Colorado
##     85                                                              New Mexico
##     86                                                                 Arizona
##     87                                                                    Utah
##     88                                                                  Nevada
##     89    Montana, Idaho, Wyoming, Colorado, New Mexico, Arizona, Utah, Nevada
##     91                                                              Washington
##     92                                                                  Oregon
##     93                                                              California
##     94                                                                  Alaska
##     95                                                                  Hawaii
##     99                                      Washington, Oregon, Alaska, Hawaii
#finding the min value of mean income wage by state and indexing the state number with that value

min(average_wage_by_state$wage_mean)
## [1] 24703.62
index_min <- average_wage_by_state$STATECENSUS[which.min(average_wage_by_state$wage_mean)]

print(index_min)
## <labelled<integer>[1]>: State (Census code)
## [1] 64
## 
## Labels:
##  value                                                                   label
##      0                                                                 Unknown
##     11                                                                   Maine
##     12                                                           New Hampshire
##     13                                                                 Vermont
##     14                                                           Massachusetts
##     15                                                            Rhode Island
##     16                                                             Connecticut
##     19                             Maine, New Hampshire, Vermont, Rhode Island
##     21                                                                New York
##     22                                                              New Jersey
##     23                                                            Pennsylvania
##     31                                                                    Ohio
##     32                                                                 Indiana
##     33                                                                Illinois
##     34                                                                Michigan
##     35                                                               Wisconsin
##     39                                                     Michigan, Wisconsin
##     41                                                               Minnesota
##     42                                                                    Iowa
##     43                                                                Missouri
##     44                                                            North Dakota
##     45                                                            South Dakota
##     46                                                                Nebraska
##     47                                                                  Kansas
##     49 Minnesota, Iowa, Missouri, North Dakota, South Dakota, Nebraska, Kansas
##     50                             Delaware, Maryland, Virginia, West Virginia
##     51                                                                Delaware
##     52                                                                Maryland
##     53                                                    District of Columbia
##     54                                                                Virginia
##     55                                                           West Virginia
##     56                                                          North Carolina
##     57                                                          South Carolina
##     58                                                                 Georgia
##     59                                                                 Florida
##     60                                                 South Carolina, Georgia
##     61                                                                Kentucky
##     62                                                               Tennessee
##     63                                                                 Alabama
##     64                                                             Mississippi
##     67                                                     Kentucky, Tennessee
##     69                                                    Alabama, Mississippi
##     71                                                                Arkansas
##     72                                                               Louisiana
##     73                                                                Oklahoma
##     74                                                                   Texas
##     79                                           Arkansas, Louisiana, Oklahoma
##     81                                                                 Montana
##     82                                                                   Idaho
##     83                                                                 Wyoming
##     84                                                                Colorado
##     85                                                              New Mexico
##     86                                                                 Arizona
##     87                                                                    Utah
##     88                                                                  Nevada
##     89    Montana, Idaho, Wyoming, Colorado, New Mexico, Arizona, Utah, Nevada
##     91                                                              Washington
##     92                                                                  Oregon
##     93                                                              California
##     94                                                                  Alaska
##     95                                                                  Hawaii
##     99                                      Washington, Oregon, Alaska, Hawaii
#plot mean income by state

ggplot(average_wage_by_state, aes(x = STATECENSUS, y = wage_mean)) + 
         geom_bar(stat = "identity") +
         labs(x = "State",
              y = "Mean Income",
              title = "Mean Income by State")

This data shows that there is variation between average income and state. The largest income area by a wide margin is area 53, also known as Washington DC. The mean income there is about $65,000. The state with the lowest mean income is in area 64, also known as Mississippi at $24,703.

A vast majority of the areas fall between higher end $20,000s and lower end $40,000s. Only DC is over $50,000, and every area is over $25,000.

Massachusetts has a mean income of $47,609.