The sample data includes information on The Current Population Survey (CPS) for the 2010 - 2022 year sample. These surveys gather information on education, labor force status, demographics, and other aspects of the U.S. population, which we will use as investigating social and economic trends in the U.S. This sample data has the following characteristics:
The total population of this sample is 210,451.
The sample data is composed entirely of individual person records from population censuses.
This is a weighted sample, meaning that each individual does not represent the same number of persons in the population.
The predominant race of the population sample is White, with 88% of the population being White and the remaining 12% being Black.
The population is approximately 50% percent Male and 50% Female.
| Variable | Explanation |
|---|---|
| year | This variable is the census year and constitutes a unique identifier for each person record in the IPUMS |
| sex | This variable is the sex of the person, either male or female |
| age | This variable is the age of the person |
| race | This variable is the race of the person, either white or black |
| education | This variable is the educational attainment, as measured by the highest year of school or degree completed |
| hourwage | This variable reports how much the respondent earned per hour in the current job per contemporary dollars, for those workers paid an hourly wage |
First, as part of the data preparation process, we will begin by cleaning the data in order to remove irrelevant data from the raw dataset. Cleaning our data ahead of time allows for accurate, defensible data that generates a more reliable model and visualizations.
## Use of data from IPUMS CPS is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
#str(data)
# Change sex, education, and race variables as factor variables
df = df %>% mutate(SEX.factor = as_factor(lbl_clean(SEX)))
df = df %>% mutate(EDUC.factor = as_factor(lbl_clean(EDUC)))
df = df %>% mutate(RACE.factor = as_factor(lbl_clean(RACE)))
# Remove the RACED AND EDUCD variables
df1 <- subset(df, select = -c(SEX,EDUC,RACE))
# Change column names
names(df1)[1] = "year"
names(df1)[2] = "age"
names(df1)[3] = "hourweight"
names(df1)[4] = "hourwage"
names(df1)[5] = "sex"
names(df1)[6] = "education"
names(df1)[7] = "race"
# Not needed for this analysis
df1$hourweight <- NULL
df1$year <- NULL
formatted_id <- sprintf("ID%04d", seq_len(nrow(df1)))
df1 <- cbind(id = formatted_id, df1)
# Remove NAs
df1 <- subset(df1, hourwage!=999.99)%>% droplevels()head(df2, 10) %>%
kable() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)| id | Zip_Code | sex | race | age | education | hourwage |
|---|---|---|---|---|---|---|
| ID0008 | 23221 | Female | White | 30 | Associate’s degree, occupational/vocational program | 11.49 |
| ID0009 | 23460 | Male | White | 33 | Associate’s degree, occupational/vocational program | 14.50 |
| ID0077 | 22153 | Male | White | 60 | High school diploma or equivalent | 9.00 |
| ID0080 | 22303 | Female | White | 41 | High school diploma or equivalent | 7.50 |
| ID0081 | 22181 | Female | White | 17 | Grade 11 | 7.25 |
| ID0083 | 24015 | Female | White | 31 | Associate’s degree, academic program | 10.42 |
| ID0084 | 23832 | Male | White | 31 | High school diploma or equivalent | 13.00 |
| ID0088 | 23603 | Male | White | 53 | Grade 11 | 12.00 |
| ID0089 | 23323 | Male | White | 45 | Some college but no degree | 28.75 |
| ID0090 | 22303 | Female | White | 38 | Professional school degree | 13.00 |