library(tidyverse)
library(openintro)More With Employment Data
This activity uses concepts from two labe in the Smith College SDS 100 course, Lab 3: Basic Data Visualizations and Lab 4: Intro to Data Wrangling. It may be helpful to refer to code from these labs.
Question 1
Load the tidyverse and openintro packages and create a variable that will hold the acs12 data contained in openintro.
Question 2
Create a histogram of the income variable within the acs12 data.
ggplot(
data = acs12,
mapping = aes(x = acs12$income)
) + geom_histogram()Question 3
The histogram looks very skewed toward incomes that are near zero because individuals without a job are included. Apply the R function unique (which you can look up in R documentation) to get a list of possible values in the data set’s employment variable.
unique(acs12 $employment)[1] not in labor force <NA> employed unemployed
Levels: not in labor force unemployed employed
Question 4
Filter the data set to create a new data object including only those individuals who have a job. Repeat the histogram from question 2, with this subset of data, which we will use for the remainder of the lab.
subset_df <- acs12 |>
select(income, employment) |>
filter(employment == "employed")
ggplot(
data = subset_df,
mapping = aes(x = income)
) + geom_histogram()Question 5
Visualize the incomes of working individuals using a boxplot instead of a histogram. To suppress some of the high-income outliers, set the maximum visible income at $100,000 by adding the code coord_cartesian(xlim = c(0, 1e5)) at the end of the ggplot command. The 1e5 is shorthand using scientific notation, where 1∙10^5 = 100,000.
ggplot(
data = subset_df,
mapping = aes(x = income),
)+ coord_cartesian(xlim = c(0, 1e5)) + geom_boxplot()Question 6
Now, make a set of income boxplots among the working individuals, stratified by a categorical variable (binary is acceptable) that you select. Have each boxplot filled in a different color, and income (again capped at $100,000) is on the vertical axis.
ggplot(
data = acs12,
mapping = aes(x = income,
fill = race),
) + coord_cartesian(xlim = c(0, 1e5)) + geom_boxplot()Question 7
Another way of visualizing this data is with a scatter plot. Make a scatter plot with hours worked on the horizontal axis, income on the vertical axis, and each group of your categorical variable in a different color.
ggplot(
data = acs12,
mapping = aes_(x = acs12$hrs_work,
y = acs12$income, color = acs12$race) ) + geom_point()Question 8
Repeat your boxplot from question 6, but this time include only college graduates who are working at least 35 hours per week. You might need to set your income upper limit higher, when working with only this group.
unique(acs12$edu)[1] college hs or lower grad <NA>
Levels: hs or lower college grad
subset_df <- acs12 |>
select(edu, hrs_work) |>
filter(edu == "college grad" & hrs_work >= 35)
ggplot(
data = subset_df,
mapping = aes(x = acs12$income,
fill = acs12$race)
) + coord_cartesian(xlim = c(0, 1e5)) + geom_boxplot() Question 9
When subdividing a data set into various groups (often called “segmentation”), it can be helpful to know how many members are in each. The table function within the dplyr package can tell us this information. Create a new data object containing only the column you want to segment by (from among the employed individuals), then use the table function to count how many are in each segment.
library(dplyr)
filter_employed <- acs12 |>
select(employment)
table1<-table(filter_employed)
print(table1)employment
not in labor force unemployed employed
656 106 843
Question 10
Determine your own well-defined data science problem on the acs12 data set, e.g. “How do ___ and ___ affect income?” and write it here. Then explore the problem using the R methods that you are familiar with. Include at least two plots and briefly discuss your findings.
ggplot(
data = acs12,
mapping = aes_(x = acs12$employment,
y = acs12$income, color = acs12$race) ) + geom_boxplot()ggplot(
data = acs12,
mapping = aes(x = acs12$income,
fill = acs12$race)
) + coord_cartesian(xlim = c(0, 1e5)) + geom_boxplot()How does someone’s race impact their hours worked, and therefore their income?