Setup

In order to make bar charts and line graphs, you may want to start to install few packages such as tidyverse and ggplot2. You can install them by typing install.packages("tidyverse") and install.packages("ggplot2") in the console. Then load those packages by typing library(tidyverse) and library(ggplot2) like the example below.

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.7
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ggplot2)

The Data

Now let’s create the data. You can download the data set that contains male and female incomes in the US (2015) from kaggle. Remeber that the data set is in csv form. Using read.csv() will let you to read the csv file in table format and create a data frame from it.

income_by_occupation <- read.csv("data/inc_occ_gender.csv", stringsAsFactors = FALSE)

1. Filtering out unnecessary data (for a side-by-side bar chart)

First, we are going to use select() to choose and extract columns of interest from our data frame. The values that we are interested are male weekly income and female weekly income of each occupation.

weekly_income_by_gender <- income_by_occupation %>%
  select(Occupation, M_weekly, F_weekly)

2. Filtering out “NA” values

Since the male and female incomes have to be numeric values, we do not want any character values included in the columns of the income. After filtering out all “NA” values, we are going to transform all the values in the weekly income columns into numeric values.

new_weekly_income_by_gender <- weekly_income_by_gender %>%
  filter(M_weekly != "Na") %>% # Filtering out NAs.
  filter(F_weekly != "Na") %>%
  # Transform all the values into the numeric values. 
  transform(M_weekly = as.numeric(M_weekly), F_weekly = as.numeric(F_weekly))

3. Gathering the weekly incomes of male and female of each type of occupation

Instead of plotting the weekly incomes of 142 individual occupations, we are going to plot the weekly incomes of each type of occupation. There are total 21 types of occupations: management, business, computational, engineering, science, social service, legal, education, arts, healthcare professionals, healthcare support, protective service, culinary, groundskeeping, service, sales, office, agricultural, construction, maintenance, and production. We are going to filter the data that contains the occupations that match with those 21 types of occupations.

weekly_income_by_occupation <- new_weekly_income_by_gender %>%
  filter(Occupation == "MANAGEMENT" | Occupation == "BUSINESS" | 
         Occupation == "COMPUTATIONAL" | Occupation == "ENGINEERING" | 
         Occupation == "SCIENCE" | Occupation == "SOCIAL SERVICE" | 
         Occupation == "LEGAL" | Occupation == "EDUCATION" | Occupation == "ARTS" | 
         Occupation == "HEALTHCARE PROFESSIONAL" | Occupation == "HEALTHCARE SUPPORT" | 
         Occupation == "PROTECTIVE SERVICE" | Occupation == "CULINARY" | 
         Occupation == "GROUNDSKEEPING" | Occupation == "SERVICE" | Occupation == "SALES"|
         Occupation == "OFFICE" | Occupation == "AGRICULTURAL" | Occupation == 
         "CONSTRUCTION" | Occupation == "MAINTENANCE" | Occupation == "PRODUCTION")   

4. Preparing data to make side-by-side bar chart

We want to compare the weekly incomes of male and female, and a side-by-side bar chart would be the best choice. In order to make that kind of side-by-side bar chart, we need to have a column that contains the gender values: male and female. This can be achieved by using gather().

# Change the names of the columns from M_weekly to Male and from F_weekly to Female.
colnames(weekly_income_by_occupation)[2] <- "Male"
colnames(weekly_income_by_occupation)[3] <- "Female" 

# Data frame that contains each gender's weekly income in each type of occupation. 
new_weekly_income_df <- weekly_income_by_occupation %>%
  gather(key = "Gender", value = "Weekly_income", Male, Female)

5. Filtering out unnecessary data (for a line graph)

Again, we are going to use select() to choose and extract columns of interest from our data frame. The values that we are interested in are the number of male and female workers of each occupation. Also, we are going to filter again the data that contains the occupations that match with those 21 types of occupations.

# Data frame that only includes the types of occupations, and number of male and female workers in each type of 
# occupation. 
number_of_male_female <- income_by_occupation %>%
  select(Occupation, M_workers, F_workers)

# Only saving the rows of the types of occupations.  
new_number_of_male_female <- number_of_male_female %>%
  filter(Occupation == "MANAGEMENT" | Occupation == "BUSINESS" | 
         Occupation == "COMPUTATIONAL" | Occupation == "ENGINEERING" | 
         Occupation == "SCIENCE" | Occupation == "SOCIAL SERVICE" | 
         Occupation == "LEGAL" | Occupation == "EDUCATION" | Occupation == "ARTS" | 
         Occupation == "HEALTHCARE PROFESSIONAL" | Occupation == "HEALTHCARE SUPPORT" | 
         Occupation == "PROTECTIVE SERVICE" | Occupation == "CULINARY" | 
         Occupation == "GROUNDSKEEPING" | Occupation == "SERVICE" | Occupation == 
         "SALES" | Occupation == "OFFICE" | Occupation == "AGRICULTURAL" | Occupation == 
         "CONSTRUCTION" | Occupation == "MAINTENANCE" | Occupation == "PRODUCTION")

6. Preparing data to make a line graph for each gender

Similar to step 4. To make line graphs that compare the number of male and female workers in each type of occupation, we are going to make a column that contains the gender values by using gather().

# Change the names of the columns from M_weekly to Male and from F_weekly to Female.
colnames(new_number_of_male_female)[2] <- "Male"
colnames(new_number_of_male_female)[3] <- "Female"

# Data frame that contains number of male workers and female workers in each type of occupation. 
new_number_df <- new_number_of_male_female %>%
  gather(key = "Gender", value = "Number", Male, Female)

# Join two different data frames by the type of occupation. 
joined_occupation_df <- left_join(new_weekly_income_df, new_number_df, by = "Occupation", copy = FALSE)

# Change the names of the columns.
colnames(joined_occupation_df)[2] <- "Income"
colnames(joined_occupation_df)[4] <- "Number_of_People"

Creating a chart

1. Bar chart: geom_bar()

Now the data sets are ready to be visualized. For making a side-by-side bar chart, you can use ggplot() and geom_bar().

# Plot the graph that displays both numbers of male workers and female workers in each type of occupation
# and weekly income of each type of occupation of each gender. 
first_plot <- ggplot(joined_occupation_df, aes(x = reorder(Occupation, Weekly_income)))
first_plot <- first_plot + 
  geom_bar(aes(y = Weekly_income, fill = Income), stat = "identity", position = "dodge")

2. Line graph: geom_line() and geom_point()

For making a line graph, you can use geom_line() and geom_point(). You can just plot the line graph on the side-by-side bar chart like the example below. That way, you can view not only the weekly income of each gender in each type of occupation, but also the number of workers of each gender in each type of occupation.

# Plot the line graph. 
first_plot <- first_plot + 
  geom_line(aes(y = Number, group = Number_of_People, col = Number_of_People)) + 
  geom_point(aes(y = Number, group = Number_of_People, col = Number_of_People)) 

# Adding a second dependent variable, scaling the axis, and adding necessary titles. 
first_plot <- first_plot + 
  scale_y_continuous(sec.axis = sec_axis(~.*1, name = "Number of People")) + 
  labs(title = "Weekly Income and Number of Workers of Each Gender in Each type of Occupation (2015)") + 
  theme(plot.title = element_text(size = 9)) + 
  xlab("Type of Occupation") + 
  ylab("Weekly Income") + 
  coord_flip() # Flipping the coordinates for a better view. 

# View the first plot. 
first_plot 

Save your work

Finally, save your visualization!

# Save the image of the plot.
ggsave("first_plot.png", first_plot, width = 8, height = 5)