Introduction
In this tutorial, we will visualize the Top 10 Departments by Avergae Salary using a dataset that includes salary information by department and salary. We will clean the data and use ggplot2 to create visualizations that compare average salaries by departments.
Step 1: Load Required Libraries and Dataset
First, we need to load the required libraries and the dataset. We are using a sample dataset that includes calendar year, full name, job description, department description, salary, and full-time equivalence (FTE).
# Load required libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data_url <- "https://docs.google.com/spreadsheets/d/1RoiO9bfpbXowprWdZrgtYXG9_WuK3NFemwlvDGdym7E/export?gid=1335284952&format=csv"
salary_data <- read.csv(data_url)
Step 2: Inspect the Dataset
Next, we inspect the dataset to understand its structure. The dataset includes columns such as calendar year, full name, job description, department description, salary, and FTE. This helps us identify if there are any necessary cleaning steps for the Salary column, such as converting it to numeric or normalizing it based on FTE.
head(salary_data)
## Calendar.Year Full.Name Job.Description
## 1 2020 ABBASI, Mohammad Research/Lab Assistant
## 2 2020 ARQUIZA, Jose Maria Reynaldo Apollo Lecturer
## 3 2020 Aaberg, Kelsea Student Support Specialist
## 4 2020 Abadjivor, Enyah Project Manager
## 5 2020 Abayesu, Precious Management Intern
## 6 2020 Abbas, James Assoc Professor
## Department.Description Salary FTE
## 1 Sch Biological & Hlth Sys Engr $35,090.00 100
## 2 Sch Biological & Hlth Sys Engr $71,400.00 100
## 3 Admission Services $36,000.00 100
## 4 CASGE Tempe $64,000.00 100
## 5 Health & Clinical Partnerships $20,800.00 50
## 6 Sch Biological & Hlth Sys Engr $107,195.00 100
We can see that the Salary column contains dollar signs and commas, so we will clean these before proceeding with further analysis.
Step 3: Clean the Data
We will clean the Salary column by removing the dollar signs and commas and then convert it to numeric. We will also normalize salaries by dividing by FTE (full-time equivalence) to account for different working hours.
salary_data <- salary_data %>%
mutate(Salary = gsub("[\\$,]", "", Salary),
Salary = as.numeric(Salary),
FTE = as.numeric(FTE),
Salary_FTE = Salary / FTE) %>%
filter(!is.na(Salary_FTE))
head(salary_data)
## Calendar.Year Full.Name Job.Description
## 1 2020 ABBASI, Mohammad Research/Lab Assistant
## 2 2020 ARQUIZA, Jose Maria Reynaldo Apollo Lecturer
## 3 2020 Aaberg, Kelsea Student Support Specialist
## 4 2020 Abadjivor, Enyah Project Manager
## 5 2020 Abayesu, Precious Management Intern
## 6 2020 Abbas, James Assoc Professor
## Department.Description Salary FTE Salary_FTE
## 1 Sch Biological & Hlth Sys Engr 35090 100 350.90
## 2 Sch Biological & Hlth Sys Engr 71400 100 714.00
## 3 Admission Services 36000 100 360.00
## 4 CASGE Tempe 64000 100 640.00
## 5 Health & Clinical Partnerships 20800 50 416.00
## 6 Sch Biological & Hlth Sys Engr 107195 100 1071.95
Step 4: Visualize the Top 10 Departments by Average Salary
In this step, we aim to visualize the departments with the highest average salaries (normalized by full-time equivalence or FTE). We’ll first calculate the average salary for each department, then filter the top 10 departments with the highest average salary, and finally create a horizontal bar chart for better readability.
top_10_departments <- salary_data %>%
group_by(Department.Description) %>%
summarize(Avg_Salary_FTE = mean(Salary_FTE, na.rm = TRUE)) %>%
top_n(10, Avg_Salary_FTE)
ggplot(top_10_departments, aes(x = reorder(Department.Description, Avg_Salary_FTE), y = Avg_Salary_FTE)) +
geom_bar(stat = "identity", fill = "#56B4E9") +
labs(title = "Top 10 Departments by Average Salary (Normalized by FTE)",
x = "Department", y = "Average Salary (FTE)") +
coord_flip() +
theme_minimal() +
theme(axis.text.y = element_text(size = 8))
If your having trouble always makesure to check your column
names to makesure you are inputting the correct names when creating your
codes
colnames(salary_data)
## [1] "Calendar.Year" "Full.Name" "Job.Description"
## [4] "Department.Description" "Salary" "FTE"
## [7] "Salary_FTE"
Step 5: Further Customization of the Bar Chart
In this step, we’ll explore additional customizations to improve the overall presentation of the graph. We will:
1. Add labels showing the exact salary amounts on the bars.
2. Customize the color palette for a more visually appealing chart.
3. Adjust the title, axis labels, and other aesthetics for clarity.
ggplot(top_10_departments, aes(x = reorder(Department.Description, Avg_Salary_FTE), y = Avg_Salary_FTE)) +
geom_bar(stat = "identity", fill = "#0073C2FF") +
geom_text(aes(label = round(Avg_Salary_FTE, 2)),
hjust = -0.1, size = 3) +
labs(title = "Top 10 Departments by Average Salary (Normalized by FTE)",
subtitle = "Showing departments with the highest average salaries",
x = "Department", y = "Average Salary (FTE)") +
coord_flip() +
theme_minimal() +
theme(axis.text.y = element_text(size = 8),
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10))
**Here is an Explanation of the Customizations made*:**
1.Custom color: We use a custom blue color (#0073C2FF) to give the chart a more professional appearance.
2.Labels on bars: We add the salary values as text labels on the bars using geom_text(). The labels are rounded to two decimal places and positioned slightly outside the bars (hjust = -0.1).
3.Title and subtitle: We include a subtitle to provide additional context, and we bold the title and adjust its size for emphasis.
4.Readability improvements: The text size for department names (axis.text.y) is adjusted to ensure they are readable, and we make slight adjustments to the title and subtitle text sizes.