Today’s focus is on careers and salaries, using a data set called, Salary. The numerical variables include: age, years of experience, and salary, while the categorical variables include: education level, gender, job title, country, and race. To narrow our analysis, we specifically target data from the United States, creating the USAnalyst dataframe.
This chosen topic and dataset stem from a personal interest in exploring data analyst career transition, adding an element of usefulness to the analysis.
Please note that the data originates from the Kaggle Repository, with the source stating, “This dataset originates from a combination of publicly available salary surveys, data collected by reputable job search websites, and government labor statistics,” and is released under the CC0: Public Domain license.
#Loading possible librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
setwd("C:/Users/It's Me Betty/Documents/MC. Data 110") #set working directory SalaryBASED <-read_csv("Salary.csv", show_col_types =FALSE)
data.frame(sapply(SalaryBASED, class)) #Checking class of all variables in one go.data.frame(unique(SalaryBASED$'Job Title')) #looking at every Job Titledata.frame(unique(SalaryBASED$'Country'))#looking at every Country
Cleaning The Data
After looking in the data, I extracted entries in the USA and job titles containing “analyst.” Using the mutate function, I generated a new categorical column classifying individuals into their respective generations based on their birth year. Also, I eliminated spaces from the variable headings for coding convenience and applied logical adjustments to a singular salary value.
USAnalyst <-SalaryBASED |>filter(Country =="USA", grepl("Analyst", SalaryBASED$'Job Title')) |>mutate(Generation =case_when( #Creating new column for Generationsbetween(Age, 11, 26) ~"Gen Z",between(Age, 27, 42) ~"Millennial",between(Age, 43, 58) ~"Gen X",between(Age, 59, 77) ~"Boomer")) names(USAnalyst) <-gsub(" ","", names(USAnalyst)) #eliminated spaces in head titleUSAnalyst$Salary[13] <-35000#Changed because no way the salary was just 350#max(USAnalyst$Age)#min(USAnalyst$Age)#2023-23#2023-41#Boomer Generation:946–1964. 59-77#Generation X:1965–1980. 43-58#Generation Y (Millennials):1981–1996. 27-42#Generation Z:1997–2010. 11-26
A Quick Statistical Analysis + minor Viz
#Choosing these two JobTitle for personal interestGenderStats <- USAnalyst |>filter(JobTitle %in%c('Data Analyst', 'Financial Analyst'))|>group_by(JobTitle, Gender ) |>summarise(MaxSalary =max(Salary),MeanSalary =mean(Salary), .groups ='drop',AveExpeience =mean(YearsofExperience))print(GenderStats)
# A tibble: 4 × 5
JobTitle Gender MaxSalary MeanSalary AveExpeience
<chr> <chr> <dbl> <dbl> <dbl>
1 Data Analyst Female 150000 114500 5.13
2 Data Analyst Male 195000 126922. 5.73
3 Financial Analyst Female 150000 100000 8
4 Financial Analyst Male 130000 82857. 4.29
ggplot(USAnalyst, aes(x = JobTitle, y = Salary,fill = Gender)) +geom_boxplot() +labs(title ="Salary Comparison ", x =NULL, #Remove Y label due to redundancyy ="Salary") +scale_fill_manual(values =c("Male"="#1313BDB1", "Female"="#C40E0E9C"))+scale_y_continuous(labels = scales::label_dollar(scale =0.001)) +theme_gray() +theme(axis.text.x =element_text(angle =45, hjust =1,),plot.title =element_text(hjust = .5),legend.position ="top")
The Main Data Viz
ggplot(USAnalyst, aes(x = Salary , y = JobTitle , color = Gender, size = YearsofExperience )) +geom_point(alpha = .2)+facet_wrap(~Generation, )+#scales = "free_x" labs(title ="Boost Your Net Worth",y =NULL, #Remove Y label due to redundancyx="Salary (in thousands)",size ="Years of Experience",subtitle ="Request a Raise Backed by Data, Sealed with Confidence!",caption ="Data Soruce: Kaggle Repository by Amirmahdi Aboutalebi (Owner) " ) +scale_x_continuous(labels = scales::label_dollar(scale =0.001)) +scale_color_manual(values =c("Male"="#00EDFF", "Female"="#FFED00"), name =NULL)+theme_classic()+theme(plot.title =element_text(face ="bold", size =20, hjust =0.2, color ="#2b964f"),plot.subtitle =element_text(hjust =0.199, vjust =4, size =8, color ="#FFFFFF"),plot.background =element_rect(fill ="#1C1710"),plot.caption =element_text(hjust =-.9, size =7, color ="#E0E0E0"), axis.text.y =element_text(face ="bold", color ="#00cccc"),axis.text.x =element_text(color ="#00cccc"),axis.title.x =element_text(face ="bold", color ="#E0E0E0"),strip.text =element_text(color ="white", face ="bold"),strip.background =element_rect(fill ="#1A8A8A"),legend.background =element_rect(fill ="#1C1710"),legend.position ="top", legend.title =element_text(face ="bold", size =10, color ="#FFFFFF"),legend.spacing.x =unit(1.5, "mm"),legend.margin =margin(0, 0, -11, 0),legend.text =element_text(color ="#FFFFFF"),panel.background =element_rect(fill ="#000000", color ="#000000"),panel.grid.major.x =element_line( linetype ="longdash", color ="#1a1a1a", linewidth =0.2),panel.grid.minor =element_line(linetype ="dashed",color ="#1a1a1a", linewidth =0.2) ) +guides(size =guide_legend(override.aes =list(color ="#FFFFFF")))
#code line directly above. changes the defult color of "size" icons
# Interactive plot for plot above. Removed some invalid code for plotly.abc2<-ggplot(USAnalyst, aes(x = Salary , y = JobTitle , color = Gender, size = YearsofExperience )) +geom_point(alpha = .3)+facet_wrap(~Generation, scales ="free_x")+labs(title ="Boost Your Net Worth",y =NULL,x="Salary (in thousands)",size =NULL) +scale_color_manual(values =c("Male"="#00EDFF", "Female"="#FFED00"), name =NULL)+scale_x_continuous(labels = scales::label_dollar(scale =0.001)) +theme_classic()+theme(plot.title =element_text(face ="bold", size =20, hjust =0, color ="#2b964f"),plot.background =element_rect(fill ="#1C1710"),axis.text.y =element_text(face ="bold", color ="#00cccc"),axis.text.x =element_text(color ="#00cccc"),axis.title.x =element_text(face ="bold", color ="#E0E0E0"),strip.text =element_text(color ="white", face ="bold"),strip.background =element_rect(fill ="#1A8A8A"),legend.background =element_rect(fill ="#1C1710"),legend.title =element_text(face ="bold", size =10, color ="#FFFFFF"),legend.text =element_text(color ="#FFFFFF"),panel.background =element_rect(fill ="#000000", color ="#000000") )acbply2 <-ggplotly(abc2)acbply2
A Briefing
Click here - for background article (Link at bottom if hyperlink broken)
The unseen barrier hindering women from reaching top positions in companies is a consequence of unjust perceptions and limited opportunities. The “gender pay gap” reflects the disparity in earnings between women and men undertaking comparable roles. This visualization specifically focuses on salaries for job titles containing the term “analyst,” considering factors such as years of experience and generation. It visually illustrates the logical correlation that Millennial generally possess more work experience. While no unexpected findings emerged, the visualization did reaffirm the existence of the glass ceiling.
I had hoped for better results with my alluvial, which can be seen below.
A “Failed” Attempt
library(alluvial)
Warning: package 'alluvial' was built under R version 4.3.2
library(ggalluvial)
Warning: package 'ggalluvial' was built under R version 4.3.2