Project 2 Data 110

Author

Betty Liu

Intro

Today’s focus is on careers and salaries, using a data set called, Salary. The numerical variables include: age, years of experience, and salary, while the categorical variables include: education level, gender, job title, country, and race. To narrow our analysis, we specifically target data from the United States, creating the USAnalyst dataframe.

This chosen topic and dataset stem from a personal interest in exploring data analyst career transition, adding an element of usefulness to the analysis.

Please note that the data originates from the Kaggle Repository, with the source stating, “This dataset originates from a combination of publicly available salary surveys, data collected by reputable job search websites, and government labor statistics,” and is released under the CC0: Public Domain license.

#Loading possible libraries
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
setwd("C:/Users/It's Me Betty/Documents/MC. Data 110") #set working directory 
SalaryBASED <-read_csv("Salary.csv", show_col_types = FALSE)
data.frame(sapply(SalaryBASED, class)) #Checking class of all variables in one go.
data.frame(unique(SalaryBASED$'Job Title')) #looking at every Job Title
data.frame(unique(SalaryBASED$'Country'))#looking at every Country

Cleaning The Data

After looking in the data, I extracted entries in the USA and job titles containing “analyst.” Using the mutate function, I generated a new categorical column classifying individuals into their respective generations based on their birth year. Also, I eliminated spaces from the variable headings for coding convenience and applied logical adjustments to a singular salary value.

USAnalyst <-SalaryBASED |>
  filter(Country == "USA", grepl("Analyst", SalaryBASED$'Job Title')) |>
  mutate(Generation = case_when( #Creating new column for Generations
    between(Age, 11, 26) ~ "Gen Z",
    between(Age, 27, 42) ~ "Millennial",
    between(Age, 43, 58) ~ "Gen X",
    between(Age, 59, 77) ~ "Boomer")) 
names(USAnalyst) <- gsub(" ","", names(USAnalyst)) #eliminated spaces in head title
USAnalyst$Salary[13] <- 35000 #Changed because no way the salary was just 350
#max(USAnalyst$Age)
#min(USAnalyst$Age)
#2023-23
#2023-41
#Boomer Generation:946–1964. 59-77
#Generation X:1965–1980. 43-58
#Generation Y (Millennials):1981–1996. 27-42
#Generation Z:1997–2010. 11-26

A Quick Statistical Analysis + minor Viz

#Choosing these two JobTitle for personal interest
GenderStats <- USAnalyst |>
  filter(JobTitle %in% c('Data Analyst', 'Financial Analyst'))|>
  group_by(JobTitle, Gender ) |>
  summarise(MaxSalary = max(Salary),
            MeanSalary = mean(Salary), .groups = 'drop',
            AveExpeience = mean(YearsofExperience))

print(GenderStats)
# A tibble: 4 × 5
  JobTitle          Gender MaxSalary MeanSalary AveExpeience
  <chr>             <chr>      <dbl>      <dbl>        <dbl>
1 Data Analyst      Female    150000    114500          5.13
2 Data Analyst      Male      195000    126922.         5.73
3 Financial Analyst Female    150000    100000          8   
4 Financial Analyst Male      130000     82857.         4.29
ggplot(USAnalyst, aes(x = JobTitle, y = Salary,fill = Gender)) +
  geom_boxplot() +
  labs(title = "Salary Comparison ", 
       x = NULL,  #Remove Y label due to redundancy
       y = "Salary") +
  scale_fill_manual(values = c("Male" = "#1313BDB1", "Female" = "#C40E0E9C"))+
  scale_y_continuous(labels = scales::label_dollar(scale = 0.001)) +
  
  theme_gray() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1,),
        plot.title =  element_text(hjust = .5),
        legend.position = "top")

The Main Data Viz

ggplot(USAnalyst, aes(x = Salary , y = JobTitle , color = Gender, size = YearsofExperience )) +
  
  geom_point(alpha = .2)+
  facet_wrap(~Generation, )+ #scales = "free_x" 
  labs(title = "Boost Your Net Worth",
        y = NULL, #Remove Y label due to redundancy
        x= "Salary (in thousands)",
       size = "Years of Experience",
       subtitle = "Request a Raise Backed by Data, Sealed with Confidence!",
       caption = "Data Soruce: Kaggle Repository by Amirmahdi Aboutalebi (Owner)     "    ) +
  
  scale_x_continuous(labels = scales::label_dollar(scale = 0.001)) + 
  scale_color_manual(values = c("Male" = "#00EDFF", "Female" = "#FFED00"), name =    NULL)+
 
  theme_classic()+ 
  theme(
        plot.title = element_text(face = "bold", size = 20, hjust = 0.2, color =           "#2b964f"),
        plot.subtitle = element_text(hjust = 0.199, vjust = 4, size = 8, color =             "#FFFFFF"),
        plot.background = element_rect(fill = "#1C1710"),
        plot.caption = element_text(hjust = -.9, size = 7, color = "#E0E0E0"), 
        
        axis.text.y = element_text(face = "bold", color = "#00cccc"),
        axis.text.x = element_text(color = "#00cccc"),
        axis.title.x = element_text(face = "bold", color = "#E0E0E0"),
      
        strip.text = element_text(color = "white", face = "bold"),
        strip.background = element_rect(fill = "#1A8A8A"),
        
        legend.background = element_rect(fill = "#1C1710"),
        legend.position = "top", 
        legend.title = element_text(face = "bold", size = 10, color = "#FFFFFF"),
        legend.spacing.x = unit(1.5, "mm"),
        legend.margin = margin(0, 0, -11, 0),
        legend.text = element_text(color = "#FFFFFF"),
        
        panel.background = element_rect(fill = "#000000", color = "#000000"),
        panel.grid.major.x = element_line( 
          linetype = "longdash", color = "#1a1a1a", linewidth = 0.2),
        panel.grid.minor = element_line(
          linetype = "dashed",color = "#1a1a1a", linewidth = 0.2)
        
        ) +
guides(size = guide_legend(override.aes = list(color = "#FFFFFF")))

#code line directly above. changes the defult color of "size" icons
# Interactive plot for plot above. Removed some invalid code for plotly.
abc2<- ggplot(USAnalyst, aes(x = Salary , y = JobTitle , color = Gender, size = YearsofExperience )) +
  
  geom_point(alpha = .3)+
  facet_wrap(~Generation, scales = "free_x")+
  
  labs(title = "Boost Your Net Worth",
        y = NULL,
        x= "Salary (in thousands)",
       size = NULL) +
   
  scale_color_manual(values = c("Male" = "#00EDFF", "Female" = "#FFED00"), name =    NULL)+
  scale_x_continuous(labels = scales::label_dollar(scale = 0.001)) +
  
  theme_classic()+
  theme(
        plot.title = element_text(face = "bold", size = 20, hjust = 0, color =           "#2b964f"),
        plot.background = element_rect(fill = "#1C1710"),
       
        axis.text.y = element_text(face = "bold", color = "#00cccc"),
        axis.text.x = element_text(color = "#00cccc"),
        axis.title.x = element_text(face = "bold", color = "#E0E0E0"),
      
        strip.text = element_text(color = "white", face = "bold"),
        strip.background = element_rect(fill = "#1A8A8A"),
        
        legend.background = element_rect(fill = "#1C1710"),
        legend.title = element_text(face = "bold", size = 10, color = "#FFFFFF"),
        legend.text = element_text(color = "#FFFFFF"),
        
        panel.background = element_rect(fill = "#000000", color = "#000000")
        )

acbply2 <- ggplotly(abc2)
acbply2

A Briefing

Click here - for background article (Link at bottom if hyperlink broken)

The unseen barrier hindering women from reaching top positions in companies is a consequence of unjust perceptions and limited opportunities. The “gender pay gap” reflects the disparity in earnings between women and men undertaking comparable roles. This visualization specifically focuses on salaries for job titles containing the term “analyst,” considering factors such as years of experience and generation. It visually illustrates the logical correlation that Millennial generally possess more work experience. While no unexpected findings emerged, the visualization did reaffirm the existence of the glass ceiling.

I had hoped for better results with my alluvial, which can be seen below.

A “Failed” Attempt

library(alluvial)
Warning: package 'alluvial' was built under R version 4.3.2
library(ggalluvial)
Warning: package 'ggalluvial' was built under R version 4.3.2
ggplot(as.data.frame(USAnalyst),
       aes( axis1 = Generation, axis2 = YearsofExperience, axis3 = JobTitle, axis4 = Salary)) +
  scale_x_discrete(limits = c("Generation", "Years of Experience", "Job Title", "Salary"), expand = c(0.1, 0.01)) +
  
  geom_alluvium(aes(fill = Gender), width = .4) +
  geom_stratum(width = .2, fill = "black") +
  #geom_label(stat = "stratum", aes(label = after_stat(stratum))) +
  geom_label(stat = "stratum", aes(label = after_stat(stratum)), size = 2.5) +
  ggtitle("Careers and Salary")+
  
  theme_classic()+
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank(),
        plot.title =  element_text(hjust = .5),
        legend.position = "bottom")

Link to article if the Click Here is not working

https://go-gale-com.montgomerycollege.idm.oclc.org/ps/retrieve.do?tabID=T002&resultListType=RESULT_LIST&searchResultsType=SingleTab&retrievalId=263d683f-3947-4ac7a24b28ad57d367e0&hitCount=151&searchType=BasicSearchForm&currentPosition=3&docId=GALE%7CA472268220&docType=Report&sort=Relevance&contentSegment=ZONE-MOD1&prodId=AONE&pageNum=1&contentSet=GALE%7CA472268220&searchId=R2&userGroupName=rock77357&inPS=true