Introduction

This report delves into data about tech salaries to uncover insights. The dataset, obtained on Kaggle, provides a glimpse into the compensation trends within the tech industry. It encompasses various attributes such as annual base pay, signing bonus, stock value bonus, job title, employer name, and location country. The goal is to uncover compelling narratives hidden in the data, such as the factors influencing compensation and variations in pay across different tech roles.

Here are three main stories we’re going to explore:

  1. How Pay Relates to Job Titles: We want to see if there’s a connection between how much someone’s base salary is and how much they make in total. Are certain job titles earning more overall?

  2. Looking at Pay Distributions: We’re curious about how salaries are spread out. Are most people making similar amounts, or is there a big range? This can tell us a lot about fairness and equality in pay.

  3. Comparing Pay Across Jobs: We’ll compare how much people make on average in different tech jobs. This can help us see if some roles are more lucrative than others.

Load Packages

# Load necessary R packages for data manipulation and visualization
library(tidyverse)  # Includes various functions for data manipulation and visualization
library(here)       # Helps manage file paths effectively
library(dplyr)      # Provides cleaning capabilities
library(janitor)    # Useful for cleaning column names
library(ggplot2)    # Provides advanced plotting capabilities
# Set option to prevent scientific notation for large numbers
options(scipen = 999)
# Read the salaries dataset from a CSV file (via kaggle.com)
salaries <- read.csv(here("r_data", "salaries.csv"))
head(salaries)
##   index salary_id      employer_name     location_name location_state
## 1     0         1             opower san francisco, ca             CA
## 2     1         3            walmart   bentonville, ar             AR
## 3     2         4 vertical knowledge     cleveland, oh             OH
## 4     3         6             netapp           waltham               
## 5     4        12              apple         cupertino               
## 6     5        14             casino    eastern oregon             OR
##   location_country location_latitude location_longitude         job_title
## 1               US             37.77            -122.41  systems engineer
## 2               US             36.36             -94.20  senior developer
## 3               US             41.47             -81.67 software engineer
## 4                                 NA                 NA               mts
## 5                                 NA                 NA software engineer
## 6               US             38.00             -97.00     it technician
##   job_title_category job_title_rank total_experience_years
## 1        Engineering                                    13
## 2           Software         Senior                     15
## 3           Software                                     4
## 4              Other                                     4
## 5           Software                                     4
## 6              Other                                     5
##   employer_experience_years annual_base_pay signing_bonus annual_bonus
## 1                       2.0          125000          5000            0
## 2                       8.0           65000            NA         5000
## 3                       1.0           86000          5000         6000
## 4                       0.0          105000          5000         8500
## 5                       3.0          110000          5000         7000
## 6                       1.5           40000             0          500
##   stock_value_bonus         comments  submitted_at
## 1       5000 shares Don't work here. 3/21/16 12:58
## 2             3,000                  3/21/16 12:58
## 3                 0                  3/21/16 12:59
## 4                 0                  3/21/16 13:00
## 5            150000                  3/21/16 13:02
## 6                 0                  3/21/16 13:03

Data Preparation

Before conducting analysis, data cleaning steps are executed. This includes converting column names to lowercase and cleaning them to snake_case for consistency. Additionally, renaming columns like ‘location_country’, ‘location_state’, ‘location_latitude’, ‘location_longitude’ to ‘country’, ‘state’, ‘latitude’, and ‘longitude’ respectively for clarity.

# Clean column names: Convert column names to useful names  ----
cleansalaries <- select_all(salaries, tolower) %>%
  clean_names() %>%
  rename("latitude" = "location_latitude",
          "longitude" = "location_longitude",
          "country" = "location_country",
          "state" = "location_state")
names(cleansalaries)
##  [1] "index"                     "salary_id"                
##  [3] "employer_name"             "location_name"            
##  [5] "state"                     "country"                  
##  [7] "latitude"                  "longitude"                
##  [9] "job_title"                 "job_title_category"       
## [11] "job_title_rank"            "total_experience_years"   
## [13] "employer_experience_years" "annual_base_pay"          
## [15] "signing_bonus"             "annual_bonus"             
## [17] "stock_value_bonus"         "comments"                 
## [19] "submitted_at"

Total Compensation Calculation

In this section, the dataset is augmented by adding a new column named ‘total_compensation’. This column is computed by summing up the values from columns representing annual base pay, signing bonus, stock value bonus, and annual bonus for each entry. This comprehensive measure provides a clearer understanding of employees’ overall earnings. Additionally, data entries where the annual base pay is less than $200,000 are filtered out to focus the analysis on higher-paying positions within the tech industry.

# Convert relevant columns to numeric type and calculate total compensation
cleansalaries <- cleansalaries %>%
  mutate_at(vars(annual_base_pay, signing_bonus, stock_value_bonus, annual_bonus), as.double) %>%
  mutate(total_compensation = annual_base_pay + signing_bonus + stock_value_bonus + annual_bonus) %>%
  na.omit() %>%
  filter(annual_base_pay < 200000) %>%
  select(index, salary_id, employer_name, job_title, job_title_category, latitude, longitude, country, state, annual_base_pay, signing_bonus, stock_value_bonus, total_compensation) %>%
  # Order the data set by total_compensation in descending order
  arrange(desc(total_compensation))

Analysis

Exploring Relationships

To begin our exploration, let’s visualize the relationship between annual base pay and total compensation. The aim is to understand how these two variables correlate across different job titles or categories within the tech industry.

# Plotting the relationship between annual base pay and total compensation
relationship_plot <- cleansalaries %>%
  ggplot(aes(x = annual_base_pay, y = total_compensation, color = job_title_category)) +
  geom_point() +
  labs(title = "Relationship Between Annual Base Pay and Total Compensation",
       x = "Annual Base Pay ($)",
       y = "Total Compensation ($)",
       color = "Job Title Category",
       caption = "Figure 1: Scatter plot showing the relationship between annual base pay and total compensation, \n                with color indicating job title category.") +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2") +
  theme(plot.caption = element_text(hjust = 0, margin = margin(t = 10, r = 0, b = 0, l = 0)))

relationship_plot

Figure 1. illustrates the scatterplot of annual base pay against total compensation, color-coded by job title category. From the visualization, we observe a generally positive correlation between these two variables, indicating that as annual base pay increases, total compensation also tends to increase. It is observed that Software jobs’ total compensation increases notably as annual base pay increases. This phenomenon can be attributed to the high bonuses and compensation in the form of stocks, which significantly contribute to the total compensation package. In contrast, other job categories exhibit linear patterns, reflecting variations in compensation structures and bonus distributions.

Distribution Analysis

Next, let’s delve into the distribution of annual base pay and total compensation within our dataset. This visualization provides insights into how these two variables are spread out and their relationship with each other.

# Plotting the distribution of annual base pay and total compensation
distribution_plot <- cleansalaries %>%
  ggplot(aes(x = annual_base_pay, y = total_compensation)) +
  geom_density_2d() +
  labs(title = "Distribution of Annual Base Pay and Total Compensation",
       x = "Annual Base Pay ($)",
       y = "Total Compensation ($)",
       caption = "Figure 2: Distribution of Annual Base Pay and Total Compensation") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1") +
  theme(plot.caption = element_text(hjust = 0, margin = margin(t = 10, r = 0, b = 0, l = 0)))

distribution_plot

The distribution plot presented in Figure 2. provides insights into the typical salary ranges within the dataset. It indicates that the majority of individuals have an annual base pay ranging from $50,000 to $150,000. This suggests that most people in the dataset receive salaries within this range. Furthermore, the plot reveals that the total compensation for individuals in the dataset can reach up to $200,000. This implies that while the base pay is typically between $50,000 and $150,000, additional bonuses or compensation components contribute to higher total compensation. Understanding these salary ranges and compensation structures can provide valuable insights for both employers and employees in the tech industry.

Comparative Analysis

Furthermore, average base pay and total compensation across different job categories are compared to identify trends and disparities within the tech industry. This comparison is visualized using a scatterplot.

# Comparing average base pay and total compensation across different job categories
comparison_plot <- cleansalaries %>%
  group_by(job_title_category) %>%
  summarise(avg_base_pay = mean(annual_base_pay), avg_total_compensation = mean(total_compensation)) %>%
  ggplot(aes(x = avg_base_pay, y = avg_total_compensation, color = job_title_category)) +
  geom_point() +
  geom_text(aes(label = job_title_category), vjust = -0.5, hjust = 0.5, size = 3) +
  labs(title = "Average Base Pay vs. Total Compensation by Job Category",
       x = "Average Annual Base Pay ($)",
       y = "Average Total Compensation ($)",
       color = "Job Title Category",
       caption = "Figure 3: Scatterplot comparing the average base pay and total compensation across \n                 different job categories within the tech industry.") +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2") +
  theme(plot.caption = element_text(hjust = 0, margin = margin(t = 10, r = 0, b = 0, l = 0)))
  
comparison_plot

The scatterplot presented in Figure 3 visualizes these comparisons, with each point representing a job title category. The x-axis denotes the average annual base pay, while the y-axis represents the average total compensation. Each point is color-coded according to the respective job title category for clarity. This visualization provides a clear overview of how average total compensation varies across different job categories, highlighting the disparities in compensation within the tech industry. Upon comparing the average base pay and total compensation across different job categories within the tech industry, notable trends and disparities emerge. Applied science job categories exhibit the highest average total compensation, followed by software and engineering categories. Conversely, web and other job categories tend to have the lowest average total compensation. This suggests that individuals in applied science roles typically receive higher compensation packages compared to those in software, engineering, or other roles within the tech industry.

Next, we examine the distribution of total compensation across different job title categories within the tech industry using a box plot visualization. The box plot provides insights into the central tendency and spread of total compensation values for each job title category.

# Draw box plot of Total Compensation by Job Title Category
box_plot <- cleansalaries %>%
  ggplot(aes(x = job_title_category, y = total_compensation, fill = job_title_category)) +
  geom_boxplot() +
  geom_hline(yintercept = median(cleansalaries$total_compensation), linetype = "dashed", color = "black") + # Add median lines
  labs(title = "Box Plot of Total Compensation by Job Title Category",
       x = "Job Title Category",
       y = "Total Compensation",
       caption = "Figure 4: Box plot showing the distribution of total compensation across job title categories.") +
  theme_minimal() +
  theme(plot.caption = element_text(hjust = 0, margin = margin(t = 10, r = 0, b = 0, l = 0)))
box_plot

Conclusion

In our visualizations and analysis, we’ve gained valuable insights into the landscape of tech salaries. We’ve explored the relationships between annual base pay and total compensation, examined their distributions, and compared average salaries across different job categories. These insights can inform decision-making processes within the tech industry, guiding salary negotiations, talent acquisition strategies, and resource allocations.

That concludes our exploration of tech salaries data. Stay tuned for more data-driven insights and analyses in the future!