Introduction

Hi everyone! It’s been quite a while since the last time I shared my R code works. I have been through a lot of unexpected things in my life forcing me to take a break on playing around with datasets :D. I guess now it is the time to back at it again and a bit material revision seems necessary just to have a review in data processing (as I haven’t been using this language for 2 years!)

I would like to practice my R language (as per usual) by doing exploratory data analysis (EDA) using a Job Salaries Data that I got from kaggle (this is the link to the dataset https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries). All of the data visualisation instruments including charts and graphs will be done using ggplot package, so I can show you how powerful this library is in helping us producing beautiful yet attractive charts.

Let’s go through the dataset! Happy data-playing!

Data Preparation

Import Library

First thing first, import all libraries that might be needed

library(tidyverse)
library(ggplot2)
library(plotly)
library(reshape2)

Import Data

Obviously don’t forget about the data

df <-read.csv(file ="ds_salaries.csv")
df<-df %>% 
  select(-X, -salary_currency,-salary) %>% 
  mutate_at(c("job_title","employment_type","experience_level","company_location","company_size","employee_residence","remote_ratio"),as.factor)
str(df)
## 'data.frame':    607 obs. of  9 variables:
##  $ work_year         : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ experience_level  : Factor w/ 4 levels "EN","EX","MI",..: 3 4 4 3 4 1 4 3 3 4 ...
##  $ employment_type   : Factor w/ 4 levels "CT","FL","FT",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ job_title         : Factor w/ 50 levels "3D Computer Vision Researcher",..: 23 41 8 48 38 13 35 23 9 34 ...
##  $ salary_in_usd     : int  79833 260000 109024 20000 150000 72000 190000 35735 135000 125000 ...
##  $ employee_residence: Factor w/ 57 levels "AE","AR","AT",..: 15 33 21 24 56 56 56 26 56 42 ...
##  $ remote_ratio      : Factor w/ 3 levels "0","50","100": 1 1 2 1 2 3 3 2 3 2 ...
##  $ company_location  : Factor w/ 50 levels "AE","AS","AT",..: 13 30 19 21 49 49 49 23 49 39 ...
##  $ company_size      : Factor w/ 3 levels "L","M","S": 1 3 2 3 1 1 3 1 1 3 ...

We’ve got 9 columns left with:
1. Integer type columns: “work_year”,“salary_in_usd”,“remote_ratio”
2. Categorical type columns : “experience_level”,“job_title”,“employment_type”

NA value check

Let’s check if we’ve got some NA here

anyNA(df )
## [1] FALSE

Nope, no missing values - everything is just fine. Sound

Data Exploratory on Each Columns

I will split the EDA of this dataset in to two parts: categorical columns data and numerical columns data.
Let’s begin with categorical cloumns first

1. Categorical Columns

We will change the experience_level values as follow:
1. MI : Mid-Level
2. SE : Senior-Level
3. EN : Entry-Level
4. EX : Executive-Level  

Feel free to switch around the tab!

df<-df %>% 
  mutate(experience_level = as.factor(ifelse(experience_level == "MI","Mid-Level",ifelse(experience_level == "SE","Senior-Level",ifelse(experience_level == "EN","Entry-Level", "Executive-Level")))))

head(df)
##   work_year experience_level employment_type                  job_title
## 1      2020        Mid-Level              FT             Data Scientist
## 2      2020     Senior-Level              FT Machine Learning Scientist
## 3      2020     Senior-Level              FT          Big Data Engineer
## 4      2020        Mid-Level              FT       Product Data Analyst
## 5      2020     Senior-Level              FT  Machine Learning Engineer
## 6      2020      Entry-Level              FT               Data Analyst
##   salary_in_usd employee_residence remote_ratio company_location company_size
## 1         79833                 DE            0               DE            L
## 2        260000                 JP            0               JP            S
## 3        109024                 GB           50               GB            M
## 4         20000                 HN            0               HN            S
## 5        150000                 US           50               US            L
## 6         72000                 US          100               US            L

Well-Done!

Experience Levels

I will show the experience level distribution in a tree map

library(treemapify) #for creating treemap

p1<-df %>% 
  select(experience_level) %>% 
  group_by(experience_level) %>% 
  summarise(freq = n()) %>%
  mutate(percentage = as.integer(100*freq/(213+26+280+88))) %>% 
  ggplot(mapping = aes(fill = experience_level, area = freq, label = paste0(experience_level,"\n",freq,"\n",percentage,"%"))) +
  geom_treemap(layout = "scol") +
  geom_treemap_text(colour = "white", place = "centre",size = 14, )+
  scale_fill_brewer(palette = "Dark2")+
  theme(legend.position = "none") 

p1

Job Titles

First, let’s count how many job titles are in the datasets.

cat ("number of job title :",length(unique(df$job_title)))
## number of job title : 50

We’ve got 50 unique job titles in the job_titles column, now we’d like to know the top 10 job titles.

library(glue)
p2<-df %>% 
  count(job_title) %>% 
  rename(freq=n) %>% 
  head(10) %>% 
  ggplot(mapping = aes(x = reorder(job_title, freq), y=freq))+
  geom_col(mapping = aes(fill = job_title, text = glue("Job Title: {job_title} 
                                                      Count : {freq}")))+
  coord_flip()+
  theme_minimal()+
  labs(title = "Top 10 Job Titles")+
  theme(panel.background = element_rect(fill = "black"), 
        legend.position = "none", 
        plot.title = element_text(colour = "white", face = "italic"),
        axis.title.y  = element_blank(),
        axis.title.x = element_blank(), 
        plot.background = element_rect(fill = "black"), 
        axis.text.y = element_text(colour = "White"),
        axis.text.x  = element_text(colour = "White"),
        panel.grid.major.y= element_blank(),
        panel.grid.major.x = element_line(colour = "black"),
        axis.line.x = element_line(colour = "white"))


ggplotly(p2, tooltip = "text")  

Remarks

As we can see on the chart above, Big Data Engineer stands on top of the set with 8 titles.

Employment Type

There are four different employment types:
1. FT : “Full-Time”
2. PT : “Part-Time”
3. CT : “Contract”
4. FL : “Freelance”

I will explore the size distribution of these employment types

p3<-df %>% 
  count(employment_type) %>% 
  rename(freq=n) %>% 
  ggplot(mapping = aes(x = reorder(employment_type, -freq), y=freq))+
  geom_col(mapping = aes(fill = employment_type, text = glue("Employment Type: {employment_type} 
                                                      Count : {freq}")))+
  theme_minimal()+
  labs(title = "Employment Type Distribution")+
  theme(panel.background = element_rect(fill = "black"), 
        legend.position = "none", 
        plot.title = element_text(colour = "white",face = "italic"),
        axis.title.y  = element_blank(),
        axis.title.x = element_blank(), 
        plot.background = element_rect(fill = "black"), 
        axis.text.y = element_text(colour = "White"),
        axis.text.x  = element_text(colour = "White"),
        panel.grid.major.y= element_blank(),
        panel.grid.major.x = element_line(colour = "black"),
        axis.line.x = element_line(colour = "white"))


ggplotly(p3, tooltip = "text") 

Remarks
It appears that most of the employees are full-time workers

Employee Residence and Location

In this section, I will show you the where most of the employees live across the globe. A world map might be a good instrument to visualise this kind of data, as we can see exactly the distribution of employees’ residence on the map.

library(countrycode) #this library is for code translating from country name to iso2c and vice versa
library(maps) 
df_residence<-df %>% 
  mutate(region = as.character(countrycode(df$employee_residence,"iso2c","country.name"))) %>% 
  count(region) %>% 
  rename(freq = n) 

df_residence$region<-replace(df_residence$region, df_residence$region =="United States","USA") #create residence dataframe

Create a new dataframe and store it into mapdata

mapdata<-map_data("world") %>% 
  mutate(region = as.factor(region)) #create mapdata using maps package. We want world as the attribute value

mapdata_residence<-mapdata %>% #join the world map coordinate with residence dataframe 
  left_join(df_residence,by = "region") %>% 
  select(-subregion)

#remove the NA data from freq
mapdata_new <- mapdata_residence %>% 
  filter(!is.na(mapdata_residence$freq))

Let’s plot the map!

Here I’m using ggplot2 package to build the map by plotting longitudes and latitudes point. Let me show you how it works

map_residence<-mapdata_residence %>% 
  ggplot(mapping = aes(x = long, y=lat, group = group))+
  geom_polygon(aes(fill = freq, text = glue("{region}
                                            {freq} employees")), color = "black")+
  scale_fill_gradient(name ="employees", low = "lightblue", high = "blue", na.value = "grey50")+ #use geom_polygon to plot all the coordinates so it aligns like a map 
  labs(title = "Employee Residence Distribution")+
  theme_minimal()+
  theme(axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.x = element_blank(),
        axis.title.y = element_blank(),
        panel.grid = element_blank(),
        plot.background = element_rect(fill = "black"),
        panel.background = element_rect(fill="black"),
        legend.text = element_text(colour = "white"),
        legend.title = element_text(colour = "white"),
        plot.title = element_text(colour = "white", face = "italic"),
        rect = element_blank(),
        aspect.ratio = 1)

ggplotly(map_residence, tooltip = "text")

Remarks
As we can see on the map most of the employees live in the US. Brilliant!

The bar plot below gives a clearer comparison in regards to employee_residence:

p4<-df_residence %>% 
  group_by(region) %>% 
  arrange(desc(freq)) %>% 
  head(10) %>% 
  ggplot(mapping = aes(x = reorder(region,-freq), y = freq)) +
  geom_col(mapping = aes(fill = region, text = glue("Region: {region} 
                                                      Count : {freq}")))+
  theme_minimal()+
  labs(title = "Top 10 Employee Residence")+
  theme(panel.background = element_rect(fill = "black"), 
        legend.position = "none", 
        plot.title = element_text(colour = "white", face = "italic"),
        axis.title.y  = element_blank(),
        axis.title.x = element_blank(), 
        plot.background = element_rect(fill = "black"), 
        axis.text.y = element_text(colour = "White"),
        axis.text.x  = element_text(colour = "White"),
        panel.grid.major.y= element_line(colour = "grey50"),
        panel.grid.major.x = element_line(colour = "black"),
        axis.line.x = element_line(colour = "white"))


ggplotly(p4, tooltip = "text")  

Remarks
From the chart above, we can conclude that USA is the region with the highest number of employees followed by UK, India, Canada, and Germany in the top 5.

Now I am going to compare the company location and employee residence

#create a new dataframe containing company location and its frequency 
df_location<-df %>% 
  mutate(region = as.character(countrycode(df$company_location,"iso2c","country.name"))) %>% 
  count(region) %>% 
  rename(freq = n) 

df_location$region<-replace(df_location$region, df_location$region =="United States","USA")

I will alter the df_residence dataframe by renaming the frequency column, so I can create a merged employee_residence and company_location dataframe

table_res<-df_residence %>% 
  mutate(region = as.factor(region)) %>%
  rename(freq_residence = freq) %>% 
  group_by(region) %>% 
  arrange(desc(freq_residence)) %>% 
  head(10)

Let’s merge the dataframe and create the plot!

p5<-df_location %>% 
  mutate(region = as.factor(region)) %>%
  rename(freq_loc = freq) %>% 
  group_by(region) %>% 
  arrange(desc(freq_loc)) %>% 
  head(10) %>% 
  left_join(table_res, by = "region") %>% #merging dataframe
  pivot_longer(cols = c("freq_loc","freq_residence")) %>% #create a long type dataframe
  
  
  ggplot(mapping = aes(x = reorder (region,-value), y = value)) + #plot the daaframe 
  geom_col(mapping = aes(fill = name, text = glue("Region : {region}
                                                   Count  : {value}")),position = "dodge") + #use dodge type position 
  theme_minimal()+
  labs(title = "Top 10 Employee Residence and Company Location")+
  theme(panel.background = element_rect(fill = "black"), 
        legend.position = "top", 
        plot.title = element_text(colour = "white", face = "italic"),
        axis.title.y  = element_blank(),
        axis.title.x = element_blank(), 
        plot.background = element_rect(fill = "black"), 
        axis.text.y = element_text(colour = "White"),
        axis.text.x  = element_text(colour = "White"),
        panel.grid.major.y= element_line(colour = "black"),
        panel.grid.major.x = element_line(colour = "black"),
        panel.grid.minor.y = element_blank(),
        axis.line.x = element_line(colour = "white"),
        legend.text = element_text(colour = "white"))+
  scale_fill_discrete(aesthetics = "fill", labels = c("Company Location", "Employee Residence"))

p5

Remarks
The employess looked mostly live in the States- not a bad place thouh

Company size

Create a bar plot to show company size comparison - it’s a simple plot anyway

p6<-df %>% 
  count(company_size) %>% 
  rename(freq = n) %>% 
  ggplot(mapping = aes(x = company_size, y = freq)) +
  geom_col(mapping = aes(fill = company_size, text = glue("Company size :{company_size}
                                                           Count : {freq}"))) +
  theme_minimal()+
  labs(title = "Company Size Distribution")+
  theme(panel.background = element_rect(fill = "black"), 
        legend.position = "none", 
        plot.title = element_text(colour = "white",face = "italic"),
        axis.title.y  = element_blank(),
        axis.title.x = element_blank(), 
        plot.background = element_rect(fill = "black"), 
        axis.text.y = element_text(colour = "White"),
        axis.text.x  = element_text(colour = "White"),
        panel.grid.major.y= element_blank(),
        panel.grid.major.x = element_line(colour = "black"),
        axis.line.x = element_line(colour = "white"))


ggplotly(p6, tooltip = "text") 

Remarks
It can be concluded that most of the companies are medium-sized followed by large and small.

2. Numerical Columns

On our df dataframe, we’ve got three numeric columns which are:

  1. work_year
  2. salary_in_usd
  3. remote_ratio

I will explore each of the columns

Work Year

I am going to illustrate the work_year distribution by using a pie chart in ggplot package

df <- df %>% 
  mutate(work_year = as.factor(work_year))

pie1<-df %>% 
  count(work_year) %>% 
  rename(freq = n) %>% 
  arrange(desc(work_year)) %>% 
  mutate(prop = freq / sum(freq) *100) %>%
  mutate(ypos = cumsum(prop)- 0.5*prop )

         
pie1 %>% 
  ggplot(mapping = aes(x = 2, y = prop, fill =work_year))+
  geom_bar(stat = "identity", width = 1, color = "white") +
  coord_polar("y", start = 0) +
  theme_void() +
  geom_text(aes(y = ypos, label = freq), color = "white", size=5) +
  theme(legend.position = "right", 
        legend.title = element_blank()) +
  scale_fill_brewer(palette = "Set2") +
  xlim(0.5,2.5)+
  labs (title = "Work Year Pie Chart")

Remarks
Pie chart above illustrates that most workers worked in 2022 with 318 employees (more than half the data population) were active during the period.

Salary in USD

Let’s see what is going on with the salary column. I will break the visualisation in two types: boxplot and density plot

Boxplot

df %>% 
  ggplot( mapping = aes (x = "",y = salary_in_usd))+
  geom_boxplot(fill ="slateblue", alpha = 0.3,
               color = "white",
               outlier.fill = "red",
               outlier.color = "orange")+
  ylab("USD")+
  labs(title = "Salary in USD Boxplot")+
  theme_minimal()+
  theme(panel.background = element_rect(fill = "black"), 
        legend.position = "none", 
        plot.title = element_text(colour = "white",face = "italic"),
        axis.title.x = element_blank(), 
        plot.background = element_rect(fill = "black"), 
        axis.text.y = element_text(colour = "White"),
        axis.text.x  = element_text(colour = "White"),
        panel.grid.major.y= element_blank(),
        panel.grid.major.x = element_line(colour = "black"),
        axis.line.x = element_line(colour = "white"))+
  scale_y_continuous(labels = scales::unit_format(unit = "k",
                                                  scale = 1/1000),
                     limits = c(0,600000),
                     breaks  = c(0,100000,200000,300000,400000,500000,600000))

Density Plot

library(hrbrthemes)
df %>% 
  ggplot(mapping = aes (x = salary_in_usd))+
  geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
  theme_ipsum()+
  scale_x_continuous(labels = scales::unit_format(unit = "k", scale = 1/1000))+
  ggtitle("Salary in USD Density Plot")+
  theme(plot.background = element_rect(fill = "white"),
        plot.title = element_text(colour = "black"),
        text = element_text(colour = "white"))

Looks great

Remote Ratio

Let’s take a look at remote ratio column

p8<-df %>% 
  count (remote_ratio) %>% 
  rename(freq = n ) %>% 
  mutate(remote_ratio = as.factor(ifelse(remote_ratio == "0", "No Remote Work", ifelse(remote_ratio == "50", "Partially Remote", "Fully Remote")))) %>% 
  ggplot(mapping = aes(x = remote_ratio, y = freq)) +
  geom_col(mapping = aes(fill = remote_ratio, text = glue("{remote_ratio}
                                                           Count : {freq}"))) +
  theme_minimal()+
  labs(title = "Remote Ratio Distribution")+
  theme(panel.background = element_rect(fill = "black"), 
        legend.position = "none", 
        plot.title = element_text(colour = "white",face = "italic"),
        axis.title.y  = element_blank(),
        axis.title.x = element_blank(), 
        plot.background = element_rect(fill = "black"), 
        axis.text.y = element_text(colour = "White"),
        axis.text.x  = element_text(colour = "White"),
        panel.grid.major.y= element_blank(),
        panel.grid.major.x = element_line(colour = "black"),
        axis.line.x = element_line(colour = "white"))


ggplotly(p8, tooltip ="text")  

Remarks
Most of the employees are fully remote workers

Employment type by Experience Level Analysis

I want to take a look deeper on the employement type by experience level. We can see how this two values correlate each other

p9<-df %>% 
  select (employment_type, experience_level) %>% 
  count(employment_type, experience_level, .drop = F) %>% 
  rename(freq = n) %>% 

  ggplot(mapping = aes(x = reorder (employment_type, -freq), y = freq)) +
  geom_col(mapping = aes(fill = experience_level, text = glue("{employment_type}
                                                           Count : {freq}")), position = "dodge") +
  theme_minimal()+
  labs(title = "Employment Type and Experience Level Corelation")+
  theme(panel.background = element_rect(fill = "black"), 
        legend.position = "right", 
        plot.title = element_text(colour = "white",face = "italic"),
        axis.title.y  = element_blank(),
        axis.title.x = element_blank(), 
        plot.background = element_rect(fill = "black"), 
        axis.text.y = element_text(colour = "White"),
        axis.text.x  = element_text(colour = "White"),
        panel.grid.major.y= element_blank(),
        panel.grid.major.x = element_line(colour = "black"),
        axis.line.x = element_line(colour = "white"), 
        legend.text = element_text(colour = "white"))


ggplotly(p9, tooltip ="text")  

Remarks
Full time and contract employee consist of all type of workers, however there are only entry-level and mid-level workers with part-time employment type. There are no entry-level and mid-level working for freelance job type.

Highest Salary by Job Title

Lastly, I’m going to create a plot which illustrates highest paid salary by job title

p10<-df %>% 
  select(salary_in_usd, job_title) %>% 
  group_by(job_title) %>% 
  arrange(desc(salary_in_usd)) %>% 
  head(15) %>% 
  ggplot(mapping = aes(x = job_title, y = salary_in_usd)) +
  geom_col(mapping = aes(fill = salary_in_usd, text = glue("{job_title}
                                                           Salary : {salary_in_usd}")), position = "stack") +
  theme_minimal()+
  labs(title = "Top 15 Highest Salary by Job Title")+
  ylab("Salary in USD")+
  theme(panel.background = element_rect(fill = "black"), 
        legend.position = "none", 
        plot.title = element_text(colour = "white",face = "italic"),
        axis.title.x = element_blank(), 
        axis.title.y = element_text(colour = "white"),
        plot.background = element_rect(fill = "black"), 
        axis.text.y = element_text(colour = "White"),
        axis.text.x  = element_text(colour = "White", angle = 30, hjust = 1),
        panel.grid.major.y= element_blank(),
        panel.grid.major.x = element_line(colour = "black"),
        axis.line.x = element_line(colour = "white")) +
  scale_y_continuous(labels = scales::unit_format(unit = "k",
                                                  scale = 1/1000),
                     limits = c(0,800000),
                     breaks  = c(0,100000,200000,300000,400000,500000,600000,700000,800000))
  

ggplotly(p10, tooltip = "text")

End

So, that’s all for the EDA process of Job Salaries Data using packages in R programming language. What a tough way to refresh all of the syntax and algorithms. I know this is only a small portion of data analysis process, but everything big starts from smaller piece!. I wish this post can help you understand about EDA and the way to take advantage of ggplot package in visualising any variation of data structures.

See you in the other page!

Author, Alfado Milala

Notes :
In case you want to look up my profile, click the link below :
Jump To My Profile (open link in a new tab)