Hi everyone! It’s been quite a while since the last time I shared my R code works. I have been through a lot of unexpected things in my life forcing me to take a break on playing around with datasets :D. I guess now it is the time to back at it again and a bit material revision seems necessary just to have a review in data processing (as I haven’t been using this language for 2 years!)
I would like to practice my R language (as per usual) by doing exploratory data analysis (EDA) using a Job Salaries Data that I got from kaggle (this is the link to the dataset https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries). All of the data visualisation instruments including charts and graphs will be done using ggplot package, so I can show you how powerful this library is in helping us producing beautiful yet attractive charts.
Let’s go through the dataset! Happy data-playing!
First thing first, import all libraries that might be needed
library(tidyverse)
library(ggplot2)
library(plotly)
library(reshape2)Obviously don’t forget about the data
df <-read.csv(file ="ds_salaries.csv")
df<-df %>%
select(-X, -salary_currency,-salary) %>%
mutate_at(c("job_title","employment_type","experience_level","company_location","company_size","employee_residence","remote_ratio"),as.factor)
str(df)## 'data.frame': 607 obs. of 9 variables:
## $ work_year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ experience_level : Factor w/ 4 levels "EN","EX","MI",..: 3 4 4 3 4 1 4 3 3 4 ...
## $ employment_type : Factor w/ 4 levels "CT","FL","FT",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ job_title : Factor w/ 50 levels "3D Computer Vision Researcher",..: 23 41 8 48 38 13 35 23 9 34 ...
## $ salary_in_usd : int 79833 260000 109024 20000 150000 72000 190000 35735 135000 125000 ...
## $ employee_residence: Factor w/ 57 levels "AE","AR","AT",..: 15 33 21 24 56 56 56 26 56 42 ...
## $ remote_ratio : Factor w/ 3 levels "0","50","100": 1 1 2 1 2 3 3 2 3 2 ...
## $ company_location : Factor w/ 50 levels "AE","AS","AT",..: 13 30 19 21 49 49 49 23 49 39 ...
## $ company_size : Factor w/ 3 levels "L","M","S": 1 3 2 3 1 1 3 1 1 3 ...
We’ve got 9 columns left with:
1. Integer type columns:
“work_year”,“salary_in_usd”,“remote_ratio”
2. Categorical type columns :
“experience_level”,“job_title”,“employment_type”
Let’s check if we’ve got some NA here
anyNA(df )## [1] FALSE
Nope, no missing values - everything is just fine. Sound
I will split the EDA of this dataset in to two parts:
categorical columns data and numerical
columns data.
Let’s begin with categorical cloumns first
We will change the experience_level values as
follow:
1. MI : Mid-Level
2. SE : Senior-Level
3. EN : Entry-Level
4. EX : Executive-Level
Feel free to switch around the tab!
df<-df %>%
mutate(experience_level = as.factor(ifelse(experience_level == "MI","Mid-Level",ifelse(experience_level == "SE","Senior-Level",ifelse(experience_level == "EN","Entry-Level", "Executive-Level")))))
head(df)## work_year experience_level employment_type job_title
## 1 2020 Mid-Level FT Data Scientist
## 2 2020 Senior-Level FT Machine Learning Scientist
## 3 2020 Senior-Level FT Big Data Engineer
## 4 2020 Mid-Level FT Product Data Analyst
## 5 2020 Senior-Level FT Machine Learning Engineer
## 6 2020 Entry-Level FT Data Analyst
## salary_in_usd employee_residence remote_ratio company_location company_size
## 1 79833 DE 0 DE L
## 2 260000 JP 0 JP S
## 3 109024 GB 50 GB M
## 4 20000 HN 0 HN S
## 5 150000 US 50 US L
## 6 72000 US 100 US L
Well-Done!
I will show the experience level distribution in a tree map
library(treemapify) #for creating treemap
p1<-df %>%
select(experience_level) %>%
group_by(experience_level) %>%
summarise(freq = n()) %>%
mutate(percentage = as.integer(100*freq/(213+26+280+88))) %>%
ggplot(mapping = aes(fill = experience_level, area = freq, label = paste0(experience_level,"\n",freq,"\n",percentage,"%"))) +
geom_treemap(layout = "scol") +
geom_treemap_text(colour = "white", place = "centre",size = 14, )+
scale_fill_brewer(palette = "Dark2")+
theme(legend.position = "none")
p1First, let’s count how many job titles are in the datasets.
cat ("number of job title :",length(unique(df$job_title)))## number of job title : 50
We’ve got 50 unique job titles in the job_titles column, now we’d like to know the top 10 job titles.
library(glue)
p2<-df %>%
count(job_title) %>%
rename(freq=n) %>%
head(10) %>%
ggplot(mapping = aes(x = reorder(job_title, freq), y=freq))+
geom_col(mapping = aes(fill = job_title, text = glue("Job Title: {job_title}
Count : {freq}")))+
coord_flip()+
theme_minimal()+
labs(title = "Top 10 Job Titles")+
theme(panel.background = element_rect(fill = "black"),
legend.position = "none",
plot.title = element_text(colour = "white", face = "italic"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
plot.background = element_rect(fill = "black"),
axis.text.y = element_text(colour = "White"),
axis.text.x = element_text(colour = "White"),
panel.grid.major.y= element_blank(),
panel.grid.major.x = element_line(colour = "black"),
axis.line.x = element_line(colour = "white"))
ggplotly(p2, tooltip = "text") Remarks
As we can see on the chart above, Big Data Engineer stands on top of the set with 8 titles.
There are four different employment types:
1. FT : “Full-Time”
2. PT : “Part-Time”
3. CT : “Contract”
4. FL : “Freelance”
I will explore the size distribution of these employment types
p3<-df %>%
count(employment_type) %>%
rename(freq=n) %>%
ggplot(mapping = aes(x = reorder(employment_type, -freq), y=freq))+
geom_col(mapping = aes(fill = employment_type, text = glue("Employment Type: {employment_type}
Count : {freq}")))+
theme_minimal()+
labs(title = "Employment Type Distribution")+
theme(panel.background = element_rect(fill = "black"),
legend.position = "none",
plot.title = element_text(colour = "white",face = "italic"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
plot.background = element_rect(fill = "black"),
axis.text.y = element_text(colour = "White"),
axis.text.x = element_text(colour = "White"),
panel.grid.major.y= element_blank(),
panel.grid.major.x = element_line(colour = "black"),
axis.line.x = element_line(colour = "white"))
ggplotly(p3, tooltip = "text") Remarks
It appears that most of the employees are full-time workers
In this section, I will show you the where most of the employees live across the globe. A world map might be a good instrument to visualise this kind of data, as we can see exactly the distribution of employees’ residence on the map.
library(countrycode) #this library is for code translating from country name to iso2c and vice versa
library(maps)
df_residence<-df %>%
mutate(region = as.character(countrycode(df$employee_residence,"iso2c","country.name"))) %>%
count(region) %>%
rename(freq = n)
df_residence$region<-replace(df_residence$region, df_residence$region =="United States","USA") #create residence dataframeCreate a new dataframe and store it into mapdata
mapdata<-map_data("world") %>%
mutate(region = as.factor(region)) #create mapdata using maps package. We want world as the attribute value
mapdata_residence<-mapdata %>% #join the world map coordinate with residence dataframe
left_join(df_residence,by = "region") %>%
select(-subregion)
#remove the NA data from freq
mapdata_new <- mapdata_residence %>%
filter(!is.na(mapdata_residence$freq))Let’s plot the map!
Here I’m using ggplot2 package to build the map by plotting longitudes and latitudes point. Let me show you how it works
map_residence<-mapdata_residence %>%
ggplot(mapping = aes(x = long, y=lat, group = group))+
geom_polygon(aes(fill = freq, text = glue("{region}
{freq} employees")), color = "black")+
scale_fill_gradient(name ="employees", low = "lightblue", high = "blue", na.value = "grey50")+ #use geom_polygon to plot all the coordinates so it aligns like a map
labs(title = "Employee Residence Distribution")+
theme_minimal()+
theme(axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid = element_blank(),
plot.background = element_rect(fill = "black"),
panel.background = element_rect(fill="black"),
legend.text = element_text(colour = "white"),
legend.title = element_text(colour = "white"),
plot.title = element_text(colour = "white", face = "italic"),
rect = element_blank(),
aspect.ratio = 1)
ggplotly(map_residence, tooltip = "text")Remarks
As we can see on the map most of the employees live in the US.
Brilliant!
The bar plot below gives a clearer comparison in regards to employee_residence:
p4<-df_residence %>%
group_by(region) %>%
arrange(desc(freq)) %>%
head(10) %>%
ggplot(mapping = aes(x = reorder(region,-freq), y = freq)) +
geom_col(mapping = aes(fill = region, text = glue("Region: {region}
Count : {freq}")))+
theme_minimal()+
labs(title = "Top 10 Employee Residence")+
theme(panel.background = element_rect(fill = "black"),
legend.position = "none",
plot.title = element_text(colour = "white", face = "italic"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
plot.background = element_rect(fill = "black"),
axis.text.y = element_text(colour = "White"),
axis.text.x = element_text(colour = "White"),
panel.grid.major.y= element_line(colour = "grey50"),
panel.grid.major.x = element_line(colour = "black"),
axis.line.x = element_line(colour = "white"))
ggplotly(p4, tooltip = "text") Remarks
From the chart above, we can conclude that USA is the region with the
highest number of employees followed by UK, India, Canada, and Germany
in the top 5.
Now I am going to compare the company location and employee residence
#create a new dataframe containing company location and its frequency
df_location<-df %>%
mutate(region = as.character(countrycode(df$company_location,"iso2c","country.name"))) %>%
count(region) %>%
rename(freq = n)
df_location$region<-replace(df_location$region, df_location$region =="United States","USA")I will alter the df_residence dataframe by renaming the frequency column, so I can create a merged employee_residence and company_location dataframe
table_res<-df_residence %>%
mutate(region = as.factor(region)) %>%
rename(freq_residence = freq) %>%
group_by(region) %>%
arrange(desc(freq_residence)) %>%
head(10)Let’s merge the dataframe and create the plot!
p5<-df_location %>%
mutate(region = as.factor(region)) %>%
rename(freq_loc = freq) %>%
group_by(region) %>%
arrange(desc(freq_loc)) %>%
head(10) %>%
left_join(table_res, by = "region") %>% #merging dataframe
pivot_longer(cols = c("freq_loc","freq_residence")) %>% #create a long type dataframe
ggplot(mapping = aes(x = reorder (region,-value), y = value)) + #plot the daaframe
geom_col(mapping = aes(fill = name, text = glue("Region : {region}
Count : {value}")),position = "dodge") + #use dodge type position
theme_minimal()+
labs(title = "Top 10 Employee Residence and Company Location")+
theme(panel.background = element_rect(fill = "black"),
legend.position = "top",
plot.title = element_text(colour = "white", face = "italic"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
plot.background = element_rect(fill = "black"),
axis.text.y = element_text(colour = "White"),
axis.text.x = element_text(colour = "White"),
panel.grid.major.y= element_line(colour = "black"),
panel.grid.major.x = element_line(colour = "black"),
panel.grid.minor.y = element_blank(),
axis.line.x = element_line(colour = "white"),
legend.text = element_text(colour = "white"))+
scale_fill_discrete(aesthetics = "fill", labels = c("Company Location", "Employee Residence"))
p5
Remarks
The employess looked mostly live in the States- not a bad place
thouh
Create a bar plot to show company size comparison - it’s a simple plot anyway
p6<-df %>%
count(company_size) %>%
rename(freq = n) %>%
ggplot(mapping = aes(x = company_size, y = freq)) +
geom_col(mapping = aes(fill = company_size, text = glue("Company size :{company_size}
Count : {freq}"))) +
theme_minimal()+
labs(title = "Company Size Distribution")+
theme(panel.background = element_rect(fill = "black"),
legend.position = "none",
plot.title = element_text(colour = "white",face = "italic"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
plot.background = element_rect(fill = "black"),
axis.text.y = element_text(colour = "White"),
axis.text.x = element_text(colour = "White"),
panel.grid.major.y= element_blank(),
panel.grid.major.x = element_line(colour = "black"),
axis.line.x = element_line(colour = "white"))
ggplotly(p6, tooltip = "text") Remarks
It can be concluded that most of the companies are medium-sized followed
by large and small.
On our df dataframe, we’ve got three numeric columns
which are:
I will explore each of the columns
I am going to illustrate the work_year distribution by using a pie chart in ggplot package
df <- df %>%
mutate(work_year = as.factor(work_year))
pie1<-df %>%
count(work_year) %>%
rename(freq = n) %>%
arrange(desc(work_year)) %>%
mutate(prop = freq / sum(freq) *100) %>%
mutate(ypos = cumsum(prop)- 0.5*prop )
pie1 %>%
ggplot(mapping = aes(x = 2, y = prop, fill =work_year))+
geom_bar(stat = "identity", width = 1, color = "white") +
coord_polar("y", start = 0) +
theme_void() +
geom_text(aes(y = ypos, label = freq), color = "white", size=5) +
theme(legend.position = "right",
legend.title = element_blank()) +
scale_fill_brewer(palette = "Set2") +
xlim(0.5,2.5)+
labs (title = "Work Year Pie Chart")Remarks
Pie chart above illustrates that most workers worked in 2022 with 318
employees (more than half the data population) were active during the
period.
Let’s see what is going on with the salary column. I will break the visualisation in two types: boxplot and density plot
df %>%
ggplot( mapping = aes (x = "",y = salary_in_usd))+
geom_boxplot(fill ="slateblue", alpha = 0.3,
color = "white",
outlier.fill = "red",
outlier.color = "orange")+
ylab("USD")+
labs(title = "Salary in USD Boxplot")+
theme_minimal()+
theme(panel.background = element_rect(fill = "black"),
legend.position = "none",
plot.title = element_text(colour = "white",face = "italic"),
axis.title.x = element_blank(),
plot.background = element_rect(fill = "black"),
axis.text.y = element_text(colour = "White"),
axis.text.x = element_text(colour = "White"),
panel.grid.major.y= element_blank(),
panel.grid.major.x = element_line(colour = "black"),
axis.line.x = element_line(colour = "white"))+
scale_y_continuous(labels = scales::unit_format(unit = "k",
scale = 1/1000),
limits = c(0,600000),
breaks = c(0,100000,200000,300000,400000,500000,600000))library(hrbrthemes)
df %>%
ggplot(mapping = aes (x = salary_in_usd))+
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)+
theme_ipsum()+
scale_x_continuous(labels = scales::unit_format(unit = "k", scale = 1/1000))+
ggtitle("Salary in USD Density Plot")+
theme(plot.background = element_rect(fill = "white"),
plot.title = element_text(colour = "black"),
text = element_text(colour = "white"))
Looks great
Let’s take a look at remote ratio column
p8<-df %>%
count (remote_ratio) %>%
rename(freq = n ) %>%
mutate(remote_ratio = as.factor(ifelse(remote_ratio == "0", "No Remote Work", ifelse(remote_ratio == "50", "Partially Remote", "Fully Remote")))) %>%
ggplot(mapping = aes(x = remote_ratio, y = freq)) +
geom_col(mapping = aes(fill = remote_ratio, text = glue("{remote_ratio}
Count : {freq}"))) +
theme_minimal()+
labs(title = "Remote Ratio Distribution")+
theme(panel.background = element_rect(fill = "black"),
legend.position = "none",
plot.title = element_text(colour = "white",face = "italic"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
plot.background = element_rect(fill = "black"),
axis.text.y = element_text(colour = "White"),
axis.text.x = element_text(colour = "White"),
panel.grid.major.y= element_blank(),
panel.grid.major.x = element_line(colour = "black"),
axis.line.x = element_line(colour = "white"))
ggplotly(p8, tooltip ="text") Remarks
Most of the employees are fully remote workers
I want to take a look deeper on the employement type by experience level. We can see how this two values correlate each other
p9<-df %>%
select (employment_type, experience_level) %>%
count(employment_type, experience_level, .drop = F) %>%
rename(freq = n) %>%
ggplot(mapping = aes(x = reorder (employment_type, -freq), y = freq)) +
geom_col(mapping = aes(fill = experience_level, text = glue("{employment_type}
Count : {freq}")), position = "dodge") +
theme_minimal()+
labs(title = "Employment Type and Experience Level Corelation")+
theme(panel.background = element_rect(fill = "black"),
legend.position = "right",
plot.title = element_text(colour = "white",face = "italic"),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
plot.background = element_rect(fill = "black"),
axis.text.y = element_text(colour = "White"),
axis.text.x = element_text(colour = "White"),
panel.grid.major.y= element_blank(),
panel.grid.major.x = element_line(colour = "black"),
axis.line.x = element_line(colour = "white"),
legend.text = element_text(colour = "white"))
ggplotly(p9, tooltip ="text") Remarks
Full time and contract employee consist of all type of workers, however
there are only entry-level and mid-level workers with part-time
employment type. There are no entry-level and mid-level working for
freelance job type.
Lastly, I’m going to create a plot which illustrates highest paid salary by job title
p10<-df %>%
select(salary_in_usd, job_title) %>%
group_by(job_title) %>%
arrange(desc(salary_in_usd)) %>%
head(15) %>%
ggplot(mapping = aes(x = job_title, y = salary_in_usd)) +
geom_col(mapping = aes(fill = salary_in_usd, text = glue("{job_title}
Salary : {salary_in_usd}")), position = "stack") +
theme_minimal()+
labs(title = "Top 15 Highest Salary by Job Title")+
ylab("Salary in USD")+
theme(panel.background = element_rect(fill = "black"),
legend.position = "none",
plot.title = element_text(colour = "white",face = "italic"),
axis.title.x = element_blank(),
axis.title.y = element_text(colour = "white"),
plot.background = element_rect(fill = "black"),
axis.text.y = element_text(colour = "White"),
axis.text.x = element_text(colour = "White", angle = 30, hjust = 1),
panel.grid.major.y= element_blank(),
panel.grid.major.x = element_line(colour = "black"),
axis.line.x = element_line(colour = "white")) +
scale_y_continuous(labels = scales::unit_format(unit = "k",
scale = 1/1000),
limits = c(0,800000),
breaks = c(0,100000,200000,300000,400000,500000,600000,700000,800000))
ggplotly(p10, tooltip = "text")So, that’s all for the EDA process of Job Salaries Data using packages in R programming language. What a tough way to refresh all of the syntax and algorithms. I know this is only a small portion of data analysis process, but everything big starts from smaller piece!. I wish this post can help you understand about EDA and the way to take advantage of ggplot package in visualising any variation of data structures.
See you in the other page!
Author, Alfado Milala
Notes :
In case you want to look up my profile, click the link below :
Jump To My Profile
(open link in a new tab)