knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>" )

1 Introduction

Jobs in data is typically refer to roles within the field of data science, data analysis, data engineer, business intelligence (BI) analyst, machine learning engineer,data visualization specialist and related areas. These jobs involve working with data to extract insights, make predictions, and inform decision-making processes. In this project of Programming for Data Science with R, we would like to analysis benefit from Jobs in data scince. This analyss are based on data collected from kaggle.

2 🛠 Data Preparation

2.1 Import Library

# Data Preparation
library(dplyr)
library(lubridate) # working with datetime
library(scales)    # untuk tampilan digit (memberikan koma dll)
library(dplyr)
library(tidyr)
library(forcats)

# visualisasi
library(ggplot2) 
library(ggpubr)
library(plotly) 
library(glue)
library(ggridges)
library(treemap)
library(viridis)
library(hrbrthemes)
2.2 Creating Initial Dataframe
jobs_data <- read.csv("datainput/jobs_in_data.csv")
head(jobs_data)

2.2 Column Description

  • salary_currency: The currency in which the salary is paid, such as USD, EUR, etc.
  • salary: The annual gross salary of the role in the local currency.
  • salary_in_usd: The annual gross salary converted to United States Dollars (USD). This uniform currency conversion aids in global salary comparisons and analyses.
  • employee_residence: The country of residence of the employee.
  • experience_level: Classifies the professional experience level of the employee.
  • employment_type: Specifies the type of employment, such as ‘Full-time’, ‘Part-time’, ‘Contract’, etc.
  • work_setting: The work setting or environment, like ‘Remote’, ‘In-person’, or ‘Hybrid’. This column reflects the impact of work settings on salary levels in the data industry.
  • company_location: The country where the company is located. It helps in analyzing how the location of the company affects salary structures.
  • company_size: The size of the employer company, often categorized into small (S), medium (M), and large (L) sizes. This allows for analysis of how company size influences salary.

3 🛠 Data Processing

3.1 Check general data information

glimpse(jobs_data)
#> Rows: 9,355
#> Columns: 12
#> $ work_year          <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
#> $ job_title          <chr> "Data DevOps Engineer", "Data Architect", "Data Arc…
#> $ job_category       <chr> "Data Engineering", "Data Architecture and Modeling…
#> $ salary_currency    <chr> "EUR", "USD", "USD", "USD", "USD", "USD", "USD", "U…
#> $ salary             <int> 88000, 186000, 81800, 212000, 93300, 130000, 100000…
#> $ salary_in_usd      <int> 95012, 186000, 81800, 212000, 93300, 130000, 100000…
#> $ employee_residence <chr> "Germany", "United States", "United States", "Unite…
#> $ experience_level   <chr> "Mid-level", "Senior", "Senior", "Senior", "Senior"…
#> $ employment_type    <chr> "Full-time", "Full-time", "Full-time", "Full-time",…
#> $ work_setting       <chr> "Hybrid", "In-person", "In-person", "In-person", "I…
#> $ company_location   <chr> "Germany", "United States", "United States", "Unite…
#> $ company_size       <chr> "L", "M", "M", "M", "M", "M", "M", "M", "M", "M", "…

From the dataset above, we got the information that there are 12 columns, 9,355 rows and the data types for each column. Checking the data types is a crucial step due to the data types must be appropriate for analysis.

3.2 Missing Value

colSums(is.na(jobs_data))
#>          work_year          job_title       job_category    salary_currency 
#>                  0                  0                  0                  0 
#>             salary      salary_in_usd employee_residence   experience_level 
#>                  0                  0                  0                  0 
#>    employment_type       work_setting   company_location       company_size 
#>                  0                  0                  0                  0

Missing values in a dataset give significantly impact for the results of analysis. In the dataset above, there are no missing values in any of the columns.

3.3 Duplicates

sum(duplicated(jobs_data))
#> [1] 4014

From the results of the duplicate data check, it was found that there are 4014 duplicate data entries. In this process, we will not delete the duplicate data because the duplicated data in this dataset provides meaningful information.

3.4 Data Cleaning

jobsdata_clean <- jobs_data %>% 
  mutate(
    job_category   = as.factor(job_category),
    salary_currency   = as.factor(salary_currency),
    experience_level  = as.factor(experience_level),
    employment_type  = as.factor(employment_type),
    work_setting  = as.factor(work_setting),
    company_size  = as.factor(company_size)
    )
head(jobsdata_clean)

In this data cleaning step, we will change the data types that are not appropriate. From the original dataset type, we will change the data types for the columns of job_category, salary_currency, experience_level, employment_type, work_setting, and company_size columns from character data type to factor or category data type.

4 🛠 Data Exploration

Firstly, let’s take a look at the distribution of data for each column.

summary(jobsdata_clean)
#>    work_year     job_title                            job_category 
#>  Min.   :2020   Length:9355        Data Science and Research:3014  
#>  1st Qu.:2023   Class :character   Data Engineering         :2260  
#>  Median :2023   Mode  :character   Data Analysis            :1457  
#>  Mean   :2023                      Machine Learning and AI  :1428  
#>  3rd Qu.:2023                      Leadership and Management: 503  
#>  Max.   :2023                      BI and Visualization     : 313  
#>                                    (Other)                  : 380  
#>  salary_currency     salary       salary_in_usd    employee_residence
#>  USD    :8591    Min.   : 14000   Min.   : 15000   Length:9355       
#>  GBP    : 347    1st Qu.:105200   1st Qu.:105700   Class :character  
#>  EUR    : 340    Median :143860   Median :143000   Mode  :character  
#>  CAD    :  38    Mean   :149928   Mean   :150300                     
#>  AUD    :  11    3rd Qu.:187000   3rd Qu.:186723                     
#>  PLN    :   7    Max.   :450000   Max.   :450000                     
#>  (Other):  21                                                        
#>     experience_level  employment_type    work_setting  company_location  
#>  Entry-level: 496    Contract :  19   Hybrid   : 191   Length:9355       
#>  Executive  : 281    Freelance:  11   In-person:5730   Class :character  
#>  Mid-level  :1869    Full-time:9310   Remote   :3434   Mode  :character  
#>  Senior     :6709    Part-time:  15                                      
#>                                                                          
#>                                                                          
#>                                                                          
#>  company_size
#>  L: 748      
#>  M:8448      
#>  S: 159      
#>              
#>              
#>              
#> 

Insight :

  • The most common job in the field of data is Data “Science and Research”.
  • Minumum salary for data scince around 15000 USD and maximum salary for data scince around 450000 USD.
  • The most employment type in data science is full-time with 9310 employ and work setting is “in-person” with 5730 employ.
  • the most data jobs in medium company.

4.1 🏄 Average Salary based on Job Category

4.1.1 Aggregation Table

Create an aggregation table to determine the average income for each job_category in the field of data

Q1 <- jobsdata_clean %>% 
  group_by(job_category) %>% 
  summarise(Avg_Salary = mean(salary_in_usd)) %>% 
  ungroup() %>% 
  arrange(-Avg_Salary)

Q1

4.1.2 Visualization with densityplot

Create a visualization from the aggregation table as above

Plot1 <- Q1 %>%
   mutate(text = fct_reorder(job_category, Avg_Salary)) %>%
  ggplot( aes(y=job_category, x=Avg_Salary,  fill=job_category)) +
    geom_density_ridges(alpha=0.6, stat="binline", bins=20) +
    theme_ridges() +
    theme(
      legend.position="none",
      panel.spacing = unit(0.1, "lines"),
      strip.text.x = element_text(size = 8)
    ) +
    xlab("Average Salary") +
    ylab("Job Category") +
  ggtitle("Salary by Job Category")
Plot1

Insight :

  1. Machine Learning and AI have the highest average salary among other job categories, with 178,925.8 USD, and the second position is held by Data Science and Research with 163,758.6 USD.
  2. Data Quality and Operations is the lowest average salary among other job categories, with 100,879.5 USD

4.2 🏄 Trend of average Salary in USD based on experience_level.

4.2.1 Aggregation Table

Create an aggregation table to determine trend of average income based on experience_level

Q2 <- aggregate(salary_in_usd ~ experience_level + work_year,
                data = jobsdata_clean,
                FUN = mean)
head(Q2)

4.2.2 visualization with Line Plot

Create a visualization from the aggregation table as above

Plot2 <- ggplot(data = Q2, aes(x = work_year, y = salary_in_usd, color = experience_level)) +
  geom_line(aes(group = experience_level)) +
  ggtitle("Trend Salary by experience_level") 
Plot2

Insight :

  1. Based on trend of average income, for Entry-level, mid-level, and senior occured decline in 2021 and continue to increase until 2023.
  2. from executive level decrease in 2023, but it was not significant.

4.3 🏄 Range salary in USD based on work_setting.

4.3.1 Aggregation Table

Create an aggregation table to determine range of average income based on employment_type

Q3 <- jobsdata_clean %>% 
select(employment_type, salary_in_usd) %>%
group_by(employment_type) %>% 
arrange(-salary_in_usd) %>% 
  mutate(labeling = glue("employment_type: {employment_type}
                          Salary: {salary_in_usd}"))
  
head(Q3)

4.3.2 visualization with Boxplot

Create a visualization from the aggregation table as above

Plot3 <- Q3 %>%
  ggplot( aes(x=employment_type, y=salary_in_usd, fill=employment_type,label = labeling)) +
    geom_boxplot() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Range Salary by Employment_type") +
    xlab("Employment_type")
ggplotly(Plot3, tooltip = "label")

Insight :

  1. Employe with status full-time is the highest average salary among other with avarage min salary in 144,000 USD and max in 450,000 USD.
  2. Freelance is the lowest avarage salary from the others but still includes a promising income with an average salary of 50,000 USD/project
  3. Avarage salary from freelance and part-time not significantly different.

4.5 🏄 The percentage of work_setting in the data job field.

4.5.1 Aggregation Table

Create an aggregation table to determine percentage of work_setting.

Q5 <- jobsdata_clean %>% 
  group_by(work_setting) %>% 
  summarise(count = n()) %>% 
  ungroup() %>%
  arrange(-count)
  Q5

4.5.2 Additional calculation

Create an additional calculation for excample percentage, ymax,ymin, and label.

Q5$fraction = Q5$count / sum(Q5$count)*100 # Precentage 
Q5$ymax <- cumsum(Q5$fraction)
Q5$ymin <- c(0, head(Q5$ymax, n=-1))
Q5$labelPosition <- (Q5$ymax + Q5$ymin) / 2
Q5$label <- paste0(Q5$work_setting, ":", comma(round(Q5$fraction,2)), "%")
Q5

4.5.3 visualization with pie chart

Create a visualization from the aggregation table as above

Plot5 <- ggplot(Q5, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=work_setting)) +
  geom_rect() +
  geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
  scale_fill_brewer(palette=4) +
  coord_polar(theta="y") +
  xlim(c(2, 4)) +
  theme_void() +
  theme(legend.position = "none") +
  ggtitle("Percentage of Work_setting")
Plot5

Insight :

The most work_setting in the data job field is In-person/onside with 61% and 39% the others possible to work from anyware.

4.6 🏄 The country with the most job opportunities in the field of data.

4.6.1 Aggregation Table

Create an aggregation table to determine the most jobs data opportunities.

Q6 <- jobsdata_clean %>% 
  group_by(company_location) %>% 
  summarise(count = n()) %>% 
  ungroup() %>% 
  arrange(-count) %>% 
  head(10)
Q6

4.6.2 visualization with treemap

Create a visualization from the aggregation table as above

treemap(Q6,
            index="company_location",
            vSize="count",
            type="index"
            ) 

Insight:

Currently, the highest region with the most job opportunities in the field of data is United States.

4.7 🏄 The range of Avarage salary based on company_size

4.7.1 Aggregation Table

Create an aggregation table to determine the Range of avarage salary based on company_size (Low,Medium,Hight)

Q7 <- jobsdata_clean %>% 
  select(work_year, company_size, salary_in_usd) %>% 
  group_by( work_year,company_size) %>% 
  summarise(Avg_Salary = mean(salary_in_usd)) %>% 
  ungroup() %>% 
   mutate(labeling = glue("Company: {company_size}
                         Avarage Salary: {comma(round(Avg_Salary,2))}"))
Q7

4.7.2 visualization with barplot

Create a visualization from the aggregation table as above

Plot7 <- ggplot(Q7, aes(fill= company_size, y=Avg_Salary, x= company_size, text = labeling)) +
    geom_bar(position="dodge", stat="identity") +
    scale_fill_viridis(discrete = T, option = "E") +
    ggtitle("Company_Size Salary") +
    facet_wrap(~work_year) +
    theme_ipsum() +
    theme(legend.position="none") +
    xlab("")
ggplotly(Plot7, tooltip = "text")

Insight:

  1. Based on work_year in 2020 to 2021, for avarage salary in large company relative stagnant, decline from madium company, and for the small company increased.
  2. Based on work_year in 2022 and 2022 all segment company size continue to increase.
  3. For the medium company in 2022 and 2023 has the highest average income compared to large companies.

5 Explanatory Text & Business Recomendation

This time, we have conclude created various data visualizations from the dataset related to jobs in the field of data. From these visualizations, a lot of information and insight which including:

  1. Machine Learning and AI stands as the job category in the data field that provides an average income of more than 175,000 USD. Machine Learning and AI professionals to be high incomes due to several factors are : high of demand, specialized skills, complex work, business impact, and innovation contributes to the high income potential for individuals working in Machine Learning and AI.

  2. Based on the average salary from 2021 until now, jobs in the field of data from the all experience levels consistently increased.

  3. Based on employment_type, employes with full-time/onsite positions have a very high salary range, with a maximum income of 450,000 USD. However, if we look at “freelance,” it is also very promising with an average of 50,000 USD, which can be considered for generating massive income or as a solution for full-time moms to get additional income.

  4. Based on job title, the most popular profession are data engineer, with a total of 2,195 workers, data scince 1989, and data analyst 1388. The increasing reliance on data-driven insights, coupled with advancements in technology and a growing demand for skilled professionals, has made data engineering, data science, and data analysis among the most popular jobs in current job market.

  5. The highest work_setting is In-person/onside with 61% and 39% the others possible to work from anyware.

  6. Currently, the highest region with the most job opportunities in the field of data is United States.

  7. Careers in data giving prospects for career advancement. Currently, skills in data are highly demand by many companies,from small-scale to large enterprises. In 2023 it was observed that the average income for jobs in data, especially in medium-level companies, was able to provide higher salaries than in high-level companies. Therefore, professionals data do not necessarily have to work in high-level companies to earn high incomes.

The visualizations we have created show that data visualization is very beneficial in making it easier for us and our audience to extract and understand information from the data.