Jobs in data is typically refer to roles within the field of data science, data analysis, data engineer, business intelligence (BI) analyst, machine learning engineer,data visualization specialist and related areas. These jobs involve working with data to extract insights, make predictions, and inform decision-making processes. In this project of Programming for Data Science with R, we would like to analysis benefit from Jobs in data scince. This analyss are based on data collected from kaggle.
# Data Preparation
library(dplyr)
library(lubridate) # working with datetime
library(scales) # untuk tampilan digit (memberikan koma dll)
library(dplyr)
library(tidyr)
library(forcats)
# visualisasi
library(ggplot2)
library(ggpubr)
library(plotly)
library(glue)
library(ggridges)
library(treemap)
library(viridis)
library(hrbrthemes)2.2 Creating Initial Dataframe
#> Rows: 9,355
#> Columns: 12
#> $ work_year <int> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
#> $ job_title <chr> "Data DevOps Engineer", "Data Architect", "Data Arc…
#> $ job_category <chr> "Data Engineering", "Data Architecture and Modeling…
#> $ salary_currency <chr> "EUR", "USD", "USD", "USD", "USD", "USD", "USD", "U…
#> $ salary <int> 88000, 186000, 81800, 212000, 93300, 130000, 100000…
#> $ salary_in_usd <int> 95012, 186000, 81800, 212000, 93300, 130000, 100000…
#> $ employee_residence <chr> "Germany", "United States", "United States", "Unite…
#> $ experience_level <chr> "Mid-level", "Senior", "Senior", "Senior", "Senior"…
#> $ employment_type <chr> "Full-time", "Full-time", "Full-time", "Full-time",…
#> $ work_setting <chr> "Hybrid", "In-person", "In-person", "In-person", "I…
#> $ company_location <chr> "Germany", "United States", "United States", "Unite…
#> $ company_size <chr> "L", "M", "M", "M", "M", "M", "M", "M", "M", "M", "…
From the dataset above, we got the information that there are 12 columns, 9,355 rows and the data types for each column. Checking the data types is a crucial step due to the data types must be appropriate for analysis.
#> work_year job_title job_category salary_currency
#> 0 0 0 0
#> salary salary_in_usd employee_residence experience_level
#> 0 0 0 0
#> employment_type work_setting company_location company_size
#> 0 0 0 0
Missing values in a dataset give significantly impact for the results of analysis. In the dataset above, there are no missing values in any of the columns.
#> [1] 4014
From the results of the duplicate data check, it was found that there are 4014 duplicate data entries. In this process, we will not delete the duplicate data because the duplicated data in this dataset provides meaningful information.
jobsdata_clean <- jobs_data %>%
mutate(
job_category = as.factor(job_category),
salary_currency = as.factor(salary_currency),
experience_level = as.factor(experience_level),
employment_type = as.factor(employment_type),
work_setting = as.factor(work_setting),
company_size = as.factor(company_size)
)
head(jobsdata_clean)In this data cleaning step, we will change the data types that are not appropriate. From the original dataset type, we will change the data types for the columns of job_category, salary_currency, experience_level, employment_type, work_setting, and company_size columns from character data type to factor or category data type.
Firstly, let’s take a look at the distribution of data for each column.
#> work_year job_title job_category
#> Min. :2020 Length:9355 Data Science and Research:3014
#> 1st Qu.:2023 Class :character Data Engineering :2260
#> Median :2023 Mode :character Data Analysis :1457
#> Mean :2023 Machine Learning and AI :1428
#> 3rd Qu.:2023 Leadership and Management: 503
#> Max. :2023 BI and Visualization : 313
#> (Other) : 380
#> salary_currency salary salary_in_usd employee_residence
#> USD :8591 Min. : 14000 Min. : 15000 Length:9355
#> GBP : 347 1st Qu.:105200 1st Qu.:105700 Class :character
#> EUR : 340 Median :143860 Median :143000 Mode :character
#> CAD : 38 Mean :149928 Mean :150300
#> AUD : 11 3rd Qu.:187000 3rd Qu.:186723
#> PLN : 7 Max. :450000 Max. :450000
#> (Other): 21
#> experience_level employment_type work_setting company_location
#> Entry-level: 496 Contract : 19 Hybrid : 191 Length:9355
#> Executive : 281 Freelance: 11 In-person:5730 Class :character
#> Mid-level :1869 Full-time:9310 Remote :3434 Mode :character
#> Senior :6709 Part-time: 15
#>
#>
#>
#> company_size
#> L: 748
#> M:8448
#> S: 159
#>
#>
#>
#>
Insight :
Create an aggregation table to determine the average income for each job_category in the field of data
Q1 <- jobsdata_clean %>%
group_by(job_category) %>%
summarise(Avg_Salary = mean(salary_in_usd)) %>%
ungroup() %>%
arrange(-Avg_Salary)
Q1Create a visualization from the aggregation table as above
Plot1 <- Q1 %>%
mutate(text = fct_reorder(job_category, Avg_Salary)) %>%
ggplot( aes(y=job_category, x=Avg_Salary, fill=job_category)) +
geom_density_ridges(alpha=0.6, stat="binline", bins=20) +
theme_ridges() +
theme(
legend.position="none",
panel.spacing = unit(0.1, "lines"),
strip.text.x = element_text(size = 8)
) +
xlab("Average Salary") +
ylab("Job Category") +
ggtitle("Salary by Job Category")
Plot1Insight :
Create an aggregation table to determine trend of average income based on experience_level
Q2 <- aggregate(salary_in_usd ~ experience_level + work_year,
data = jobsdata_clean,
FUN = mean)
head(Q2)Create a visualization from the aggregation table as above
Plot2 <- ggplot(data = Q2, aes(x = work_year, y = salary_in_usd, color = experience_level)) +
geom_line(aes(group = experience_level)) +
ggtitle("Trend Salary by experience_level")
Plot2Insight :
Create an aggregation table to determine range of average income based on employment_type
Q3 <- jobsdata_clean %>%
select(employment_type, salary_in_usd) %>%
group_by(employment_type) %>%
arrange(-salary_in_usd) %>%
mutate(labeling = glue("employment_type: {employment_type}
Salary: {salary_in_usd}"))
head(Q3)Create a visualization from the aggregation table as above
Plot3 <- Q3 %>%
ggplot( aes(x=employment_type, y=salary_in_usd, fill=employment_type,label = labeling)) +
geom_boxplot() +
theme(
legend.position="none",
plot.title = element_text(size=11)
) +
ggtitle("Range Salary by Employment_type") +
xlab("Employment_type")
ggplotly(Plot3, tooltip = "label")Insight :
Create an aggregation table to determine top 10 popular job_title in data
Q4 <- jobsdata_clean %>%
group_by(job_title) %>%
summarise(count = n()) %>%
ungroup() %>%
arrange(-count) %>%
head(10)
Q4Create a visualization from the aggregation table as above
Plot4 <- ggplot(data = Q4, mapping = aes(x = count, y =reorder(job_title, count) )) +
geom_col(aes(fill=count)) +
scale_fill_gradient(low = "white", high = "darkgreen")+
ggtitle("Top 10 Popular Job_title")
Plot4Insight :
The most popular job title is data are data engineer with 2195, data scientist with 1989, and data analysis with 1388 employe.
Create an aggregation table to determine percentage of work_setting.
Q5 <- jobsdata_clean %>%
group_by(work_setting) %>%
summarise(count = n()) %>%
ungroup() %>%
arrange(-count)
Q5Create an additional calculation for excample percentage, ymax,ymin, and label.
Q5$fraction = Q5$count / sum(Q5$count)*100 # Precentage
Q5$ymax <- cumsum(Q5$fraction)
Q5$ymin <- c(0, head(Q5$ymax, n=-1))
Q5$labelPosition <- (Q5$ymax + Q5$ymin) / 2
Q5$label <- paste0(Q5$work_setting, ":", comma(round(Q5$fraction,2)), "%")
Q5Create a visualization from the aggregation table as above
Plot5 <- ggplot(Q5, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=work_setting)) +
geom_rect() +
geom_label( x=3.5, aes(y=labelPosition, label=label), size=6) +
scale_fill_brewer(palette=4) +
coord_polar(theta="y") +
xlim(c(2, 4)) +
theme_void() +
theme(legend.position = "none") +
ggtitle("Percentage of Work_setting")
Plot5Insight :
The most work_setting in the data job field is In-person/onside with 61% and 39% the others possible to work from anyware.
Create an aggregation table to determine the Range of avarage salary based on company_size (Low,Medium,Hight)
Q7 <- jobsdata_clean %>%
select(work_year, company_size, salary_in_usd) %>%
group_by( work_year,company_size) %>%
summarise(Avg_Salary = mean(salary_in_usd)) %>%
ungroup() %>%
mutate(labeling = glue("Company: {company_size}
Avarage Salary: {comma(round(Avg_Salary,2))}"))
Q7Create a visualization from the aggregation table as above
Plot7 <- ggplot(Q7, aes(fill= company_size, y=Avg_Salary, x= company_size, text = labeling)) +
geom_bar(position="dodge", stat="identity") +
scale_fill_viridis(discrete = T, option = "E") +
ggtitle("Company_Size Salary") +
facet_wrap(~work_year) +
theme_ipsum() +
theme(legend.position="none") +
xlab("")
ggplotly(Plot7, tooltip = "text")Insight:
This time, we have conclude created various data visualizations from the dataset related to jobs in the field of data. From these visualizations, a lot of information and insight which including:
Machine Learning and AI stands as the job category in the data field that provides an average income of more than 175,000 USD. Machine Learning and AI professionals to be high incomes due to several factors are : high of demand, specialized skills, complex work, business impact, and innovation contributes to the high income potential for individuals working in Machine Learning and AI.
Based on the average salary from 2021 until now, jobs in the field of data from the all experience levels consistently increased.
Based on employment_type, employes with full-time/onsite positions have a very high salary range, with a maximum income of 450,000 USD. However, if we look at “freelance,” it is also very promising with an average of 50,000 USD, which can be considered for generating massive income or as a solution for full-time moms to get additional income.
Based on job title, the most popular profession are data engineer, with a total of 2,195 workers, data scince 1989, and data analyst 1388. The increasing reliance on data-driven insights, coupled with advancements in technology and a growing demand for skilled professionals, has made data engineering, data science, and data analysis among the most popular jobs in current job market.
The highest work_setting is In-person/onside with 61% and 39% the others possible to work from anyware.
Currently, the highest region with the most job opportunities in the field of data is United States.
Careers in data giving prospects for career advancement. Currently, skills in data are highly demand by many companies,from small-scale to large enterprises. In 2023 it was observed that the average income for jobs in data, especially in medium-level companies, was able to provide higher salaries than in high-level companies. Therefore, professionals data do not necessarily have to work in high-level companies to earn high incomes.
The visualizations we have created show that data visualization is very beneficial in making it easier for us and our audience to extract and understand information from the data.