In R, we can analyse what we want with a variety of datasets. In this case, we can use tidyverse package. This tidyverse is a package that supports data from pre-processing stages to data loading, filtering, aggregation, and storage DB connections. There is even a case of searching while modeling data in the data and preprocessing search step. In addition, tidyverse can search a large number of data using the existing model in R Base, which is very fast. This file will research about dplyr package, one of the packages included in the tidyverse package. This report is written through Kaggle’s dataset in the link below.
https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries
First of all, for reading the dataset, and since we need the tidyverse package to proceed with this analysis, look at the package installation process as below.
if (!require("tidyverse")) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.8 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
We can see all the functions included in tidyverse through the command below.
tidyverse_packages() # tidyverse package includes dplyr package
## [1] "broom" "cli" "crayon" "dbplyr"
## [5] "dplyr" "dtplyr" "forcats" "ggplot2"
## [9] "googledrive" "googlesheets4" "haven" "hms"
## [13] "httr" "jsonlite" "lubridate" "magrittr"
## [17] "modelr" "pillar" "purrr" "readr"
## [21] "readxl" "reprex" "rlang" "rstudioapi"
## [25] "rvest" "stringr" "tibble" "tidyr"
## [29] "xml2" "tidyverse"
Loads to use the embedded packages and functions in tidyverse.
library(tidyverse)
The readr package is designed to easily and conveniently read grid data such as csv, tsv, and fwf.
ds_salaries <- read.csv("C:/Users/kms57/Desktop/36103 Statistical Thinking for Data Science/ds_salaries.csv")
Check the data form that was loaded using the str() function as shown below.
str(ds_salaries)
## 'data.frame': 607 obs. of 12 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ work_year : int 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
## $ experience_level : chr "MI" "SE" "SE" "MI" ...
## $ employment_type : chr "FT" "FT" "FT" "FT" ...
## $ job_title : chr "Data Scientist" "Machine Learning Scientist" "Big Data Engineer" "Product Data Analyst" ...
## $ salary : int 70000 260000 85000 20000 150000 72000 190000 11000000 135000 125000 ...
## $ salary_currency : chr "EUR" "USD" "GBP" "USD" ...
## $ salary_in_usd : int 79833 260000 109024 20000 150000 72000 190000 35735 135000 125000 ...
## $ employee_residence: chr "DE" "JP" "GB" "HN" ...
## $ remote_ratio : int 0 0 50 0 50 100 100 50 100 50 ...
## $ company_location : chr "DE" "JP" "GB" "HN" ...
## $ company_size : chr "L" "S" "M" "S" ...
The dplyr package is required for this process, but since it is already included in tidyverse, additional library calls are required.
da <- ds_salaries %>% filter(work_year == 2022 & job_title == "Data Analyst")
ds <- ds_salaries %>% filter(work_year == 2022 & job_title == "Data Scientist")
mean(da$salary_in_usd) # Data Analyst Average Annual Salary in USD, 2022
## [1] 100550.7
mean(ds$salary_in_usd) # Data Scientist Average Annual Salary in USD, 2022
## [1] 136172.1
ds_salaries %>%
select(work_year, job_title, salary_in_usd) %>%
head # Extract up to the first six lines
## work_year job_title salary_in_usd
## 1 2020 Data Scientist 79833
## 2 2020 Machine Learning Scientist 260000
## 3 2020 Big Data Engineer 109024
## 4 2020 Product Data Analyst 20000
## 5 2020 Machine Learning Engineer 150000
## 6 2020 Data Analyst 72000
# When the amount of data is too large, head (the desired number of output data) can be used to output as many data as desired and head is from the beginning of the data, whereas tail can be checked from the end.
ds_salaries %>% arrange(work_year,salary_in_usd) %>%
head # Sort in ascending order
## X work_year experience_level employment_type job_title salary
## 1 18 2020 EN FT Data Science Consultant 423000
## 2 21 2020 MI FT Product Data Analyst 450000
## 3 50 2020 EN FT Data Analyst 450000
## 4 15 2020 MI FT Data Analyst 8000
## 5 38 2020 EN FT Data Analyst 10000
## 6 45 2020 EN PT ML Engineer 14000
## salary_currency salary_in_usd employee_residence remote_ratio
## 1 INR 5707 IN 50
## 2 INR 6072 IN 100
## 3 INR 6072 IN 0
## 4 USD 8000 PK 50
## 5 USD 10000 NG 100
## 6 EUR 15966 DE 100
## company_location company_size
## 1 IN M
## 2 IN L
## 3 IN S
## 4 PK L
## 5 NG S
## 6 DE S
ds_salaries %>% arrange(desc(work_year),desc(salary_in_usd)) %>%
head # Sort in descending order
## X work_year experience_level employment_type job_title salary
## 1 523 2022 SE FT Data Analytics Lead 405000
## 2 519 2022 SE FT Applied Data Scientist 380000
## 3 482 2022 EX FT Data Engineer 324000
## 4 534 2022 SE FT Data Architect 266400
## 5 416 2022 SE FT Data Scientist 260000
## 6 337 2022 SE FT Data Engineer 243900
## salary_currency salary_in_usd employee_residence remote_ratio
## 1 USD 405000 US 100
## 2 USD 380000 US 100
## 3 USD 324000 US 100
## 4 USD 266400 US 100
## 5 USD 260000 US 100
## 6 USD 243900 US 100
## company_location company_size
## 1 US L
## 2 US L
## 3 US M
## 4 US M
## 5 US M
## 6 US M
ds_salaries %>%
mutate(salary_in_aud = salary_in_usd * 1.40) %>%
head
## X work_year experience_level employment_type job_title
## 1 0 2020 MI FT Data Scientist
## 2 1 2020 SE FT Machine Learning Scientist
## 3 2 2020 SE FT Big Data Engineer
## 4 3 2020 MI FT Product Data Analyst
## 5 4 2020 SE FT Machine Learning Engineer
## 6 5 2020 EN FT Data Analyst
## salary salary_currency salary_in_usd employee_residence remote_ratio
## 1 70000 EUR 79833 DE 0
## 2 260000 USD 260000 JP 0
## 3 85000 GBP 109024 GB 50
## 4 20000 USD 20000 HN 0
## 5 150000 USD 150000 US 50
## 6 72000 USD 72000 US 100
## company_location company_size salary_in_aud
## 1 DE L 111766.2
## 2 JP S 364000.0
## 3 GB M 152633.6
## 4 HN S 28000.0
## 5 US L 210000.0
## 6 US L 100800.0
ds_salaries %>%
group_by(job_title) %>%
summarise(mean_salaries = mean(salary_in_usd))
## # A tibble: 50 × 2
## job_title mean_salaries
## <chr> <dbl>
## 1 3D Computer Vision Researcher 5409
## 2 AI Scientist 66136.
## 3 Analytics Engineer 175000
## 4 Applied Data Scientist 175655
## 5 Applied Machine Learning Scientist 142069.
## 6 BI Data Analyst 74755.
## 7 Big Data Architect 99703
## 8 Big Data Engineer 51974
## 9 Business Data Analyst 76691.
## 10 Cloud Data Engineer 124647
## # … with 40 more rows
## # ℹ Use `print(n = ...)` to see more rows
ds_salaries %>%
filter(work_year == 2022) %>%
select(work_year, job_title, salary_in_usd, employee_residence) %>%
arrange(salary_in_usd) %>%
mutate(salary_in_aud = salary_in_usd * 1.40) %>%
group_by(job_title) %>%
summarise(mean_salaries_in_aud = mean(salary_in_aud),
median_salaries = median(salary_in_aud),
number_of_samples = n()) #Annual salary by data occupation in 2022
## # A tibble: 33 × 4
## job_title mean_salaries_in_aud median_sala…¹ numbe…²
## <chr> <dbl> <dbl> <int>
## 1 AI Scientist 224000 224000 2
## 2 Analytics Engineer 245000 251790 4
## 3 Applied Data Scientist 333200 247800 3
## 4 Applied Machine Learning Scientist 74812. 74812. 2
## 5 Business Data Analyst 62548. 62548. 2
## 6 Computer Vision Engineer 94500 94500 2
## 7 Computer Vision Software Engineer 210000 210000 1
## 8 Data Analyst 140771. 147000 73
## 9 Data Analytics Engineer 28000 28000 1
## 10 Data Analytics Lead 567000 567000 1
## # … with 23 more rows, and abbreviated variable names ¹median_salaries,
## # ²number_of_samples
## # ℹ Use `print(n = ...)` to see more rows
For these funtions, two or more datasets are required, so let’s create and combine simple data.
area1 <- data.frame(code = c(1,2,3),
america = c("NA","CA","SA"))
area2 <- data.frame(code = c(1,2,3),
asia = c("ENA","MEA","ESA"))
area1
## code america
## 1 1 NA
## 2 2 CA
## 3 3 SA
area2
## code asia
## 1 1 ENA
## 2 2 MEA
## 3 3 ESA
total_horizontal <- left_join(area1, area2, by = "code")
total_horizontal
## code america asia
## 1 1 NA ENA
## 2 2 CA MEA
## 3 3 SA ESA
zone1 <- data.frame(code = c(1,2,3),
continent = c("NA","CA","SA"))
zone2 <- data.frame(code = c(4,5,6),
continent = c("ENA","MEA","ESA"))
zone1
## code continent
## 1 1 NA
## 2 2 CA
## 3 3 SA
zone2
## code continent
## 1 4 ENA
## 2 5 MEA
## 3 6 ESA
total_vertical <- bind_rows(zone1, zone2)
total_vertical
## code continent
## 1 1 NA
## 2 2 CA
## 3 3 SA
## 4 4 ENA
## 5 5 MEA
## 6 6 ESA