Overview

In R, we can analyse what we want with a variety of datasets. In this case, we can use tidyverse package. This tidyverse is a package that supports data from pre-processing stages to data loading, filtering, aggregation, and storage DB connections. There is even a case of searching while modeling data in the data and preprocessing search step. In addition, tidyverse can search a large number of data using the existing model in R Base, which is very fast. This file will research about dplyr package, one of the packages included in the tidyverse package. This report is written through Kaggle’s dataset in the link below.

https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries

Package installation (tidyverse)

First of all, for reading the dataset, and since we need the tidyverse package to proceed with this analysis, look at the package installation process as below.

if (!require("tidyverse")) install.packages("tidyverse")

## Loading required package: tidyverse

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

View the package configuration

We can see all the functions included in tidyverse through the command below.

tidyverse_packages() # tidyverse package includes dplyr package

##  [1] "broom"         "cli"           "crayon"        "dbplyr"       
##  [5] "dplyr"         "dtplyr"        "forcats"       "ggplot2"      
##  [9] "googledrive"   "googlesheets4" "haven"         "hms"          
## [13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
## [17] "modelr"        "pillar"        "purrr"         "readr"        
## [21] "readxl"        "reprex"        "rlang"         "rstudioapi"   
## [25] "rvest"         "stringr"       "tibble"        "tidyr"        
## [29] "xml2"          "tidyverse"

Loading the package

Loads to use the embedded packages and functions in tidyverse.

library(tidyverse)

Reading dataset (readr package)

The readr package is designed to easily and conveniently read grid data such as csv, tsv, and fwf.

ds_salaries <- read.csv("C:/Users/kms57/Desktop/36103 Statistical Thinking for Data Science/ds_salaries.csv")

Check the data form that was loaded using the str() function as shown below.

str(ds_salaries)

## 'data.frame':    607 obs. of  12 variables:
##  $ X                 : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ work_year         : int  2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ experience_level  : chr  "MI" "SE" "SE" "MI" ...
##  $ employment_type   : chr  "FT" "FT" "FT" "FT" ...
##  $ job_title         : chr  "Data Scientist" "Machine Learning Scientist" "Big Data Engineer" "Product Data Analyst" ...
##  $ salary            : int  70000 260000 85000 20000 150000 72000 190000 11000000 135000 125000 ...
##  $ salary_currency   : chr  "EUR" "USD" "GBP" "USD" ...
##  $ salary_in_usd     : int  79833 260000 109024 20000 150000 72000 190000 35735 135000 125000 ...
##  $ employee_residence: chr  "DE" "JP" "GB" "HN" ...
##  $ remote_ratio      : int  0 0 50 0 50 100 100 50 100 50 ...
##  $ company_location  : chr  "DE" "JP" "GB" "HN" ...
##  $ company_size      : chr  "L" "S" "M" "S" ...

Data processing

The dplyr package is required for this process, but since it is already included in tidyverse, additional library calls are required.

Data preprocessing - Process data into whatever shape you want

Extract data that meets the requirements

1. Extract only rows that meet the criteria and create data. - filter()

da <- ds_salaries %>% filter(work_year == 2022 & job_title == "Data Analyst")
ds <- ds_salaries %>% filter(work_year == 2022 & job_title == "Data Scientist")

mean(da$salary_in_usd) # Data Analyst Average Annual Salary in USD, 2022

## [1] 100550.7

mean(ds$salary_in_usd) # Data Scientist Average Annual Salary in USD, 2022

## [1] 136172.1

2. Extract only columns(variables) that meet the criteria and leave out a large amount of data. - select()

ds_salaries %>% 
  select(work_year, job_title, salary_in_usd) %>% 
  head # Extract up to the first six lines

##   work_year                  job_title salary_in_usd
## 1      2020             Data Scientist         79833
## 2      2020 Machine Learning Scientist        260000
## 3      2020          Big Data Engineer        109024
## 4      2020       Product Data Analyst         20000
## 5      2020  Machine Learning Engineer        150000
## 6      2020               Data Analyst         72000

# When the amount of data is too large, head (the desired number of output data) can be used to output as many data as desired and head is from the beginning of the data, whereas tail can be checked from the end.

3. Arrange data in order.- arrange()

ds_salaries %>% arrange(work_year,salary_in_usd) %>% 
  head # Sort in ascending order

##    X work_year experience_level employment_type               job_title salary
## 1 18      2020               EN              FT Data Science Consultant 423000
## 2 21      2020               MI              FT    Product Data Analyst 450000
## 3 50      2020               EN              FT            Data Analyst 450000
## 4 15      2020               MI              FT            Data Analyst   8000
## 5 38      2020               EN              FT            Data Analyst  10000
## 6 45      2020               EN              PT             ML Engineer  14000
##   salary_currency salary_in_usd employee_residence remote_ratio
## 1             INR          5707                 IN           50
## 2             INR          6072                 IN          100
## 3             INR          6072                 IN            0
## 4             USD          8000                 PK           50
## 5             USD         10000                 NG          100
## 6             EUR         15966                 DE          100
##   company_location company_size
## 1               IN            M
## 2               IN            L
## 3               IN            S
## 4               PK            L
## 5               NG            S
## 6               DE            S

ds_salaries %>% arrange(desc(work_year),desc(salary_in_usd)) %>% 
  head # Sort in descending order

##     X work_year experience_level employment_type              job_title salary
## 1 523      2022               SE              FT    Data Analytics Lead 405000
## 2 519      2022               SE              FT Applied Data Scientist 380000
## 3 482      2022               EX              FT          Data Engineer 324000
## 4 534      2022               SE              FT         Data Architect 266400
## 5 416      2022               SE              FT         Data Scientist 260000
## 6 337      2022               SE              FT          Data Engineer 243900
##   salary_currency salary_in_usd employee_residence remote_ratio
## 1             USD        405000                 US          100
## 2             USD        380000                 US          100
## 3             USD        324000                 US          100
## 4             USD        266400                 US          100
## 5             USD        260000                 US          100
## 6             USD        243900                 US          100
##   company_location company_size
## 1               US            L
## 2               US            L
## 3               US            M
## 4               US            M
## 5               US            M
## 6               US            M

4. Add derivative variable. - mutate()

ds_salaries %>%
  mutate(salary_in_aud = salary_in_usd * 1.40) %>% 
  head

##   X work_year experience_level employment_type                  job_title
## 1 0      2020               MI              FT             Data Scientist
## 2 1      2020               SE              FT Machine Learning Scientist
## 3 2      2020               SE              FT          Big Data Engineer
## 4 3      2020               MI              FT       Product Data Analyst
## 5 4      2020               SE              FT  Machine Learning Engineer
## 6 5      2020               EN              FT               Data Analyst
##   salary salary_currency salary_in_usd employee_residence remote_ratio
## 1  70000             EUR         79833                 DE            0
## 2 260000             USD        260000                 JP            0
## 3  85000             GBP        109024                 GB           50
## 4  20000             USD         20000                 HN            0
## 5 150000             USD        150000                 US           50
## 6  72000             USD         72000                 US          100
##   company_location company_size salary_in_aud
## 1               DE            L      111766.2
## 2               JP            S      364000.0
## 3               GB            M      152633.6
## 4               HN            S       28000.0
## 5               US            L      210000.0
## 6               US            L      100800.0

5. Summarise by group. - group_by() and summarise()

ds_salaries %>% 
  group_by(job_title) %>% 
  summarise(mean_salaries = mean(salary_in_usd))

## # A tibble: 50 × 2
##    job_title                          mean_salaries
##    <chr>                                      <dbl>
##  1 3D Computer Vision Researcher              5409 
##  2 AI Scientist                              66136.
##  3 Analytics Engineer                       175000 
##  4 Applied Data Scientist                   175655 
##  5 Applied Machine Learning Scientist       142069.
##  6 BI Data Analyst                           74755.
##  7 Big Data Architect                        99703 
##  8 Big Data Engineer                         51974 
##  9 Business Data Analyst                     76691.
## 10 Cloud Data Engineer                      124647 
## # … with 40 more rows
## # ℹ Use `print(n = ...)` to see more rows

An example of a complex use of functions

ds_salaries %>%  
  filter(work_year == 2022) %>%
  select(work_year, job_title, salary_in_usd, employee_residence) %>%
  arrange(salary_in_usd) %>%
  mutate(salary_in_aud = salary_in_usd * 1.40) %>%
  group_by(job_title) %>% 
  summarise(mean_salaries_in_aud = mean(salary_in_aud),
            median_salaries = median(salary_in_aud),
            number_of_samples = n()) #Annual salary by data occupation in 2022

## # A tibble: 33 × 4
##    job_title                          mean_salaries_in_aud median_sala…¹ numbe…²
##    <chr>                                             <dbl>         <dbl>   <int>
##  1 AI Scientist                                    224000        224000        2
##  2 Analytics Engineer                              245000        251790        4
##  3 Applied Data Scientist                          333200        247800        3
##  4 Applied Machine Learning Scientist               74812.        74812.       2
##  5 Business Data Analyst                            62548.        62548.       2
##  6 Computer Vision Engineer                         94500         94500        2
##  7 Computer Vision Software Engineer               210000        210000        1
##  8 Data Analyst                                    140771.       147000       73
##  9 Data Analytics Engineer                          28000         28000        1
## 10 Data Analytics Lead                             567000        567000        1
## # … with 23 more rows, and abbreviated variable names ¹median_salaries,
## #   ²number_of_samples
## # ℹ Use `print(n = ...)` to see more rows

6. Consolidate data. - left_join() and bind_rows()

For these funtions, two or more datasets are required, so let’s create and combine simple data.

Combine data horizontally - left_join() function adds columns as a function of combining column data.

area1 <- data.frame(code = c(1,2,3),
                    america = c("NA","CA","SA"))
area2 <- data.frame(code = c(1,2,3),
                    asia = c("ENA","MEA","ESA"))

area1

##   code america
## 1    1      NA
## 2    2      CA
## 3    3      SA

area2

##   code asia
## 1    1  ENA
## 2    2  MEA
## 3    3  ESA

total_horizontal <- left_join(area1, area2, by = "code")
total_horizontal

##   code america asia
## 1    1      NA  ENA
## 2    2      CA  MEA
## 3    3      SA  ESA

Consolidate data vertically - bind_rows() function adds rows as a function of combining row data.

zone1 <- data.frame(code = c(1,2,3),
                    continent = c("NA","CA","SA"))
zone2 <- data.frame(code = c(4,5,6),
                    continent = c("ENA","MEA","ESA"))

zone1

##   code continent
## 1    1        NA
## 2    2        CA
## 3    3        SA

zone2

##   code continent
## 1    4       ENA
## 2    5       MEA
## 3    6       ESA

total_vertical <- bind_rows(zone1, zone2)
total_vertical

##   code continent
## 1    1        NA
## 2    2        CA
## 3    3        SA
## 4    4       ENA
## 5    5       MEA
## 6    6       ESA

AT1 - Utilise dataset using dplyr package

STDS 2022 Spring

Minsu Kim

13.08.2022