ASK phase of the study

Rather than representing a company, the purpose of the case study is to explore salaries for entry-level Data Analysts based in the United States. Therefore, the pertinent stakeholders are not just prospective junior data scientists, but also business recruiters. The data obtained in the case study was pulled from one of RANDOMARNAB’s Kaggle dataset (cited at the end of the study) which was a list of worldwide available data science-related positions as of March 2023. The dataset contained some information that wasn’t needed in the study, such as jobs located in countries outside the United States, or positions that required more experience. The key insights for this project are “What keywords should I use when job-hunting?” and “What kind of salary can I expect from (relatively) market-matching companies?”

PREPARE phase of the study

As previously mentioned, the dataset used came from a public dataset on Kaggle(cited below). The data is clearly current since it was made in March of 2023, just 3 months prior to this study. Admittedly, the data is not as current as the creator would prefer, however it is under their impression that it is current enough to give a gist of the current job market and market salary. The data was filtered to only contain information pertinent to the study, US-based Entry-Level and Virtual positions. The reason for each specific filter was to keep the data country specific for the stakeholders, but general enough for it to be accessible country-wide. The process of cleaning and transforming data is exemplified below in the R coding chunks included.

Setting up my environment

Notes: I loaded the necessary packages, “flexdashboard”, “tinytex”, “rmarkdown”, “tidyverse”, “ggplot2”, “tidyr”, “readr”, “dplyr”, “skimr”,“janitor”’, “here”, and “formatR”

library("flexdashboard")
library("tinytex")
library("rmarkdown")
library("tidyverse")
library("ggplot2")
library("tidyr")
library("readr")
library("dplyr")
library("skimr")
library("janitor")
library("here")
library("formatR")
library("readxl")
library("packrat")
library("rsconnect")

Upload the dataframe

data_science_salaries <- read_csv(here("ds_salaries.csv"))

ANALYZE and SHARE phases of the study

Data Cleansing Process

Checking for data that is not characters, numbers, and underscores

Next, I checked for any values in the columns that are not characters, numbers, or underscores.

clean_names(data_science_salaries)

Checking for missing data

After that, I ran a short blurb of code to look over the dataset at a glance, and look for missing data.

skim_without_charts(data_science_salaries)

Checking for irregular data

I continued the data cleansing process by checking character length in key columns.

data_science_salaries$work_year_leng = nchar(as.character(data_science_salaries$work_year))
data_science_salaries$salary_currency_leng = nchar(as.character(data_science_salaries$salary_currency))
data_science_salaries$employee_residence_leng = nchar(as.character(data_science_salaries$employee_residence))
data_science_salaries$company_location_leng = nchar(as.character(data_science_salaries$company_location))

Validating data returned from the previous table

Since the code I ran matched the table returned in the skim_without_charts command, I knew I could proceed to the analysis process. Since I live in the United States, the only jobs I needed to look at from the dataset which were US-based.

Narrowing search to US-based jobs

us_ds_salaries <- subset(data_science_salaries, company_location == "US")

There were a lot of different job titles listed, so I ran some code to determine exactly how many different job title descriptions were in the list.

Finding job title spreads

n_distinct(us_ds_salaries$job_title)

## [1] 70

Determining the most numerous job titles

That line of code revealed that there were 70 different job titles in the US. I decided to sort the data further and look at only the jobs that interested me, and were also most numerous. Using the most common job titles will help me narrow down my search a bit so that I will know what key words to use when looking for a job.

us_ds_salaries %>% 
  count(job_title,sort = TRUE)

## # A tibble: 70 × 2
##    job_title                     n
##    <chr>                     <int>
##  1 Data Engineer               907
##  2 Data Scientist              674
##  3 Data Analyst                525
##  4 Machine Learning Engineer   219
##  5 Data Architect               97
##  6 Analytics Engineer           92
##  7 Applied Scientist            58
##  8 Research Scientist           58
##  9 Data Science Manager         52
## 10 Research Engineer            31
## # ℹ 60 more rows

Common Job Titles

It appears that Data Engineer,Data Scientist, Data Analyst, and Machine Learning Engineer are the most common titles for job postings. After looking at the returned information, I noticed that the biggest hits for job titles were Data Engineer, Data Scientist, Data Analyst, and Machine Learning Engineer. I decided to hone in on that and continue my analysis with that information in mind. I focused in on those jobs by creating a new dataframe that included only those 4 job titles.

most_common_us_job_titles <-
  data_science_salaries[data_science_salaries$company_location == 'US' &
                data_science_salaries$job_title %in% c('Data Engineer',
                                             'Data Scientist',
                                             'Data Analyst',
                                             'Machine Learning Engineer'), ]

From there, I created a barplot to visualize the spread of each job title across the new dataframe I had created.

Refining the search

Since I am using this project as a first look at finding my first job in data analytics, I needed to further refine my search by creating yet another dataframe which included only entry-level positions.

entry_level_jobs <- subset(most_common_us_job_titles, experience_level == "EN")

Additionally, since the United States is a large area to find a job and the state/sity is not specified, it is best to find a Work-From-Home position, so I created another dataframe which included only virtual positions.

remote_entry_level <- subset(entry_level_jobs, remote_ratio == "100")

After that, I created a summative barplot showing the distribution of the most common job titles that are US-based, virtual, and entry-level.

Finally, it is important in job interviews to know how much an entry-level data analyst position should make, so I calculated the average entry-level position salary.

mean_salary <-
  mean(remote_entry_level$salary_in_usd) # the mean salary is $84,105.83
median_salary <-
  median(remote_entry_level$salary_in_usd) #the median salary is 80,000
salary_standard_deviation <-
  sd(remote_entry_level$salary_in_usd, na.rm = TRUE) # the standard deviation is $32,891.17

Since the mean and median salary levels are relatively close to each other, within one standard deviation, it is acceptable to expect an entry-level salary to be somewhere around 80,000USD.

ACT phase of the study

Based on the study, job-hunting, a smart move is searching for job titles of “Data Analyst”, “Data Engineer”, “Data Scientist”, and “Machine Learning Engineer” since those terms are most populous in job descriptions. Once a person has secured an interview, in the steps going forward, it is wise for them to know that the market salary as of March 2023 is 80,000USD. These pieces of information will not only help candidates find jobs, but also prevent them from being sold short in salary discussions.

RANDOMARNAB, (2023). Data Science Salaries 2023 Kaggle. https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023/code))

US-Based Entry-Level Data Analyst Position Recommendation and Important Info to Know

Peter Schmuldt

2023-06-29

ASK phase of the study

PREPARE phase of the study

Setting up my environment

Upload the dataframe

ACT phase of the study