Edward Harvey, Yecheng Cao, Xinyang Liu
2023-05-09
As we and fellow members of our class pursue careers related to data, we thought it would be interesting to the relationship between salary and various job characteristics. Significant findings could potentially influence career decisions for ourselves, our peers, and our future employers. In particular we wanted to explore the relationship between salary and:
-Fully remote vs. not fully remote work
-Level of experience
-Full time work vs. part time or freelance/contract work
-Geographic location
In addition, we wanted to develop an App so that you can see your future salary in data science based on your choices on various job characteristics.
AI, ML, Data salaries: Salary trends in AI, ML, Data around the world from 2020-2023
The data source is available with a worldwide contribution size, being collected and updated continuously from 2020 to the present time (usually on a weekly basis). The dataset is published in the public domain, users can access and download the dataset easily.
Downloaded from Kaggle.com
| experience_level | employment_type | job_title | salary_currency | salary_in_usd | employee_residence | company_location | company_size | remote_status |
|---|---|---|---|---|---|---|---|---|
| MI | FT | Other | USD | 258000 | US | US | L | Fully Remote |
| SE | FT | Data Scientist | USD | 225000 | US | US | M | Not Fully Remote |
| SE | FT | Data Scientist | USD | 156400 | US | US | M | Not Fully Remote |
| SE | FT | Data Engineer | USD | 190000 | US | US | M | Fully Remote |
| SE | FT | Data Engineer | USD | 150000 | US | US | M | Fully Remote |
| SE | FT | Data Scientist | USD | 196000 | US | US | M | Not Fully Remote |
| SE | FT | Data Scientist | USD | 121000 | US | US | M | Not Fully Remote |
| SE | FT | Data Scientist | USD | 219000 | US | US | M | Not Fully Remote |
| SE | FT | Data Scientist | USD | 141000 | US | US | M | Not Fully Remote |
| SE | FT | Data Engineer | USD | 230000 | US | US | M | Not Fully Remote |
| SE | FT | Data Engineer | USD | 206000 | US | US | M | Not Fully Remote |
| SE | FT | Other | USD | 192000 | US | US | M | Not Fully Remote |
| SE | FT | Other | USD | 164000 | US | US | M | Not Fully Remote |
| MI | FT | Machine Learning Engineer | USD | 300000 | Other | Other | M | Not Fully Remote |
| MI | FT | Machine Learning Engineer | USD | 260000 | Other | Other | M | Not Fully Remote |
| SE | FT | Data Analyst | USD | 147000 | US | US | M | Fully Remote |
| SE | FT | Data Analyst | USD | 92000 | US | US | M | Fully Remote |
| SE | FT | Machine Learning Engineer | USD | 200000 | US | US | M | Not Fully Remote |
We took a first look at the data to check for outliers in salary
There are some salaries that are surprisingly low, but perhaps not outliers in the sense that they are obviously data errors.
| salary_in_usd | company_location | salary_currency | employment_type |
|---|---|---|---|
| 7799 | Other | BRL | FT |
| 7500 | Other | USD | CT |
| 7000 | Other | USD | FT |
| 6359 | Other | INR | FT |
| 6304 | Other | EUR | FT |
| 6270 | Other | BRL | FT |
| 6072 | Other | INR | FT |
| 6072 | Other | INR | FT |
| 5882 | Other | INR | FT |
| 5723 | Other | INR | FT |
| 5707 | Other | INR | FT |
| 5679 | US | INR | FT |
| 5409 | Other | INR | FT |
| 5409 | Other | INR | PT |
| 5132 | Other | CZK | FT |
Filtering the data, we see that many of these are from countries with lower incomes, such as India, The Philippines, Brazil, and the Czech Republic, or contract work in the US or Europe. While there are a few surprisingly low incomes in the US and Europe, we will not exclude them.
Mid-sized companies pay slightly more
US pays the most, but Canada shows high variability
Experience pays, but not uniformly. There is lots of overlap between experience levels.
Upon holding the experience_level constant on the X-axis, the observations listed below emerge:
(Y-axis: experience_level) Expert Executive-level (EX) and Intermediate Senior-level (SE) professionals earn the highest average salaries, whereas Entry-level (EN) individuals receive the lowest.
(Y-axis: company_location) In US, GB, and other countries, EX-level employees have the highest earnings, while SE-level professionals in Canada earn even more than EX-level workers.
(Y-axis: company_size) In terms of company size, employees in medium-sized firms have the highest overall earnings. Large organizations offer higher salaries to entry-level, junior-level, and senior-level workers than small companies, but expert-level employees in small firms earn more than those in large organizations.
(Y-axis: remote_ratio) Data scientists with minimal remote work (less than 20%) have the highest overall salaries, followed by those working entirely remotely (over 80%). Expert-level employees, however, earn the most in fully remote companies. Partially remote companies provide the lowest compensation.
(Y-axis: employment_type) EX and SE-level employees command the highest salaries in Contract (CT) roles, while EN and MI-level workers earn the most in Full-time (FT) positions. Part-time (PT) and Freelance (FL) employees generally receive significantly lower pay.
(Y-axis: job_title) With regard to job titles, expert-level data engineers, other professionals, and data scientists are the highest earners. Data analysts typically receive the lowest compensation. Data engineers enjoy near-top salaries across all positions, from entry-level to expert-level.
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| company_location | 3 | 3.824557e+12 | 1.274852e+12 | 507.67 | 0.00 |
| company_size | 2 | 8.471606e+10 | 4.235803e+10 | 16.87 | 0.00 |
| employment_type | 3 | 5.981473e+10 | 1.993824e+10 | 7.94 | 0.00 |
| experience_level | 3 | 1.464579e+12 | 4.881931e+11 | 194.41 | 0.00 |
| job_title | 12 | 9.444936e+11 | 7.870780e+10 | 31.34 | 0.00 |
| remote_status | 1 | 4.792904e+09 | 4.792904e+09 | 1.91 | 0.17 |
There is a lack of uniformity in job titles across the industry. Establishing a grouping of titles or some other standard of comparison could make for more meaningful analysis.
Salary is only one form of compensation. There may be other forms, such as health insurance, vacation time, or interesting work that are not accounted for here.
More granular employee residence information would be helpful for comparing salary against cost of living. Incorporating the state data if possible and visualizing by “plot_mapbox()”
Years of experience would be a valuable quantitative variable to add to the analysis