For this project, we are going to do research about data scientists’ career future, necessary skill and salary. Data were colllected from Indeed, O*NET, Bureau of Labor Statistics website by webscaping using R.
Hui Han (Gracie) and I (Jun Pan) were focused on the data analysis of 31 data science related jobs and necessary hard skills.
Firstly, 31 data science related jobs and required skills were downloaded from O*NET website and saved in github repository. Load csv files for occupations and skills from github
f1<- read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_11-3111-00.csv")
f3<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-1141-00.csv")
f4<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-1161-00.csv")
f5<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2011-02.csv")
f6<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2041-00.csv")
f7<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2051-00.csv")
f8<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2053-00.csv")
f9<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2099-02.csv")
f10<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1111-00.csv")
f11<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1121-00.csv")
f12<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1131-00.csv")
f13<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1133-00.csv")
f14<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1134-00.csv")
f15<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1141-00.csv")
f17<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2021-00.csv")
f18<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2031-00.csv")
f19<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2041-00.csv")
f20<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2041-01.csv")
f21<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2041-02.csv")
f22<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_19-2099-01.csv")
f23<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_19-3011-00.csv")
f24<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_19-3022-00.csv")
f25<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_19-4061-00.csv")
f26<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_25-1021-00.csv")
f27<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_27-4011-00.csv")
f28<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_27-4012-00.csv")
f29<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_43-9011-00.csv")
f30<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_49-2011-00.csv")
f31<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_25-9011-00.csv")
f32<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1151-00.csv")
f33<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2011-00.csv")
Set Working Environment
combined all information of 33 jobs into one mass dataframe
df<-bind_rows(f1,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f17,f18,f19,f20,f21,f22,f23,f24,f25,f26,f27,f28,f29,f30, f31, f32, f33)
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
head(df)
## Section Category
## 1 Technology Accounting software
## 2 Technology Accounting software
## 3 Technology Analytical or scientific software
## 4 Technology Analytical or scientific software
## 5 Technology Data base reporting software
## 6 Technology Data base user interface and query software
## Example
## 1 Deltek Costpoint
## 2 Intuit QuickBooks
## 3 Business analysis software
## 4 Relex Weibull
## 5 AdRelevance
## 6 Microsoft Access
tail(df)
## Section Category
## 1879 Technology Object or component oriented development software
## 1880 Technology Object oriented data base management software
## 1881 Technology Office suite software
## 1882 Technology Presentation software
## 1883 Technology Spreadsheet software
## 1884 Technology Word processing software
## Example
## 1879 R
## 1880 Microsoft Visual FoxPro
## 1881 Microsoft Office
## 1882 Microsoft PowerPoint
## 1883 Microsoft Excel
## 1884 Microsoft Word
Our partener has orgnized a bunch of key skills for data scientists. Later, we will see the match to our mass dataframe of 33 occupations.
skills_Ravi<- c("AWS", "Python","AI", "SQL", "R", "SAS", "Tableau", "AZURE", "SparkML", "Spark","Hadoop", "Machine Learning", "Shiny","Statistics","Probability")
After, review Ravi’s key skills and O.NET website. Gracie and Jun has pulled out a set of skills as backup plan for this study.
skills_Jun <- c("C", "C#","Cassandra", "Django", "Hadoop", "Hive", "HTML", "Java", "MangoDB", "Matlab", "Python", "Pig", "SAS", "R", "Ruby", "SAS", "SQL", "Statistics","Tableau","Teradata")
Using the filter function of dplyr packge to get the data of our mass dataframe matched with Ravi’s key skills.
df_Ravi <- df %>% filter (Example %in% skills_Ravi)
print(df_Ravi)
## Section Category Example
## 1 Technology Analytical or scientific software SAS
## 2 Technology Object or component oriented development software R
## 3 Technology Analytical or scientific software SAS
## 4 Technology Business intelligence and data analysis software Tableau
## 5 Technology Object or component oriented development software R
## 6 Technology Analytical or scientific software SAS
## 7 Technology Analytical or scientific software SAS
## 8 Technology Business intelligence and data analysis software Tableau
## 9 Technology Object or component oriented development software R
## 10 Technology Analytical or scientific software SAS
## 11 Technology Business intelligence and data analysis software Tableau
## 12 Technology Object or component oriented development software R
## 13 Technology Analytical or scientific software SAS
## 14 Technology Business intelligence and data analysis software Tableau
## 15 Technology Object or component oriented development software Python
## 16 Technology Object or component oriented development software R
## 17 Technology Analytical or scientific software SAS
## 18 Technology Object or component oriented development software Python
## 19 Technology Analytical or scientific software SAS
## 20 Technology Business intelligence and data analysis software Tableau
## 21 Technology Object or component oriented development software Python
## 22 Technology Analytical or scientific software SAS
## 23 Technology Object or component oriented development software Python
## 24 Technology Business intelligence and data analysis software Tableau
## 25 Technology Object or component oriented development software Python
## 26 Technology Analytical or scientific software SAS
## 27 Technology Business intelligence and data analysis software Tableau
## 28 Technology Object or component oriented development software Python
## 29 Technology Object or component oriented development software R
## 30 Technology Analytical or scientific software SAS
## 31 Technology Object or component oriented development software Python
## 32 Technology Object or component oriented development software R
## 33 Technology Analytical or scientific software SAS
## 34 Technology Business intelligence and data analysis software Tableau
## 35 Technology Object or component oriented development software Python
## 36 Technology Object or component oriented development software R
## 37 Technology Analytical or scientific software SAS
## 38 Technology Business intelligence and data analysis software Tableau
## 39 Technology Object or component oriented development software Python
## 40 Technology Object or component oriented development software R
## 41 Technology Analytical or scientific software SAS
## 42 Technology Object or component oriented development software Python
## 43 Technology Analytical or scientific software SAS
## 44 Technology Analytical or scientific software SAS
## 45 Technology Object or component oriented development software Python
## 46 Technology Analytical or scientific software SAS
## 47 Technology Business intelligence and data analysis software Tableau
## 48 Technology Object or component oriented development software Python
## 49 Technology Analytical or scientific software SAS
## 50 Technology Object or component oriented development software Python
## 51 Technology Object or component oriented development software Python
## 52 Technology Business intelligence and data analysis software Tableau
## 53 Technology Object or component oriented development software Python
## 54 Technology Analytical or scientific software SAS
## 55 Technology Object or component oriented development software R
Data visulization using ggplot2. We can find that according to Ravi’s skills, we can find that the top 4 skills for data scientists are the follwoing: SAS, Python, Tableau and R.
pl <- ggplot(df_Ravi, aes(x = Example, color = Example, fill = Example)) + geom_bar()
print(pl)
Similar finding were observed using Gracie and Jun’s key words for data scientist. The top 8 key skills for data scientist are SAS, Pythone, Tableau, R, C, Ruby, Diango.
df_Jun <- df %>% filter (Example %in% skills_Jun)
pl_Jun <- ggplot(df_Jun, aes(x = Example, color = Example, fill = Example)) + geom_bar()
print(pl_Jun)
Those are the very preliminary data from our analysis.