For this project, we are going to do research about data scientists’ career future, necessary skill and salary. Data were colllected from Indeed, O*NET, Bureau of Labor Statistics website by webscaping using R.

Hui Han (Gracie) and I (Jun Pan) were focused on the data analysis of 31 data science related jobs and necessary hard skills.

Firstly, 31 data science related jobs and required skills were downloaded from O*NET website and saved in github repository. Load csv files for occupations and skills from github

f1<- read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_11-3111-00.csv")
f3<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-1141-00.csv")
f4<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-1161-00.csv")
f5<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2011-02.csv")
f6<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2041-00.csv")
f7<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2051-00.csv")
f8<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2053-00.csv")
f9<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2099-02.csv")
f10<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1111-00.csv")
f11<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1121-00.csv")
f12<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1131-00.csv")
f13<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1133-00.csv")
f14<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1134-00.csv")
f15<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1141-00.csv")
f17<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2021-00.csv")
f18<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2031-00.csv")
f19<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2041-00.csv")
f20<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2041-01.csv")
f21<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2041-02.csv")
f22<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_19-2099-01.csv")
f23<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_19-3011-00.csv")
f24<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_19-3022-00.csv")
f25<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_19-4061-00.csv")
f26<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_25-1021-00.csv")
f27<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_27-4011-00.csv")
f28<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_27-4012-00.csv")
f29<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_43-9011-00.csv")
f30<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_49-2011-00.csv")
f31<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_25-9011-00.csv")
f32<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1151-00.csv")
f33<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2011-00.csv")

Set Working Environment

combined all information of 33 jobs into one mass dataframe

df<-bind_rows(f1,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f17,f18,f19,f20,f21,f22,f23,f24,f25,f26,f27,f28,f29,f30, f31, f32, f33)
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
head(df)
##      Section                                    Category
## 1 Technology                         Accounting software
## 2 Technology                         Accounting software
## 3 Technology           Analytical or scientific software
## 4 Technology           Analytical or scientific software
## 5 Technology                Data base reporting software
## 6 Technology Data base user interface and query software
##                      Example
## 1           Deltek Costpoint
## 2          Intuit QuickBooks
## 3 Business analysis software
## 4              Relex Weibull
## 5                AdRelevance
## 6           Microsoft Access
tail(df)
##         Section                                          Category
## 1879 Technology Object or component oriented development software
## 1880 Technology     Object oriented data base management software
## 1881 Technology                             Office suite software
## 1882 Technology                             Presentation software
## 1883 Technology                              Spreadsheet software
## 1884 Technology                          Word processing software
##                      Example
## 1879                       R
## 1880 Microsoft Visual FoxPro
## 1881        Microsoft Office
## 1882    Microsoft PowerPoint
## 1883         Microsoft Excel
## 1884          Microsoft Word

Our partener has orgnized a bunch of key skills for data scientists. Later, we will see the match to our mass dataframe of 33 occupations.

skills_Ravi<- c("AWS", "Python","AI", "SQL", "R", "SAS", "Tableau", "AZURE", "SparkML", "Spark","Hadoop", "Machine Learning", "Shiny","Statistics","Probability")

After, review Ravi’s key skills and O.NET website. Gracie and Jun has pulled out a set of skills as backup plan for this study.

skills_Jun <- c("C", "C#","Cassandra", "Django", "Hadoop", "Hive", "HTML", "Java", "MangoDB", "Matlab", "Python", "Pig", "SAS", "R",  "Ruby", "SAS", "SQL", "Statistics","Tableau","Teradata")

Using the filter function of dplyr packge to get the data of our mass dataframe matched with Ravi’s key skills.

df_Ravi <- df %>% filter (Example %in% skills_Ravi)
print(df_Ravi)
##       Section                                          Category Example
## 1  Technology                 Analytical or scientific software     SAS
## 2  Technology Object or component oriented development software       R
## 3  Technology                 Analytical or scientific software     SAS
## 4  Technology  Business intelligence and data analysis software Tableau
## 5  Technology Object or component oriented development software       R
## 6  Technology                 Analytical or scientific software     SAS
## 7  Technology                 Analytical or scientific software     SAS
## 8  Technology  Business intelligence and data analysis software Tableau
## 9  Technology Object or component oriented development software       R
## 10 Technology                 Analytical or scientific software     SAS
## 11 Technology  Business intelligence and data analysis software Tableau
## 12 Technology Object or component oriented development software       R
## 13 Technology                 Analytical or scientific software     SAS
## 14 Technology  Business intelligence and data analysis software Tableau
## 15 Technology Object or component oriented development software  Python
## 16 Technology Object or component oriented development software       R
## 17 Technology                 Analytical or scientific software     SAS
## 18 Technology Object or component oriented development software  Python
## 19 Technology                 Analytical or scientific software     SAS
## 20 Technology  Business intelligence and data analysis software Tableau
## 21 Technology Object or component oriented development software  Python
## 22 Technology                 Analytical or scientific software     SAS
## 23 Technology Object or component oriented development software  Python
## 24 Technology  Business intelligence and data analysis software Tableau
## 25 Technology Object or component oriented development software  Python
## 26 Technology                 Analytical or scientific software     SAS
## 27 Technology  Business intelligence and data analysis software Tableau
## 28 Technology Object or component oriented development software  Python
## 29 Technology Object or component oriented development software       R
## 30 Technology                 Analytical or scientific software     SAS
## 31 Technology Object or component oriented development software  Python
## 32 Technology Object or component oriented development software       R
## 33 Technology                 Analytical or scientific software     SAS
## 34 Technology  Business intelligence and data analysis software Tableau
## 35 Technology Object or component oriented development software  Python
## 36 Technology Object or component oriented development software       R
## 37 Technology                 Analytical or scientific software     SAS
## 38 Technology  Business intelligence and data analysis software Tableau
## 39 Technology Object or component oriented development software  Python
## 40 Technology Object or component oriented development software       R
## 41 Technology                 Analytical or scientific software     SAS
## 42 Technology Object or component oriented development software  Python
## 43 Technology                 Analytical or scientific software     SAS
## 44 Technology                 Analytical or scientific software     SAS
## 45 Technology Object or component oriented development software  Python
## 46 Technology                 Analytical or scientific software     SAS
## 47 Technology  Business intelligence and data analysis software Tableau
## 48 Technology Object or component oriented development software  Python
## 49 Technology                 Analytical or scientific software     SAS
## 50 Technology Object or component oriented development software  Python
## 51 Technology Object or component oriented development software  Python
## 52 Technology  Business intelligence and data analysis software Tableau
## 53 Technology Object or component oriented development software  Python
## 54 Technology                 Analytical or scientific software     SAS
## 55 Technology Object or component oriented development software       R

Data visulization using ggplot2. We can find that according to Ravi’s skills, we can find that the top 4 skills for data scientists are the follwoing: SAS, Python, Tableau and R.

pl <- ggplot(df_Ravi, aes(x = Example, color = Example, fill = Example)) + geom_bar()
print(pl)

Similar finding were observed using Gracie and Jun’s key words for data scientist. The top 8 key skills for data scientist are SAS, Pythone, Tableau, R, C, Ruby, Diango.

df_Jun <- df %>% filter (Example %in% skills_Jun)
pl_Jun <- ggplot(df_Jun, aes(x = Example, color = Example, fill = Example)) + geom_bar()
print(pl_Jun)

Those are the very preliminary data from our analysis.