For this project, we are going to do research about data scientists’ career future, necessary skill and salary. Data were colllected from Indeed, O*NET, Bureau of Labor Statistics website by webscaping using R.

Hui (Gracie) Han and Jun Pan) were focused on the data analysis of 31 data science related jobs and necessary hard skills.

Firstly, 31 data science related jobs and required skills were downloaded from O*NET website and saved in github repository. Load csv files for occupations and skills from github

#try(setwd("tech_skills/old"))
f1<- read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_11-3111-00.csv")
f3<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-1141-00.csv")
f4<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-1161-00.csv")
f5<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2011-02.csv")
f6<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2041-00.csv")
f7<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2051-00.csv")
f8<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2053-00.csv")
f9<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_13-2099-02.csv")
f10<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1111-00.csv")
f11<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1121-00.csv")
f12<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1131-00.csv")
f13<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1133-00.csv")
f14<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1134-00.csv")
f15<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1141-00.csv")
f17<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2021-00.csv")
f18<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2031-00.csv")
f19<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2041-00.csv")
f20<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2041-01.csv")
f21<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2041-02.csv")
f22<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_19-2099-01.csv")
f23<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_19-3011-00.csv")
f24<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_19-3022-00.csv")
f25<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_19-4061-00.csv")
f26<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_25-1021-00.csv")
f27<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_27-4011-00.csv")
f28<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_27-4012-00.csv")
f29<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_43-9011-00.csv")
f30<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_49-2011-00.csv")
f31<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_25-9011-00.csv")
f32<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-1151-00.csv")
f33<-read.csv("https://raw.githubusercontent.com/simplymathematics/data-skills/master/tech_skills/old/technology_skills_15-2011-00.csv")

Set Working Environment

combined all information of 33 jobs into one mass dataframe

df<-bind_rows(f1,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12,f13,f14,f15,f17,f18,f19,f20,f21,f22,f23,f24,f25,f26,f27,f28,f29,f30, f31, f32, f33)
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): Unequal factor levels: coercing to character
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector

## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
head (df)
##      Section                                    Category
## 1 Technology                         Accounting software
## 2 Technology                         Accounting software
## 3 Technology           Analytical or scientific software
## 4 Technology           Analytical or scientific software
## 5 Technology                Data base reporting software
## 6 Technology Data base user interface and query software
##                      Example
## 1           Deltek Costpoint
## 2          Intuit QuickBooks
## 3 Business analysis software
## 4              Relex Weibull
## 5                AdRelevance
## 6           Microsoft Access
tail(df)
##         Section                                          Category
## 1879 Technology Object or component oriented development software
## 1880 Technology     Object oriented data base management software
## 1881 Technology                             Office suite software
## 1882 Technology                             Presentation software
## 1883 Technology                              Spreadsheet software
## 1884 Technology                          Word processing software
##                      Example
## 1879                       R
## 1880 Microsoft Visual FoxPro
## 1881        Microsoft Office
## 1882    Microsoft PowerPoint
## 1883         Microsoft Excel
## 1884          Microsoft Word
dim(df)
## [1] 1884    3

Our partener has orgnized a bunch of key skills for data scientists. Later, we will see the match to our mass dataframe of 33 occupations.

skills_Ravi<- c("AWS", "Python","AI", "SQL", "R", "SAS", "Tableau", "AZURE", "SparkML", "Spark","Hadoop", "Machine Learning", "Shiny","Statistics","Probability")

After, review Ravi’s key skills and O.NET website. Gracie and Jun has pulled out a set of skills as backup plan for this study.

skills_Jun <- c("C", "C#","Cassandra", "Django", "Hadoop", "Hive", "HTML", "Java", "MangoDB", "Matlab", "Python", "Pig", "SAS", "R",  "Ruby", "SAS", "SQL", "Statistics","Tableau","Teradata")

Using the filter function of dplyr packge to get the data of our mass dataframe matched with Ravi’s key skills.

df_Ravi <- df %>% filter (Example %in% skills_Ravi)
## Warning: package 'bindrcpp' was built under R version 3.3.3
print(df_Ravi)
##       Section                                          Category Example
## 1  Technology                 Analytical or scientific software     SAS
## 2  Technology Object or component oriented development software       R
## 3  Technology                 Analytical or scientific software     SAS
## 4  Technology  Business intelligence and data analysis software Tableau
## 5  Technology Object or component oriented development software       R
## 6  Technology                 Analytical or scientific software     SAS
## 7  Technology                 Analytical or scientific software     SAS
## 8  Technology  Business intelligence and data analysis software Tableau
## 9  Technology Object or component oriented development software       R
## 10 Technology                 Analytical or scientific software     SAS
## 11 Technology  Business intelligence and data analysis software Tableau
## 12 Technology Object or component oriented development software       R
## 13 Technology                 Analytical or scientific software     SAS
## 14 Technology  Business intelligence and data analysis software Tableau
## 15 Technology Object or component oriented development software  Python
## 16 Technology Object or component oriented development software       R
## 17 Technology                 Analytical or scientific software     SAS
## 18 Technology Object or component oriented development software  Python
## 19 Technology                 Analytical or scientific software     SAS
## 20 Technology  Business intelligence and data analysis software Tableau
## 21 Technology Object or component oriented development software  Python
## 22 Technology                 Analytical or scientific software     SAS
## 23 Technology Object or component oriented development software  Python
## 24 Technology  Business intelligence and data analysis software Tableau
## 25 Technology Object or component oriented development software  Python
## 26 Technology                 Analytical or scientific software     SAS
## 27 Technology  Business intelligence and data analysis software Tableau
## 28 Technology Object or component oriented development software  Python
## 29 Technology Object or component oriented development software       R
## 30 Technology                 Analytical or scientific software     SAS
## 31 Technology Object or component oriented development software  Python
## 32 Technology Object or component oriented development software       R
## 33 Technology                 Analytical or scientific software     SAS
## 34 Technology  Business intelligence and data analysis software Tableau
## 35 Technology Object or component oriented development software  Python
## 36 Technology Object or component oriented development software       R
## 37 Technology                 Analytical or scientific software     SAS
## 38 Technology  Business intelligence and data analysis software Tableau
## 39 Technology Object or component oriented development software  Python
## 40 Technology Object or component oriented development software       R
## 41 Technology                 Analytical or scientific software     SAS
## 42 Technology Object or component oriented development software  Python
## 43 Technology                 Analytical or scientific software     SAS
## 44 Technology                 Analytical or scientific software     SAS
## 45 Technology Object or component oriented development software  Python
## 46 Technology                 Analytical or scientific software     SAS
## 47 Technology  Business intelligence and data analysis software Tableau
## 48 Technology Object or component oriented development software  Python
## 49 Technology                 Analytical or scientific software     SAS
## 50 Technology Object or component oriented development software  Python
## 51 Technology Object or component oriented development software  Python
## 52 Technology  Business intelligence and data analysis software Tableau
## 53 Technology Object or component oriented development software  Python
## 54 Technology                 Analytical or scientific software     SAS
## 55 Technology Object or component oriented development software       R

Data visulization using ggplot2. We can find that according to Ravi’s skills, we can find that the top 4 skills for data scientists are the follwoing: SAS, Python, Tableau and R.

pl <- ggplot(df_Ravi, aes(x = Example, color = Example, fill = Example)) + geom_bar()
print(pl)

Similar finding were observed using Gracie and Jun’s key words for data scientist. The top 8 key skills for data scientist are SAS, Pythone, Tableau, R, C, Ruby, Diango.

df_Jun <- df %>% filter (Example %in% skills_Jun)
pl_Jun <- ggplot(df_Jun, aes(x = Example, color = Example, fill = Example)) + geom_bar()
print(pl_Jun)

Techskills.graphic <- pl_Jun

Those are the very preliminary data from our analysis.

Techskill.frame <- df_Ravi
Techskills.graphic

setwd(‘..’)