This is the result of exploration in answering the question “What is the most valued data science skill”

In order to explore and attempt to answer this question our team took the following approach:

Approach

Christopher is doing it We are incorporating some graphs that I done… including technology stack for communication Status: Still Pending

team

Christopher is doing it Team composition, including the group Status: Still Pending

1. Scrapping date from Web

The group in charge of scrapping the information from the web followed similar approaches;
Valerie integrating Over view of overall approaches for each I will write a high level summary and then a link to individual RMarkDown link on Rpub Please send me the Rmarkdown with using the common .css

a. Google

Scott ** Status = Done **

b. Kaggle

Arindam ** Status = Re-run Rmarkdown with common .css and correct typos **

c. Indeed

Dan F. ** Status = Re-Run Rmarkdown with common .css**

d. Other

Valerie ** Status = In Progress ** Yadu, please send me word document for effort and problems encountered

2. transforming and aggregating the raw data

We need an overview of what this section is : I am hopping I can write it from each input

a. Consoliation of various .csv

Armenoush ** Status = Pending **

b. mapping of detail skills to skill set and skill type

Rob ** Status = Pending ** Please send me the link to github as soon as possible so that I can determine whether we have things missing.

c. Weighting and aggregation

Dan Brooks ** Status = Pending ** Dan, I think I will need an overview of the weighting algorithm if you can, unless I can easily get it from the documentation in github.

3. Loading transformed data to DB hosted in cloud

Keith ** Status = Pending ** I do not think we need a different RPub document for this. I will integrate code here. I just need the command.

a. SQL Schema

Keith ** Status = Pending ** Keith, Let me know if we can get this or I will write one on visio.

b. data dictionary

Transformer Group I think we are making use of data dictionary, this section may not prove necessary. I just need a few words on it.

c. Cloud hosting

Transformer Group I just need a few words on it, basically where we are hosting it… I can speak to cloud solution.

d. Pulling data from DB

Keith ** Status = Done ** I do not think we need a different RPub document for this. I will integrate code here. I just need the command. Valerie, Integration Integration to be done…

4. Visualization and Analysis

a. Visualization and Graphs

The Presenters used the CSV file pulled from the Data Science Skills Database to create bar charts, word clouds and other visualization tools that show and summarize the group’s findings. When we examined the data, we found the the skill names are weighted on different scales and that the top Data Science Skills are different for the 3 sources.

1. Bar Graph Showing the Top 6 Data Science Skills from the 3 Sources (Google, Indeed and Kaggle)

library(RCurl)
## Warning: package 'RCurl' was built under R version 3.2.3
## Loading required package: bitops
url = "https://raw.githubusercontent.com/danielhong98/MSDA-Spring-2016/347a383eae3b9f02bc5d128efb5de14e1f688f8e/tbl_data_v2.csv"
x = getURL(url)
weightedskills = read.csv(file = textConnection(x), header = T)

include all R packages needed for visualization of results

library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.3
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
## 
##     complete
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 3.2.3
library(knitr)
## Warning: package 'knitr' was built under R version 3.2.3

convert columns to proper data type, pick relevant columns and arrange by source

weightedskills$skill_name = as.character(weightedskills$skill_name)
weightedskills$weighted_rating_overall = as.numeric(weightedskills$weighted_rating_overall)
weightedskills$source_name = as.character(weightedskills$source_name)
weightedskills1 = weightedskills %>%
  select(skill_name, weighted_rating_overall, source_name) %>%
  arrange(source_name)

convert vector to a data frame

weightedskills1 = data.frame(weightedskills1)

pick all rows that have a Google source

weightedskills11 = filter(weightedskills1, source_name == "Google")
weightedskills11 = weightedskills11[order(-weightedskills11$weighted_rating_overall),]

pick the top six Google data science skills and generate bar graph

h1 = head(weightedskills11)
p1 = ggplot(h1, aes(y = weighted_rating_overall, fill = skill_name))
h1$skill_name = reorder(h1$skill_name, -h1$weighted_rating_overall)
p1 + geom_bar(aes(x = skill_name), data = h1, stat = "identity") + xlab("skill name") + ylab("weighted rating overall") + theme(legend.position = "none") +
ggtitle("Top 6 Data Science Skills - Google")

pick all rows that have a Indeed source

weightedskills12 = filter(weightedskills1, source_name == "Indeed")
weightedskills12 = weightedskills12[order(-weightedskills12$weighted_rating_overall),]

pick the top six Indeed data science skills and generate bar graph

h2 = head(weightedskills12)
p2 = ggplot(h2, aes(y = weighted_rating_overall, fill = skill_name))
h2$skill_name = reorder(h2$skill_name, -h2$weighted_rating_overall)
p2 + geom_bar(aes(x = skill_name), data = h2, stat = "identity") + xlab("skill name") + ylab("weighted rating overall") + theme(legend.position = "none") +
ggtitle("Top 6 Data Science Skills - Indeed")

pick all rows that have a Kaggle source

weightedskills13 = filter(weightedskills1, source_name == "Kaggle")
weightedskills13 = weightedskills13[order(-weightedskills13$weighted_rating_overall),]

pick the top six Kaggle data science skills and generate bar graph

h3 = head(weightedskills13)
p3 = ggplot(h3, aes(y = weighted_rating_overall, fill = skill_name))
h3$skill_name = reorder(h3$skill_name, -h3$weighted_rating_overall)
p3 + geom_bar(aes(x = skill_name), data = h3, stat = "identity") + xlab("skill name") + ylab("weighted rating overall") + theme(legend.position = "none") +
ggtitle("Top 6 Data Science Skills - Kaggle")

2. Horizontal Bar Graph Showing the Top 20 Data Science Skills from the 3 Sources (Google, Indeed and Kaggle)

read data

jobdata <- read.csv("https://raw.githubusercontent.com/danielhong98/MSDA-Spring-2016/347a383eae3b9f02bc5d128efb5de14e1f688f8e/tbl_data_v2.csv")

order by source_name (asc) and weighted_rating_overall

newjobdata <- jobdata[with(jobdata, order(source_name,-weighted_rating_overall)),]

list skill_name for each source_name

Google <- subset(newjobdata, source_name == "Google", select=c(source_name, skill_name,weighted_rating_overall))
Google <- Google[c(1:20),]
Indeed <- subset(newjobdata, source_name == "Indeed", select=c(source_name, skill_name, weighted_rating_overall))
Indeed <- Indeed[c(1:20),]
Kaggle <- subset(newjobdata, source_name == "Kaggle", select=c(source_name, skill_name, weighted_rating_overall))
Kaggle <- Kaggle[c(1:20),]
Combined <- cbind(Google,Indeed,Kaggle)
Combined$source_name <- NULL
Combined$source_name <- NULL
Combined$source_name <- NULL
colnames(Combined)[1] <- "GoogleSkills"
colnames(Combined)[2] <- "GoogleRatings"
colnames(Combined)[3] <- "IndeedSkills"
colnames(Combined)[4] <- "IndeedRatings"
colnames(Combined)[5] <- "KaggleSkills"
colnames(Combined)[6] <- "KaggleRatings"
kable(Combined)
GoogleSkills GoogleRatings IndeedSkills IndeedRatings KaggleSkills KaggleRatings
13 big data 21.128834 GIS 117.21472 Python 14.0858896
24 machine learning 7.546012 XML 93.06748 machine learning 9.5582822
21 Hadoop 6.036810 text mining 87.53374 programming 7.0429448
17 data natives 4.024540 clustering 83.00614 SQL 6.0368098
29 Python 3.521472 BUGS 80.49080 modeling 4.5276074
30 R 3.521472 Pig 77.97546 big data 3.5214724
34 SQL 3.018405 JSON 73.44785 Hadoop 3.0184049
28 NOSQL 2.515337 SVM 72.44172 Java 3.0184049
15 data engineering 2.012270 DBA 71.93865 analytics 2.9631902
16 data mining 1.509203 ANOVA 69.42331 business 1.7177914
86 analytics 1.417178 Simulation 68.41718 statistics 1.6748466
107 analysis 1.226994 Rails 66.40491 MATLAB 1.5092025
111 information 1.104294 Objective C 64.89571 predictive analytics 1.5092025
14 cloud 1.006135 Teradata 63.88957 SAS 1.5092025
18 devops 1.006135 PostgreSQL 62.38037 team 0.7730061
19 fintech 1.006135 SPSS 61.37423 communication 0.6073620
20 galaxql 1.006135 Oracle 60.87117 R 0.5030675
22 IOS 1.006135 MySQL 57.34969 interpersonal 0.4969325
23 Java 1.006135 MATLAB 54.33129 management 0.4969325
25 MATLAB 1.006135 Stata 53.82822 research 0.4907975

plot results

darkcols <- brewer.pal(8,"Dark2")
names <- Combined$GoogleSkills
barplot(Combined$GoogleRatings,main="GoogleRatings", horiz=TRUE, names.arg=names, las=1, col=darkcols, cex.axis=0.5, cex.names = 0.5)

names <- Combined$IndeedSkills
barplot(Combined$IndeedRatings,main="IndeedRatings", horiz=TRUE, names.arg=names, las=1, col=darkcols, cex.axis=0.5, cex.names = 0.5)

names <- Combined$KaggleSkills
barplot(Combined$KaggleRatings,main="KaggleRatings", horiz=TRUE, names.arg=names, las=1, col=darkcols, cex.axis=0.5, cex.names = 0.5)

3. Bubble Graph Showing Weighted Rank of Skills by Skill Type and Source (Google, Indeed and Kaggle)

data<- read.csv("https://raw.githubusercontent.com/ChristopheHunt/MSDA---Coursework/master/Data%20607/Homework/Group%20Project/tbl_data_version1%20.csv")
ggplot(data, aes(source_name, skill_name, label = skill_name, 
                    size = weighted_rating_overall, fill = skill_type_name)) + 
        geom_point(pch = 21) + 
        scale_fill_manual(values =  brewer.pal(9, "Set1")) + 
        scale_size_continuous(range =c(1,20)) + 
        facet_grid(~skill_type_name) + 
        theme_light() +
        xlab("Source") + 
        ylab("Skill") + 
        theme(legend.position = "none" , axis.text.y = element_text(size=3)) +
        ggtitle("Weighted Rank of Skill by Skill Type and Source")

4. Bar Graph Showing Average Weighted Overall Rating by Source and Skill Type

stab <- data %>% group_by(source_name,skill_type_name) %>% summarise(ave_wgt =mean(weighted_rating_overall))

stab
## Source: local data frame [15 x 3]
## Groups: source_name [?]
## 
##    source_name skill_type_name    ave_wgt
##         (fctr)          (fctr)      (dbl)
## 1       Google        business  0.4907975
## 2       Google   communication  0.3865031
## 3       Google            math  0.8374233
## 4       Google     programming  2.7970552
## 5       Google   visualization  0.4049080
## 6       Indeed        business  9.7392638
## 7       Indeed   communication  4.8588957
## 8       Indeed            math 12.0379601
## 9       Indeed     programming 52.9226994
## 10      Indeed   visualization  6.7062883
## 11      Kaggle        business  0.8179959
## 12      Kaggle   communication  0.4601227
## 13      Kaggle            math  1.6748466
## 14      Kaggle     programming  4.6533742
## 15      Kaggle   visualization  0.2361963
ggplot(stab, aes(x =source_name, y=round(ave_wgt,2), fill = skill_type_name)) +  geom_bar(stat="identity",position="dodge") + xlab("Source") + ylab("Average Weighted Rating Overall") + ggtitle("Average Weighted Overall Rating by Source and Skill Type") 

Jeff? Musa - ? ###a. Visualization and Graphs

4. Analysis of Bar Graph Showing Average Weighted Overall Rating by Source and Skill Type

The data from all 3 sources show that programming is the primary and predominant skill needed in data science. The average weighted overall rating for programming, which included such skills such as GIS, Machine Learning and Python, exceeded the average weighted overall rating for all the other skill types combined.

For all 3 sources, math skills came in second, followed by business skills. For Google and Indeed, visualization skills was fourth followed by communication skills last. For our Kaggle source, communication skills came in fourth followed by visualization skills last.

There may be several reasons for these results. First, our group’s classification of data science skill set types (programming, math, business, communication and visualization) are not mutually exclusive. There are obvious overlaps between programming, math and visualization skills. It seems that when employers post skills on the job boards or when bloggers write articles on data science, they assume that when a person has the skill to program in Python, Hadoop, Machine Language or R, the math and visualization skills are already part of it. Employers and writers assume that if a person is proficient in a data science programming language, he is also has the math and visualization skills that come with knowing the programming language. Second, programming is the predominant skill needed in the early stages of the data science process such as data collection, data cleaning and building algorithms and model. It is only when we get to the visualization and data analysis stage where math, communication and visualization skills become as significant as programming skills. Third, domain knowledge and expertise (business skills), although as important as technical and match skills (see data science Venn diagram ), are not emphasized on job sites. Most of the jobs for data scientist are entry or mid level jobs that do not require domain expertise. These expertise are assumed to come later as the employee gains more experience with the company and learns its business processes.

So what is the most valued skill type in data science? Not surprisingly, technical skills such as programming and math skills are the most valued. You need to be technically savvy to have a career in data science.

b. Conclusion