This is the result of exploration in answering the question “What is the most valued data science skill”
In order to explore and attempt to answer this question our team took the following approach:
Christopher is doing it We are incorporating some graphs that I done… including technology stack for communication Status: Still Pending
Christopher is doing it Team composition, including the group Status: Still Pending
The group in charge of scrapping the information from the web followed similar approaches;
Valerie integrating Over view of overall approaches for each I will write a high level summary and then a link to individual RMarkDown link on Rpub Please send me the Rmarkdown with using the common .css
Scott ** Status = Done **
Arindam ** Status = Re-run Rmarkdown with common .css and correct typos **
Dan F. ** Status = Re-Run Rmarkdown with common .css**
Valerie ** Status = In Progress ** Yadu, please send me word document for effort and problems encountered
We need an overview of what this section is : I am hopping I can write it from each input
Armenoush ** Status = Pending **
Rob ** Status = Pending ** Please send me the link to github as soon as possible so that I can determine whether we have things missing.
Dan Brooks ** Status = Pending ** Dan, I think I will need an overview of the weighting algorithm if you can, unless I can easily get it from the documentation in github.
Keith ** Status = Pending ** I do not think we need a different RPub document for this. I will integrate code here. I just need the command.
Keith ** Status = Pending ** Keith, Let me know if we can get this or I will write one on visio.
Transformer Group I think we are making use of data dictionary, this section may not prove necessary. I just need a few words on it.
Transformer Group I just need a few words on it, basically where we are hosting it… I can speak to cloud solution.
Keith ** Status = Done ** I do not think we need a different RPub document for this. I will integrate code here. I just need the command. Valerie, Integration Integration to be done…
The Presenters used the CSV file pulled from the Data Science Skills Database to create bar charts, word clouds and other visualization tools that show and summarize the group’s findings. When we examined the data, we found the the skill names are weighted on different scales and that the top Data Science Skills are different for the 3 sources.
library(RCurl)
## Warning: package 'RCurl' was built under R version 3.2.3
## Loading required package: bitops
url = "https://raw.githubusercontent.com/danielhong98/MSDA-Spring-2016/347a383eae3b9f02bc5d128efb5de14e1f688f8e/tbl_data_v2.csv"
x = getURL(url)
weightedskills = read.csv(file = textConnection(x), header = T)
include all R packages needed for visualization of results
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.3
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:RCurl':
##
## complete
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 3.2.3
library(knitr)
## Warning: package 'knitr' was built under R version 3.2.3
convert columns to proper data type, pick relevant columns and arrange by source
weightedskills$skill_name = as.character(weightedskills$skill_name)
weightedskills$weighted_rating_overall = as.numeric(weightedskills$weighted_rating_overall)
weightedskills$source_name = as.character(weightedskills$source_name)
weightedskills1 = weightedskills %>%
select(skill_name, weighted_rating_overall, source_name) %>%
arrange(source_name)
convert vector to a data frame
weightedskills1 = data.frame(weightedskills1)
pick all rows that have a Google source
weightedskills11 = filter(weightedskills1, source_name == "Google")
weightedskills11 = weightedskills11[order(-weightedskills11$weighted_rating_overall),]
pick the top six Google data science skills and generate bar graph
h1 = head(weightedskills11)
p1 = ggplot(h1, aes(y = weighted_rating_overall, fill = skill_name))
h1$skill_name = reorder(h1$skill_name, -h1$weighted_rating_overall)
p1 + geom_bar(aes(x = skill_name), data = h1, stat = "identity") + xlab("skill name") + ylab("weighted rating overall") + theme(legend.position = "none") +
ggtitle("Top 6 Data Science Skills - Google")
pick all rows that have a Indeed source
weightedskills12 = filter(weightedskills1, source_name == "Indeed")
weightedskills12 = weightedskills12[order(-weightedskills12$weighted_rating_overall),]
pick the top six Indeed data science skills and generate bar graph
h2 = head(weightedskills12)
p2 = ggplot(h2, aes(y = weighted_rating_overall, fill = skill_name))
h2$skill_name = reorder(h2$skill_name, -h2$weighted_rating_overall)
p2 + geom_bar(aes(x = skill_name), data = h2, stat = "identity") + xlab("skill name") + ylab("weighted rating overall") + theme(legend.position = "none") +
ggtitle("Top 6 Data Science Skills - Indeed")
pick all rows that have a Kaggle source
weightedskills13 = filter(weightedskills1, source_name == "Kaggle")
weightedskills13 = weightedskills13[order(-weightedskills13$weighted_rating_overall),]
pick the top six Kaggle data science skills and generate bar graph
h3 = head(weightedskills13)
p3 = ggplot(h3, aes(y = weighted_rating_overall, fill = skill_name))
h3$skill_name = reorder(h3$skill_name, -h3$weighted_rating_overall)
p3 + geom_bar(aes(x = skill_name), data = h3, stat = "identity") + xlab("skill name") + ylab("weighted rating overall") + theme(legend.position = "none") +
ggtitle("Top 6 Data Science Skills - Kaggle")
read data
jobdata <- read.csv("https://raw.githubusercontent.com/danielhong98/MSDA-Spring-2016/347a383eae3b9f02bc5d128efb5de14e1f688f8e/tbl_data_v2.csv")
order by source_name (asc) and weighted_rating_overall
newjobdata <- jobdata[with(jobdata, order(source_name,-weighted_rating_overall)),]
list skill_name for each source_name
Google <- subset(newjobdata, source_name == "Google", select=c(source_name, skill_name,weighted_rating_overall))
Google <- Google[c(1:20),]
Indeed <- subset(newjobdata, source_name == "Indeed", select=c(source_name, skill_name, weighted_rating_overall))
Indeed <- Indeed[c(1:20),]
Kaggle <- subset(newjobdata, source_name == "Kaggle", select=c(source_name, skill_name, weighted_rating_overall))
Kaggle <- Kaggle[c(1:20),]
Combined <- cbind(Google,Indeed,Kaggle)
Combined$source_name <- NULL
Combined$source_name <- NULL
Combined$source_name <- NULL
colnames(Combined)[1] <- "GoogleSkills"
colnames(Combined)[2] <- "GoogleRatings"
colnames(Combined)[3] <- "IndeedSkills"
colnames(Combined)[4] <- "IndeedRatings"
colnames(Combined)[5] <- "KaggleSkills"
colnames(Combined)[6] <- "KaggleRatings"
kable(Combined)
GoogleSkills | GoogleRatings | IndeedSkills | IndeedRatings | KaggleSkills | KaggleRatings | |
---|---|---|---|---|---|---|
13 | big data | 21.128834 | GIS | 117.21472 | Python | 14.0858896 |
24 | machine learning | 7.546012 | XML | 93.06748 | machine learning | 9.5582822 |
21 | Hadoop | 6.036810 | text mining | 87.53374 | programming | 7.0429448 |
17 | data natives | 4.024540 | clustering | 83.00614 | SQL | 6.0368098 |
29 | Python | 3.521472 | BUGS | 80.49080 | modeling | 4.5276074 |
30 | R | 3.521472 | Pig | 77.97546 | big data | 3.5214724 |
34 | SQL | 3.018405 | JSON | 73.44785 | Hadoop | 3.0184049 |
28 | NOSQL | 2.515337 | SVM | 72.44172 | Java | 3.0184049 |
15 | data engineering | 2.012270 | DBA | 71.93865 | analytics | 2.9631902 |
16 | data mining | 1.509203 | ANOVA | 69.42331 | business | 1.7177914 |
86 | analytics | 1.417178 | Simulation | 68.41718 | statistics | 1.6748466 |
107 | analysis | 1.226994 | Rails | 66.40491 | MATLAB | 1.5092025 |
111 | information | 1.104294 | Objective C | 64.89571 | predictive analytics | 1.5092025 |
14 | cloud | 1.006135 | Teradata | 63.88957 | SAS | 1.5092025 |
18 | devops | 1.006135 | PostgreSQL | 62.38037 | team | 0.7730061 |
19 | fintech | 1.006135 | SPSS | 61.37423 | communication | 0.6073620 |
20 | galaxql | 1.006135 | Oracle | 60.87117 | R | 0.5030675 |
22 | IOS | 1.006135 | MySQL | 57.34969 | interpersonal | 0.4969325 |
23 | Java | 1.006135 | MATLAB | 54.33129 | management | 0.4969325 |
25 | MATLAB | 1.006135 | Stata | 53.82822 | research | 0.4907975 |
plot results
darkcols <- brewer.pal(8,"Dark2")
names <- Combined$GoogleSkills
barplot(Combined$GoogleRatings,main="GoogleRatings", horiz=TRUE, names.arg=names, las=1, col=darkcols, cex.axis=0.5, cex.names = 0.5)
names <- Combined$IndeedSkills
barplot(Combined$IndeedRatings,main="IndeedRatings", horiz=TRUE, names.arg=names, las=1, col=darkcols, cex.axis=0.5, cex.names = 0.5)
names <- Combined$KaggleSkills
barplot(Combined$KaggleRatings,main="KaggleRatings", horiz=TRUE, names.arg=names, las=1, col=darkcols, cex.axis=0.5, cex.names = 0.5)
data<- read.csv("https://raw.githubusercontent.com/ChristopheHunt/MSDA---Coursework/master/Data%20607/Homework/Group%20Project/tbl_data_version1%20.csv")
ggplot(data, aes(source_name, skill_name, label = skill_name,
size = weighted_rating_overall, fill = skill_type_name)) +
geom_point(pch = 21) +
scale_fill_manual(values = brewer.pal(9, "Set1")) +
scale_size_continuous(range =c(1,20)) +
facet_grid(~skill_type_name) +
theme_light() +
xlab("Source") +
ylab("Skill") +
theme(legend.position = "none" , axis.text.y = element_text(size=3)) +
ggtitle("Weighted Rank of Skill by Skill Type and Source")
stab <- data %>% group_by(source_name,skill_type_name) %>% summarise(ave_wgt =mean(weighted_rating_overall))
stab
## Source: local data frame [15 x 3]
## Groups: source_name [?]
##
## source_name skill_type_name ave_wgt
## (fctr) (fctr) (dbl)
## 1 Google business 0.4907975
## 2 Google communication 0.3865031
## 3 Google math 0.8374233
## 4 Google programming 2.7970552
## 5 Google visualization 0.4049080
## 6 Indeed business 9.7392638
## 7 Indeed communication 4.8588957
## 8 Indeed math 12.0379601
## 9 Indeed programming 52.9226994
## 10 Indeed visualization 6.7062883
## 11 Kaggle business 0.8179959
## 12 Kaggle communication 0.4601227
## 13 Kaggle math 1.6748466
## 14 Kaggle programming 4.6533742
## 15 Kaggle visualization 0.2361963
ggplot(stab, aes(x =source_name, y=round(ave_wgt,2), fill = skill_type_name)) + geom_bar(stat="identity",position="dodge") + xlab("Source") + ylab("Average Weighted Rating Overall") + ggtitle("Average Weighted Overall Rating by Source and Skill Type")
Jeff? Musa - ? ###a. Visualization and Graphs
The data from all 3 sources show that programming is the primary and predominant skill needed in data science. The average weighted overall rating for programming, which included such skills such as GIS, Machine Learning and Python, exceeded the average weighted overall rating for all the other skill types combined.
For all 3 sources, math skills came in second, followed by business skills. For Google and Indeed, visualization skills was fourth followed by communication skills last. For our Kaggle source, communication skills came in fourth followed by visualization skills last.
There may be several reasons for these results. First, our group’s classification of data science skill set types (programming, math, business, communication and visualization) are not mutually exclusive. There are obvious overlaps between programming, math and visualization skills. It seems that when employers post skills on the job boards or when bloggers write articles on data science, they assume that when a person has the skill to program in Python, Hadoop, Machine Language or R, the math and visualization skills are already part of it. Employers and writers assume that if a person is proficient in a data science programming language, he is also has the math and visualization skills that come with knowing the programming language. Second, programming is the predominant skill needed in the early stages of the data science process such as data collection, data cleaning and building algorithms and model. It is only when we get to the visualization and data analysis stage where math, communication and visualization skills become as significant as programming skills. Third, domain knowledge and expertise (business skills), although as important as technical and match skills (see data science Venn diagram ), are not emphasized on job sites. Most of the jobs for data scientist are entry or mid level jobs that do not require domain expertise. These expertise are assumed to come later as the employee gains more experience with the company and learns its business processes.
So what is the most valued skill type in data science? Not surprisingly, technical skills such as programming and math skills are the most valued. You need to be technically savvy to have a career in data science.