Project 3 - Section 1

This is the result of exploration in answering the question “What is the most valued data science skill”

In order to explore and attempt to answer this question our team took the following approach:

Approach

Christopher is doing it We are incorporating some graphs that I done… including technology stack for communication Status: Still Pending

team

Christopher is doing it Team composition, including the group Status: Still Pending

1. Scrapping date from Web

The group in charge of scrapping the information from the web followed similar approaches;
Valerie integrating Over view of overall approaches for each I will write a high level summary and then a link to individual RMarkDown link on Rpub Please send me the Rmarkdown with using the common .css

a. Google

Scott ** Status = Done **

b. Kaggle

Arindam ** Status = Re-run Rmarkdown with common .css and correct typos **

c. Indeed

Dan F. ** Status = Re-Run Rmarkdown with common .css**

d. Other

Valerie ** Status = In Progress ** Yadu, please send me word document for effort and problems encountered

2. transforming and aggregating the raw data

We need an overview of what this section is : I am hopping I can write it from each input

a. Consoliation of various .csv

Armenoush ** Status = Pending **

b. mapping of detail skills to skill set and skill type

Rob ** Status = Pending ** Please send me the link to github as soon as possible so that I can determine whether we have things missing.

c. Weighting and aggregation

Dan Brooks ** Status = Pending ** Dan, I think I will need an overview of the weighting algorithm if you can, unless I can easily get it from the documentation in github.

3. Loading transformed data to DB hosted in cloud

Keith ** Status = Pending ** I do not think we need a different RPub document for this. I will integrate code here. I just need the command.

a. SQL Schema

Keith ** Status = Pending ** Keith, Let me know if we can get this or I will write one on visio.

b. data dictionary

Transformer Group I think we are making use of data dictionary, this section may not prove necessary. I just need a few words on it.

c. Cloud hosting

Transformer Group I just need a few words on it, basically where we are hosting it… I can speak to cloud solution.

d. Pulling data from DB

Keith ** Status = Done ** I do not think we need a different RPub document for this. I will integrate code here. I just need the command. Valerie, Integration Integration to be done…

4. Visualization and Analysis

a. Visualization and Graphs

The Presenters used the CSV file pulled from the Data Science Skills Database to create bar charts, word clouds and other visualization tools that show and summarize the group’s findings. When we examined the data, we found the the skill names are weighted on different scales and that the top Data Science Skills are different for the 3 sources.

1. Bar Graph Showing the Top 6 Data Science Skills from the 3 Sources (Google, Indeed and Kaggle)

library(RCurl)

## Warning: package 'RCurl' was built under R version 3.2.3

## Loading required package: bitops

url = "https://raw.githubusercontent.com/danielhong98/MSDA-Spring-2016/347a383eae3b9f02bc5d128efb5de14e1f688f8e/tbl_data_v2.csv"
x = getURL(url)
weightedskills = read.csv(file = textConnection(x), header = T)

include all R packages needed for visualization of results

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.2.3

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:RCurl':
## 
##     complete

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.3

library(RColorBrewer)

## Warning: package 'RColorBrewer' was built under R version 3.2.3

library(knitr)

## Warning: package 'knitr' was built under R version 3.2.3

convert columns to proper data type, pick relevant columns and arrange by source

weightedskills$skill_name = as.character(weightedskills$skill_name)
weightedskills$weighted_rating_overall = as.numeric(weightedskills$weighted_rating_overall)
weightedskills$source_name = as.character(weightedskills$source_name)
weightedskills1 = weightedskills %>%
  select(skill_name, weighted_rating_overall, source_name) %>%
  arrange(source_name)

convert vector to a data frame

weightedskills1 = data.frame(weightedskills1)

pick all rows that have a Google source

weightedskills11 = filter(weightedskills1, source_name == "Google")
weightedskills11 = weightedskills11[order(-weightedskills11$weighted_rating_overall),]

pick the top six Google data science skills and generate bar graph

h1 = head(weightedskills11)
p1 = ggplot(h1, aes(y = weighted_rating_overall, fill = skill_name))
h1$skill_name = reorder(h1$skill_name, -h1$weighted_rating_overall)
p1 + geom_bar(aes(x = skill_name), data = h1, stat = "identity") + xlab("skill name") + ylab("weighted rating overall") + theme(legend.position = "none") +
ggtitle("Top 6 Data Science Skills - Google")

pick all rows that have a Indeed source

weightedskills12 = filter(weightedskills1, source_name == "Indeed")
weightedskills12 = weightedskills12[order(-weightedskills12$weighted_rating_overall),]

pick the top six Indeed data science skills and generate bar graph

h2 = head(weightedskills12)
p2 = ggplot(h2, aes(y = weighted_rating_overall, fill = skill_name))
h2$skill_name = reorder(h2$skill_name, -h2$weighted_rating_overall)
p2 + geom_bar(aes(x = skill_name), data = h2, stat = "identity") + xlab("skill name") + ylab("weighted rating overall") + theme(legend.position = "none") +
ggtitle("Top 6 Data Science Skills - Indeed")

pick all rows that have a Kaggle source

weightedskills13 = filter(weightedskills1, source_name == "Kaggle")
weightedskills13 = weightedskills13[order(-weightedskills13$weighted_rating_overall),]

pick the top six Kaggle data science skills and generate bar graph

h3 = head(weightedskills13)
p3 = ggplot(h3, aes(y = weighted_rating_overall, fill = skill_name))
h3$skill_name = reorder(h3$skill_name, -h3$weighted_rating_overall)
p3 + geom_bar(aes(x = skill_name), data = h3, stat = "identity") + xlab("skill name") + ylab("weighted rating overall") + theme(legend.position = "none") +
ggtitle("Top 6 Data Science Skills - Kaggle")

2. Horizontal Bar Graph Showing the Top 20 Data Science Skills from the 3 Sources (Google, Indeed and Kaggle)

read data

jobdata <- read.csv("https://raw.githubusercontent.com/danielhong98/MSDA-Spring-2016/347a383eae3b9f02bc5d128efb5de14e1f688f8e/tbl_data_v2.csv")

order by source_name (asc) and weighted_rating_overall

newjobdata <- jobdata[with(jobdata, order(source_name,-weighted_rating_overall)),]

list skill_name for each source_name

Google <- subset(newjobdata, source_name == "Google", select=c(source_name, skill_name,weighted_rating_overall))
Google <- Google[c(1:20),]
Indeed <- subset(newjobdata, source_name == "Indeed", select=c(source_name, skill_name, weighted_rating_overall))
Indeed <- Indeed[c(1:20),]
Kaggle <- subset(newjobdata, source_name == "Kaggle", select=c(source_name, skill_name, weighted_rating_overall))
Kaggle <- Kaggle[c(1:20),]
Combined <- cbind(Google,Indeed,Kaggle)
Combined$source_name <- NULL
Combined$source_name <- NULL
Combined$source_name <- NULL
colnames(Combined)[1] <- "GoogleSkills"
colnames(Combined)[2] <- "GoogleRatings"
colnames(Combined)[3] <- "IndeedSkills"
colnames(Combined)[4] <- "IndeedRatings"
colnames(Combined)[5] <- "KaggleSkills"
colnames(Combined)[6] <- "KaggleRatings"
kable(Combined)

	GoogleSkills	GoogleRatings	IndeedSkills	IndeedRatings	KaggleSkills	KaggleRatings
13	big data	21.128834	GIS	117.21472	Python	14.0858896
24	machine learning	7.546012	XML	93.06748	machine learning	9.5582822
21	Hadoop	6.036810	text mining	87.53374	programming	7.0429448
17	data natives	4.024540	clustering	83.00614	SQL	6.0368098
29	Python	3.521472	BUGS	80.49080	modeling	4.5276074
30	R	3.521472	Pig	77.97546	big data	3.5214724
34	SQL	3.018405	JSON	73.44785	Hadoop	3.0184049
28	NOSQL	2.515337	SVM	72.44172	Java	3.0184049
15	data engineering	2.012270	DBA	71.93865	analytics	2.9631902
16	data mining	1.509203	ANOVA	69.42331	business	1.7177914
86	analytics	1.417178	Simulation	68.41718	statistics	1.6748466
107	analysis	1.226994	Rails	66.40491	MATLAB	1.5092025
111	information	1.104294	Objective C	64.89571	predictive analytics	1.5092025
14	cloud	1.006135	Teradata	63.88957	SAS	1.5092025
18	devops	1.006135	PostgreSQL	62.38037	team	0.7730061
19	fintech	1.006135	SPSS	61.37423	communication	0.6073620
20	galaxql	1.006135	Oracle	60.87117	R	0.5030675
22	IOS	1.006135	MySQL	57.34969	interpersonal	0.4969325
23	Java	1.006135	MATLAB	54.33129	management	0.4969325
25	MATLAB	1.006135	Stata	53.82822	research	0.4907975

plot results

darkcols <- brewer.pal(8,"Dark2")
names <- Combined$GoogleSkills
barplot(Combined$GoogleRatings,main="GoogleRatings", horiz=TRUE, names.arg=names, las=1, col=darkcols, cex.axis=0.5, cex.names = 0.5)

names <- Combined$IndeedSkills
barplot(Combined$IndeedRatings,main="IndeedRatings", horiz=TRUE, names.arg=names, las=1, col=darkcols, cex.axis=0.5, cex.names = 0.5)

names <- Combined$KaggleSkills
barplot(Combined$KaggleRatings,main="KaggleRatings", horiz=TRUE, names.arg=names, las=1, col=darkcols, cex.axis=0.5, cex.names = 0.5)

3. Bubble Graph Showing Weighted Rank of Skills by Skill Type and Source (Google, Indeed and Kaggle)

data<- read.csv("https://raw.githubusercontent.com/ChristopheHunt/MSDA---Coursework/master/Data%20607/Homework/Group%20Project/tbl_data_version1%20.csv")

ggplot(data, aes(source_name, skill_name, label = skill_name, 
                    size = weighted_rating_overall, fill = skill_type_name)) + 
        geom_point(pch = 21) + 
        scale_fill_manual(values =  brewer.pal(9, "Set1")) + 
        scale_size_continuous(range =c(1,20)) + 
        facet_grid(~skill_type_name) + 
        theme_light() +
        xlab("Source") + 
        ylab("Skill") + 
        theme(legend.position = "none" , axis.text.y = element_text(size=3)) +
        ggtitle("Weighted Rank of Skill by Skill Type and Source")

4. Bar Graph Showing Average Weighted Overall Rating by Source and Skill Type

stab <- data %>% group_by(source_name,skill_type_name) %>% summarise(ave_wgt =mean(weighted_rating_overall))

stab

## Source: local data frame [15 x 3]
## Groups: source_name [?]
## 
##    source_name skill_type_name    ave_wgt
##         (fctr)          (fctr)      (dbl)
## 1       Google        business  0.4907975
## 2       Google   communication  0.3865031
## 3       Google            math  0.8374233
## 4       Google     programming  2.7970552
## 5       Google   visualization  0.4049080
## 6       Indeed        business  9.7392638
## 7       Indeed   communication  4.8588957
## 8       Indeed            math 12.0379601
## 9       Indeed     programming 52.9226994
## 10      Indeed   visualization  6.7062883
## 11      Kaggle        business  0.8179959
## 12      Kaggle   communication  0.4601227
## 13      Kaggle            math  1.6748466
## 14      Kaggle     programming  4.6533742
## 15      Kaggle   visualization  0.2361963

ggplot(stab, aes(x =source_name, y=round(ave_wgt,2), fill = skill_type_name)) +  geom_bar(stat="identity",position="dodge") + xlab("Source") + ylab("Average Weighted Rating Overall") + ggtitle("Average Weighted Overall Rating by Source and Skill Type")

Jeff? Musa - ? ###a. Visualization and Graphs

4. Analysis of Bar Graph Showing Average Weighted Overall Rating by Source and Skill Type

The data from all 3 sources show that programming is the primary and predominant skill needed in data science. The average weighted overall rating for programming, which included such skills such as GIS, Machine Learning and Python, exceeded the average weighted overall rating for all the other skill types combined.

For all 3 sources, math skills came in second, followed by business skills. For Google and Indeed, visualization skills was fourth followed by communication skills last. For our Kaggle source, communication skills came in fourth followed by visualization skills last.

There may be several reasons for these results. First, our group’s classification of data science skill set types (programming, math, business, communication and visualization) are not mutually exclusive. There are obvious overlaps between programming, math and visualization skills. It seems that when employers post skills on the job boards or when bloggers write articles on data science, they assume that when a person has the skill to program in Python, Hadoop, Machine Language or R, the math and visualization skills are already part of it. Employers and writers assume that if a person is proficient in a data science programming language, he is also has the math and visualization skills that come with knowing the programming language. Second, programming is the predominant skill needed in the early stages of the data science process such as data collection, data cleaning and building algorithms and model. It is only when we get to the visualization and data analysis stage where math, communication and visualization skills become as significant as programming skills. Third, domain knowledge and expertise (business skills), although as important as technical and match skills (see data science Venn diagram ), are not emphasized on job sites. Most of the jobs for data scientist are entry or mid level jobs that do not require domain expertise. These expertise are assumed to come later as the employee gains more experience with the company and learns its business processes.

So what is the most valued skill type in data science? Not surprisingly, technical skills such as programming and math skills are the most valued. You need to be technically savvy to have a career in data science.