Introduction

For Project 3, our class split up into three teams: an initial Sourcer team to scrape the data and output source CSVs, a Transformer team to tidy and transform the data with the use of a database, and a Presenter team to create graphical representations and assemble the final product.

The class had agreed upon a standard format: skill_type, skill, rating. Due to varieties in the datasets received by the Transformers, variable and structure cleanup was necessary to get the three primary datasets into a usable format.

The Sourcer group was able to output three lists of skills. As the class did not know the end result of the scraping before starting, the categorization of these skills into skill_type had to be done downstream by the Transformers. The received data was reviewed to identify skill type categories. Via team discussion and with the use of supplmental skill lists from the Source team, five categories were identified.

All skills were assigned to one of five categories:

Programming

Math

Visualization

Business

Communication

Process Summary:

Raw CSVs were imported.

Nonsensical or messy skill names were either rejected or cleaned.

Skills with zero ratings were rejected.

For each dataframe, skills were grouped by skill name.

Each dataframe was run through a lookup table to assign skill type names.

The resulting cleaned datasets were output to CSVs and handed off to the other Transformers for processing.

Load Packages.

suppressMessages(suppressWarnings(library(knitr)))
suppressMessages(suppressWarnings(library(stringr)))
suppressMessages(suppressWarnings(library(tidyr)))
suppressMessages(suppressWarnings(library(plyr)))
suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(RCurl)))

Import and Clean Source 1.

x08 <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/source08.csv") #import source CSV

s08<-data.frame(read.csv(text=x08, header=T, stringsAsFactors = FALSE, sep=""))
suppressWarnings(s08<-separate(s08,frequency., into = c("count", "rating"), sep = " "))
s08 <- subset(s08[, c(3,2)]) #subset needed columns
s08$skill[s08$skill == "machine"] <- "machine learning" #rename skill
s08$skill[s08$skill == "big"] <- "big data" #rename skill
s08$skill[s08$skill == "scriptingpythonrjavascala"] <- "r" #rename skill
s08$skill[s08$skill == "predictive"] <- "predictive analytics" #rename skill
#kable(head(s08))

#import source 1 skill type lookup table
s08csv <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/s08v.csv") #import source CSV
s08v<-data.frame(read.csv(text=s08csv, header=T, stringsAsFactors = FALSE))
#kable(head(s08v))
#kable(s08v)

#join lookup table
s08c<- join(s08, s08v, by = "skill")
s08c<- subset(s08c[, c(3,1,2)])
kable(head(s08c))

skill_type	skill	rating
business	analysis	2
math	analytics	23
programming	big data	7
business	business	14
visualization	design	6
programming	hadoop	6

Import and Clean Source 2.

x10 <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/source10.csv") #import source CSV

s10<-data.frame(read.csv(text=x10, header=T, stringsAsFactors = FALSE))
#s10 <- subset(s10[c(1:80), c(2,4)])
s10<- subset(s10[, c(2,4)]) #subset needed columns
s10<-subset(s10, hit.count>1) #omit rows with 1 or fewer hits
colnames(s10) <- c("skill","rating") #rename rows
#kable(head(s10))

#import source 2 skill translation table
s10t<-data.frame(read.csv(file="s10translation.csv", header=T, stringsAsFactors = FALSE))
#kable(head(s10t))
#kable(s10t)

#join translation table
s10r<- join(s10, s10t, by = "skill")
s10r<-ddply(s10r, c("skill_name"), summarize, rating = sum(rating)) #group by skill name
s10r<-na.omit(s10r) #omit NA skill names
colnames(s10r) <- c("skill","rating") #rename rows
#kable(s10r)

#import source 2 skill type lookup table
s10csv <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/s10v.csv") #import source CSV
s10v<-data.frame(read.csv(text=s10csv, header=T, stringsAsFactors = FALSE))
#kable(head(s10v))
#kable(s10v)

#join lookup table
s10c<- join(s10r, s10v, by = "skill")
s10c<- subset(s10c[, c(3,1,2)]) #reorder columns
kable(head(s10c))

skill_type	skill	rating
communication	ai law	7
business	analysis	10
programming	analytics	18
programming	big data	42
programming	cloud	2
business	creativity	3

Import and Clean Source 3.

x11 <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/source11.csv") #import source CSV

s11<-data.frame(read.csv(text=x11, header=T, stringsAsFactors = FALSE))
s11 <- subset(s11[, c(2,6)])
s11<-ddply(s11, c("Skill_Term"), summarize, rating = sum(new_jobs_count))
colnames(s11) <- c("skill","rating")
s11<-subset(s11, rating>0) #omit rows with 0 hits
s11$skill<- as.character(s11$skill)  #rename skill
s11$skill[s11$skill == "Visualisation"] <- "Visualization"  #rename skill
#kable(s11)

#import source 3 skill type lookup table
s11csv <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/s11v.csv") #import source CSV
s11v<-data.frame(read.csv(text=s11csv, header=T, stringsAsFactors = FALSE))
#kable(head(s11v))
#kable(s11v)

#join lookup table
s11c<- join(s11, s11v, by = "skill")
s11c<- subset(s11c[, c(3,1,2)])
kable(head(s11c))

skill_type	skill	rating
programming	agent-based Simulation	32
programming	ANOVA	138
math	Bayes networks	74
math	Bayesian Statistics	175
programming	Big Data	44
programming	BUGS	160

Export Cleaned CSV Files.

write.table(s08, "08-barman-data-cleaned.csv")
write.table(s10, "10-karr-data-cleaned.csv")
write.table(s11, "11-fanelli-data-cleaned.csv")

Conclusion

The exported cleaned CSV files were handed off to the rest of the Transfomer team for data weighting and shaping via a SQL database. The resulting views were then handed off to the Presenter team.

Data Science Skills

Group 1