Introduction

For Project 3, our class split up into three teams: an initial Sourcer team to scrape the data and output source CSVs, a Transformer team to tidy and transform the data with the use of a database, and a Presenter team to create graphical representations and assemble the final product.

The class had agreed upon a standard format: skill_type, skill, rating. Due to varieties in the datasets received by the Transformers, variable and structure cleanup was necessary to get the three primary datasets into a usable format.

The Sourcer group was able to output three lists of skills. As the class did not know the end result of the scraping before starting, the categorization of these skills into skill_type had to be done downstream by the Transformers. The received data was reviewed to identify skill type categories. Via team discussion and with the use of supplmental skill lists from the Source team, five categories were identified.

All skills were assigned to one of five categories:

Programming
Math
Visualization
Business
Communication

Process Summary:

Raw CSVs were imported.
Nonsensical or messy skill names were either rejected or cleaned.
Skills with zero ratings were rejected.
For each dataframe, skills were grouped by skill name.
Each dataframe was run through a lookup table to assign skill type names.
The resulting cleaned datasets were output to CSVs and handed off to the other Transformers for processing.

Load Packages.

suppressMessages(suppressWarnings(library(knitr)))
suppressMessages(suppressWarnings(library(stringr)))
suppressMessages(suppressWarnings(library(tidyr)))
suppressMessages(suppressWarnings(library(plyr)))
suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(RCurl)))

Import and Clean Source 1.

x08 <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/source08.csv") #import source CSV

s08<-data.frame(read.csv(text=x08, header=T, stringsAsFactors = FALSE, sep=""))
suppressWarnings(s08<-separate(s08,frequency., into = c("count", "rating"), sep = " "))
s08 <- subset(s08[, c(3,2)]) #subset needed columns
s08$skill[s08$skill == "machine"] <- "machine learning" #rename skill
s08$skill[s08$skill == "big"] <- "big data" #rename skill
s08$skill[s08$skill == "scriptingpythonrjavascala"] <- "r" #rename skill
s08$skill[s08$skill == "predictive"] <- "predictive analytics" #rename skill
#kable(head(s08))

#import source 1 skill type lookup table
s08csv <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/s08v.csv") #import source CSV
s08v<-data.frame(read.csv(text=s08csv, header=T, stringsAsFactors = FALSE))
#kable(head(s08v))
#kable(s08v)

#join lookup table
s08c<- join(s08, s08v, by = "skill")
s08c<- subset(s08c[, c(3,1,2)])
kable(head(s08c))
skill_type skill rating
business analysis 2
math analytics 23
programming big data 7
business business 14
visualization design 6
programming hadoop 6

Import and Clean Source 2.

x10 <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/source10.csv") #import source CSV

s10<-data.frame(read.csv(text=x10, header=T, stringsAsFactors = FALSE))
#s10 <- subset(s10[c(1:80), c(2,4)])
s10<- subset(s10[, c(2,4)]) #subset needed columns
s10<-subset(s10, hit.count>1) #omit rows with 1 or fewer hits
colnames(s10) <- c("skill","rating") #rename rows
#kable(head(s10))

#import source 2 skill translation table
s10t<-data.frame(read.csv(file="s10translation.csv", header=T, stringsAsFactors = FALSE))
#kable(head(s10t))
#kable(s10t)

#join translation table
s10r<- join(s10, s10t, by = "skill")
s10r<-ddply(s10r, c("skill_name"), summarize, rating = sum(rating)) #group by skill name
s10r<-na.omit(s10r) #omit NA skill names
colnames(s10r) <- c("skill","rating") #rename rows
#kable(s10r)

#import source 2 skill type lookup table
s10csv <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/s10v.csv") #import source CSV
s10v<-data.frame(read.csv(text=s10csv, header=T, stringsAsFactors = FALSE))
#kable(head(s10v))
#kable(s10v)

#join lookup table
s10c<- join(s10r, s10v, by = "skill")
s10c<- subset(s10c[, c(3,1,2)]) #reorder columns
kable(head(s10c))
skill_type skill rating
communication ai law 7
business analysis 10
programming analytics 18
programming big data 42
programming cloud 2
business creativity 3

Import and Clean Source 3.

x11 <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/source11.csv") #import source CSV

s11<-data.frame(read.csv(text=x11, header=T, stringsAsFactors = FALSE))
s11 <- subset(s11[, c(2,6)])
s11<-ddply(s11, c("Skill_Term"), summarize, rating = sum(new_jobs_count))
colnames(s11) <- c("skill","rating")
s11<-subset(s11, rating>0) #omit rows with 0 hits
s11$skill<- as.character(s11$skill)  #rename skill
s11$skill[s11$skill == "Visualisation"] <- "Visualization"  #rename skill
#kable(s11)

#import source 3 skill type lookup table
s11csv <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/s11v.csv") #import source CSV
s11v<-data.frame(read.csv(text=s11csv, header=T, stringsAsFactors = FALSE))
#kable(head(s11v))
#kable(s11v)

#join lookup table
s11c<- join(s11, s11v, by = "skill")
s11c<- subset(s11c[, c(3,1,2)])
kable(head(s11c))
skill_type skill rating
programming agent-based Simulation 32
programming ANOVA 138
math Bayes networks 74
math Bayesian Statistics 175
programming Big Data 44
programming BUGS 160

Export Cleaned CSV Files.

write.table(s08, "08-barman-data-cleaned.csv")
write.table(s10, "10-karr-data-cleaned.csv")
write.table(s11, "11-fanelli-data-cleaned.csv")

Conclusion

The exported cleaned CSV files were handed off to the rest of the Transfomer team for data weighting and shaping via a SQL database. The resulting views were then handed off to the Presenter team.