For Project 3, our class split up into three teams: an initial Sourcer team to scrape the data and output source CSVs, a Transformer team to tidy and transform the data with the use of a database, and a Presenter team to create graphical representations and assemble the final product.
The class had agreed upon a standard format: skill_type, skill, rating. Due to varieties in the datasets received by the Transformers, variable and structure cleanup was necessary to get the three primary datasets into a usable format.
The Sourcer group was able to output three lists of skills. As the class did not know the end result of the scraping before starting, the categorization of these skills into skill_type had to be done downstream by the Transformers. The received data was reviewed to identify skill type categories. Via team discussion and with the use of supplmental skill lists from the Source team, five categories were identified.
suppressMessages(suppressWarnings(library(knitr)))
suppressMessages(suppressWarnings(library(stringr)))
suppressMessages(suppressWarnings(library(tidyr)))
suppressMessages(suppressWarnings(library(plyr)))
suppressMessages(suppressWarnings(library(dplyr)))
suppressMessages(suppressWarnings(library(RCurl)))x08 <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/source08.csv") #import source CSV
s08<-data.frame(read.csv(text=x08, header=T, stringsAsFactors = FALSE, sep=""))
suppressWarnings(s08<-separate(s08,frequency., into = c("count", "rating"), sep = " "))
s08 <- subset(s08[, c(3,2)]) #subset needed columns
s08$skill[s08$skill == "machine"] <- "machine learning" #rename skill
s08$skill[s08$skill == "big"] <- "big data" #rename skill
s08$skill[s08$skill == "scriptingpythonrjavascala"] <- "r" #rename skill
s08$skill[s08$skill == "predictive"] <- "predictive analytics" #rename skill
#kable(head(s08))
#import source 1 skill type lookup table
s08csv <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/s08v.csv") #import source CSV
s08v<-data.frame(read.csv(text=s08csv, header=T, stringsAsFactors = FALSE))
#kable(head(s08v))
#kable(s08v)
#join lookup table
s08c<- join(s08, s08v, by = "skill")
s08c<- subset(s08c[, c(3,1,2)])
kable(head(s08c))| skill_type | skill | rating |
|---|---|---|
| business | analysis | 2 |
| math | analytics | 23 |
| programming | big data | 7 |
| business | business | 14 |
| visualization | design | 6 |
| programming | hadoop | 6 |
x10 <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/source10.csv") #import source CSV
s10<-data.frame(read.csv(text=x10, header=T, stringsAsFactors = FALSE))
#s10 <- subset(s10[c(1:80), c(2,4)])
s10<- subset(s10[, c(2,4)]) #subset needed columns
s10<-subset(s10, hit.count>1) #omit rows with 1 or fewer hits
colnames(s10) <- c("skill","rating") #rename rows
#kable(head(s10))
#import source 2 skill translation table
s10t<-data.frame(read.csv(file="s10translation.csv", header=T, stringsAsFactors = FALSE))
#kable(head(s10t))
#kable(s10t)
#join translation table
s10r<- join(s10, s10t, by = "skill")
s10r<-ddply(s10r, c("skill_name"), summarize, rating = sum(rating)) #group by skill name
s10r<-na.omit(s10r) #omit NA skill names
colnames(s10r) <- c("skill","rating") #rename rows
#kable(s10r)
#import source 2 skill type lookup table
s10csv <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/s10v.csv") #import source CSV
s10v<-data.frame(read.csv(text=s10csv, header=T, stringsAsFactors = FALSE))
#kable(head(s10v))
#kable(s10v)
#join lookup table
s10c<- join(s10r, s10v, by = "skill")
s10c<- subset(s10c[, c(3,1,2)]) #reorder columns
kable(head(s10c))| skill_type | skill | rating |
|---|---|---|
| communication | ai law | 7 |
| business | analysis | 10 |
| programming | analytics | 18 |
| programming | big data | 42 |
| programming | cloud | 2 |
| business | creativity | 3 |
x11 <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/source11.csv") #import source CSV
s11<-data.frame(read.csv(text=x11, header=T, stringsAsFactors = FALSE))
s11 <- subset(s11[, c(2,6)])
s11<-ddply(s11, c("Skill_Term"), summarize, rating = sum(new_jobs_count))
colnames(s11) <- c("skill","rating")
s11<-subset(s11, rating>0) #omit rows with 0 hits
s11$skill<- as.character(s11$skill) #rename skill
s11$skill[s11$skill == "Visualisation"] <- "Visualization" #rename skill
#kable(s11)
#import source 3 skill type lookup table
s11csv <- getURL("https://raw.githubusercontent.com/spsstudent15/2016-01-607-09-P3/master/s11v.csv") #import source CSV
s11v<-data.frame(read.csv(text=s11csv, header=T, stringsAsFactors = FALSE))
#kable(head(s11v))
#kable(s11v)
#join lookup table
s11c<- join(s11, s11v, by = "skill")
s11c<- subset(s11c[, c(3,1,2)])
kable(head(s11c))| skill_type | skill | rating |
|---|---|---|
| programming | agent-based Simulation | 32 |
| programming | ANOVA | 138 |
| math | Bayes networks | 74 |
| math | Bayesian Statistics | 175 |
| programming | Big Data | 44 |
| programming | BUGS | 160 |
write.table(s08, "08-barman-data-cleaned.csv")
write.table(s10, "10-karr-data-cleaned.csv")
write.table(s11, "11-fanelli-data-cleaned.csv")The exported cleaned CSV files were handed off to the rest of the Transfomer team for data weighting and shaping via a SQL database. The resulting views were then handed off to the Presenter team.