Sourcing Google search

Load libraries

The Data Model

Development started off by initiating a conversation about how to model the data we collect from matching skills and rating them. “Search”“,”source“,”match“” and “rate” were general theme persistent throughout the project.

Excerpted dialogue:

The issue I want to discuss . . . is “search”. We could query the term: “Data Science” with the filter: “skills”,get the results from each source and be done. If we take a more expansive position, we can include queries for synonyms such as term: “Data Analytics” term: “professional skill” or subsets such as term: “Big Data” skill: “R”.

The idea is to evaluate the information sources . . . but to generalize the categorization engine that evaluates . . . This may be useful if we get a lot of different types of data to compare.

. . . maybe we should start simpler - what do you think of a subset

there’s a list of skills in the DB
there’s a bunch of different R scripts/source pairs that use different strategies at rating those skills
we use a cross table to score/join the “thing” with the “source”
the final skill score is the average of what the different tools scored that skill in the cross table…

. . . look into whether the . . . “evaluator” can handle the following:

· Source (Google Trends, Twitter) · Query term (Data Science) · Filter (skills) · Variants (synonyms, subsets) · Result (score) · Result Classification (Rank, Count, Percentage, Mean)

Initialize helper function - getLinks
http://stackoverflow.com/questions/25213983/explanation-of-how-this-complex-function-works

getLinks <- function() { 
   links <- character() 
   list(a = function(node, ...) { 
               links <<- c(links, xmlGetAttr(node, "href"))
               node 
            }, 
        links = function()links)
}

GOOGLE - search for “data science skills’
Collect result set of child urls from the initial 10

rinterface

Parse and extract key words from url result set . . .

rinterface

# aggregate urls 
df_links <- rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10)
#str(df_links)
head(df_links)

##                                                    filterLinks.links..
## 1                                                                    /
## 2                                                 /news/subscribe.html
## 3                                         http://twitter.com/kdnuggets
## 4                                        http://facebook.com/kdnuggets
## 5 http://www.linkedin.com/groups/KDnuggets-Analytics-Data-Mining-54257
## 6                                                            /gps.html

# break urls into  domain and path components
df_url_parsed <- url_parse(as.character(df_links$filterLinks.links))
df_url_parsed <- data.frame(df_url_parsed)[,c(2,4)]
#str(df_url_parsed)
head(df_url_parsed)

##             domain                                         path
## 1                                                              
## 2                                           news/subscribe.html
## 3      twitter.com                                    kdnuggets
## 4     facebook.com                                    kdnuggets
## 5 www.linkedin.com groups/kdnuggets-analytics-data-mining-54257
## 6                                                      gps.html

# split the path components into 7 word elements
df_url_parsed["lvl1"] <- basename(df_url_parsed$path)
df_url_parsed["lvl2"] <- basename(dirname(df_url_parsed$path))
df_url_parsed["lvl3"] <- basename(dirname(dirname(df_url_parsed$path)))
df_url_parsed["lvl4"] <- basename(dirname(dirname(dirname(df_url_parsed$path))))
df_url_parsed["lvl5"] <- basename(dirname(dirname(dirname(dirname(df_url_parsed$path)))))
df_url_parsed["lvl6"] <- basename(dirname(dirname(dirname(dirname(dirname(df_url_parsed$path))))))
df <- df_url_parsed %>%
        select(lvl1,lvl2,lvl3,lvl4,lvl5,lvl6,domain,path)

Parsed urls!

rinterface

This is the ugly part!

It would be better done on the database against a dictionary of distinct skills

We ‘the sourcers’ weren’t able to develop an authoratative dictionary of skills that could be x-referenced at the sourcing stage of development. This was developed by the transformers and perhaps with more time could be introduced at the point data is retrieved. This was an initial pass at matching the data by trial and error.
‘Emphasis on error’

Subset result set to match skills

df[,1] <- gsub("\\.com|\\.html", " ", df[,1])
df[,1] <- gsub("\\s{2,}|\\d+|[[:punct:]]", " ", df[,1])
df[,1] <- gsub("edit|help|free|make|more|uwsp|hire|your|must|have|goes|udemy", "", df[,1])
#df[,1] <- gsub(" a | and|the |who |jpg|tips|top |new|job|for | in |is | us | one |if |to |get |tma |bad |mba| ms |aspx|law |fake|ways|spot|at |umuc|id ", "", df[,1])
df[,1] <- gsub("the | job |who ","",df[,1])
df[,1] <- gsub("fix |vs |usa| or |tips|mq |d d|paw|aust|mar|avp| asa |know|to see |much|why  |wine", "", df[,1])
df[,1] <- gsub("index|login|money|knew |every|thing|chair|works|steps|flow |easy|last |says|when|about|match|made| you", "", df[,1])
df[,1] <- gsub("master|signup|skills|advice|summer|burtch|hired|april|tweets|great|good|week|with|sports|learn them| need", "", df[,1])
df[,1] <- gsub("skytree|career|schools|stories|hiring|become|philip|maymsports|academy|dataconomy|guide|does|best", "", df[,1])
df[,1] <- gsub("opinions|revision|subscribe|webinars|require|seven|gartner|flash|section|gbdc|course|lawyers", "", df[,1])
df[,1] <- gsub("kdnuggets|interview|bootcamps|ontotext|required|unicorn|questions|answers|canada|scikit|next big|wind|tutorial|articles", "", df[,1])
df[,1] <- gsub("data science|data scientist|professor|offerings|programs|possible|becoming|bootcamp|datacamp|peachih|weather|experts|expect", "", df[,1])
df[,1] <- gsub("fundamentals|automated|certificates|certificate|fellowship|specialties|disappoints|gainers|losers|s program|influence", "", df[,1])
df[,1] <- gsub("google tensor|gooata science|tensor learn|explained|tutorials|inflection|opening|continuum|medicine|training", "", df[,1])
df[,1] <- gsub("usa|san francisco|niagara falls|alabama|arizona|arkansas|california|colorado|connecticut|florida|georgia|hawaii", "", df[,1])
df[,1] <- gsub("beginners|dataconomy|heaven|certification|srch|ko|htm|how|salary|future|terms of use|driving|announcing|opportunities", "", df[,1])
df[,1] <- gsub("rediscovered|manulife|specialist||tedtalks|advanced|open ch| world|degrees|prosensus|quandl|doctorate|events| from sas horton", "", df[,1])
df[,1] <- gsub("top stories|stackoverflow|dissapoints|stackexchange|interviews|nate silver|startups|cheat sheets|learning both|list|forms", "", df[,1])
df[,1] <- gsub("healthcare|profile|document|message|share|article|archive|contact|contact us|essential of| chesheets","",df[,1])
df[,1] <- gsub(" amsterdam| barcelona| berlin| brussels| budapest| dusseldorf ln| frankfurt| hamburg| munich| paris| stockholm| vienna| istanbul","",df[,1])
df[,1] <- gsub(" tel aviv|london|t necessary learn|","",df[,1])
df[,2] <- gsub("\\d+","",df[,2]) 
df[,2] <- gsub("d-id|course|blog|tag|careers|industry|data-science|tracks|data-scientist-skills|specialties|groups|community-brands","",df[,2])
df[,2] <- gsub("multimedia|whitepaper|this-just-in|vendor|application|sector|topic|wiki|-pages|education|big-data-analyst-salary|intent","",df[,2])
df[,2] <- gsub("technology|event|reports|ideas|event|list|formsa-hoic-approach-to-countering-insider-threats|airbnb|answers|apps|class","",df[,2])
df[,2] <- gsub("computer-and-information-|create|cut-your-costs-with-netsupport-dna|daily-life|data|data-scientists-|datanami|ds|legal||nl|ottawa","",df[,2])
df[,2] <- gsub("electrical-engineering-and-computer-science|en|framework|ged|go|gogreen|help|jobs-and--in-|key-tools-for-hybrid-cloud|thesis","",df[,2])
df[,3] <- gsub("\\d+","",df[,3])
df[,3] <- gsub("v|d|questions|pages|jobs|certification|framework|topics|training|courses|articles|whitepapers|webcasts","",df[,3])
df[,3] <- gsub("insight|pin|store|what-are-the-most-aluable-skills-to-learn-for-a-ata-|post|alexaner-lees|kickoff|ooh|resources","",df[,3])
df[,3] <- gsub("watersworks|prouctiity|acaemic","",df[,3])
df[,4] <- gsub("\\d+","",df[,4])
df[,4] <- gsub("data-science-skills-to-boost-your-salary|big-data-brings-big-security-problems|are-you-recruiting-a-data-scientist-or-unicorn","",df[,4])
df[,4] <- gsub("-state-of-database|in-a-fever-for-big-data|where-data-science-meets-it-|windows--inside-nyc-launch-day","",df[,4])
df[,4] <- gsub("windows--inside-nyc-launch-day|the-age-of-biotechnology-has-arrived|-cloud-startups-worth-your-attention","",df[,4])
df[,4] <- gsub("-hot-it-jobs-that-deliver-work-life-balance|files|uploads|training|certification|profile|blogs|images|watersworks|blog","",df[,4])
df[,4] <- gsub("guest|category|go|whitepapers|webcasts","",df[,4])
df[,5] <- gsub("\\d+","",df[,5])
df[,5] <- gsub("big-data-analytics|wp-content|guest|articles|abstract|whitepaper|adtmag|blogs|it-life","",df[,5])
df[,6] <- gsub("\\d+","",df[,6])
df[,6] <- gsub("blog.udacity.com|ecg","",df[,6])


df <- df %>% filter(lvl1 != "")
df <- df %>%
          filter(
            lvl1 != ""  & 
            lvl1 != "s"  &
            lvl1 != "datascientist isnt being inventive" &
            lvl2 != 'schools' & 
            lvl2 != 'certification' & 
            lvl2 != "author" & 
            lvl2 != "+udacity" &
            lvl2 != "about" &
            lvl2 != "course" &
            lvl2 != "news" &       
            lvl2 != "category" &                     
            lvl2 != "q" &                                   
            lvl2 != "a" &                                   
            lvl2 != "company" &                                   
            lvl2 != "profile" &                                   
            lvl2 != "readings" &                                   
            lvl2 != "api" &                                   
            lvl2 != "gampad" &                                   
            lvl2 != "pages" &                                                 
            lvl2 != "jobs" &                
            lvl2 != "software" &                           
            lvl2 != "opinions" &
            lvl2 != "datasets" &     
            lvl2 != "sharer" &        
            lvl2 != "users" &                    
            lvl2 != "s"  &     
            lvl2 != "p"  &              
            lvl2 != "meetings"  &     
            lvl2 != "academic"  &     
            lvl2 != "tutorials"  &     
            lvl2 != "companies"  &     
            lvl2 != "polls"  &                 
            lvl2 != "salaries"  &     
            lvl2 != "webcasts"  &     
            lvl2 != "unanswered"  &     
            lvl2 != "sets"  &     
            lvl2 != "skills"  &     
            lvl2 != "forms"  &                 
            lvl2 != "catery"  &     
            lvl2 != "questions"  &                 
            lvl2 != "www.reddit.com"  &     
            lvl2 != "forms"  &
            lvl2 != "catery"  &     
            lvl2 != "top--reasons-to-ctralize-your-business-communications"  &
            lvl2 != "survey-report:-the-value-of-threat-intelligce-in-protection"  &
            lvl2 != "the-forrester-wave:-digital-experice-platforms,-q-"  &
            lvl2 != "a-hoic-approach-to-countering-insider-threats"  &
            lvl2 != "forms"  &                 
            lvl2 != "question"  &     
            lvl2 != "stackexchange.com"  &
            lvl3 != "sitemap" &  
            lvl3 != "eents" &
            lvl3 != "posts" &
            lvl3 != "certify" &
            lvl3 != "us" &
            lvl3 != "s"  &     
            lvl3 != "mastersinata" &
            lvl3 != "profiles" &
            lvl3 != "users" &       
            lvl3 != "category" &
            lvl3 != "mark-meloon" &
            lvl3 != "lasegas" &
            lvl3 != "sites" &
            lvl3 != "licenses" &
            lvl3 != "what-are-the-most-aluable-skills-to-learn-for-a-ata-" &
            lvl3 != "scientist-now" &
            lvl3 != "ata-scientist-the-sexiest-job-of-the-st-century" &
            lvl4 != "googles-next-hq-modern-with-retro-flairs-" &
            lvl4 != "everything-youve-been-told-about-mobility-is-wrong"&
            lvl6 != "content" & 
            lvl6 != "webcasts" & 
            lvl6 != "strategic-cio"
      )

df[df=="."] <- ""
df[,1] <- gsub("whitepaper|^s |asp|history|like see |easier|adopt|bashos|john|musser|stacver|tweet|perfect pairing","", df[,1])
df[,1] <- gsub("essential of|privacy|sites|adt tech library|mediadata||home|radio|i want be|essential of a|us ","", df[,1])
df[,1] <- gsub("^\\s+|\\s+$", "", df[,1])

df[,1] <- gsub("php|thjust in| spark","", df[,1])
df <- df %>% filter(lvl1 != "")

Output result file

df_out <-
        df %>%
        select(lvl1)  %>%
        group_by(lvl1) %>%
        summarise(score= n()) %>%
        arrange(desc(score))

write.csv(df_out, file = "google.csv")

Lessons learned

Matching Engine

Step 7 In this data sourcing exercise was really unecessarily laborious and inaccurate. At best, this can be attributed to a discovery process. It became apparent early-on however, that a source-skill lookup would be a more effective way to handle matching and the database which is optimized for this purpose may be more ideally suited for the matching function that R.

Process Improvement

Ideally, with each pass through the matching cycle the list of skills becomes larger. These skills can further be classified through meta-data. Variants on the initiall query should be incorporated such as “data analytics” “business intelligence”etc. and result sets should also be classified i.e. technical skills, soft skills, skill concentration by location, industry etc.

The notion is that each cycle of matching and classification improves the ultimate search result.