Sourcing Google search

Load libraries

Introduction To “Search”

The objective of this utility is to model a simple approach for searching a web-based data source for a few key words and applying this model to “data science skills”. This segment of the utility focuses on sourcing the result set from a Google query and return the result set in a usable format to the transformer segment.

Development started off by initiating a conversation about how to model the data we collect from matching skills and rating them. “Search”“,”source“,”match“” and “rate” were general themes persistent throughout the project.

The Data Model

Excerpted dialogue:

The issue I want to discuss . . . is “search”. We could query the term: “Data Science” with the filter: “skills”, get the results from each source and be done. If we take a more expansive approach, we can include queries with synonyms such as the term: “Data Analytics” with the term qualifier: “professional skill” or subsets a term such as: “Big Data” skill: “R”.

The idea is to evaluate the sources . . . but to generalize the categorization engine that evaluates . . . them. This may be useful if we get a lot of different types of data to compare.

. . . maybe we should start simpler - what do you think of a subset

there are a list of skills in the DB
there are a bunch of different R scripts/source pairs that use different strategies at rating those skills
we use a cross table to score/join the “thing” with the “source”
the final skill score is the average of what the different tools scored that skill in the cross table…

. . . look into whether the . . . “evaluator” can handle the following:

· Source (Google Trends, Twitter) · Query term (Data Science) · Filter (skills) · Variants (synonyms, subsets) · Result (score) · Result Classification (Rank, Count, Percentage, Mean)

Start by initializing a helper function - getLinks - to get child links http://stackoverflow.com/questions/25213983/explanation-of-how-this-complex-function-works

getLinks <- function() { 
   links <- character() 
   list(a = function(node, ...) { 
               links <<- c(links, xmlGetAttr(node, "href"))
               node 
            }, 
        links = function()links)
}

GOOGLE - search for “data science skills” *Note, Google obscures its initial results to prevent hacking.
Collect a result set of child urls from the initial result set of 10
Parse and extract key words from the result set . . .

# aggregate urls 
df_links <- rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10)
#str(df_links)
head(df_links)

##                                                    filterLinks.links..
## 1                                                                    /
## 2                                                 /news/subscribe.html
## 3                                         http://twitter.com/kdnuggets
## 4                                        http://facebook.com/kdnuggets
## 5 http://www.linkedin.com/groups/KDnuggets-Analytics-Data-Mining-54257
## 6                                                            /gps.html

# break urls into  domain and path components
df_url_parsed <- url_parse(as.character(df_links$filterLinks.links))
df_url_parsed <- data.frame(df_url_parsed)[,c(2,4)]
#str(df_url_parsed)
head(df_url_parsed)

##             domain                                         path
## 1                                                              
## 2                                           news/subscribe.html
## 3      twitter.com                                    kdnuggets
## 4     facebook.com                                    kdnuggets
## 5 www.linkedin.com groups/kdnuggets-analytics-data-mining-54257
## 6                                                      gps.html

# split the path components into 7 word elements
df_url_parsed["lvl1"] <- basename(df_url_parsed$path)
df_url_parsed["lvl2"] <- basename(dirname(df_url_parsed$path))
df_url_parsed["lvl3"] <- basename(dirname(dirname(df_url_parsed$path)))
df_url_parsed["lvl4"] <- basename(dirname(dirname(dirname(df_url_parsed$path))))
df_url_parsed["lvl5"] <- basename(dirname(dirname(dirname(dirname(df_url_parsed$path)))))
df_url_parsed["lvl6"] <- basename(dirname(dirname(dirname(dirname(dirname(df_url_parsed$path))))))
df <- df_url_parsed %>%
        select(lvl1,lvl2,lvl3,lvl4,lvl5,lvl6,domain,path)

Parsed urls!

This is the ugly part!

It would be better done on the database against a dictionary of distinct skills

We ‘the sourcers’ weren’t able to develop an authoratative dictionary of skills that could be x-referenced at the sourcing stage of development. This was developed by the transformers and perhaps with more time could be introduced at the point data is retrieved. This was an initial pass at matching the data by trial and error.
‘Emphasis on error’

Subset result set to match skills

df[,1] <- gsub("\\.com|\\.html", " ", df[,1])
df[,1] <- gsub("\\s{2,}|\\d+|[[:punct:]]", " ", df[,1])
df[,1] <- gsub("edit|help|free|make|more|uwsp|hire|your|must|have|goes|udemy", "", df[,1])
#df[,1] <- gsub(" a | and|the |who |jpg|tips|top |new|job|for | in |is | us | one |if |to |get |tma |bad |mba| ms |aspx|law |fake|ways|spot|at |umuc|id ", "", df[,1])
df[,1] <- gsub("the | job |who ","",df[,1])
df[,1] <- gsub("fix |vs |usa| or |tips|mq |d d|paw|aust|mar|avp| asa |know|to see |much|why  |wine", "", df[,1])
df[,1] <- gsub("index|login|money|knew |every|thing|chair|works|steps|flow |easy|last |says|when|about|match|made| you", "", df[,1])
df[,1] <- gsub("master|signup|skills|advice|summer|burtch|hired|april|tweets|great|good|week|with|sports|learn them| need", "", df[,1])
df[,1] <- gsub("skytree|career|schools|stories|hiring|become|philip|maymsports|academy|dataconomy|guide|does|best", "", df[,1])
df[,1] <- gsub("opinions|revision|subscribe|webinars|require|seven|gartner|flash|section|gbdc|course|lawyers", "", df[,1])
df[,1] <- gsub("kdnuggets|interview|bootcamps|ontotext|required|unicorn|questions|answers|canada|scikit|next big|wind|tutorial|articles", "", df[,1])
df[,1] <- gsub("data science|data scientist|professor|offerings|programs|possible|becoming|bootcamp|datacamp|peachih|weather|experts|expect", "", df[,1])
df[,1] <- gsub("fundamentals|automated|certificates|certificate|fellowship|specialties|disappoints|gainers|losers|s program|influence", "", df[,1])
df[,1] <- gsub("google tensor|gooata science|tensor learn|explained|tutorials|inflection|opening|continuum|medicine|training", "", df[,1])
df[,1] <- gsub("usa|san francisco|niagara falls|alabama|arizona|arkansas|california|colorado|connecticut|florida|georgia|hawaii", "", df[,1])
df[,1] <- gsub("beginners|dataconomy|heaven|certification|srch|ko|htm|how|salary|future|terms of use|driving|announcing|opportunities", "", df[,1])
df[,1] <- gsub("rediscovered|manulife|specialist||tedtalks|advanced|open ch| world|degrees|prosensus|quandl|doctorate|events| from sas horton", "", df[,1])
df[,1] <- gsub("top stories|stackoverflow|dissapoints|stackexchange|interviews|nate silver|startups|cheat sheets|learning both|list|forms", "", df[,1])
df[,1] <- gsub("healthcare|profile|document|message|share|article|archive|contact|contact us|essential of| chesheets","",df[,1])
df[,1] <- gsub(" amsterdam| barcelona| berlin| brussels| budapest| dusseldorf ln| frankfurt| hamburg| munich| paris| stockholm| vienna| istanbul","",df[,1])
df[,1] <- gsub(" tel aviv|london|t necessary learn|","",df[,1])
df[,2] <- gsub("\\d+","",df[,2]) 
df[,2] <- gsub("d-id|course|blog|tag|careers|industry|data-science|tracks|data-scientist-skills|specialties|groups|community-brands","",df[,2])
df[,2] <- gsub("multimedia|whitepaper|this-just-in|vendor|application|sector|topic|wiki|-pages|education|big-data-analyst-salary|intent","",df[,2])
df[,2] <- gsub("technology|event|reports|ideas|event|list|formsa-hoic-approach-to-countering-insider-threats|airbnb|answers|apps|class","",df[,2])
df[,2] <- gsub("computer-and-information-|create|cut-your-costs-with-netsupport-dna|daily-life|data|data-scientists-|datanami|ds|legal||nl|ottawa","",df[,2])
df[,2] <- gsub("electrical-engineering-and-computer-science|en|framework|ged|go|gogreen|help|jobs-and--in-|key-tools-for-hybrid-cloud|thesis","",df[,2])
df[,3] <- gsub("\\d+","",df[,3])
df[,3] <- gsub("v|d|questions|pages|jobs|certification|framework|topics|training|courses|articles|whitepapers|webcasts","",df[,3])
df[,3] <- gsub("insight|pin|store|what-are-the-most-aluable-skills-to-learn-for-a-ata-|post|alexaner-lees|kickoff|ooh|resources","",df[,3])
df[,3] <- gsub("watersworks|prouctiity|acaemic","",df[,3])
df[,4] <- gsub("\\d+","",df[,4])
df[,4] <- gsub("data-science-skills-to-boost-your-salary|big-data-brings-big-security-problems|are-you-recruiting-a-data-scientist-or-unicorn","",df[,4])
df[,4] <- gsub("-state-of-database|in-a-fever-for-big-data|where-data-science-meets-it-|windows--inside-nyc-launch-day","",df[,4])
df[,4] <- gsub("windows--inside-nyc-launch-day|the-age-of-biotechnology-has-arrived|-cloud-startups-worth-your-attention","",df[,4])
df[,4] <- gsub("-hot-it-jobs-that-deliver-work-life-balance|files|uploads|training|certification|profile|blogs|images|watersworks|blog","",df[,4])
df[,4] <- gsub("guest|category|go|whitepapers|webcasts","",df[,4])
df[,5] <- gsub("\\d+","",df[,5])
df[,5] <- gsub("big-data-analytics|wp-content|guest|articles|abstract|whitepaper|adtmag|blogs|it-life","",df[,5])
df[,6] <- gsub("\\d+","",df[,6])
df[,6] <- gsub("blog.udacity.com|ecg","",df[,6])


df <- df %>% filter(lvl1 != "")
df <- df %>%
          filter(
            lvl1 != ""  & 
            lvl1 != "s"  &
            lvl1 != "datascientist isnt being inventive" &
            lvl2 != 'schools' & 
            lvl2 != 'certification' & 
            lvl2 != "author" & 
            lvl2 != "+udacity" &
            lvl2 != "about" &
            lvl2 != "course" &
            lvl2 != "news" &       
            lvl2 != "category" &                     
            lvl2 != "q" &                                   
            lvl2 != "a" &                                   
            lvl2 != "company" &                                   
            lvl2 != "profile" &                                   
            lvl2 != "readings" &                                   
            lvl2 != "api" &                                   
            lvl2 != "gampad" &                                   
            lvl2 != "pages" &                                                 
            lvl2 != "jobs" &                
            lvl2 != "software" &                           
            lvl2 != "opinions" &
            lvl2 != "datasets" &     
            lvl2 != "sharer" &        
            lvl2 != "users" &                    
            lvl2 != "s"  &     
            lvl2 != "p"  &              
            lvl2 != "meetings"  &     
            lvl2 != "academic"  &     
            lvl2 != "tutorials"  &     
            lvl2 != "companies"  &     
            lvl2 != "polls"  &                 
            lvl2 != "salaries"  &     
            lvl2 != "webcasts"  &     
            lvl2 != "unanswered"  &     
            lvl2 != "sets"  &     
            lvl2 != "skills"  &     
            lvl2 != "forms"  &                 
            lvl2 != "catery"  &     
            lvl2 != "questions"  &                 
            lvl2 != "www.reddit.com"  &     
            lvl2 != "forms"  &
            lvl2 != "catery"  &     
            lvl2 != "top--reasons-to-ctralize-your-business-communications"  &
            lvl2 != "survey-report:-the-value-of-threat-intelligce-in-protection"  &
            lvl2 != "the-forrester-wave:-digital-experice-platforms,-q-"  &
            lvl2 != "a-hoic-approach-to-countering-insider-threats"  &
            lvl2 != "forms"  &                 
            lvl2 != "question"  &     
            lvl2 != "stackexchange.com"  &
            lvl3 != "sitemap" &  
            lvl3 != "eents" &
            lvl3 != "posts" &
            lvl3 != "certify" &
            lvl3 != "us" &
            lvl3 != "s"  &     
            lvl3 != "mastersinata" &
            lvl3 != "profiles" &
            lvl3 != "users" &       
            lvl3 != "category" &
            lvl3 != "mark-meloon" &
            lvl3 != "lasegas" &
            lvl3 != "sites" &
            lvl3 != "licenses" &
            lvl3 != "what-are-the-most-aluable-skills-to-learn-for-a-ata-" &
            lvl3 != "scientist-now" &
            lvl3 != "ata-scientist-the-sexiest-job-of-the-st-century" &
            lvl4 != "googles-next-hq-modern-with-retro-flairs-" &
            lvl4 != "everything-youve-been-told-about-mobility-is-wrong"&
            lvl6 != "content" & 
            lvl6 != "webcasts" & 
            lvl6 != "strategic-cio"
      )

df[df=="."] <- ""
df[,1] <- gsub("whitepaper|^s |asp|history|like see |easier|adopt|bashos|john|musser|stacver|tweet|perfect pairing","", df[,1])
df[,1] <- gsub("essential of|privacy|sites|adt tech library|mediadata||home|radio|i want be|essential of a|us ","", df[,1])
df[,1] <- gsub("^\\s+|\\s+$", "", df[,1])

df[,1] <- gsub("php|thjust in| spark","", df[,1])
df <- df %>% filter(lvl1 != "")

Summarize results and output to file

df_out <-
        df %>%
        select(lvl1)  %>%
        group_by(lvl1) %>%
        summarise(score= n()) %>%
        arrange(desc(score))

write.csv(df_out, file = "google.csv")

Lessons learned

Matching Engine

Step 7 of this data sourcing exercise was unecessarily laborious and inaccurate. At best, this can be attributed to a discovery process. It became apparent early-on however, that a source-skill lookup would be a more effective way to handle matching and the database which is optimized for this purpose may be more ideally suited for the matching function that R.

Process Improvement

Ideally, with each pass through the matching cycle the list of skills improves. The skills can further be classified through meta-data. Variants on the initial query should becoome incorporated such as “data analytics” “business intelligence” etc. and the result sets should also be classified i.e. technical skills, soft skills, skill concentration by location, industry etc.

The notion is that each cycle of matching and classification improves the ultimate search result.