Introduction
In this project we worked as a team to gather text data to address the question, “Which are the most valued data science skills?” Our approach involved scraping data from two very different sources: The job-listing site Indeed and the content aggregation site Reddit. Our motivation is to get two very different perspectives on valued data science skills: one from the perspective of the job market and another from relevant conversations within a subset of the data science community.
Here we describe the methods used to scrape text data from the two sources, the methods used to process and clean the data, discuss our analysis approaches and present our findings. We have two Analysis directions that we pursue:
- Unsupervised Natural Language Processing (NLP) Analysis This approach makes no assumptions about which data science skills to extract information about from the raw text data. We use the NLP library udpipe to apply a model of the English language to the text scraped from Indeed & Reddit to identify data science skills that occur with the highest frequency.
- Supervised Word & Phrase Frequency Analysis Here we scrape a new set of text from Indeed and search the text with known data science skills terms
Table of Contents
R Environment
- Libraries used
Unsupervised Natural Language Processing (NLP) Analysis
- Scraping Relevent Reddit Comments
- Query Reddit for relevant comments with get_reddit()
- Query Reddit for relevant URLs with reddit_urls() and accessing comments with reddit_content()
- Combining comments & removing duplicates cases
- Scraping Indeed Text Data
- Comment cleaning with the text mining library tm
- cleaning Reddit data
- cleaning Indeed data
- Natural Language Processing with the udpipe library
- processing Reddit data
- processing Indeed data
- Centralizing datasets in a Relational Database
- Joint Analysis of the Normalized Data
Supervised Word & Phrase Frequency Analysis
- Approach to Scraping Indeed Job Listings
- Putting all the Pieces Together and build our Scraper
- Analyzing frequencies of Data Science Skills words and phrases
In Closing
- Conclusions
- Future Directions
Libraries used
These libraries are used at various steps in this file. The use of prominent libraries are highlighted throughout the text.
library( RedditExtractoR )
library( dplyr )
library( tidyr )
library( tm )
library( SnowballC )
library( wordcloud )
library( RColorBrewer )
library( udpipe )
library( reshape2 )
library( ggplot2 )
library( lattice )
library( rvest )
library( tidyverse )
library( SnowballC )
library( RMySQL )
library( xml2 )
library( stringi )
library( ggpubr )Unsupervised Natural Language Processing (NLP) Analysis
Scraping Relevent Reddit Comments with redditExtractoR
Reddit is a popular content aggregations cite with different boards, or ‘subreddits’ dedicated (and strictly moderated) for specific topics. This section describes the methods used to scrape Reddit comments relevent to data science skills from the subreddit r/datascience.
RedditExtractoR is an R library with tools specific for extracting unstructured data from Reddit. RedditExtractoR functions were used to extract text comment data from relevant threads.
Goal: Query reddit for threads relevant to ‘Data science skills’, collect and mine text for insights, chiefly: "What are the most valued data science skills
Query Reddit for relevant comments with get_reddit()
get_reddit() function was used in multiple queries within the subreddit r/datascience to find relevant thread & comment results for such terms as ‘data science skills’, ‘data science tools’, ‘learning data science’, ‘data tools’ etc.
An example get_reddit() query:
closeAllConnections()
URLs <- get_reddit(
search_terms = “data science skills”,
cn_threshold = 5,
subreddit = ‘datascience’
)
The result of a query is a data.frame with 18 features one of them being the comments from relevent threads
Mulitiple querries were performed and the resulting data.frames were combined with the rbind() function and exported as a .csv file. The file was uploaded to the B. Cooper’s github to be accessed here:
rTexts <- 'https://raw.githubusercontent.com/SmilodonCub/DATA607/master/allTexts.csv'
rTexts_df <- read.csv( rTexts, stringsAsFactors = F)
dim( rTexts_df )## [1] 3148 19
Query Reddit for relevant URLs with reddit_urls() and accessing comments with reddit_content()
Comments were also collected with other redditExtractoR methods. thread URLs were collected using the reddit_urls using the same search criteria that was used with get_reddit()
The data.frames from multiple reddit_URLs() queries were collated with rbind() Next, the a for loop was used to scrape the comments from each URL with the redditExractoR function reddit_content():
Here is an example query with reddit_urls():
closeAllConnections()
URLs <- reddit_urls(
search_terms = “data science skills”,
cn_threshold = 1,
subreddit = ‘datascience’
)
numComments <- sum( rURLs_df\(num_comments ) allComments <- data.frame( matrix( 0, nrow = numComments, ncol = 1 ) ) numURLs <- length( rURLs_df\)URL )
IDX <- 1
Secs <- 3
for (aURL in seq(1,numURLs)){
urlContent <- reddit_content(rURLs_df\(URL[aURL], wait_time = 2) Sys.sleep(Secs) closeAllConnections() gc() numComments_thisURL <- length( urlContent\)comment )
print( numComments_thisURL )
if (numComments_thisURL>0){
allComments[ IDX:(IDX + numComments_thisURL -1),] <-
urlContent$comment
}
IDX <- IDX + numComments_thisURL
print(IDX)
}
The comments that resulted from this approach were exported as a .csv and uploaded to gihub to be accessed here:
url1 <- 'https://raw.githubusercontent.com/SmilodonCub/DATA607/master/redditComments.csv'
moreComments_df <- read.csv( url1, stringsAsFactors = F) %>%
select(-X ) %>%
rename( comment = matrix.0..nrow...numComments..ncol...1.)
dim( moreComments_df )## [1] 2386 1
This supplementary .R script was used for the analysis and is available on B. Cooper’s github.
Combining Comments & removing duplicates cases
Multiple queries with similar search terms predictably yielded overlapping results. Therefore, we will create a new data.frame with just the ‘comment’ feature and remove the duplicates
#combine comments from different redditExtractoR methods
allComments_df <- rTexts_df %>% select( comment )
allComments_df <- rbind( allComments_df, moreComments_df)
dim( allComments_df )## [1] 5534 1
## [1] 4726 1
Scraping Indeed Text Data
A python script that utilizes the selenium and BeautifulSoup packages was used to collect a list of Indeed job add links for Data Science jobs.
The following scripts accesses the output of the python scrape and formats the text to a single string for downstream processing:
# read in the links that R will scrape from a csv and create column names
URL <- 'https://raw.githubusercontent.com/dmoste/DATA607/master/Project%203/data_science_links.csv'
links <- read.csv(URL, header = FALSE)
names(links) <- c("Link")
links$Link <- as.character(links$Link)
# scrape the lists from each link and add the text to a single string (textList)
textList <- c()
for(i in 1:length(links$Link)){
h_text <- read_html(links[i,]) %>%
html_nodes("li") %>%
html_text()
textList <- rbind(textList, h_text)
}Natural Language Processing with the udpipe library
Here we use the udpipe library to apply some basic Natural Language Processing on the text. We tag the words with part of speach and select for words that are nouns under the assumption that nouns their frequency of occurance will be most informative about the data science skills in the text.
We start by downloading and loading into the R environment an English language model for udpipe to apply to our text
model <- udpipe_download_model(language = "english")
udmodel_english <- udpipe_load_model(file = model$file_model)Processing Reddit data
Now we apply udpipe’s English language model to the Reddit text data. The udpipe_annotate() function will process each word and associate several features. For instance, it will tag each word with it’s most likely part of speach (e.g. noun, verb etc)
reddit1000_processedWords <- udpipe_annotate(udmodel_english,
reddit1000_words$word )
reddit1000_NLP <- data.frame(reddit1000_processedWords)
head( reddit1000_NLP )## doc_id paragraph_id sentence_id sentence token_id token lemma upos xpos
## 1 doc1 1 1 data 1 data data NOUN NN
## 2 doc2 1 1 can 1 can can AUX MD
## 3 doc3 1 1 like 1 like like INTJ UH
## 4 doc4 1 1 get 1 get get VERB VB
## 5 doc5 1 1 just 1 just just ADV RB
## 6 doc6 1 1 work 1 work work NOUN NN
## feats head_token_id dep_rel deps misc
## 1 Number=Sing 0 root <NA> SpacesAfter=\\n
## 2 VerbForm=Fin 0 root <NA> SpacesAfter=\\n
## 3 <NA> 0 root <NA> SpacesAfter=\\n
## 4 Mood=Imp|VerbForm=Fin 0 root <NA> SpacesAfter=\\n
## 5 <NA> 0 root <NA> SpacesAfter=\\n
## 6 Number=Sing 0 root <NA> SpacesAfter=\\n
Now that the words have been annotated, we can subset the data for the noun with the assumption that nouns will be more informative about data science skills.
#remove duplicated word entries (for ambiguous text)
reddit1000_NLP <- reddit1000_NLP[ !duplicated( reddit1000_NLP$doc_id ), ]
#merge two dataframes
reddit1000_NLP$value <- reddit1000_words$value
#Most occuring nouns
nounsReddit <- subset(reddit1000_NLP, upos %in% c("NOUN"))
#to check if i'm missing anything interesting:
#verbs <- subset(top1000_NLP, upos %in% c("VERB"))
#adjs <- subset(top1000_NLP, upos %in% c("ADJ"))
nounsReddit <- nounsReddit %>% group_by( lemma ) %>%
summarise( value = sum( value )) %>%
arrange( desc( value ) )
nounsReddit$lemma <- factor(nounsReddit$lemma,
levels = rev(nounsReddit$lemma))
fig <- ggplot(head(nounsReddit,15), aes(x=lemma, y=value)) +
geom_bar(stat="identity") +
xlab("Word") +
ylab("Count") +
coord_flip()
print(fig)Present the results as a word cloud:
set.seed(36) #be sure to set the seed if you want to reproduce the same again
wordcloud(words=nounsReddit$lemma, freq=nounsReddit$value, scale=c(3,.5),max.words = 360, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))Processing Indeed data
We now apply the same procedures to the equivalent data.frame of Indeed text data.
indeed1000_processedWords <- udpipe_annotate(udmodel_english,
indeed1000_words$word )
indeed1000_NLP <- data.frame(indeed1000_processedWords)
#head( indeed1000_NLP )
#remove duplicated word entries (for ambiguous text)
indeed1000_NLP <- indeed1000_NLP[ !duplicated( indeed1000_NLP$doc_id ), ]
#merge two dataframes
indeed1000_NLP$value <- indeed1000_words$value
#Most occuring nouns
nounsIndeed <- subset(indeed1000_NLP, upos %in% c("NOUN"))
#to check if i'm missing anything interesting:
#verbs <- subset(top1000_NLP, upos %in% c("VERB"))
#adjs <- subset(top1000_NLP, upos %in% c("ADJ"))
nounsIndeed <- nounsIndeed %>% group_by( lemma ) %>%
summarise( value = sum( value )) %>%
arrange( desc( value ) )
nounsIndeed$lemma <- factor(nounsIndeed$lemma,
levels = rev(nounsIndeed$lemma))
#barchart(sentence ~ value, data = head(nouns, 20), col = "cadetblue",
#main = "Most occurring nouns", xlab = "Freq")
fig <- ggplot(head(nounsIndeed,15), aes(x=lemma, y=value)) +
geom_bar(stat="identity") +
xlab("Word") +
ylab("Count") +
coord_flip()
print(fig)set.seed(36) #be sure to set the seed if you want to reproduce the same again
wordcloud(words=nounsIndeed$lemma, freq=nounsIndeed$value, scale=c(3,.5),max.words = 300, random.order=FALSE, rot.per=0.35, colors=brewer.pal(8, "Dark2"))As an initial finding, there are some remarkable similarities between the words in each of the data sets. In both of the histograms, we can see several words in common. In the word clouds there are multiple! However, we need to compare the datasets directly to explore the similarities further.
Centralizing Datasets in a Relational Database
The facilitate the joint analysis of Indeed and Reddit text data, both data sets were stored in a MySQL relational database. The SQL script that generates tables can be accessed here.
Joint Analysis of the Normalized Data
The datasets have be stored in seperate tables in a MySQL database. We will now query the data and perform an inner join of the two tables.
To start, we establish a connection to the MySQL database:
## <MySQLConnection:0,0>
Now we prepare the data so that we can make direct comparisons between the two datasets. We will do this by querying the SQL datatables through our connection to the MySQL database, processing the data, combining data with an inner join method and tidy the data to facilitate visualization:
#query the indeed1 and reddit MySQL datatables with a SELECT statement such that all of the contents are selected and cast as R data.frames
sql <- "SELECT * FROM indeed1"
indeed1 <- dbGetQuery(con, sql)
sql <- "SELECT * FROM reddit"
reddit <- dbGetQuery(con, sql)
#Filter by part of speech for words that are nouns
indeed_data <- indeed1 %>%
filter(upos == "NOUN")
reddit_data <- reddit %>%
filter(upos == "NOUN")
#Normlize the results to facilitate a direct comparison of the data.
indeed_data$Indeed <- (indeed_data$count)/sum(indeed_data$count)
reddit_data$Reddit <- (reddit_data$count)/sum(reddit_data$count)
#Make a new data.frame that is the result of an inner join of the indeed and reddit data. This will hold all the nouns that thw two data.frames hold in commonality.
#Then, process the joined data set to filter out low incidence cases.
#tidy the data with the pivot_longer() function to facilitate downstream analysis & visualization.
data_compare <- indeed_data %>%
inner_join(reddit_data, by = "token") %>% #join common elements
filter(count.x >100) %>% #filter out low incidence cases
filter(count.y >100) %>%
select(-lemma.x, -lemma.y, -upos.x, -upos.y) %>%
pivot_longer( #tidy the data to a long format
c("Reddit", "Indeed"),
names_to = "Database",
values_to = "Count"
)Visualize the joint dataset
ggplot(data_compare, aes(x = reorder(token, Count), y = Count,
fill = Database)) +
geom_bar(position = "dodge", stat = "identity") +
scale_fill_brewer(palette = "Dark2") +
labs(x = "Word",
y = "Usage",
fill = "Site") +
coord_flip()
The figure above is interesting because it directly compares the normalized occurances of nouns that were shared in common between the Indeed and Reddit data. Next, we can observe the plot that quantifies the correlation of commonly shared words.
data_compare <- data_compare %>% pivot_wider( names_from = Database, values_from = Count )
cor( data_compare$Reddit, data_compare$Indeed)## [1] 0.8718501
ggscatter(data_compare, x= "Reddit", y= "Indeed",
add = "reg.line", cor.coef = TRUE, conf.int = TRUE)Our key finding here: we observe several nouns that, using our domain knowledge, we identify as data science skills: Experience, python, business, analysis, degree and machine (learning). Our data indicate that these skills examples are represented the most when input is considered from both the job market perspective (Indeed) and the context of the data science community (Reddit). We also observe a strong correlation of occurences in text for the words that are shared in common between the two datasets.
Supervised Word & Phrase Frequency Analysis
Approach to Scraping Indeed Job Listings
Here we perform a seperate scrape of Indeed to build a text dataset independent of the unsupervised dataset. The Objective is to create a data frame that includes the job titles and the job description from Indeed job-listings.
After looking for a data scientist job in specific location (in this case New York, NY), we we copied the link address and stored the URL in a variable called url. Then we used the xml2 package and the read_html function to parse the page. In short, this means that the function will read in the code from the webpage and break it down into different elements (’
’, etc.) for you to analyse it.
url <- "https://www.indeed.com/jobs?as_and=data+scientist&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=&as_src=&salary=&radius=25&l=New+York%2C+NY&fromage=any&limit=50&sort=&psf=advsrch&from=advancedsearch"
page <- read_html(url)The Job Titles
By inspecting the code in the Indeed website using Inspect element tool we see that the The job title is located under the anchor tag. If we look more into it we can also see the it is located under the jobtitle CSS selector .
Job Descriptions
You’ll notice, that on the current page, there is just a little short description of the job summary. However, we want to get the full description of how many years of experience we need, what skill set is required, and what responsibilities the job entails.
We start by collecting the links on the website. After that we can locate where the job description is located in the document. after inspecting the full description we noticed that the job description is in a element with a class attribute values .jobsearch-JobComponent-description. Also we’ll need our scraper to include the part the will scrape multiple page results. since the only thing that change in the url when moving from page to another is the number or results, so can scrape multiple pages by messing with this number.
Putting all the Pieces Together and build our Scraper
# We changed the number of results per page from 10 to 50 results per page
first_page <- 50 # first page of result
last_page <- 500 # last page of results
results <- seq(from = first_page, to = last_page, by = 50)
full_df <- data.frame()
for(i in seq_along(results)) {
first_page_url <- "https://www.indeed.com/jobs?as_and=data+scientist&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=&as_src=&salary=&radius=25&l=New+York%2C+NY&fromage=any&limit=50&sort=&psf=advsrch&from=advancedsearch"
url <- paste0(first_page_url, "&start=", results[i])
page <- xml2::read_html(url)
Sys.sleep(3) # to avoids error messages such as "Error in open.connect
##Job Title
JobTitle <- page %>%
rvest::html_nodes('[data-tn-element="jobTitle"]') %>%
rvest::html_attr("title")
## Job Link
links <- page %>%
rvest::html_nodes('[data-tn-element="jobTitle"]') %>%
rvest::html_attr("href")
## Job Description
job_description <- c()
for(i in seq_along(links)) {
url <- paste0("https://www.indeed.com/", links[i])
page <- xml2::read_html(url)
job_description[[i]] <- page %>%
html_nodes('.jobsearch-JobComponent-description') %>%
html_text() %>%
stri_trim_both()
}
}
df <- data.frame(JobTitle, job_description)
full_df <- rbind(full_df, df) %>%
mutate_at(vars(JobTitle, job_description), as.character)
#full_df_count <- str_count(full_df$job_description, "SQL" )
head(full_df)## JobTitle
## 1 Senior Data Scientist (Remote-friendly)
## 2 Data Science Specialist USA
## 3 Senior Data Scientist
## 4 Senior Data Analyst
## 5 Media Science Big Data Analyst
## 6 Head of Data Science
## job_description
## 1 try {\n window.mosaic.onMosaicApiReady(function() {\n var zoneId = 'aboveFullJobDescription';\n var providers = window.mosaic.zonedProviders[zoneId];\n\n if (providers) {\n providers.filter(function(p) { return window.mosaic.lazyFns[p]; }).forEach(function(p) {\n return window.mosaic.api.loadProvider(p);\n });\n }\n });\n } catch (e) {};\n At Noom, we use scientifically proven methods to help our users create healthier lifestyles, and manage important conditions like Type-II Diabetes, Obesity, and Hypertension. Our Engineering team is at the forefront of this challenge, solving complex technical and UX problems on our mobile apps that center around habits, behavior, and lifestyle.We are looking for a Data Scientist to join our Data team and help us ensure that we apply the best approaches to data analysis and research, artificial intelligence, and machine learning.What You’ll Like About Us: We work on problems that affect the lives of real people. Our users depend on us to make positive changes to their health and their lives.We base our work on scientifically-proven, peer-reviewed methodologies that are designed by medical professionals.We are a data-driven company through and through.We’re a respectful, diverse, and dynamic environment in which Engineering is a first-class citizen, and where you’ll be able to work on a variety of interesting problems that affect the lives of real people.We offer a generous budget for personal development expenses like training courses, conferences, and books.You’ll get three weeks’ paid vacation and a flexible work policy that is remote- and family-friendly (about 50% of our engineering team is fully remote). We worry about results, not time spent in seats.Delicious (and nutritious) daily lunches and snacks prepared by Sam, our NYC office on-site chef.What We’ll Like About You: You have 4+ years of experience as a Data Scientist in a similarly-sized organization, with a proven record of analysis and research that positively impacts your team.You have a superior knowledge of statistical analysis methods, such as input selection, logistic and standard regression, random forests, etc.You have extensive experience with pandas, numpy, and sklearn. Experience with deep learning frameworks (TensorFlow, Keras, PyTorch, or similar) is a plusYou are capable of working with engineers to build an actual production system that uses machine learning and artificial intelligence. We don’t expect you to write production-quality code, but you should have some programming experience.You are comfortable with at least “medium data” technologies and how to transcend the “memory bound” nature of most analytics tools.You possess excellent SQL/relational algebra skills, ideally with at least a basic knowledge of how different types of databases (e.g.: column vs row storage) work.You possess excellent communication skills and the ability to clearly communicate technical concepts to a non-technical audience.Job Type: Full-timeBenefits:Health insuranceDental insuranceVision insuranceRetirement planPaid time offRelocation assistance\n try {\n window.mosaic.onMosaicApiReady(function() {\n var zoneId = 'belowFullJobDescription';\n var providers = window.mosaic.zonedProviders[zoneId];\n\n if (providers) {\n providers.filter(function(p) { return window.mosaic.lazyFns[p]; }).forEach(function(p) {\n return window.mosaic.api.loadProvider(p);\n });\n }\n });\n } catch (e) {};
## 2 try {\n window.mosaic.onMosaicApiReady(function() {\n var zoneId = 'aboveFullJobDescription';\n var providers = window.mosaic.zonedProviders[zoneId];\n\n if (providers) {\n providers.filter(function(p) { return window.mosaic.lazyFns[p]; }).forEach(function(p) {\n return window.mosaic.api.loadProvider(p);\n });\n }\n });\n } catch (e) {};\n Note: This position is accepting applicants for- New York and Texas.\nSkills :\nExcellent knowledge of NLP and ML alogrithms.\nExcellent understanding of machine learning techniques and algorithms, such as k-NN, Naive Bayes, SVM, Decision Forests, etc\nExperience with common data science toolkits, such as R, Weka, NumPy, MatLab, etc {{depending on specific project requirements}}. Excellence in at least one of these is highly desirable\nGreat communication skills\nExperience with data visualization tools, such as D3.js, GGplot, etc.\nProficiency in using query languages such as SQL, Hive, Pig {{actual list depends on what you are currently using in your company}}\nExperience with NoSQL databases, such as MongoDB, Cassandra, HBase {{depending on project needs}}\nGood applied statistics skills, such as distributions, statistical testing, regression, etc.\nQualification:\nTECH (Computer Science Engineering/ IT)/ MCA\nWhat’s in for you?\nAt Mphasis, we promise you the perfect opportunity of building technical excellence, understand business performance and nuances, be abreast with the latest happenings in technology world and enjoy a satisfying work life balance.\nWith the current opportunity, you will get to work with the team that has consistently been setting benchmarks for other deliveries in terms of delivery high CSATs, project completion on time and being one of the best teams to work for in the organization.\nYou get an open and transparent culture along with freedom to experimentation and innovation\n try {\n window.mosaic.onMosaicApiReady(function() {\n var zoneId = 'belowFullJobDescription';\n var providers = window.mosaic.zonedProviders[zoneId];\n\n if (providers) {\n providers.filter(function(p) { return window.mosaic.lazyFns[p]; }).forEach(function(p) {\n return window.mosaic.api.loadProvider(p);\n });\n }\n });\n } catch (e) {};
## 3 try {\n window.mosaic.onMosaicApiReady(function() {\n var zoneId = 'aboveFullJobDescription';\n var providers = window.mosaic.zonedProviders[zoneId];\n\n if (providers) {\n providers.filter(function(p) { return window.mosaic.lazyFns[p]; }).forEach(function(p) {\n return window.mosaic.api.loadProvider(p);\n });\n }\n });\n } catch (e) {};\n How the Position WorksThe Data Scientist reports to the VP of Data Science and Analytics and is responsible for gathering data, conducting analysis, building predictive algorithms and communicating findings to drive profitable growth and performance across Tranzact.The Data Scientist must have a strong grasp on the data structure, business needs, and statistical and predictive modeling.They must be comfortable gathering, manipulating, and utilizing large sets of data to solve business problems.Prior work experience in creating predictive models to be utilized in automating operations is a plus.The specific responsibilities of the Sr. Data Scientist include;Gathering data from various parts of the organization and third parties / online resourcesProcessing, cleaning, and verifying data integrity for analysisSelecting features, building and optimizing prediction models (classifications and regressions)Completing ad-hoc analysis and presenting results clearlyConceptualizing how prediction models can be utilized in an automated systemWhat you need to be successfulBachelor’s Degree in a quantitative field (Computer Science, Engineering, Mathematics, Statistics, etc.)Minimum 5 years of experience building predictive algorithms in industryExcellent understanding of machine learning techniques and algorithms such as decision forests, logistic regression, k-means clustering, etc.Ability to communicate findings clearlyFluency with R or Python for statistical modeling and machine learningProficiency with a query language such as SQLExperience with data visualization tools (Tableau, Power BI, etc.)Experience with big data platform (i.e. Hadoop)Expert in Microsoft ExcelCuriosity and ability to be objectiveStrong organizational skills including time and project managementThrives in a fast-paced environment that is constantly changingWhat we would LOVE to see!Masters or PhD in a quantitative field (Computer Science, Engineering, Mathematics, Statistics, etc.)Finance, Insurance, Call center, Marketing or Sales industry experienceExperience developing predictive models for business and implementing them in an automated fashionExperience working with unbalanced datasetsGood programming skills (C#, JavaScript, etc.)Experience managing or mentoring other data scientist is a plusJob Type: Full-timeSalary: $90,000.00 to $125,000.00 /yearExperience:building predictive algorithms: 3 years (Required)Work authorization:United States (Required)\n try {\n window.mosaic.onMosaicApiReady(function() {\n var zoneId = 'belowFullJobDescription';\n var providers = window.mosaic.zonedProviders[zoneId];\n\n if (providers) {\n providers.filter(function(p) { return window.mosaic.lazyFns[p]; }).forEach(function(p) {\n return window.mosaic.api.loadProvider(p);\n });\n }\n });\n } catch (e) {};
## 4 try {\n window.mosaic.onMosaicApiReady(function() {\n var zoneId = 'aboveFullJobDescription';\n var providers = window.mosaic.zonedProviders[zoneId];\n\n if (providers) {\n providers.filter(function(p) { return window.mosaic.lazyFns[p]; }).forEach(function(p) {\n return window.mosaic.api.loadProvider(p);\n });\n }\n });\n } catch (e) {};\n Generate EmblemHealth Provider File reports and data analysis for a variety of internal and external customers. Create basic Provider dashboards pertaining to Product participation by service areas. Convert old processes in Access into SAS code, including streamlining network adequacy by specialty and product and product print directory data extractions. Transform current programs to fit new data infrastructure and ensure that data is accurately captured through parallel testing QA protocols. Generate ad-hoc provider reports that meet the various business needs of the requestors, which can include disruption, network adequacy, provider/facility termination and regulatory reports.Responsibilities: Program old databases to interface with current Provider information systems.Rewrite queries, currently written in Microsoft Access, and translate them into efficient SAS code.Run Quality Assurance protocols to ensure that the same information is translated into SAS code.Work with/mentor other team members to design and develop all standard and ad-hoc reports and analyses within the Provider Reporting team.Design dashboards & visualizations that demonstrate Provider Network participation by product line and service areas.Generate ad-hoc data reports for various internal and external stakeholders (includes meetings to determine user purpose and needs; ensure specifications are met, and to ensure that timelines and expectations are reasonable.Compile regulatory reports for network adequacy to the New York State Department of Health (NYS DOH).Produce disruption reports and network/product figures to sales and contractors for recruitment and request for proposal (RFP) purposes.Automate manual recurring data processes utilizing SAS.Additional duties as assigned.Qualifications: Bachelor’s degreeSAS Base Certification preferred4 – 6 years of related professional work experience required3+ years’ work experience in SAS programming requiredAdditional years of experience/training (i.e., SAS programming) may be considered in lieu of educational requirements requiredExcellent analytical, problem-solving, organizational, and project- and time- management skills requiredProven track record of successfully completing deliverables and projects within specified timeframes requiredProven ability to identify problems/bottlenecks and to provide recommendations to resolve them requiredAn intellectual curiosity and willingness to take initiative on various longstanding projects requiredExceptional interpersonal skills and written communication skills to frequently interact with all levels of the organization requiredProficiency with Microsoft Access query design and Excel functions requiredExperience working in a team environment, exercising independent thinking and achieving goals and deadlines requiredFamiliarity with healthcare operations data, (e.g., CPF or FACETS), preferredJob Type: Full-timeAdditional Compensation:Other formsWork Location:One locationBenefits:Health insuranceDental insuranceVision insuranceRetirement planPaid time off\n try {\n window.mosaic.onMosaicApiReady(function() {\n var zoneId = 'belowFullJobDescription';\n var providers = window.mosaic.zonedProviders[zoneId];\n\n if (providers) {\n providers.filter(function(p) { return window.mosaic.lazyFns[p]; }).forEach(function(p) {\n return window.mosaic.api.loadProvider(p);\n });\n }\n });\n } catch (e) {};
## 5 try {\n window.mosaic.onMosaicApiReady(function() {\n var zoneId = 'aboveFullJobDescription';\n var providers = window.mosaic.zonedProviders[zoneId];\n\n if (providers) {\n providers.filter(function(p) { return window.mosaic.lazyFns[p]; }).forEach(function(p) {\n return window.mosaic.api.loadProvider(p);\n });\n }\n });\n } catch (e) {};\n Position Summary:\nOur Team: Media Strategy and Analytics\nThe Media Strategy & Analytics team (MSA) acts as an internal media team that handles the strategy, execution, and reporting of all paid & owned media vehicles with the goal of driving viewership across TV, digital and direct-to-consumer (DTC) products for all of Discovery, Inc’s 16+ networks (Discovery, HGTV, Food Network, TLC, etc.) and direct-to-consumer products (MotorTrend, GOLFTV, Dplay, etc.).\nMedia Science is a growing discipline in the media team. The business encompasses both paid media for TV and direct-to-consumer business and owned assets, which have seen exponential growth in data collection and management. Join a data-driven team that is building out a Media Science discipline incorporating big data sets, automation and visualizations to drive media efficiencies and effectiveness.\nThe Opportunity\nAs TV steers into the world of DTC and streaming, the Media Science Big Data Analyst will drill into the cross-platform evolution of the business. Sitting at the intersection of media and big data, the role will query/mine data across traditional linear, TVE and DTC viewing behaviors and conversions to improve media efficacy and drive revenue generating outcomes.\nThe role will help buildout the internal reporting infrastructure for viewership data and support advanced business initiatives including real-time optimization, media mix modeling and multi-touch methodologies. Join a media team that is building out smart tv advanced analytics in-house. If you possess a combination of both data engineering and research/analytical skills to tackle big data, love TV and thrive on asking "what if ?", come join us.\nResponsibilities:\n1. Query and analyze raw data including smart tv data, campaign exposure data and other datasets (e.g. digital and app measurement partners) to drive media business requirements and outcomes\n2. Analyze ecosystem metrics across linear, TVE and DTC, producing automated dashboards for media partners while identifying larger consumer shifts and business opportunities\n3. Interpret paid and owned media performance across platforms and channels to define audience opportunities, media mix model strategies and multi-touch methodologies\n\n4. Utilize big datasets to analyze and extract deep learnings to guide media strategy and optimization\n\n5. Develop automated data visualizations (e.g. Tableau, Gsheet) to explain/display complex datasets empowering business stakeholders with context\n\n6. Collaborate with other analysts to build out and maintain dashboards\n7. Work with data science to test supervised and unsupervised learning models and build algorithmic solutions\n\n8. Work with data engineering to consolidate and analyze structured and unstructured, diverse “big data” sources\n9. Write queries and code to extract relevant data from various platforms\n10. Help ensure that data is accurate and consistent throughout the organization\n11. Scope and manage ad-hoc analytics projects driven by the needs of the company\nRequirements:Bachelor’s degree in a subject such as Applied Mathematics, Statistics, Engineering, Computer Science, etc. Master’s Degree preferred2+ years of relevant work-related experience working with large datasetsExperience with TV viewing data. Media Industry experience and/or ad tech would be a plusStrong SQL experience, ability to perform effective querying and analysis of data from multiple sourcesWorking knowledge of AWS Big Data Tools: S3, Red Shift, AthenaWorking knowledge of Python, R, SQLExperience doing data extraction, transformation, and reporting\nExperience with source control system, ideally GithubUnderstanding of Predictive AnalyticsMachine learning experience is a nice to haveHighly proficient mathematics, including probabilityVisualization experience (Tableau or other tools)Ability to prioritize tasks and resolve problems in a timely mannerAbility to work autonomously, multi-task and work in a fast-paced environmentAbility to work in agile environments\nAbility to understand business problems, draw conclusions from data and offer solutions\nProfessional/polished communicatorTeam playerExtremely detail oriented and organizedSelf-starter, willing to get hands dirty and “make it happen.”Must have the legal right to work in the United States\nNew York City, New York, NYC, NY\n try {\n window.mosaic.onMosaicApiReady(function() {\n var zoneId = 'belowFullJobDescription';\n var providers = window.mosaic.zonedProviders[zoneId];\n\n if (providers) {\n providers.filter(function(p) { return window.mosaic.lazyFns[p]; }).forEach(function(p) {\n return window.mosaic.api.loadProvider(p);\n });\n }\n });\n } catch (e) {};
## 6 try {\n window.mosaic.onMosaicApiReady(function() {\n var zoneId = 'aboveFullJobDescription';\n var providers = window.mosaic.zonedProviders[zoneId];\n\n if (providers) {\n providers.filter(function(p) { return window.mosaic.lazyFns[p]; }).forEach(function(p) {\n return window.mosaic.api.loadProvider(p);\n });\n }\n });\n } catch (e) {};\n About Us\n\nMediaMath is a leading independent advertising technology company, working with brands and agencies. We created the first software for real-time media buying in 2007 and today work with over two-thirds of the Fortune 500 and more than 3,500 brands and their agency partners to grow and deepen direct customer relationships.\n\nWe have recently launched SOURCE by MediaMath which provides our clients with the purest media supply to connect their brands with consumers: real impressions on real media properties; real humans connected to with real ads, at scale; and a true and trusted data set that enables machine learning and attribution at scale, across channels such as mobile, Connected TV, Digital Out of Home, and display.\n\nWe need talent like you to fuel this next-generation ecosystem.\n\nKey Responsibilities\n\nMediaMath is seeking a leader of Data Science to help transform our engineering culture, to grow and develop our people, and to build the greatest tech possible. The Head of Data Science designs and launches innovative and complex analytic models, utilizing a blend of contemporary and traditional data mining techniques, which, when applied to both structured and unstructured data sets, drive insights and benefits not otherwise apparent. This person should have business domain expertise in order to translate goals into data-based deliverables, using quantitative analysis, statistical modeling, predictive and prescriptive analytics, optimization and attribution algorithms, pattern detection analysis, etc. This person should have knowledge of current AI and machine learning capabilities and should stay up to speed on advances in the field, both academic and applied. This person should be interested in the academics of data science, but more focused on practical application. This person should clearly articulate the purpose of data science solutions and then translate those into action. These solutions will encompass such things as product innovations and fixes, data architecture improvements, system architecture enhancements, business risks, and process improvements.\n\nWe are seeking an individual who thrives on describing a vision and then inspiring the team to achieve it. A person who values giving credit over taking it. We are looking for someone who will break down barriers, enlist and empower, communicate and stimulate. Someone who harnesses advanced analytic data modeling systems to drive positive outcomes for our customers. From the definition of a strategy through the execution of it, you will develop, collect, and report the objective metrics required to assure it. You will be responsible for driving employee engagement and productivity across Data Science and into Engineering\nYou will:Define the vision for our data science applications, focused on up-leveling internal use of machine learning\nPartner with our product teams to help predict system behavior, establish metrics, identify bugs and improve debugging skills\nEnsure data quality and integrity within our products as well as our teams\nTest performance of data-driven products\nPartner with our client teams to enhance products and develop client solutions applying critical thinking skills to remove extraneous inputs\nConceive, plan and prioritize data projects\nInterpret and analyze data problems\nLead data mining and collection procedures, especially focused on unstructured and siloed data sets\nBuild analytic systems and predictive models\nVisualize data and create reports\nExperiment with new models and techniques\nDrive the implementation of models into Production through various Engineering teams\nManage, develop, coach, and mentor a team of Data Scientists, machine learning engineers and big data specialists\nCreate a positive culture to maximize productivity and minimize attrition\nYou are:An innate personal interest in technology\nCritical thinker\nStrong attention to detail and extremely well-organized\nAbility to manage multiple projects with competing priorities\nAbility to understand business language\nAbility to work with diverse teams (product, engineering, analytics, sales, services) and at various levels within the organization\nDemonstrated passion for excellence with respect to Engineering services, education, and support\nStrong interpersonal skills, demonstrating an ability to work well with and enthusiastically influence teams & stakeholders\nYou have:A strong focus on the client\nProven experience as a Data Scientist or similar role\nSolid understanding of machine learning\nKnowledge of data management and visualization techniques\nA knack for statistical analysis and predictive modeling\nPractical knowledge of machine learning tools and techniques. (e.g. Python Tensorflow, PyTorch, Spark, Scala)\nStrong organizational and leadership skills\nA business mindset\nExcellent communication and analytics translation skills\n\nPreferred Qualifications\nMaster's or PhD in Statistics, Machine Learning, Mathematics, Computer Science, Economics, or any other related quantitative field. A working experience of the same is also acceptable for the position.\nAt least 7 years of working experience in a data science position, preferably working as a Senior Data Scientist.\nProven and successful track record of leading high-performing data analyst teams leading through the successful performance of advanced quantitative analyses and statistical modeling that positively impact business performance.\n\nWhy We Work at MediaMath\n\nWe are restless innovators, smart, passionate and kind. At the heart of our culture are six values that provide a framework for how we approach our work and the world: Teams Win, Scale + Innovation, Obsess Over Learning & Growth, Align then Execute, Do Good Better and Embrace the Journey. These values inform how we energize one another and engage with our clients. They get us amped to come to work. And, let's face it, so do the free snacks, great benefits, and unlimited vacation.\n\nMediaMath is committed to equal employment opportunity. It is a fundamental principle at MediaMath not to discriminate against employees or applicants for employment on any legally-recognized basis including, but not limited to: age, race, creed, color, religion, national origin, sexual orientation, sex, disability, predisposing genetic characteristics, genetic information, military or veteran status, marital status, gender identity/transgender status, pregnancy, childbirth or related medical condition, and other protected characteristic as established by law.\n try {\n window.mosaic.onMosaicApiReady(function() {\n var zoneId = 'belowFullJobDescription';\n var providers = window.mosaic.zonedProviders[zoneId];\n\n if (providers) {\n providers.filter(function(p) { return window.mosaic.lazyFns[p]; }).forEach(function(p) {\n return window.mosaic.api.loadProvider(p);\n });\n }\n });\n } catch (e) {};
Analyzing frequencies of Data Science Skills words and phrases
Here we use our domain knowledge of the Data Science field to perform a top-down search for key skills that we know to be valuable in data science.
# In this part I'll search and count some words related to data science using mutate and str_count
skills <- full_df %>%
mutate(mathematics = str_count(full_df$job_description, "mathematics" )) %>%
mutate(SQL = str_count(full_df$job_description, "SQL" )) %>%
mutate(Python = str_count(full_df$job_description, "Python" )) %>%
mutate(programming = str_count(full_df$job_description, "programming" )) %>%
mutate(Hadoop = str_count(full_df$job_description, "Hadoop" )) %>%
mutate(statistics = str_count(full_df$job_description, "statistics" )) %>%
mutate(mathematics = str_count(full_df$job_description, "mathematics" )) %>%
mutate(modeling = str_count(full_df$job_description, "modeling" )) %>%
mutate(communication = str_count(full_df$job_description, "communication" )) %>%
mutate(Java = str_count(full_df$job_description, "Java" )) %>%
mutate(Apache = str_count(full_df$job_description, "Apache" )) %>%
mutate(Tableau = str_count(full_df$job_description, "Tableau" )) %>%
mutate(computer_science = str_count(full_df$job_description, "computer science" )) %>%
mutate(TensorFlow = str_count(full_df$job_description, "TensorFlow" )) %>%
mutate(big_data = str_count(full_df$job_description, "big data" )) %>%
mutate(machine_learning = str_count(full_df$job_description, "machine learning" )) %>%
mutate(SAS = str_count(full_df$job_description, "SAS" )) %>%
mutate(R = str_count(full_df$job_description, "R" )) %>%
select(3:19) %>% summarise_all(funs(sum))## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
## mathematics SQL Python programming Hadoop statistics modeling communication
## 1 14 48 53 37 12 23 33 37
## Java Apache Tableau computer_science TensorFlow big_data machine_learning SAS
## 1 21 2 19 9 5 17 67 18
## R
## 1 411
Display the result in a table format:
table <- gather(skills, "skill", "Count", 1:16) %>%
arrange( Count) %>%
mutate(skill = factor( skill, skill))
table## R skill Count
## 1 411 Apache 2
## 2 411 TensorFlow 5
## 3 411 computer_science 9
## 4 411 Hadoop 12
## 5 411 mathematics 14
## 6 411 big_data 17
## 7 411 SAS 18
## 8 411 Tableau 19
## 9 411 Java 21
## 10 411 statistics 23
## 11 411 modeling 33
## 12 411 programming 37
## 13 411 communication 37
## 14 411 SQL 48
## 15 411 Python 53
## 16 411 machine_learning 67
Visualize as a bar plot:
From the barplot figure above, our finding is that the data science skills words and phrases of interest were well represented in the text data. In particular, we observe hard skills like machine learning, Python and SQL occured frequently. However, we also see that some soft skills (e.g. communication) are very common as well. Additionally, another key finding was that several of the terms used here had corresponded frequent occurances in the unsupervised NLP analysis. For example python, and machine (learining).
In Closing
Conclusions
For this project, we scraped, wraggled and processed text data from both Indeed and Reddit. We believe that these sources give two different approaches to addressing which skills are the most valuable in data science.
Based on the analysis of our data, it appears that some of the most valued skills for data scientists are python, analysis, modeling, machine (learning), analytics, and teamwork. These words came up frequently in every database we examined and make sense in terms of what we have learned about so far. When attempting to get a job, it looks like experience and degree are also valued (though these are not skills, per se).
Future Directions
It would be very interesting to see how this data correlates with text data sets from other sources. Perhaps from other groups for this Project 3?…
For our analysis, we only looked at single words. We could revisit the data with RAKE udpipe methods for phrases or word pairings (e.g. ‘machine learning’ or ‘big data’)
For the unsupervised approach to analyzing the text data, we made the assumption that nouns would be more informative informative than other parts of speech. This was based on subjectively screening the adjectives, verbs etc. in the data. However, we could take measures to verify this assumption.
Comment cleaning with the text mining library tm
We will now use a text mining library to break the text into word elements, group like elements & calculate their frequency for both datasets.
Cleaning Reddit data
We previously used multiple redditExtractoR methods to build a data.frame of relevant Reddit comments. Now we will use the tm library to reformat the comments to a structure that holds all the unique words across the dataset as well as a count of the occurence frequency. We start by casting the reddit data.frame as a corpus, or a collection of documents. This transformation is necessary for us to apply the tm_map() methods to the text data.
Cleaning Indeed data
Here we will access the listings output and use tm methods to clean the data. We will follow the same steps as we did with the Reddit cemments data: