Job Postings Data Analysis

Data Analysis

Libraries

library(kableExtra)
library(tidyverse)
library(readxl)
library(ggplot2)

Combine NY and CA Jobs Data

data_store_path <- "~/R/Project3_Group3"

#load NY data into jobs.ny data frame
load(file.path(data_store_path, "jobs_df.RData"))
jobs.ny <- job_df
rm(job_df)
names(jobs.ny)[1] <- "job_post_region"
jobs.ny$job_post_region <- 'NY'
jobs.ny <- jobs.ny[,-which(names(jobs.ny) == "job_post_title")]
kable_styling(knitr::kable(head(jobs.ny, 3), "html", caption = "NY Data Frame"), bootstrap_options = "striped")

NY Data Frame
job_post_region	job_post_summary
NY	Job Requisition Number:64749The Fixed Income Real-Time Pricing team provides intraday pricing on a wide variety of asset classes. Our product gives clients unprecedented transparency into the OTC markets. We develop and support several real-time pricing engines customized for different use cases.We are working on new pricing models for the European, Chinese, and various other fixed income markets. The quality of the algorithms we develop must be highly defensible and transparent to those who use the data for trading. We apply various quantitative methods for data cleaning and price generation, and build the tools needed to validate and monitor online quality. Backtesting is a key part of our algorithm development. Some of the biggest technical challenges we face are those of scale; we must produce pricing in real time, while taking in millions of data points per minute. We achieve low latency and high throughput via distributed computing that scales horizontally. As a Data Scientist you will help lead our research efforts as well as develop production-quality systems to further expand our product. You will work closely with our highly-quantitative product side to design the models for these systems. We will trust you to: Contribute research and be hands-on in the development of the product Partner with stakeholders to understand requirements and take projects fully through the research and software life cycle Design models that are highly robust with an emphasis on data integrity and a rigorous, defendable scientific approach What’s in it for you: You will use C++, Python, Jupyter notebook, Redis, Jenkins and Google test, as well as explore cutting-edge technologies like Kafka and Cassandra on the Bloomberg Cloud You will tackle some of the biggest scalability challenges in producing real-time pricing while taking in millions of data points per minute You’ll need to have: 3 or more years of industry experience in a quantitative finance role within the fixed income area Advanced degree in Mathematics, Statistics, Physics, Engineering, Finance or related field. Strong understanding of object-oriented design and problem solving skills Experience leading projects through the complete life cycle from research and development to testing and deployment Solid understanding of financial concepts including Fixed Income, Credit, callable bonds, curves, and relative value Deep understanding of statistics, probability, inverse problems, stochastic calculus, financial and econometric models, as well as estimation and calibration techniques Hands-on experience with C/C++ and Python Quantitative Development. Excellent communication and collaboration skills Experience with handling large scale data sets We’d love to see: PhD Experience with MATLAB, Mathematica or R
NY	RESPONSIBILITIES: Kforce has a client seeking a Lead/Manager of Data Science based in New York, New York (NY). Summary: The Manager/Head of Data Science position leads the Data Science and Analytics team. This is a senior position within the organization that will be responsible for the analytics and data science for the organization including developing the strategy and future vision of the area. This is a hands-on leadership position. The client is looking for a visionary that understands the importance of data insights in a growing organization and is up for the challenge of developing a practice that can harness those insights to drive meaningful business decisions. This position reports directly to VP of Engineering and will have the additional responsibility of managing a staff of 3-5 employees (data scientists). Manager of Data Science - Python, Pandas, and Numpy Responsibilities: Mentor and develop members of the Data Science and Data Analytics Teams Apply deep, creative, rigorous thinking to solve broad, platform-wide technical and/or business problems Teach, mentor, and collaborate with members of the broader Data Science Data Analytics Teams, both to build out specific projects and to continuously teach and learn new technology and techniques Design, conduct, and analyze real-time A/B tests Support and guide team in using machine-learning techniques, visualizations, and statistical analysis to gain insight into various data Support and guide team in researching, designing, simulating and/or prototyping new algorithmic product features per business need Work closely with the team to deliver products that enhance campaign optimization REQUIREMENTS:BS in Computer Science, Statistics, Economics, or related field, Masters or PhD is preferred 6+ years of experience with at least 4 years of hands-on technical experience Must have experience with the following: Machine learning algorithms, can code in Python or R, Numpy and Pandas; some understanding of statistics is important Experience leading multiple teams Experience with applying probabilistic and/or statistical methods to real-world data sets Experience delivering multiple complex, scalable, mission critical products Solid Computer Science fundamentals including complex distributed systems and space-time complexity Experience optimizing machine learning algorithms; experience with predictive algorithms Experience with stream processing platforms such as Spark or Storm a plus Knowledge of and/or working exposure to the scikit-learn open source library is a plus Experience with machine learning techniques; expert knowledge is a major plus Strong coding proficiency in SQL and at least one other common scripting or compiled programming language (Python, Java, and/or Scala a plus) Experience working in a Linux environment Kforce is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, pregnancy, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status. * BS in Computer Science, Statistics, Economics, or related field, Masters or PhD is preferred 6+ years of experience with at least 4 years of hands-on technical experience Must have experience with the following: Machine learning algorithms, can code in Python or R, Numpy and Pandas; some understanding of statistics is important Experience leading multiple teams Experience with applying probabilistic and/or statistical methods to real-world data sets Experience delivering multiple complex, scalable, mission critical products Solid Computer Science fundamentals including complex distributed systems and space-time complexity Experience optimizing machine learning algorithms; experience with predictive algorithms Experience with stream processing platforms such as Spark or Storm a plus Knowledge of and/or working exposure to the scikit-learn open source library is a plus Experience with machine learning techniques; expert knowledge is a major plus Strong coding proficiency in SQL and at least one other common scripting or compiled programming language (Python, Java, and/or Scala a plus) Experience working in a Linux environment Kforce is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, pregnancy, sexual orientation, gender identity, national origin, age, protected veteran status, or disability status.
NY	My client a growing leader in the AdTech space has hit rapid growth mode!…. They are looking an experienced Data Scientist to build a custom pricing platform. Your vision will be to build tools that help determine how much of each product to buy, to recommend the best prices for said products, and to ensure that the clients have the right level of logistics capacity to fulfill their demand. The Pricing and Forecasting engineering team uses cutting edge technology to forecast demand and pricing with focus on business and customer experience. Responsibilities and Duties What you’ll do: Influence pricing decisions through advanced statistical techniques and concepts. Uses predictive modeling, data mining, segmentation and other statistical analyses to optimize revenue and retail sales performance for our clients. Recommends, designs, and evaluates pricing tests to help drive business growth and impact strategic decisions. Works collaboratively across the organization to understand their need and challenges providing thought leadership and perspective around appropriate statistical analyses to address those issues. Is an advocate for implementing pricing best practices and identifying opportunities to improve decision making. Qualifications and skills Required: You are experienced in the latest technologies and processes to transform an insight into a model and to deploy a highly scalable service. You also posses good programming skills (preferably Python), ML frameworks (TensorFlow, R), SQL/NoSQL, AWS. You have a solid academic history and a proven track record in research and innovation in the field of machine learning and optimization. Leadership in is you blood as you will be required to act as a role model and a mentor to the more junior data scientists as the team grows. This is currently a growing team so you need to be able to communicate on varying levels! In any case, your strong analytical skills help you easily translate data insights into business opportunities. You hold subject expertise in traditional and statistical models. Including but not limited to; generalized linear modeling, regression, time series, naive Bayesian classifiers, parametric statistical analysis, error calculation. You also hold knowledge of optimization techniques and linear programming.

#load CA (San Francisco) data into jobs.ca data frame
load(file.path(data_store_path, "SanFrancisco_CA_searchAllJobUrls_and_job_sum_text_objects_after_remove_empty_and_duplicate_postings.Rdata"))
jobs.ca <- data.frame(job_post_region = "CA", job_post_summary = job_sum_text)
rm(job_sum_text, job_post_region, job_post_summary)
kable_styling(knitr::kable(head(jobs.ca, 3), "html", caption = "CA Data Frame"), bootstrap_options = "striped")

CA Data Frame
job_post_region	job_post_summary
CA	We are looking to bring on a Senior-level Data Scientist for a national client of ours for a long-term contract with the potential of full-time conversion. Rate: $60-65/hour (W2) \| $70-75/hour (Corp-to-Corp, 1099) RequirementsOur expectations for this person is to have expertise in the three following topic areas:Data AnalyticsModellingA/B testing Minimum Requirements / Responsibilities:At least five (5) years of hands-on experience doing this and some skill sets on our current analytics platform/environment, Amazon Web Services (AWS) Cloud, such as S3 within a Hadoop ecosystem.Bachelor of Science in Computer Science, Engineering, Mathematics, or Economics, or related discipline or equivalent work experience.Experience leading data and modeling initiatives, complex and impactful.Track record of writing clear and well documented code, preferably in Python.SQL proficiency and experience working with relational databases.Proficiency with data visualization tools such as Tableau, matplotlib, etc.Strong understanding of statistics and experience developing supervised and unsupervised learning models.GIS experience
CA	Qualifications:This position is a DATA ENGINEER, not a Data Scientist.The ideal candidate would be a strong SAS/SQL/AWS/HIVE developer who can support the non-statistical model development project in the SAS/ Teradata platform and subsequently transition these models to the AWS/HIVE/Python platform.Since a lot of our work spins off from new regulatory requirements, we lack detailed documentation of the specifications. We are looking for someone who can capture the requirements and build the models without detailed guidance.Bachelor“s Degree in Econometrics, Economics, Engineering, Mathematics, Applied Sciences, Statistics or job-related discipline or equivalent experience Job-related experience, 8 years, OR Master”s Degree and job-related experience, 6 years, OR Doctorate Degree and job-related experience, 3 years. Experience in data modeling, 5yrs DesiredEducation / Skills: PhD in engineering or a related field (computer science, natural sciences, mathematics) Experience with Python, R, Scala, SQL Experience developing solutions with Pandas/Scikit-learn, Spark or comparable technologies Experience data science notebooks (Jupyter, Zeppelin or other) Experience with AWS, Azure, cloud computing technologies Scrum team experience Energy industry Experience designing efficient data science workflows and database architecture for data science purposes Experience with forecasting, Bayesian networks, and graph analytics Strong statistics experience with software development methodologies and software engineering principles.Knowledge of program management theories, concepts, methods, best practices, and techniques as needed to perform at the job level Knowledge of relevant programming languages - for example Visual Basic, Ladder Logic, Programmable Logic Controller, C, SharePoint, HTML, Java, Adobe – as needed to perform at the job level Competency in knowing the most effective and efficient processes to get things done, with a focus on continuous improvement.Knowledge of principles, techniques, and procedures used for production and design of technology based equipment and systems as needed to perform at the job level.Knowledge of statistical theories, concepts, methods, best practices, and analyses as needed to perform at the job level Ability to develop reports, models, and simulations as needed to perform at the job level Competency in developing and delivering multi-mode communications that convey a clear understanding of the unique needs of different audiences.Knowledge of data model design philosophies and methodologies for data warehouse and OLTP systemsResponsibilities: Client“s Information Technology (IT) organization is comprised of various unified departments which collaborate effectively in order to deliver high quality technology solutions. The Digital Catalyst Team is a new enterprise team that is responsible for working collaboratively with the lines of business (e.g., Gas Operations, Electric Operations, etc.) to implement consumer grade mobile and analytical solutions across various user groups (e.g., field users, office workers, etc.). This includes, but is not limited to:Deploying best-in-class / rapid delivery capability for mobile solutions.Simplifying, improving, and standardizing business work management processes for mobile needs.Delivering high value analytics across all Lines of Businesses.Rapid delivery of web applications.Digital Catalyst consists of a staff of highly skilled professionals working together to produce mobile solutions following an agile methodology and design thinking. We are a”start-up” department within IT and building driven and creative mobile development team.We take the time to understand our partners" needs and translate those into solutions that delight our users.Our goal is to deliver products with intuitive user experience that will improve Client employees" and customer“s safety, productivity and overall well-being.Position Summary: We are seeking an experienced Data Scientist in the Digital Catalyst Team who will provide strong execution and delivery of data science.Working as a part of the product team, this Data Scientist will translate business needs into advanced analytics and machine learning models. The successful candidate will be responsible for model selection and identification of appropriate training data sets; building, training, and evaluating models; and delivering results to the business on a regular cadence.This role is part of a fully Agile Scrum team, so the data scientist will work alongside a product owner, technical lead, and team of developers and data engineers to support delivery of high-value analytics and software products. PositionResponsibilities: Leads development of high complexity models and training setsProvides hands-on execution and implementation of data science models • Translates business analysis needs into well-defined data science problems, and selecting appropriate models and algorithms and communicates model evaluation and implications of results back to stakeholdersRecognizes and prioritizes the most important work related to data science models to achieve highest operational impact for analytics in the businessBalances tradeoffs among analytics value, model development methods and design and technologies used to implement data science models with a bias toward actionPerforms collaborative work on data science problems and mentor junior data scientistsCreates shared process models, business objects, activity diagrams and process documentation to effectively articulate multiple views of the business solutions that support technical architecture.Manages development of quantitative models and tools.Collaborates with leaders, other LOBs, and business partners to work on issues, projects or activities.Develops new or revises complex models to predict business demand trends, and volume and expenditures forecasts capacity analysis, and various other metrics to identify potential opportunities.Assesses business implications associated with modeling assumptions, inputs, methodologies, technical implementation, analytic procedures and processes, and advanced data analysis.Partners with leaders to drive high performance in their lines of business.Develop deep understanding of business drivers and financial levers to provide strategic decision support.Oversees resolution of complex projects and programs.Develops and maintains up-to-date detailed project schedules and work plans.Performs analysis on complex data models requiring customized reports and data and presents recommendations.Contact:Rozina Hudda Email: rozina@sunrisesys.comHelp \| 732-395-4460Asha Krishna Email: asha@sunrisesys.comHelp \| 732-395-4591
CA	•Data Analysis: Compile and analyze data from different sources, to check completeness, business sense and filtering criteria ’s. •Data Cleansing: Develop/Edit SQL scripts to pull and clean datasets. •Strong quantitative training•Substantial experience in data manipulation •Savviness with building and validating models •Business acumen and great communication. •4+ years related professional experience •Proven achievements resulting from data analysis •Degree in computer science, applied math, physics, economics, or other quantitative science (graduate degree a plus) •Experience with languages used for querying (e.g. SQL/Hive), and statistical analysis (e.g. SAS/R/Matlab/python) •Proven ability to succeed in both collaborative and independent work environments. •Ability to leverage experience to scale and validate models •Experience working with business partners to validate the output of analytical models •Retail industry experience preferred

#Combine NY and CA data frames into jobs.df
jobs.df <- bind_rows(jobs.ny, jobs.ca)
rm(jobs.ny, jobs.ca)

Prepare list of key terms - from Heather G. and Raj K.

The list of data science skills is based off the list found here: https://www.thebalance.com/list-of-data-scientist-skills-2062381

keywords <- read.table("https://raw.githubusercontent.com/heathergeiger/Data607_Project3_Group3/master/heathergeiger_individual_work/combine_ny_and_san_francisco/keywords.txt",header=TRUE,check.names=FALSE,stringsAsFactors=FALSE,sep="\t")
keywords <- keywords[grep('This is probably too tough',keywords$Other.notes,invert=TRUE),]
kable_styling(knitr::kable(head(keywords[,-4], 10), "html", caption = "Keywords"), bootstrap_options = "striped")

Keywords
	Skill	Soft.or.technical	Synonyms
1	algorithm	technical	None
2	appengine	technical	None
3	aws	technical	None
4	big data	technical	None
5	c++	technical	None
6	collaboration	soft	collaborative,collaborate,collaborated,team player,teamwork
7	communication	soft	communicative,communicate,communicated
8	prediction	technical	predictive,predict,predicted
10	couchdb	technical	None
11	creativity	soft	creative

keyword_list <- vector("list",length=nrow(keywords))
for(i in 1:nrow(keywords)) {
  keywords_this_row <- keywords$Skill[i]
  if(keywords$Synonyms[i] != "None"){
    keywords_this_row <- c(keywords_this_row,unlist(strsplit(keywords$Synonyms[i],",")[[1]]))
  }
  keyword_list[[i]] <- keywords_this_row
}

space_or_comma <- "[[:space:],]"
word_boundary <- "\\b"
pattern_for_one_keyword <- function(keyword){
  regexes <- paste0(space_or_comma,keyword,space_or_comma)
  regexes <- c(regexes,paste0(word_boundary,keyword,word_boundary))
  regexes <- c(regexes,paste0(word_boundary,keyword,space_or_comma))
  regexes <- c(regexes,paste0(space_or_comma,keyword,word_boundary))
  return(paste0(regexes,collapse="|"))
}
pattern_for_multiple_keywords <- function(keyword_vector){
  if(length(keyword_vector) == 1){return(pattern_for_one_keyword(keyword_vector))}
  if(length(keyword_vector) > 1){
    individual_regexes <- c()
    for(i in 1:length(keyword_vector))
    {
      individual_regexes <- c(individual_regexes,pattern_for_one_keyword(keyword_vector[i]))
    }
    return(paste0(individual_regexes,collapse="|")) 
  }
}
keyword_regexes <- unlist(lapply(keyword_list,function(x)pattern_for_multiple_keywords(x)))
kable_styling(knitr::kable(head(keyword_regexes), "html", caption = "Regex of Keywords"), bootstrap_options = "striped")

Regex of Keywords
x
[[:space:],]algorithm[[:space:],]\|\|\|[[:space:],]algorithm
[[:space:],]appengine[[:space:],]\|\|\|[[:space:],]appengine
[[:space:],]aws[[:space:],]\|\|\|[[:space:],]aws
[[:space:],]big data[[:space:],]\|data\|data[[:space:],]\|[[:space:],]big data
[[:space:],]c++[[:space:],]\|++\|++[[:space:],]\|[[:space:],]c++
[[:space:],]collaboration[[:space:],]\|\|\|[[:space:],]collaboration\|[[:space:],]collaborative[[:space:],]\|\|\|[[:space:],]collaborative\|[[:space:],]collaborate[[:space:],]\|\|\|[[:space:],]collaborate\|[[:space:],]collaborated[[:space:],]\|\|\|[[:space:],]collaborated\|[[:space:],]team player[[:space:],]\|player\|player[[:space:],]\|[[:space:],]team player\|[[:space:],]teamwork[[:space:],]\|\|\|[[:space:],]teamwork

Compare keywords against jobs post summary data

for(i in 1:length(keyword_regexes)) {
  jobs.df[,keywords$Skill[i]] <- NA
  skill <- keyword_regexes[i]
  new.skill.col <- unlist(str_detect(tolower(jobs.df$job_post_summary),skill))
  jobs.df[,keywords$Skill[i]] <- new.skill.col
}
kable_styling(knitr::kable(head(jobs.df[15:18,]), "html", caption = "Job Skills [in Wide Format]"), bootstrap_options = "striped")

Job Skills [in Wide Format]
	job_post_region	job_post_summary	algorithm	appengine	aws	big data	c++	collaboration	communication	prediction	couchdb	creativity	critical thinking	customer service	data manipulation	data wrangling	data mining	d3.js	decision making	decision tree	ecl	flare	google visualization api	hadoop	java	leadership	machine learning	matlab	microsoft excel	mining social media	modeling	perl	powerpoint	presentation	problem solving	python	r	raphael.js	risk modeling	sas	scripting languages	sql	statistics	tableau	a/b testing	data visualization
15	NY	EY is the only professional services firm with a separate business unit (“FSO”) that is dedicated to the financial services marketplace. Our FSO teams have been at the forefront of every event that has reshaped and redefined the financial services industry. If you have a passion for rallying together to solve the most complex challenges in the financial services industry, come join our dynamic FSO team! The Opportunity Today’s clients are looking to transform processes and organizations, and achieve data-driven growth. Data science seniors come from a variety of technical backgrounds and apply statistical and machine learning models to business problems ranging from traditional econometrics, to NLP, and Deep Learning. As a member of the Advanced Analytics team, you’ll work in a highly collaborative environment with clients, experienced data science practitioners, subject matter experts, and other advisory professionals to drive business value through the use of advanced analytics. This is a high growth, high visibility area with opportunities to enhance your skillset and build your career. EY’s Advanced Analytics team supports both internal business teams and external clients in developing innovative techniques and methods, product solutions, and proof-of-concepts. The business problems our clients are facing today are not the same problems they have faced in the past. The rapid pace of development in AI and the technology that enables it has created an urgent need to innovate and adapt to the new global business paradigm. Financial institutions are looking to build smarter and more efficient ways to operate their business, create new revenue streams, and better manage risk, through new opportunities uncovered by their data. We believe that to fully unlock the potential of AI and advanced analytics we need to look not only at the application of AI, but also at the strategy level for how best to transform the enterprise into one that is technology and data focused and ready for the new age. Our clients’ problems are becoming increasingly complex while at the same time the need to automate and streamline is rising. Your key responsibilities Analyze structured and unstructured data at scale to derive new insights and opportunities Build and validate predictive models Advise clients and project teams on leading data science practices Create data-driven business recommendations Contribute to internal research and development efforts in cutting edge areas including NLP, graph analytics, and deep learning / AI Skills and attributes for success Fostering an innovative and inclusive team-oriented work environment Leading and coaching diverse teams of professionals with different backgrounds Demonstrating in-depth technical capabilities and professional knowledge Establishing strong relationships with the clients Working in an entrepreneurial environment to pave your own career path To qualify for the role you’ll need A MSc in a technical field like Computer Science, Econometrics, Mathematics, Engineering, or a related field Practical experience with advanced machine learning techniques and big data technologies An understanding of the latest industry developments in Big Data and AI such as graph databases, Natural Language Processing, and neural networks A thorough understanding of common languages and libraries used in machine learning (such as R, C, Java or Python. Experience with big data tools, Hive, Spark, Neo4j, etc. a plus) Excellent business, communication and presentation skills (experience in financial services domains a plus, but not required) The ability to translate complex information into non-technical, easy to understand language Ideally, you’ll also have A PhD in a technical field like Computer Science, Econometrics, Mathematics, Engineering, or a related fieldWhat we look for We’re looking for well-rounded, technical, and intellectually curious individuals with an entrepreneurial spirit and a genuine desire to influence entire industries. You’ll need great analytical, strategic and communication skills, as well as the ability to handle new responsibilities and challenges. If you’re ready to own complex projects and bring new perspectives to a constantly evolving industry, this role is for you. What working at EY offers We offer a competitive compensation package where you’ll be rewarded based on your performance and recognized for the value you bring to our business. In addition, our Total Rewards package includes medical and dental coverage, both pension and 401(k) plans, a minimum of three weeks of vacation plus 10 observed holidays and three paid personal days, and a range of programs and benefits designed to support your physical, financial and social wellbeing. Plus, we offer Support, coaching and feedback from some of the best colleagues around Opportunities to develop new skills and progress your career The freedom and flexibility to handle your role in a way that’s right for you A rewards package tailored to your unique needs If you can confidently demonstrate that you meet the criteria above, please contact us as soon as possible. About EY As a global leader in assurance, tax, transaction and advisory services, we’re using the finance products, expertise and systems we’ve developed to build a better working world. That starts with a culture that believes in giving you the training, opportunities and creative freedom to make things better. Whenever you join, however long you stay, the exceptional EY experience lasts a lifetime. And with a commitment to hiring and developing the most passionate people, we’ll make our ambition to be the best employer by 2020 a reality. .	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	TRUE	FALSE	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE
16	NY	Job Requisition Number:63976The Bloomberg Law (BLAW) division of Bloomberg LP provides best-in-class tools, search and analytics for several Bloomberg products in the areas of Law and Government. Our Machine Learning team is looking for an experienced data scientist to join us and contribute to building innovative legal research products using Natural Language Processing (NLP) and Machine Learning. We extract knowledge from hundreds of millions of legal documents and build intelligent models to enable our customers to get the right answers quickly. Our data scientists and engineers work closely with product managers, content team members and market strategists. We apply NLP and Machine Learning techniques to text such as entity extraction and disambiguation, text classification and clustering. As a data scientist on our team you will work with other team members on the latest technology and models to optimize our current line of products. You’ll need to have: A B.S, M.S. or PhD. in Computer Science, Electrical Engineering, Applied Mathematics or a related field 3+ years of experience with Machine Learning, Data Science Statistical Models, NLP and Text Analytics on large data sets Experience with techniques such as topic modeling, text classification, entity extraction and disambiguation A solid understanding of machine learning techniques including Support Vector Machines (SVM), Logistic Regression, Decision Trees, Max entropy, Conditional Random Fields (CRF) and Unsupervised Learning Methods 1+ year of experience programming in Python Experience using R, or other data science toolkits, such as Scikit-learn, NLTK, WEKA, Mallet, CLUTO and GENSIM We’d love to see: Experience in all phases of machine learning application lifecycles from data gathering and preparation to optimizing model performance Publications or presentations in relevant communities (ICML, NIPS, CVPR, SIGIR, ACM Multimedia) Legal or financial domain experience	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE
17	CA	We are looking to bring on a Senior-level Data Scientist for a national client of ours for a long-term contract with the potential of full-time conversion. Rate: $60-65/hour (W2) \| $70-75/hour (Corp-to-Corp, 1099) RequirementsOur expectations for this person is to have expertise in the three following topic areas:Data AnalyticsModellingA/B testing Minimum Requirements / Responsibilities:At least five (5) years of hands-on experience doing this and some skill sets on our current analytics platform/environment, Amazon Web Services (AWS) Cloud, such as S3 within a Hadoop ecosystem.Bachelor of Science in Computer Science, Engineering, Mathematics, or Economics, or related discipline or equivalent work experience.Experience leading data and modeling initiatives, complex and impactful.Track record of writing clear and well documented code, preferably in Python.SQL proficiency and experience working with relational databases.Proficiency with data visualization tools such as Tableau, matplotlib, etc.Strong understanding of statistics and experience developing supervised and unsupervised learning models.GIS experience	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	TRUE	FALSE	TRUE
18	CA	Qualifications:This position is a DATA ENGINEER, not a Data Scientist.The ideal candidate would be a strong SAS/SQL/AWS/HIVE developer who can support the non-statistical model development project in the SAS/ Teradata platform and subsequently transition these models to the AWS/HIVE/Python platform.Since a lot of our work spins off from new regulatory requirements, we lack detailed documentation of the specifications. We are looking for someone who can capture the requirements and build the models without detailed guidance.Bachelor“s Degree in Econometrics, Economics, Engineering, Mathematics, Applied Sciences, Statistics or job-related discipline or equivalent experience Job-related experience, 8 years, OR Master”s Degree and job-related experience, 6 years, OR Doctorate Degree and job-related experience, 3 years. Experience in data modeling, 5yrs DesiredEducation / Skills: PhD in engineering or a related field (computer science, natural sciences, mathematics) Experience with Python, R, Scala, SQL Experience developing solutions with Pandas/Scikit-learn, Spark or comparable technologies Experience data science notebooks (Jupyter, Zeppelin or other) Experience with AWS, Azure, cloud computing technologies Scrum team experience Energy industry Experience designing efficient data science workflows and database architecture for data science purposes Experience with forecasting, Bayesian networks, and graph analytics Strong statistics experience with software development methodologies and software engineering principles.Knowledge of program management theories, concepts, methods, best practices, and techniques as needed to perform at the job level Knowledge of relevant programming languages - for example Visual Basic, Ladder Logic, Programmable Logic Controller, C, SharePoint, HTML, Java, Adobe – as needed to perform at the job level Competency in knowing the most effective and efficient processes to get things done, with a focus on continuous improvement.Knowledge of principles, techniques, and procedures used for production and design of technology based equipment and systems as needed to perform at the job level.Knowledge of statistical theories, concepts, methods, best practices, and analyses as needed to perform at the job level Ability to develop reports, models, and simulations as needed to perform at the job level Competency in developing and delivering multi-mode communications that convey a clear understanding of the unique needs of different audiences.Knowledge of data model design philosophies and methodologies for data warehouse and OLTP systemsResponsibilities: Client“s Information Technology (IT) organization is comprised of various unified departments which collaborate effectively in order to deliver high quality technology solutions. The Digital Catalyst Team is a new enterprise team that is responsible for working collaboratively with the lines of business (e.g., Gas Operations, Electric Operations, etc.) to implement consumer grade mobile and analytical solutions across various user groups (e.g., field users, office workers, etc.). This includes, but is not limited to:Deploying best-in-class / rapid delivery capability for mobile solutions.Simplifying, improving, and standardizing business work management processes for mobile needs.Delivering high value analytics across all Lines of Businesses.Rapid delivery of web applications.Digital Catalyst consists of a staff of highly skilled professionals working together to produce mobile solutions following an agile methodology and design thinking. We are a”start-up” department within IT and building driven and creative mobile development team.We take the time to understand our partners" needs and translate those into solutions that delight our users.Our goal is to deliver products with intuitive user experience that will improve Client employees" and customer“s safety, productivity and overall well-being.Position Summary: We are seeking an experienced Data Scientist in the Digital Catalyst Team who will provide strong execution and delivery of data science.Working as a part of the product team, this Data Scientist will translate business needs into advanced analytics and machine learning models. The successful candidate will be responsible for model selection and identification of appropriate training data sets; building, training, and evaluating models; and delivering results to the business on a regular cadence.This role is part of a fully Agile Scrum team, so the data scientist will work alongside a product owner, technical lead, and team of developers and data engineers to support delivery of high-value analytics and software products. PositionResponsibilities: Leads development of high complexity models and training setsProvides hands-on execution and implementation of data science models • Translates business analysis needs into well-defined data science problems, and selecting appropriate models and algorithms and communicates model evaluation and implications of results back to stakeholdersRecognizes and prioritizes the most important work related to data science models to achieve highest operational impact for analytics in the businessBalances tradeoffs among analytics value, model development methods and design and technologies used to implement data science models with a bias toward actionPerforms collaborative work on data science problems and mentor junior data scientistsCreates shared process models, business objects, activity diagrams and process documentation to effectively articulate multiple views of the business solutions that support technical architecture.Manages development of quantitative models and tools.Collaborates with leaders, other LOBs, and business partners to work on issues, projects or activities.Develops new or revises complex models to predict business demand trends, and volume and expenditures forecasts capacity analysis, and various other metrics to identify potential opportunities.Assesses business implications associated with modeling assumptions, inputs, methodologies, technical implementation, analytic procedures and processes, and advanced data analysis.Partners with leaders to drive high performance in their lines of business.Develop deep understanding of business drivers and financial levers to provide strategic decision support.Oversees resolution of complex projects and programs.Develops and maintains up-to-date detailed project schedules and work plans.Performs analysis on complex data models requiring customized reports and data and presents recommendations.Contact:Rozina Hudda Email: rozina@sunrisesys.comHelp \| 732-395-4460Asha Krishna Email: asha@sunrisesys.comHelp \| 732-395-4591	FALSE	FALSE	TRUE	FALSE	TRUE	TRUE	FALSE	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	FALSE	TRUE	FALSE	TRUE	FALSE	FALSE	FALSE	TRUE	FALSE	FALSE	FALSE	FALSE	TRUE	TRUE	FALSE	FALSE	TRUE	FALSE	TRUE	TRUE	FALSE	FALSE	FALSE

jobs.df.long <- jobs.df %>% gather("Skill", "Appears", 3:length(jobs.df)) %>% inner_join(keywords) %>% select(-c(Synonyms, Other.notes))

## Joining, by = "Skill"

kable_styling(knitr::kable(head(jobs.df.long[15:18,]), "html", caption = "Job Skills [in Long Format]"), bootstrap_options = "striped")

Job Skills [in Long Format]
	job_post_region	job_post_summary	Skill	Appears	Soft.or.technical
15	NY	EY is the only professional services firm with a separate business unit (“FSO”) that is dedicated to the financial services marketplace. Our FSO teams have been at the forefront of every event that has reshaped and redefined the financial services industry. If you have a passion for rallying together to solve the most complex challenges in the financial services industry, come join our dynamic FSO team! The Opportunity Today’s clients are looking to transform processes and organizations, and achieve data-driven growth. Data science seniors come from a variety of technical backgrounds and apply statistical and machine learning models to business problems ranging from traditional econometrics, to NLP, and Deep Learning. As a member of the Advanced Analytics team, you’ll work in a highly collaborative environment with clients, experienced data science practitioners, subject matter experts, and other advisory professionals to drive business value through the use of advanced analytics. This is a high growth, high visibility area with opportunities to enhance your skillset and build your career. EY’s Advanced Analytics team supports both internal business teams and external clients in developing innovative techniques and methods, product solutions, and proof-of-concepts. The business problems our clients are facing today are not the same problems they have faced in the past. The rapid pace of development in AI and the technology that enables it has created an urgent need to innovate and adapt to the new global business paradigm. Financial institutions are looking to build smarter and more efficient ways to operate their business, create new revenue streams, and better manage risk, through new opportunities uncovered by their data. We believe that to fully unlock the potential of AI and advanced analytics we need to look not only at the application of AI, but also at the strategy level for how best to transform the enterprise into one that is technology and data focused and ready for the new age. Our clients’ problems are becoming increasingly complex while at the same time the need to automate and streamline is rising. Your key responsibilities Analyze structured and unstructured data at scale to derive new insights and opportunities Build and validate predictive models Advise clients and project teams on leading data science practices Create data-driven business recommendations Contribute to internal research and development efforts in cutting edge areas including NLP, graph analytics, and deep learning / AI Skills and attributes for success Fostering an innovative and inclusive team-oriented work environment Leading and coaching diverse teams of professionals with different backgrounds Demonstrating in-depth technical capabilities and professional knowledge Establishing strong relationships with the clients Working in an entrepreneurial environment to pave your own career path To qualify for the role you’ll need A MSc in a technical field like Computer Science, Econometrics, Mathematics, Engineering, or a related field Practical experience with advanced machine learning techniques and big data technologies An understanding of the latest industry developments in Big Data and AI such as graph databases, Natural Language Processing, and neural networks A thorough understanding of common languages and libraries used in machine learning (such as R, C, Java or Python. Experience with big data tools, Hive, Spark, Neo4j, etc. a plus) Excellent business, communication and presentation skills (experience in financial services domains a plus, but not required) The ability to translate complex information into non-technical, easy to understand language Ideally, you’ll also have A PhD in a technical field like Computer Science, Econometrics, Mathematics, Engineering, or a related fieldWhat we look for We’re looking for well-rounded, technical, and intellectually curious individuals with an entrepreneurial spirit and a genuine desire to influence entire industries. You’ll need great analytical, strategic and communication skills, as well as the ability to handle new responsibilities and challenges. If you’re ready to own complex projects and bring new perspectives to a constantly evolving industry, this role is for you. What working at EY offers We offer a competitive compensation package where you’ll be rewarded based on your performance and recognized for the value you bring to our business. In addition, our Total Rewards package includes medical and dental coverage, both pension and 401(k) plans, a minimum of three weeks of vacation plus 10 observed holidays and three paid personal days, and a range of programs and benefits designed to support your physical, financial and social wellbeing. Plus, we offer Support, coaching and feedback from some of the best colleagues around Opportunities to develop new skills and progress your career The freedom and flexibility to handle your role in a way that’s right for you A rewards package tailored to your unique needs If you can confidently demonstrate that you meet the criteria above, please contact us as soon as possible. About EY As a global leader in assurance, tax, transaction and advisory services, we’re using the finance products, expertise and systems we’ve developed to build a better working world. That starts with a culture that believes in giving you the training, opportunities and creative freedom to make things better. Whenever you join, however long you stay, the exceptional EY experience lasts a lifetime. And with a commitment to hiring and developing the most passionate people, we’ll make our ambition to be the best employer by 2020 a reality. .	algorithm	FALSE	technical
16	NY	Job Requisition Number:63976The Bloomberg Law (BLAW) division of Bloomberg LP provides best-in-class tools, search and analytics for several Bloomberg products in the areas of Law and Government. Our Machine Learning team is looking for an experienced data scientist to join us and contribute to building innovative legal research products using Natural Language Processing (NLP) and Machine Learning. We extract knowledge from hundreds of millions of legal documents and build intelligent models to enable our customers to get the right answers quickly. Our data scientists and engineers work closely with product managers, content team members and market strategists. We apply NLP and Machine Learning techniques to text such as entity extraction and disambiguation, text classification and clustering. As a data scientist on our team you will work with other team members on the latest technology and models to optimize our current line of products. You’ll need to have: A B.S, M.S. or PhD. in Computer Science, Electrical Engineering, Applied Mathematics or a related field 3+ years of experience with Machine Learning, Data Science Statistical Models, NLP and Text Analytics on large data sets Experience with techniques such as topic modeling, text classification, entity extraction and disambiguation A solid understanding of machine learning techniques including Support Vector Machines (SVM), Logistic Regression, Decision Trees, Max entropy, Conditional Random Fields (CRF) and Unsupervised Learning Methods 1+ year of experience programming in Python Experience using R, or other data science toolkits, such as Scikit-learn, NLTK, WEKA, Mallet, CLUTO and GENSIM We’d love to see: Experience in all phases of machine learning application lifecycles from data gathering and preparation to optimizing model performance Publications or presentations in relevant communities (ICML, NIPS, CVPR, SIGIR, ACM Multimedia) Legal or financial domain experience	algorithm	FALSE	technical
17	CA	We are looking to bring on a Senior-level Data Scientist for a national client of ours for a long-term contract with the potential of full-time conversion. Rate: $60-65/hour (W2) \| $70-75/hour (Corp-to-Corp, 1099) RequirementsOur expectations for this person is to have expertise in the three following topic areas:Data AnalyticsModellingA/B testing Minimum Requirements / Responsibilities:At least five (5) years of hands-on experience doing this and some skill sets on our current analytics platform/environment, Amazon Web Services (AWS) Cloud, such as S3 within a Hadoop ecosystem.Bachelor of Science in Computer Science, Engineering, Mathematics, or Economics, or related discipline or equivalent work experience.Experience leading data and modeling initiatives, complex and impactful.Track record of writing clear and well documented code, preferably in Python.SQL proficiency and experience working with relational databases.Proficiency with data visualization tools such as Tableau, matplotlib, etc.Strong understanding of statistics and experience developing supervised and unsupervised learning models.GIS experience	algorithm	FALSE	technical
18	CA	Qualifications:This position is a DATA ENGINEER, not a Data Scientist.The ideal candidate would be a strong SAS/SQL/AWS/HIVE developer who can support the non-statistical model development project in the SAS/ Teradata platform and subsequently transition these models to the AWS/HIVE/Python platform.Since a lot of our work spins off from new regulatory requirements, we lack detailed documentation of the specifications. We are looking for someone who can capture the requirements and build the models without detailed guidance.Bachelor“s Degree in Econometrics, Economics, Engineering, Mathematics, Applied Sciences, Statistics or job-related discipline or equivalent experience Job-related experience, 8 years, OR Master”s Degree and job-related experience, 6 years, OR Doctorate Degree and job-related experience, 3 years. Experience in data modeling, 5yrs DesiredEducation / Skills: PhD in engineering or a related field (computer science, natural sciences, mathematics) Experience with Python, R, Scala, SQL Experience developing solutions with Pandas/Scikit-learn, Spark or comparable technologies Experience data science notebooks (Jupyter, Zeppelin or other) Experience with AWS, Azure, cloud computing technologies Scrum team experience Energy industry Experience designing efficient data science workflows and database architecture for data science purposes Experience with forecasting, Bayesian networks, and graph analytics Strong statistics experience with software development methodologies and software engineering principles.Knowledge of program management theories, concepts, methods, best practices, and techniques as needed to perform at the job level Knowledge of relevant programming languages - for example Visual Basic, Ladder Logic, Programmable Logic Controller, C, SharePoint, HTML, Java, Adobe – as needed to perform at the job level Competency in knowing the most effective and efficient processes to get things done, with a focus on continuous improvement.Knowledge of principles, techniques, and procedures used for production and design of technology based equipment and systems as needed to perform at the job level.Knowledge of statistical theories, concepts, methods, best practices, and analyses as needed to perform at the job level Ability to develop reports, models, and simulations as needed to perform at the job level Competency in developing and delivering multi-mode communications that convey a clear understanding of the unique needs of different audiences.Knowledge of data model design philosophies and methodologies for data warehouse and OLTP systemsResponsibilities: Client“s Information Technology (IT) organization is comprised of various unified departments which collaborate effectively in order to deliver high quality technology solutions. The Digital Catalyst Team is a new enterprise team that is responsible for working collaboratively with the lines of business (e.g., Gas Operations, Electric Operations, etc.) to implement consumer grade mobile and analytical solutions across various user groups (e.g., field users, office workers, etc.). This includes, but is not limited to:Deploying best-in-class / rapid delivery capability for mobile solutions.Simplifying, improving, and standardizing business work management processes for mobile needs.Delivering high value analytics across all Lines of Businesses.Rapid delivery of web applications.Digital Catalyst consists of a staff of highly skilled professionals working together to produce mobile solutions following an agile methodology and design thinking. We are a”start-up” department within IT and building driven and creative mobile development team.We take the time to understand our partners" needs and translate those into solutions that delight our users.Our goal is to deliver products with intuitive user experience that will improve Client employees" and customer“s safety, productivity and overall well-being.Position Summary: We are seeking an experienced Data Scientist in the Digital Catalyst Team who will provide strong execution and delivery of data science.Working as a part of the product team, this Data Scientist will translate business needs into advanced analytics and machine learning models. The successful candidate will be responsible for model selection and identification of appropriate training data sets; building, training, and evaluating models; and delivering results to the business on a regular cadence.This role is part of a fully Agile Scrum team, so the data scientist will work alongside a product owner, technical lead, and team of developers and data engineers to support delivery of high-value analytics and software products. PositionResponsibilities: Leads development of high complexity models and training setsProvides hands-on execution and implementation of data science models • Translates business analysis needs into well-defined data science problems, and selecting appropriate models and algorithms and communicates model evaluation and implications of results back to stakeholdersRecognizes and prioritizes the most important work related to data science models to achieve highest operational impact for analytics in the businessBalances tradeoffs among analytics value, model development methods and design and technologies used to implement data science models with a bias toward actionPerforms collaborative work on data science problems and mentor junior data scientistsCreates shared process models, business objects, activity diagrams and process documentation to effectively articulate multiple views of the business solutions that support technical architecture.Manages development of quantitative models and tools.Collaborates with leaders, other LOBs, and business partners to work on issues, projects or activities.Develops new or revises complex models to predict business demand trends, and volume and expenditures forecasts capacity analysis, and various other metrics to identify potential opportunities.Assesses business implications associated with modeling assumptions, inputs, methodologies, technical implementation, analytic procedures and processes, and advanced data analysis.Partners with leaders to drive high performance in their lines of business.Develop deep understanding of business drivers and financial levers to provide strategic decision support.Oversees resolution of complex projects and programs.Develops and maintains up-to-date detailed project schedules and work plans.Performs analysis on complex data models requiring customized reports and data and presents recommendations.Contact:Rozina Hudda Email: rozina@sunrisesys.comHelp \| 732-395-4460Asha Krishna Email: asha@sunrisesys.comHelp \| 732-395-4591	algorithm	FALSE	technical

save(jobs.df.long, file = file.path(data_store_path, "jobs_df_long.RData"), ascii = TRUE)

Analysing Results

#Filter out Job Skills which did not appear
job.skills <- jobs.df.long %>% 
  filter(Appears == TRUE)

table(job.skills$Soft.or.technical, job.skills$Appears) %>% 
  knitr::kable("html", caption = "Soft vs. Technical") %>% kable_styling(bootstrap_options = "striped")

Soft vs. Technical
	TRUE
soft	902
technical	2083

table(job.skills$Soft.or.technical, job.skills$Appears) %>%
  as.data.frame %>%
  ggplot(aes(x = Var1, y = Freq, fill = Var1)) +
    geom_col() +
    labs(title = "Job Results by Skill Type", x = "Skill Type", y = "Number of Results") + 
    scale_fill_discrete(name = "Skill Type") +
    geom_text(aes(label = Freq, y = Freq + 12), size = 5, position = position_dodge(0.9), vjust = 0)

n.jobs <- nrow(jobs.df)
job.skills %>%
  group_by(Soft.or.technical, Skill) %>% 
  summarize(Skill.Percent = 100 * sum(Appears == TRUE)/n.jobs) %>%
  ggplot(aes(x = reorder(Skill, Skill.Percent), Skill.Percent, fill = Soft.or.technical)) + 
    geom_bar(stat = 'identity', position = 'dodge') +
    coord_flip() +
    labs(title = "Top Data Scientist Skills", x = "Skills", y = "Percentage of Jobs with Skill") + 
    scale_fill_discrete(name = "Skill Type")

Conclusion

Given that our jobs criteria was based on top-paying, senior level Data Scientists, the top 10 skills are very representative and realistic to expect for this type of data.
* At the senior job levels, it’s no wonder that Leadership soft skill tops all other soft and technical skills, with 2 other soft skills, Communication and Collaboration following closely in importance.
* On the technical side, it’s no surprise that at a senior level, Modeling and Machine Learning are expected as top technical skills. Then we have Python and R wrapping around Statistics and finally followed by SQL and Big Data completing the top 10 skill set.

Data Scraping (NY - Indeed.com)

Libraries

library(RCurl)
library(XML)
library(tidyverse)
library(rvest)
library(stringr)
library(ggplot2)

Get listing of 16 HTML files for the Data Scientist [from Indeed.com] job posts

#NOTE: provide an existing path (in your environment) in order to store generated output files
data_store_path <- "~/GitHub/Project3"

jobURLs <- list.files(data_store_path, "indeed_job_post_.*.html")
head(jobURLs, 3)

Visit each job posting HTML file and scrape job title and description for analysis

job_sum_text <- vector(mode = "character", length = length(jobURLs))
job_title <- vector(mode = "character", length = length(jobURLs))

for (i in 1:length(jobURLs)) {
  #Visit each HTML page
  htmFile <- file.path(data_store_path, jobURLs[i])
  h <- read_html(htmFile)

  #Get HTML nodes with CSS id "job_summary"
  jobSum <- html_nodes(h, "#job_summary")
  
  #Get textual content from the "job summary"" nodes
  job_sum_text[i] = html_text(jobSum)

  #Collect job title text
  #Search for HTML <b> nodes with CSS class "jobtitle"
  jobTitleNode <- html_nodes(h, "b.jobtitle")
  job_title[i] <- html_text(jobTitleNode)
}

Create a data frame holding the result of scraping (job title, job summary, etc.) and save to a file

job_df <- data.frame(job_post_source = "INDEED", job_post_title = job_title, job_post_summary = job_sum_text)
glimpse(job_df)
save(job_df, file = file.path(data_store_path, "jobs_df.RData"), ascii = TRUE)

To load the data frame object [named job_df] back into the environment call:

load(file.path(data_store_path, "jobs_df.RData"))
head(job_df, 2)
View(job_df)

Data Scraping (CA - Monster.com)

Set URLs.

Set URLs based on the URLs for an actual search result in my browser. Did it this way because this way can search for job title of “data scientist” (not just keyword search). Can also search for a reasonable radius around the city.

new_york_url <- "https://www.monster.com/jobs/search/New-York+New-York-City+Data-Scientist_125?where=New-York__2c-NY&rad=20-miles"

san_francisco_url <- "https://www.monster.com/jobs/search/California+San-Francisco+Data-Scientist_125?where=San-Francisco__2c-CA&rad=20-miles"

Load libraries.

library(stringr)    #For string operations
library(rvest)      #For screen scrapper
library(tokenizers) #
library(tidyverse)  #For Tidyverse
library(RCurl)      #For File Operations
library(dplyr)      #For Manipulating the data frames
library(DT)         #For Data table package
library(curl)

Set city to New York or San Francisco, then make output directory and pg. 1 URL object.

Run this part only once to avoid getting banned.

If you run subsequent times, load in from Rdata file.

Base URL gives first 25 results, then run pasting “&page=2”, “&page=3”, etc. to get all results.

Let’s take the first 500 results per city, so the first 20 pages.

I checked and both New York and San Francisco have over 500 jobs in the search results.

searchPage <- read_html(searchPage_url)

searchAllJobUrls <- unlist(str_extract_all(searchPage,'(job-openings\\.monster\\.com\\/)\\w.[^\\"]+'))
searchAllJobUrls <- paste("https://",searchAllJobUrls,sep = "")

searchAllJobUrls <- searchAllJobUrls[1:25]

for(page in 2:20)
{
searchPage <- read_html(paste0(searchPage_url,"&page=",page))
searchAllJobUrls_this_page <- unlist(str_extract_all(searchPage,'(job-openings\\.monster\\.com\\/)\\w.[^\\"]+'))
searchAllJobUrls_this_page <- paste("https://",searchAllJobUrls_this_page,sep = "")
searchAllJobUrls <- c(searchAllJobUrls,searchAllJobUrls_this_page[1:25])
}

save(searchAllJobUrls,file=paste0(data_store_path,"/searchAllJobUrls.Rdata"))
length(unique(tolower(searchAllJobUrls)))

If rerunning this script after already scraping the search results, set above to eval=FALSE and the below to eval=TRUE.

load(paste0(data_store_path,"/searchAllJobUrls.Rdata"))

To make sure everything looks correct, show URLs 1, 26, and 51 so we can compare to the links we get by looking in a browser at search pages 1, 2, and 3.

searchAllJobUrls[c(1,26,51)]

So, these match what we see by looking in browser results.

However, initially we found somewhat concerningly that the number of unique URLs is less than 500.

Looking manually through a few pages, it appears sometimes the same job will be listed under two different headlines (eg a “Data Scientist” job at Open Systems Technologies was listed as “Data Scientist” on pg2 and “Machine Learning Data Scientist” on pg3).

I think it should be fine to just run unique on searchAllJobUrls, and then proceed as normal.

searchAllJobUrls <- unique(searchAllJobUrls)
length(searchAllJobUrls)

Now, read from each URL in searchAllJobUrls and save text in job description.

job_sum_text <- vector(mode = "character", length = length(searchAllJobUrls))

for(i in 1:length(searchAllJobUrls))
{
h <- read_html(searchAllJobUrls[i])
forecasthtml <- html_nodes(h,"#JobDescription")
#Adding a check to ensure that there actually is a node with "JobDescription", as one of the URLs did not have this node and it broke the for loop.
if(length(forecasthtml) == 1)
{
        job_sum_text[i] <- html_text(forecasthtml)
}
if(length(forecasthtml) != 1)
{
        job_sum_text[i] <- "" #Add an empty string to job_sum_text for now for these. May want to delete these later on.
}
}

save(job_sum_text,file=paste0(data_store_path,"/job_description_text.Rdata"))

If rerunning this script after already scraping the job pages, set above to eval=FALSE and below to eval=TRUE.

load(paste0(data_store_path,"/job_description_text.Rdata"))

length(job_sum_text)
class(job_sum_text)
job_sum_text[1:3]
length(unique(job_sum_text))

When running a similar script for Columbus, OH, we found at least one job without a valid JobDescription node, so we put an empty string in text field.

We also found an instance of the same job clearly listed under two different URLs.

Let’s check if this happens here, and remove such instances if so.

Then, save Rdata again.

searchAllJobUrls <- searchAllJobUrls[job_sum_text != "" & duplicated(job_sum_text) == FALSE]
job_sum_text <- job_sum_text[job_sum_text != "" & duplicated(job_sum_text) == FALSE]

length(searchAllJobUrls)
length(job_sum_text)

save(list=c("searchAllJobUrls","job_sum_text"),
file=paste0(data_store_path,"/searchAllJobUrls_and_job_sum_text_objects_after_remove_empty_and_duplicate_postings.Rdata"))

DATA 607 - Project3 [Job Posts Web Data Analysis]

Simon U. & Ritesh Lohiya

March 25, 2018

Job Postings Data Analysis

Data Analysis

Libraries

Combine NY and CA Jobs Data

Prepare list of key terms - from Heather G. and Raj K.

Compare keywords against jobs post summary data

Analysing Results

Conclusion

Data Scraping (NY - Indeed.com)

Libraries

Get listing of 16 HTML files for the Data Scientist [from Indeed.com] job posts

Visit each job posting HTML file and scrape job title and description for analysis

Create a data frame holding the result of scraping (job title, job summary, etc.) and save to a file

To load the data frame object [named job_df] back into the environment call:

Data Scraping (CA - Monster.com)

Set URLs.

Load libraries.

Set city to New York or San Francisco, then make output directory and pg. 1 URL object.