Project 3 - Data Science Skills

What are the most valued data science skills?

Introduction

Data is the descriptive tapestry woven through almost every part of business and our lives. As technology advances so does the collection and volume of data. This abundance of data has lead to entire sectors focused on how to manage, analyze, and communicate data.

This leads to a fundamental question: What are the most valued data science skill?

This project looks to explore this question through the acquisition, normalization, cleaning, and analysis of data related jobs. The goal will be to look at job postings through the lens of which data domain they specialize in: Engineering, Science or Analysis. To do this, the analysis considers salary trends, demand in the different domains, common skills and tools, and job location.

This project was executed in three primary steps: 1. Data transformation, cleaning, and normalization 2. Data exploration and analysis 3. A brief summation

Preparing the R enivronment

The following code

Load all Required packages.

library(RMySQL)

## Loading required package: DBI

library(DBI)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Connect to the database.

user <- Sys.getenv("project03")
psw <- Sys.getenv("PROJECT03_PW")

con <- dbConnect(MySQL(), 
                 dbname = "data607_database", 
                 host = "project03.mysql.database.azure.com", 
                 user = user, 
                 password = psw)

Acquire the dataset:

This dataset was loaded to the database in advance. The original data was acquired from:

https://www.kaggle.com/datasets/fahadrehman07/data-science-jobs-and-salary-glassdoor

db_glassdoor_data2 <- dbGetQuery(con, "SELECT * FROM glassdoor_data2")

Data transformation, cleaning, and normalization

Database Normalization

The dataset was loaded from its original source to the database. Within the database the dataset was normalized according to the following entity-relationship diagram:

Entity-relationship diagram

Transformation and cleaning

With the dataset acquired, the data must be transformed, cleaned and tidied so that an analysis may be performed.

The code below selects the columns to be used in the clean dataset.

glassdoor_data2 <- db_glassdoor_data2 |> 
  select(
         "Job_Title",
         "Size",
         "company_txt",
         "Type_of_ownership",
         "Job_Description",
         "Industry",
         "Sector",
         "Revenue",
         "min_salary",
         "max_salary",
         "avg_salary",
         "City",
         "State",
         "Country",
         "Source",
         "same_state",
         "python_yn",
         "r_yn",
         "spark",
         "aws",
         "excel")

Then, the column headers are made uniform.

glassdoor_data2 <- glassdoor_data2 |> 
  clean_names()

head(glassdoor_data2)

##                   job_title                   size
## 1            Data Scientist  501 to 1000 employees
## 2 Healthcare Data Scientist       10000+ employees
## 3            Data Scientist  501 to 1000 employees
## 4            Data Scientist 1001 to 5000 employees
## 5            Data Scientist    51 to 200 employees
## 6            Data Scientist   201 to 500 employees
##                               company_txt  type_of_ownership
## 1                     Tecolote Research\n  Company - Private
## 2 University of Maryland Medical System\n Other Organization
## 3                               KnowBe4\n  Company - Private
## 4                                  PNNL\n         Government
## 5                    Affinity Solutions\n  Company - Private
## 6                              CyrusOne\n   Company - Public
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          job_description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor‚Äôs degree required, preferably in math, engineering, business, or the sciences.\nSkills Required:\nBachelor‚Äôs Degree in relevant field, e.g., math, data analysis, database, computer science, Artificial Intelligence (AI); three years‚Äô experience credit for Master‚Äôs degree; five years‚Äô experience credit for a Ph.D\nApplicant should be proficient in the use of Power BI, Tableau, Python, MATLAB, Microsoft Word, PowerPoint, Excel, and working knowledge of MS Access, LMS, SAS, data visualization tools, and have a strong algorithmic aptitude\nExcellent verbal and written communication skills, and quantitative analytical skills are required\nApplicant must be able to work in a team environment\nU.S. citizenship and ability to obtain a DoD Secret Clearance required\nResponsibilities: The applicant will be responsible for formulating analytical solutions to complex data problems; creating data analytic models to improve data metrics; analyzing customer behavior and trends; delivering insights to stakeholders, as well as designing and crafting reports, dashboards, models, and algorithms to make data insights actionable; selecting features, building and optimizing classifiers using machine learning techniques; data mining using state-of-the-art methods, extending organization‚Äôs data with third party sources of information when needed; enhancing data collection procedures to include information that is relevant for building analytic systems; processing, cleansing, and verifying the integrity of data used for analysis; doing ad-hoc analysis and presenting results in a clear manner; and creating automated anomaly detection systems and constant tracking of its performance.\nBenefits:\nWe offer competitive salaries commensurate with education and experience. We have an excellent benefits package that includes:\nComprehensive health, dental, life, long and short term disability insurance\n100% Company funded Retirement Plans\nGenerous vacation, holiday and sick pay plans\nTuition assistance\n\nBenefits are provided to employees regularly working a minimum of 30 hours per week.\n\nTecolote Research is a private, employee-owned corporation where people are our primary resource. Our investments in technology and training give our employees the tools to ensure our clients are provided the solutions they need, and our very high employee retention rate and stable workforce is an added value to our customers. Apply now to connect with a company that invests in you.
## 2 What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analytics group at the University of Maryland Medical System (UMMS) in support of its strategic priority to become a data-driven and outcomes-oriented organization. The successful candidate will have 3+ years of experience with Machine Learning, Predictive Modeling, Statistical Analysis, Mathematical Optimization, Algorithm Development and a passion for working with healthcare data. Previous experience with various computational approaches along with an ability to demonstrate a portfolio of relevant prior projects is essential. This position will report to the UMMS Vice President for Enterprise Data and Analytics (ED&A).\n\nII. Principal Responsibilities and Tasks\n\n‚Ä¢ Develops predictive and prescriptive analytic models in support of the organization‚Äôs clinical, operations and business initiatives and priorities.\n‚Ä¢ Deploys solutions so that they provide actionable insights to the organization and are embedded or integrated with application systems\n‚Ä¢ Supports and drives analytic efforts designed around organization‚Äôs strategic priorities and clinical/business problems\n‚Ä¢ Works in a team to drive disruptive innovation, which may translate into improved quality of care, clinical outcomes, reduced costs, temporal efficiencies and process improvements.\n‚Ä¢ Builds and extends our analytics portfolio supported by robust documentation\n‚Ä¢ Works with autonomy to find solutions to complex problems using open source tools and in-house development\n‚Ä¢ Stays abreast of state-of-the-art literature in the fields of operations research, statistical modeling, statistical process control and mathematical optimization\n‚Ä¢ Creates, communicates, and manages the project plans and other required project documentation and provides updates to leadership as necessary\n‚Ä¢ Develops and maintains relationships with business, IT and clinical leaders and stakeholders across the enterprise to facilitate collaboration and effective communication\n‚Ä¢ Works with the analytics team and clinical/business stakeholders to develop pilots so that they may be tested and validated in pilot settings\n‚Ä¢ Performs analysis to evaluate primary and secondary objectives from such pilots\n‚Ä¢ Assists leadership with strategies for scaling successful projects across the organization and enhances the analytics applications based on feedback from end-users and clinical/business consumers\n‚Ä¢ Assists leadership with dissemination of success stories (and failures) in an effort to increase analytics literacy and adoption across the organization.\n\nWhat You Need to Be Successful:\n\nIII. Education and Experience\n\n‚Ä¢ Master‚Äôs or higher degree (may be substituted by relevant work experience) in applied mathematics, physics, computer science, engineering, statistics or a related field\n‚Ä¢ 3+ years of Mathematical Optimization, Machine Learning, Predictive Analytics and Algorithm Development experience (experience with tools such as WEKA, RapidMiner, R. Python or other open source tools strongly desired)\n‚Ä¢ Strong development skills in two or more of the following: C/C++, C#, Python, Java\n‚Ä¢ Combining analytic methods with advanced data visualizations\n‚Ä¢ Expert ability to breakdown and clearly define problems\n‚Ä¢ Experience with Natural Language Processing preferred\n\nIV. Knowledge, Skills and Abilities\n\n‚Ä¢ Proven communications skills ‚Äì Effective at working independently and in collaboration with other staff members. Capable of clearly presenting findings orally, in writing, or through graphics.\n‚Ä¢ Proven analytical skills ‚Äì Able to compare, contrast, and validate work with keen attention to detail. Skilled in working with ‚Äúreal world‚Äù data including scrubbing, transformation, and imputation.\n‚Ä¢ Proven problem solving skills ‚Äì Able to plan work, set clear direction, and coordinate own tasks in a fast-paced multidisciplinary environment. Expert at triaging issues, identifying data anomalies, and debugging software.\n‚Ä¢ Design and prototype new application functionality for our products.\n‚Ä¢ Change oriented ‚Äì actively generates process improvements; supports and drives change, and confronts difficult circumstances in creative ways\n‚Ä¢ Effective communicator and change agent\n‚Ä¢ Ability to prioritize the tasks of the project timeline to achieve the desired results\n‚Ä¢ Strong analytic and problem solving skills\n‚Ä¢ Ability to cooperatively and effectively work with people from various organization levels\n\nWe are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school security awareness training and simulated phishing. KnowBe4 was created to help organizations manage the ongoing problem of social engineering. Tens of thousands of organizations worldwide use KnowBe4's platform to mobilize their end users as a last line of defense and enable them to make better security decisions, every day.\n\nWe are ranked #1 best place to work in technology nationwide by Fortune Magazine and have placed #1 or #2 in The Tampa Bay Top Workplaces Survey for the last four years. We also just had our 27th record-setting quarter in a row!\n\nThe Data Scientist will work closely with the VP of FP&A and the Quantitative Analytics Manager to implement advanced analytical models and other data-driven solutions.\n\nResponsibilities:\nWork with key stakeholders throughout the organization to identify opportunities using financial data to develop business solutions.\nDevelop new and enhance existing data collection procedures to ensure that all data relevant for analyses is captured.\nCleanse, consolidate, and verify the integrity of data used in analyses.\nBuild and validate predictive models to increase customer retention, revenue generation, and other business outcomes.\nDevelop relevant statistical models to assist with profitability forecasting\nCreate the analytics to leverage known, inferred and appended information about origins and recognizing patterns to assist in outlook forecasting\nAssist in the design and data modeling of data warehouse.\nVisualize data, especially in reports and dashboards, to communicate analysis results to stakeholders.\nExtend data collection to unstructured data within the company and external sources\nMine and collect data (both structured and unstructured) to detect patterns, opportunities and insights that drive our organization\nCreate and execute automation and data mining requests utilizing SQL, Access, Excel, SAS and other statistical programs\nTrouble shoot forecast and optimization anomalies with FP&A team through the use of statistical and mathematical optimization models. Develop testing to explain and or reduce these anomalies.\nOversee and develop key metric forecasts as well as provide budget support based on trends in the business/industry.\nMinimum Qualifications:\nMaster's degree in Statistics, Computer Science, Mathematics or other quantitative discipline required\n2-3 years of experience in similar role (Master's Degree)\n0-2 years of experience in similar role (PhD)\nExperience leveraging predictive modeling, big data analytics, exploratory data analysis and machine learning to drive significant business impact\nExperience with statistical computer languages (Python, R etc.) to manipulate and analyze large datasets preferred.\nExperience with data visualization tools like D3.js, matplotlib, etc., preferred\nExceptional understanding of machine learning algorithms such as Random Forest, SVM, k-NN, Na√Øve Bayes, Gradient Boosting a plus.\nApplied statistical skills including statistical testing, regression, etc.\nExperience with data bases, query languages, and associated data architecture.\nExperience with distributed computing tools (Hive, Spark, etc.) is a plus.\nStrong analytical skills and ability to meet project deadlines.\nNote: An applicant assessment, background check and drug test may be part of your hiring procedure.\n\nNo recruitment agencies, please.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 *Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological Sciences\n\nGroup: Exposure Science Team\n*Job Description**\nThe Biological System Science (BSS) Group in the Biological Sciences Division of the Pacific Northwest National Laboratory (PNNL) is seeking a staff scientist with multidisciplinary experience in computational chemistry, cheminformatics, advanced statistics and/or machine learning/deep learning/AI. Preferred candidates will have a broad understanding of the state of computational metabolomics and experience in designing and implementing novel deep learning networks for chemistry applications. Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification is also highly valued. Successful candidates will join a large, uniquely collaborative, collegial group of innovators driving the integration of data science, computational science and analytical chemistry to solve the nations most challenging problems in human health, chemical forensics, and national security. The BSS Group is diverse and inclusive, working closely with colleagues across the laboratory with expertise in computational biology, integrative omics, applied mathematics, computer science, and statistics.\n\n+ Apply knowledge of statistics, machine learning, advanced mathematics, simulation, software development, and data modeling to to design, development and implement methods that integrate, clean and analyze data, recognize patterns, address uncertainty, pose questions, and make discoveries from structured and/or unstructured data.\n\n+ Produce solutions driven by exploratory data analysis from complex and high-dimensional datasets.\n\n+ Design, develop, and evaluate predictive models and advanced algorithms that lead to optimal value extraction from data.\n\n+ Develop and maintain existing deep learning networks that generate novel molecules for drug discovery applications\n\n+ Contribue or author proposals, peer-reviewed papers, and other technical products.\n*Minimum Qualifications**\nBS/BA with 0-1 years of experience or MS/MA with 0-1 years of experience\n*Preferred Qualifications**\n+ MS in chemical engineering, computer science, or related field with a GPA of 3.5+ 5+ years of research experience\n\n+ Intermediate level programming experience (preferably Python) and high-performance computing experience\n\n+ At least one first author published, or proof of submitted, paper applying deep learning for use in novel compound generation\n\n+ Understanding of the NMDA receptor and potential drug targets\n\n+ Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification\n*Equal Employment Opportunity**\nBattelle Memorial Institute (BMI) at Pacific Northwest National Laboratory (PNNL) is an Affirmative Action/Equal Opportunity Employer and supports diversity in the workplace. All employment decisions are made without regard to race, color, religion, sex, national origin, age, disability, veteran status, marital or family status, sexual orientation, gender identity, or genetic information. All BMI staff must be able to demonstrate the legal right to work in the United States. BMI is an E-Verify employer. Learn more at jobs.pnnl.gov.\n*_Please be aware that the Department of Energy (DOE) prohibits DOE employees and contractors from participation in certain foreign government talent recruitment programs. If you are offered a position at PNNL and are currently a participant in a foreign government talent recruitment program you will be required to disclose this information before your first day of employment._**\n_Directorate:_ _Earth & Biological Sciences_\n\n_Job Category:_ _Scientists/Scientific Support_\n\n_Group:_ _Biological Systems Science_\n\n_Opening Date:_ _2020-03-26_\n\n_Closing Date:_ _2020-04-05_
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Data Scientist\nAffinity Solutions / Marketing Cloud seeks smart, curious, technically savvy candidates to join our cutting-edge data science team. We hire the best and brightest and give them the opportunity to work on industry-leading technologies.\nThe data sciences team at AFS/Marketing Cloud build models, machine learning algorithms that power all our ad-tech/mar-tech products at scale, develop methodology and tools to precisely and effectively measure market campaign effects, and research in-house and public data sources for consumer spend behavior insights. In this role, you'll have the opportunity to come up with new ideas and solutions that will lead to improvement of our ability to target the right audience, derive insights and provide better measurement methodology for marketing campaigns. You'll access our core data asset and machine learning infrastructure to power your ideas.\nDuties and Responsibilities\n¬∑ Support all clients model building needs, including maintaining and improving current modeling/scoring methodology and processes,\n¬∑ Provide innovative solutions to customized modeling/scoring/targeting with appropriate ML/statistical tools,\n¬∑ Provide analytical/statistical support such as marketing test design, projection, campaign measurement, market insights to clients and stakeholders.\n¬∑ Mine large consumer datasets in the cloud environment to support ad hoc business and statistical analysis,\n¬∑ Develop and Improve automation capabilities to enable customized delivery of the analytical products to clients,\n¬∑ Communicate the methodologies and the results to the management, clients and none technical stakeholders.\nBasic Qualifications\n¬∑ Advanced degree in Statistics/Mathematics/Computer Science/Economics or other fields that requires advanced training in data analytics.\n¬∑ Being able to apply basic statistical/ML concepts and reasoning to address and solve business problems such as targeting, test design, KPI projection and performance measurement.\n¬∑ Entrepreneurial, highly self-motivated, collaborative, keen attention to detail, willingness and capable learn quickly, and ability to effectively prioritize and execute tasks in a high pressure environment.\n¬∑ Being flexible to accept different task assignments and able to work on a tight time schedule.\n¬∑ Excellent command of one or more programming languages; preferably Python, SAS or R\n¬∑ Familiar with one of the database technologies such as PostgreSQL, MySQL, can write basic SQL queries\n¬∑ Great communication skills (verbal, written and presentation)\nPreferred Qualifications\n¬∑ Experience or exposure to large consumer and/or demographic data sets.\n¬∑ Familiarity with data manipulation and cleaning routines and techniques.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CyrusOne is seeking a talented Data Scientist who holds a range of data-focused skills both in technical and analytical domains. The ideal candidate is adept at processing, cleansing, and verifying the integrity of data used for visualization and analysis. This role is dynamic, granting the candidate the opportunity to participate in a wide variety of projects and collaborate with many cross-functional teams throughout the business.\n\nDuties and Responsibilities:\nParticipate in an agile scrum cadence\nProcess, cleanse, and verify the integrity of data used for analysis\nPerform functional business requirements analysis and data analysis\nDevelop data models and algorithms to apply to data sets\nAugment data collection procedures to include necessary information for building accurate analytics\nCollaborate with stakeholders throughout the organization to identify opportunities for leveraging data to drive business solutions\nEvaluate the effectiveness and accuracy of data sources and data gathering techniques\nGather critical information from meetings with various stakeholders and produce useful reports\nCoordinate with cross-functional teams to implement models and monitor outcomes\nDevelop automated discrepancy detection systems and distribute reconciliation reports to stakeholders\nRequirements:\nMust be legally authorized to work in the United States for any employer without sponsorship\nProfessional experience using statistical software languages like R, Python, and SQL to query, manipulate, and draw insights from data sets\nStrong problem-solving skills with an emphasis on product development\nExtensive experience with Microsoft SQL, MySQL and MongoDB\nUnderstanding of version control (git) and project management with Azure DevOps\nKnowledge of machine learning techniques (clustering, decision tree learning, artificial neural networks, etc.)\nExperience visualizing data for stakeholders using visualization tools such as Power BI\nExperience working with and creating data architectures\nUnderstanding and adherence to agile principles and practices\nAbility to work on problems of any scope where the analysis of situations or data requires a review of a variety of factors\nSelf-maintainability and reliability with minimal supervision\nExcellent interpersonal communication, decision making, presentation, and organizational skills\nAbility to build productive internal/external working relationships\nHarmonious with CyrusOne culture, core values, and business goals\nMinimum Qualifications:\n2+ years of related experience in a data analyst role\nStrong can-do attitude in a time sensitive environment\nOther important information about this position:\nThis position requires typical weekday (Monday - Friday) attendance in an office setting, at times after hours work may be required to meet business and customer needs\nEvery position requires certain physical capabilities. CyrusOne seeks to make reasonable accommodations that enable individuals with disabilities to perform essential duties when possible\nCyrusOne is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, sex, sexual orientation, gender identity, religion, national origin, disability, veteran status, or other legally protected status.\n\nCyrusOne provides reasonable accommodation for qualified individuals with disabilities in accordance with the Americans with Disabilities Act (ADA) and any other state or local laws. We will respond to requests for reasonable accommodations to assist you in applying for positions at CyrusOne, or to submit a resume. If you need to request an accommodation, please contact our Human Resources at 214.488.1365 (Option 7) or by email at HR@cyrusone.com.
##                           industry                       sector
## 1              Aerospace & Defense          Aerospace & Defense
## 2 Health Care Services & Hospitals                  Health Care
## 3                Security Services            Business Services
## 4                           Energy Oil, Gas, Energy & Utilities
## 5          Advertising & Marketing            Business Services
## 6                      Real Estate                  Real Estate
##                            revenue min_salary max_salary avg_salary        city
## 1        $50 to $100 million (USD)         53         91         72 Albuquerque
## 2           $2 to $5 billion (USD)         63        112       87.5   Linthicum
## 3       $100 to $500 million (USD)         80         90         85  Clearwater
## 4 $500 million to $1 billion (USD)         56         97       76.5    Richland
## 5         Unknown / Non-Applicable         86        143      114.5    New York
## 6           $1 to $2 billion (USD)         71        119         95      Dallas
##   state country source same_state python_yn r_yn spark aws excel
## 1    NM      US   <NA>          0         1    0     0   0     1
## 2    MD      US   <NA>          0         1    0     0   0     0
## 3    FL      US   <NA>          1         1    0     1   0     1
## 4    WA      US   <NA>          1         1    0     0   0     0
## 5    NY      US   <NA>          1         1    0     0   0     1
## 6    TX      US   <NA>          1         1    0     0   1     1

This code converts some columns to numeric columns.

glassdoor_data2 <- glassdoor_data2 |> 
  mutate(across(c(min_salary, 
                  max_salary, 
                  avg_salary,
                  same_state,
                  python_yn, 
                  r_yn, 
                  spark, 
                  aws, 
                  excel), 
                parse_number))

Then unknown values are replaced with “NA” in the revenue column.

glassdoor_data2 <- glassdoor_data2 |> 
  mutate(revenue = ifelse(revenue == "-1" | revenue == "Unknown / Non-Applicable", NA, revenue))

Several columns are renamed for readability.

glassdoor_data2 <- glassdoor_data2 |> 
  rename("company_name" = "company_txt",
         "python" = "python_yn",
         "r_lang" = "r_yn")
head(glassdoor_data2)

##                   job_title                   size
## 1            Data Scientist  501 to 1000 employees
## 2 Healthcare Data Scientist       10000+ employees
## 3            Data Scientist  501 to 1000 employees
## 4            Data Scientist 1001 to 5000 employees
## 5            Data Scientist    51 to 200 employees
## 6            Data Scientist   201 to 500 employees
##                              company_name  type_of_ownership
## 1                     Tecolote Research\n  Company - Private
## 2 University of Maryland Medical System\n Other Organization
## 3                               KnowBe4\n  Company - Private
## 4                                  PNNL\n         Government
## 5                    Affinity Solutions\n  Company - Private
## 6                              CyrusOne\n   Company - Public
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          job_description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor‚Äôs degree required, preferably in math, engineering, business, or the sciences.\nSkills Required:\nBachelor‚Äôs Degree in relevant field, e.g., math, data analysis, database, computer science, Artificial Intelligence (AI); three years‚Äô experience credit for Master‚Äôs degree; five years‚Äô experience credit for a Ph.D\nApplicant should be proficient in the use of Power BI, Tableau, Python, MATLAB, Microsoft Word, PowerPoint, Excel, and working knowledge of MS Access, LMS, SAS, data visualization tools, and have a strong algorithmic aptitude\nExcellent verbal and written communication skills, and quantitative analytical skills are required\nApplicant must be able to work in a team environment\nU.S. citizenship and ability to obtain a DoD Secret Clearance required\nResponsibilities: The applicant will be responsible for formulating analytical solutions to complex data problems; creating data analytic models to improve data metrics; analyzing customer behavior and trends; delivering insights to stakeholders, as well as designing and crafting reports, dashboards, models, and algorithms to make data insights actionable; selecting features, building and optimizing classifiers using machine learning techniques; data mining using state-of-the-art methods, extending organization‚Äôs data with third party sources of information when needed; enhancing data collection procedures to include information that is relevant for building analytic systems; processing, cleansing, and verifying the integrity of data used for analysis; doing ad-hoc analysis and presenting results in a clear manner; and creating automated anomaly detection systems and constant tracking of its performance.\nBenefits:\nWe offer competitive salaries commensurate with education and experience. We have an excellent benefits package that includes:\nComprehensive health, dental, life, long and short term disability insurance\n100% Company funded Retirement Plans\nGenerous vacation, holiday and sick pay plans\nTuition assistance\n\nBenefits are provided to employees regularly working a minimum of 30 hours per week.\n\nTecolote Research is a private, employee-owned corporation where people are our primary resource. Our investments in technology and training give our employees the tools to ensure our clients are provided the solutions they need, and our very high employee retention rate and stable workforce is an added value to our customers. Apply now to connect with a company that invests in you.
## 2 What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analytics group at the University of Maryland Medical System (UMMS) in support of its strategic priority to become a data-driven and outcomes-oriented organization. The successful candidate will have 3+ years of experience with Machine Learning, Predictive Modeling, Statistical Analysis, Mathematical Optimization, Algorithm Development and a passion for working with healthcare data. Previous experience with various computational approaches along with an ability to demonstrate a portfolio of relevant prior projects is essential. This position will report to the UMMS Vice President for Enterprise Data and Analytics (ED&A).\n\nII. Principal Responsibilities and Tasks\n\n‚Ä¢ Develops predictive and prescriptive analytic models in support of the organization‚Äôs clinical, operations and business initiatives and priorities.\n‚Ä¢ Deploys solutions so that they provide actionable insights to the organization and are embedded or integrated with application systems\n‚Ä¢ Supports and drives analytic efforts designed around organization‚Äôs strategic priorities and clinical/business problems\n‚Ä¢ Works in a team to drive disruptive innovation, which may translate into improved quality of care, clinical outcomes, reduced costs, temporal efficiencies and process improvements.\n‚Ä¢ Builds and extends our analytics portfolio supported by robust documentation\n‚Ä¢ Works with autonomy to find solutions to complex problems using open source tools and in-house development\n‚Ä¢ Stays abreast of state-of-the-art literature in the fields of operations research, statistical modeling, statistical process control and mathematical optimization\n‚Ä¢ Creates, communicates, and manages the project plans and other required project documentation and provides updates to leadership as necessary\n‚Ä¢ Develops and maintains relationships with business, IT and clinical leaders and stakeholders across the enterprise to facilitate collaboration and effective communication\n‚Ä¢ Works with the analytics team and clinical/business stakeholders to develop pilots so that they may be tested and validated in pilot settings\n‚Ä¢ Performs analysis to evaluate primary and secondary objectives from such pilots\n‚Ä¢ Assists leadership with strategies for scaling successful projects across the organization and enhances the analytics applications based on feedback from end-users and clinical/business consumers\n‚Ä¢ Assists leadership with dissemination of success stories (and failures) in an effort to increase analytics literacy and adoption across the organization.\n\nWhat You Need to Be Successful:\n\nIII. Education and Experience\n\n‚Ä¢ Master‚Äôs or higher degree (may be substituted by relevant work experience) in applied mathematics, physics, computer science, engineering, statistics or a related field\n‚Ä¢ 3+ years of Mathematical Optimization, Machine Learning, Predictive Analytics and Algorithm Development experience (experience with tools such as WEKA, RapidMiner, R. Python or other open source tools strongly desired)\n‚Ä¢ Strong development skills in two or more of the following: C/C++, C#, Python, Java\n‚Ä¢ Combining analytic methods with advanced data visualizations\n‚Ä¢ Expert ability to breakdown and clearly define problems\n‚Ä¢ Experience with Natural Language Processing preferred\n\nIV. Knowledge, Skills and Abilities\n\n‚Ä¢ Proven communications skills ‚Äì Effective at working independently and in collaboration with other staff members. Capable of clearly presenting findings orally, in writing, or through graphics.\n‚Ä¢ Proven analytical skills ‚Äì Able to compare, contrast, and validate work with keen attention to detail. Skilled in working with ‚Äúreal world‚Äù data including scrubbing, transformation, and imputation.\n‚Ä¢ Proven problem solving skills ‚Äì Able to plan work, set clear direction, and coordinate own tasks in a fast-paced multidisciplinary environment. Expert at triaging issues, identifying data anomalies, and debugging software.\n‚Ä¢ Design and prototype new application functionality for our products.\n‚Ä¢ Change oriented ‚Äì actively generates process improvements; supports and drives change, and confronts difficult circumstances in creative ways\n‚Ä¢ Effective communicator and change agent\n‚Ä¢ Ability to prioritize the tasks of the project timeline to achieve the desired results\n‚Ä¢ Strong analytic and problem solving skills\n‚Ä¢ Ability to cooperatively and effectively work with people from various organization levels\n\nWe are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school security awareness training and simulated phishing. KnowBe4 was created to help organizations manage the ongoing problem of social engineering. Tens of thousands of organizations worldwide use KnowBe4's platform to mobilize their end users as a last line of defense and enable them to make better security decisions, every day.\n\nWe are ranked #1 best place to work in technology nationwide by Fortune Magazine and have placed #1 or #2 in The Tampa Bay Top Workplaces Survey for the last four years. We also just had our 27th record-setting quarter in a row!\n\nThe Data Scientist will work closely with the VP of FP&A and the Quantitative Analytics Manager to implement advanced analytical models and other data-driven solutions.\n\nResponsibilities:\nWork with key stakeholders throughout the organization to identify opportunities using financial data to develop business solutions.\nDevelop new and enhance existing data collection procedures to ensure that all data relevant for analyses is captured.\nCleanse, consolidate, and verify the integrity of data used in analyses.\nBuild and validate predictive models to increase customer retention, revenue generation, and other business outcomes.\nDevelop relevant statistical models to assist with profitability forecasting\nCreate the analytics to leverage known, inferred and appended information about origins and recognizing patterns to assist in outlook forecasting\nAssist in the design and data modeling of data warehouse.\nVisualize data, especially in reports and dashboards, to communicate analysis results to stakeholders.\nExtend data collection to unstructured data within the company and external sources\nMine and collect data (both structured and unstructured) to detect patterns, opportunities and insights that drive our organization\nCreate and execute automation and data mining requests utilizing SQL, Access, Excel, SAS and other statistical programs\nTrouble shoot forecast and optimization anomalies with FP&A team through the use of statistical and mathematical optimization models. Develop testing to explain and or reduce these anomalies.\nOversee and develop key metric forecasts as well as provide budget support based on trends in the business/industry.\nMinimum Qualifications:\nMaster's degree in Statistics, Computer Science, Mathematics or other quantitative discipline required\n2-3 years of experience in similar role (Master's Degree)\n0-2 years of experience in similar role (PhD)\nExperience leveraging predictive modeling, big data analytics, exploratory data analysis and machine learning to drive significant business impact\nExperience with statistical computer languages (Python, R etc.) to manipulate and analyze large datasets preferred.\nExperience with data visualization tools like D3.js, matplotlib, etc., preferred\nExceptional understanding of machine learning algorithms such as Random Forest, SVM, k-NN, Na√Øve Bayes, Gradient Boosting a plus.\nApplied statistical skills including statistical testing, regression, etc.\nExperience with data bases, query languages, and associated data architecture.\nExperience with distributed computing tools (Hive, Spark, etc.) is a plus.\nStrong analytical skills and ability to meet project deadlines.\nNote: An applicant assessment, background check and drug test may be part of your hiring procedure.\n\nNo recruitment agencies, please.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 *Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological Sciences\n\nGroup: Exposure Science Team\n*Job Description**\nThe Biological System Science (BSS) Group in the Biological Sciences Division of the Pacific Northwest National Laboratory (PNNL) is seeking a staff scientist with multidisciplinary experience in computational chemistry, cheminformatics, advanced statistics and/or machine learning/deep learning/AI. Preferred candidates will have a broad understanding of the state of computational metabolomics and experience in designing and implementing novel deep learning networks for chemistry applications. Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification is also highly valued. Successful candidates will join a large, uniquely collaborative, collegial group of innovators driving the integration of data science, computational science and analytical chemistry to solve the nations most challenging problems in human health, chemical forensics, and national security. The BSS Group is diverse and inclusive, working closely with colleagues across the laboratory with expertise in computational biology, integrative omics, applied mathematics, computer science, and statistics.\n\n+ Apply knowledge of statistics, machine learning, advanced mathematics, simulation, software development, and data modeling to to design, development and implement methods that integrate, clean and analyze data, recognize patterns, address uncertainty, pose questions, and make discoveries from structured and/or unstructured data.\n\n+ Produce solutions driven by exploratory data analysis from complex and high-dimensional datasets.\n\n+ Design, develop, and evaluate predictive models and advanced algorithms that lead to optimal value extraction from data.\n\n+ Develop and maintain existing deep learning networks that generate novel molecules for drug discovery applications\n\n+ Contribue or author proposals, peer-reviewed papers, and other technical products.\n*Minimum Qualifications**\nBS/BA with 0-1 years of experience or MS/MA with 0-1 years of experience\n*Preferred Qualifications**\n+ MS in chemical engineering, computer science, or related field with a GPA of 3.5+ 5+ years of research experience\n\n+ Intermediate level programming experience (preferably Python) and high-performance computing experience\n\n+ At least one first author published, or proof of submitted, paper applying deep learning for use in novel compound generation\n\n+ Understanding of the NMDA receptor and potential drug targets\n\n+ Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification\n*Equal Employment Opportunity**\nBattelle Memorial Institute (BMI) at Pacific Northwest National Laboratory (PNNL) is an Affirmative Action/Equal Opportunity Employer and supports diversity in the workplace. All employment decisions are made without regard to race, color, religion, sex, national origin, age, disability, veteran status, marital or family status, sexual orientation, gender identity, or genetic information. All BMI staff must be able to demonstrate the legal right to work in the United States. BMI is an E-Verify employer. Learn more at jobs.pnnl.gov.\n*_Please be aware that the Department of Energy (DOE) prohibits DOE employees and contractors from participation in certain foreign government talent recruitment programs. If you are offered a position at PNNL and are currently a participant in a foreign government talent recruitment program you will be required to disclose this information before your first day of employment._**\n_Directorate:_ _Earth & Biological Sciences_\n\n_Job Category:_ _Scientists/Scientific Support_\n\n_Group:_ _Biological Systems Science_\n\n_Opening Date:_ _2020-03-26_\n\n_Closing Date:_ _2020-04-05_
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Data Scientist\nAffinity Solutions / Marketing Cloud seeks smart, curious, technically savvy candidates to join our cutting-edge data science team. We hire the best and brightest and give them the opportunity to work on industry-leading technologies.\nThe data sciences team at AFS/Marketing Cloud build models, machine learning algorithms that power all our ad-tech/mar-tech products at scale, develop methodology and tools to precisely and effectively measure market campaign effects, and research in-house and public data sources for consumer spend behavior insights. In this role, you'll have the opportunity to come up with new ideas and solutions that will lead to improvement of our ability to target the right audience, derive insights and provide better measurement methodology for marketing campaigns. You'll access our core data asset and machine learning infrastructure to power your ideas.\nDuties and Responsibilities\n¬∑ Support all clients model building needs, including maintaining and improving current modeling/scoring methodology and processes,\n¬∑ Provide innovative solutions to customized modeling/scoring/targeting with appropriate ML/statistical tools,\n¬∑ Provide analytical/statistical support such as marketing test design, projection, campaign measurement, market insights to clients and stakeholders.\n¬∑ Mine large consumer datasets in the cloud environment to support ad hoc business and statistical analysis,\n¬∑ Develop and Improve automation capabilities to enable customized delivery of the analytical products to clients,\n¬∑ Communicate the methodologies and the results to the management, clients and none technical stakeholders.\nBasic Qualifications\n¬∑ Advanced degree in Statistics/Mathematics/Computer Science/Economics or other fields that requires advanced training in data analytics.\n¬∑ Being able to apply basic statistical/ML concepts and reasoning to address and solve business problems such as targeting, test design, KPI projection and performance measurement.\n¬∑ Entrepreneurial, highly self-motivated, collaborative, keen attention to detail, willingness and capable learn quickly, and ability to effectively prioritize and execute tasks in a high pressure environment.\n¬∑ Being flexible to accept different task assignments and able to work on a tight time schedule.\n¬∑ Excellent command of one or more programming languages; preferably Python, SAS or R\n¬∑ Familiar with one of the database technologies such as PostgreSQL, MySQL, can write basic SQL queries\n¬∑ Great communication skills (verbal, written and presentation)\nPreferred Qualifications\n¬∑ Experience or exposure to large consumer and/or demographic data sets.\n¬∑ Familiarity with data manipulation and cleaning routines and techniques.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CyrusOne is seeking a talented Data Scientist who holds a range of data-focused skills both in technical and analytical domains. The ideal candidate is adept at processing, cleansing, and verifying the integrity of data used for visualization and analysis. This role is dynamic, granting the candidate the opportunity to participate in a wide variety of projects and collaborate with many cross-functional teams throughout the business.\n\nDuties and Responsibilities:\nParticipate in an agile scrum cadence\nProcess, cleanse, and verify the integrity of data used for analysis\nPerform functional business requirements analysis and data analysis\nDevelop data models and algorithms to apply to data sets\nAugment data collection procedures to include necessary information for building accurate analytics\nCollaborate with stakeholders throughout the organization to identify opportunities for leveraging data to drive business solutions\nEvaluate the effectiveness and accuracy of data sources and data gathering techniques\nGather critical information from meetings with various stakeholders and produce useful reports\nCoordinate with cross-functional teams to implement models and monitor outcomes\nDevelop automated discrepancy detection systems and distribute reconciliation reports to stakeholders\nRequirements:\nMust be legally authorized to work in the United States for any employer without sponsorship\nProfessional experience using statistical software languages like R, Python, and SQL to query, manipulate, and draw insights from data sets\nStrong problem-solving skills with an emphasis on product development\nExtensive experience with Microsoft SQL, MySQL and MongoDB\nUnderstanding of version control (git) and project management with Azure DevOps\nKnowledge of machine learning techniques (clustering, decision tree learning, artificial neural networks, etc.)\nExperience visualizing data for stakeholders using visualization tools such as Power BI\nExperience working with and creating data architectures\nUnderstanding and adherence to agile principles and practices\nAbility to work on problems of any scope where the analysis of situations or data requires a review of a variety of factors\nSelf-maintainability and reliability with minimal supervision\nExcellent interpersonal communication, decision making, presentation, and organizational skills\nAbility to build productive internal/external working relationships\nHarmonious with CyrusOne culture, core values, and business goals\nMinimum Qualifications:\n2+ years of related experience in a data analyst role\nStrong can-do attitude in a time sensitive environment\nOther important information about this position:\nThis position requires typical weekday (Monday - Friday) attendance in an office setting, at times after hours work may be required to meet business and customer needs\nEvery position requires certain physical capabilities. CyrusOne seeks to make reasonable accommodations that enable individuals with disabilities to perform essential duties when possible\nCyrusOne is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, sex, sexual orientation, gender identity, religion, national origin, disability, veteran status, or other legally protected status.\n\nCyrusOne provides reasonable accommodation for qualified individuals with disabilities in accordance with the Americans with Disabilities Act (ADA) and any other state or local laws. We will respond to requests for reasonable accommodations to assist you in applying for positions at CyrusOne, or to submit a resume. If you need to request an accommodation, please contact our Human Resources at 214.488.1365 (Option 7) or by email at HR@cyrusone.com.
##                           industry                       sector
## 1              Aerospace & Defense          Aerospace & Defense
## 2 Health Care Services & Hospitals                  Health Care
## 3                Security Services            Business Services
## 4                           Energy Oil, Gas, Energy & Utilities
## 5          Advertising & Marketing            Business Services
## 6                      Real Estate                  Real Estate
##                            revenue min_salary max_salary avg_salary        city
## 1        $50 to $100 million (USD)         53         91       72.0 Albuquerque
## 2           $2 to $5 billion (USD)         63        112       87.5   Linthicum
## 3       $100 to $500 million (USD)         80         90       85.0  Clearwater
## 4 $500 million to $1 billion (USD)         56         97       76.5    Richland
## 5                             <NA>         86        143      114.5    New York
## 6           $1 to $2 billion (USD)         71        119       95.0      Dallas
##   state country source same_state python r_lang spark aws excel
## 1    NM      US   <NA>          0      1      0     0   0     1
## 2    MD      US   <NA>          0      1      0     0   0     0
## 3    FL      US   <NA>          1      1      0     1   0     1
## 4    WA      US   <NA>          1      1      0     0   0     0
## 5    NY      US   <NA>          1      1      0     0   0     1
## 6    TX      US   <NA>          1      1      0     0   1     1

Job types are created using keywords from the job titles.

glassdoor_data2 <- glassdoor_data2 |> 
  mutate(job_type = case_when(
    grepl("engineer", job_title, ignore.case = TRUE) & 
      grepl("scientist", job_title, ignore.case = TRUE) ~ "Engineer/Scientist",
    grepl("engineer", job_title, ignore.case = TRUE) &
      grepl("analyst", job_title, ignore.case = TRUE) ~ "Engineer/Analyst",
    grepl("scientist", job_title, ignore.case = TRUE) &
      grepl("analyst", job_title, ignore.case = TRUE) ~ "Scientist/Analyst",
    grepl("engineer", job_title, ignore.case = TRUE) ~ "Engineer",
    grepl("analyst", job_title, ignore.case = TRUE) |
      grepl("analytics", job_title, ignore.case = TRUE) ~ "Analyst",  
    grepl("scientist", job_title, ignore.case = TRUE) |
      grepl("science", job_title, ignore.case = TRUE) ~ "Scientist", 
    TRUE ~ NA_character_ 
  ))

head(glassdoor_data2)

##                   job_title                   size
## 1            Data Scientist  501 to 1000 employees
## 2 Healthcare Data Scientist       10000+ employees
## 3            Data Scientist  501 to 1000 employees
## 4            Data Scientist 1001 to 5000 employees
## 5            Data Scientist    51 to 200 employees
## 6            Data Scientist   201 to 500 employees
##                              company_name  type_of_ownership
## 1                     Tecolote Research\n  Company - Private
## 2 University of Maryland Medical System\n Other Organization
## 3                               KnowBe4\n  Company - Private
## 4                                  PNNL\n         Government
## 5                    Affinity Solutions\n  Company - Private
## 6                              CyrusOne\n   Company - Public
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          job_description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor‚Äôs degree required, preferably in math, engineering, business, or the sciences.\nSkills Required:\nBachelor‚Äôs Degree in relevant field, e.g., math, data analysis, database, computer science, Artificial Intelligence (AI); three years‚Äô experience credit for Master‚Äôs degree; five years‚Äô experience credit for a Ph.D\nApplicant should be proficient in the use of Power BI, Tableau, Python, MATLAB, Microsoft Word, PowerPoint, Excel, and working knowledge of MS Access, LMS, SAS, data visualization tools, and have a strong algorithmic aptitude\nExcellent verbal and written communication skills, and quantitative analytical skills are required\nApplicant must be able to work in a team environment\nU.S. citizenship and ability to obtain a DoD Secret Clearance required\nResponsibilities: The applicant will be responsible for formulating analytical solutions to complex data problems; creating data analytic models to improve data metrics; analyzing customer behavior and trends; delivering insights to stakeholders, as well as designing and crafting reports, dashboards, models, and algorithms to make data insights actionable; selecting features, building and optimizing classifiers using machine learning techniques; data mining using state-of-the-art methods, extending organization‚Äôs data with third party sources of information when needed; enhancing data collection procedures to include information that is relevant for building analytic systems; processing, cleansing, and verifying the integrity of data used for analysis; doing ad-hoc analysis and presenting results in a clear manner; and creating automated anomaly detection systems and constant tracking of its performance.\nBenefits:\nWe offer competitive salaries commensurate with education and experience. We have an excellent benefits package that includes:\nComprehensive health, dental, life, long and short term disability insurance\n100% Company funded Retirement Plans\nGenerous vacation, holiday and sick pay plans\nTuition assistance\n\nBenefits are provided to employees regularly working a minimum of 30 hours per week.\n\nTecolote Research is a private, employee-owned corporation where people are our primary resource. Our investments in technology and training give our employees the tools to ensure our clients are provided the solutions they need, and our very high employee retention rate and stable workforce is an added value to our customers. Apply now to connect with a company that invests in you.
## 2 What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analytics group at the University of Maryland Medical System (UMMS) in support of its strategic priority to become a data-driven and outcomes-oriented organization. The successful candidate will have 3+ years of experience with Machine Learning, Predictive Modeling, Statistical Analysis, Mathematical Optimization, Algorithm Development and a passion for working with healthcare data. Previous experience with various computational approaches along with an ability to demonstrate a portfolio of relevant prior projects is essential. This position will report to the UMMS Vice President for Enterprise Data and Analytics (ED&A).\n\nII. Principal Responsibilities and Tasks\n\n‚Ä¢ Develops predictive and prescriptive analytic models in support of the organization‚Äôs clinical, operations and business initiatives and priorities.\n‚Ä¢ Deploys solutions so that they provide actionable insights to the organization and are embedded or integrated with application systems\n‚Ä¢ Supports and drives analytic efforts designed around organization‚Äôs strategic priorities and clinical/business problems\n‚Ä¢ Works in a team to drive disruptive innovation, which may translate into improved quality of care, clinical outcomes, reduced costs, temporal efficiencies and process improvements.\n‚Ä¢ Builds and extends our analytics portfolio supported by robust documentation\n‚Ä¢ Works with autonomy to find solutions to complex problems using open source tools and in-house development\n‚Ä¢ Stays abreast of state-of-the-art literature in the fields of operations research, statistical modeling, statistical process control and mathematical optimization\n‚Ä¢ Creates, communicates, and manages the project plans and other required project documentation and provides updates to leadership as necessary\n‚Ä¢ Develops and maintains relationships with business, IT and clinical leaders and stakeholders across the enterprise to facilitate collaboration and effective communication\n‚Ä¢ Works with the analytics team and clinical/business stakeholders to develop pilots so that they may be tested and validated in pilot settings\n‚Ä¢ Performs analysis to evaluate primary and secondary objectives from such pilots\n‚Ä¢ Assists leadership with strategies for scaling successful projects across the organization and enhances the analytics applications based on feedback from end-users and clinical/business consumers\n‚Ä¢ Assists leadership with dissemination of success stories (and failures) in an effort to increase analytics literacy and adoption across the organization.\n\nWhat You Need to Be Successful:\n\nIII. Education and Experience\n\n‚Ä¢ Master‚Äôs or higher degree (may be substituted by relevant work experience) in applied mathematics, physics, computer science, engineering, statistics or a related field\n‚Ä¢ 3+ years of Mathematical Optimization, Machine Learning, Predictive Analytics and Algorithm Development experience (experience with tools such as WEKA, RapidMiner, R. Python or other open source tools strongly desired)\n‚Ä¢ Strong development skills in two or more of the following: C/C++, C#, Python, Java\n‚Ä¢ Combining analytic methods with advanced data visualizations\n‚Ä¢ Expert ability to breakdown and clearly define problems\n‚Ä¢ Experience with Natural Language Processing preferred\n\nIV. Knowledge, Skills and Abilities\n\n‚Ä¢ Proven communications skills ‚Äì Effective at working independently and in collaboration with other staff members. Capable of clearly presenting findings orally, in writing, or through graphics.\n‚Ä¢ Proven analytical skills ‚Äì Able to compare, contrast, and validate work with keen attention to detail. Skilled in working with ‚Äúreal world‚Äù data including scrubbing, transformation, and imputation.\n‚Ä¢ Proven problem solving skills ‚Äì Able to plan work, set clear direction, and coordinate own tasks in a fast-paced multidisciplinary environment. Expert at triaging issues, identifying data anomalies, and debugging software.\n‚Ä¢ Design and prototype new application functionality for our products.\n‚Ä¢ Change oriented ‚Äì actively generates process improvements; supports and drives change, and confronts difficult circumstances in creative ways\n‚Ä¢ Effective communicator and change agent\n‚Ä¢ Ability to prioritize the tasks of the project timeline to achieve the desired results\n‚Ä¢ Strong analytic and problem solving skills\n‚Ä¢ Ability to cooperatively and effectively work with people from various organization levels\n\nWe are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school security awareness training and simulated phishing. KnowBe4 was created to help organizations manage the ongoing problem of social engineering. Tens of thousands of organizations worldwide use KnowBe4's platform to mobilize their end users as a last line of defense and enable them to make better security decisions, every day.\n\nWe are ranked #1 best place to work in technology nationwide by Fortune Magazine and have placed #1 or #2 in The Tampa Bay Top Workplaces Survey for the last four years. We also just had our 27th record-setting quarter in a row!\n\nThe Data Scientist will work closely with the VP of FP&A and the Quantitative Analytics Manager to implement advanced analytical models and other data-driven solutions.\n\nResponsibilities:\nWork with key stakeholders throughout the organization to identify opportunities using financial data to develop business solutions.\nDevelop new and enhance existing data collection procedures to ensure that all data relevant for analyses is captured.\nCleanse, consolidate, and verify the integrity of data used in analyses.\nBuild and validate predictive models to increase customer retention, revenue generation, and other business outcomes.\nDevelop relevant statistical models to assist with profitability forecasting\nCreate the analytics to leverage known, inferred and appended information about origins and recognizing patterns to assist in outlook forecasting\nAssist in the design and data modeling of data warehouse.\nVisualize data, especially in reports and dashboards, to communicate analysis results to stakeholders.\nExtend data collection to unstructured data within the company and external sources\nMine and collect data (both structured and unstructured) to detect patterns, opportunities and insights that drive our organization\nCreate and execute automation and data mining requests utilizing SQL, Access, Excel, SAS and other statistical programs\nTrouble shoot forecast and optimization anomalies with FP&A team through the use of statistical and mathematical optimization models. Develop testing to explain and or reduce these anomalies.\nOversee and develop key metric forecasts as well as provide budget support based on trends in the business/industry.\nMinimum Qualifications:\nMaster's degree in Statistics, Computer Science, Mathematics or other quantitative discipline required\n2-3 years of experience in similar role (Master's Degree)\n0-2 years of experience in similar role (PhD)\nExperience leveraging predictive modeling, big data analytics, exploratory data analysis and machine learning to drive significant business impact\nExperience with statistical computer languages (Python, R etc.) to manipulate and analyze large datasets preferred.\nExperience with data visualization tools like D3.js, matplotlib, etc., preferred\nExceptional understanding of machine learning algorithms such as Random Forest, SVM, k-NN, Na√Øve Bayes, Gradient Boosting a plus.\nApplied statistical skills including statistical testing, regression, etc.\nExperience with data bases, query languages, and associated data architecture.\nExperience with distributed computing tools (Hive, Spark, etc.) is a plus.\nStrong analytical skills and ability to meet project deadlines.\nNote: An applicant assessment, background check and drug test may be part of your hiring procedure.\n\nNo recruitment agencies, please.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 *Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological Sciences\n\nGroup: Exposure Science Team\n*Job Description**\nThe Biological System Science (BSS) Group in the Biological Sciences Division of the Pacific Northwest National Laboratory (PNNL) is seeking a staff scientist with multidisciplinary experience in computational chemistry, cheminformatics, advanced statistics and/or machine learning/deep learning/AI. Preferred candidates will have a broad understanding of the state of computational metabolomics and experience in designing and implementing novel deep learning networks for chemistry applications. Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification is also highly valued. Successful candidates will join a large, uniquely collaborative, collegial group of innovators driving the integration of data science, computational science and analytical chemistry to solve the nations most challenging problems in human health, chemical forensics, and national security. The BSS Group is diverse and inclusive, working closely with colleagues across the laboratory with expertise in computational biology, integrative omics, applied mathematics, computer science, and statistics.\n\n+ Apply knowledge of statistics, machine learning, advanced mathematics, simulation, software development, and data modeling to to design, development and implement methods that integrate, clean and analyze data, recognize patterns, address uncertainty, pose questions, and make discoveries from structured and/or unstructured data.\n\n+ Produce solutions driven by exploratory data analysis from complex and high-dimensional datasets.\n\n+ Design, develop, and evaluate predictive models and advanced algorithms that lead to optimal value extraction from data.\n\n+ Develop and maintain existing deep learning networks that generate novel molecules for drug discovery applications\n\n+ Contribue or author proposals, peer-reviewed papers, and other technical products.\n*Minimum Qualifications**\nBS/BA with 0-1 years of experience or MS/MA with 0-1 years of experience\n*Preferred Qualifications**\n+ MS in chemical engineering, computer science, or related field with a GPA of 3.5+ 5+ years of research experience\n\n+ Intermediate level programming experience (preferably Python) and high-performance computing experience\n\n+ At least one first author published, or proof of submitted, paper applying deep learning for use in novel compound generation\n\n+ Understanding of the NMDA receptor and potential drug targets\n\n+ Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification\n*Equal Employment Opportunity**\nBattelle Memorial Institute (BMI) at Pacific Northwest National Laboratory (PNNL) is an Affirmative Action/Equal Opportunity Employer and supports diversity in the workplace. All employment decisions are made without regard to race, color, religion, sex, national origin, age, disability, veteran status, marital or family status, sexual orientation, gender identity, or genetic information. All BMI staff must be able to demonstrate the legal right to work in the United States. BMI is an E-Verify employer. Learn more at jobs.pnnl.gov.\n*_Please be aware that the Department of Energy (DOE) prohibits DOE employees and contractors from participation in certain foreign government talent recruitment programs. If you are offered a position at PNNL and are currently a participant in a foreign government talent recruitment program you will be required to disclose this information before your first day of employment._**\n_Directorate:_ _Earth & Biological Sciences_\n\n_Job Category:_ _Scientists/Scientific Support_\n\n_Group:_ _Biological Systems Science_\n\n_Opening Date:_ _2020-03-26_\n\n_Closing Date:_ _2020-04-05_
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Data Scientist\nAffinity Solutions / Marketing Cloud seeks smart, curious, technically savvy candidates to join our cutting-edge data science team. We hire the best and brightest and give them the opportunity to work on industry-leading technologies.\nThe data sciences team at AFS/Marketing Cloud build models, machine learning algorithms that power all our ad-tech/mar-tech products at scale, develop methodology and tools to precisely and effectively measure market campaign effects, and research in-house and public data sources for consumer spend behavior insights. In this role, you'll have the opportunity to come up with new ideas and solutions that will lead to improvement of our ability to target the right audience, derive insights and provide better measurement methodology for marketing campaigns. You'll access our core data asset and machine learning infrastructure to power your ideas.\nDuties and Responsibilities\n¬∑ Support all clients model building needs, including maintaining and improving current modeling/scoring methodology and processes,\n¬∑ Provide innovative solutions to customized modeling/scoring/targeting with appropriate ML/statistical tools,\n¬∑ Provide analytical/statistical support such as marketing test design, projection, campaign measurement, market insights to clients and stakeholders.\n¬∑ Mine large consumer datasets in the cloud environment to support ad hoc business and statistical analysis,\n¬∑ Develop and Improve automation capabilities to enable customized delivery of the analytical products to clients,\n¬∑ Communicate the methodologies and the results to the management, clients and none technical stakeholders.\nBasic Qualifications\n¬∑ Advanced degree in Statistics/Mathematics/Computer Science/Economics or other fields that requires advanced training in data analytics.\n¬∑ Being able to apply basic statistical/ML concepts and reasoning to address and solve business problems such as targeting, test design, KPI projection and performance measurement.\n¬∑ Entrepreneurial, highly self-motivated, collaborative, keen attention to detail, willingness and capable learn quickly, and ability to effectively prioritize and execute tasks in a high pressure environment.\n¬∑ Being flexible to accept different task assignments and able to work on a tight time schedule.\n¬∑ Excellent command of one or more programming languages; preferably Python, SAS or R\n¬∑ Familiar with one of the database technologies such as PostgreSQL, MySQL, can write basic SQL queries\n¬∑ Great communication skills (verbal, written and presentation)\nPreferred Qualifications\n¬∑ Experience or exposure to large consumer and/or demographic data sets.\n¬∑ Familiarity with data manipulation and cleaning routines and techniques.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CyrusOne is seeking a talented Data Scientist who holds a range of data-focused skills both in technical and analytical domains. The ideal candidate is adept at processing, cleansing, and verifying the integrity of data used for visualization and analysis. This role is dynamic, granting the candidate the opportunity to participate in a wide variety of projects and collaborate with many cross-functional teams throughout the business.\n\nDuties and Responsibilities:\nParticipate in an agile scrum cadence\nProcess, cleanse, and verify the integrity of data used for analysis\nPerform functional business requirements analysis and data analysis\nDevelop data models and algorithms to apply to data sets\nAugment data collection procedures to include necessary information for building accurate analytics\nCollaborate with stakeholders throughout the organization to identify opportunities for leveraging data to drive business solutions\nEvaluate the effectiveness and accuracy of data sources and data gathering techniques\nGather critical information from meetings with various stakeholders and produce useful reports\nCoordinate with cross-functional teams to implement models and monitor outcomes\nDevelop automated discrepancy detection systems and distribute reconciliation reports to stakeholders\nRequirements:\nMust be legally authorized to work in the United States for any employer without sponsorship\nProfessional experience using statistical software languages like R, Python, and SQL to query, manipulate, and draw insights from data sets\nStrong problem-solving skills with an emphasis on product development\nExtensive experience with Microsoft SQL, MySQL and MongoDB\nUnderstanding of version control (git) and project management with Azure DevOps\nKnowledge of machine learning techniques (clustering, decision tree learning, artificial neural networks, etc.)\nExperience visualizing data for stakeholders using visualization tools such as Power BI\nExperience working with and creating data architectures\nUnderstanding and adherence to agile principles and practices\nAbility to work on problems of any scope where the analysis of situations or data requires a review of a variety of factors\nSelf-maintainability and reliability with minimal supervision\nExcellent interpersonal communication, decision making, presentation, and organizational skills\nAbility to build productive internal/external working relationships\nHarmonious with CyrusOne culture, core values, and business goals\nMinimum Qualifications:\n2+ years of related experience in a data analyst role\nStrong can-do attitude in a time sensitive environment\nOther important information about this position:\nThis position requires typical weekday (Monday - Friday) attendance in an office setting, at times after hours work may be required to meet business and customer needs\nEvery position requires certain physical capabilities. CyrusOne seeks to make reasonable accommodations that enable individuals with disabilities to perform essential duties when possible\nCyrusOne is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, sex, sexual orientation, gender identity, religion, national origin, disability, veteran status, or other legally protected status.\n\nCyrusOne provides reasonable accommodation for qualified individuals with disabilities in accordance with the Americans with Disabilities Act (ADA) and any other state or local laws. We will respond to requests for reasonable accommodations to assist you in applying for positions at CyrusOne, or to submit a resume. If you need to request an accommodation, please contact our Human Resources at 214.488.1365 (Option 7) or by email at HR@cyrusone.com.
##                           industry                       sector
## 1              Aerospace & Defense          Aerospace & Defense
## 2 Health Care Services & Hospitals                  Health Care
## 3                Security Services            Business Services
## 4                           Energy Oil, Gas, Energy & Utilities
## 5          Advertising & Marketing            Business Services
## 6                      Real Estate                  Real Estate
##                            revenue min_salary max_salary avg_salary        city
## 1        $50 to $100 million (USD)         53         91       72.0 Albuquerque
## 2           $2 to $5 billion (USD)         63        112       87.5   Linthicum
## 3       $100 to $500 million (USD)         80         90       85.0  Clearwater
## 4 $500 million to $1 billion (USD)         56         97       76.5    Richland
## 5                             <NA>         86        143      114.5    New York
## 6           $1 to $2 billion (USD)         71        119       95.0      Dallas
##   state country source same_state python r_lang spark aws excel  job_type
## 1    NM      US   <NA>          0      1      0     0   0     1 Scientist
## 2    MD      US   <NA>          0      1      0     0   0     0 Scientist
## 3    FL      US   <NA>          1      1      0     1   0     1 Scientist
## 4    WA      US   <NA>          1      1      0     0   0     0 Scientist
## 5    NY      US   <NA>          1      1      0     0   0     1 Scientist
## 6    TX      US   <NA>          1      1      0     0   1     1 Scientist

The following code converts skill and tool indicators from a binary coded system to a logical system.

glassdoor_data2 <- glassdoor_data2 |> 
  mutate(
    same_state = as.logical(same_state),
    python = as.logical(python),
    r_lang = as.logical(r_lang),
    spark = as.logical(spark),
    aws = as.logical(aws),
    excel = as.logical(excel)
  )

Salaries are converted to their full values.

glassdoor_data2 <- glassdoor_data2 |> 
  mutate(min_salary = min_salary * 1000) |> 
  mutate(max_salary = max_salary * 1000) |> 
  mutate(avg_salary = avg_salary * 1000)

Then an average salary range column is created based on the average salary for each job. This will allow for analysis by range and not just individual values.

glassdoor_data2 <- glassdoor_data2 |> 
  mutate(avg_salary_range = cut(avg_salary,
                                breaks = c(0, 
                                           25000, 
                                           50000, 
                                           75000, 
                                           100000, 
                                           125000, 
                                           150000, 
                                           175000, 
                                           200000, 
                                           225000, 
                                           250000, 
                                           275000),
                                labels = c("0-25000", 
                                           "25000-50000", 
                                           "50000-75000", 
                                           "75000-100000", 
                                           "100000-125000", 
                                           "125000-150000", 
                                           "150000-175000", 
                                           "175000-200000", 
                                           "200000-225000", 
                                           "225000-250000", 
                                           "250000+"),
                                right = TRUE))

Columns are reordered to group related information and improve readability.

glassdoor_data2 <- glassdoor_data2 |> 
  select(
         job_title, 
         job_type, 
         job_description,
         avg_salary_range, 
         avg_salary, 
         min_salary, 
         max_salary, 
         python,
         r_lang,
         spark,
         aws,
         excel,
         city, 
         state, 
         country,
         company_name,
         revenue,
         size,
         type_of_ownership,
         industry,
         sector
  )

head(glassdoor_data2)

##                   job_title  job_type
## 1            Data Scientist Scientist
## 2 Healthcare Data Scientist Scientist
## 3            Data Scientist Scientist
## 4            Data Scientist Scientist
## 5            Data Scientist Scientist
## 6            Data Scientist Scientist
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          job_description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor‚Äôs degree required, preferably in math, engineering, business, or the sciences.\nSkills Required:\nBachelor‚Äôs Degree in relevant field, e.g., math, data analysis, database, computer science, Artificial Intelligence (AI); three years‚Äô experience credit for Master‚Äôs degree; five years‚Äô experience credit for a Ph.D\nApplicant should be proficient in the use of Power BI, Tableau, Python, MATLAB, Microsoft Word, PowerPoint, Excel, and working knowledge of MS Access, LMS, SAS, data visualization tools, and have a strong algorithmic aptitude\nExcellent verbal and written communication skills, and quantitative analytical skills are required\nApplicant must be able to work in a team environment\nU.S. citizenship and ability to obtain a DoD Secret Clearance required\nResponsibilities: The applicant will be responsible for formulating analytical solutions to complex data problems; creating data analytic models to improve data metrics; analyzing customer behavior and trends; delivering insights to stakeholders, as well as designing and crafting reports, dashboards, models, and algorithms to make data insights actionable; selecting features, building and optimizing classifiers using machine learning techniques; data mining using state-of-the-art methods, extending organization‚Äôs data with third party sources of information when needed; enhancing data collection procedures to include information that is relevant for building analytic systems; processing, cleansing, and verifying the integrity of data used for analysis; doing ad-hoc analysis and presenting results in a clear manner; and creating automated anomaly detection systems and constant tracking of its performance.\nBenefits:\nWe offer competitive salaries commensurate with education and experience. We have an excellent benefits package that includes:\nComprehensive health, dental, life, long and short term disability insurance\n100% Company funded Retirement Plans\nGenerous vacation, holiday and sick pay plans\nTuition assistance\n\nBenefits are provided to employees regularly working a minimum of 30 hours per week.\n\nTecolote Research is a private, employee-owned corporation where people are our primary resource. Our investments in technology and training give our employees the tools to ensure our clients are provided the solutions they need, and our very high employee retention rate and stable workforce is an added value to our customers. Apply now to connect with a company that invests in you.
## 2 What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analytics group at the University of Maryland Medical System (UMMS) in support of its strategic priority to become a data-driven and outcomes-oriented organization. The successful candidate will have 3+ years of experience with Machine Learning, Predictive Modeling, Statistical Analysis, Mathematical Optimization, Algorithm Development and a passion for working with healthcare data. Previous experience with various computational approaches along with an ability to demonstrate a portfolio of relevant prior projects is essential. This position will report to the UMMS Vice President for Enterprise Data and Analytics (ED&A).\n\nII. Principal Responsibilities and Tasks\n\n‚Ä¢ Develops predictive and prescriptive analytic models in support of the organization‚Äôs clinical, operations and business initiatives and priorities.\n‚Ä¢ Deploys solutions so that they provide actionable insights to the organization and are embedded or integrated with application systems\n‚Ä¢ Supports and drives analytic efforts designed around organization‚Äôs strategic priorities and clinical/business problems\n‚Ä¢ Works in a team to drive disruptive innovation, which may translate into improved quality of care, clinical outcomes, reduced costs, temporal efficiencies and process improvements.\n‚Ä¢ Builds and extends our analytics portfolio supported by robust documentation\n‚Ä¢ Works with autonomy to find solutions to complex problems using open source tools and in-house development\n‚Ä¢ Stays abreast of state-of-the-art literature in the fields of operations research, statistical modeling, statistical process control and mathematical optimization\n‚Ä¢ Creates, communicates, and manages the project plans and other required project documentation and provides updates to leadership as necessary\n‚Ä¢ Develops and maintains relationships with business, IT and clinical leaders and stakeholders across the enterprise to facilitate collaboration and effective communication\n‚Ä¢ Works with the analytics team and clinical/business stakeholders to develop pilots so that they may be tested and validated in pilot settings\n‚Ä¢ Performs analysis to evaluate primary and secondary objectives from such pilots\n‚Ä¢ Assists leadership with strategies for scaling successful projects across the organization and enhances the analytics applications based on feedback from end-users and clinical/business consumers\n‚Ä¢ Assists leadership with dissemination of success stories (and failures) in an effort to increase analytics literacy and adoption across the organization.\n\nWhat You Need to Be Successful:\n\nIII. Education and Experience\n\n‚Ä¢ Master‚Äôs or higher degree (may be substituted by relevant work experience) in applied mathematics, physics, computer science, engineering, statistics or a related field\n‚Ä¢ 3+ years of Mathematical Optimization, Machine Learning, Predictive Analytics and Algorithm Development experience (experience with tools such as WEKA, RapidMiner, R. Python or other open source tools strongly desired)\n‚Ä¢ Strong development skills in two or more of the following: C/C++, C#, Python, Java\n‚Ä¢ Combining analytic methods with advanced data visualizations\n‚Ä¢ Expert ability to breakdown and clearly define problems\n‚Ä¢ Experience with Natural Language Processing preferred\n\nIV. Knowledge, Skills and Abilities\n\n‚Ä¢ Proven communications skills ‚Äì Effective at working independently and in collaboration with other staff members. Capable of clearly presenting findings orally, in writing, or through graphics.\n‚Ä¢ Proven analytical skills ‚Äì Able to compare, contrast, and validate work with keen attention to detail. Skilled in working with ‚Äúreal world‚Äù data including scrubbing, transformation, and imputation.\n‚Ä¢ Proven problem solving skills ‚Äì Able to plan work, set clear direction, and coordinate own tasks in a fast-paced multidisciplinary environment. Expert at triaging issues, identifying data anomalies, and debugging software.\n‚Ä¢ Design and prototype new application functionality for our products.\n‚Ä¢ Change oriented ‚Äì actively generates process improvements; supports and drives change, and confronts difficult circumstances in creative ways\n‚Ä¢ Effective communicator and change agent\n‚Ä¢ Ability to prioritize the tasks of the project timeline to achieve the desired results\n‚Ä¢ Strong analytic and problem solving skills\n‚Ä¢ Ability to cooperatively and effectively work with people from various organization levels\n\nWe are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school security awareness training and simulated phishing. KnowBe4 was created to help organizations manage the ongoing problem of social engineering. Tens of thousands of organizations worldwide use KnowBe4's platform to mobilize their end users as a last line of defense and enable them to make better security decisions, every day.\n\nWe are ranked #1 best place to work in technology nationwide by Fortune Magazine and have placed #1 or #2 in The Tampa Bay Top Workplaces Survey for the last four years. We also just had our 27th record-setting quarter in a row!\n\nThe Data Scientist will work closely with the VP of FP&A and the Quantitative Analytics Manager to implement advanced analytical models and other data-driven solutions.\n\nResponsibilities:\nWork with key stakeholders throughout the organization to identify opportunities using financial data to develop business solutions.\nDevelop new and enhance existing data collection procedures to ensure that all data relevant for analyses is captured.\nCleanse, consolidate, and verify the integrity of data used in analyses.\nBuild and validate predictive models to increase customer retention, revenue generation, and other business outcomes.\nDevelop relevant statistical models to assist with profitability forecasting\nCreate the analytics to leverage known, inferred and appended information about origins and recognizing patterns to assist in outlook forecasting\nAssist in the design and data modeling of data warehouse.\nVisualize data, especially in reports and dashboards, to communicate analysis results to stakeholders.\nExtend data collection to unstructured data within the company and external sources\nMine and collect data (both structured and unstructured) to detect patterns, opportunities and insights that drive our organization\nCreate and execute automation and data mining requests utilizing SQL, Access, Excel, SAS and other statistical programs\nTrouble shoot forecast and optimization anomalies with FP&A team through the use of statistical and mathematical optimization models. Develop testing to explain and or reduce these anomalies.\nOversee and develop key metric forecasts as well as provide budget support based on trends in the business/industry.\nMinimum Qualifications:\nMaster's degree in Statistics, Computer Science, Mathematics or other quantitative discipline required\n2-3 years of experience in similar role (Master's Degree)\n0-2 years of experience in similar role (PhD)\nExperience leveraging predictive modeling, big data analytics, exploratory data analysis and machine learning to drive significant business impact\nExperience with statistical computer languages (Python, R etc.) to manipulate and analyze large datasets preferred.\nExperience with data visualization tools like D3.js, matplotlib, etc., preferred\nExceptional understanding of machine learning algorithms such as Random Forest, SVM, k-NN, Na√Øve Bayes, Gradient Boosting a plus.\nApplied statistical skills including statistical testing, regression, etc.\nExperience with data bases, query languages, and associated data architecture.\nExperience with distributed computing tools (Hive, Spark, etc.) is a plus.\nStrong analytical skills and ability to meet project deadlines.\nNote: An applicant assessment, background check and drug test may be part of your hiring procedure.\n\nNo recruitment agencies, please.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 *Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological Sciences\n\nGroup: Exposure Science Team\n*Job Description**\nThe Biological System Science (BSS) Group in the Biological Sciences Division of the Pacific Northwest National Laboratory (PNNL) is seeking a staff scientist with multidisciplinary experience in computational chemistry, cheminformatics, advanced statistics and/or machine learning/deep learning/AI. Preferred candidates will have a broad understanding of the state of computational metabolomics and experience in designing and implementing novel deep learning networks for chemistry applications. Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification is also highly valued. Successful candidates will join a large, uniquely collaborative, collegial group of innovators driving the integration of data science, computational science and analytical chemistry to solve the nations most challenging problems in human health, chemical forensics, and national security. The BSS Group is diverse and inclusive, working closely with colleagues across the laboratory with expertise in computational biology, integrative omics, applied mathematics, computer science, and statistics.\n\n+ Apply knowledge of statistics, machine learning, advanced mathematics, simulation, software development, and data modeling to to design, development and implement methods that integrate, clean and analyze data, recognize patterns, address uncertainty, pose questions, and make discoveries from structured and/or unstructured data.\n\n+ Produce solutions driven by exploratory data analysis from complex and high-dimensional datasets.\n\n+ Design, develop, and evaluate predictive models and advanced algorithms that lead to optimal value extraction from data.\n\n+ Develop and maintain existing deep learning networks that generate novel molecules for drug discovery applications\n\n+ Contribue or author proposals, peer-reviewed papers, and other technical products.\n*Minimum Qualifications**\nBS/BA with 0-1 years of experience or MS/MA with 0-1 years of experience\n*Preferred Qualifications**\n+ MS in chemical engineering, computer science, or related field with a GPA of 3.5+ 5+ years of research experience\n\n+ Intermediate level programming experience (preferably Python) and high-performance computing experience\n\n+ At least one first author published, or proof of submitted, paper applying deep learning for use in novel compound generation\n\n+ Understanding of the NMDA receptor and potential drug targets\n\n+ Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification\n*Equal Employment Opportunity**\nBattelle Memorial Institute (BMI) at Pacific Northwest National Laboratory (PNNL) is an Affirmative Action/Equal Opportunity Employer and supports diversity in the workplace. All employment decisions are made without regard to race, color, religion, sex, national origin, age, disability, veteran status, marital or family status, sexual orientation, gender identity, or genetic information. All BMI staff must be able to demonstrate the legal right to work in the United States. BMI is an E-Verify employer. Learn more at jobs.pnnl.gov.\n*_Please be aware that the Department of Energy (DOE) prohibits DOE employees and contractors from participation in certain foreign government talent recruitment programs. If you are offered a position at PNNL and are currently a participant in a foreign government talent recruitment program you will be required to disclose this information before your first day of employment._**\n_Directorate:_ _Earth & Biological Sciences_\n\n_Job Category:_ _Scientists/Scientific Support_\n\n_Group:_ _Biological Systems Science_\n\n_Opening Date:_ _2020-03-26_\n\n_Closing Date:_ _2020-04-05_
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Data Scientist\nAffinity Solutions / Marketing Cloud seeks smart, curious, technically savvy candidates to join our cutting-edge data science team. We hire the best and brightest and give them the opportunity to work on industry-leading technologies.\nThe data sciences team at AFS/Marketing Cloud build models, machine learning algorithms that power all our ad-tech/mar-tech products at scale, develop methodology and tools to precisely and effectively measure market campaign effects, and research in-house and public data sources for consumer spend behavior insights. In this role, you'll have the opportunity to come up with new ideas and solutions that will lead to improvement of our ability to target the right audience, derive insights and provide better measurement methodology for marketing campaigns. You'll access our core data asset and machine learning infrastructure to power your ideas.\nDuties and Responsibilities\n¬∑ Support all clients model building needs, including maintaining and improving current modeling/scoring methodology and processes,\n¬∑ Provide innovative solutions to customized modeling/scoring/targeting with appropriate ML/statistical tools,\n¬∑ Provide analytical/statistical support such as marketing test design, projection, campaign measurement, market insights to clients and stakeholders.\n¬∑ Mine large consumer datasets in the cloud environment to support ad hoc business and statistical analysis,\n¬∑ Develop and Improve automation capabilities to enable customized delivery of the analytical products to clients,\n¬∑ Communicate the methodologies and the results to the management, clients and none technical stakeholders.\nBasic Qualifications\n¬∑ Advanced degree in Statistics/Mathematics/Computer Science/Economics or other fields that requires advanced training in data analytics.\n¬∑ Being able to apply basic statistical/ML concepts and reasoning to address and solve business problems such as targeting, test design, KPI projection and performance measurement.\n¬∑ Entrepreneurial, highly self-motivated, collaborative, keen attention to detail, willingness and capable learn quickly, and ability to effectively prioritize and execute tasks in a high pressure environment.\n¬∑ Being flexible to accept different task assignments and able to work on a tight time schedule.\n¬∑ Excellent command of one or more programming languages; preferably Python, SAS or R\n¬∑ Familiar with one of the database technologies such as PostgreSQL, MySQL, can write basic SQL queries\n¬∑ Great communication skills (verbal, written and presentation)\nPreferred Qualifications\n¬∑ Experience or exposure to large consumer and/or demographic data sets.\n¬∑ Familiarity with data manipulation and cleaning routines and techniques.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CyrusOne is seeking a talented Data Scientist who holds a range of data-focused skills both in technical and analytical domains. The ideal candidate is adept at processing, cleansing, and verifying the integrity of data used for visualization and analysis. This role is dynamic, granting the candidate the opportunity to participate in a wide variety of projects and collaborate with many cross-functional teams throughout the business.\n\nDuties and Responsibilities:\nParticipate in an agile scrum cadence\nProcess, cleanse, and verify the integrity of data used for analysis\nPerform functional business requirements analysis and data analysis\nDevelop data models and algorithms to apply to data sets\nAugment data collection procedures to include necessary information for building accurate analytics\nCollaborate with stakeholders throughout the organization to identify opportunities for leveraging data to drive business solutions\nEvaluate the effectiveness and accuracy of data sources and data gathering techniques\nGather critical information from meetings with various stakeholders and produce useful reports\nCoordinate with cross-functional teams to implement models and monitor outcomes\nDevelop automated discrepancy detection systems and distribute reconciliation reports to stakeholders\nRequirements:\nMust be legally authorized to work in the United States for any employer without sponsorship\nProfessional experience using statistical software languages like R, Python, and SQL to query, manipulate, and draw insights from data sets\nStrong problem-solving skills with an emphasis on product development\nExtensive experience with Microsoft SQL, MySQL and MongoDB\nUnderstanding of version control (git) and project management with Azure DevOps\nKnowledge of machine learning techniques (clustering, decision tree learning, artificial neural networks, etc.)\nExperience visualizing data for stakeholders using visualization tools such as Power BI\nExperience working with and creating data architectures\nUnderstanding and adherence to agile principles and practices\nAbility to work on problems of any scope where the analysis of situations or data requires a review of a variety of factors\nSelf-maintainability and reliability with minimal supervision\nExcellent interpersonal communication, decision making, presentation, and organizational skills\nAbility to build productive internal/external working relationships\nHarmonious with CyrusOne culture, core values, and business goals\nMinimum Qualifications:\n2+ years of related experience in a data analyst role\nStrong can-do attitude in a time sensitive environment\nOther important information about this position:\nThis position requires typical weekday (Monday - Friday) attendance in an office setting, at times after hours work may be required to meet business and customer needs\nEvery position requires certain physical capabilities. CyrusOne seeks to make reasonable accommodations that enable individuals with disabilities to perform essential duties when possible\nCyrusOne is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, sex, sexual orientation, gender identity, religion, national origin, disability, veteran status, or other legally protected status.\n\nCyrusOne provides reasonable accommodation for qualified individuals with disabilities in accordance with the Americans with Disabilities Act (ADA) and any other state or local laws. We will respond to requests for reasonable accommodations to assist you in applying for positions at CyrusOne, or to submit a resume. If you need to request an accommodation, please contact our Human Resources at 214.488.1365 (Option 7) or by email at HR@cyrusone.com.
##   avg_salary_range avg_salary min_salary max_salary python r_lang spark   aws
## 1      50000-75000      72000      53000      91000   TRUE  FALSE FALSE FALSE
## 2     75000-100000      87500      63000     112000   TRUE  FALSE FALSE FALSE
## 3     75000-100000      85000      80000      90000   TRUE  FALSE  TRUE FALSE
## 4     75000-100000      76500      56000      97000   TRUE  FALSE FALSE FALSE
## 5    100000-125000     114500      86000     143000   TRUE  FALSE FALSE FALSE
## 6     75000-100000      95000      71000     119000   TRUE  FALSE FALSE  TRUE
##   excel        city state country                            company_name
## 1  TRUE Albuquerque    NM      US                     Tecolote Research\n
## 2 FALSE   Linthicum    MD      US University of Maryland Medical System\n
## 3  TRUE  Clearwater    FL      US                               KnowBe4\n
## 4 FALSE    Richland    WA      US                                  PNNL\n
## 5  TRUE    New York    NY      US                    Affinity Solutions\n
## 6  TRUE      Dallas    TX      US                              CyrusOne\n
##                            revenue                   size  type_of_ownership
## 1        $50 to $100 million (USD)  501 to 1000 employees  Company - Private
## 2           $2 to $5 billion (USD)       10000+ employees Other Organization
## 3       $100 to $500 million (USD)  501 to 1000 employees  Company - Private
## 4 $500 million to $1 billion (USD) 1001 to 5000 employees         Government
## 5                             <NA>    51 to 200 employees  Company - Private
## 6           $1 to $2 billion (USD)   201 to 500 employees   Company - Public
##                           industry                       sector
## 1              Aerospace & Defense          Aerospace & Defense
## 2 Health Care Services & Hospitals                  Health Care
## 3                Security Services            Business Services
## 4                           Energy Oil, Gas, Energy & Utilities
## 5          Advertising & Marketing            Business Services
## 6                      Real Estate                  Real Estate

A lone job with an illegible title is removed.

glassdoor_data2 <- glassdoor_data2 |> filter(job_title != "sg nsjx nm/.;'" )

Unknown values are replaced in company size.

glassdoor_data2 <- glassdoor_data2 |> 
  mutate(size = na_if(size,"Unknown"),
         size = na_if(size, "-1"))

Then factors are created and applied to all columns that have an implied heirarchy but not an inherent one. Again this will allow for easier sorting, ordering, and displaying during analysis.

avg_salary_levels <- c("0-25000", "25000-50000", "50000-75000", 
                       "75000-100000", "100000-125000", "125000-150000", 
                       "150000-175000", "175000-200000", "200000-225000", 
                       "225000-250000", "250000+")

revenue_levels <- c("Less than $1 million (USD)", " $1 to $5 million (USD)", 
                    "$5 to $10 million (USD)", " $10 to $25 million (USD)",
                    "$25 to $50 million (USD)", "$50 to $100 million (USD)",
                    "$100 to $500 million (USD)", "$500 million to $1 billion (USD)",
                    "$1 to $2 billion (USD)", "$2 to $5 billion (USD)", "$5 to $10 billion (USD)",
                    "$10+ billion (USD)")

size_levels <- c("1 to 50 employees","51 to 200 employees","201 to 500 employees",
          "501 to 1000 employees","1001 to 5000 employees", "5001 to 10000 employees",
          "10000+ employees")

glassdoor_data2 <- glassdoor_data2 |> mutate(
  avg_salary_range = factor(avg_salary_range, levels = avg_salary_levels),
  revenue = factor(revenue, levels = revenue_levels),
  size = factor(size, levels = size_levels)
)

head(glassdoor_data2)

##                   job_title  job_type
## 1            Data Scientist Scientist
## 2 Healthcare Data Scientist Scientist
## 3            Data Scientist Scientist
## 4            Data Scientist Scientist
## 5            Data Scientist Scientist
## 6            Data Scientist Scientist
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          job_description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor‚Äôs degree required, preferably in math, engineering, business, or the sciences.\nSkills Required:\nBachelor‚Äôs Degree in relevant field, e.g., math, data analysis, database, computer science, Artificial Intelligence (AI); three years‚Äô experience credit for Master‚Äôs degree; five years‚Äô experience credit for a Ph.D\nApplicant should be proficient in the use of Power BI, Tableau, Python, MATLAB, Microsoft Word, PowerPoint, Excel, and working knowledge of MS Access, LMS, SAS, data visualization tools, and have a strong algorithmic aptitude\nExcellent verbal and written communication skills, and quantitative analytical skills are required\nApplicant must be able to work in a team environment\nU.S. citizenship and ability to obtain a DoD Secret Clearance required\nResponsibilities: The applicant will be responsible for formulating analytical solutions to complex data problems; creating data analytic models to improve data metrics; analyzing customer behavior and trends; delivering insights to stakeholders, as well as designing and crafting reports, dashboards, models, and algorithms to make data insights actionable; selecting features, building and optimizing classifiers using machine learning techniques; data mining using state-of-the-art methods, extending organization‚Äôs data with third party sources of information when needed; enhancing data collection procedures to include information that is relevant for building analytic systems; processing, cleansing, and verifying the integrity of data used for analysis; doing ad-hoc analysis and presenting results in a clear manner; and creating automated anomaly detection systems and constant tracking of its performance.\nBenefits:\nWe offer competitive salaries commensurate with education and experience. We have an excellent benefits package that includes:\nComprehensive health, dental, life, long and short term disability insurance\n100% Company funded Retirement Plans\nGenerous vacation, holiday and sick pay plans\nTuition assistance\n\nBenefits are provided to employees regularly working a minimum of 30 hours per week.\n\nTecolote Research is a private, employee-owned corporation where people are our primary resource. Our investments in technology and training give our employees the tools to ensure our clients are provided the solutions they need, and our very high employee retention rate and stable workforce is an added value to our customers. Apply now to connect with a company that invests in you.
## 2 What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analytics group at the University of Maryland Medical System (UMMS) in support of its strategic priority to become a data-driven and outcomes-oriented organization. The successful candidate will have 3+ years of experience with Machine Learning, Predictive Modeling, Statistical Analysis, Mathematical Optimization, Algorithm Development and a passion for working with healthcare data. Previous experience with various computational approaches along with an ability to demonstrate a portfolio of relevant prior projects is essential. This position will report to the UMMS Vice President for Enterprise Data and Analytics (ED&A).\n\nII. Principal Responsibilities and Tasks\n\n‚Ä¢ Develops predictive and prescriptive analytic models in support of the organization‚Äôs clinical, operations and business initiatives and priorities.\n‚Ä¢ Deploys solutions so that they provide actionable insights to the organization and are embedded or integrated with application systems\n‚Ä¢ Supports and drives analytic efforts designed around organization‚Äôs strategic priorities and clinical/business problems\n‚Ä¢ Works in a team to drive disruptive innovation, which may translate into improved quality of care, clinical outcomes, reduced costs, temporal efficiencies and process improvements.\n‚Ä¢ Builds and extends our analytics portfolio supported by robust documentation\n‚Ä¢ Works with autonomy to find solutions to complex problems using open source tools and in-house development\n‚Ä¢ Stays abreast of state-of-the-art literature in the fields of operations research, statistical modeling, statistical process control and mathematical optimization\n‚Ä¢ Creates, communicates, and manages the project plans and other required project documentation and provides updates to leadership as necessary\n‚Ä¢ Develops and maintains relationships with business, IT and clinical leaders and stakeholders across the enterprise to facilitate collaboration and effective communication\n‚Ä¢ Works with the analytics team and clinical/business stakeholders to develop pilots so that they may be tested and validated in pilot settings\n‚Ä¢ Performs analysis to evaluate primary and secondary objectives from such pilots\n‚Ä¢ Assists leadership with strategies for scaling successful projects across the organization and enhances the analytics applications based on feedback from end-users and clinical/business consumers\n‚Ä¢ Assists leadership with dissemination of success stories (and failures) in an effort to increase analytics literacy and adoption across the organization.\n\nWhat You Need to Be Successful:\n\nIII. Education and Experience\n\n‚Ä¢ Master‚Äôs or higher degree (may be substituted by relevant work experience) in applied mathematics, physics, computer science, engineering, statistics or a related field\n‚Ä¢ 3+ years of Mathematical Optimization, Machine Learning, Predictive Analytics and Algorithm Development experience (experience with tools such as WEKA, RapidMiner, R. Python or other open source tools strongly desired)\n‚Ä¢ Strong development skills in two or more of the following: C/C++, C#, Python, Java\n‚Ä¢ Combining analytic methods with advanced data visualizations\n‚Ä¢ Expert ability to breakdown and clearly define problems\n‚Ä¢ Experience with Natural Language Processing preferred\n\nIV. Knowledge, Skills and Abilities\n\n‚Ä¢ Proven communications skills ‚Äì Effective at working independently and in collaboration with other staff members. Capable of clearly presenting findings orally, in writing, or through graphics.\n‚Ä¢ Proven analytical skills ‚Äì Able to compare, contrast, and validate work with keen attention to detail. Skilled in working with ‚Äúreal world‚Äù data including scrubbing, transformation, and imputation.\n‚Ä¢ Proven problem solving skills ‚Äì Able to plan work, set clear direction, and coordinate own tasks in a fast-paced multidisciplinary environment. Expert at triaging issues, identifying data anomalies, and debugging software.\n‚Ä¢ Design and prototype new application functionality for our products.\n‚Ä¢ Change oriented ‚Äì actively generates process improvements; supports and drives change, and confronts difficult circumstances in creative ways\n‚Ä¢ Effective communicator and change agent\n‚Ä¢ Ability to prioritize the tasks of the project timeline to achieve the desired results\n‚Ä¢ Strong analytic and problem solving skills\n‚Ä¢ Ability to cooperatively and effectively work with people from various organization levels\n\nWe are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school security awareness training and simulated phishing. KnowBe4 was created to help organizations manage the ongoing problem of social engineering. Tens of thousands of organizations worldwide use KnowBe4's platform to mobilize their end users as a last line of defense and enable them to make better security decisions, every day.\n\nWe are ranked #1 best place to work in technology nationwide by Fortune Magazine and have placed #1 or #2 in The Tampa Bay Top Workplaces Survey for the last four years. We also just had our 27th record-setting quarter in a row!\n\nThe Data Scientist will work closely with the VP of FP&A and the Quantitative Analytics Manager to implement advanced analytical models and other data-driven solutions.\n\nResponsibilities:\nWork with key stakeholders throughout the organization to identify opportunities using financial data to develop business solutions.\nDevelop new and enhance existing data collection procedures to ensure that all data relevant for analyses is captured.\nCleanse, consolidate, and verify the integrity of data used in analyses.\nBuild and validate predictive models to increase customer retention, revenue generation, and other business outcomes.\nDevelop relevant statistical models to assist with profitability forecasting\nCreate the analytics to leverage known, inferred and appended information about origins and recognizing patterns to assist in outlook forecasting\nAssist in the design and data modeling of data warehouse.\nVisualize data, especially in reports and dashboards, to communicate analysis results to stakeholders.\nExtend data collection to unstructured data within the company and external sources\nMine and collect data (both structured and unstructured) to detect patterns, opportunities and insights that drive our organization\nCreate and execute automation and data mining requests utilizing SQL, Access, Excel, SAS and other statistical programs\nTrouble shoot forecast and optimization anomalies with FP&A team through the use of statistical and mathematical optimization models. Develop testing to explain and or reduce these anomalies.\nOversee and develop key metric forecasts as well as provide budget support based on trends in the business/industry.\nMinimum Qualifications:\nMaster's degree in Statistics, Computer Science, Mathematics or other quantitative discipline required\n2-3 years of experience in similar role (Master's Degree)\n0-2 years of experience in similar role (PhD)\nExperience leveraging predictive modeling, big data analytics, exploratory data analysis and machine learning to drive significant business impact\nExperience with statistical computer languages (Python, R etc.) to manipulate and analyze large datasets preferred.\nExperience with data visualization tools like D3.js, matplotlib, etc., preferred\nExceptional understanding of machine learning algorithms such as Random Forest, SVM, k-NN, Na√Øve Bayes, Gradient Boosting a plus.\nApplied statistical skills including statistical testing, regression, etc.\nExperience with data bases, query languages, and associated data architecture.\nExperience with distributed computing tools (Hive, Spark, etc.) is a plus.\nStrong analytical skills and ability to meet project deadlines.\nNote: An applicant assessment, background check and drug test may be part of your hiring procedure.\n\nNo recruitment agencies, please.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 *Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological Sciences\n\nGroup: Exposure Science Team\n*Job Description**\nThe Biological System Science (BSS) Group in the Biological Sciences Division of the Pacific Northwest National Laboratory (PNNL) is seeking a staff scientist with multidisciplinary experience in computational chemistry, cheminformatics, advanced statistics and/or machine learning/deep learning/AI. Preferred candidates will have a broad understanding of the state of computational metabolomics and experience in designing and implementing novel deep learning networks for chemistry applications. Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification is also highly valued. Successful candidates will join a large, uniquely collaborative, collegial group of innovators driving the integration of data science, computational science and analytical chemistry to solve the nations most challenging problems in human health, chemical forensics, and national security. The BSS Group is diverse and inclusive, working closely with colleagues across the laboratory with expertise in computational biology, integrative omics, applied mathematics, computer science, and statistics.\n\n+ Apply knowledge of statistics, machine learning, advanced mathematics, simulation, software development, and data modeling to to design, development and implement methods that integrate, clean and analyze data, recognize patterns, address uncertainty, pose questions, and make discoveries from structured and/or unstructured data.\n\n+ Produce solutions driven by exploratory data analysis from complex and high-dimensional datasets.\n\n+ Design, develop, and evaluate predictive models and advanced algorithms that lead to optimal value extraction from data.\n\n+ Develop and maintain existing deep learning networks that generate novel molecules for drug discovery applications\n\n+ Contribue or author proposals, peer-reviewed papers, and other technical products.\n*Minimum Qualifications**\nBS/BA with 0-1 years of experience or MS/MA with 0-1 years of experience\n*Preferred Qualifications**\n+ MS in chemical engineering, computer science, or related field with a GPA of 3.5+ 5+ years of research experience\n\n+ Intermediate level programming experience (preferably Python) and high-performance computing experience\n\n+ At least one first author published, or proof of submitted, paper applying deep learning for use in novel compound generation\n\n+ Understanding of the NMDA receptor and potential drug targets\n\n+ Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification\n*Equal Employment Opportunity**\nBattelle Memorial Institute (BMI) at Pacific Northwest National Laboratory (PNNL) is an Affirmative Action/Equal Opportunity Employer and supports diversity in the workplace. All employment decisions are made without regard to race, color, religion, sex, national origin, age, disability, veteran status, marital or family status, sexual orientation, gender identity, or genetic information. All BMI staff must be able to demonstrate the legal right to work in the United States. BMI is an E-Verify employer. Learn more at jobs.pnnl.gov.\n*_Please be aware that the Department of Energy (DOE) prohibits DOE employees and contractors from participation in certain foreign government talent recruitment programs. If you are offered a position at PNNL and are currently a participant in a foreign government talent recruitment program you will be required to disclose this information before your first day of employment._**\n_Directorate:_ _Earth & Biological Sciences_\n\n_Job Category:_ _Scientists/Scientific Support_\n\n_Group:_ _Biological Systems Science_\n\n_Opening Date:_ _2020-03-26_\n\n_Closing Date:_ _2020-04-05_
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Data Scientist\nAffinity Solutions / Marketing Cloud seeks smart, curious, technically savvy candidates to join our cutting-edge data science team. We hire the best and brightest and give them the opportunity to work on industry-leading technologies.\nThe data sciences team at AFS/Marketing Cloud build models, machine learning algorithms that power all our ad-tech/mar-tech products at scale, develop methodology and tools to precisely and effectively measure market campaign effects, and research in-house and public data sources for consumer spend behavior insights. In this role, you'll have the opportunity to come up with new ideas and solutions that will lead to improvement of our ability to target the right audience, derive insights and provide better measurement methodology for marketing campaigns. You'll access our core data asset and machine learning infrastructure to power your ideas.\nDuties and Responsibilities\n¬∑ Support all clients model building needs, including maintaining and improving current modeling/scoring methodology and processes,\n¬∑ Provide innovative solutions to customized modeling/scoring/targeting with appropriate ML/statistical tools,\n¬∑ Provide analytical/statistical support such as marketing test design, projection, campaign measurement, market insights to clients and stakeholders.\n¬∑ Mine large consumer datasets in the cloud environment to support ad hoc business and statistical analysis,\n¬∑ Develop and Improve automation capabilities to enable customized delivery of the analytical products to clients,\n¬∑ Communicate the methodologies and the results to the management, clients and none technical stakeholders.\nBasic Qualifications\n¬∑ Advanced degree in Statistics/Mathematics/Computer Science/Economics or other fields that requires advanced training in data analytics.\n¬∑ Being able to apply basic statistical/ML concepts and reasoning to address and solve business problems such as targeting, test design, KPI projection and performance measurement.\n¬∑ Entrepreneurial, highly self-motivated, collaborative, keen attention to detail, willingness and capable learn quickly, and ability to effectively prioritize and execute tasks in a high pressure environment.\n¬∑ Being flexible to accept different task assignments and able to work on a tight time schedule.\n¬∑ Excellent command of one or more programming languages; preferably Python, SAS or R\n¬∑ Familiar with one of the database technologies such as PostgreSQL, MySQL, can write basic SQL queries\n¬∑ Great communication skills (verbal, written and presentation)\nPreferred Qualifications\n¬∑ Experience or exposure to large consumer and/or demographic data sets.\n¬∑ Familiarity with data manipulation and cleaning routines and techniques.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CyrusOne is seeking a talented Data Scientist who holds a range of data-focused skills both in technical and analytical domains. The ideal candidate is adept at processing, cleansing, and verifying the integrity of data used for visualization and analysis. This role is dynamic, granting the candidate the opportunity to participate in a wide variety of projects and collaborate with many cross-functional teams throughout the business.\n\nDuties and Responsibilities:\nParticipate in an agile scrum cadence\nProcess, cleanse, and verify the integrity of data used for analysis\nPerform functional business requirements analysis and data analysis\nDevelop data models and algorithms to apply to data sets\nAugment data collection procedures to include necessary information for building accurate analytics\nCollaborate with stakeholders throughout the organization to identify opportunities for leveraging data to drive business solutions\nEvaluate the effectiveness and accuracy of data sources and data gathering techniques\nGather critical information from meetings with various stakeholders and produce useful reports\nCoordinate with cross-functional teams to implement models and monitor outcomes\nDevelop automated discrepancy detection systems and distribute reconciliation reports to stakeholders\nRequirements:\nMust be legally authorized to work in the United States for any employer without sponsorship\nProfessional experience using statistical software languages like R, Python, and SQL to query, manipulate, and draw insights from data sets\nStrong problem-solving skills with an emphasis on product development\nExtensive experience with Microsoft SQL, MySQL and MongoDB\nUnderstanding of version control (git) and project management with Azure DevOps\nKnowledge of machine learning techniques (clustering, decision tree learning, artificial neural networks, etc.)\nExperience visualizing data for stakeholders using visualization tools such as Power BI\nExperience working with and creating data architectures\nUnderstanding and adherence to agile principles and practices\nAbility to work on problems of any scope where the analysis of situations or data requires a review of a variety of factors\nSelf-maintainability and reliability with minimal supervision\nExcellent interpersonal communication, decision making, presentation, and organizational skills\nAbility to build productive internal/external working relationships\nHarmonious with CyrusOne culture, core values, and business goals\nMinimum Qualifications:\n2+ years of related experience in a data analyst role\nStrong can-do attitude in a time sensitive environment\nOther important information about this position:\nThis position requires typical weekday (Monday - Friday) attendance in an office setting, at times after hours work may be required to meet business and customer needs\nEvery position requires certain physical capabilities. CyrusOne seeks to make reasonable accommodations that enable individuals with disabilities to perform essential duties when possible\nCyrusOne is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, sex, sexual orientation, gender identity, religion, national origin, disability, veteran status, or other legally protected status.\n\nCyrusOne provides reasonable accommodation for qualified individuals with disabilities in accordance with the Americans with Disabilities Act (ADA) and any other state or local laws. We will respond to requests for reasonable accommodations to assist you in applying for positions at CyrusOne, or to submit a resume. If you need to request an accommodation, please contact our Human Resources at 214.488.1365 (Option 7) or by email at HR@cyrusone.com.
##   avg_salary_range avg_salary min_salary max_salary python r_lang spark   aws
## 1      50000-75000      72000      53000      91000   TRUE  FALSE FALSE FALSE
## 2     75000-100000      87500      63000     112000   TRUE  FALSE FALSE FALSE
## 3     75000-100000      85000      80000      90000   TRUE  FALSE  TRUE FALSE
## 4     75000-100000      76500      56000      97000   TRUE  FALSE FALSE FALSE
## 5    100000-125000     114500      86000     143000   TRUE  FALSE FALSE FALSE
## 6     75000-100000      95000      71000     119000   TRUE  FALSE FALSE  TRUE
##   excel        city state country                            company_name
## 1  TRUE Albuquerque    NM      US                     Tecolote Research\n
## 2 FALSE   Linthicum    MD      US University of Maryland Medical System\n
## 3  TRUE  Clearwater    FL      US                               KnowBe4\n
## 4 FALSE    Richland    WA      US                                  PNNL\n
## 5  TRUE    New York    NY      US                    Affinity Solutions\n
## 6  TRUE      Dallas    TX      US                              CyrusOne\n
##                            revenue                   size  type_of_ownership
## 1        $50 to $100 million (USD)  501 to 1000 employees  Company - Private
## 2           $2 to $5 billion (USD)       10000+ employees Other Organization
## 3       $100 to $500 million (USD)  501 to 1000 employees  Company - Private
## 4 $500 million to $1 billion (USD) 1001 to 5000 employees         Government
## 5                             <NA>    51 to 200 employees  Company - Private
## 6           $1 to $2 billion (USD)   201 to 500 employees   Company - Public
##                           industry                       sector
## 1              Aerospace & Defense          Aerospace & Defense
## 2 Health Care Services & Hospitals                  Health Care
## 3                Security Services            Business Services
## 4                           Energy Oil, Gas, Energy & Utilities
## 5          Advertising & Marketing            Business Services
## 6                      Real Estate                  Real Estate

Assitional unknown values are replaced with “NA”.

glassdoor_data3 <- glassdoor_data2 |> mutate(
    type_of_ownership = na_if(type_of_ownership,"-1"),
    type_of_ownership = na_if(type_of_ownership, "Unknown"),
    industry = na_if(industry,"-1"),
    industry = na_if(industry, "Unknown"),
    sector = na_if(sector,"-1"),
    sector = na_if(sector, "Unknown"))

head(glassdoor_data3)

##                   job_title  job_type
## 1            Data Scientist Scientist
## 2 Healthcare Data Scientist Scientist
## 3            Data Scientist Scientist
## 4            Data Scientist Scientist
## 5            Data Scientist Scientist
## 6            Data Scientist Scientist
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          job_description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor‚Äôs degree required, preferably in math, engineering, business, or the sciences.\nSkills Required:\nBachelor‚Äôs Degree in relevant field, e.g., math, data analysis, database, computer science, Artificial Intelligence (AI); three years‚Äô experience credit for Master‚Äôs degree; five years‚Äô experience credit for a Ph.D\nApplicant should be proficient in the use of Power BI, Tableau, Python, MATLAB, Microsoft Word, PowerPoint, Excel, and working knowledge of MS Access, LMS, SAS, data visualization tools, and have a strong algorithmic aptitude\nExcellent verbal and written communication skills, and quantitative analytical skills are required\nApplicant must be able to work in a team environment\nU.S. citizenship and ability to obtain a DoD Secret Clearance required\nResponsibilities: The applicant will be responsible for formulating analytical solutions to complex data problems; creating data analytic models to improve data metrics; analyzing customer behavior and trends; delivering insights to stakeholders, as well as designing and crafting reports, dashboards, models, and algorithms to make data insights actionable; selecting features, building and optimizing classifiers using machine learning techniques; data mining using state-of-the-art methods, extending organization‚Äôs data with third party sources of information when needed; enhancing data collection procedures to include information that is relevant for building analytic systems; processing, cleansing, and verifying the integrity of data used for analysis; doing ad-hoc analysis and presenting results in a clear manner; and creating automated anomaly detection systems and constant tracking of its performance.\nBenefits:\nWe offer competitive salaries commensurate with education and experience. We have an excellent benefits package that includes:\nComprehensive health, dental, life, long and short term disability insurance\n100% Company funded Retirement Plans\nGenerous vacation, holiday and sick pay plans\nTuition assistance\n\nBenefits are provided to employees regularly working a minimum of 30 hours per week.\n\nTecolote Research is a private, employee-owned corporation where people are our primary resource. Our investments in technology and training give our employees the tools to ensure our clients are provided the solutions they need, and our very high employee retention rate and stable workforce is an added value to our customers. Apply now to connect with a company that invests in you.
## 2 What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analytics group at the University of Maryland Medical System (UMMS) in support of its strategic priority to become a data-driven and outcomes-oriented organization. The successful candidate will have 3+ years of experience with Machine Learning, Predictive Modeling, Statistical Analysis, Mathematical Optimization, Algorithm Development and a passion for working with healthcare data. Previous experience with various computational approaches along with an ability to demonstrate a portfolio of relevant prior projects is essential. This position will report to the UMMS Vice President for Enterprise Data and Analytics (ED&A).\n\nII. Principal Responsibilities and Tasks\n\n‚Ä¢ Develops predictive and prescriptive analytic models in support of the organization‚Äôs clinical, operations and business initiatives and priorities.\n‚Ä¢ Deploys solutions so that they provide actionable insights to the organization and are embedded or integrated with application systems\n‚Ä¢ Supports and drives analytic efforts designed around organization‚Äôs strategic priorities and clinical/business problems\n‚Ä¢ Works in a team to drive disruptive innovation, which may translate into improved quality of care, clinical outcomes, reduced costs, temporal efficiencies and process improvements.\n‚Ä¢ Builds and extends our analytics portfolio supported by robust documentation\n‚Ä¢ Works with autonomy to find solutions to complex problems using open source tools and in-house development\n‚Ä¢ Stays abreast of state-of-the-art literature in the fields of operations research, statistical modeling, statistical process control and mathematical optimization\n‚Ä¢ Creates, communicates, and manages the project plans and other required project documentation and provides updates to leadership as necessary\n‚Ä¢ Develops and maintains relationships with business, IT and clinical leaders and stakeholders across the enterprise to facilitate collaboration and effective communication\n‚Ä¢ Works with the analytics team and clinical/business stakeholders to develop pilots so that they may be tested and validated in pilot settings\n‚Ä¢ Performs analysis to evaluate primary and secondary objectives from such pilots\n‚Ä¢ Assists leadership with strategies for scaling successful projects across the organization and enhances the analytics applications based on feedback from end-users and clinical/business consumers\n‚Ä¢ Assists leadership with dissemination of success stories (and failures) in an effort to increase analytics literacy and adoption across the organization.\n\nWhat You Need to Be Successful:\n\nIII. Education and Experience\n\n‚Ä¢ Master‚Äôs or higher degree (may be substituted by relevant work experience) in applied mathematics, physics, computer science, engineering, statistics or a related field\n‚Ä¢ 3+ years of Mathematical Optimization, Machine Learning, Predictive Analytics and Algorithm Development experience (experience with tools such as WEKA, RapidMiner, R. Python or other open source tools strongly desired)\n‚Ä¢ Strong development skills in two or more of the following: C/C++, C#, Python, Java\n‚Ä¢ Combining analytic methods with advanced data visualizations\n‚Ä¢ Expert ability to breakdown and clearly define problems\n‚Ä¢ Experience with Natural Language Processing preferred\n\nIV. Knowledge, Skills and Abilities\n\n‚Ä¢ Proven communications skills ‚Äì Effective at working independently and in collaboration with other staff members. Capable of clearly presenting findings orally, in writing, or through graphics.\n‚Ä¢ Proven analytical skills ‚Äì Able to compare, contrast, and validate work with keen attention to detail. Skilled in working with ‚Äúreal world‚Äù data including scrubbing, transformation, and imputation.\n‚Ä¢ Proven problem solving skills ‚Äì Able to plan work, set clear direction, and coordinate own tasks in a fast-paced multidisciplinary environment. Expert at triaging issues, identifying data anomalies, and debugging software.\n‚Ä¢ Design and prototype new application functionality for our products.\n‚Ä¢ Change oriented ‚Äì actively generates process improvements; supports and drives change, and confronts difficult circumstances in creative ways\n‚Ä¢ Effective communicator and change agent\n‚Ä¢ Ability to prioritize the tasks of the project timeline to achieve the desired results\n‚Ä¢ Strong analytic and problem solving skills\n‚Ä¢ Ability to cooperatively and effectively work with people from various organization levels\n\nWe are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school security awareness training and simulated phishing. KnowBe4 was created to help organizations manage the ongoing problem of social engineering. Tens of thousands of organizations worldwide use KnowBe4's platform to mobilize their end users as a last line of defense and enable them to make better security decisions, every day.\n\nWe are ranked #1 best place to work in technology nationwide by Fortune Magazine and have placed #1 or #2 in The Tampa Bay Top Workplaces Survey for the last four years. We also just had our 27th record-setting quarter in a row!\n\nThe Data Scientist will work closely with the VP of FP&A and the Quantitative Analytics Manager to implement advanced analytical models and other data-driven solutions.\n\nResponsibilities:\nWork with key stakeholders throughout the organization to identify opportunities using financial data to develop business solutions.\nDevelop new and enhance existing data collection procedures to ensure that all data relevant for analyses is captured.\nCleanse, consolidate, and verify the integrity of data used in analyses.\nBuild and validate predictive models to increase customer retention, revenue generation, and other business outcomes.\nDevelop relevant statistical models to assist with profitability forecasting\nCreate the analytics to leverage known, inferred and appended information about origins and recognizing patterns to assist in outlook forecasting\nAssist in the design and data modeling of data warehouse.\nVisualize data, especially in reports and dashboards, to communicate analysis results to stakeholders.\nExtend data collection to unstructured data within the company and external sources\nMine and collect data (both structured and unstructured) to detect patterns, opportunities and insights that drive our organization\nCreate and execute automation and data mining requests utilizing SQL, Access, Excel, SAS and other statistical programs\nTrouble shoot forecast and optimization anomalies with FP&A team through the use of statistical and mathematical optimization models. Develop testing to explain and or reduce these anomalies.\nOversee and develop key metric forecasts as well as provide budget support based on trends in the business/industry.\nMinimum Qualifications:\nMaster's degree in Statistics, Computer Science, Mathematics or other quantitative discipline required\n2-3 years of experience in similar role (Master's Degree)\n0-2 years of experience in similar role (PhD)\nExperience leveraging predictive modeling, big data analytics, exploratory data analysis and machine learning to drive significant business impact\nExperience with statistical computer languages (Python, R etc.) to manipulate and analyze large datasets preferred.\nExperience with data visualization tools like D3.js, matplotlib, etc., preferred\nExceptional understanding of machine learning algorithms such as Random Forest, SVM, k-NN, Na√Øve Bayes, Gradient Boosting a plus.\nApplied statistical skills including statistical testing, regression, etc.\nExperience with data bases, query languages, and associated data architecture.\nExperience with distributed computing tools (Hive, Spark, etc.) is a plus.\nStrong analytical skills and ability to meet project deadlines.\nNote: An applicant assessment, background check and drug test may be part of your hiring procedure.\n\nNo recruitment agencies, please.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 *Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological Sciences\n\nGroup: Exposure Science Team\n*Job Description**\nThe Biological System Science (BSS) Group in the Biological Sciences Division of the Pacific Northwest National Laboratory (PNNL) is seeking a staff scientist with multidisciplinary experience in computational chemistry, cheminformatics, advanced statistics and/or machine learning/deep learning/AI. Preferred candidates will have a broad understanding of the state of computational metabolomics and experience in designing and implementing novel deep learning networks for chemistry applications. Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification is also highly valued. Successful candidates will join a large, uniquely collaborative, collegial group of innovators driving the integration of data science, computational science and analytical chemistry to solve the nations most challenging problems in human health, chemical forensics, and national security. The BSS Group is diverse and inclusive, working closely with colleagues across the laboratory with expertise in computational biology, integrative omics, applied mathematics, computer science, and statistics.\n\n+ Apply knowledge of statistics, machine learning, advanced mathematics, simulation, software development, and data modeling to to design, development and implement methods that integrate, clean and analyze data, recognize patterns, address uncertainty, pose questions, and make discoveries from structured and/or unstructured data.\n\n+ Produce solutions driven by exploratory data analysis from complex and high-dimensional datasets.\n\n+ Design, develop, and evaluate predictive models and advanced algorithms that lead to optimal value extraction from data.\n\n+ Develop and maintain existing deep learning networks that generate novel molecules for drug discovery applications\n\n+ Contribue or author proposals, peer-reviewed papers, and other technical products.\n*Minimum Qualifications**\nBS/BA with 0-1 years of experience or MS/MA with 0-1 years of experience\n*Preferred Qualifications**\n+ MS in chemical engineering, computer science, or related field with a GPA of 3.5+ 5+ years of research experience\n\n+ Intermediate level programming experience (preferably Python) and high-performance computing experience\n\n+ At least one first author published, or proof of submitted, paper applying deep learning for use in novel compound generation\n\n+ Understanding of the NMDA receptor and potential drug targets\n\n+ Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification\n*Equal Employment Opportunity**\nBattelle Memorial Institute (BMI) at Pacific Northwest National Laboratory (PNNL) is an Affirmative Action/Equal Opportunity Employer and supports diversity in the workplace. All employment decisions are made without regard to race, color, religion, sex, national origin, age, disability, veteran status, marital or family status, sexual orientation, gender identity, or genetic information. All BMI staff must be able to demonstrate the legal right to work in the United States. BMI is an E-Verify employer. Learn more at jobs.pnnl.gov.\n*_Please be aware that the Department of Energy (DOE) prohibits DOE employees and contractors from participation in certain foreign government talent recruitment programs. If you are offered a position at PNNL and are currently a participant in a foreign government talent recruitment program you will be required to disclose this information before your first day of employment._**\n_Directorate:_ _Earth & Biological Sciences_\n\n_Job Category:_ _Scientists/Scientific Support_\n\n_Group:_ _Biological Systems Science_\n\n_Opening Date:_ _2020-03-26_\n\n_Closing Date:_ _2020-04-05_
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Data Scientist\nAffinity Solutions / Marketing Cloud seeks smart, curious, technically savvy candidates to join our cutting-edge data science team. We hire the best and brightest and give them the opportunity to work on industry-leading technologies.\nThe data sciences team at AFS/Marketing Cloud build models, machine learning algorithms that power all our ad-tech/mar-tech products at scale, develop methodology and tools to precisely and effectively measure market campaign effects, and research in-house and public data sources for consumer spend behavior insights. In this role, you'll have the opportunity to come up with new ideas and solutions that will lead to improvement of our ability to target the right audience, derive insights and provide better measurement methodology for marketing campaigns. You'll access our core data asset and machine learning infrastructure to power your ideas.\nDuties and Responsibilities\n¬∑ Support all clients model building needs, including maintaining and improving current modeling/scoring methodology and processes,\n¬∑ Provide innovative solutions to customized modeling/scoring/targeting with appropriate ML/statistical tools,\n¬∑ Provide analytical/statistical support such as marketing test design, projection, campaign measurement, market insights to clients and stakeholders.\n¬∑ Mine large consumer datasets in the cloud environment to support ad hoc business and statistical analysis,\n¬∑ Develop and Improve automation capabilities to enable customized delivery of the analytical products to clients,\n¬∑ Communicate the methodologies and the results to the management, clients and none technical stakeholders.\nBasic Qualifications\n¬∑ Advanced degree in Statistics/Mathematics/Computer Science/Economics or other fields that requires advanced training in data analytics.\n¬∑ Being able to apply basic statistical/ML concepts and reasoning to address and solve business problems such as targeting, test design, KPI projection and performance measurement.\n¬∑ Entrepreneurial, highly self-motivated, collaborative, keen attention to detail, willingness and capable learn quickly, and ability to effectively prioritize and execute tasks in a high pressure environment.\n¬∑ Being flexible to accept different task assignments and able to work on a tight time schedule.\n¬∑ Excellent command of one or more programming languages; preferably Python, SAS or R\n¬∑ Familiar with one of the database technologies such as PostgreSQL, MySQL, can write basic SQL queries\n¬∑ Great communication skills (verbal, written and presentation)\nPreferred Qualifications\n¬∑ Experience or exposure to large consumer and/or demographic data sets.\n¬∑ Familiarity with data manipulation and cleaning routines and techniques.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                CyrusOne is seeking a talented Data Scientist who holds a range of data-focused skills both in technical and analytical domains. The ideal candidate is adept at processing, cleansing, and verifying the integrity of data used for visualization and analysis. This role is dynamic, granting the candidate the opportunity to participate in a wide variety of projects and collaborate with many cross-functional teams throughout the business.\n\nDuties and Responsibilities:\nParticipate in an agile scrum cadence\nProcess, cleanse, and verify the integrity of data used for analysis\nPerform functional business requirements analysis and data analysis\nDevelop data models and algorithms to apply to data sets\nAugment data collection procedures to include necessary information for building accurate analytics\nCollaborate with stakeholders throughout the organization to identify opportunities for leveraging data to drive business solutions\nEvaluate the effectiveness and accuracy of data sources and data gathering techniques\nGather critical information from meetings with various stakeholders and produce useful reports\nCoordinate with cross-functional teams to implement models and monitor outcomes\nDevelop automated discrepancy detection systems and distribute reconciliation reports to stakeholders\nRequirements:\nMust be legally authorized to work in the United States for any employer without sponsorship\nProfessional experience using statistical software languages like R, Python, and SQL to query, manipulate, and draw insights from data sets\nStrong problem-solving skills with an emphasis on product development\nExtensive experience with Microsoft SQL, MySQL and MongoDB\nUnderstanding of version control (git) and project management with Azure DevOps\nKnowledge of machine learning techniques (clustering, decision tree learning, artificial neural networks, etc.)\nExperience visualizing data for stakeholders using visualization tools such as Power BI\nExperience working with and creating data architectures\nUnderstanding and adherence to agile principles and practices\nAbility to work on problems of any scope where the analysis of situations or data requires a review of a variety of factors\nSelf-maintainability and reliability with minimal supervision\nExcellent interpersonal communication, decision making, presentation, and organizational skills\nAbility to build productive internal/external working relationships\nHarmonious with CyrusOne culture, core values, and business goals\nMinimum Qualifications:\n2+ years of related experience in a data analyst role\nStrong can-do attitude in a time sensitive environment\nOther important information about this position:\nThis position requires typical weekday (Monday - Friday) attendance in an office setting, at times after hours work may be required to meet business and customer needs\nEvery position requires certain physical capabilities. CyrusOne seeks to make reasonable accommodations that enable individuals with disabilities to perform essential duties when possible\nCyrusOne is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, sex, sexual orientation, gender identity, religion, national origin, disability, veteran status, or other legally protected status.\n\nCyrusOne provides reasonable accommodation for qualified individuals with disabilities in accordance with the Americans with Disabilities Act (ADA) and any other state or local laws. We will respond to requests for reasonable accommodations to assist you in applying for positions at CyrusOne, or to submit a resume. If you need to request an accommodation, please contact our Human Resources at 214.488.1365 (Option 7) or by email at HR@cyrusone.com.
##   avg_salary_range avg_salary min_salary max_salary python r_lang spark   aws
## 1      50000-75000      72000      53000      91000   TRUE  FALSE FALSE FALSE
## 2     75000-100000      87500      63000     112000   TRUE  FALSE FALSE FALSE
## 3     75000-100000      85000      80000      90000   TRUE  FALSE  TRUE FALSE
## 4     75000-100000      76500      56000      97000   TRUE  FALSE FALSE FALSE
## 5    100000-125000     114500      86000     143000   TRUE  FALSE FALSE FALSE
## 6     75000-100000      95000      71000     119000   TRUE  FALSE FALSE  TRUE
##   excel        city state country                            company_name
## 1  TRUE Albuquerque    NM      US                     Tecolote Research\n
## 2 FALSE   Linthicum    MD      US University of Maryland Medical System\n
## 3  TRUE  Clearwater    FL      US                               KnowBe4\n
## 4 FALSE    Richland    WA      US                                  PNNL\n
## 5  TRUE    New York    NY      US                    Affinity Solutions\n
## 6  TRUE      Dallas    TX      US                              CyrusOne\n
##                            revenue                   size  type_of_ownership
## 1        $50 to $100 million (USD)  501 to 1000 employees  Company - Private
## 2           $2 to $5 billion (USD)       10000+ employees Other Organization
## 3       $100 to $500 million (USD)  501 to 1000 employees  Company - Private
## 4 $500 million to $1 billion (USD) 1001 to 5000 employees         Government
## 5                             <NA>    51 to 200 employees  Company - Private
## 6           $1 to $2 billion (USD)   201 to 500 employees   Company - Public
##                           industry                       sector
## 1              Aerospace & Defense          Aerospace & Defense
## 2 Health Care Services & Hospitals                  Health Care
## 3                Security Services            Business Services
## 4                           Energy Oil, Gas, Energy & Utilities
## 5          Advertising & Marketing            Business Services
## 6                      Real Estate                  Real Estate

Now we put the data in a tidy format. Pivoting the skills longer tidies the dataframe by moving the skills and tools into a single column. As they were those columns were a variable stored as a column header. This change will allow for easy and simple analysis of the skills and tools.

skills <- c("python", "r_lang", "spark", "aws", "excel")


tidy_glassdoor <- glassdoor_data3 |>
  pivot_longer(
    cols = all_of(skills),
    names_to = "skill",
    values_to = "required"
  )

head(tidy_glassdoor)

## # A tibble: 6 × 18
##   job_title      job_type job_description avg_salary_range avg_salary min_salary
##   <chr>          <chr>    <chr>           <fct>                 <dbl>      <dbl>
## 1 Data Scientist Scienti… "Data Scientis… 50000-75000           72000      53000
## 2 Data Scientist Scienti… "Data Scientis… 50000-75000           72000      53000
## 3 Data Scientist Scienti… "Data Scientis… 50000-75000           72000      53000
## 4 Data Scientist Scienti… "Data Scientis… 50000-75000           72000      53000
## 5 Data Scientist Scienti… "Data Scientis… 50000-75000           72000      53000
## 6 Healthcare Da… Scienti… "What You Will… 75000-100000          87500      63000
## # ℹ 12 more variables: max_salary <dbl>, city <chr>, state <chr>,
## #   country <chr>, company_name <chr>, revenue <fct>, size <fct>,
## #   type_of_ownership <chr>, industry <chr>, sector <chr>, skill <chr>,
## #   required <lgl>

Data Analysis, Exploration, and Visualization

Total Jobs

First lets see how many jobs are included in the dataset.

tot_jobs <- tidy_glassdoor |>
  distinct(job_title, job_description) |>
  nrow()

cat("There are", tot_jobs, "distinct jobs included in the dataset\n")

## There are 463 distinct jobs included in the dataset

Jobs By Type

Next lets see how many jobs there are per Job Type.

jobs_by_type <- tidy_glassdoor |>
  filter(!is.na(job_type), !is.na(avg_salary)) |>
  distinct(job_title, job_description, job_type, avg_salary) |>
  group_by(job_type) |>
  summarise(job_count = n(),
            perc_of_total = round((job_count / tot_jobs),2),
            avg_salary_type = mean(avg_salary)) |>
  arrange(desc(job_count))

print(jobs_by_type)

## # A tibble: 6 × 4
##   job_type           job_count perc_of_total avg_salary_type
##   <chr>                  <int>         <dbl>           <dbl>
## 1 Scientist                266          0.57         109912.
## 2 Engineer                  88          0.19         103273.
## 3 Analyst                   85          0.18          73829.
## 4 Engineer/Scientist        13          0.03          94385.
## 5 Engineer/Analyst           2          0             60500 
## 6 Scientist/Analyst          2          0             97000

We can better see this through a visualization:

ggplot(jobs_by_type, aes(x = reorder(job_type, -job_count), y = job_count,fill = job_type)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Number of Jobs by Job Type",
    x = "Job Type",
    y = "Number of Jobs"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Observation:

The dataset contains more job postings for Scientist job type than any other type. This is something we may want to take into consideration when looking at average salaries as the more postings, the more robust it will be and less susceptible to influence from outliers.

It may also be an indication of demand for the job type in the market.

We can explore the average salary by job type next.

ggplot(jobs_by_type, aes(x = reorder(job_type, -avg_salary_type), y = avg_salary_type, fill = job_type)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Avg Salary by Job Type",
    x = "Job Type",
    y = "Average Salary"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

avg_sal_dist <- glassdoor_data3 |>
  filter(!is.na(job_type), !is.na(avg_salary))


ggplot(avg_sal_dist, aes(x = job_type, y = avg_salary,fill = job_type)) +
  geom_boxplot() +
  labs(
    title = "Salary Distribution by Job Type",
    x = "Job Type",
    y = "Average Salary"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Observation:

We see that the Scientist category has a higher average and median salary than any other job type (both grouped and not grouped).

Of the job types that are not grouped (Analsyt vs Engineer vs Scientist), we see that Scientist has the higher average and median salary, while Analyst has a lower average and median salary.

Of the job types that are grouped (Engineer/Analyst vs Engineer/Scientist vs Scientist/Analyst), we see Scientist/Analyst has the higher average and median salary, while Engineer/Analyst has the lowest average and median salary.

Jobs By Location

What kind of insights can we draw from looking at the jobs location?

jobs_by_loc <- tidy_glassdoor |>
  filter(!is.na(job_type), !is.na(avg_salary), !is.na(state)) |>
  distinct(job_title, job_description, job_type,state, avg_salary) |>
  group_by(state) |>
  summarise(job_count_loc = n(),
            perc_of_total = round((job_count_loc / tot_jobs),2),
            avg_salary_loc = mean(avg_salary)) |>
  arrange(desc(avg_salary_loc))

print(jobs_by_loc)

## # A tibble: 37 × 4
##    state job_count_loc perc_of_total avg_salary_loc
##    <chr>         <int>         <dbl>          <dbl>
##  1 CA               97          0.21        124521.
##  2 IL               22          0.05        112295.
##  3 DC                9          0.02        110167.
##  4 MA               57          0.12        104623.
##  5 MI                4          0.01        104500 
##  6 NJ               13          0.03        102346.
##  7 NY               46          0.1         100533.
##  8 RI                1          0           100000 
##  9 NC               11          0.02         99182.
## 10 MD               21          0.05         98595.
## # ℹ 27 more rows

Lets look at the 5 states with the most job postings

top_5_job_count <- jobs_by_loc |>
  arrange(desc(job_count_loc)) |>
  slice_head(n = 5)

print(top_5_job_count)

## # A tibble: 5 × 4
##   state job_count_loc perc_of_total avg_salary_loc
##   <chr>         <int>         <dbl>          <dbl>
## 1 CA               97          0.21        124521.
## 2 MA               57          0.12        104623.
## 3 NY               46          0.1         100533.
## 4 VA               30          0.06         97667.
## 5 IL               22          0.05        112295.

What about the 5 states with the highest average salary for data related jobs

top_5_job_sal <- jobs_by_loc |>
  arrange(desc(avg_salary_loc)) |>
  slice_head(n = 5)

print(top_5_job_sal)

## # A tibble: 5 × 4
##   state job_count_loc perc_of_total avg_salary_loc
##   <chr>         <int>         <dbl>          <dbl>
## 1 CA               97          0.21        124521.
## 2 IL               22          0.05        112295.
## 3 DC                9          0.02        110167.
## 4 MA               57          0.12        104623.
## 5 MI                4          0.01        104500

We can also explore the states with the least job postings and lowest average salaries.

bot_5_job_count <- jobs_by_loc |>
  arrange((job_count_loc)) |>
  slice_head(n = 5)

print(bot_5_job_count)

## # A tibble: 5 × 4
##   state job_count_loc perc_of_total avg_salary_loc
##   <chr>         <int>         <dbl>          <dbl>
## 1 RI                1             0         100000
## 2 KS                1             0          87000
## 3 SC                1             0          60500
## 4 MN                2             0          85500
## 5 NM                2             0          73750

bot_5_job_sal <- jobs_by_loc |>
  arrange((avg_salary_loc)) |>
  slice_head(n = 5)

print(bot_5_job_sal)

## # A tibble: 5 × 4
##   state job_count_loc perc_of_total avg_salary_loc
##   <chr>         <int>         <dbl>          <dbl>
## 1 DE                2          0            27500 
## 2 LA                3          0.01         46167.
## 3 NE                3          0.01         46333.
## 4 ID                2          0            56250 
## 5 AL                5          0.01         58800

What if we group the states into regions, and try to take a look at the trends by region

# Using built in state and region mappings
state_region <- data.frame(
  state = state.abb,
  region = state.region
)

# DC is not included in the built in regions so this has to be put in 
# manually since it is in our DF
dc <- data.frame(state = "DC", region = "South")
state_region <- bind_rows(state_region, dc)

# Join in the region 
glassdoor_region <- tidy_glassdoor |>
  left_join(state_region, by = c("state" = "state"))


jobs_by_region <- glassdoor_region |>
  filter(!is.na(job_type), !is.na(avg_salary), !is.na(region)) |>
  distinct(job_title, job_description, job_type,region, avg_salary) |>
  group_by(region) |>
  summarise(job_count_reg = n(),
            perc_of_total = round((job_count_reg / tot_jobs),2),
            avg_salary_reg = mean(avg_salary)) |>
  arrange(desc(avg_salary_reg))

print(jobs_by_region)

## # A tibble: 4 × 4
##   region        job_count_reg perc_of_total avg_salary_reg
##   <chr>                 <int>         <dbl>          <dbl>
## 1 West                    137          0.3         112869.
## 2 Northeast               137          0.3         101062.
## 3 North Central            61          0.13         94426.
## 4 South                   121          0.26         91517.

Observation:

We see that most of the job postings are in the West and Northeast regions. The least are in the North Central region.

Of the two regions with the most jobs, the West region has the higher average salary.

Interestingly enough, even though we see that the 5 states with the most job postings are CA, MA, NY, VA, and IL, which may indicate the need/desire for these types of jobs in those states. IF we look at the 5 states with the highest average salaries (CA, IL, MA, DC, MI) NY and VA drop off the list, so even though there may be many opportunities there, the average salaries may not be as competitive in these states.

jobs_by_reg_type <- glassdoor_region |>
  filter(!is.na(job_type), !is.na(avg_salary), !is.na(region)) |>
  distinct(job_title, job_description, job_type,region, avg_salary) |>
  group_by(region,job_type) |>
  summarise(job_count_reg = n(),
            perc_of_total = round((job_count_reg / tot_jobs),2),
            avg_salary_reg = mean(avg_salary)) |>
  arrange(desc(avg_salary_reg))

## `summarise()` has grouped output by 'region'. You can override using the
## `.groups` argument.

print(jobs_by_reg_type)

## # A tibble: 20 × 5
## # Groups:   region [4]
##    region        job_type           job_count_reg perc_of_total avg_salary_reg
##    <chr>         <chr>                      <int>         <dbl>          <dbl>
##  1 West          Engineer/Scientist             2          0           128750 
##  2 West          Scientist                     81          0.17        123204.
##  3 West          Scientist/Analyst              1          0           121500 
##  4 West          Engineer                      28          0.06        114179.
##  5 North Central Engineer/Scientist             3          0.01        111500 
##  6 Northeast     Scientist                     89          0.19        110320.
##  7 Northeast     Engineer                      13          0.03        103346.
##  8 North Central Scientist                     29          0.06        103121.
##  9 South         Engineer                      34          0.07         97235.
## 10 South         Scientist                     67          0.14         96239.
## 11 North Central Engineer                      13          0.03         95500 
## 12 South         Engineer/Scientist             3          0.01         86667.
## 13 Northeast     Analyst                       30          0.06         76950 
## 14 West          Analyst                       25          0.05         76300 
## 15 North Central Analyst                       15          0.03         75667.
## 16 Northeast     Engineer/Scientist             5          0.01         75000 
## 17 South         Scientist/Analyst              1          0            72500 
## 18 South         Engineer/Analyst               1          0            62500 
## 19 South         Analyst                       15          0.03         61633.
## 20 North Central Engineer/Analyst               1          0            58500

ggplot(jobs_by_reg_type, aes(x = reorder(job_type, -avg_salary_reg), y = avg_salary_reg, fill = job_type)) +
  geom_bar(stat = "identity") +
  facet_wrap(~region) +
  labs(
    title = "Avg Salary by Job Type",
    x = "Job Type",
    y = "Average Salary"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Observation:

Scientists have the highest average salary in the Northeast region.

Engineer/Scientist followed by Scientist have the highest average salaries in the West region

Scientist and Engineer are very close in average salary in the South region.

Jobs By Skill Requirement

Count of skills in the joblistings.

skills_summary <- tidy_glassdoor |>
  filter(!is.na(job_type), !is.na(skill)) |>
  distinct(job_title, job_description, job_type,skill,required, avg_salary) |>
  group_by(skill) |>
  summarise(jobs_req = sum(required == TRUE, na.rm = TRUE))|>
  arrange(desc(jobs_req))

print(skills_summary)

## # A tibble: 5 × 2
##   skill  jobs_req
##   <chr>     <int>
## 1 python      255
## 2 excel       242
## 3 aws         106
## 4 spark       106
## 5 r_lang        2

Skills by job type.

skills_summary_wide <- tidy_glassdoor |>
  filter(!is.na(job_type), !is.na(skill)) |>
  distinct(job_title, job_description, job_type, skill, required, avg_salary) |>
  group_by(skill, job_type) |>
  summarise(jobs_req = sum(required == TRUE, na.rm = TRUE), .groups = "drop") |>
  pivot_wider(names_from = job_type, values_from = jobs_req, values_fill = 0) |>
  arrange(skill)

print(skills_summary_wide)

## # A tibble: 5 × 7
##   skill  Analyst Engineer `Engineer/Analyst` `Engineer/Scientist` Scientist
##   <chr>    <int>    <int>              <int>                <int>     <int>
## 1 aws          8       40                  0                    3        55
## 2 excel       60       42                  2                    9       128
## 3 python      32       59                  1                    9       154
## 4 r_lang       2        0                  0                    0         0
## 5 spark        4       46                  1                    6        49
## # ℹ 1 more variable: `Scientist/Analyst` <int>

A count of the number of skills listed per job. The majority of the data set onl list 1 or 2 skills.

tidy_glassdoor |> 
  distinct() |> 
  filter(required == TRUE) |> 
  group_by(job_description) |> 
  count(skill, required) |> 
  summarize(
    number_of_skills_per_job = n_distinct(skill)
  ) |> count(number_of_skills_per_job)

## # A tibble: 4 × 2
##   number_of_skills_per_job     n
##                      <int> <int>
## 1                        1   160
## 2                        2   143
## 3                        3    64
## 4                        4    20

The number of jobs requiring each skill type.

# Prepare data: Count jobs by skill + job type
skills_by_type <- tidy_glassdoor |>
  filter(!is.na(job_type), !is.na(skill), required == TRUE) |>
  distinct(job_title, job_description, job_type, skill) |>
  group_by(job_type, skill) |>
  summarise(job_count = n(), .groups = "drop")


ggplot(skills_by_type, aes(x = reorder(skill, -job_count), y = job_count, fill = skill)) +
  geom_bar(stat = "identity") +
  facet_wrap(~job_type) +
  labs(
    title = "Number of Jobs Requiring Each Skill by Job Type",
    x = "Skill",
    y = "Number of Jobs"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none"
  )

Observation:

Generally speaking, we see that the most frequent skills required are Python and Excel.

For engineers we can see that Python and Spark are the more frequently required skills.

avg_salary_by_skill <- tidy_glassdoor |>
  filter(required == TRUE, !is.na(avg_salary), !is.na(skill)) |>
  group_by(skill) |>
  summarise(avg_salary = round(mean(avg_salary, na.rm = TRUE), 2)) |>
  arrange(desc(avg_salary))

ggplot(avg_salary_by_skill, aes(x = reorder(skill, avg_salary), y = avg_salary, fill = skill)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Average Salary by Skill (Across All Jobs)",
    x = "Skill",
    y = "Average Salary"
  ) +
  scale_y_continuous(labels = scales::dollar) +
  theme(legend.position = "none")

Analysis and Observations Directly from Database

The following code adjusts changes the column type in the original data set and establishes a clean data set within the database.

glassdoor_data3$python <- as.integer(glassdoor_data3$python)
glassdoor_data3$r_lang <- as.integer(glassdoor_data3$r_lang)
glassdoor_data3$spark <- as.integer(glassdoor_data3$spark)
glassdoor_data3$aws <- as.integer(glassdoor_data3$aws)
glassdoor_data3$excel <- as.integer(glassdoor_data3$excel)

dbWriteTable(
  con,
  name = "glassdoor_clean",    # this will be the new table name in MySQL
  value = glassdoor_data3,
  row.names = FALSE,
  overwrite = TRUE             # change to FALSE if you want to append
)

dbListTables(con)
dbReadTable(con, "glassdoor_clean")

This data counts counts top referenced soft and hard skills directly from the database.

top_skills <- dbGetQuery(con, "
  SELECT s.skill_name, COUNT(*) AS count
  FROM job_skills js
  JOIN skills s ON js.skill_id = s.skill_id
  GROUP BY s.skill_name
  ORDER BY count DESC
  limit 5

")

print (top_skills)

##      skill_name count
## 1      Analytic   333
## 2 Communication   261
## 3        Python   257
## 4           SQL   251
## 5   Engineering   222

The following code lists the top skill counts by industry.

top_skill_top5_industries <- dbGetQuery(con, "
  -- First, find the top 5 industries by number of job postings
  WITH top_industries AS (
    SELECT i.industry_id, i.industry_name, COUNT(*) AS posting_count
    FROM job_postings jp
    JOIN industry i ON jp.industry_id = i.industry_id
    GROUP BY i.industry_id, i.industry_name
    ORDER BY posting_count DESC
    LIMIT 5
  ),
  skill_counts AS (
    SELECT 
      i.industry_name, 
      s.skill_name, 
      COUNT(*) AS count
    FROM job_postings jp
    JOIN industry i ON jp.industry_id = i.industry_id
    JOIN job_skills js ON jp.job_id = js.job_id
    JOIN skills s ON js.skill_id = s.skill_id
    WHERE i.industry_id IN (SELECT industry_id FROM top_industries)
    GROUP BY i.industry_name, s.skill_name
  )
  SELECT sc.industry_name, sc.skill_name, sc.count
  FROM skill_counts sc
  JOIN (
    SELECT industry_name, MAX(count) AS max_count
    FROM skill_counts
    GROUP BY industry_name
  ) maxed ON sc.industry_name = maxed.industry_name AND sc.count = maxed.max_count
  ORDER BY sc.industry_name;
")


print(top_skill_top5_industries)

##                             industry_name    skill_name count
## 1               Biotech & Pharmaceuticals Communication    48
## 2            Computer Hardware & Software      Analytic    30
## 3 Enterprise Software & Network Solutions      Analytic    22
## 4                      Insurance Carriers      Analytic    31
## 5                             IT Services      Analytic    29

A Brief Summation

Through the acquisition, normalization, transformation, and analysis of this dataset, several key insights emerge.

Among the three domains: data science, data engineering, and data analytics, data science is the most in demand, with 57% of job listings seeking a data scientist. It also offers the highest average salary at $109,911.

Geographically, data jobs are most concentrated in the West and Northeast, with California, Massachusetts, and New York having the most listings. However, the highest-paying states on average are California, Massachusetts, and Washington, D.C.

Regarding skill and tools, Python is the most sought-after, followed by excel, while R is rarely requested.

This data set holds even more potential insights, offering valuable indicators of emerging trends in the data job market.

data_science_607_project_3

Cindy Lin, Samuel Crummett, Maxfield Raynolds, William Forero

2025-03-20

Project 3 - Data Science Skills

What are the most valued data science skills?

Introduction

Preparing the R enivronment

Data transformation, cleaning, and normalization

Database Normalization

Transformation and cleaning

Data Analysis, Exploration, and Visualization

Total Jobs

Jobs By Type

Observation:

Observation:

Jobs By Location

Observation:

Observation:

Jobs By Skill Requirement

Observation:

Analysis and Observations Directly from Database

A Brief Summation