DATA 608 Story#4: How much do we get paid?

Introduction

For this story, I utilized the Data scientist salary dataset from Kaggle: https://www.kaggle.com/datasets/nikhilbhathi/data-scientist-salary-us-glassdoor/data

This dataset was created by web scraping job postings related to the position “Data Scientist” from www.glassdoor.com in USA using Selenium. I selected this dataset because it was hard to find Data Science Jobs data by Salary in USA States.

This data includes information based on Job title, Salary Estimate, Job Description, Rating, Company, Location, Company , Headquarters, , , any Size, Company Founded Date, Type of Ownership, Industry, Sector, Revenue, Competitors.

Import the libraries & packages

library(tidyverse)
library(openintro)
# Load necessary library
library(dplyr)

Load the data

I loaded the raw data from my github. I explored the data by checking if there were any missing or duplicate values. I also printed out summary statistics based on the data.

url <- "https://raw.githubusercontent.com/pujaroy280/DATA608Story4/main/data_cleaned_2021.csv"
df_salaries <- read.csv(url) 
head(df_salaries)

##   index                 Job.Title             Salary.Estimate
## 1     0            Data Scientist  $53K-$91K (Glassdoor est.)
## 2     1 Healthcare Data Scientist $63K-$112K (Glassdoor est.)
## 3     2            Data Scientist  $80K-$90K (Glassdoor est.)
## 4     3            Data Scientist  $56K-$97K (Glassdoor est.)
## 5     4            Data Scientist $86K-$143K (Glassdoor est.)
## 6     5            Data Scientist $71K-$119K (Glassdoor est.)
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Job.Description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, engineering, business, or the sciences.\nSkills Required:\nBachelor’s Degree in relevant field, e.g., math, data analysis, database, computer science, Artificial Intelligence (AI); three years’ experience credit for Master’s degree; five years’ experience credit for a Ph.D\nApplicant should be proficient in the use of Power BI, Tableau, Python, MATLAB, Microsoft Word, PowerPoint, Excel, and working knowledge of MS Access, LMS, SAS, data visualization tools, and have a strong algorithmic aptitude\nExcellent verbal and written communication skills, and quantitative analytical skills are required\nApplicant must be able to work in a team environment\nU.S. citizenship and ability to obtain a DoD Secret Clearance required\nResponsibilities: The applicant will be responsible for formulating analytical solutions to complex data problems; creating data analytic models to improve data metrics; analyzing customer behavior and trends; delivering insights to stakeholders, as well as designing and crafting reports, dashboards, models, and algorithms to make data insights actionable; selecting features, building and optimizing classifiers using machine learning techniques; data mining using state-of-the-art methods, extending organization’s data with third party sources of information when needed; enhancing data collection procedures to include information that is relevant for building analytic systems; processing, cleansing, and verifying the integrity of data used for analysis; doing ad-hoc analysis and presenting results in a clear manner; and creating automated anomaly detection systems and constant tracking of its performance.\nBenefits:\nWe offer competitive salaries commensurate with education and experience. We have an excellent benefits package that includes:\nComprehensive health, dental, life, long and short term disability insurance\n100% Company funded Retirement Plans\nGenerous vacation, holiday and sick pay plans\nTuition assistance\n\nBenefits are provided to employees regularly working a minimum of 30 hours per week.\n\nTecolote Research is a private, employee-owned corporation where people are our primary resource. Our investments in technology and training give our employees the tools to ensure our clients are provided the solutions they need, and our very high employee retention rate and stable workforce is an added value to our customers. Apply now to connect with a company that invests in you.
## 2 What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analytics group at the University of Maryland Medical System (UMMS) in support of its strategic priority to become a data-driven and outcomes-oriented organization. The successful candidate will have 3+ years of experience with Machine Learning, Predictive Modeling, Statistical Analysis, Mathematical Optimization, Algorithm Development and a passion for working with healthcare data. Previous experience with various computational approaches along with an ability to demonstrate a portfolio of relevant prior projects is essential. This position will report to the UMMS Vice President for Enterprise Data and Analytics (ED&A).\n\nII. Principal Responsibilities and Tasks\n\n• Develops predictive and prescriptive analytic models in support of the organization’s clinical, operations and business initiatives and priorities.\n• Deploys solutions so that they provide actionable insights to the organization and are embedded or integrated with application systems\n• Supports and drives analytic efforts designed around organization’s strategic priorities and clinical/business problems\n• Works in a team to drive disruptive innovation, which may translate into improved quality of care, clinical outcomes, reduced costs, temporal efficiencies and process improvements.\n• Builds and extends our analytics portfolio supported by robust documentation\n• Works with autonomy to find solutions to complex problems using open source tools and in-house development\n• Stays abreast of state-of-the-art literature in the fields of operations research, statistical modeling, statistical process control and mathematical optimization\n• Creates, communicates, and manages the project plans and other required project documentation and provides updates to leadership as necessary\n• Develops and maintains relationships with business, IT and clinical leaders and stakeholders across the enterprise to facilitate collaboration and effective communication\n• Works with the analytics team and clinical/business stakeholders to develop pilots so that they may be tested and validated in pilot settings\n• Performs analysis to evaluate primary and secondary objectives from such pilots\n• Assists leadership with strategies for scaling successful projects across the organization and enhances the analytics applications based on feedback from end-users and clinical/business consumers\n• Assists leadership with dissemination of success stories (and failures) in an effort to increase analytics literacy and adoption across the organization.\n\nWhat You Need to Be Successful:\n\nIII. Education and Experience\n\n• Master’s or higher degree (may be substituted by relevant work experience) in applied mathematics, physics, computer science, engineering, statistics or a related field\n• 3+ years of Mathematical Optimization, Machine Learning, Predictive Analytics and Algorithm Development experience (experience with tools such as WEKA, RapidMiner, R. Python or other open source tools strongly desired)\n• Strong development skills in two or more of the following: C/C++, C#, Python, Java\n• Combining analytic methods with advanced data visualizations\n• Expert ability to breakdown and clearly define problems\n• Experience with Natural Language Processing preferred\n\nIV. Knowledge, Skills and Abilities\n\n• Proven communications skills – Effective at working independently and in collaboration with other staff members. Capable of clearly presenting findings orally, in writing, or through graphics.\n• Proven analytical skills – Able to compare, contrast, and validate work with keen attention to detail. Skilled in working with “real world” data including scrubbing, transformation, and imputation.\n• Proven problem solving skills – Able to plan work, set clear direction, and coordinate own tasks in a fast-paced multidisciplinary environment. Expert at triaging issues, identifying data anomalies, and debugging software.\n• Design and prototype new application functionality for our products.\n• Change oriented – actively generates process improvements; supports and drives change, and confronts difficult circumstances in creative ways\n• Effective communicator and change agent\n• Ability to prioritize the tasks of the project timeline to achieve the desired results\n• Strong analytic and problem solving skills\n• Ability to cooperatively and effectively work with people from various organization levels\n\nWe are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school security awareness training and simulated phishing. KnowBe4 was created to help organizations manage the ongoing problem of social engineering. Tens of thousands of organizations worldwide use KnowBe4's platform to mobilize their end users as a last line of defense and enable them to make better security decisions, every day.\n\nWe are ranked #1 best place to work in technology nationwide by Fortune Magazine and have placed #1 or #2 in The Tampa Bay Top Workplaces Survey for the last four years. We also just had our 27th record-setting quarter in a row!\n\nThe Data Scientist will work closely with the VP of FP&A and the Quantitative Analytics Manager to implement advanced analytical models and other data-driven solutions.\n\nResponsibilities:\nWork with key stakeholders throughout the organization to identify opportunities using financial data to develop business solutions.\nDevelop new and enhance existing data collection procedures to ensure that all data relevant for analyses is captured.\nCleanse, consolidate, and verify the integrity of data used in analyses.\nBuild and validate predictive models to increase customer retention, revenue generation, and other business outcomes.\nDevelop relevant statistical models to assist with profitability forecasting\nCreate the analytics to leverage known, inferred and appended information about origins and recognizing patterns to assist in outlook forecasting\nAssist in the design and data modeling of data warehouse.\nVisualize data, especially in reports and dashboards, to communicate analysis results to stakeholders.\nExtend data collection to unstructured data within the company and external sources\nMine and collect data (both structured and unstructured) to detect patterns, opportunities and insights that drive our organization\nCreate and execute automation and data mining requests utilizing SQL, Access, Excel, SAS and other statistical programs\nTrouble shoot forecast and optimization anomalies with FP&A team through the use of statistical and mathematical optimization models. Develop testing to explain and or reduce these anomalies.\nOversee and develop key metric forecasts as well as provide budget support based on trends in the business/industry.\nMinimum Qualifications:\nMaster's degree in Statistics, Computer Science, Mathematics or other quantitative discipline required\n2-3 years of experience in similar role (Master's Degree)\n0-2 years of experience in similar role (PhD)\nExperience leveraging predictive modeling, big data analytics, exploratory data analysis and machine learning to drive significant business impact\nExperience with statistical computer languages (Python, R etc.) to manipulate and analyze large datasets preferred.\nExperience with data visualization tools like D3.js, matplotlib, etc., preferred\nExceptional understanding of machine learning algorithms such as Random Forest, SVM, k-NN, Naïve Bayes, Gradient Boosting a plus.\nApplied statistical skills including statistical testing, regression, etc.\nExperience with data bases, query languages, and associated data architecture.\nExperience with distributed computing tools (Hive, Spark, etc.) is a plus.\nStrong analytical skills and ability to meet project deadlines.\nNote: An applicant assessment, background check and drug test may be part of your hiring procedure.\n\nNo recruitment agencies, please.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       *Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological Sciences\n\nGroup: Exposure Science Team\n*Job Description**\nThe Biological System Science (BSS) Group in the Biological Sciences Division of the Pacific Northwest National Laboratory (PNNL) is seeking a staff scientist with multidisciplinary experience in computational chemistry, cheminformatics, advanced statistics and/or machine learning/deep learning/AI. Preferred candidates will have a broad understanding of the state of computational metabolomics and experience in designing and implementing novel deep learning networks for chemistry applications. Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification is also highly valued. Successful candidates will join a large, uniquely collaborative, collegial group of innovators driving the integration of data science, computational science and analytical chemistry to solve the nations most challenging problems in human health, chemical forensics, and national security. The BSS Group is diverse and inclusive, working closely with colleagues across the laboratory with expertise in computational biology, integrative omics, applied mathematics, computer science, and statistics.\n\n+ Apply knowledge of statistics, machine learning, advanced mathematics, simulation, software development, and data modeling to to design, development and implement methods that integrate, clean and analyze data, recognize patterns, address uncertainty, pose questions, and make discoveries from structured and/or unstructured data.\n\n+ Produce solutions driven by exploratory data analysis from complex and high-dimensional datasets.\n\n+ Design, develop, and evaluate predictive models and advanced algorithms that lead to optimal value extraction from data.\n\n+ Develop and maintain existing deep learning networks that generate novel molecules for drug discovery applications\n\n+ Contribue or author proposals, peer-reviewed papers, and other technical products.\n*Minimum Qualifications**\nBS/BA with 0-1 years of experience or MS/MA with 0-1 years of experience\n*Preferred Qualifications**\n+ MS in chemical engineering, computer science, or related field with a GPA of 3.5+ 5+ years of research experience\n\n+ Intermediate level programming experience (preferably Python) and high-performance computing experience\n\n+ At least one first author published, or proof of submitted, paper applying deep learning for use in novel compound generation\n\n+ Understanding of the NMDA receptor and potential drug targets\n\n+ Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification\n*Equal Employment Opportunity**\nBattelle Memorial Institute (BMI) at Pacific Northwest National Laboratory (PNNL) is an Affirmative Action/Equal Opportunity Employer and supports diversity in the workplace. All employment decisions are made without regard to race, color, religion, sex, national origin, age, disability, veteran status, marital or family status, sexual orientation, gender identity, or genetic information. All BMI staff must be able to demonstrate the legal right to work in the United States. BMI is an E-Verify employer. Learn more at jobs.pnnl.gov.\n*_Please be aware that the Department of Energy (DOE) prohibits DOE employees and contractors from participation in certain foreign government talent recruitment programs. If you are offered a position at PNNL and are currently a participant in a foreign government talent recruitment program you will be required to disclose this information before your first day of employment._**\n_Directorate:_ _Earth & Biological Sciences_\n\n_Job Category:_ _Scientists/Scientific Support_\n\n_Group:_ _Biological Systems Science_\n\n_Opening Date:_ _2020-03-26_\n\n_Closing Date:_ _2020-04-05_
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Data Scientist\nAffinity Solutions / Marketing Cloud seeks smart, curious, technically savvy candidates to join our cutting-edge data science team. We hire the best and brightest and give them the opportunity to work on industry-leading technologies.\nThe data sciences team at AFS/Marketing Cloud build models, machine learning algorithms that power all our ad-tech/mar-tech products at scale, develop methodology and tools to precisely and effectively measure market campaign effects, and research in-house and public data sources for consumer spend behavior insights. In this role, you'll have the opportunity to come up with new ideas and solutions that will lead to improvement of our ability to target the right audience, derive insights and provide better measurement methodology for marketing campaigns. You'll access our core data asset and machine learning infrastructure to power your ideas.\nDuties and Responsibilities\n· Support all clients model building needs, including maintaining and improving current modeling/scoring methodology and processes,\n· Provide innovative solutions to customized modeling/scoring/targeting with appropriate ML/statistical tools,\n· Provide analytical/statistical support such as marketing test design, projection, campaign measurement, market insights to clients and stakeholders.\n· Mine large consumer datasets in the cloud environment to support ad hoc business and statistical analysis,\n· Develop and Improve automation capabilities to enable customized delivery of the analytical products to clients,\n· Communicate the methodologies and the results to the management, clients and none technical stakeholders.\nBasic Qualifications\n· Advanced degree in Statistics/Mathematics/Computer Science/Economics or other fields that requires advanced training in data analytics.\n· Being able to apply basic statistical/ML concepts and reasoning to address and solve business problems such as targeting, test design, KPI projection and performance measurement.\n· Entrepreneurial, highly self-motivated, collaborative, keen attention to detail, willingness and capable learn quickly, and ability to effectively prioritize and execute tasks in a high pressure environment.\n· Being flexible to accept different task assignments and able to work on a tight time schedule.\n· Excellent command of one or more programming languages; preferably Python, SAS or R\n· Familiar with one of the database technologies such as PostgreSQL, MySQL, can write basic SQL queries\n· Great communication skills (verbal, written and presentation)\nPreferred Qualifications\n· Experience or exposure to large consumer and/or demographic data sets.\n· Familiarity with data manipulation and cleaning routines and techniques.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      CyrusOne is seeking a talented Data Scientist who holds a range of data-focused skills both in technical and analytical domains. The ideal candidate is adept at processing, cleansing, and verifying the integrity of data used for visualization and analysis. This role is dynamic, granting the candidate the opportunity to participate in a wide variety of projects and collaborate with many cross-functional teams throughout the business.\n\nDuties and Responsibilities:\nParticipate in an agile scrum cadence\nProcess, cleanse, and verify the integrity of data used for analysis\nPerform functional business requirements analysis and data analysis\nDevelop data models and algorithms to apply to data sets\nAugment data collection procedures to include necessary information for building accurate analytics\nCollaborate with stakeholders throughout the organization to identify opportunities for leveraging data to drive business solutions\nEvaluate the effectiveness and accuracy of data sources and data gathering techniques\nGather critical information from meetings with various stakeholders and produce useful reports\nCoordinate with cross-functional teams to implement models and monitor outcomes\nDevelop automated discrepancy detection systems and distribute reconciliation reports to stakeholders\nRequirements:\nMust be legally authorized to work in the United States for any employer without sponsorship\nProfessional experience using statistical software languages like R, Python, and SQL to query, manipulate, and draw insights from data sets\nStrong problem-solving skills with an emphasis on product development\nExtensive experience with Microsoft SQL, MySQL and MongoDB\nUnderstanding of version control (git) and project management with Azure DevOps\nKnowledge of machine learning techniques (clustering, decision tree learning, artificial neural networks, etc.)\nExperience visualizing data for stakeholders using visualization tools such as Power BI\nExperience working with and creating data architectures\nUnderstanding and adherence to agile principles and practices\nAbility to work on problems of any scope where the analysis of situations or data requires a review of a variety of factors\nSelf-maintainability and reliability with minimal supervision\nExcellent interpersonal communication, decision making, presentation, and organizational skills\nAbility to build productive internal/external working relationships\nHarmonious with CyrusOne culture, core values, and business goals\nMinimum Qualifications:\n2+ years of related experience in a data analyst role\nStrong can-do attitude in a time sensitive environment\nOther important information about this position:\nThis position requires typical weekday (Monday - Friday) attendance in an office setting, at times after hours work may be required to meet business and customer needs\nEvery position requires certain physical capabilities. CyrusOne seeks to make reasonable accommodations that enable individuals with disabilities to perform essential duties when possible\nCyrusOne is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, sex, sexual orientation, gender identity, religion, national origin, disability, veteran status, or other legally protected status.\n\nCyrusOne provides reasonable accommodation for qualified individuals with disabilities in accordance with the Americans with Disabilities Act (ADA) and any other state or local laws. We will respond to requests for reasonable accommodations to assist you in applying for positions at CyrusOne, or to submit a resume. If you need to request an accommodation, please contact our Human Resources at 214.488.1365 (Option 7) or by email at HR@cyrusone.com.
##   Rating                               Company.Name        Location
## 1    3.8                     Tecolote Research\n3.8 Albuquerque, NM
## 2    3.4 University of Maryland Medical System\n3.4   Linthicum, MD
## 3    4.8                               KnowBe4\n4.8  Clearwater, FL
## 4    3.8                                  PNNL\n3.8    Richland, WA
## 5    2.9                    Affinity Solutions\n2.9    New York, NY
## 6    3.4                              CyrusOne\n3.4      Dallas, TX
##     Headquarters         Size Founded  Type.of.ownership
## 1     Goleta, CA  501 - 1000     1973  Company - Private
## 2  Baltimore, MD      10000+     1984 Other Organization
## 3 Clearwater, FL  501 - 1000     2010  Company - Private
## 4   Richland, WA 1001 - 5000     1965         Government
## 5   New York, NY    51 - 200     1998  Company - Private
## 6     Dallas, TX   201 - 500     2000   Company - Public
##                           Industry                       Sector
## 1              Aerospace & Defense          Aerospace & Defense
## 2 Health Care Services & Hospitals                  Health Care
## 3                Security Services            Business Services
## 4                           Energy Oil, Gas, Energy & Utilities
## 5          Advertising & Marketing            Business Services
## 6                      Real Estate                  Real Estate
##                            Revenue
## 1        $50 to $100 million (USD)
## 2           $2 to $5 billion (USD)
## 3       $100 to $500 million (USD)
## 4 $500 million to $1 billion (USD)
## 5         Unknown / Non-Applicable
## 6           $1 to $2 billion (USD)
##                                                                                    Competitors
## 1                                                                                           -1
## 2                                                                                           -1
## 3                                                                                           -1
## 4 Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory
## 5                                                         Commerce Signals, Cardlytics, Yodlee
## 6                                                            Digital Realty, CoreSite, Equinix
##   Hourly Employer.provided Lower.Salary Upper.Salary Avg.Salary.K.
## 1      0                 0           53           91          72.0
## 2      0                 0           63          112          87.5
## 3      0                 0           80           90          85.0
## 4      0                 0           56           97          76.5
## 5      0                 0           86          143         114.5
## 6      0                 0           71          119          95.0
##                             company_txt Job.Location Age Python spark aws excel
## 1                     Tecolote Research           NM  48      1     0   0     1
## 2 University of Maryland Medical System           MD  37      1     0   0     0
## 3                               KnowBe4           FL  11      1     1   0     1
## 4                                  PNNL           WA  56      1     0   0     0
## 5                    Affinity Solutions           NY  23      1     0   0     1
## 6                              CyrusOne           TX  21      1     0   1     1
##   sql sas keras pytorch scikit tensor hadoop tableau bi flink mongo google_an
## 1   0   1     0       0      0      0      0       1  1     0     0         0
## 2   0   0     0       0      0      0      0       0  0     0     0         0
## 3   1   1     0       0      0      0      0       0  0     0     0         0
## 4   0   0     0       0      0      0      0       0  0     0     0         0
## 5   1   1     0       0      0      0      0       0  0     0     0         0
## 6   1   0     0       0      0      0      0       0  1     0     1         0
##    job_title_sim seniority_by_title Degree
## 1 data scientist                 na      M
## 2 data scientist                 na      M
## 3 data scientist                 na      M
## 4 data scientist                 na     na
## 5 data scientist                 na     na
## 6 data scientist                 na     na

Exercise 2

# Data Exploration
str(df_salaries)

## 'data.frame':    742 obs. of  42 variables:
##  $ index             : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Job.Title         : chr  "Data Scientist" "Healthcare Data Scientist" "Data Scientist" "Data Scientist" ...
##  $ Salary.Estimate   : chr  "$53K-$91K (Glassdoor est.)" "$63K-$112K (Glassdoor est.)" "$80K-$90K (Glassdoor est.)" "$56K-$97K (Glassdoor est.)" ...
##  $ Job.Description   : chr  "Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, "| __truncated__ "What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analyt"| __truncated__ "KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school "| __truncated__ "*Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological "| __truncated__ ...
##  $ Rating            : num  3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
##  $ Company.Name      : chr  "Tecolote Research\n3.8" "University of Maryland Medical System\n3.4" "KnowBe4\n4.8" "PNNL\n3.8" ...
##  $ Location          : chr  "Albuquerque, NM" "Linthicum, MD" "Clearwater, FL" "Richland, WA" ...
##  $ Headquarters      : chr  "Goleta, CA" "Baltimore, MD" "Clearwater, FL" "Richland, WA" ...
##  $ Size              : chr  "501 - 1000 " "10000+ " "501 - 1000 " "1001 - 5000 " ...
##  $ Founded           : int  1973 1984 2010 1965 1998 2000 2008 2005 2014 2009 ...
##  $ Type.of.ownership : chr  "Company - Private" "Other Organization" "Company - Private" "Government" ...
##  $ Industry          : chr  "Aerospace & Defense" "Health Care Services & Hospitals" "Security Services" "Energy" ...
##  $ Sector            : chr  "Aerospace & Defense" "Health Care" "Business Services" "Oil, Gas, Energy & Utilities" ...
##  $ Revenue           : chr  "$50 to $100 million (USD)" "$2 to $5 billion (USD)" "$100 to $500 million (USD)" "$500 million to $1 billion (USD)" ...
##  $ Competitors       : chr  "-1" "-1" "-1" "Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory" ...
##  $ Hourly            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Employer.provided : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Lower.Salary      : int  53 63 80 56 86 71 54 86 38 120 ...
##  $ Upper.Salary      : int  91 112 90 97 143 119 93 142 84 160 ...
##  $ Avg.Salary.K.     : num  72 87.5 85 76.5 114.5 ...
##  $ company_txt       : chr  "Tecolote Research" "University of Maryland Medical System" "KnowBe4" "PNNL" ...
##  $ Job.Location      : chr  "NM" "MD" "FL" "WA" ...
##  $ Age               : int  48 37 11 56 23 21 13 16 7 12 ...
##  $ Python            : int  1 1 1 1 1 1 0 1 0 1 ...
##  $ spark             : int  0 0 1 0 0 0 0 1 0 1 ...
##  $ aws               : int  0 0 0 0 0 1 0 1 0 0 ...
##  $ excel             : int  1 0 1 0 1 1 1 1 0 0 ...
##  $ sql               : int  0 0 1 0 1 1 0 1 0 0 ...
##  $ sas               : int  1 0 1 0 1 0 0 0 0 0 ...
##  $ keras             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ pytorch           : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ scikit            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tensor            : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ hadoop            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ tableau           : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ bi                : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ flink             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mongo             : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ google_an         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ job_title_sim     : chr  "data scientist" "data scientist" "data scientist" "data scientist" ...
##  $ seniority_by_title: chr  "na" "na" "na" "na" ...
##  $ Degree            : chr  "M" "M" "M" "na" ...

summary(df_salaries)

##      index        Job.Title         Salary.Estimate    Job.Description   
##  Min.   :  0.0   Length:742         Length:742         Length:742        
##  1st Qu.:221.5   Class :character   Class :character   Class :character  
##  Median :472.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :469.1                                                           
##  3rd Qu.:707.8                                                           
##  Max.   :955.0                                                           
##      Rating       Company.Name         Location         Headquarters      
##  Min.   :-1.000   Length:742         Length:742         Length:742        
##  1st Qu.: 3.300   Class :character   Class :character   Class :character  
##  Median : 3.700   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 3.619                                                           
##  3rd Qu.: 4.000                                                           
##  Max.   : 5.000                                                           
##      Size              Founded     Type.of.ownership    Industry        
##  Length:742         Min.   :  -1   Length:742         Length:742        
##  Class :character   1st Qu.:1939   Class :character   Class :character  
##  Mode  :character   Median :1988   Mode  :character   Mode  :character  
##                     Mean   :1837                                        
##                     3rd Qu.:2007                                        
##                     Max.   :2019                                        
##     Sector            Revenue          Competitors            Hourly       
##  Length:742         Length:742         Length:742         Min.   :0.00000  
##  Class :character   Class :character   Class :character   1st Qu.:0.00000  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.00000  
##                                                           Mean   :0.03234  
##                                                           3rd Qu.:0.00000  
##                                                           Max.   :1.00000  
##  Employer.provided  Lower.Salary     Upper.Salary   Avg.Salary.K.  
##  Min.   :0.00000   Min.   : 15.00   Min.   : 16.0   Min.   : 15.5  
##  1st Qu.:0.00000   1st Qu.: 52.00   1st Qu.: 96.0   1st Qu.: 73.5  
##  Median :0.00000   Median : 69.50   Median :124.0   Median : 97.5  
##  Mean   :0.02291   Mean   : 74.75   Mean   :128.2   Mean   :101.5  
##  3rd Qu.:0.00000   3rd Qu.: 91.00   3rd Qu.:155.0   3rd Qu.:122.5  
##  Max.   :1.00000   Max.   :202.00   Max.   :306.0   Max.   :254.0  
##  company_txt        Job.Location            Age             Python      
##  Length:742         Length:742         Min.   : -1.00   Min.   :0.0000  
##  Class :character   Class :character   1st Qu.: 12.00   1st Qu.:0.0000  
##  Mode  :character   Mode  :character   Median : 25.00   Median :1.0000  
##                                        Mean   : 47.52   Mean   :0.5283  
##                                        3rd Qu.: 60.00   3rd Qu.:1.0000  
##                                        Max.   :277.00   Max.   :1.0000  
##      spark             aws             excel             sql        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.2251   Mean   :0.2372   Mean   :0.5229   Mean   :0.5121  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       sas              keras            pytorch            scikit       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.08895   Mean   :0.03908   Mean   :0.05256   Mean   :0.07278  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##      tensor            hadoop          tableau             bi         
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.09703   Mean   :0.1671   Mean   :0.1995   Mean   :0.07547  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##      flink             mongo           google_an       job_title_sim     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Length:742        
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   Class :character  
##  Median :0.00000   Median :0.00000   Median :0.00000   Mode  :character  
##  Mean   :0.01348   Mean   :0.04987   Mean   :0.01887                     
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000                     
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000                     
##  seniority_by_title    Degree         
##  Length:742         Length:742        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

# Check for missing values
sum(is.na(df_salaries))

## [1] 0

# Count the number of missing values in each column
missing_values <- colSums(is.na(df_salaries))
missing_values

##              index          Job.Title    Salary.Estimate    Job.Description 
##                  0                  0                  0                  0 
##             Rating       Company.Name           Location       Headquarters 
##                  0                  0                  0                  0 
##               Size            Founded  Type.of.ownership           Industry 
##                  0                  0                  0                  0 
##             Sector            Revenue        Competitors             Hourly 
##                  0                  0                  0                  0 
##  Employer.provided       Lower.Salary       Upper.Salary      Avg.Salary.K. 
##                  0                  0                  0                  0 
##        company_txt       Job.Location                Age             Python 
##                  0                  0                  0                  0 
##              spark                aws              excel                sql 
##                  0                  0                  0                  0 
##                sas              keras            pytorch             scikit 
##                  0                  0                  0                  0 
##             tensor             hadoop            tableau                 bi 
##                  0                  0                  0                  0 
##              flink              mongo          google_an      job_title_sim 
##                  0                  0                  0                  0 
## seniority_by_title             Degree 
##                  0                  0

This summary includes count, mean, median, minimum, maximum, and quartile values for each numerical column in the dataframe.

# Statistical summary of the DataFrame
summary(df_salaries)

##      index        Job.Title         Salary.Estimate    Job.Description   
##  Min.   :  0.0   Length:742         Length:742         Length:742        
##  1st Qu.:221.5   Class :character   Class :character   Class :character  
##  Median :472.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :469.1                                                           
##  3rd Qu.:707.8                                                           
##  Max.   :955.0                                                           
##      Rating       Company.Name         Location         Headquarters      
##  Min.   :-1.000   Length:742         Length:742         Length:742        
##  1st Qu.: 3.300   Class :character   Class :character   Class :character  
##  Median : 3.700   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 3.619                                                           
##  3rd Qu.: 4.000                                                           
##  Max.   : 5.000                                                           
##      Size              Founded     Type.of.ownership    Industry        
##  Length:742         Min.   :  -1   Length:742         Length:742        
##  Class :character   1st Qu.:1939   Class :character   Class :character  
##  Mode  :character   Median :1988   Mode  :character   Mode  :character  
##                     Mean   :1837                                        
##                     3rd Qu.:2007                                        
##                     Max.   :2019                                        
##     Sector            Revenue          Competitors            Hourly       
##  Length:742         Length:742         Length:742         Min.   :0.00000  
##  Class :character   Class :character   Class :character   1st Qu.:0.00000  
##  Mode  :character   Mode  :character   Mode  :character   Median :0.00000  
##                                                           Mean   :0.03234  
##                                                           3rd Qu.:0.00000  
##                                                           Max.   :1.00000  
##  Employer.provided  Lower.Salary     Upper.Salary   Avg.Salary.K.  
##  Min.   :0.00000   Min.   : 15.00   Min.   : 16.0   Min.   : 15.5  
##  1st Qu.:0.00000   1st Qu.: 52.00   1st Qu.: 96.0   1st Qu.: 73.5  
##  Median :0.00000   Median : 69.50   Median :124.0   Median : 97.5  
##  Mean   :0.02291   Mean   : 74.75   Mean   :128.2   Mean   :101.5  
##  3rd Qu.:0.00000   3rd Qu.: 91.00   3rd Qu.:155.0   3rd Qu.:122.5  
##  Max.   :1.00000   Max.   :202.00   Max.   :306.0   Max.   :254.0  
##  company_txt        Job.Location            Age             Python      
##  Length:742         Length:742         Min.   : -1.00   Min.   :0.0000  
##  Class :character   Class :character   1st Qu.: 12.00   1st Qu.:0.0000  
##  Mode  :character   Mode  :character   Median : 25.00   Median :1.0000  
##                                        Mean   : 47.52   Mean   :0.5283  
##                                        3rd Qu.: 60.00   3rd Qu.:1.0000  
##                                        Max.   :277.00   Max.   :1.0000  
##      spark             aws             excel             sql        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :1.0000   Median :1.0000  
##  Mean   :0.2251   Mean   :0.2372   Mean   :0.5229   Mean   :0.5121  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##       sas              keras            pytorch            scikit       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.08895   Mean   :0.03908   Mean   :0.05256   Mean   :0.07278  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##      tensor            hadoop          tableau             bi         
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.09703   Mean   :0.1671   Mean   :0.1995   Mean   :0.07547  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##      flink             mongo           google_an       job_title_sim     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Length:742        
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   Class :character  
##  Median :0.00000   Median :0.00000   Median :0.00000   Mode  :character  
##  Mean   :0.01348   Mean   :0.04987   Mean   :0.01887                     
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000                     
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000                     
##  seniority_by_title    Degree         
##  Length:742         Length:742        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

# Print out the column names
print(colnames(df_salaries))

##  [1] "index"              "Job.Title"          "Salary.Estimate"   
##  [4] "Job.Description"    "Rating"             "Company.Name"      
##  [7] "Location"           "Headquarters"       "Size"              
## [10] "Founded"            "Type.of.ownership"  "Industry"          
## [13] "Sector"             "Revenue"            "Competitors"       
## [16] "Hourly"             "Employer.provided"  "Lower.Salary"      
## [19] "Upper.Salary"       "Avg.Salary.K."      "company_txt"       
## [22] "Job.Location"       "Age"                "Python"            
## [25] "spark"              "aws"                "excel"             
## [28] "sql"                "sas"                "keras"             
## [31] "pytorch"            "scikit"             "tensor"            
## [34] "hadoop"             "tableau"            "bi"                
## [37] "flink"              "mongo"              "google_an"         
## [40] "job_title_sim"      "seniority_by_title" "Degree"

To verify duplicate values in the dataset, I used the duplicated() function. This creates a new dataframe displaying any duplication values. I also used the sum function.

num_duplicates <- sum(duplicated(df_salaries))
# Check for duplicates
duplicates <- df_salaries[duplicated(df_salaries), ]
print(duplicates)

##  [1] index              Job.Title          Salary.Estimate    Job.Description   
##  [5] Rating             Company.Name       Location           Headquarters      
##  [9] Size               Founded            Type.of.ownership  Industry          
## [13] Sector             Revenue            Competitors        Hourly            
## [17] Employer.provided  Lower.Salary       Upper.Salary       Avg.Salary.K.     
## [21] company_txt        Job.Location       Age                Python            
## [25] spark              aws                excel              sql               
## [29] sas                keras              pytorch            scikit            
## [33] tensor             hadoop             tableau            bi                
## [37] flink              mongo              google_an          job_title_sim     
## [41] seniority_by_title Degree            
## <0 rows> (or 0-length row.names)

Filtering Data

I prepared the data for visualization by converting salry to numeric and filtering out non-numeric values and aggregating the average salaries by job title and job location. Then, I filtered the data by obtaining the top 10 job titles by highest average salary and their corresponding locations in descending order.

# Data Preparation: Convert Salary to numeric, filter out non-numeric values, and aggregate average salary by job title and job location
avg_salary <- df_salaries %>%
  mutate(Salary = as.numeric(Avg.Salary.K.)) %>%
  filter(!is.na(Salary)) %>%
  group_by(Job.Title, Job.Location) %>%
  summarise(Avg_Salary = mean(Salary))

## `summarise()` has grouped output by 'Job.Title'. You can override using the
## `.groups` argument.

# Find top 10 job titles by highest average salary and their corresponding locations in descending order
top_10_job_titles <- avg_salary %>%
  group_by(Job.Title) %>%
  summarise(Avg_Salary = mean(Avg_Salary)) %>%
  top_n(9, Avg_Salary) %>%
  arrange(desc(Avg_Salary))

# Filter data to include only the top 10 job titles and their corresponding locations
top_10_data <- avg_salary %>%
  filter(Job.Title %in% top_10_job_titles$Job.Title)

# Filter data to include only the top 10 job titles and their corresponding locations
top_10_data <- avg_salary %>%
  filter(Job.Title %in% top_10_job_titles$Job.Title) %>%
  arrange(desc(Avg_Salary))

# Print top 10 job titles and their corresponding locations in descending order
print(top_10_data)

## # A tibble: 10 × 3
## # Groups:   Job.Title [9]
##    Job.Title                                             Job.Location Avg_Salary
##    <chr>                                                 <chr>             <dbl>
##  1 Director II, Data Science - GRM Actuarial             IL                 254 
##  2 Principal Machine Learning Scientist                  CA                 232.
##  3 Principal Data Scientist with over 10 years experien… CA                 225 
##  4 Data Science Manager                                  CA                 222.
##  5 Lead Data Engineer                                    CA                 205 
##  6 Director II, Data Science - GRS Predictive Analytics  IL                 194.
##  7 Staff Machine Learning Engineer                       CA                 181 
##  8 Director, Data Science                                IL                 180.
##  9 Sr. Scientist II                                      CA                 174 
## 10 Data Science Manager                                  PA                 128.

Heatmap

I visualized the data using a heat map to visualize the color intensity based on the average salary distribution of the Top 10 jobs and their corresponding location.

library(viridis)

## Warning: package 'viridis' was built under R version 4.3.3

## Loading required package: viridisLite

# Create a heatmap
ggplot(top_10_data, aes(x = Job.Location, y = Job.Title, fill = Avg_Salary)) +
  geom_tile() +
  scale_fill_viridis(name = "Average Salary (in thousands USD)") +
  theme_minimal() +
  labs(title = "Top 10 Jobs by Highest Avg Salary & Location",
       x = "Job Location",
       y = "Job Title")

Conclusion

Based on the data analysis and visualization, it is evident that senior and mid level data science roles pay higher salaries in California, Illinois and Pennsylvania. Based on the data, Director II, Data Science - GRM Actuarial, Principal Machine Learning Scientist, and Principal Data Scientist with over 10 years experience receive higher salary because they require more technical areas of expertise in data science. This analysis revealed significant differences in average salaries among various data practitioner roles. You can observe that certain roles command higher compensation compared to others. This suggests that factors such as job responsibilities, required skill sets, and market demand play a crucial role in determining salary levels within the data practitioner field. To add on, the heat map displaying average salaries across different job titles and locations allowed to identify geographical disparities in compensation. This highlights the importance of considering regional factors such as cost of living, industry presence, and local economic conditions when assessing salary expectations. By understanding the prevailing salary trends associated with different roles and geographical locations, individuals or future Data Practitioners can make informed decisions regarding job opportunities, career advancement strategies, and potential relocation.

LS0tDQp0aXRsZTogIkRBVEEgNjA4IFN0b3J5IzQ6IEhvdyBtdWNoIGRvIHdlIGdldCBwYWlkPyINCmF1dGhvcjogIlB1amEgUm95Ig0KZGF0ZTogIjMvMTcvMjQiDQpvdXRwdXQ6IG9wZW5pbnRybzo6bGFiX3JlcG9ydA0KLS0tDQoNCiMjIyBJbnRyb2R1Y3Rpb24NCg0KRm9yIHRoaXMgc3RvcnksIEkgdXRpbGl6ZWQgdGhlIERhdGEgc2NpZW50aXN0IHNhbGFyeSBkYXRhc2V0IGZyb20gS2FnZ2xlOiBodHRwczovL3d3dy5rYWdnbGUuY29tL2RhdGFzZXRzL25pa2hpbGJoYXRoaS9kYXRhLXNjaWVudGlzdC1zYWxhcnktdXMtZ2xhc3Nkb29yL2RhdGENCg0KVGhpcyBkYXRhc2V0IHdhcyBjcmVhdGVkIGJ5IHdlYiBzY3JhcGluZyBqb2IgcG9zdGluZ3MgcmVsYXRlZCB0byB0aGUgcG9zaXRpb24gIkRhdGEgU2NpZW50aXN0IiBmcm9tIHd3dy5nbGFzc2Rvb3IuY29tIGluIFVTQSB1c2luZyBTZWxlbml1bS4gSSBzZWxlY3RlZCB0aGlzIGRhdGFzZXQgYmVjYXVzZSBpdCB3YXMgaGFyZCB0byBmaW5kIERhdGEgU2NpZW5jZSBKb2JzIGRhdGEgYnkgU2FsYXJ5IGluIFVTQSBTdGF0ZXMuDQoNClRoaXMgZGF0YSBpbmNsdWRlcyBpbmZvcm1hdGlvbiBiYXNlZCBvbiBKb2IgdGl0bGUsIFNhbGFyeSBFc3RpbWF0ZSwgSm9iIERlc2NyaXB0aW9uLCBSYXRpbmcsIENvbXBhbnksIExvY2F0aW9uLCBDb21wYW55ICwgSGVhZHF1YXJ0ZXJzLCAsICwgYW55IFNpemUsIENvbXBhbnkgRm91bmRlZCBEYXRlLCBUeXBlIG9mIE93bmVyc2hpcCwgSW5kdXN0cnksIFNlY3RvciwgUmV2ZW51ZSwgQ29tcGV0aXRvcnMuDQoNCiMjIyBJbXBvcnQgdGhlIGxpYnJhcmllcyAmIHBhY2thZ2VzDQpgYGB7ciBsb2FkLXBhY2thZ2VzLCBtZXNzYWdlPUZBTFNFfQ0KbGlicmFyeSh0aWR5dmVyc2UpDQpsaWJyYXJ5KG9wZW5pbnRybykNCiMgTG9hZCBuZWNlc3NhcnkgbGlicmFyeQ0KbGlicmFyeShkcGx5cikNCmBgYA0KDQojIyMgTG9hZCB0aGUgZGF0YQ0KDQpJIGxvYWRlZCB0aGUgcmF3IGRhdGEgZnJvbSBteSBnaXRodWIuIEkgZXhwbG9yZWQgdGhlIGRhdGEgYnkgY2hlY2tpbmcgaWYgdGhlcmUgd2VyZSBhbnkgbWlzc2luZyBvciBkdXBsaWNhdGUgdmFsdWVzLiBJIGFsc28gcHJpbnRlZCBvdXQgc3VtbWFyeSBzdGF0aXN0aWNzIGJhc2VkIG9uIHRoZSBkYXRhLg0KDQpgYGB7cn0NCnVybCA8LSAiaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL3B1amFyb3kyODAvREFUQTYwOFN0b3J5NC9tYWluL2RhdGFfY2xlYW5lZF8yMDIxLmNzdiINCmRmX3NhbGFyaWVzIDwtIHJlYWQuY3N2KHVybCkgDQpoZWFkKGRmX3NhbGFyaWVzKQ0KYGBgDQoNCiMjIyBFeGVyY2lzZSAyDQpgYGB7cn0NCiMgRGF0YSBFeHBsb3JhdGlvbg0Kc3RyKGRmX3NhbGFyaWVzKQ0Kc3VtbWFyeShkZl9zYWxhcmllcykNCiMgQ2hlY2sgZm9yIG1pc3NpbmcgdmFsdWVzDQpzdW0oaXMubmEoZGZfc2FsYXJpZXMpKQ0KDQpgYGANCmBgYHtyfQ0KIyBDb3VudCB0aGUgbnVtYmVyIG9mIG1pc3NpbmcgdmFsdWVzIGluIGVhY2ggY29sdW1uDQptaXNzaW5nX3ZhbHVlcyA8LSBjb2xTdW1zKGlzLm5hKGRmX3NhbGFyaWVzKSkNCm1pc3NpbmdfdmFsdWVzDQpgYGANCg0KVGhpcyBzdW1tYXJ5IGluY2x1ZGVzIGNvdW50LCBtZWFuLCBtZWRpYW4sIG1pbmltdW0sIG1heGltdW0sIGFuZCBxdWFydGlsZSB2YWx1ZXMgZm9yIGVhY2ggbnVtZXJpY2FsIGNvbHVtbiBpbiB0aGUgZGF0YWZyYW1lLg0KYGBge3J9DQojIFN0YXRpc3RpY2FsIHN1bW1hcnkgb2YgdGhlIERhdGFGcmFtZQ0Kc3VtbWFyeShkZl9zYWxhcmllcykNCmBgYA0KDQpgYGB7cn0NCiMgUHJpbnQgb3V0IHRoZSBjb2x1bW4gbmFtZXMNCnByaW50KGNvbG5hbWVzKGRmX3NhbGFyaWVzKSkNCmBgYA0KDQpUbyB2ZXJpZnkgZHVwbGljYXRlIHZhbHVlcyBpbiB0aGUgZGF0YXNldCwgSSB1c2VkIHRoZSBkdXBsaWNhdGVkKCkgZnVuY3Rpb24uIFRoaXMgY3JlYXRlcyBhIG5ldyBkYXRhZnJhbWUgZGlzcGxheWluZyBhbnkgZHVwbGljYXRpb24gdmFsdWVzLiBJIGFsc28gdXNlZCB0aGUgc3VtIGZ1bmN0aW9uLg0KYGBge3J9DQpudW1fZHVwbGljYXRlcyA8LSBzdW0oZHVwbGljYXRlZChkZl9zYWxhcmllcykpDQojIENoZWNrIGZvciBkdXBsaWNhdGVzDQpkdXBsaWNhdGVzIDwtIGRmX3NhbGFyaWVzW2R1cGxpY2F0ZWQoZGZfc2FsYXJpZXMpLCBdDQpwcmludChkdXBsaWNhdGVzKQ0KYGBgDQoNCiMjIyBGaWx0ZXJpbmcgRGF0YQ0KDQpJIHByZXBhcmVkIHRoZSBkYXRhIGZvciB2aXN1YWxpemF0aW9uIGJ5IGNvbnZlcnRpbmcgc2FscnkgdG8gbnVtZXJpYyBhbmQgZmlsdGVyaW5nIG91dCBub24tbnVtZXJpYyB2YWx1ZXMgYW5kIGFnZ3JlZ2F0aW5nIHRoZSBhdmVyYWdlIHNhbGFyaWVzIGJ5IGpvYiB0aXRsZSBhbmQgam9iIGxvY2F0aW9uLiBUaGVuLCBJIGZpbHRlcmVkIHRoZSBkYXRhIGJ5IG9idGFpbmluZyB0aGUgdG9wIDEwIGpvYiB0aXRsZXMgYnkgaGlnaGVzdCBhdmVyYWdlIHNhbGFyeSBhbmQgdGhlaXIgY29ycmVzcG9uZGluZyBsb2NhdGlvbnMgaW4gZGVzY2VuZGluZyBvcmRlci4NCg0KYGBge3J9DQojIERhdGEgUHJlcGFyYXRpb246IENvbnZlcnQgU2FsYXJ5IHRvIG51bWVyaWMsIGZpbHRlciBvdXQgbm9uLW51bWVyaWMgdmFsdWVzLCBhbmQgYWdncmVnYXRlIGF2ZXJhZ2Ugc2FsYXJ5IGJ5IGpvYiB0aXRsZSBhbmQgam9iIGxvY2F0aW9uDQphdmdfc2FsYXJ5IDwtIGRmX3NhbGFyaWVzICU+JQ0KICBtdXRhdGUoU2FsYXJ5ID0gYXMubnVtZXJpYyhBdmcuU2FsYXJ5LksuKSkgJT4lDQogIGZpbHRlcighaXMubmEoU2FsYXJ5KSkgJT4lDQogIGdyb3VwX2J5KEpvYi5UaXRsZSwgSm9iLkxvY2F0aW9uKSAlPiUNCiAgc3VtbWFyaXNlKEF2Z19TYWxhcnkgPSBtZWFuKFNhbGFyeSkpDQoNCiMgRmluZCB0b3AgMTAgam9iIHRpdGxlcyBieSBoaWdoZXN0IGF2ZXJhZ2Ugc2FsYXJ5IGFuZCB0aGVpciBjb3JyZXNwb25kaW5nIGxvY2F0aW9ucyBpbiBkZXNjZW5kaW5nIG9yZGVyDQp0b3BfMTBfam9iX3RpdGxlcyA8LSBhdmdfc2FsYXJ5ICU+JQ0KICBncm91cF9ieShKb2IuVGl0bGUpICU+JQ0KICBzdW1tYXJpc2UoQXZnX1NhbGFyeSA9IG1lYW4oQXZnX1NhbGFyeSkpICU+JQ0KICB0b3Bfbig5LCBBdmdfU2FsYXJ5KSAlPiUNCiAgYXJyYW5nZShkZXNjKEF2Z19TYWxhcnkpKQ0KDQojIEZpbHRlciBkYXRhIHRvIGluY2x1ZGUgb25seSB0aGUgdG9wIDEwIGpvYiB0aXRsZXMgYW5kIHRoZWlyIGNvcnJlc3BvbmRpbmcgbG9jYXRpb25zDQp0b3BfMTBfZGF0YSA8LSBhdmdfc2FsYXJ5ICU+JQ0KICBmaWx0ZXIoSm9iLlRpdGxlICVpbiUgdG9wXzEwX2pvYl90aXRsZXMkSm9iLlRpdGxlKQ0KYGBgDQoNCmBgYHtyfQ0KIyBGaWx0ZXIgZGF0YSB0byBpbmNsdWRlIG9ubHkgdGhlIHRvcCAxMCBqb2IgdGl0bGVzIGFuZCB0aGVpciBjb3JyZXNwb25kaW5nIGxvY2F0aW9ucw0KdG9wXzEwX2RhdGEgPC0gYXZnX3NhbGFyeSAlPiUNCiAgZmlsdGVyKEpvYi5UaXRsZSAlaW4lIHRvcF8xMF9qb2JfdGl0bGVzJEpvYi5UaXRsZSkgJT4lDQogIGFycmFuZ2UoZGVzYyhBdmdfU2FsYXJ5KSkNCg0KIyBQcmludCB0b3AgMTAgam9iIHRpdGxlcyBhbmQgdGhlaXIgY29ycmVzcG9uZGluZyBsb2NhdGlvbnMgaW4gZGVzY2VuZGluZyBvcmRlcg0KcHJpbnQodG9wXzEwX2RhdGEpDQoNCmBgYA0KDQojIyMgSGVhdG1hcA0KDQpJIHZpc3VhbGl6ZWQgdGhlIGRhdGEgdXNpbmcgYSBoZWF0IG1hcCB0byB2aXN1YWxpemUgdGhlIGNvbG9yIGludGVuc2l0eSBiYXNlZCBvbiB0aGUgYXZlcmFnZSBzYWxhcnkgZGlzdHJpYnV0aW9uIG9mIHRoZSBUb3AgMTAgam9icyBhbmQgdGhlaXIgY29ycmVzcG9uZGluZyBsb2NhdGlvbi4NCmBgYHtyfQ0KbGlicmFyeSh2aXJpZGlzKQ0KDQojIENyZWF0ZSBhIGhlYXRtYXANCmdncGxvdCh0b3BfMTBfZGF0YSwgYWVzKHggPSBKb2IuTG9jYXRpb24sIHkgPSBKb2IuVGl0bGUsIGZpbGwgPSBBdmdfU2FsYXJ5KSkgKw0KICBnZW9tX3RpbGUoKSArDQogIHNjYWxlX2ZpbGxfdmlyaWRpcyhuYW1lID0gIkF2ZXJhZ2UgU2FsYXJ5IChpbiB0aG91c2FuZHMgVVNEKSIpICsNCiAgdGhlbWVfbWluaW1hbCgpICsNCiAgbGFicyh0aXRsZSA9ICJUb3AgMTAgSm9icyBieSBIaWdoZXN0IEF2ZyBTYWxhcnkgJiBMb2NhdGlvbiIsDQogICAgICAgeCA9ICJKb2IgTG9jYXRpb24iLA0KICAgICAgIHkgPSAiSm9iIFRpdGxlIikNCmBgYA0KDQojIyMgQ29uY2x1c2lvbg0KDQpCYXNlZCBvbiB0aGUgZGF0YSBhbmFseXNpcyBhbmQgdmlzdWFsaXphdGlvbiwgaXQgaXMgZXZpZGVudCB0aGF0IHNlbmlvciBhbmQgbWlkIGxldmVsIGRhdGEgc2NpZW5jZSByb2xlcyBwYXkgaGlnaGVyIHNhbGFyaWVzIGluIENhbGlmb3JuaWEsIElsbGlub2lzIGFuZCBQZW5uc3lsdmFuaWEuIEJhc2VkIG9uIHRoZSBkYXRhLCBEaXJlY3RvciBJSSwgRGF0YSBTY2llbmNlIC0gR1JNIEFjdHVhcmlhbCwgUHJpbmNpcGFsIE1hY2hpbmUgTGVhcm5pbmcgU2NpZW50aXN0LCBhbmQgUHJpbmNpcGFsIERhdGEgU2NpZW50aXN0IHdpdGggb3ZlciAxMCB5ZWFycyBleHBlcmllbmNlIHJlY2VpdmUgaGlnaGVyIHNhbGFyeSBiZWNhdXNlIHRoZXkgcmVxdWlyZSBtb3JlIHRlY2huaWNhbCBhcmVhcyBvZiBleHBlcnRpc2UgaW4gZGF0YSBzY2llbmNlLiBUaGlzIGFuYWx5c2lzIHJldmVhbGVkIHNpZ25pZmljYW50IGRpZmZlcmVuY2VzIGluIGF2ZXJhZ2Ugc2FsYXJpZXMgYW1vbmcgdmFyaW91cyBkYXRhIHByYWN0aXRpb25lciByb2xlcy4gWW91IGNhbiBvYnNlcnZlIHRoYXQgY2VydGFpbiByb2xlcyBjb21tYW5kIGhpZ2hlciBjb21wZW5zYXRpb24gY29tcGFyZWQgdG8gb3RoZXJzLiBUaGlzIHN1Z2dlc3RzIHRoYXQgZmFjdG9ycyBzdWNoIGFzIGpvYiByZXNwb25zaWJpbGl0aWVzLCByZXF1aXJlZCBza2lsbCBzZXRzLCBhbmQgbWFya2V0IGRlbWFuZCBwbGF5IGEgY3J1Y2lhbCByb2xlIGluIGRldGVybWluaW5nIHNhbGFyeSBsZXZlbHMgd2l0aGluIHRoZSBkYXRhIHByYWN0aXRpb25lciBmaWVsZC4gVG8gYWRkIG9uLCB0aGUgaGVhdCBtYXAgZGlzcGxheWluZyBhdmVyYWdlIHNhbGFyaWVzIGFjcm9zcyBkaWZmZXJlbnQgam9iIHRpdGxlcyBhbmQgbG9jYXRpb25zIGFsbG93ZWQgdG8gaWRlbnRpZnkgZ2VvZ3JhcGhpY2FsIGRpc3Bhcml0aWVzIGluIGNvbXBlbnNhdGlvbi4gVGhpcyBoaWdobGlnaHRzIHRoZSBpbXBvcnRhbmNlIG9mIGNvbnNpZGVyaW5nIHJlZ2lvbmFsIGZhY3RvcnMgc3VjaCBhcyBjb3N0IG9mIGxpdmluZywgaW5kdXN0cnkgcHJlc2VuY2UsIGFuZCBsb2NhbCBlY29ub21pYyBjb25kaXRpb25zIHdoZW4gYXNzZXNzaW5nIHNhbGFyeSBleHBlY3RhdGlvbnMuIEJ5IHVuZGVyc3RhbmRpbmcgdGhlIHByZXZhaWxpbmcgc2FsYXJ5IHRyZW5kcyBhc3NvY2lhdGVkIHdpdGggZGlmZmVyZW50IHJvbGVzIGFuZCBnZW9ncmFwaGljYWwgbG9jYXRpb25zLCBpbmRpdmlkdWFscyBvciBmdXR1cmUgRGF0YSBQcmFjdGl0aW9uZXJzIGNhbiBtYWtlIGluZm9ybWVkIGRlY2lzaW9ucyByZWdhcmRpbmcgam9iIG9wcG9ydHVuaXRpZXMsIGNhcmVlciBhZHZhbmNlbWVudCBzdHJhdGVnaWVzLCBhbmQgcG90ZW50aWFsIHJlbG9jYXRpb24uDQo=