For this story, I utilized the Data scientist salary dataset from Kaggle: https://www.kaggle.com/datasets/nikhilbhathi/data-scientist-salary-us-glassdoor/data
This dataset was created by web scraping job postings related to the position “Data Scientist” from www.glassdoor.com in USA using Selenium. I selected this dataset because it was hard to find Data Science Jobs data by Salary in USA States.
This data includes information based on Job title, Salary Estimate, Job Description, Rating, Company, Location, Company , Headquarters, , , any Size, Company Founded Date, Type of Ownership, Industry, Sector, Revenue, Competitors.
I loaded the raw data from my github. I explored the data by checking if there were any missing or duplicate values. I also printed out summary statistics based on the data.
url <- "https://raw.githubusercontent.com/pujaroy280/DATA608Story4/main/data_cleaned_2021.csv"
df_salaries <- read.csv(url)
head(df_salaries)## index Job.Title Salary.Estimate
## 1 0 Data Scientist $53K-$91K (Glassdoor est.)
## 2 1 Healthcare Data Scientist $63K-$112K (Glassdoor est.)
## 3 2 Data Scientist $80K-$90K (Glassdoor est.)
## 4 3 Data Scientist $56K-$97K (Glassdoor est.)
## 5 4 Data Scientist $86K-$143K (Glassdoor est.)
## 6 5 Data Scientist $71K-$119K (Glassdoor est.)
## Job.Description
## 1 Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, engineering, business, or the sciences.\nSkills Required:\nBachelor’s Degree in relevant field, e.g., math, data analysis, database, computer science, Artificial Intelligence (AI); three years’ experience credit for Master’s degree; five years’ experience credit for a Ph.D\nApplicant should be proficient in the use of Power BI, Tableau, Python, MATLAB, Microsoft Word, PowerPoint, Excel, and working knowledge of MS Access, LMS, SAS, data visualization tools, and have a strong algorithmic aptitude\nExcellent verbal and written communication skills, and quantitative analytical skills are required\nApplicant must be able to work in a team environment\nU.S. citizenship and ability to obtain a DoD Secret Clearance required\nResponsibilities: The applicant will be responsible for formulating analytical solutions to complex data problems; creating data analytic models to improve data metrics; analyzing customer behavior and trends; delivering insights to stakeholders, as well as designing and crafting reports, dashboards, models, and algorithms to make data insights actionable; selecting features, building and optimizing classifiers using machine learning techniques; data mining using state-of-the-art methods, extending organization’s data with third party sources of information when needed; enhancing data collection procedures to include information that is relevant for building analytic systems; processing, cleansing, and verifying the integrity of data used for analysis; doing ad-hoc analysis and presenting results in a clear manner; and creating automated anomaly detection systems and constant tracking of its performance.\nBenefits:\nWe offer competitive salaries commensurate with education and experience. We have an excellent benefits package that includes:\nComprehensive health, dental, life, long and short term disability insurance\n100% Company funded Retirement Plans\nGenerous vacation, holiday and sick pay plans\nTuition assistance\n\nBenefits are provided to employees regularly working a minimum of 30 hours per week.\n\nTecolote Research is a private, employee-owned corporation where people are our primary resource. Our investments in technology and training give our employees the tools to ensure our clients are provided the solutions they need, and our very high employee retention rate and stable workforce is an added value to our customers. Apply now to connect with a company that invests in you.
## 2 What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analytics group at the University of Maryland Medical System (UMMS) in support of its strategic priority to become a data-driven and outcomes-oriented organization. The successful candidate will have 3+ years of experience with Machine Learning, Predictive Modeling, Statistical Analysis, Mathematical Optimization, Algorithm Development and a passion for working with healthcare data. Previous experience with various computational approaches along with an ability to demonstrate a portfolio of relevant prior projects is essential. This position will report to the UMMS Vice President for Enterprise Data and Analytics (ED&A).\n\nII. Principal Responsibilities and Tasks\n\n• Develops predictive and prescriptive analytic models in support of the organization’s clinical, operations and business initiatives and priorities.\n• Deploys solutions so that they provide actionable insights to the organization and are embedded or integrated with application systems\n• Supports and drives analytic efforts designed around organization’s strategic priorities and clinical/business problems\n• Works in a team to drive disruptive innovation, which may translate into improved quality of care, clinical outcomes, reduced costs, temporal efficiencies and process improvements.\n• Builds and extends our analytics portfolio supported by robust documentation\n• Works with autonomy to find solutions to complex problems using open source tools and in-house development\n• Stays abreast of state-of-the-art literature in the fields of operations research, statistical modeling, statistical process control and mathematical optimization\n• Creates, communicates, and manages the project plans and other required project documentation and provides updates to leadership as necessary\n• Develops and maintains relationships with business, IT and clinical leaders and stakeholders across the enterprise to facilitate collaboration and effective communication\n• Works with the analytics team and clinical/business stakeholders to develop pilots so that they may be tested and validated in pilot settings\n• Performs analysis to evaluate primary and secondary objectives from such pilots\n• Assists leadership with strategies for scaling successful projects across the organization and enhances the analytics applications based on feedback from end-users and clinical/business consumers\n• Assists leadership with dissemination of success stories (and failures) in an effort to increase analytics literacy and adoption across the organization.\n\nWhat You Need to Be Successful:\n\nIII. Education and Experience\n\n• Master’s or higher degree (may be substituted by relevant work experience) in applied mathematics, physics, computer science, engineering, statistics or a related field\n• 3+ years of Mathematical Optimization, Machine Learning, Predictive Analytics and Algorithm Development experience (experience with tools such as WEKA, RapidMiner, R. Python or other open source tools strongly desired)\n• Strong development skills in two or more of the following: C/C++, C#, Python, Java\n• Combining analytic methods with advanced data visualizations\n• Expert ability to breakdown and clearly define problems\n• Experience with Natural Language Processing preferred\n\nIV. Knowledge, Skills and Abilities\n\n• Proven communications skills – Effective at working independently and in collaboration with other staff members. Capable of clearly presenting findings orally, in writing, or through graphics.\n• Proven analytical skills – Able to compare, contrast, and validate work with keen attention to detail. Skilled in working with “real world” data including scrubbing, transformation, and imputation.\n• Proven problem solving skills – Able to plan work, set clear direction, and coordinate own tasks in a fast-paced multidisciplinary environment. Expert at triaging issues, identifying data anomalies, and debugging software.\n• Design and prototype new application functionality for our products.\n• Change oriented – actively generates process improvements; supports and drives change, and confronts difficult circumstances in creative ways\n• Effective communicator and change agent\n• Ability to prioritize the tasks of the project timeline to achieve the desired results\n• Strong analytic and problem solving skills\n• Ability to cooperatively and effectively work with people from various organization levels\n\nWe are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.
## 3 KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school security awareness training and simulated phishing. KnowBe4 was created to help organizations manage the ongoing problem of social engineering. Tens of thousands of organizations worldwide use KnowBe4's platform to mobilize their end users as a last line of defense and enable them to make better security decisions, every day.\n\nWe are ranked #1 best place to work in technology nationwide by Fortune Magazine and have placed #1 or #2 in The Tampa Bay Top Workplaces Survey for the last four years. We also just had our 27th record-setting quarter in a row!\n\nThe Data Scientist will work closely with the VP of FP&A and the Quantitative Analytics Manager to implement advanced analytical models and other data-driven solutions.\n\nResponsibilities:\nWork with key stakeholders throughout the organization to identify opportunities using financial data to develop business solutions.\nDevelop new and enhance existing data collection procedures to ensure that all data relevant for analyses is captured.\nCleanse, consolidate, and verify the integrity of data used in analyses.\nBuild and validate predictive models to increase customer retention, revenue generation, and other business outcomes.\nDevelop relevant statistical models to assist with profitability forecasting\nCreate the analytics to leverage known, inferred and appended information about origins and recognizing patterns to assist in outlook forecasting\nAssist in the design and data modeling of data warehouse.\nVisualize data, especially in reports and dashboards, to communicate analysis results to stakeholders.\nExtend data collection to unstructured data within the company and external sources\nMine and collect data (both structured and unstructured) to detect patterns, opportunities and insights that drive our organization\nCreate and execute automation and data mining requests utilizing SQL, Access, Excel, SAS and other statistical programs\nTrouble shoot forecast and optimization anomalies with FP&A team through the use of statistical and mathematical optimization models. Develop testing to explain and or reduce these anomalies.\nOversee and develop key metric forecasts as well as provide budget support based on trends in the business/industry.\nMinimum Qualifications:\nMaster's degree in Statistics, Computer Science, Mathematics or other quantitative discipline required\n2-3 years of experience in similar role (Master's Degree)\n0-2 years of experience in similar role (PhD)\nExperience leveraging predictive modeling, big data analytics, exploratory data analysis and machine learning to drive significant business impact\nExperience with statistical computer languages (Python, R etc.) to manipulate and analyze large datasets preferred.\nExperience with data visualization tools like D3.js, matplotlib, etc., preferred\nExceptional understanding of machine learning algorithms such as Random Forest, SVM, k-NN, Naïve Bayes, Gradient Boosting a plus.\nApplied statistical skills including statistical testing, regression, etc.\nExperience with data bases, query languages, and associated data architecture.\nExperience with distributed computing tools (Hive, Spark, etc.) is a plus.\nStrong analytical skills and ability to meet project deadlines.\nNote: An applicant assessment, background check and drug test may be part of your hiring procedure.\n\nNo recruitment agencies, please.
## 4 *Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological Sciences\n\nGroup: Exposure Science Team\n*Job Description**\nThe Biological System Science (BSS) Group in the Biological Sciences Division of the Pacific Northwest National Laboratory (PNNL) is seeking a staff scientist with multidisciplinary experience in computational chemistry, cheminformatics, advanced statistics and/or machine learning/deep learning/AI. Preferred candidates will have a broad understanding of the state of computational metabolomics and experience in designing and implementing novel deep learning networks for chemistry applications. Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification is also highly valued. Successful candidates will join a large, uniquely collaborative, collegial group of innovators driving the integration of data science, computational science and analytical chemistry to solve the nations most challenging problems in human health, chemical forensics, and national security. The BSS Group is diverse and inclusive, working closely with colleagues across the laboratory with expertise in computational biology, integrative omics, applied mathematics, computer science, and statistics.\n\n+ Apply knowledge of statistics, machine learning, advanced mathematics, simulation, software development, and data modeling to to design, development and implement methods that integrate, clean and analyze data, recognize patterns, address uncertainty, pose questions, and make discoveries from structured and/or unstructured data.\n\n+ Produce solutions driven by exploratory data analysis from complex and high-dimensional datasets.\n\n+ Design, develop, and evaluate predictive models and advanced algorithms that lead to optimal value extraction from data.\n\n+ Develop and maintain existing deep learning networks that generate novel molecules for drug discovery applications\n\n+ Contribue or author proposals, peer-reviewed papers, and other technical products.\n*Minimum Qualifications**\nBS/BA with 0-1 years of experience or MS/MA with 0-1 years of experience\n*Preferred Qualifications**\n+ MS in chemical engineering, computer science, or related field with a GPA of 3.5+ 5+ years of research experience\n\n+ Intermediate level programming experience (preferably Python) and high-performance computing experience\n\n+ At least one first author published, or proof of submitted, paper applying deep learning for use in novel compound generation\n\n+ Understanding of the NMDA receptor and potential drug targets\n\n+ Research experience in drug design, cheminformatics, deep learning, machine learning and/or small molecule identification\n*Equal Employment Opportunity**\nBattelle Memorial Institute (BMI) at Pacific Northwest National Laboratory (PNNL) is an Affirmative Action/Equal Opportunity Employer and supports diversity in the workplace. All employment decisions are made without regard to race, color, religion, sex, national origin, age, disability, veteran status, marital or family status, sexual orientation, gender identity, or genetic information. All BMI staff must be able to demonstrate the legal right to work in the United States. BMI is an E-Verify employer. Learn more at jobs.pnnl.gov.\n*_Please be aware that the Department of Energy (DOE) prohibits DOE employees and contractors from participation in certain foreign government talent recruitment programs. If you are offered a position at PNNL and are currently a participant in a foreign government talent recruitment program you will be required to disclose this information before your first day of employment._**\n_Directorate:_ _Earth & Biological Sciences_\n\n_Job Category:_ _Scientists/Scientific Support_\n\n_Group:_ _Biological Systems Science_\n\n_Opening Date:_ _2020-03-26_\n\n_Closing Date:_ _2020-04-05_
## 5 Data Scientist\nAffinity Solutions / Marketing Cloud seeks smart, curious, technically savvy candidates to join our cutting-edge data science team. We hire the best and brightest and give them the opportunity to work on industry-leading technologies.\nThe data sciences team at AFS/Marketing Cloud build models, machine learning algorithms that power all our ad-tech/mar-tech products at scale, develop methodology and tools to precisely and effectively measure market campaign effects, and research in-house and public data sources for consumer spend behavior insights. In this role, you'll have the opportunity to come up with new ideas and solutions that will lead to improvement of our ability to target the right audience, derive insights and provide better measurement methodology for marketing campaigns. You'll access our core data asset and machine learning infrastructure to power your ideas.\nDuties and Responsibilities\n· Support all clients model building needs, including maintaining and improving current modeling/scoring methodology and processes,\n· Provide innovative solutions to customized modeling/scoring/targeting with appropriate ML/statistical tools,\n· Provide analytical/statistical support such as marketing test design, projection, campaign measurement, market insights to clients and stakeholders.\n· Mine large consumer datasets in the cloud environment to support ad hoc business and statistical analysis,\n· Develop and Improve automation capabilities to enable customized delivery of the analytical products to clients,\n· Communicate the methodologies and the results to the management, clients and none technical stakeholders.\nBasic Qualifications\n· Advanced degree in Statistics/Mathematics/Computer Science/Economics or other fields that requires advanced training in data analytics.\n· Being able to apply basic statistical/ML concepts and reasoning to address and solve business problems such as targeting, test design, KPI projection and performance measurement.\n· Entrepreneurial, highly self-motivated, collaborative, keen attention to detail, willingness and capable learn quickly, and ability to effectively prioritize and execute tasks in a high pressure environment.\n· Being flexible to accept different task assignments and able to work on a tight time schedule.\n· Excellent command of one or more programming languages; preferably Python, SAS or R\n· Familiar with one of the database technologies such as PostgreSQL, MySQL, can write basic SQL queries\n· Great communication skills (verbal, written and presentation)\nPreferred Qualifications\n· Experience or exposure to large consumer and/or demographic data sets.\n· Familiarity with data manipulation and cleaning routines and techniques.
## 6 CyrusOne is seeking a talented Data Scientist who holds a range of data-focused skills both in technical and analytical domains. The ideal candidate is adept at processing, cleansing, and verifying the integrity of data used for visualization and analysis. This role is dynamic, granting the candidate the opportunity to participate in a wide variety of projects and collaborate with many cross-functional teams throughout the business.\n\nDuties and Responsibilities:\nParticipate in an agile scrum cadence\nProcess, cleanse, and verify the integrity of data used for analysis\nPerform functional business requirements analysis and data analysis\nDevelop data models and algorithms to apply to data sets\nAugment data collection procedures to include necessary information for building accurate analytics\nCollaborate with stakeholders throughout the organization to identify opportunities for leveraging data to drive business solutions\nEvaluate the effectiveness and accuracy of data sources and data gathering techniques\nGather critical information from meetings with various stakeholders and produce useful reports\nCoordinate with cross-functional teams to implement models and monitor outcomes\nDevelop automated discrepancy detection systems and distribute reconciliation reports to stakeholders\nRequirements:\nMust be legally authorized to work in the United States for any employer without sponsorship\nProfessional experience using statistical software languages like R, Python, and SQL to query, manipulate, and draw insights from data sets\nStrong problem-solving skills with an emphasis on product development\nExtensive experience with Microsoft SQL, MySQL and MongoDB\nUnderstanding of version control (git) and project management with Azure DevOps\nKnowledge of machine learning techniques (clustering, decision tree learning, artificial neural networks, etc.)\nExperience visualizing data for stakeholders using visualization tools such as Power BI\nExperience working with and creating data architectures\nUnderstanding and adherence to agile principles and practices\nAbility to work on problems of any scope where the analysis of situations or data requires a review of a variety of factors\nSelf-maintainability and reliability with minimal supervision\nExcellent interpersonal communication, decision making, presentation, and organizational skills\nAbility to build productive internal/external working relationships\nHarmonious with CyrusOne culture, core values, and business goals\nMinimum Qualifications:\n2+ years of related experience in a data analyst role\nStrong can-do attitude in a time sensitive environment\nOther important information about this position:\nThis position requires typical weekday (Monday - Friday) attendance in an office setting, at times after hours work may be required to meet business and customer needs\nEvery position requires certain physical capabilities. CyrusOne seeks to make reasonable accommodations that enable individuals with disabilities to perform essential duties when possible\nCyrusOne is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, sex, sexual orientation, gender identity, religion, national origin, disability, veteran status, or other legally protected status.\n\nCyrusOne provides reasonable accommodation for qualified individuals with disabilities in accordance with the Americans with Disabilities Act (ADA) and any other state or local laws. We will respond to requests for reasonable accommodations to assist you in applying for positions at CyrusOne, or to submit a resume. If you need to request an accommodation, please contact our Human Resources at 214.488.1365 (Option 7) or by email at HR@cyrusone.com.
## Rating Company.Name Location
## 1 3.8 Tecolote Research\n3.8 Albuquerque, NM
## 2 3.4 University of Maryland Medical System\n3.4 Linthicum, MD
## 3 4.8 KnowBe4\n4.8 Clearwater, FL
## 4 3.8 PNNL\n3.8 Richland, WA
## 5 2.9 Affinity Solutions\n2.9 New York, NY
## 6 3.4 CyrusOne\n3.4 Dallas, TX
## Headquarters Size Founded Type.of.ownership
## 1 Goleta, CA 501 - 1000 1973 Company - Private
## 2 Baltimore, MD 10000+ 1984 Other Organization
## 3 Clearwater, FL 501 - 1000 2010 Company - Private
## 4 Richland, WA 1001 - 5000 1965 Government
## 5 New York, NY 51 - 200 1998 Company - Private
## 6 Dallas, TX 201 - 500 2000 Company - Public
## Industry Sector
## 1 Aerospace & Defense Aerospace & Defense
## 2 Health Care Services & Hospitals Health Care
## 3 Security Services Business Services
## 4 Energy Oil, Gas, Energy & Utilities
## 5 Advertising & Marketing Business Services
## 6 Real Estate Real Estate
## Revenue
## 1 $50 to $100 million (USD)
## 2 $2 to $5 billion (USD)
## 3 $100 to $500 million (USD)
## 4 $500 million to $1 billion (USD)
## 5 Unknown / Non-Applicable
## 6 $1 to $2 billion (USD)
## Competitors
## 1 -1
## 2 -1
## 3 -1
## 4 Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory
## 5 Commerce Signals, Cardlytics, Yodlee
## 6 Digital Realty, CoreSite, Equinix
## Hourly Employer.provided Lower.Salary Upper.Salary Avg.Salary.K.
## 1 0 0 53 91 72.0
## 2 0 0 63 112 87.5
## 3 0 0 80 90 85.0
## 4 0 0 56 97 76.5
## 5 0 0 86 143 114.5
## 6 0 0 71 119 95.0
## company_txt Job.Location Age Python spark aws excel
## 1 Tecolote Research NM 48 1 0 0 1
## 2 University of Maryland Medical System MD 37 1 0 0 0
## 3 KnowBe4 FL 11 1 1 0 1
## 4 PNNL WA 56 1 0 0 0
## 5 Affinity Solutions NY 23 1 0 0 1
## 6 CyrusOne TX 21 1 0 1 1
## sql sas keras pytorch scikit tensor hadoop tableau bi flink mongo google_an
## 1 0 1 0 0 0 0 0 1 1 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0 0
## 3 1 1 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0 0
## 5 1 1 0 0 0 0 0 0 0 0 0 0
## 6 1 0 0 0 0 0 0 0 1 0 1 0
## job_title_sim seniority_by_title Degree
## 1 data scientist na M
## 2 data scientist na M
## 3 data scientist na M
## 4 data scientist na na
## 5 data scientist na na
## 6 data scientist na na
## 'data.frame': 742 obs. of 42 variables:
## $ index : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Job.Title : chr "Data Scientist" "Healthcare Data Scientist" "Data Scientist" "Data Scientist" ...
## $ Salary.Estimate : chr "$53K-$91K (Glassdoor est.)" "$63K-$112K (Glassdoor est.)" "$80K-$90K (Glassdoor est.)" "$56K-$97K (Glassdoor est.)" ...
## $ Job.Description : chr "Data Scientist\nLocation: Albuquerque, NM\nEducation Required: Bachelor’s degree required, preferably in math, "| __truncated__ "What You Will Do:\n\nI. General Summary\n\nThe Healthcare Data Scientist position will join our Advanced Analyt"| __truncated__ "KnowBe4, Inc. is a high growth information security company. We are the world's largest provider of new-school "| __truncated__ "*Organization and Job ID**\nJob ID: 310709\n\nDirectorate: Earth & Biological Sciences\n\nDivision: Biological "| __truncated__ ...
## $ Rating : num 3.8 3.4 4.8 3.8 2.9 3.4 4.1 3.8 3.3 4.6 ...
## $ Company.Name : chr "Tecolote Research\n3.8" "University of Maryland Medical System\n3.4" "KnowBe4\n4.8" "PNNL\n3.8" ...
## $ Location : chr "Albuquerque, NM" "Linthicum, MD" "Clearwater, FL" "Richland, WA" ...
## $ Headquarters : chr "Goleta, CA" "Baltimore, MD" "Clearwater, FL" "Richland, WA" ...
## $ Size : chr "501 - 1000 " "10000+ " "501 - 1000 " "1001 - 5000 " ...
## $ Founded : int 1973 1984 2010 1965 1998 2000 2008 2005 2014 2009 ...
## $ Type.of.ownership : chr "Company - Private" "Other Organization" "Company - Private" "Government" ...
## $ Industry : chr "Aerospace & Defense" "Health Care Services & Hospitals" "Security Services" "Energy" ...
## $ Sector : chr "Aerospace & Defense" "Health Care" "Business Services" "Oil, Gas, Energy & Utilities" ...
## $ Revenue : chr "$50 to $100 million (USD)" "$2 to $5 billion (USD)" "$100 to $500 million (USD)" "$500 million to $1 billion (USD)" ...
## $ Competitors : chr "-1" "-1" "-1" "Oak Ridge National Laboratory, National Renewable Energy Lab, Los Alamos National Laboratory" ...
## $ Hourly : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Employer.provided : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Lower.Salary : int 53 63 80 56 86 71 54 86 38 120 ...
## $ Upper.Salary : int 91 112 90 97 143 119 93 142 84 160 ...
## $ Avg.Salary.K. : num 72 87.5 85 76.5 114.5 ...
## $ company_txt : chr "Tecolote Research" "University of Maryland Medical System" "KnowBe4" "PNNL" ...
## $ Job.Location : chr "NM" "MD" "FL" "WA" ...
## $ Age : int 48 37 11 56 23 21 13 16 7 12 ...
## $ Python : int 1 1 1 1 1 1 0 1 0 1 ...
## $ spark : int 0 0 1 0 0 0 0 1 0 1 ...
## $ aws : int 0 0 0 0 0 1 0 1 0 0 ...
## $ excel : int 1 0 1 0 1 1 1 1 0 0 ...
## $ sql : int 0 0 1 0 1 1 0 1 0 0 ...
## $ sas : int 1 0 1 0 1 0 0 0 0 0 ...
## $ keras : int 0 0 0 0 0 0 0 0 0 0 ...
## $ pytorch : int 0 0 0 0 0 0 0 1 0 0 ...
## $ scikit : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tensor : int 0 0 0 0 0 0 0 1 0 0 ...
## $ hadoop : int 0 0 0 0 0 0 0 0 0 0 ...
## $ tableau : int 1 0 0 0 0 0 0 0 0 0 ...
## $ bi : int 1 0 0 0 0 1 0 0 0 0 ...
## $ flink : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mongo : int 0 0 0 0 0 1 0 0 0 0 ...
## $ google_an : int 0 0 0 0 0 0 0 0 0 0 ...
## $ job_title_sim : chr "data scientist" "data scientist" "data scientist" "data scientist" ...
## $ seniority_by_title: chr "na" "na" "na" "na" ...
## $ Degree : chr "M" "M" "M" "na" ...
## index Job.Title Salary.Estimate Job.Description
## Min. : 0.0 Length:742 Length:742 Length:742
## 1st Qu.:221.5 Class :character Class :character Class :character
## Median :472.5 Mode :character Mode :character Mode :character
## Mean :469.1
## 3rd Qu.:707.8
## Max. :955.0
## Rating Company.Name Location Headquarters
## Min. :-1.000 Length:742 Length:742 Length:742
## 1st Qu.: 3.300 Class :character Class :character Class :character
## Median : 3.700 Mode :character Mode :character Mode :character
## Mean : 3.619
## 3rd Qu.: 4.000
## Max. : 5.000
## Size Founded Type.of.ownership Industry
## Length:742 Min. : -1 Length:742 Length:742
## Class :character 1st Qu.:1939 Class :character Class :character
## Mode :character Median :1988 Mode :character Mode :character
## Mean :1837
## 3rd Qu.:2007
## Max. :2019
## Sector Revenue Competitors Hourly
## Length:742 Length:742 Length:742 Min. :0.00000
## Class :character Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Mode :character Median :0.00000
## Mean :0.03234
## 3rd Qu.:0.00000
## Max. :1.00000
## Employer.provided Lower.Salary Upper.Salary Avg.Salary.K.
## Min. :0.00000 Min. : 15.00 Min. : 16.0 Min. : 15.5
## 1st Qu.:0.00000 1st Qu.: 52.00 1st Qu.: 96.0 1st Qu.: 73.5
## Median :0.00000 Median : 69.50 Median :124.0 Median : 97.5
## Mean :0.02291 Mean : 74.75 Mean :128.2 Mean :101.5
## 3rd Qu.:0.00000 3rd Qu.: 91.00 3rd Qu.:155.0 3rd Qu.:122.5
## Max. :1.00000 Max. :202.00 Max. :306.0 Max. :254.0
## company_txt Job.Location Age Python
## Length:742 Length:742 Min. : -1.00 Min. :0.0000
## Class :character Class :character 1st Qu.: 12.00 1st Qu.:0.0000
## Mode :character Mode :character Median : 25.00 Median :1.0000
## Mean : 47.52 Mean :0.5283
## 3rd Qu.: 60.00 3rd Qu.:1.0000
## Max. :277.00 Max. :1.0000
## spark aws excel sql
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :1.0000 Median :1.0000
## Mean :0.2251 Mean :0.2372 Mean :0.5229 Mean :0.5121
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## sas keras pytorch scikit
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.08895 Mean :0.03908 Mean :0.05256 Mean :0.07278
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## tensor hadoop tableau bi
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.09703 Mean :0.1671 Mean :0.1995 Mean :0.07547
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## flink mongo google_an job_title_sim
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Length:742
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 Class :character
## Median :0.00000 Median :0.00000 Median :0.00000 Mode :character
## Mean :0.01348 Mean :0.04987 Mean :0.01887
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
## seniority_by_title Degree
## Length:742 Length:742
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## [1] 0
# Count the number of missing values in each column
missing_values <- colSums(is.na(df_salaries))
missing_values## index Job.Title Salary.Estimate Job.Description
## 0 0 0 0
## Rating Company.Name Location Headquarters
## 0 0 0 0
## Size Founded Type.of.ownership Industry
## 0 0 0 0
## Sector Revenue Competitors Hourly
## 0 0 0 0
## Employer.provided Lower.Salary Upper.Salary Avg.Salary.K.
## 0 0 0 0
## company_txt Job.Location Age Python
## 0 0 0 0
## spark aws excel sql
## 0 0 0 0
## sas keras pytorch scikit
## 0 0 0 0
## tensor hadoop tableau bi
## 0 0 0 0
## flink mongo google_an job_title_sim
## 0 0 0 0
## seniority_by_title Degree
## 0 0
This summary includes count, mean, median, minimum, maximum, and quartile values for each numerical column in the dataframe.
## index Job.Title Salary.Estimate Job.Description
## Min. : 0.0 Length:742 Length:742 Length:742
## 1st Qu.:221.5 Class :character Class :character Class :character
## Median :472.5 Mode :character Mode :character Mode :character
## Mean :469.1
## 3rd Qu.:707.8
## Max. :955.0
## Rating Company.Name Location Headquarters
## Min. :-1.000 Length:742 Length:742 Length:742
## 1st Qu.: 3.300 Class :character Class :character Class :character
## Median : 3.700 Mode :character Mode :character Mode :character
## Mean : 3.619
## 3rd Qu.: 4.000
## Max. : 5.000
## Size Founded Type.of.ownership Industry
## Length:742 Min. : -1 Length:742 Length:742
## Class :character 1st Qu.:1939 Class :character Class :character
## Mode :character Median :1988 Mode :character Mode :character
## Mean :1837
## 3rd Qu.:2007
## Max. :2019
## Sector Revenue Competitors Hourly
## Length:742 Length:742 Length:742 Min. :0.00000
## Class :character Class :character Class :character 1st Qu.:0.00000
## Mode :character Mode :character Mode :character Median :0.00000
## Mean :0.03234
## 3rd Qu.:0.00000
## Max. :1.00000
## Employer.provided Lower.Salary Upper.Salary Avg.Salary.K.
## Min. :0.00000 Min. : 15.00 Min. : 16.0 Min. : 15.5
## 1st Qu.:0.00000 1st Qu.: 52.00 1st Qu.: 96.0 1st Qu.: 73.5
## Median :0.00000 Median : 69.50 Median :124.0 Median : 97.5
## Mean :0.02291 Mean : 74.75 Mean :128.2 Mean :101.5
## 3rd Qu.:0.00000 3rd Qu.: 91.00 3rd Qu.:155.0 3rd Qu.:122.5
## Max. :1.00000 Max. :202.00 Max. :306.0 Max. :254.0
## company_txt Job.Location Age Python
## Length:742 Length:742 Min. : -1.00 Min. :0.0000
## Class :character Class :character 1st Qu.: 12.00 1st Qu.:0.0000
## Mode :character Mode :character Median : 25.00 Median :1.0000
## Mean : 47.52 Mean :0.5283
## 3rd Qu.: 60.00 3rd Qu.:1.0000
## Max. :277.00 Max. :1.0000
## spark aws excel sql
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :1.0000 Median :1.0000
## Mean :0.2251 Mean :0.2372 Mean :0.5229 Mean :0.5121
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## sas keras pytorch scikit
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.08895 Mean :0.03908 Mean :0.05256 Mean :0.07278
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## tensor hadoop tableau bi
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.09703 Mean :0.1671 Mean :0.1995 Mean :0.07547
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## flink mongo google_an job_title_sim
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Length:742
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 Class :character
## Median :0.00000 Median :0.00000 Median :0.00000 Mode :character
## Mean :0.01348 Mean :0.04987 Mean :0.01887
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
## seniority_by_title Degree
## Length:742 Length:742
## Class :character Class :character
## Mode :character Mode :character
##
##
##
## [1] "index" "Job.Title" "Salary.Estimate"
## [4] "Job.Description" "Rating" "Company.Name"
## [7] "Location" "Headquarters" "Size"
## [10] "Founded" "Type.of.ownership" "Industry"
## [13] "Sector" "Revenue" "Competitors"
## [16] "Hourly" "Employer.provided" "Lower.Salary"
## [19] "Upper.Salary" "Avg.Salary.K." "company_txt"
## [22] "Job.Location" "Age" "Python"
## [25] "spark" "aws" "excel"
## [28] "sql" "sas" "keras"
## [31] "pytorch" "scikit" "tensor"
## [34] "hadoop" "tableau" "bi"
## [37] "flink" "mongo" "google_an"
## [40] "job_title_sim" "seniority_by_title" "Degree"
To verify duplicate values in the dataset, I used the duplicated() function. This creates a new dataframe displaying any duplication values. I also used the sum function.
num_duplicates <- sum(duplicated(df_salaries))
# Check for duplicates
duplicates <- df_salaries[duplicated(df_salaries), ]
print(duplicates)## [1] index Job.Title Salary.Estimate Job.Description
## [5] Rating Company.Name Location Headquarters
## [9] Size Founded Type.of.ownership Industry
## [13] Sector Revenue Competitors Hourly
## [17] Employer.provided Lower.Salary Upper.Salary Avg.Salary.K.
## [21] company_txt Job.Location Age Python
## [25] spark aws excel sql
## [29] sas keras pytorch scikit
## [33] tensor hadoop tableau bi
## [37] flink mongo google_an job_title_sim
## [41] seniority_by_title Degree
## <0 rows> (or 0-length row.names)
I prepared the data for visualization by converting salry to numeric and filtering out non-numeric values and aggregating the average salaries by job title and job location. Then, I filtered the data by obtaining the top 10 job titles by highest average salary and their corresponding locations in descending order.
# Data Preparation: Convert Salary to numeric, filter out non-numeric values, and aggregate average salary by job title and job location
avg_salary <- df_salaries %>%
mutate(Salary = as.numeric(Avg.Salary.K.)) %>%
filter(!is.na(Salary)) %>%
group_by(Job.Title, Job.Location) %>%
summarise(Avg_Salary = mean(Salary))## `summarise()` has grouped output by 'Job.Title'. You can override using the
## `.groups` argument.
# Find top 10 job titles by highest average salary and their corresponding locations in descending order
top_10_job_titles <- avg_salary %>%
group_by(Job.Title) %>%
summarise(Avg_Salary = mean(Avg_Salary)) %>%
top_n(9, Avg_Salary) %>%
arrange(desc(Avg_Salary))
# Filter data to include only the top 10 job titles and their corresponding locations
top_10_data <- avg_salary %>%
filter(Job.Title %in% top_10_job_titles$Job.Title)# Filter data to include only the top 10 job titles and their corresponding locations
top_10_data <- avg_salary %>%
filter(Job.Title %in% top_10_job_titles$Job.Title) %>%
arrange(desc(Avg_Salary))
# Print top 10 job titles and their corresponding locations in descending order
print(top_10_data)## # A tibble: 10 × 3
## # Groups: Job.Title [9]
## Job.Title Job.Location Avg_Salary
## <chr> <chr> <dbl>
## 1 Director II, Data Science - GRM Actuarial IL 254
## 2 Principal Machine Learning Scientist CA 232.
## 3 Principal Data Scientist with over 10 years experien… CA 225
## 4 Data Science Manager CA 222.
## 5 Lead Data Engineer CA 205
## 6 Director II, Data Science - GRS Predictive Analytics IL 194.
## 7 Staff Machine Learning Engineer CA 181
## 8 Director, Data Science IL 180.
## 9 Sr. Scientist II CA 174
## 10 Data Science Manager PA 128.
I visualized the data using a heat map to visualize the color intensity based on the average salary distribution of the Top 10 jobs and their corresponding location.
## Warning: package 'viridis' was built under R version 4.3.3
## Loading required package: viridisLite
# Create a heatmap
ggplot(top_10_data, aes(x = Job.Location, y = Job.Title, fill = Avg_Salary)) +
geom_tile() +
scale_fill_viridis(name = "Average Salary (in thousands USD)") +
theme_minimal() +
labs(title = "Top 10 Jobs by Highest Avg Salary & Location",
x = "Job Location",
y = "Job Title")Based on the data analysis and visualization, it is evident that senior and mid level data science roles pay higher salaries in California, Illinois and Pennsylvania. Based on the data, Director II, Data Science - GRM Actuarial, Principal Machine Learning Scientist, and Principal Data Scientist with over 10 years experience receive higher salary because they require more technical areas of expertise in data science. This analysis revealed significant differences in average salaries among various data practitioner roles. You can observe that certain roles command higher compensation compared to others. This suggests that factors such as job responsibilities, required skill sets, and market demand play a crucial role in determining salary levels within the data practitioner field. To add on, the heat map displaying average salaries across different job titles and locations allowed to identify geographical disparities in compensation. This highlights the importance of considering regional factors such as cost of living, industry presence, and local economic conditions when assessing salary expectations. By understanding the prevailing salary trends associated with different roles and geographical locations, individuals or future Data Practitioners can make informed decisions regarding job opportunities, career advancement strategies, and potential relocation.