Data Sources
Kaggle Data Science Job Data set (Primary) : data_cleaned csv (on Github repository)
Why this?
Large scale job posting dataset (job titles, salaries, skill flags (Python, SQL, AWS, Tableau etc.) and job descriptions - great structured columns ready to be analyzed, salary ranges/average - good for exploring skill-salary association, Job description for texting mining soft skills, Many job posting with variety, technical skills listed can be used for statistical relationship analysis
BuiltIn Job Postings (Supplement Data Source): SoFi, TempusAI, Grow Therapy, Braze, PRC..etc
Supplement Skill Reference Articles (Qualitative reference sources),
https://online.nyit.edu/blog/data-science-skills
provides overview of core technical and soft skills expected of data science professionals (programming, statistics, communication, and domain knowledge)
https://www.pragmaticinstitute.com/resources/articles/data/must-have-data-science-skills-for-2023/
Highlights industry‑aligned skill requirements, modern tools, business communication, and deployment skills.
Why this?
Up to date skill requirements reflecting 2025-2026 job market, Includes qualitative details (soft skill, modern tools, expectation. Great for comparison between Kaggle dataset from the past and present day job demands on what quality is still the same
Kaggle Data Acquistion Extraction Plan:
load from csv on github, extract skill indicators, salary information for skill to salary analysis, Soft skill via text search (Teamwork, Communication..etc), grouping job titles (data scientist vs analyst vs engineers)
BuiltIn Job Posting Extraction Plan:
Scrapping –> Tools (PowerBI, Tableu etc), Soft skills (Presentation, communication, collaboration et), Salary, Job description text, Required Programming Skills (R, Python, SQL, Scala etc), Industry domain, Job seniority
Storing dataset in standardized columns
job_idsource = “builtin”
company
job_title
skill_list
soft_skills
salary_range
job_description
date_collected
# For data wrangling
#install.packages('polite')
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.2
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tibble' was built under R version 4.5.2
## Warning: package 'readr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
## Warning: package 'janitor' was built under R version 4.5.2
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
# Website scrapping, HTML parsing (polite to avoid blockage)
library(rvest)
##
## Attaching package: 'rvest'
##
## The following object is masked from 'package:readr':
##
## guess_encoding
library(polite)
## Warning: package 'polite' was built under R version 4.5.2
# 607 Project 3 Main Github repository
kaggle_url <- "https://raw.githubusercontent.com/meiqing39/607-Project-3/main/data_cleaned_2021.csv"
kg <- read_csv(kaggle_url, show_col_types = FALSE) |>
clean_names() # <- renames by snake_case format, removes special character
#Ensures consistent and reproducible references to variable by using janitor function clean_names()
# Tibble Resource table
sources_roles <- tibble(
Source = c(
"Kaggle dataset (structured job postings)",
"BuiltIn job postings (recent, unstructured)",
"NYIT article (skills overview)",
"Pragmatic Institute article (skills 2023)"
),
Role = c(
"Primary quantitative dataset used for skill frequency and salary association analyses",
"Supplemental contemporary postings used to validate/contrast 2025–2026 skill demand",
"Qualitative reference to define/justify skill taxonomy (not stored in DB)",
"Qualitative reference to contextualize/interpret findings (not stored in DB)"
)
)
print(sources_roles)
## # A tibble: 4 × 2
## Source Role
## <chr> <chr>
## 1 Kaggle dataset (structured job postings) Primary quantitative dataset used…
## 2 BuiltIn job postings (recent, unstructured) Supplemental contemporary posting…
## 3 NYIT article (skills overview) Qualitative reference to define/j…
## 4 Pragmatic Institute article (skills 2023) Qualitative reference to contextu…
ER diagram shows the normalized database design used to store job postings, technical skills, and soft skills collected from the Kaggle dataset and BuiltIn job postings.
job level attributes (title, company, industry, salary) extracted into jobs table
Technical skills: extract from binary columns, reshape into longer format with pivot longer function connect each job with required skills
job description test extracted without modding for later soft skill text analysis
# Extract job level data
jobs <- kg |>
transmute(
job_id = row_number(),
job_title,
company = company_name,
location,
industry,
avg_salary_k,
source = "kaggle"
)
# Extract job description (soft skill use)
job_descriptions <- kg |>
transmute(
job_id = row_number(),
job_description
)
# Id skill columns
skill_cols <- names(kg)[
names(kg) %in% c(
"python","sql","aws","excel","tableau","spark",
"sas","pytorch","keras","scikit","tensor",
"hadoop","flink","mongo","google_an"
)
]
# Combine created skill_cols to job_id to make job_skills table
job_skills <- kg |>
mutate(job_id = row_number()) |>
select(job_id, all_of(skill_cols)) |>
pivot_longer(
cols = -job_id,
names_to = "skill",
values_to = "present"
) |>
filter(present == 1) |>
select(job_id, skill)
# Create Scrapping function for scrapping of job title and job description later
get1 <- function(html, css) {
node <- html_element(html, css)
if (inherits(node, "xml_missing") || is.na(node)) return("")
html_text2(node)
}
first_nonempty <- function(html, selectors) {
for (sel in selectors) {
txt <- get1(html, sel)
if (nzchar(txt)) return(txt)
}
""
}
scrape_builtin <- function(url, ua = "607-Project-3/edu (rvest+polite)") {
ses <- polite::bow(url, user_agent = ua, force = TRUE)
pg <- polite::scrape(ses)
# Title
title <- first_nonempty(pg, c("h1", "meta[property='og:title']"))
if (!nzchar(title)) title <- NA_character_
# Description: try a few common containers, fall back to main/body
desc <- first_nonempty(pg, c(
"article",
"[data-testid='job-description']",
"[data-qa='job-description']",
"div[class*='job-description']",
"section[class*='description']"
))
if (!nzchar(desc)) desc <- get1(pg, "[role='main']")
if (!nzchar(desc)) desc <- get1(pg, "body")
if (!nzchar(desc)) desc <- NA_character_
tibble(
job_title = title,
job_description = desc,
job_url = url
)
}
# Organize all job posting URLs into a tibble
builtin_urls <- tibble(
job_url = c(
"https://builtin.com/job/manager-data-scientist-alternate-data-strategy/8162509",
"https://builtin.com/job/principal-associate-data-scientist-ai-software-engineering/8017518",
"https://builtin.com/job/principal-associate-data-scientist-us-card-new-credit-team/8029691",
"https://builtin.com/job/senior-associate-data-scientist/8432791",
"https://builtin.com/job/data-scientist-ii-real-world-evidence-rwe-pharma-r-d/8031762",
"https://builtin.com/job/staff-data-scientist-growth-care/8019471",
"https://www.builtinnyc.com/job/senior-staff-data-scientist-invest/8697390",
"https://www.builtinnyc.com/job/staff-data-scientist-borrow/8290699",
"https://www.builtinnyc.com/job/field-data-scientist-ai-deployment/7405417",
"https://www.builtinnyc.com/job/ai-genai-data-scientist-senior-manager/8258875",
"https://www.builtinnyc.com/job/data-scientist/8446518"
))
# Scrape --> builtin_raw
safe_scrape <- purrr::safely(scrape_builtin)
builtin_raw <- builtin_urls |>
mutate(res = map(job_url, safe_scrape)) |>
transmute(
job_url,
job_title = map_chr(res, ~ .x$result$job_title %||% NA_character_),
job_description = map_chr(res, ~ .x$result$job_description %||% NA_character_))
# Double checking that job description returns with page body text for skill, soft skill keywords, and frequency comparison later - text number appear = good
nchar(builtin_raw$job_description)
## [1] 12241 12498 12419 12083 13239 13691 10203 10779 12258 11764 12244
# Scrape and Manually add company and industry from data seen on job posting website
builtin_jobs <- builtin_raw |>
mutate(
company = case_when(
str_detect(job_url, "capital|cap1") ~ "Capital One",
str_detect(job_url, "nylife|new-york-life|senior-associate-data-scientist/8432791")
~ "NY Life",
str_detect(job_url, "tempus") ~ "Tempus AI",
str_detect(job_url, "grow") ~ "Grow Therapy",
str_detect(job_url, "sofi") ~ "SoFi",
str_detect(job_url, "braze") ~ "Braze",
str_detect(job_url, "prc") ~ "PRC",
str_detect(job_url, "octus") ~ "Octus",
TRUE ~ NA_character_
),
industry = case_when(
company %in% c("Capital One", "SoFi", "NY Life") ~ "Finance",
company == "Tempus AI" ~ "Healthcare / Pharma",
company == "Grow Therapy" ~ "Healthcare",
company == "Braze" ~ "SaaS / Martech",
company %in% c("PRC", "Octus") ~ "Consulting",
TRUE ~ NA_character_
),
source = "builtin",
job_id = row_number()
) %>%
select(job_id, job_title, company, industry, source, job_url, job_description)
# extract skills seen on job descriptions
skill_keywords <- c(
# languages
"python","r","sql","scala","java",
# libraries / ml
"pytorch","tensorflow","scikit","sklearn","xgboost",
# platforms
"aws","azure","gcp","databricks","snowflake","spark",
# bi / viz
"tableau","power bi")
builtin_job_skills <- builtin_jobs %>%
transmute(job_id, text = tolower(job_description %||% "")) %>%
filter(nzchar(text)) %>%
crossing(skill = skill_keywords) %>%
mutate(rx = paste0("\\b", skill, "\\b")) %>%
filter(str_detect(text, regex(rx, ignore_case = TRUE))) %>%
distinct(job_id, skill)
head(builtin_jobs)
## # A tibble: 6 × 7
## job_id job_title company industry source job_url job_description
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Manager, Data Scientis… <NA> <NA> built… https:… "https://ad.do…
## 2 2 Principal Associate, D… <NA> <NA> built… https:… "https://ad.do…
## 3 3 Principal Associate, D… <NA> <NA> built… https:… "https://ad.do…
## 4 4 Senior Associate - Dat… NY Life Finance built… https:… "New York Life…
## 5 5 Data Scientist II, Rea… <NA> <NA> built… https:… "Tempus AI\nDa…
## 6 6 Staff Data Scientist -… Grow T… Healthc… built… https:… "Grow Therapy\…
builtin_job_skills |>
count(skill) |>
arrange(desc(n))
## # A tibble: 14 × 2
## skill n
## <chr> <int>
## 1 python 10
## 2 sql 9
## 3 aws 7
## 4 gcp 4
## 5 r 3
## 6 spark 3
## 7 snowflake 2
## 8 tableau 2
## 9 azure 1
## 10 databricks 1
## 11 scala 1
## 12 scikit 1
## 13 tensorflow 1
## 14 xgboost 1
##Extracting csv file from builtin_jobs and builtin_job_skills
write_csv(builtin_jobs, "builtin_jobs.csv")
write_csv(builtin_job_skills, "builtin_job_skills.csv")
I scraped job posting descriptions from Builtin to extract modern technical and soft skills using keyword matching while also understanding that some attributes (salary, explicit seniority levels) are inconsistent available on job posting websites. Salary based analyses rely mostly on the structured kaggle dataset, while the Builtin job posting provide up to date information on skill demands and tools used as data scientists. Industry classification were assigned manually based on company information.
pacman::p_load(janitor,
tidyverse)
#library(tidyverse) Moved into pacman statement package to make it easier to open up and install the latest version
#library(dplyr) Removed this library because it is already included in tidyverse
#library(readr) Removed this library because it is already included in tidyverse
#library(ggplot2) Removed this library because it is already included in tidyverse
#library(janitor) Moved into pacman statement package to make it easier to open up and install the latest version
kaggle_url <- "https://raw.githubusercontent.com/meiqing39/607-Project-3/refs/heads/main/data_cleaned_2021.csv"
kg <- read_csv(kaggle_url, show_col_types = FALSE) |>
clean_names() # <- renames by snake_case format, removes special character
#Ensures consistent and reproducible references to variable by using janitor function clean_names()
kg
## # A tibble: 742 × 42
## index job_title salary_estimate job_description rating company_name location
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 0 Data Scie… $53K-$91K (Gla… "Data Scientis… 3.8 "Tecolote R… Albuque…
## 2 1 Healthcar… $63K-$112K (Gl… "What You Will… 3.4 "University… Linthic…
## 3 2 Data Scie… $80K-$90K (Gla… "KnowBe4, Inc.… 4.8 "KnowBe4\n4… Clearwa…
## 4 3 Data Scie… $56K-$97K (Gla… "*Organization… 3.8 "PNNL\n3.8" Richlan…
## 5 4 Data Scie… $86K-$143K (Gl… "Data Scientis… 2.9 "Affinity S… New Yor…
## 6 5 Data Scie… $71K-$119K (Gl… "CyrusOne is s… 3.4 "CyrusOne\n… Dallas,…
## 7 6 Data Scie… $54K-$93K (Gla… "Job Descripti… 4.1 "ClearOne A… Baltimo…
## 8 7 Data Scie… $86K-$142K (Gl… "Advanced Anal… 3.8 "Logic20/20… San Jos…
## 9 8 Research … $38K-$84K (Gla… "SUMMARY\n\nTh… 3.3 "Rochester … Rochest…
## 10 9 Data Scie… $120K-$160K (G… "isn’t your us… 4.6 "<intent>\n… New Yor…
## # ℹ 732 more rows
## # ℹ 35 more variables: headquarters <chr>, size <chr>, founded <dbl>,
## # type_of_ownership <chr>, industry <chr>, sector <chr>, revenue <chr>,
## # competitors <chr>, hourly <dbl>, employer_provided <dbl>,
## # lower_salary <dbl>, upper_salary <dbl>, avg_salary_k <dbl>,
## # company_txt <chr>, job_location <chr>, age <dbl>, python <dbl>,
## # spark <dbl>, aws <dbl>, excel <dbl>, sql <dbl>, sas <dbl>, keras <dbl>, …
colnames(kg)
## [1] "index" "job_title" "salary_estimate"
## [4] "job_description" "rating" "company_name"
## [7] "location" "headquarters" "size"
## [10] "founded" "type_of_ownership" "industry"
## [13] "sector" "revenue" "competitors"
## [16] "hourly" "employer_provided" "lower_salary"
## [19] "upper_salary" "avg_salary_k" "company_txt"
## [22] "job_location" "age" "python"
## [25] "spark" "aws" "excel"
## [28] "sql" "sas" "keras"
## [31] "pytorch" "scikit" "tensor"
## [34] "hadoop" "tableau" "bi"
## [37] "flink" "mongo" "google_an"
## [40] "job_title_sim" "seniority_by_title" "degree"
#head(10)
Several variables contain placeholder values that represent missing
data rather than valid observations. For example, Rating,
Founded, and Age include values of
-1, while variables such as Competitors,
seniority_by_title, and Degree use strings
such as "-1" or "na" to indicate unavailable
information. These values must be recoded as missing before further
analysis.
The dataset contains 742 job postings and 42 variables. These
variables include job and company descriptors, salary information, and
binary indicators for technical skills such as Python, SQL, AWS, and
Tableau. Initial inspection using glimpse() and
summary() showed that the dataset contains both raw text
variables and derived variables that are more suitable for analysis.
The binary skill columns provide a practical proxy for skill demand in job postings. Since these variables are coded as 0 and 1, their means can be interpreted as the proportion of postings that require each skill. Preliminary results suggest that Python, Excel, and SQL are among the most frequently requested skills in the dataset.
Exploratory summaries also suggest the presence of extreme values. For instance, company age ranges up to 277 years, and average salary ranges from 15.5 to 254 thousand dollars. While some of these values may be valid, they require further inspection through visualizations such as boxplots and histograms.
Standardize Kaggle jobs
kaggle_jobs <- kg |>
transmute(
job_id = paste0("kg_", row_number()),
source = "kaggle",
job_title,
company = company_name,
location,
industry,
avg_salary_k,
salary_min_k = lower_salary,
salary_max_k = upper_salary,
job_description,
job_url = NA_character_
)
kaggle_jobs
## # A tibble: 742 × 11
## job_id source job_title company location industry avg_salary_k salary_min_k
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 kg_1 kaggle Data Scien… "Tecol… Albuque… Aerospa… 72 53
## 2 kg_2 kaggle Healthcare… "Unive… Linthic… Health … 87.5 63
## 3 kg_3 kaggle Data Scien… "KnowB… Clearwa… Securit… 85 80
## 4 kg_4 kaggle Data Scien… "PNNL\… Richlan… Energy 76.5 56
## 5 kg_5 kaggle Data Scien… "Affin… New Yor… Adverti… 114. 86
## 6 kg_6 kaggle Data Scien… "Cyrus… Dallas,… Real Es… 95 71
## 7 kg_7 kaggle Data Scien… "Clear… Baltimo… Banks &… 73.5 54
## 8 kg_8 kaggle Data Scien… "Logic… San Jos… Consult… 114 86
## 9 kg_9 kaggle Research S… "Roche… Rochest… Health … 61 38
## 10 kg_10 kaggle Data Scien… "<inte… New Yor… Internet 140 120
## # ℹ 732 more rows
## # ℹ 3 more variables: salary_max_k <dbl>, job_description <chr>, job_url <chr>
builtin_jobs_clean <- builtin_jobs |>
transmute(
job_id = paste0("bi_", row_number()),
source = "builtin",
job_title,
company,
location = NA_character_,
industry,
avg_salary_k = NA_real_,
salary_min_k = NA_real_,
salary_max_k = NA_real_,
job_description,
job_url
)
builtin_jobs_clean
## # A tibble: 11 × 11
## job_id source job_title company location industry avg_salary_k salary_min_k
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 bi_1 builtin Manager, … <NA> <NA> <NA> NA NA
## 2 bi_2 builtin Principal… <NA> <NA> <NA> NA NA
## 3 bi_3 builtin Principal… <NA> <NA> <NA> NA NA
## 4 bi_4 builtin Senior As… NY Life <NA> Finance NA NA
## 5 bi_5 builtin Data Scie… <NA> <NA> <NA> NA NA
## 6 bi_6 builtin Staff Dat… Grow T… <NA> Healthc… NA NA
## 7 bi_7 builtin Senior St… <NA> <NA> <NA> NA NA
## 8 bi_8 builtin Staff Dat… <NA> <NA> <NA> NA NA
## 9 bi_9 builtin Forward-D… <NA> <NA> <NA> NA NA
## 10 bi_10 builtin AI & GenA… <NA> <NA> <NA> NA NA
## 11 bi_11 builtin Data Scie… <NA> <NA> <NA> NA NA
## # ℹ 3 more variables: salary_max_k <dbl>, job_description <chr>, job_url <chr>
all_jobs <- bind_rows(kaggle_jobs, builtin_jobs_clean)
all_jobs
## # A tibble: 753 × 11
## job_id source job_title company location industry avg_salary_k salary_min_k
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 kg_1 kaggle Data Scien… "Tecol… Albuque… Aerospa… 72 53
## 2 kg_2 kaggle Healthcare… "Unive… Linthic… Health … 87.5 63
## 3 kg_3 kaggle Data Scien… "KnowB… Clearwa… Securit… 85 80
## 4 kg_4 kaggle Data Scien… "PNNL\… Richlan… Energy 76.5 56
## 5 kg_5 kaggle Data Scien… "Affin… New Yor… Adverti… 114. 86
## 6 kg_6 kaggle Data Scien… "Cyrus… Dallas,… Real Es… 95 71
## 7 kg_7 kaggle Data Scien… "Clear… Baltimo… Banks &… 73.5 54
## 8 kg_8 kaggle Data Scien… "Logic… San Jos… Consult… 114 86
## 9 kg_9 kaggle Research S… "Roche… Rochest… Health … 61 38
## 10 kg_10 kaggle Data Scien… "<inte… New Yor… Internet 140 120
## # ℹ 743 more rows
## # ℹ 3 more variables: salary_max_k <dbl>, job_description <chr>, job_url <chr>
Missing values were identified using the is.na()
function. Depending on the variable, missing values were either removed
or handled appropriately to maintain data integrit
skill_cols <- names(kg)[
names(kg) %in% c(
"python","sql","aws","excel","tableau","spark",
"sas","pytorch","keras","scikit","tensor",
"hadoop","flink","mongo","google_an"
)
]
kaggle_job_skills <- kg |>
mutate(job_id = paste0("kg_", row_number())) |>
select(job_id, all_of(skill_cols)) |>
pivot_longer(
cols = -job_id,
names_to = "skill",
values_to = "present"
) |>
filter(present == 1) |>
select(job_id, skill)
#Summarizing the skills we can see that python, excel and sql are the most required skill in the job market
#BuiltIn skills
skill_keywords <- c(
"python","r","sql","scala","java",
"pytorch","tensorflow","scikit","sklearn","xgboost",
"aws","azure","gcp","databricks","snowflake","spark",
"tableau","power bi"
)
builtin_job_skills_clean <- builtin_jobs |>
mutate(job_id = paste0("bi_", row_number())) |>
transmute(job_id, text = tolower(coalesce(job_description, ""))) |>
filter(nzchar(text)) |>
crossing(skill = skill_keywords) |>
mutate(rx = paste0("\\b", skill, "\\b")) |>
filter(str_detect(text, regex(rx, ignore_case = TRUE))) |>
distinct(job_id, skill)
all_job_skills <- bind_rows(
kaggle_job_skills,
builtin_job_skills_clean
)
all_job_skills
## # A tibble: 2,142 × 2
## job_id skill
## <chr> <chr>
## 1 kg_1 python
## 2 kg_1 excel
## 3 kg_1 sas
## 4 kg_1 tableau
## 5 kg_2 python
## 6 kg_3 python
## 7 kg_3 spark
## 8 kg_3 excel
## 9 kg_3 sql
## 10 kg_3 sas
## # ℹ 2,132 more rows
count(all_jobs, source)
## # A tibble: 2 × 2
## source n
## <chr> <int>
## 1 builtin 11
## 2 kaggle 742
count(all_job_skills, skill, sort = TRUE)
## # A tibble: 23 × 2
## skill n
## <chr> <int>
## 1 python 402
## 2 sql 389
## 3 excel 388
## 4 aws 183
## 5 spark 170
## 6 tableau 150
## 7 hadoop 124
## 8 tensor 72
## 9 sas 66
## 10 scikit 55
## # ℹ 13 more rows
glimpse(all_jobs)
## Rows: 753
## Columns: 11
## $ job_id <chr> "kg_1", "kg_2", "kg_3", "kg_4", "kg_5", "kg_6", "kg_7"…
## $ source <chr> "kaggle", "kaggle", "kaggle", "kaggle", "kaggle", "kag…
## $ job_title <chr> "Data Scientist", "Healthcare Data Scientist", "Data S…
## $ company <chr> "Tecolote Research\n3.8", "University of Maryland Medi…
## $ location <chr> "Albuquerque, NM", "Linthicum, MD", "Clearwater, FL", …
## $ industry <chr> "Aerospace & Defense", "Health Care Services & Hospita…
## $ avg_salary_k <dbl> 72.0, 87.5, 85.0, 76.5, 114.5, 95.0, 73.5, 114.0, 61.0…
## $ salary_min_k <dbl> 53, 63, 80, 56, 86, 71, 54, 86, 38, 120, 126, 64, 106,…
## $ salary_max_k <dbl> 91, 112, 90, 97, 143, 119, 93, 142, 84, 160, 201, 106,…
## $ job_description <chr> "Data Scientist\nLocation: Albuquerque, NM\nEducation …
## $ job_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
glimpse(all_job_skills)
## Rows: 2,142
## Columns: 2
## $ job_id <chr> "kg_1", "kg_1", "kg_1", "kg_1", "kg_2", "kg_3", "kg_3", "kg_3",…
## $ skill <chr> "python", "excel", "sas", "tableau", "python", "python", "spark…
checks
count(all_jobs, source)
## # A tibble: 2 × 2
## source n
## <chr> <int>
## 1 builtin 11
## 2 kaggle 742
count(all_job_skills, skill, sort = TRUE)
## # A tibble: 23 × 2
## skill n
## <chr> <int>
## 1 python 402
## 2 sql 389
## 3 excel 388
## 4 aws 183
## 5 spark 170
## 6 tableau 150
## 7 hadoop 124
## 8 tensor 72
## 9 sas 66
## 10 scikit 55
## # ℹ 13 more rows
glimpse(all_jobs)
## Rows: 753
## Columns: 11
## $ job_id <chr> "kg_1", "kg_2", "kg_3", "kg_4", "kg_5", "kg_6", "kg_7"…
## $ source <chr> "kaggle", "kaggle", "kaggle", "kaggle", "kaggle", "kag…
## $ job_title <chr> "Data Scientist", "Healthcare Data Scientist", "Data S…
## $ company <chr> "Tecolote Research\n3.8", "University of Maryland Medi…
## $ location <chr> "Albuquerque, NM", "Linthicum, MD", "Clearwater, FL", …
## $ industry <chr> "Aerospace & Defense", "Health Care Services & Hospita…
## $ avg_salary_k <dbl> 72.0, 87.5, 85.0, 76.5, 114.5, 95.0, 73.5, 114.0, 61.0…
## $ salary_min_k <dbl> 53, 63, 80, 56, 86, 71, 54, 86, 38, 120, 126, 64, 106,…
## $ salary_max_k <dbl> 91, 112, 90, 97, 143, 119, 93, 142, 84, 160, 201, 106,…
## $ job_description <chr> "Data Scientist\nLocation: Albuquerque, NM\nEducation …
## $ job_url <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
glimpse(all_job_skills)
## Rows: 2,142
## Columns: 2
## $ job_id <chr> "kg_1", "kg_1", "kg_1", "kg_1", "kg_2", "kg_3", "kg_3", "kg_3",…
## $ skill <chr> "python", "excel", "sas", "tableau", "python", "python", "spark…
skill_counts <- kaggle_job_skills |>
count(skill, sort = TRUE)
ggplot(skill_counts, aes(x = reorder(skill, n), y = n, fill = skill )) +
geom_col() +
coord_flip() +
labs(
title = "Frequency of Skills in Job Postings",
x = "Skill",
y = "Number of postings"
) +
theme(legend.position = 'none')
skill_counts <- builtin_job_skills_clean |>
count(skill, sort = TRUE)
ggplot(skill_counts, aes(x = reorder(skill, n), y = n, fill = skill )) +
geom_col() +
coord_flip() +
labs(
title = "Frequency of Skills in Job Postings",
x = "Skill",
y = "Number of postings"
) +
theme(legend.position = 'none')
all_jobs_salary <- all_jobs |>
filter(!is.na(avg_salary_k))
ggplot(all_jobs_salary, aes(x = avg_salary_k)) +
geom_histogram(bins = 30) +
labs(
title = "Distribution of Average Salary",
x = "Average Salary (K)",
y = "Number of Job Postings"
)
ggplot(all_jobs_salary, aes(y = avg_salary_k)) +
geom_boxplot() +
labs(
title = "Boxplot of Average Salary",
y = "Average Salary (K)"
)
##The Average Salary is 70k to 125k after observing the boxplot
The histogram shows that most job postings fall within the mid-range salary band, with a concentration between approximately 70K and 130K. Higher salary values are less frequent and may represent senior or specialized roles.
job_salary_summary <- all_jobs_salary |>
group_by(job_title) |>
summarise(
mean_salary = mean(avg_salary_k, na.rm = TRUE),
median_salary = median(avg_salary_k, na.rm = TRUE),
n = n(),
.groups = "drop"
) |>
filter(n >= 5) |>
arrange(desc(mean_salary))
job_salary_summary |>
ggplot(aes( x = reorder(job_title, mean_salary), y = mean_salary, fill = job_title)) +
geom_col() +
coord_flip() +
labs(
title = "Average Salary for Job Title",
x = "Job Title",
y = " Average Salary (K)"
) +
theme(legend.position = 'none')
##To compare pay across roles, job postings were grouped by job title and summarized using average salary. Because many titles appeared only a few times, the analysis focused on titles with at least five postings. This provides a more stable comparison of salary levels across common job categories.
source_skill_counts <- all_jobs |>
left_join(all_job_skills, by = "job_id") |>
filter(!is.na(skill)) |>
count(source, skill)
source_skill_counts |>
ggplot(aes(x = reorder(skill, n), y = n, fill = source)) +
geom_col(position = "dodge") +
coord_flip() +
labs(
title = "Skill Frequency by Data Source",
x = "Skill",
y = "Number of Job Postings"
)
What are the most common required skills?
Let’s look at the top 5 skills from the Builtin and Kaggle data sets:
builtin_top5 <- source_skill_counts |>
filter(source == "builtin") |>
arrange(desc(n)) |>
rename("count" = n) |>
slice_head(n = 5)
kaggle_top5 <- source_skill_counts |>
filter(source == "kaggle") |>
arrange(desc(n)) |>
rename("count" = n) |>
slice_head(n = 5)
library(gt)
## Warning: package 'gt' was built under R version 4.5.2
builtin_top5 |>
select(skill, count) |>
gt() |>
cols_label(
skill = "Skill",
count = "Count"
) |>
tab_header(title = "Top 5 Skills - Builtin")
| Top 5 Skills - Builtin | |
| Skill | Count |
|---|---|
| python | 10 |
| sql | 9 |
| aws | 7 |
| gcp | 4 |
| r | 3 |
builtin_top5 |>
ggplot(aes(x = reorder(skill, -count), y = count, fill = skill)) +
geom_col() +
geom_text(aes(label = count), vjust = -0.5, size = 3) +
scale_fill_viridis_d(option = "magma") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
labs(title = "Top 5 Skills (Builitin)", x = "Skill", y = "Count")
kaggle_top5 |>
select(skill, count) |>
gt() |>
cols_label(
skill = "Skill",
count = "Count"
) |>
tab_header(title = "Top 5 Skills - Kaggle")
| Top 5 Skills - Kaggle | |
| Skill | Count |
|---|---|
| python | 392 |
| excel | 388 |
| sql | 380 |
| aws | 176 |
| spark | 167 |
kaggle_top5 |>
ggplot(aes(x = reorder(skill, -count), y = count, fill = skill)) +
geom_col() +
geom_text(aes(label = count), vjust = -0.5, size = 3) +
scale_fill_viridis_d(option = "magma") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
labs(title = "Top 5 Skills (Kaggle)", x = "Skill", y = "Count")
For the Builtin data set, Python, SQL, and R are the top software skills. Cloud skills are also in demand.
It’s worth noting that this data set is much smaller and all the job posts are in the greater NYC area, and salaries will reflect that.
Kaggle shows Python, Excel, and SQL as the top 3 software skills. R is excluded from this data set, so it’s not a good metric of how commonly R is a required skill. AWS cloud skills are also in demand, as is Apache Spark.
Looking at the full list, Tableau shows up quite early. In my experience, the barrier to using Tableau isn’t necessarily learning to use it, but the price.
#visualizing all Builtin Skills
source_skill_counts |>
filter(source == "builtin") |>
arrange(desc(n)) |>
rename("count" = n) |>
ggplot(aes(x = reorder(skill, -count), y = count, fill = skill)) +
geom_col() +
geom_text(aes(label = count), vjust = -0.5, size = 3) +
scale_fill_viridis_d(option = "magma") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
labs(title = "All Skills (Builitin)", x = "Skill", y = "Count")
#All Kaggle skills
source_skill_counts |>
filter(source == "kaggle") |>
arrange(desc(n)) |>
rename("count" = n) |>
ggplot(aes(x = reorder(skill, -count), y = count, fill = skill)) +
geom_col() +
geom_text(aes(label = count), vjust = -0.5, size = 3) +
scale_fill_viridis_d(option = "magma") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
labs(title = "All Skills (Kaggle)", x = "Skill", y = "Count")
In general, the Kaggle skills start with broad software (Python, Excel, SQL, AWS), then moves into more niche Apache software. Some of these are advanced Python libraries, like Pytorch, scikit, and Keras. Tensor is a data structure.
Above, we have salary by title. In general, data scientists and data engineers make more than analysts.
Here, we’ll look at salary by state, just to see the impact of that and benchmark for later observations.
all_jobs_state <- all_jobs_salary |>
separate(col = location, into = c("city", "state"), sep = ",") |>
mutate(across(c(city, state), str_trim)) |>
mutate(state = if_else(state == "Los Angeles", "CA", state))
## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [127].
#benchmarking the salaries for later observations
summary(all_jobs_state$avg_salary_k)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.5 73.5 97.5 101.5 122.5 254.0
avg_state <- all_jobs_state |>
filter(source == "kaggle") |>
group_by(state) |>
summarise(
avg_salary = round(mean(avg_salary_k, na.rm = TRUE)),
job_count = n()
) |>
arrange(desc(avg_salary))
#top 10 states by salary
avg_state |>
slice_head(n = 10) |>
select(state, avg_salary, job_count) |>
gt() |>
cols_align(
align = "center",
columns = everything()
) |>
cols_label(
state = "State",
avg_salary = "Average Salary (k)",
job_count = "Number of Jobs"
) |>
tab_header(title = "Top 10 States by Average Salary")
| Top 10 States by Average Salary | ||
| State | Average Salary (k) | Number of Jobs |
|---|---|---|
| CA | 124 | 152 |
| IL | 117 | 40 |
| DC | 110 | 11 |
| MA | 107 | 103 |
| NJ | 105 | 17 |
| MI | 100 | 6 |
| RI | 100 | 1 |
| NY | 99 | 72 |
| NC | 98 | 21 |
| KY | 97 | 6 |
avg_state |>
slice_head(n = 10) |>
ggplot(aes(x = reorder(state, -avg_salary), y = avg_salary, fill = state)) +
geom_col() +
geom_text(aes(label = paste0(avg_salary, "k")), vjust = -0.5, size = 3) +
scale_fill_viridis_d(option = "magma") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
labs(title = "Top 10 States by Average Salary", x = "State", y = "Average Salary (k)")
Here, we take the job skills table and join it with the KG skills table based on index. Then, we create an average salary per skill column. This is to see which skill is associated with the highest average salary.
#making a slight tweak to this code so I can keep the index column and join
#with the kg data set for average salary
job_skills_2 <- kg |>
mutate(job_id = index) |>
select(job_id, all_of(skill_cols)) |>
pivot_longer(
cols = -job_id,
names_to = "skill",
values_to = "present"
) |>
filter(present == 1) |>
select(job_id, skill)
#left join with KG to get the salary
job_skills_salary <- job_skills_2 |>
left_join(kg |> select(index, avg_salary_k), by = c("job_id" = "index"))
#Averaging the salary
salary_skill <- job_skills_salary |>
group_by(skill) |>
summarise(
avg_salary = round(mean(avg_salary_k, na.rm = TRUE)),
job_count = n()) |>
arrange(desc(avg_salary))
salary_skill |>
ggplot(aes(x = reorder(skill, -avg_salary), y = avg_salary, fill = skill)) +
geom_col() +
geom_text(aes(label = paste0(avg_salary, "k")), vjust = -0.5, size = 3) +
scale_fill_viridis_d(option = "magma") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
labs(title = "Average Salary by Skill", x = "Skill", y = "Average Salary (k)")
In general, it seems like more niche skills are more highly valued. Scikit and Keras are Python libraries. Flink = Apache Flink. SAS ranks fifth – this data set is a little older and SAS is likely not as popular as it used to be.
How does the average salary for each skill compare to the median salary?
all_jobs_salary |> summarise(mean(avg_salary_k))
## # A tibble: 1 × 1
## `mean(avg_salary_k)`
## <dbl>
## 1 101.
all_jobs_salary
## # A tibble: 742 × 11
## job_id source job_title company location industry avg_salary_k salary_min_k
## <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 kg_1 kaggle Data Scien… "Tecol… Albuque… Aerospa… 72 53
## 2 kg_2 kaggle Healthcare… "Unive… Linthic… Health … 87.5 63
## 3 kg_3 kaggle Data Scien… "KnowB… Clearwa… Securit… 85 80
## 4 kg_4 kaggle Data Scien… "PNNL\… Richlan… Energy 76.5 56
## 5 kg_5 kaggle Data Scien… "Affin… New Yor… Adverti… 114. 86
## 6 kg_6 kaggle Data Scien… "Cyrus… Dallas,… Real Es… 95 71
## 7 kg_7 kaggle Data Scien… "Clear… Baltimo… Banks &… 73.5 54
## 8 kg_8 kaggle Data Scien… "Logic… San Jos… Consult… 114 86
## 9 kg_9 kaggle Research S… "Roche… Rochest… Health … 61 38
## 10 kg_10 kaggle Data Scien… "<inte… New Yor… Internet 140 120
## # ℹ 732 more rows
## # ℹ 3 more variables: salary_max_k <dbl>, job_description <chr>, job_url <chr>
#the average salary for this data set is ~101K
salary_skill |>
mutate("difference" = avg_salary - 101) |>
mutate("percentage" = round(difference/101,2)) |>
select(skill, avg_salary, difference, percentage) |>
gt() |>
cols_label(
skill = "Skill",
avg_salary = "Mean Salary (k)",
difference = "Diff from Mean Salary (k)",
percentage = "% Diff from Mean Salary"
) |>
tab_header(title = "How Much Are Skills Worth?")
| How Much Are Skills Worth? | |||
| Skill | Mean Salary (k) | Diff from Mean Salary (k) | % Diff from Mean Salary |
|---|---|---|---|
| flink | 129 | 28 | 0.28 |
| scikit | 125 | 24 | 0.24 |
| keras | 123 | 22 | 0.22 |
| tensor | 120 | 19 | 0.19 |
| sas | 114 | 13 | 0.13 |
| aws | 113 | 12 | 0.12 |
| mongo | 113 | 12 | 0.12 |
| python | 113 | 12 | 0.12 |
| spark | 113 | 12 | 0.12 |
| hadoop | 111 | 10 | 0.10 |
| pytorch | 109 | 8 | 0.08 |
| sql | 102 | 1 | 0.01 |
| excel | 99 | -2 | -0.02 |
| tableau | 96 | -5 | -0.05 |
| google_an | 68 | -33 | -0.33 |
#A slightly dramatic bar chart
salary_skill |>
mutate(difference = avg_salary - 101) |>
ggplot(aes(x = reorder(skill, -difference), y = difference, fill = skill)) +
geom_col() +
geom_text(aes(label = paste0(difference, "k"),
vjust = ifelse(difference >= 0, -0.5, 1.5)),
size = 3) +
scale_fill_viridis_d(option = "magma") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
expand_limits(y = c(min(salary_skill$avg_salary - 101) * 1.2,
max(salary_skill$avg_salary - 101) * 1.2)) +
labs(title = "How Much Are Skills Worth?", x = "Skill", y = "Difference from Mean Salary (k)")
Excel is the #2 skill, but it’s assocated with a lower salary. Can we conclude that learning Excel will drop your salary down by two grand? Probably not. But it’s a more accessible skill, and may be associated with less intense/specialized jobs. Tableau is also fairly accessible, as is Google Analytics.
What (if anything) does title mean? Are certain skills associated with a data scientist/analyst/engineer vs an analyst? On average, what is the pay differential between these titles?
Let’s look at titles and skill counts together.
#This data set that keeps everything and pivots skills
kg_long <- kg |>
pivot_longer(
cols = python:google_an,
names_to = "skill",
values_to = "present"
) |>
filter(present == 1) |>
select(index:age, job_title_sim:degree, skill)
kg_skills_pivot <- kg_long |>
select(job_title_sim, skill)
#changing the title "data analitics" to just analyst, so it matches
#(it seems like an error and it's also a typo)
kg_skills_pivot <- kg_skills_pivot |>
mutate(job_title_sim = if_else(job_title_sim == "data analitics", "analyst",
job_title_sim))
kg_skills_agg <- kg_skills_pivot |>
group_by(job_title_sim, skill) |>
summarise("skill_count" = n(), .groups = "drop")
#Let's look at analyst, data engineer, and data scientist
kg_skills_agg |> filter(job_title_sim == "analyst") |>
arrange(desc(skill_count)) |>
select(skill, skill_count) |>
gt() |>
cols_label(
skill = "Skill",
skill_count = "Count"
) |>
tab_header(title = "Most Common Analyst Skills")
| Most Common Analyst Skills | |
| Skill | Count |
|---|---|
| excel | 81 |
| sql | 79 |
| tableau | 44 |
| python | 36 |
| bi | 19 |
| sas | 13 |
| google_an | 11 |
| aws | 10 |
| spark | 6 |
| hadoop | 3 |
| mongo | 3 |
kg_skills_agg |> filter(job_title_sim == "data scientist") |>
arrange(desc(skill_count)) |>
select(skill, skill_count) |>
gt() |>
cols_label(
skill = "Skill",
skill_count = "Count"
) |>
tab_header(title = "Most Common Data Scientist Skills")
| Most Common Data Scientist Skills | |
| Skill | Count |
|---|---|
| python | 240 |
| sql | 176 |
| excel | 155 |
| spark | 84 |
| aws | 79 |
| tableau | 76 |
| hadoop | 60 |
| tensor | 60 |
| sas | 50 |
| scikit | 47 |
| pytorch | 33 |
| keras | 29 |
| bi | 28 |
| mongo | 19 |
| flink | 4 |
| google_an | 3 |
kg_skills_agg |> filter(job_title_sim == "data engineer") |>
arrange(desc(skill_count)) |>
select(skill, skill_count) |>
gt() |>
cols_label(
skill = "Skill",
skill_count = "Count"
) |>
tab_header(title = "Most Common Data Engineer Skills")
| Most Common Data Engineer Skills | |
| Skill | Count |
|---|---|
| sql | 87 |
| python | 77 |
| spark | 67 |
| aws | 59 |
| excel | 54 |
| hadoop | 50 |
| mongo | 13 |
| tableau | 11 |
| flink | 6 |
| bi | 3 |
| tensor | 2 |
| sas | 1 |
#analyst skills bar chart
kg_skills_agg |>
filter(job_title_sim == "analyst") |>
ggplot(aes(x = reorder(skill, -skill_count), y = skill_count, fill = skill)) +
geom_col() +
geom_text(aes(label = skill_count), vjust = -0.5, size = 3) +
scale_fill_viridis_d(option = "magma") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
labs(title = "Most Common Data Analyst Skills", x = "Skill", y = "Count")
#scientist
kg_skills_agg |>
filter(job_title_sim == "data scientist") |>
ggplot(aes(x = reorder(skill, -skill_count), y = skill_count, fill = skill)) +
geom_col() +
geom_text(aes(label = skill_count), vjust = -0.5, size = 3) +
scale_fill_viridis_d(option = "magma") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
labs(title = "Most Common Data Scientist Skills", x = "Skill", y = "Count")
#engineer
kg_skills_agg |>
filter(job_title_sim == "data engineer") |>
ggplot(aes(x = reorder(skill, -skill_count), y = skill_count, fill = skill)) +
geom_col() +
geom_text(aes(label = skill_count), vjust = -0.5, size = 3) +
scale_fill_viridis_d(option = "magma") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
labs(title = "Most Common Data Engineer Skills", x = "Skill", y = "Count")
Analysts seem more likely to need excel, SQL, and Tableau before python, whereas engineers and scientists generally need SQL, Python, and cloud skills. Analysts need more visualization tools like Tableau, BI, and google analytics. In general, programming skills are preferred for scientists and engineers.
Is understanding a particular area (e.g. healthcare or finance) correlated with a higher salary?
In data science, you may need two skill sets. For example, you may need chemistry and data science skills in order to work at a pharmaceutical company.
#all jobs salary
by_industry <- all_jobs_salary |>
filter(industry != "-1") |>
group_by(industry) |>
summarise(
avg_salary = round(mean(avg_salary_k, na.rm = TRUE), 0),
job_count = n()
) |>
arrange(desc(avg_salary))
#there are a lot of one-offs in here, so I am going to get rid of anything with a count of less than 2
by_industry_multi <- by_industry |>
filter(job_count > 1)
#Let's look at the top 10 and bottom 10
by_industry_multi |>
slice_head(n = 10) |>
select(industry, avg_salary) |>
gt() |>
cols_label(
industry = "Industry",
avg_salary = "Average Salary (Thousands)"
) |>
tab_header(title = "Top Average Salaries by Industry")
| Top Average Salaries by Industry | |
| Industry | Average Salary (Thousands) |
|---|---|
| Financial Analytics & Research | 145 |
| Telecommunications Services | 132 |
| Brokerage Services | 129 |
| Internet | 124 |
| Investment Banking & Asset Management | 118 |
| TV Broadcast & Cable Networks | 118 |
| Computer Hardware & Software | 115 |
| Enterprise Software & Network Solutions | 115 |
| Biotech & Pharmaceuticals | 112 |
| Consulting | 109 |
Unsurprisingly, 3/10 are in finance (brokerage, financial analytics, investment banking), industries that tend to pay more. Others are in technology generally, along with media.
by_industry_multi |>
slice_tail(n = 10) |>
select(industry, avg_salary) |>
gt() |>
cols_label(
industry = "Industry",
avg_salary = "Average Salary (Thousands)"
) |>
tab_header(title = "Bottom Average Salaries by Industry")
| Bottom Average Salaries by Industry | |
| Industry | Average Salary (Thousands) |
|---|---|
| Research & Development | 78 |
| Health Care Services & Hospitals | 77 |
| Banks & Credit Unions | 71 |
| Travel Agencies | 70 |
| Construction | 55 |
| Food & Beverage Manufacturing | 53 |
| Architectural & Engineering Services | 50 |
| Gambling | 48 |
| Social Assistance | 48 |
| Telecommunications Manufacturing | 44 |
Some of these are industries where salaries tend to be on the lower side, like food & beverage and social services. Construction and architecture are adjacent fields.
Gambling is interesting – it may offer higher salaries now, given the growth of gambling apps.
If we define valuable as “most opportunities”:
by_industry_multi |>
arrange(desc(job_count)) |>
slice_head(n = 10) |>
select(industry, job_count) |>
gt() |>
cols_align(
align = "center",
columns = job_count
) |>
cols_label(
industry = "Industry",
job_count = "Number of Jobs"
) |>
tab_header(title = "Most Common Industries")
| Most Common Industries | |
| Industry | Number of Jobs |
|---|---|
| Biotech & Pharmaceuticals | 112 |
| Insurance Carriers | 63 |
| Computer Hardware & Software | 59 |
| IT Services | 50 |
| Health Care Services & Hospitals | 49 |
| Enterprise Software & Network Solutions | 42 |
| Internet | 29 |
| Consulting | 29 |
| Advertising & Marketing | 25 |
| Aerospace & Defense | 25 |
by_industry_multi |>
arrange(desc(job_count)) |>
slice_head(n = 10) |>
ggplot(aes(x = reorder(industry, -job_count), y = job_count, fill = industry)) +
geom_col() +
geom_text(aes(label = job_count), vjust = -0.2) +
scale_fill_viridis_d(option = "magma") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none") +
labs(title = "Most Common Industries for Data Science", x = "Industry", y = "Number of Jobs")
According to this, having a biology background could be valuable (biotech & pharmaceuticals). There’s demand in IT and adjacent fields like software and internet, along with insurance carriers (presumably to assess risk).
final_wide <- all_jobs %>%
left_join(all_job_skills, by = "job_id") %>%
mutate(has_skill = 1L) %>%
pivot_wider(
names_from = skill,
values_from = has_skill,
values_fill = list(has_skill = 0L),
values_fn = list(has_skill = max)
)
# Columns from the original all_jobs table (keep these)
orig_cols <- names(all_jobs)
# Skill columns are all columns in final_wide that were *not* in all_jobs
skill_cols <- setdiff(names(final_wide), orig_cols)
# Identify skill columns that are all zeros (no presence across any job)
zero_only <- vapply(final_wide[skill_cols], function(x) {
# handle numeric/logical 0/1; coerce characters like "0"/"1" if needed
if (is.character(x)) x <- suppressWarnings(as.numeric(x))
sum(x, na.rm = TRUE) == 0
}, logical(1))
# Drop unnecessary columns
final_wide_trim <- final_wide %>%
select(
-any_of(skill_cols[zero_only]), # drop skills with only 0
-any_of("NA") # drop the NA column
)
run_univariate_lm <- function(df,
y = "avg_salary_k",
x_cols = 12:34,
na_action = na.exclude,
assign_to_global = FALSE,
sanitize_names = FALSE,
coerce_binary_chars = TRUE) {
# Basic checks
if (!y %in% names(df)) {
stop(sprintf("Response column '%s' not found in data.", y))
}
if (any(x_cols < 1 | x_cols > ncol(df))) {
stop("x_cols index range is out of bounds for the provided data frame.")
}
# Optionally sanitize names to syntactic R names
if (sanitize_names) {
old_names <- names(df)
names(df) <- make.unique(make.names(names(df)))
if (!y %in% names(df)) {
stop("After sanitizing names, the response column was renamed; set y to the new name.")
}
message("Column names were sanitized via make.names() + make.unique().")
}
# Ensure response is numeric (lm works best this way)
if (!is.numeric(df[[y]])) {
suppressWarnings({
df[[y]] <- as.numeric(df[[y]])
})
if (anyNA(df[[y]])) {
warning(sprintf("Coercing '%s' to numeric introduced NAs; check your response column.", y))
}
}
results <- list()
for (i in x_cols) {
xname <- names(df)[i]
xvec <- df[[xname]]
# Skip problematic types (list/data.frame columns)
if (is.list(xvec) && !is.atomic(xvec)) {
message(sprintf("Skipping '%s': predictor is a list/data-frame column.", xname))
next
}
# Optional coercion for binary character columns
if (coerce_binary_chars && is.character(xvec)) {
vals <- unique(na.omit(xvec))
if (all(vals %in% c("0", "1"))) {
xvec_num <- as.numeric(xvec)
df[[xname]] <- xvec_num
}
}
# Skip non-variant predictors
if (all(is.na(xvec)) || length(unique(na.omit(xvec))) < 2) {
message(sprintf("Skipping '%s': predictor has <2 unique non-NA values.", xname))
next
}
# Build formula with backticks to handle non-syntactic names safely
f <- as.formula(sprintf("`%s` ~ `%s`", y, xname))
# Fit model (try-catch for edge cases)
fit <- try(lm(f, data = df, na.action = na_action), silent = TRUE)
if (inherits(fit, "try-error")) {
message(sprintf("Skipping '%s': lm() failed (%s).", xname, as.character(fit)))
next
}
# Store with name "lm_<column_name>"
obj_name <- paste0("lm_", xname)
results[[obj_name]] <- fit
# Print summary
cat("\n", strrep("=", 72), "\n", sep = "")
cat(sprintf("Model: %s\nFormula: %s\n", obj_name, deparse(f)))
cat(strrep("-", 72), "\n", sep = "")
print(summary(fit))
}
# Optionally assign each model to the global environment
if (assign_to_global && length(results) > 0) {
list2env(results, envir = .GlobalEnv)
message("Models assigned to .GlobalEnv with names 'lm_<column_name>'.")
}
invisible(results)
}
models <- run_univariate_lm(final_wide_trim)
##
## ========================================================================
## Model: lm_python
## Formula: avg_salary_k ~ python
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.476 -25.976 -5.653 20.980 141.347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.976 1.902 46.769 <2e-16 ***
## python 23.677 2.617 9.046 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35.59 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.09957, Adjusted R-squared: 0.09836
## F-statistic: 81.83 on 1 and 740 DF, p-value: < 2.2e-16
##
##
## ========================================================================
## Model: lm_excel
## Formula: avg_salary_k ~ excel
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -83.397 -27.397 -5.321 19.679 149.679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 104.321 1.988 52.467 <2e-16 ***
## excel -5.424 2.750 -1.973 0.0489 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.41 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.005231, Adjusted R-squared: 0.003886
## F-statistic: 3.891 on 1 and 740 DF, p-value: 0.04892
##
##
## ========================================================================
## Model: lm_sas
## Formula: avg_salary_k ~ sas
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.791 -27.601 -5.041 22.084 140.295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.291 1.435 69.885 < 2e-16 ***
## sas 13.413 4.812 2.788 0.00545 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.31 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.01039, Adjusted R-squared: 0.009054
## F-statistic: 7.77 on 1 and 740 DF, p-value: 0.005447
##
##
## ========================================================================
## Model: lm_tableau
## Formula: avg_salary_k ~ tableau
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.382 -26.382 -4.629 21.118 151.118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.882 1.535 67.039 <2e-16 ***
## tableau -7.007 3.436 -2.039 0.0418 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.4 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.005588, Adjusted R-squared: 0.004244
## F-statistic: 4.158 on 1 and 740 DF, p-value: 0.04179
##
##
## ========================================================================
## Model: lm_spark
## Formula: avg_salary_k ~ spark
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.539 -27.414 -5.347 21.961 155.961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.039 1.541 63.611 < 2e-16 ***
## spark 15.308 3.249 4.712 2.93e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.96 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.02913, Adjusted R-squared: 0.02782
## F-statistic: 22.2 on 1 and 740 DF, p-value: 2.929e-06
##
##
## ========================================================================
## Model: lm_sql
## Formula: avg_salary_k ~ sql
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.907 -27.907 -4.058 20.942 152.593
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.4075 1.9714 51.440 <2e-16 ***
## sql 0.1504 2.7547 0.055 0.956
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.51 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 4.03e-06, Adjusted R-squared: -0.001347
## F-statistic: 0.002982 on 1 and 740 DF, p-value: 0.9565
##
##
## ========================================================================
## Model: lm_aws
## Formula: avg_salary_k ~ aws
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.378 -26.878 -5.878 22.122 156.122
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 97.878 1.553 63.030 < 2e-16 ***
## aws 15.204 3.189 4.768 2.24e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.94 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.02981, Adjusted R-squared: 0.0285
## F-statistic: 22.74 on 1 and 740 DF, p-value: 2.236e-06
##
##
## ========================================================================
## Model: lm_mongo
## Formula: avg_salary_k ~ mongo
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.376 -27.376 -4.876 20.499 153.124
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.876 1.409 71.591 <2e-16 ***
## mongo 12.205 6.310 1.934 0.0535 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.41 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.00503, Adjusted R-squared: 0.003686
## F-statistic: 3.741 on 1 and 740 DF, p-value: 0.05346
##
##
## ========================================================================
## Model: lm_pytorch
## Formula: avg_salary_k ~ pytorch
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.59 -28.09 -3.84 21.29 152.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.090 1.413 71.532 <2e-16 ***
## pytorch 7.513 6.164 1.219 0.223
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.47 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.002003, Adjusted R-squared: 0.0006547
## F-statistic: 1.485 on 1 and 740 DF, p-value: 0.2233
##
##
## ========================================================================
## Model: lm_tensor
## Formula: avg_salary_k ~ tensor
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.019 -26.894 -4.769 20.662 154.481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.519 1.430 69.576 < 2e-16 ***
## tensor 20.259 4.592 4.412 1.18e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.02 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.02563, Adjusted R-squared: 0.02431
## F-statistic: 19.47 on 1 and 740 DF, p-value: 1.176e-05
##
##
## ========================================================================
## Model: lm_hadoop
## Formula: avg_salary_k ~ hadoop
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.13 -27.13 -4.43 20.87 154.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.629 1.500 66.440 < 2e-16 ***
## hadoop 11.100 3.668 3.026 0.00256 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.28 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.01222, Adjusted R-squared: 0.01089
## F-statistic: 9.158 on 1 and 740 DF, p-value: 0.002563
##
##
## ========================================================================
## Model: lm_keras
## Formula: avg_salary_k ~ keras
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.111 -27.611 -4.111 20.389 153.389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.611 1.395 72.109 <2e-16 ***
## keras 22.355 7.058 3.167 0.0016 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.26 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.01338, Adjusted R-squared: 0.01204
## F-statistic: 10.03 on 1 and 740 DF, p-value: 0.001601
##
##
## ========================================================================
## Model: lm_scikit
## Formula: avg_salary_k ~ scikit
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.114 -26.614 -4.464 20.386 154.386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.614 1.407 70.795 < 2e-16 ***
## scikit 25.701 5.216 4.927 1.03e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.91 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.03177, Adjusted R-squared: 0.03046
## F-statistic: 24.28 on 1 and 740 DF, p-value: 1.029e-06
##
##
## ========================================================================
## Model: lm_flink
## Formula: avg_salary_k ~ flink
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.609 -27.984 -4.609 21.266 152.891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.109 1.381 73.203 <2e-16 ***
## flink 27.891 11.898 2.344 0.0193 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.37 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.007372, Adjusted R-squared: 0.00603
## F-statistic: 5.496 on 1 and 740 DF, p-value: 0.01933
##
##
## ========================================================================
## Model: lm_google_an
## Formula: avg_salary_k ~ google_an
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86.625 -26.625 -4.375 21.125 151.875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.12 1.38 74.03 < 2e-16 ***
## google_an -33.95 10.04 -3.38 0.000763 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.22 on 740 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.0152, Adjusted R-squared: 0.01387
## F-statistic: 11.42 on 1 and 740 DF, p-value: 0.0007626
##
##
## ========================================================================
## Model: lm_r
## Formula: avg_salary_k ~ r
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.985 -27.985 -3.985 21.015 152.515
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.485 1.376 73.75 <2e-16 ***
## r NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.48 on 741 degrees of freedom
## (11 observations deleted due to missingness)
##
##
## ========================================================================
## Model: lm_scala
## Formula: avg_salary_k ~ scala
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.985 -27.985 -3.985 21.015 152.515
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.485 1.376 73.75 <2e-16 ***
## scala NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.48 on 741 degrees of freedom
## (11 observations deleted due to missingness)
##
##
## ========================================================================
## Model: lm_databricks
## Formula: avg_salary_k ~ databricks
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.985 -27.985 -3.985 21.015 152.515
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.485 1.376 73.75 <2e-16 ***
## databricks NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.48 on 741 degrees of freedom
## (11 observations deleted due to missingness)
##
##
## ========================================================================
## Model: lm_gcp
## Formula: avg_salary_k ~ gcp
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.985 -27.985 -3.985 21.015 152.515
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.485 1.376 73.75 <2e-16 ***
## gcp NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.48 on 741 degrees of freedom
## (11 observations deleted due to missingness)
##
##
## ========================================================================
## Model: lm_snowflake
## Formula: avg_salary_k ~ snowflake
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.985 -27.985 -3.985 21.015 152.515
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.485 1.376 73.75 <2e-16 ***
## snowflake NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.48 on 741 degrees of freedom
## (11 observations deleted due to missingness)
##
##
## ========================================================================
## Model: lm_tensorflow
## Formula: avg_salary_k ~ tensorflow
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.985 -27.985 -3.985 21.015 152.515
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.485 1.376 73.75 <2e-16 ***
## tensorflow NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.48 on 741 degrees of freedom
## (11 observations deleted due to missingness)
##
##
## ========================================================================
## Model: lm_xgboost
## Formula: avg_salary_k ~ xgboost
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.985 -27.985 -3.985 21.015 152.515
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.485 1.376 73.75 <2e-16 ***
## xgboost NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.48 on 741 degrees of freedom
## (11 observations deleted due to missingness)
##
##
## ========================================================================
## Model: lm_azure
## Formula: avg_salary_k ~ azure
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -85.985 -27.985 -3.985 21.015 152.515
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.485 1.376 73.75 <2e-16 ***
## azure NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.48 on 741 degrees of freedom
## (11 observations deleted due to missingness)
run_multivar_lm <- function(df,
y = "avg_salary_k",
x_cols = 12:34,
na_action = na.exclude,
sanitize_names = FALSE,
coerce_binary_chars = TRUE,
standardize = FALSE) {
# --- Basic checks ---
if (!y %in% names(df)) {
stop(sprintf("Response column '%s' not found in data.", y))
}
if (any(x_cols < 1 | x_cols > ncol(df))) {
stop("x_cols index range is out of bounds for the provided data frame.")
}
# --- Optionally sanitize column names ---
if (sanitize_names) {
names(df) <- make.unique(make.names(names(df)))
if (!y %in% names(df)) {
stop("After sanitizing names, the response column was renamed; update 'y' to the new name.")
}
message("Column names were sanitized via make.names() + make.unique().")
}
# --- Ensure response is numeric ---
if (!is.numeric(df[[y]])) {
suppressWarnings({
df[[y]] <- as.numeric(df[[y]])
})
if (anyNA(df[[y]])) {
warning(sprintf("Coercing '%s' to numeric introduced NAs; check your response column.", y))
}
}
# --- Collect candidate predictors ---
xnames <- names(df)[x_cols]
# --- Filter: skip problematic types, coerce binary chars, skip constants ---
keep <- logical(length(xnames))
for (i in seq_along(xnames)) {
xn <- xnames[i]
xv <- df[[xn]]
# Skip list/data-frame columns
if (is.list(xv) && !is.atomic(xv)) {
message(sprintf("Skipping '%s': predictor is a list/data-frame column.", xn))
keep[i] <- FALSE
next
}
# Optional coercion: "0"/"1" -> numeric
if (coerce_binary_chars && is.character(xv)) {
vals <- unique(na.omit(xv))
if (length(vals) > 0 && all(vals %in% c("0", "1"))) {
df[[xn]] <- as.numeric(xv)
xv <- df[[xn]]
}
}
# Skip non-variant predictors (all NA or <2 unique non-NA values)
if (all(is.na(xv)) || length(unique(na.omit(xv))) < 2) {
message(sprintf("Skipping '%s': predictor has <2 unique non-NA values.", xn))
keep[i] <- FALSE
next
}
keep[i] <- TRUE
}
xnames <- xnames[keep]
if (length(xnames) == 0) {
stop("No valid predictors found in the selected range after filtering.")
}
# --- Optional standardization of numeric predictors ---
if (standardize) {
for (xn in xnames) {
if (is.numeric(df[[xn]])) {
df[[xn]] <- as.numeric(scale(df[[xn]]))
}
}
message("Numeric predictors standardized (mean=0, sd=1).")
}
# --- Build safe formula with backticks ---
backtick <- function(nm) paste0("`", nm, "`")
rhs <- paste(vapply(xnames, backtick, ""), collapse = " + ")
f <- as.formula(paste(backtick(y), "~", rhs))
# --- Fit model ---
fit <- lm(f, data = df, na.action = na_action)
# --- Print concise diagnostics ---
cat("\n", strrep("=", 72), "\n", sep = "")
cat("Multivariable Linear Model\n")
cat(sprintf("Formula: %s\n", deparse(f)))
cat(strrep("-", 72), "\n", sep = "")
s <- summary(fit)
print(s)
# Extra diagnostics
cat(strrep("-", 72), "\n", sep = "")
cat(sprintf("Observations used: %d\n", s$df[1] + s$df[2]))
cat(sprintf("Residual df: %d\n", s$df[2]))
cat(sprintf("Residual SE: %.3f\n", s$sigma))
cat(sprintf("Multiple R-squared: %.5f\n", s$r.squared))
cat(sprintf("Adjusted R-squared: %.5f\n", s$adj.r.squared))
suppressWarnings({
aic_val <- try(AIC(fit), silent = TRUE)
if (!inherits(aic_val, "try-error")) {
cat(sprintf("AIC: %.3f\n", aic_val))
}
})
invisible(fit)
}
mmodel <- run_multivar_lm(final_wide_trim)
##
## ========================================================================
## Multivariable Linear Model
## Formula: avg_salary_k ~ python + excel + sas + tableau + spark + sql +
## Formula: aws + mongo + pytorch + tensor + hadoop + keras + scikit +
## Formula: flink + google_an + r + scala + databricks + gcp + snowflake +
## Formula: tensorflow + xgboost + azure
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = f, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -74.135 -23.389 -6.199 19.860 125.237
##
## Coefficients: (8 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 92.184 2.443 37.739 < 2e-16 ***
## python 22.627 2.928 7.729 3.63e-14 ***
## excel -2.549 2.579 -0.988 0.323313
## sas 18.499 4.675 3.957 8.35e-05 ***
## tableau -7.601 3.515 -2.162 0.030920 *
## spark -1.442 4.071 -0.354 0.723305
## sql -10.613 3.060 -3.468 0.000555 ***
## aws 9.877 3.258 3.031 0.002522 **
## mongo 7.776 6.173 1.260 0.208133
## pytorch -18.653 7.584 -2.459 0.014149 *
## tensor 15.412 6.682 2.306 0.021371 *
## hadoop 4.712 4.207 1.120 0.263017
## keras -3.025 8.340 -0.363 0.716898
## scikit 11.969 6.222 1.924 0.054778 .
## flink 4.294 11.486 0.374 0.708604
## google_an -28.312 9.386 -3.016 0.002646 **
## r NA NA NA NA
## scala NA NA NA NA
## databricks NA NA NA NA
## gcp NA NA NA NA
## snowflake NA NA NA NA
## tensorflow NA NA NA NA
## xgboost NA NA NA NA
## azure NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.21 on 726 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.1841, Adjusted R-squared: 0.1672
## F-statistic: 10.92 on 15 and 726 DF, p-value: < 2.2e-16
##
## ------------------------------------------------------------------------
## Observations used: 742
## Residual df: 726
## Residual SE: 34.206
## Multiple R-squared: 0.18407
## Adjusted R-squared: 0.16721
## AIC: 7365.593
run_stepwise_backward_lm <- function(df,
y = "avg_salary_k",
x_cols = 12:34,
na_action = na.exclude,
sanitize_names = FALSE,
coerce_binary_chars = TRUE,
standardize = FALSE,
criterion = c("AIC", "BIC"),
trace = TRUE) {
criterion <- match.arg(criterion)
# --- Basic checks ---
if (!y %in% names(df)) {
stop(sprintf("Response column '%s' not found in data.", y))
}
if (any(x_cols < 1 | x_cols > ncol(df))) {
stop("x_cols index range is out of bounds for the provided data frame.")
}
# --- Optionally sanitize column names ---
if (sanitize_names) {
names(df) <- make.unique(make.names(names(df)))
if (!y %in% names(df)) {
stop("After sanitizing names, the response column was renamed; update 'y' to the new name.")
}
message("Column names were sanitized via make.names() + make.unique().")
}
# --- Ensure response is numeric ---
if (!is.numeric(df[[y]])) {
suppressWarnings({
df[[y]] <- as.numeric(df[[y]])
})
if (anyNA(df[[y]])) {
warning(sprintf("Coercing '%s' to numeric introduced NAs; check your response column.", y))
}
}
# --- Collect candidate predictors ---
xnames <- names(df)[x_cols]
# --- Filter: skip problematic types, coerce binary chars, skip constants ---
keep <- logical(length(xnames))
for (i in seq_along(xnames)) {
xn <- xnames[i]
xv <- df[[xn]]
# Skip list/data-frame columns
if (is.list(xv) && !is.atomic(xv)) {
message(sprintf("Skipping '%s': predictor is a list/data-frame column.", xn))
keep[i] <- FALSE
next
}
# Optional coercion: "0"/"1" -> numeric
if (coerce_binary_chars && is.character(xv)) {
vals <- unique(na.omit(xv))
if (length(vals) > 0 && all(vals %in% c("0", "1"))) {
df[[xn]] <- as.numeric(xv)
xv <- df[[xn]]
}
}
# Skip non-variant predictors (all NA or <2 unique non-NA values)
if (all(is.na(xv)) || length(unique(na.omit(xv))) < 2) {
message(sprintf("Skipping '%s': predictor has <2 unique non-NA values.", xn))
keep[i] <- FALSE
next
}
keep[i] <- TRUE
}
xnames <- xnames[keep]
if (length(xnames) == 0) {
stop("No valid predictors found in the selected range after filtering.")
}
# --- Optional standardization of numeric predictors ---
if (standardize) {
for (xn in xnames) {
if (is.numeric(df[[xn]])) {
df[[xn]] <- as.numeric(scale(df[[xn]]))
}
}
message("Numeric predictors standardized (mean=0, sd=1).")
}
# --- Build safe formula with backticks ---
backtick <- function(nm) paste0("`", nm, "`")
rhs <- paste(vapply(xnames, backtick, ""), collapse = " + ")
f_full <- as.formula(paste(backtick(y), "~", rhs))
# --- Fit full model ---
fit_full <- lm(f_full, data = df, na.action = na_action)
# --- Choose penalty k (AIC vs BIC) ---
n <- nobs(fit_full) # number of observations actually used
k <- if (criterion == "AIC") 2 else log(n)
# --- Backward stepwise using stats::step ---
fit_step <- step(fit_full,
direction = "backward",
trace = as.integer(trace),
k = k)
# --- Print concise diagnostics for the final selected model ---
cat("\n", strrep("=", 72), "\n", sep = "")
cat(sprintf("Backward Stepwise Linear Model (criterion: %s, k = %.3f)\n", criterion, k))
cat(sprintf("Final Formula: %s\n", deparse(formula(fit_step))))
cat(strrep("-", 72), "\n", sep = "")
s <- summary(fit_step)
print(s)
# Which predictors were selected?
selected <- attr(terms(fit_step), "term.labels")
dropped <- setdiff(xnames, selected)
cat(strrep("-", 72), "\n", sep = "")
cat("Selected predictors:\n")
if (length(selected) > 0) {
cat(paste0(" - ", selected), sep = "\n")
} else {
cat(" (none; intercept-only model)\n")
}
cat("\nDropped predictors:\n")
if (length(dropped) > 0) {
cat(paste0(" - ", dropped), sep = "\n")
} else {
cat(" (none dropped)\n")
}
# Extra metrics
cat("\n", strrep("-", 72), "\n", sep = "")
cat(sprintf("Observations used: %d\n", s$df[1] + s$df[2]))
cat(sprintf("Residual df: %d\n", s$df[2]))
cat(sprintf("Residual SE: %.3f\n", s$sigma))
cat(sprintf("Multiple R-squared: %.5f\n", s$r.squared))
cat(sprintf("Adjusted R-squared: %.5f\n", s$adj.r.squared))
cat(sprintf("%s: %.3f\n", criterion, AIC(fit_step) + (k - 2) * length(coef(fit_step))))
# Note: AIC() returns AIC with k = 2. For BIC, we report adjusted by (k-2)*p.
invisible(fit_step)
}
stepmodel <- run_stepwise_backward_lm(final_wide_trim)
## Start: AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql +
## aws + mongo + pytorch + tensor + hadoop + keras + scikit +
## flink + google_an + r + scala + databricks + gcp + snowflake +
## tensorflow + xgboost + azure
##
##
## Step: AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql +
## aws + mongo + pytorch + tensor + hadoop + keras + scikit +
## flink + google_an + r + scala + databricks + gcp + snowflake +
## tensorflow + xgboost
##
##
## Step: AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql +
## aws + mongo + pytorch + tensor + hadoop + keras + scikit +
## flink + google_an + r + scala + databricks + gcp + snowflake +
## tensorflow
##
##
## Step: AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql +
## aws + mongo + pytorch + tensor + hadoop + keras + scikit +
## flink + google_an + r + scala + databricks + gcp + snowflake
##
##
## Step: AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql +
## aws + mongo + pytorch + tensor + hadoop + keras + scikit +
## flink + google_an + r + scala + databricks + gcp
##
##
## Step: AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql +
## aws + mongo + pytorch + tensor + hadoop + keras + scikit +
## flink + google_an + r + scala + databricks
##
##
## Step: AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql +
## aws + mongo + pytorch + tensor + hadoop + keras + scikit +
## flink + google_an + r + scala
##
##
## Step: AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql +
## aws + mongo + pytorch + tensor + hadoop + keras + scikit +
## flink + google_an + r
##
##
## Step: AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql +
## aws + mongo + pytorch + tensor + hadoop + keras + scikit +
## flink + google_an
##
## Df Sum of Sq RSS AIC
## - spark 1 147 849581 5256.0
## - keras 1 154 849588 5256.0
## - flink 1 164 849597 5256.0
## - excel 1 1143 850577 5256.9
## - hadoop 1 1468 850902 5257.2
## - mongo 1 1857 851291 5257.5
## <none> 849434 5257.9
## - scikit 1 4330 853764 5259.7
## - tableau 1 5471 854904 5260.7
## - tensor 1 6224 855657 5261.3
## - pytorch 1 7077 856511 5262.0
## - google_an 1 10646 860080 5265.1
## - aws 1 10750 860184 5265.2
## - sql 1 14071 863505 5268.1
## - sas 1 18316 867750 5271.7
## - python 1 69885 919319 5314.6
##
## Step: AIC=5256.02
## avg_salary_k ~ python + excel + sas + tableau + sql + aws + mongo +
## pytorch + tensor + hadoop + keras + scikit + flink + google_an
##
## Df Sum of Sq RSS AIC
## - keras 1 127 849708 5254.1
## - flink 1 133 849713 5254.1
## - excel 1 1101 850681 5255.0
## - hadoop 1 1346 850927 5255.2
## - mongo 1 1741 851321 5255.5
## <none> 849581 5256.0
## - scikit 1 4195 853776 5257.7
## - tableau 1 5335 854916 5258.7
## - tensor 1 6111 855692 5259.3
## - pytorch 1 6936 856516 5260.0
## - google_an 1 10602 860182 5263.2
## - aws 1 10604 860184 5263.2
## - sql 1 14507 864087 5266.6
## - sas 1 18508 868089 5270.0
## - python 1 70982 920563 5313.6
##
## Step: AIC=5254.13
## avg_salary_k ~ python + excel + sas + tableau + sql + aws + mongo +
## pytorch + tensor + hadoop + scikit + flink + google_an
##
## Df Sum of Sq RSS AIC
## - flink 1 170 849878 5252.3
## - excel 1 1103 850811 5253.1
## - hadoop 1 1404 851112 5253.4
## - mongo 1 1690 851398 5253.6
## <none> 849708 5254.1
## - scikit 1 4156 853863 5255.7
## - tableau 1 5425 855133 5256.9
## - tensor 1 6241 855949 5257.6
## - pytorch 1 6876 856584 5258.1
## - aws 1 10481 860189 5261.2
## - google_an 1 10726 860434 5261.4
## - sql 1 14388 864096 5264.6
## - sas 1 18787 868495 5268.4
## - python 1 70881 920589 5311.6
##
## Step: AIC=5252.28
## avg_salary_k ~ python + excel + sas + tableau + sql + aws + mongo +
## pytorch + tensor + hadoop + scikit + google_an
##
## Df Sum of Sq RSS AIC
## - excel 1 1110 850988 5251.2
## - hadoop 1 1544 851421 5251.6
## - mongo 1 1686 851564 5251.7
## <none> 849878 5252.3
## - scikit 1 4135 854013 5253.9
## - tableau 1 5477 855355 5255.0
## - tensor 1 6754 856631 5256.1
## - pytorch 1 7331 857209 5256.6
## - google_an 1 10718 860596 5259.6
## - aws 1 10948 860826 5259.8
## - sql 1 14822 864699 5263.1
## - sas 1 18856 868734 5266.6
## - python 1 71124 921002 5309.9
##
## Step: AIC=5251.24
## avg_salary_k ~ python + sas + tableau + sql + aws + mongo + pytorch +
## tensor + hadoop + scikit + google_an
##
## Df Sum of Sq RSS AIC
## - mongo 1 1515 852503 5250.6
## - hadoop 1 1554 852542 5250.6
## <none> 850988 5251.2
## - scikit 1 4060 855049 5252.8
## - tableau 1 5993 856981 5254.5
## - tensor 1 6998 857986 5255.3
## - pytorch 1 7292 858280 5255.6
## - aws 1 11034 862022 5258.8
## - google_an 1 11498 862486 5259.2
## - sql 1 15412 866400 5262.6
## - sas 1 18820 869808 5265.5
## - python 1 72221 923209 5309.7
##
## Step: AIC=5250.56
## avg_salary_k ~ python + sas + tableau + sql + aws + pytorch +
## tensor + hadoop + scikit + google_an
##
## Df Sum of Sq RSS AIC
## - hadoop 1 1980 854483 5250.3
## <none> 852503 5250.6
## - scikit 1 4119 856621 5252.1
## - tableau 1 5556 858059 5253.4
## - tensor 1 7143 859646 5254.8
## - pytorch 1 7683 860185 5255.2
## - google_an 1 11455 863958 5258.5
## - aws 1 12616 865119 5259.5
## - sql 1 14645 867148 5261.2
## - sas 1 20120 872623 5265.9
## - python 1 71465 923968 5308.3
##
## Step: AIC=5250.29
## avg_salary_k ~ python + sas + tableau + sql + aws + pytorch +
## tensor + scikit + google_an
##
## Df Sum of Sq RSS AIC
## <none> 854483 5250.3
## - scikit 1 5112 859595 5252.7
## - tableau 1 5621 860104 5253.2
## - tensor 1 6544 861027 5253.9
## - pytorch 1 8208 862691 5255.4
## - google_an 1 11679 866162 5258.4
## - sql 1 13033 867516 5259.5
## - aws 1 15824 870307 5261.9
## - sas 1 18930 873413 5264.5
## - python 1 74733 929216 5310.5
##
## ========================================================================
## Backward Stepwise Linear Model (criterion: AIC, k = 2.000)
## Final Formula: avg_salary_k ~ python + sas + tableau + sql + aws + pytorch +
## Final Formula: tensor + scikit + google_an
## ------------------------------------------------------------------------
##
## Call:
## lm(formula = avg_salary_k ~ python + sas + tableau + sql + aws +
## pytorch + tensor + scikit + google_an, data = df, na.action = na_action)
##
## Residuals:
## Min 1Q Median 3Q Max
## -75.452 -23.502 -5.291 19.320 123.744
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.952 2.089 43.543 < 2e-16 ***
## python 22.804 2.850 8.001 4.82e-15 ***
## sas 18.520 4.599 4.027 6.24e-05 ***
## tableau -7.582 3.455 -2.194 0.028524 *
## sql -9.796 2.932 -3.341 0.000876 ***
## aws 11.327 3.076 3.682 0.000249 ***
## pytorch -19.513 7.359 -2.652 0.008181 **
## tensor 14.230 6.010 2.368 0.018155 *
## scikit 12.114 5.788 2.093 0.036717 *
## google_an -29.472 9.318 -3.163 0.001626 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34.17 on 732 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.1792, Adjusted R-squared: 0.1691
## F-statistic: 17.76 on 9 and 732 DF, p-value: < 2.2e-16
##
## ------------------------------------------------------------------------
## Selected predictors:
## - python
## - sas
## - tableau
## - sql
## - aws
## - pytorch
## - tensor
## - scikit
## - google_an
##
## Dropped predictors:
## - excel
## - spark
## - mongo
## - hadoop
## - keras
## - flink
## - r
## - scala
## - databricks
## - gcp
## - snowflake
## - tensorflow
## - xgboost
## - azure
##
## ------------------------------------------------------------------------
## Observations used: 742
## Residual df: 732
## Residual SE: 34.166
## Multiple R-squared: 0.17922
## Adjusted R-squared: 0.16912
## AIC: 7357.991
Google Gemini. (2026). Gemini 3 Flash \[Large language model\]. https://gemini.google.com. Accessed March 20, 2026.