Data Sources

Kaggle Data Science Job Data set (Primary) : data_cleaned csv (on Github repository)

Why this?

Large scale job posting dataset (job titles, salaries, skill flags (Python, SQL, AWS, Tableau etc.) and job descriptions - great structured columns ready to be analyzed, salary ranges/average - good for exploring skill-salary association, Job description for texting mining soft skills, Many job posting with variety, technical skills listed can be used for statistical relationship analysis
BuiltIn Job Postings (Supplement Data Source): SoFi, TempusAI, Grow Therapy, Braze, PRC..etc
Supplement Skill Reference Articles (Qualitative reference sources),
1. NYIT Online – *Data Science Skills*
https://online.nyit.edu/blog/data-science-skills

provides overview of core technical and soft skills expected of data science professionals (programming, statistics, communication, and domain knowledge)
1. Pragmatic Institute – *Must‑Have Data Science Skills for 2023*
https://www.pragmaticinstitute.com/resources/articles/data/must-have-data-science-skills-for-2023/

Highlights industry‑aligned skill requirements, modern tools, business communication, and deployment skills.

Why this?

Up to date skill requirements reflecting 2025-2026 job market, Includes qualitative details (soft skill, modern tools, expectation. Great for comparison between Kaggle dataset from the past and present day job demands on what quality is still the same

Kaggle Data Acquistion Extraction Plan:

load from csv on github, extract skill indicators, salary information for skill to salary analysis, Soft skill via text search (Teamwork, Communication..etc), grouping job titles (data scientist vs analyst vs engineers)

BuiltIn Job Posting Extraction Plan:

Scrapping –> Tools (PowerBI, Tableu etc), Soft skills (Presentation, communication, collaboration et), Salary, Job description text, Required Programming Skills (R, Python, SQL, Scala etc), Industry domain, Job seniority

Storing dataset in standardized columns

job_idsource = “builtin”

company

job_title

skill_list

soft_skills

salary_range

job_description

date_collected

Load Dataset and required packages

# For data wrangling
#install.packages('polite')
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.5.2

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'tibble' was built under R version 4.5.2

## Warning: package 'readr' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)

## Warning: package 'janitor' was built under R version 4.5.2

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

# Website scrapping, HTML parsing (polite to avoid blockage)
library(rvest)

## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:readr':
## 
##     guess_encoding

library(polite)

## Warning: package 'polite' was built under R version 4.5.2

# 607 Project 3 Main Github repository


kaggle_url <- "https://raw.githubusercontent.com/meiqing39/607-Project-3/main/data_cleaned_2021.csv"

kg <- read_csv(kaggle_url, show_col_types = FALSE) |> 
  clean_names()   # <- renames by snake_case format, removes special character

#Ensures consistent and reproducible references to variable by using janitor function clean_names()

# Tibble Resource table
sources_roles <- tibble(
  Source = c(
    "Kaggle dataset (structured job postings)",
    "BuiltIn job postings (recent, unstructured)",
    "NYIT article (skills overview)",
    "Pragmatic Institute article (skills 2023)"
  ),
  Role = c(
    "Primary quantitative dataset used for skill frequency and salary association analyses",
    "Supplemental contemporary postings used to validate/contrast 2025–2026 skill demand",
    "Qualitative reference to define/justify skill taxonomy (not stored in DB)",
    "Qualitative reference to contextualize/interpret findings (not stored in DB)"
  )
)

print(sources_roles)

## # A tibble: 4 × 2
##   Source                                      Role                              
##   <chr>                                       <chr>                             
## 1 Kaggle dataset (structured job postings)    Primary quantitative dataset used…
## 2 BuiltIn job postings (recent, unstructured) Supplemental contemporary posting…
## 3 NYIT article (skills overview)              Qualitative reference to define/j…
## 4 Pragmatic Institute article (skills 2023)   Qualitative reference to contextu…

Entity-Relationship (ER) Diagram

ER diagram shows the normalized database design used to store job postings, technical skills, and soft skills collected from the Kaggle dataset and BuiltIn job postings.

Normalize Kaggle CSV - relational connection job_id

job level attributes (title, company, industry, salary) extracted into jobs table
Technical skills: extract from binary columns, reshape into longer format with pivot longer function connect each job with required skills
job description test extracted without modding for later soft skill text analysis

# Extract job level data
jobs <- kg |> 
  transmute(
    job_id = row_number(),
    job_title,
    company = company_name,
    location,
    industry,
    avg_salary_k,
    source = "kaggle"
  )

# Extract job description (soft skill use)
job_descriptions <- kg |> 
  transmute(
    job_id = row_number(),
    job_description
  )

# Id skill columns

skill_cols <- names(kg)[
  names(kg) %in% c(
    "python","sql","aws","excel","tableau","spark",
    "sas","pytorch","keras","scikit","tensor",
    "hadoop","flink","mongo","google_an"
  )
]



# Combine created skill_cols to job_id to make job_skills table
job_skills <- kg |> 
  mutate(job_id = row_number()) |> 
  select(job_id, all_of(skill_cols)) |> 
  pivot_longer(
    cols = -job_id,
    names_to = "skill",
    values_to = "present"
  ) |> 
  filter(present == 1) |> 
  select(job_id, skill)

Job Posting Web Scrapping

# Create Scrapping function for scrapping of job title and job description later

get1 <- function(html, css) {
  node <- html_element(html, css)
  if (inherits(node, "xml_missing") || is.na(node)) return("")
  html_text2(node)
}

first_nonempty <- function(html, selectors) {
  for (sel in selectors) {
    txt <- get1(html, sel)
    if (nzchar(txt)) return(txt)
  }
  ""
}

scrape_builtin <- function(url, ua = "607-Project-3/edu (rvest+polite)") {
  ses <- polite::bow(url, user_agent = ua, force = TRUE)
  pg  <- polite::scrape(ses)

  # Title
  title <- first_nonempty(pg, c("h1", "meta[property='og:title']"))
  if (!nzchar(title)) title <- NA_character_

  # Description: try a few common containers, fall back to main/body
  desc <- first_nonempty(pg, c(
    "article",
    "[data-testid='job-description']",
    "[data-qa='job-description']",
    "div[class*='job-description']",
    "section[class*='description']"
  ))
  if (!nzchar(desc)) desc <- get1(pg, "[role='main']")
  if (!nzchar(desc)) desc <- get1(pg, "body")
  if (!nzchar(desc)) desc <- NA_character_

  tibble(
    job_title       = title,
    job_description = desc,
    job_url         = url
  )
}

# Organize all job posting URLs into a tibble

builtin_urls <- tibble(
  job_url = c(
    "https://builtin.com/job/manager-data-scientist-alternate-data-strategy/8162509",
    "https://builtin.com/job/principal-associate-data-scientist-ai-software-engineering/8017518",
    "https://builtin.com/job/principal-associate-data-scientist-us-card-new-credit-team/8029691",
    "https://builtin.com/job/senior-associate-data-scientist/8432791",
    "https://builtin.com/job/data-scientist-ii-real-world-evidence-rwe-pharma-r-d/8031762",
    "https://builtin.com/job/staff-data-scientist-growth-care/8019471",
    "https://www.builtinnyc.com/job/senior-staff-data-scientist-invest/8697390",
    "https://www.builtinnyc.com/job/staff-data-scientist-borrow/8290699",
    "https://www.builtinnyc.com/job/field-data-scientist-ai-deployment/7405417",
    "https://www.builtinnyc.com/job/ai-genai-data-scientist-senior-manager/8258875",
    "https://www.builtinnyc.com/job/data-scientist/8446518"
  ))


# Scrape --> builtin_raw 
safe_scrape <- purrr::safely(scrape_builtin)

builtin_raw <- builtin_urls |> 
  mutate(res = map(job_url, safe_scrape)) |> 
  transmute(
    job_url,
    job_title  = map_chr(res, ~ .x$result$job_title %||% NA_character_),
    job_description = map_chr(res, ~ .x$result$job_description %||% NA_character_))


# Double checking that job description returns with page body text for skill, soft skill keywords, and frequency comparison later - text number appear = good 

nchar(builtin_raw$job_description)

##  [1] 12241 12498 12419 12083 13239 13691 10203 10779 12258 11764 12244

# Scrape and  Manually add company and industry from data seen on job posting website
builtin_jobs <- builtin_raw |> 
  mutate(
    company = case_when(
      str_detect(job_url, "capital|cap1") ~ "Capital One",
      str_detect(job_url, "nylife|new-york-life|senior-associate-data-scientist/8432791")
~ "NY Life",
      str_detect(job_url, "tempus") ~ "Tempus AI",
      str_detect(job_url, "grow") ~ "Grow Therapy",
      str_detect(job_url, "sofi") ~ "SoFi",
      str_detect(job_url, "braze") ~ "Braze",
      str_detect(job_url, "prc") ~ "PRC",
      str_detect(job_url, "octus") ~ "Octus",
      TRUE ~ NA_character_
    ),
    industry = case_when(
      company %in% c("Capital One", "SoFi", "NY Life") ~ "Finance",
company == "Tempus AI" ~ "Healthcare / Pharma",
      company == "Grow Therapy" ~ "Healthcare",
      company == "Braze" ~ "SaaS / Martech",
      company %in% c("PRC", "Octus") ~ "Consulting",
      TRUE ~ NA_character_
    ),
    source = "builtin",
    job_id = row_number()
  ) %>%
  select(job_id, job_title, company, industry, source, job_url, job_description)


# extract skills seen on job descriptions
skill_keywords <- c(
  # languages
  "python","r","sql","scala","java",
  # libraries / ml
  "pytorch","tensorflow","scikit","sklearn","xgboost",
  # platforms
  "aws","azure","gcp","databricks","snowflake","spark",
  # bi / viz
  "tableau","power bi")


builtin_job_skills <- builtin_jobs %>%
  transmute(job_id, text = tolower(job_description %||% "")) %>%
  filter(nzchar(text)) %>%
  crossing(skill = skill_keywords) %>%
  mutate(rx = paste0("\\b", skill, "\\b")) %>%
  filter(str_detect(text, regex(rx, ignore_case = TRUE))) %>%
  distinct(job_id, skill)

Table result to ensure data was extracted and organized

head(builtin_jobs)

## # A tibble: 6 × 7
##   job_id job_title               company industry source job_url job_description
##    <int> <chr>                   <chr>   <chr>    <chr>  <chr>   <chr>          
## 1      1 Manager, Data Scientis… <NA>    <NA>     built… https:… "https://ad.do…
## 2      2 Principal Associate, D… <NA>    <NA>     built… https:… "https://ad.do…
## 3      3 Principal Associate, D… <NA>    <NA>     built… https:… "https://ad.do…
## 4      4 Senior Associate - Dat… NY Life Finance  built… https:… "New York Life…
## 5      5 Data Scientist II, Rea… <NA>    <NA>     built… https:… "Tempus AI\nDa…
## 6      6 Staff Data Scientist -… Grow T… Healthc… built… https:… "Grow Therapy\…

builtin_job_skills |>
  count(skill) |> 
  arrange(desc(n))

## # A tibble: 14 × 2
##    skill          n
##    <chr>      <int>
##  1 python        10
##  2 sql            9
##  3 aws            7
##  4 gcp            4
##  5 r              3
##  6 spark          3
##  7 snowflake      2
##  8 tableau        2
##  9 azure          1
## 10 databricks     1
## 11 scala          1
## 12 scikit         1
## 13 tensorflow     1
## 14 xgboost        1

##Extracting csv file from builtin_jobs and builtin_job_skills

write_csv(builtin_jobs, "builtin_jobs.csv")
write_csv(builtin_job_skills, "builtin_job_skills.csv")

I scraped job posting descriptions from Builtin to extract modern technical and soft skills using keyword matching while also understanding that some attributes (salary, explicit seniority levels) are inconsistent available on job posting websites. Salary based analyses rely mostly on the structured kaggle dataset, while the Builtin job posting provide up to date information on skill demands and tools used as data scientists. Industry classification were assigned manually based on company information.

pacman::p_load(janitor,
               tidyverse)

#library(tidyverse) Moved into pacman statement package to make it easier to open up and install the latest version
#library(dplyr) Removed this library because it is already included in tidyverse
#library(readr) Removed this library because it is already included in tidyverse
#library(ggplot2) Removed this library because it is already included in tidyverse
#library(janitor) Moved into pacman statement package to make it easier to open up and install the latest version

kaggle_url <- "https://raw.githubusercontent.com/meiqing39/607-Project-3/refs/heads/main/data_cleaned_2021.csv"
kg <- read_csv(kaggle_url, show_col_types = FALSE) |> 
  clean_names()   # <- renames by snake_case format, removes special character

#Ensures consistent and reproducible references to variable by using janitor function clean_names()
kg

## # A tibble: 742 × 42
##    index job_title  salary_estimate job_description rating company_name location
##    <dbl> <chr>      <chr>           <chr>            <dbl> <chr>        <chr>   
##  1     0 Data Scie… $53K-$91K (Gla… "Data Scientis…    3.8 "Tecolote R… Albuque…
##  2     1 Healthcar… $63K-$112K (Gl… "What You Will…    3.4 "University… Linthic…
##  3     2 Data Scie… $80K-$90K (Gla… "KnowBe4, Inc.…    4.8 "KnowBe4\n4… Clearwa…
##  4     3 Data Scie… $56K-$97K (Gla… "*Organization…    3.8 "PNNL\n3.8"  Richlan…
##  5     4 Data Scie… $86K-$143K (Gl… "Data Scientis…    2.9 "Affinity S… New Yor…
##  6     5 Data Scie… $71K-$119K (Gl… "CyrusOne is s…    3.4 "CyrusOne\n… Dallas,…
##  7     6 Data Scie… $54K-$93K (Gla… "Job Descripti…    4.1 "ClearOne A… Baltimo…
##  8     7 Data Scie… $86K-$142K (Gl… "Advanced Anal…    3.8 "Logic20/20… San Jos…
##  9     8 Research … $38K-$84K (Gla… "SUMMARY\n\nTh…    3.3 "Rochester … Rochest…
## 10     9 Data Scie… $120K-$160K (G… "isn’t your us…    4.6 "<intent>\n… New Yor…
## # ℹ 732 more rows
## # ℹ 35 more variables: headquarters <chr>, size <chr>, founded <dbl>,
## #   type_of_ownership <chr>, industry <chr>, sector <chr>, revenue <chr>,
## #   competitors <chr>, hourly <dbl>, employer_provided <dbl>,
## #   lower_salary <dbl>, upper_salary <dbl>, avg_salary_k <dbl>,
## #   company_txt <chr>, job_location <chr>, age <dbl>, python <dbl>,
## #   spark <dbl>, aws <dbl>, excel <dbl>, sql <dbl>, sas <dbl>, keras <dbl>, …

colnames(kg)

##  [1] "index"              "job_title"          "salary_estimate"   
##  [4] "job_description"    "rating"             "company_name"      
##  [7] "location"           "headquarters"       "size"              
## [10] "founded"            "type_of_ownership"  "industry"          
## [13] "sector"             "revenue"            "competitors"       
## [16] "hourly"             "employer_provided"  "lower_salary"      
## [19] "upper_salary"       "avg_salary_k"       "company_txt"       
## [22] "job_location"       "age"                "python"            
## [25] "spark"              "aws"                "excel"             
## [28] "sql"                "sas"                "keras"             
## [31] "pytorch"            "scikit"             "tensor"            
## [34] "hadoop"             "tableau"            "bi"                
## [37] "flink"              "mongo"              "google_an"         
## [40] "job_title_sim"      "seniority_by_title" "degree"

#head(10)

Data quality issues

Several variables contain placeholder values that represent missing data rather than valid observations. For example, Rating, Founded, and Age include values of -1, while variables such as Competitors, seniority_by_title, and Degree use strings such as "-1" or "na" to indicate unavailable information. These values must be recoded as missing before further analysis.

Summary of the initial inspection

The dataset contains 742 job postings and 42 variables. These variables include job and company descriptors, salary information, and binary indicators for technical skills such as Python, SQL, AWS, and Tableau. Initial inspection using glimpse() and summary() showed that the dataset contains both raw text variables and derived variables that are more suitable for analysis.

Skill demand

The binary skill columns provide a practical proxy for skill demand in job postings. Since these variables are coded as 0 and 1, their means can be interpreted as the proportion of postings that require each skill. Preliminary results suggest that Python, Excel, and SQL are among the most frequently requested skills in the dataset.

Outliers and unusual values

Exploratory summaries also suggest the presence of extreme values. For instance, company age ranges up to 277 years, and average salary ranges from 15.5 to 254 thousand dollars. While some of these values may be valid, they require further inspection through visualizations such as boxplots and histograms.

#Re-coding placeholder values as missing

Standardize Kaggle jobs

kaggle_jobs <- kg |>
  transmute(
    job_id = paste0("kg_", row_number()),
    source = "kaggle",
    job_title,
    company = company_name,
    location,
    industry,
    avg_salary_k,
    salary_min_k = lower_salary,
    salary_max_k = upper_salary,
    job_description,
    job_url = NA_character_
  )
kaggle_jobs

## # A tibble: 742 × 11
##    job_id source job_title   company location industry avg_salary_k salary_min_k
##    <chr>  <chr>  <chr>       <chr>   <chr>    <chr>           <dbl>        <dbl>
##  1 kg_1   kaggle Data Scien… "Tecol… Albuque… Aerospa…         72             53
##  2 kg_2   kaggle Healthcare… "Unive… Linthic… Health …         87.5           63
##  3 kg_3   kaggle Data Scien… "KnowB… Clearwa… Securit…         85             80
##  4 kg_4   kaggle Data Scien… "PNNL\… Richlan… Energy           76.5           56
##  5 kg_5   kaggle Data Scien… "Affin… New Yor… Adverti…        114.            86
##  6 kg_6   kaggle Data Scien… "Cyrus… Dallas,… Real Es…         95             71
##  7 kg_7   kaggle Data Scien… "Clear… Baltimo… Banks &…         73.5           54
##  8 kg_8   kaggle Data Scien… "Logic… San Jos… Consult…        114             86
##  9 kg_9   kaggle Research S… "Roche… Rochest… Health …         61             38
## 10 kg_10  kaggle Data Scien… "<inte… New Yor… Internet        140            120
## # ℹ 732 more rows
## # ℹ 3 more variables: salary_max_k <dbl>, job_description <chr>, job_url <chr>

Standardize BuiltIn jobs

builtin_jobs_clean <- builtin_jobs |>
  transmute(
    job_id = paste0("bi_", row_number()),
    source = "builtin",
    job_title,
    company,
    location = NA_character_,
    industry,
    avg_salary_k = NA_real_,
    salary_min_k = NA_real_,
    salary_max_k = NA_real_,
    job_description,
    job_url
  )
builtin_jobs_clean

## # A tibble: 11 × 11
##    job_id source  job_title  company location industry avg_salary_k salary_min_k
##    <chr>  <chr>   <chr>      <chr>   <chr>    <chr>           <dbl>        <dbl>
##  1 bi_1   builtin Manager, … <NA>    <NA>     <NA>               NA           NA
##  2 bi_2   builtin Principal… <NA>    <NA>     <NA>               NA           NA
##  3 bi_3   builtin Principal… <NA>    <NA>     <NA>               NA           NA
##  4 bi_4   builtin Senior As… NY Life <NA>     Finance            NA           NA
##  5 bi_5   builtin Data Scie… <NA>    <NA>     <NA>               NA           NA
##  6 bi_6   builtin Staff Dat… Grow T… <NA>     Healthc…           NA           NA
##  7 bi_7   builtin Senior St… <NA>    <NA>     <NA>               NA           NA
##  8 bi_8   builtin Staff Dat… <NA>    <NA>     <NA>               NA           NA
##  9 bi_9   builtin Forward-D… <NA>    <NA>     <NA>               NA           NA
## 10 bi_10  builtin AI & GenA… <NA>    <NA>     <NA>               NA           NA
## 11 bi_11  builtin Data Scie… <NA>    <NA>     <NA>               NA           NA
## # ℹ 3 more variables: salary_max_k <dbl>, job_description <chr>, job_url <chr>

Merge Job postings

all_jobs <- bind_rows(kaggle_jobs, builtin_jobs_clean)
all_jobs

## # A tibble: 753 × 11
##    job_id source job_title   company location industry avg_salary_k salary_min_k
##    <chr>  <chr>  <chr>       <chr>   <chr>    <chr>           <dbl>        <dbl>
##  1 kg_1   kaggle Data Scien… "Tecol… Albuque… Aerospa…         72             53
##  2 kg_2   kaggle Healthcare… "Unive… Linthic… Health …         87.5           63
##  3 kg_3   kaggle Data Scien… "KnowB… Clearwa… Securit…         85             80
##  4 kg_4   kaggle Data Scien… "PNNL\… Richlan… Energy           76.5           56
##  5 kg_5   kaggle Data Scien… "Affin… New Yor… Adverti…        114.            86
##  6 kg_6   kaggle Data Scien… "Cyrus… Dallas,… Real Es…         95             71
##  7 kg_7   kaggle Data Scien… "Clear… Baltimo… Banks &…         73.5           54
##  8 kg_8   kaggle Data Scien… "Logic… San Jos… Consult…        114             86
##  9 kg_9   kaggle Research S… "Roche… Rochest… Health …         61             38
## 10 kg_10  kaggle Data Scien… "<inte… New Yor… Internet        140            120
## # ℹ 743 more rows
## # ℹ 3 more variables: salary_max_k <dbl>, job_description <chr>, job_url <chr>

Missing Values

Missing values were identified using the is.na() function. Depending on the variable, missing values were either removed or handled appropriately to maintain data integrit

Merge kaggle skills

skill_cols <- names(kg)[
  names(kg) %in% c(
    "python","sql","aws","excel","tableau","spark",
    "sas","pytorch","keras","scikit","tensor",
    "hadoop","flink","mongo","google_an"
  )
]

kaggle_job_skills <- kg |>
  mutate(job_id = paste0("kg_", row_number())) |>
  select(job_id, all_of(skill_cols)) |>
  pivot_longer(
    cols = -job_id,
    names_to = "skill",
    values_to = "present"
  ) |>
  filter(present == 1) |>
  select(job_id, skill)

#Summarizing the skills we can see that python, excel and sql are the most required skill in the job market

#BuiltIn skills

skill_keywords <- c(
  "python","r","sql","scala","java",
  "pytorch","tensorflow","scikit","sklearn","xgboost",
  "aws","azure","gcp","databricks","snowflake","spark",
  "tableau","power bi"
)

builtin_job_skills_clean <- builtin_jobs |>
  mutate(job_id = paste0("bi_", row_number())) |>
  transmute(job_id, text = tolower(coalesce(job_description, ""))) |>
  filter(nzchar(text)) |>
  crossing(skill = skill_keywords) |>
  mutate(rx = paste0("\\b", skill, "\\b")) |>
  filter(str_detect(text, regex(rx, ignore_case = TRUE))) |>
  distinct(job_id, skill)

Merge all skills

all_job_skills <- bind_rows(
  kaggle_job_skills,
  builtin_job_skills_clean
)
all_job_skills

## # A tibble: 2,142 × 2
##    job_id skill  
##    <chr>  <chr>  
##  1 kg_1   python 
##  2 kg_1   excel  
##  3 kg_1   sas    
##  4 kg_1   tableau
##  5 kg_2   python 
##  6 kg_3   python 
##  7 kg_3   spark  
##  8 kg_3   excel  
##  9 kg_3   sql    
## 10 kg_3   sas    
## # ℹ 2,132 more rows

count(all_jobs, source)

## # A tibble: 2 × 2
##   source      n
##   <chr>   <int>
## 1 builtin    11
## 2 kaggle    742

count(all_job_skills, skill, sort = TRUE)

## # A tibble: 23 × 2
##    skill       n
##    <chr>   <int>
##  1 python    402
##  2 sql       389
##  3 excel     388
##  4 aws       183
##  5 spark     170
##  6 tableau   150
##  7 hadoop    124
##  8 tensor     72
##  9 sas        66
## 10 scikit     55
## # ℹ 13 more rows

glimpse(all_jobs)

## Rows: 753
## Columns: 11
## $ job_id          <chr> "kg_1", "kg_2", "kg_3", "kg_4", "kg_5", "kg_6", "kg_7"…
## $ source          <chr> "kaggle", "kaggle", "kaggle", "kaggle", "kaggle", "kag…
## $ job_title       <chr> "Data Scientist", "Healthcare Data Scientist", "Data S…
## $ company         <chr> "Tecolote Research\n3.8", "University of Maryland Medi…
## $ location        <chr> "Albuquerque, NM", "Linthicum, MD", "Clearwater, FL", …
## $ industry        <chr> "Aerospace & Defense", "Health Care Services & Hospita…
## $ avg_salary_k    <dbl> 72.0, 87.5, 85.0, 76.5, 114.5, 95.0, 73.5, 114.0, 61.0…
## $ salary_min_k    <dbl> 53, 63, 80, 56, 86, 71, 54, 86, 38, 120, 126, 64, 106,…
## $ salary_max_k    <dbl> 91, 112, 90, 97, 143, 119, 93, 142, 84, 160, 201, 106,…
## $ job_description <chr> "Data Scientist\nLocation: Albuquerque, NM\nEducation …
## $ job_url         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

glimpse(all_job_skills)

## Rows: 2,142
## Columns: 2
## $ job_id <chr> "kg_1", "kg_1", "kg_1", "kg_1", "kg_2", "kg_3", "kg_3", "kg_3",…
## $ skill  <chr> "python", "excel", "sas", "tableau", "python", "python", "spark…

checks

count(all_jobs, source)

## # A tibble: 2 × 2
##   source      n
##   <chr>   <int>
## 1 builtin    11
## 2 kaggle    742

count(all_job_skills, skill, sort = TRUE)

## # A tibble: 23 × 2
##    skill       n
##    <chr>   <int>
##  1 python    402
##  2 sql       389
##  3 excel     388
##  4 aws       183
##  5 spark     170
##  6 tableau   150
##  7 hadoop    124
##  8 tensor     72
##  9 sas        66
## 10 scikit     55
## # ℹ 13 more rows

glimpse(all_jobs)

## Rows: 753
## Columns: 11
## $ job_id          <chr> "kg_1", "kg_2", "kg_3", "kg_4", "kg_5", "kg_6", "kg_7"…
## $ source          <chr> "kaggle", "kaggle", "kaggle", "kaggle", "kaggle", "kag…
## $ job_title       <chr> "Data Scientist", "Healthcare Data Scientist", "Data S…
## $ company         <chr> "Tecolote Research\n3.8", "University of Maryland Medi…
## $ location        <chr> "Albuquerque, NM", "Linthicum, MD", "Clearwater, FL", …
## $ industry        <chr> "Aerospace & Defense", "Health Care Services & Hospita…
## $ avg_salary_k    <dbl> 72.0, 87.5, 85.0, 76.5, 114.5, 95.0, 73.5, 114.0, 61.0…
## $ salary_min_k    <dbl> 53, 63, 80, 56, 86, 71, 54, 86, 38, 120, 126, 64, 106,…
## $ salary_max_k    <dbl> 91, 112, 90, 97, 143, 119, 93, 142, 84, 160, 201, 106,…
## $ job_description <chr> "Data Scientist\nLocation: Albuquerque, NM\nEducation …
## $ job_url         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…

glimpse(all_job_skills)

## Rows: 2,142
## Columns: 2
## $ job_id <chr> "kg_1", "kg_1", "kg_1", "kg_1", "kg_2", "kg_3", "kg_3", "kg_3",…
## $ skill  <chr> "python", "excel", "sas", "tableau", "python", "python", "spark…

Kaggle Frequency of skills in job postings

skill_counts <- kaggle_job_skills |>
  count(skill, sort = TRUE)

ggplot(skill_counts, aes(x = reorder(skill, n), y = n, fill = skill )) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Frequency of Skills in Job Postings",
    x = "Skill",
    y = "Number of postings"
  ) +
  theme(legend.position = 'none')

BuiltIn frequency of skills in job postings

skill_counts <- builtin_job_skills_clean |>
  count(skill, sort = TRUE)

ggplot(skill_counts, aes(x = reorder(skill, n), y = n, fill = skill )) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Frequency of Skills in Job Postings",
    x = "Skill",
    y = "Number of postings"
  ) +
  theme(legend.position = 'none')

Number of jobs posting vs Average Salary

all_jobs_salary <- all_jobs |>
  filter(!is.na(avg_salary_k))

ggplot(all_jobs_salary, aes(x = avg_salary_k)) +
  geom_histogram(bins = 30) +
  labs(
    title = "Distribution of Average Salary",
    x = "Average Salary (K)",
    y = "Number of Job Postings"
  )

ggplot(all_jobs_salary, aes(y = avg_salary_k)) +
  geom_boxplot() +
  labs(
    title = "Boxplot of Average Salary",
    y = "Average Salary (K)"
  )

Check salary outliers

##The Average Salary is 70k to 125k after observing the boxplot

The histogram shows that most job postings fall within the mid-range salary band, with a concentration between approximately 70K and 130K. Higher salary values are less frequent and may represent senior or specialized roles.

Check job title and salary

job_salary_summary <- all_jobs_salary |>
  
  group_by(job_title) |>
  summarise(
    mean_salary = mean(avg_salary_k, na.rm = TRUE),
    median_salary = median(avg_salary_k, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  ) |>
  filter(n >= 5) |>
  arrange(desc(mean_salary))
job_salary_summary |>
  ggplot(aes( x = reorder(job_title, mean_salary), y = mean_salary, fill = job_title)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Average Salary for Job Title",
    x = "Job Title",
    y = " Average Salary (K)"
  ) +
  theme(legend.position = 'none')

##To compare pay across roles, job postings were grouped by job title and summarized using average salary. Because many titles appeared only a few times, the analysis focused on titles with at least five postings. This provides a more stable comparison of salary levels across common job categories.

Comparing Kaggle vs BuiltIn skills

source_skill_counts <- all_jobs |>
  left_join(all_job_skills, by = "job_id") |>
  filter(!is.na(skill)) |>
  count(source, skill)

source_skill_counts |>
  ggplot(aes(x = reorder(skill, n), y = n, fill = source)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(
    title = "Skill Frequency by Data Source",
    x = "Skill",
    y = "Number of Job Postings"
  )

Analysis

What are the most common required skills?

Let’s look at the top 5 skills from the Builtin and Kaggle data sets:

builtin_top5 <- source_skill_counts |> 
  filter(source == "builtin") |>
  arrange(desc(n)) |>
  rename("count" = n) |>
  slice_head(n = 5)

kaggle_top5 <- source_skill_counts |> 
  filter(source == "kaggle") |>
  arrange(desc(n)) |>
  rename("count" = n) |>
  slice_head(n = 5)

library(gt)

## Warning: package 'gt' was built under R version 4.5.2

builtin_top5 |>
  select(skill, count) |>
  gt() |>
  cols_label(
    skill = "Skill",
    count = "Count"
  ) |>
  tab_header(title = "Top 5 Skills - Builtin")

Skill	Count
Top 5 Skills - Builtin
python	10
sql	9
aws	7
gcp	4
r	3

builtin_top5 |> 
  ggplot(aes(x = reorder(skill, -count), y = count, fill = skill)) +
  geom_col() +
  geom_text(aes(label = count), vjust = -0.5, size = 3) +
  scale_fill_viridis_d(option = "magma") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = "none") +
  labs(title = "Top 5 Skills (Builitin)", x = "Skill", y = "Count")

kaggle_top5 |>
  select(skill, count) |>
  gt() |>
  cols_label(
    skill = "Skill",
    count = "Count"
  ) |>
  tab_header(title = "Top 5 Skills - Kaggle")

Skill	Count
Top 5 Skills - Kaggle
python	392
excel	388
sql	380
aws	176
spark	167

kaggle_top5 |> 
  ggplot(aes(x = reorder(skill, -count), y = count, fill = skill)) +
  geom_col() +
  geom_text(aes(label = count), vjust = -0.5, size = 3) +
  scale_fill_viridis_d(option = "magma") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = "none") +
  labs(title = "Top 5 Skills (Kaggle)", x = "Skill", y = "Count")

For the Builtin data set, Python, SQL, and R are the top software skills. Cloud skills are also in demand.

It’s worth noting that this data set is much smaller and all the job posts are in the greater NYC area, and salaries will reflect that.

Kaggle shows Python, Excel, and SQL as the top 3 software skills. R is excluded from this data set, so it’s not a good metric of how commonly R is a required skill. AWS cloud skills are also in demand, as is Apache Spark.

Looking at the full list, Tableau shows up quite early. In my experience, the barrier to using Tableau isn’t necessarily learning to use it, but the price.

All Skills

#visualizing all Builtin Skills
source_skill_counts |> 
  filter(source == "builtin") |>
  arrange(desc(n)) |>
  rename("count" = n)  |>
  ggplot(aes(x = reorder(skill, -count), y = count, fill = skill)) +
  geom_col() +
  geom_text(aes(label = count), vjust = -0.5, size = 3) +
  scale_fill_viridis_d(option = "magma") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = "none") +
  labs(title = "All Skills (Builitin)", x = "Skill", y = "Count")

#All Kaggle skills
source_skill_counts |> 
  filter(source == "kaggle") |>
  arrange(desc(n)) |>
  rename("count" = n) |>
   ggplot(aes(x = reorder(skill, -count), y = count, fill = skill)) +
  geom_col() +
  geom_text(aes(label = count), vjust = -0.5, size = 3) +
  scale_fill_viridis_d(option = "magma") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = "none") +
  labs(title = "All Skills (Kaggle)", x = "Skill", y = "Count")

In general, the Kaggle skills start with broad software (Python, Excel, SQL, AWS), then moves into more niche Apache software. Some of these are advanced Python libraries, like Pytorch, scikit, and Keras. Tensor is a data structure.

Salary Analysis

Above, we have salary by title. In general, data scientists and data engineers make more than analysts.

Here, we’ll look at salary by state, just to see the impact of that and benchmark for later observations.

all_jobs_state <- all_jobs_salary |>
    separate(col = location, into = c("city", "state"), sep = ",") |>
  mutate(across(c(city, state), str_trim)) |> 
  mutate(state = if_else(state == "Los Angeles", "CA", state))

## Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [127].

#benchmarking the salaries for later observations

summary(all_jobs_state$avg_salary_k)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    15.5    73.5    97.5   101.5   122.5   254.0

avg_state <- all_jobs_state |>
  filter(source == "kaggle") |>
  group_by(state) |>
  summarise(
    avg_salary = round(mean(avg_salary_k, na.rm = TRUE)),
    job_count = n()
  ) |>
  arrange(desc(avg_salary))

#top 10 states by salary

avg_state |>
  slice_head(n = 10) |>
  select(state, avg_salary, job_count) |>
  gt() |>
  cols_align(
    align = "center",
    columns = everything()
  ) |>
  cols_label(
    state = "State",
    avg_salary = "Average Salary (k)",
    job_count = "Number of Jobs"
  ) |>
  tab_header(title = "Top 10 States by Average Salary")

State	Average Salary (k)	Number of Jobs
Top 10 States by Average Salary
CA	124	152
IL	117	40
DC	110	11
MA	107	103
NJ	105	17
MI	100	6
RI	100	1
NY	99	72
NC	98	21
KY	97	6

avg_state |> 
  slice_head(n = 10) |> 
  ggplot(aes(x = reorder(state, -avg_salary), y = avg_salary, fill = state)) +
  geom_col() +
  geom_text(aes(label = paste0(avg_salary, "k")), vjust = -0.5, size = 3) +
  scale_fill_viridis_d(option = "magma") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = "none") +
  labs(title = "Top 10 States by Average Salary", x = "State", y = "Average Salary (k)")

Skills and Salary

Here, we take the job skills table and join it with the KG skills table based on index. Then, we create an average salary per skill column. This is to see which skill is associated with the highest average salary.

#making a slight tweak to this code so I can keep the index column and join
#with the kg data set for average salary

job_skills_2 <- kg |> 
  mutate(job_id = index) |> 
  select(job_id, all_of(skill_cols)) |> 
  pivot_longer(
    cols = -job_id,
    names_to = "skill",
    values_to = "present"
  ) |> 
  filter(present == 1) |> 
  select(job_id, skill)

#left join with KG to get the salary
job_skills_salary <- job_skills_2 |>
    left_join(kg |> select(index, avg_salary_k), by = c("job_id" = "index"))

#Averaging the salary
salary_skill <- job_skills_salary |>
  group_by(skill) |>
  summarise(
    avg_salary = round(mean(avg_salary_k, na.rm = TRUE)),
    job_count = n()) |>
      arrange(desc(avg_salary))


salary_skill |> 
    ggplot(aes(x = reorder(skill, -avg_salary), y = avg_salary, fill = skill)) +
    geom_col() +
    geom_text(aes(label = paste0(avg_salary, "k")), vjust = -0.5, size = 3) +
    scale_fill_viridis_d(option = "magma") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    guides(fill = "none") +
    labs(title = "Average Salary by Skill", x = "Skill", y = "Average Salary (k)")

In general, it seems like more niche skills are more highly valued. Scikit and Keras are Python libraries. Flink = Apache Flink. SAS ranks fifth – this data set is a little older and SAS is likely not as popular as it used to be.

Are certain skills associated with lower salaries?

How does the average salary for each skill compare to the median salary?

all_jobs_salary |> summarise(mean(avg_salary_k))

## # A tibble: 1 × 1
##   `mean(avg_salary_k)`
##                  <dbl>
## 1                 101.

all_jobs_salary

## # A tibble: 742 × 11
##    job_id source job_title   company location industry avg_salary_k salary_min_k
##    <chr>  <chr>  <chr>       <chr>   <chr>    <chr>           <dbl>        <dbl>
##  1 kg_1   kaggle Data Scien… "Tecol… Albuque… Aerospa…         72             53
##  2 kg_2   kaggle Healthcare… "Unive… Linthic… Health …         87.5           63
##  3 kg_3   kaggle Data Scien… "KnowB… Clearwa… Securit…         85             80
##  4 kg_4   kaggle Data Scien… "PNNL\… Richlan… Energy           76.5           56
##  5 kg_5   kaggle Data Scien… "Affin… New Yor… Adverti…        114.            86
##  6 kg_6   kaggle Data Scien… "Cyrus… Dallas,… Real Es…         95             71
##  7 kg_7   kaggle Data Scien… "Clear… Baltimo… Banks &…         73.5           54
##  8 kg_8   kaggle Data Scien… "Logic… San Jos… Consult…        114             86
##  9 kg_9   kaggle Research S… "Roche… Rochest… Health …         61             38
## 10 kg_10  kaggle Data Scien… "<inte… New Yor… Internet        140            120
## # ℹ 732 more rows
## # ℹ 3 more variables: salary_max_k <dbl>, job_description <chr>, job_url <chr>

#the average salary for this data set is ~101K

salary_skill |> 
  mutate("difference" = avg_salary - 101) |>
  mutate("percentage" = round(difference/101,2)) |>
  select(skill, avg_salary, difference, percentage) |>
  gt() |>
  cols_label(
    skill = "Skill",
    avg_salary = "Mean Salary (k)",
    difference = "Diff from Mean Salary (k)",
    percentage = "% Diff from Mean Salary"
  ) |>
  tab_header(title = "How Much Are Skills Worth?")

Skill	Mean Salary (k)	Diff from Mean Salary (k)	% Diff from Mean Salary
How Much Are Skills Worth?
flink	129	28	0.28
scikit	125	24	0.24
keras	123	22	0.22
tensor	120	19	0.19
sas	114	13	0.13
aws	113	12	0.12
mongo	113	12	0.12
python	113	12	0.12
spark	113	12	0.12
hadoop	111	10	0.10
pytorch	109	8	0.08
sql	102	1	0.01
excel	99	-2	-0.02
tableau	96	-5	-0.05
google_an	68	-33	-0.33

#A slightly dramatic bar chart
salary_skill |> 
  mutate(difference = avg_salary - 101) |>
  ggplot(aes(x = reorder(skill, -difference), y = difference, fill = skill)) +
  geom_col() +
  geom_text(aes(label = paste0(difference, "k"), 
                vjust = ifelse(difference >= 0, -0.5, 1.5)), 
            size = 3) +
  scale_fill_viridis_d(option = "magma") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = "none") +
  expand_limits(y = c(min(salary_skill$avg_salary - 101) * 1.2, 
                      max(salary_skill$avg_salary - 101) * 1.2)) +
  labs(title = "How Much Are Skills Worth?", x = "Skill", y = "Difference from Mean Salary (k)")

Excel is the #2 skill, but it’s assocated with a lower salary. Can we conclude that learning Excel will drop your salary down by two grand? Probably not. But it’s a more accessible skill, and may be associated with less intense/specialized jobs. Tableau is also fairly accessible, as is Google Analytics.

Skills and Title

What (if anything) does title mean? Are certain skills associated with a data scientist/analyst/engineer vs an analyst? On average, what is the pay differential between these titles?

Let’s look at titles and skill counts together.

#This data set that keeps everything and pivots skills
kg_long <- kg |>
    pivot_longer(
        cols = python:google_an,
        names_to = "skill",
        values_to = "present"
    ) |>
    filter(present == 1) |>
    select(index:age, job_title_sim:degree, skill)

kg_skills_pivot <- kg_long |>
  select(job_title_sim, skill)

#changing the title "data analitics" to just analyst, so it matches 
#(it seems like an error and it's also a typo)
kg_skills_pivot <- kg_skills_pivot |>
    mutate(job_title_sim = if_else(job_title_sim == "data analitics", "analyst",
                                   job_title_sim))

  

kg_skills_agg <- kg_skills_pivot |> 
    group_by(job_title_sim, skill) |> 
    summarise("skill_count" = n(), .groups = "drop")

#Let's look at analyst, data engineer, and data scientist

kg_skills_agg |> filter(job_title_sim == "analyst") |>
  arrange(desc(skill_count)) |>
   select(skill, skill_count) |>
  gt() |>
  cols_label(
    skill = "Skill",
    skill_count = "Count"
  ) |>
  tab_header(title = "Most Common Analyst Skills")

Skill	Count
Most Common Analyst Skills
excel	81
sql	79
tableau	44
python	36
bi	19
sas	13
google_an	11
aws	10
spark	6
hadoop	3
mongo	3

kg_skills_agg |> filter(job_title_sim == "data scientist") |>
  arrange(desc(skill_count)) |>
   select(skill, skill_count) |>
  gt() |>
  cols_label(
    skill = "Skill",
    skill_count = "Count"
  ) |>
  tab_header(title = "Most Common Data Scientist Skills")

Skill	Count
Most Common Data Scientist Skills
python	240
sql	176
excel	155
spark	84
aws	79
tableau	76
hadoop	60
tensor	60
sas	50
scikit	47
pytorch	33
keras	29
bi	28
mongo	19
flink	4
google_an	3

kg_skills_agg |> filter(job_title_sim == "data engineer") |>
  arrange(desc(skill_count)) |>
   select(skill, skill_count) |>
  gt() |>
  cols_label(
    skill = "Skill",
    skill_count = "Count"
  ) |>
  tab_header(title = "Most Common Data Engineer Skills")

Skill	Count
Most Common Data Engineer Skills
sql	87
python	77
spark	67
aws	59
excel	54
hadoop	50
mongo	13
tableau	11
flink	6
bi	3
tensor	2
sas	1

#analyst skills bar chart
kg_skills_agg |> 
  filter(job_title_sim == "analyst") |>
  ggplot(aes(x = reorder(skill, -skill_count), y = skill_count, fill = skill)) +
  geom_col() +
  geom_text(aes(label = skill_count), vjust = -0.5, size = 3) +
  scale_fill_viridis_d(option = "magma") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = "none") +
  labs(title = "Most Common Data Analyst Skills", x = "Skill", y = "Count")

#scientist
kg_skills_agg |> 
  filter(job_title_sim == "data scientist") |>
  ggplot(aes(x = reorder(skill, -skill_count), y = skill_count, fill = skill)) +
  geom_col() +
  geom_text(aes(label = skill_count), vjust = -0.5, size = 3) +
  scale_fill_viridis_d(option = "magma") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = "none") +
  labs(title = "Most Common Data Scientist Skills", x = "Skill", y = "Count")

#engineer
kg_skills_agg |> 
  filter(job_title_sim == "data engineer") |>
  ggplot(aes(x = reorder(skill, -skill_count), y = skill_count, fill = skill)) +
  geom_col() +
  geom_text(aes(label = skill_count), vjust = -0.5, size = 3) +
  scale_fill_viridis_d(option = "magma") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = "none") +
  labs(title = "Most Common Data Engineer Skills", x = "Skill", y = "Count")

Analysts seem more likely to need excel, SQL, and Tableau before python, whereas engineers and scientists generally need SQL, Python, and cloud skills. Analysts need more visualization tools like Tableau, BI, and google analytics. In general, programming skills are preferred for scientists and engineers.

What fields pay the most?

Is understanding a particular area (e.g. healthcare or finance) correlated with a higher salary?

In data science, you may need two skill sets. For example, you may need chemistry and data science skills in order to work at a pharmaceutical company.

#all jobs salary
by_industry <- all_jobs_salary |> 
  filter(industry != "-1") |> 
  group_by(industry) |>
  summarise(
    avg_salary = round(mean(avg_salary_k, na.rm = TRUE), 0),
    job_count = n()
  ) |>
  arrange(desc(avg_salary))

#there are a lot of one-offs in here, so I am going to get rid of anything with a count of less than 2

by_industry_multi <- by_industry |>
  filter(job_count > 1)

#Let's look at the top 10 and bottom 10

by_industry_multi |>
  slice_head(n = 10) |>
  select(industry, avg_salary) |>
  gt() |>
  cols_label(
    industry = "Industry",
    avg_salary = "Average Salary (Thousands)"
  ) |>
  tab_header(title = "Top Average Salaries by Industry")

Industry	Average Salary (Thousands)
Top Average Salaries by Industry
Financial Analytics & Research	145
Telecommunications Services	132
Brokerage Services	129
Internet	124
Investment Banking & Asset Management	118
TV Broadcast & Cable Networks	118
Computer Hardware & Software	115
Enterprise Software & Network Solutions	115
Biotech & Pharmaceuticals	112
Consulting	109

Unsurprisingly, 3/10 are in finance (brokerage, financial analytics, investment banking), industries that tend to pay more. Others are in technology generally, along with media.

by_industry_multi |>
  slice_tail(n = 10) |>
  select(industry, avg_salary) |>
  gt() |>
  cols_label(
    industry = "Industry",
    avg_salary = "Average Salary (Thousands)"
  ) |>
  tab_header(title = "Bottom Average Salaries by Industry")

Industry	Average Salary (Thousands)
Bottom Average Salaries by Industry
Research & Development	78
Health Care Services & Hospitals	77
Banks & Credit Unions	71
Travel Agencies	70
Construction	55
Food & Beverage Manufacturing	53
Architectural & Engineering Services	50
Gambling	48
Social Assistance	48
Telecommunications Manufacturing	44

Some of these are industries where salaries tend to be on the lower side, like food & beverage and social services. Construction and architecture are adjacent fields.

Gambling is interesting – it may offer higher salaries now, given the growth of gambling apps.

What field has the most jobs?

If we define valuable as “most opportunities”:

by_industry_multi |>
  arrange(desc(job_count)) |>
  slice_head(n = 10) |>
  select(industry, job_count) |>
  gt() |>
  cols_align(
    align = "center",
    columns = job_count
  ) |>
  cols_label(
    industry = "Industry",
    job_count = "Number of Jobs"
  ) |>
  tab_header(title = "Most Common Industries")

Industry	Number of Jobs
Most Common Industries
Biotech & Pharmaceuticals	112
Insurance Carriers	63
Computer Hardware & Software	59
IT Services	50
Health Care Services & Hospitals	49
Enterprise Software & Network Solutions	42
Internet	29
Consulting	29
Advertising & Marketing	25
Aerospace & Defense	25

by_industry_multi |>
    arrange(desc(job_count)) |>
    slice_head(n = 10) |>
    ggplot(aes(x = reorder(industry, -job_count), y = job_count, fill = industry)) +
    geom_col() +
    geom_text(aes(label = job_count), vjust = -0.2) +
    scale_fill_viridis_d(option = "magma") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    guides(fill = "none") +
    labs(title = "Most Common Industries for Data Science", x = "Industry", y = "Number of Jobs")

According to this, having a biology background could be valuable (biotech & pharmaceuticals). There’s demand in IT and adjacent fields like software and internet, along with insurance carriers (presumably to assess risk).

Modeling

Data Prep

final_wide <- all_jobs %>%
  left_join(all_job_skills, by = "job_id") %>%
  mutate(has_skill = 1L) %>%
  pivot_wider(
    names_from = skill,
    values_from = has_skill,
    values_fill = list(has_skill = 0L),
    values_fn = list(has_skill = max)
  )

# Columns from the original all_jobs table (keep these)
orig_cols <- names(all_jobs)

# Skill columns are all columns in final_wide that were *not* in all_jobs
skill_cols <- setdiff(names(final_wide), orig_cols)

# Identify skill columns that are all zeros (no presence across any job)
zero_only <- vapply(final_wide[skill_cols], function(x) {
  # handle numeric/logical 0/1; coerce characters like "0"/"1" if needed
  if (is.character(x)) x <- suppressWarnings(as.numeric(x))
  sum(x, na.rm = TRUE) == 0
}, logical(1))

# Drop unnecessary columns
final_wide_trim <- final_wide %>%
  select(
    -any_of(skill_cols[zero_only]),   # drop skills with only 0
    -any_of("NA")                     # drop the NA column
  )

Building LM Function

run_univariate_lm <- function(df,
                              y = "avg_salary_k",
                              x_cols = 12:34,
                              na_action = na.exclude,
                              assign_to_global = FALSE,
                              sanitize_names = FALSE,
                              coerce_binary_chars = TRUE) {
  # Basic checks
  if (!y %in% names(df)) {
    stop(sprintf("Response column '%s' not found in data.", y))
  }
  if (any(x_cols < 1 | x_cols > ncol(df))) {
    stop("x_cols index range is out of bounds for the provided data frame.")
  }
  
  # Optionally sanitize names to syntactic R names
  if (sanitize_names) {
    old_names <- names(df)
    names(df) <- make.unique(make.names(names(df)))
    if (!y %in% names(df)) {
      stop("After sanitizing names, the response column was renamed; set y to the new name.")
    }
    message("Column names were sanitized via make.names() + make.unique().")
  }
  
  # Ensure response is numeric (lm works best this way)
  if (!is.numeric(df[[y]])) {
    suppressWarnings({
      df[[y]] <- as.numeric(df[[y]])
    })
    if (anyNA(df[[y]])) {
      warning(sprintf("Coercing '%s' to numeric introduced NAs; check your response column.", y))
    }
  }
  
  results <- list()
  
  for (i in x_cols) {
    xname <- names(df)[i]
    xvec <- df[[xname]]
    
    # Skip problematic types (list/data.frame columns)
    if (is.list(xvec) && !is.atomic(xvec)) {
      message(sprintf("Skipping '%s': predictor is a list/data-frame column.", xname))
      next
    }
    
    # Optional coercion for binary character columns
    if (coerce_binary_chars && is.character(xvec)) {
      vals <- unique(na.omit(xvec))
      if (all(vals %in% c("0", "1"))) {
        xvec_num <- as.numeric(xvec)
        df[[xname]] <- xvec_num
      }
    }
    
    # Skip non-variant predictors
    if (all(is.na(xvec)) || length(unique(na.omit(xvec))) < 2) {
      message(sprintf("Skipping '%s': predictor has <2 unique non-NA values.", xname))
      next
    }
    
    # Build formula with backticks to handle non-syntactic names safely
    f <- as.formula(sprintf("`%s` ~ `%s`", y, xname))
    
    # Fit model (try-catch for edge cases)
    fit <- try(lm(f, data = df, na.action = na_action), silent = TRUE)
    if (inherits(fit, "try-error")) {
      message(sprintf("Skipping '%s': lm() failed (%s).", xname, as.character(fit)))
      next
    }
    
    # Store with name "lm_<column_name>"
    obj_name <- paste0("lm_", xname)
    results[[obj_name]] <- fit
    
    # Print summary
    cat("\n", strrep("=", 72), "\n", sep = "")
    cat(sprintf("Model: %s\nFormula: %s\n", obj_name, deparse(f)))
    cat(strrep("-", 72), "\n", sep = "")
    print(summary(fit))
  }
  
  # Optionally assign each model to the global environment
  if (assign_to_global && length(results) > 0) {
    list2env(results, envir = .GlobalEnv)
    message("Models assigned to .GlobalEnv with names 'lm_<column_name>'.")
  }
  
  invisible(results)
}

Running LM Function

models <- run_univariate_lm(final_wide_trim)

## 
## ========================================================================
## Model: lm_python
## Formula: avg_salary_k ~ python
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -73.476 -25.976  -5.653  20.980 141.347 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   88.976      1.902  46.769   <2e-16 ***
## python        23.677      2.617   9.046   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35.59 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.09957,    Adjusted R-squared:  0.09836 
## F-statistic: 81.83 on 1 and 740 DF,  p-value: < 2.2e-16
## 
## 
## ========================================================================
## Model: lm_excel
## Formula: avg_salary_k ~ excel
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -83.397 -27.397  -5.321  19.679 149.679 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  104.321      1.988  52.467   <2e-16 ***
## excel         -5.424      2.750  -1.973   0.0489 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.41 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.005231,   Adjusted R-squared:  0.003886 
## F-statistic: 3.891 on 1 and 740 DF,  p-value: 0.04892
## 
## 
## ========================================================================
## Model: lm_sas
## Formula: avg_salary_k ~ sas
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.791 -27.601  -5.041  22.084 140.295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  100.291      1.435  69.885  < 2e-16 ***
## sas           13.413      4.812   2.788  0.00545 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.31 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.01039,    Adjusted R-squared:  0.009054 
## F-statistic:  7.77 on 1 and 740 DF,  p-value: 0.005447
## 
## 
## ========================================================================
## Model: lm_tableau
## Formula: avg_salary_k ~ tableau
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -87.382 -26.382  -4.629  21.118 151.118 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  102.882      1.535  67.039   <2e-16 ***
## tableau       -7.007      3.436  -2.039   0.0418 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.4 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.005588,   Adjusted R-squared:  0.004244 
## F-statistic: 4.158 on 1 and 740 DF,  p-value: 0.04179
## 
## 
## ========================================================================
## Model: lm_spark
## Formula: avg_salary_k ~ spark
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.539 -27.414  -5.347  21.961 155.961 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   98.039      1.541  63.611  < 2e-16 ***
## spark         15.308      3.249   4.712 2.93e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.96 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.02913,    Adjusted R-squared:  0.02782 
## F-statistic:  22.2 on 1 and 740 DF,  p-value: 2.929e-06
## 
## 
## ========================================================================
## Model: lm_sql
## Formula: avg_salary_k ~ sql
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.907 -27.907  -4.058  20.942 152.593 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 101.4075     1.9714  51.440   <2e-16 ***
## sql           0.1504     2.7547   0.055    0.956    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.51 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  4.03e-06,   Adjusted R-squared:  -0.001347 
## F-statistic: 0.002982 on 1 and 740 DF,  p-value: 0.9565
## 
## 
## ========================================================================
## Model: lm_aws
## Formula: avg_salary_k ~ aws
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.378 -26.878  -5.878  22.122 156.122 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   97.878      1.553  63.030  < 2e-16 ***
## aws           15.204      3.189   4.768 2.24e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.94 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.02981,    Adjusted R-squared:  0.0285 
## F-statistic: 22.74 on 1 and 740 DF,  p-value: 2.236e-06
## 
## 
## ========================================================================
## Model: lm_mongo
## Formula: avg_salary_k ~ mongo
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.376 -27.376  -4.876  20.499 153.124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  100.876      1.409  71.591   <2e-16 ***
## mongo         12.205      6.310   1.934   0.0535 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.41 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.00503,    Adjusted R-squared:  0.003686 
## F-statistic: 3.741 on 1 and 740 DF,  p-value: 0.05346
## 
## 
## ========================================================================
## Model: lm_pytorch
## Formula: avg_salary_k ~ pytorch
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -85.59 -28.09  -3.84  21.29 152.91 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  101.090      1.413  71.532   <2e-16 ***
## pytorch        7.513      6.164   1.219    0.223    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.47 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.002003,   Adjusted R-squared:  0.0006547 
## F-statistic: 1.485 on 1 and 740 DF,  p-value: 0.2233
## 
## 
## ========================================================================
## Model: lm_tensor
## Formula: avg_salary_k ~ tensor
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.019 -26.894  -4.769  20.662 154.481 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   99.519      1.430  69.576  < 2e-16 ***
## tensor        20.259      4.592   4.412 1.18e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.02 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.02563,    Adjusted R-squared:  0.02431 
## F-statistic: 19.47 on 1 and 740 DF,  p-value: 1.176e-05
## 
## 
## ========================================================================
## Model: lm_hadoop
## Formula: avg_salary_k ~ hadoop
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -84.13 -27.13  -4.43  20.87 154.37 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   99.629      1.500  66.440  < 2e-16 ***
## hadoop        11.100      3.668   3.026  0.00256 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.28 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.01222,    Adjusted R-squared:  0.01089 
## F-statistic: 9.158 on 1 and 740 DF,  p-value: 0.002563
## 
## 
## ========================================================================
## Model: lm_keras
## Formula: avg_salary_k ~ keras
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.111 -27.611  -4.111  20.389 153.389 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  100.611      1.395  72.109   <2e-16 ***
## keras         22.355      7.058   3.167   0.0016 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.26 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.01338,    Adjusted R-squared:  0.01204 
## F-statistic: 10.03 on 1 and 740 DF,  p-value: 0.001601
## 
## 
## ========================================================================
## Model: lm_scikit
## Formula: avg_salary_k ~ scikit
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.114 -26.614  -4.464  20.386 154.386 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   99.614      1.407  70.795  < 2e-16 ***
## scikit        25.701      5.216   4.927 1.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.91 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.03177,    Adjusted R-squared:  0.03046 
## F-statistic: 24.28 on 1 and 740 DF,  p-value: 1.029e-06
## 
## 
## ========================================================================
## Model: lm_flink
## Formula: avg_salary_k ~ flink
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.609 -27.984  -4.609  21.266 152.891 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  101.109      1.381  73.203   <2e-16 ***
## flink         27.891     11.898   2.344   0.0193 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.37 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.007372,   Adjusted R-squared:  0.00603 
## F-statistic: 5.496 on 1 and 740 DF,  p-value: 0.01933
## 
## 
## ========================================================================
## Model: lm_google_an
## Formula: avg_salary_k ~ google_an
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -86.625 -26.625  -4.375  21.125 151.875 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   102.12       1.38   74.03  < 2e-16 ***
## google_an     -33.95      10.04   -3.38 0.000763 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.22 on 740 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.0152, Adjusted R-squared:  0.01387 
## F-statistic: 11.42 on 1 and 740 DF,  p-value: 0.0007626
## 
## 
## ========================================================================
## Model: lm_r
## Formula: avg_salary_k ~ r
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.985 -27.985  -3.985  21.015 152.515 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  101.485      1.376   73.75   <2e-16 ***
## r                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.48 on 741 degrees of freedom
##   (11 observations deleted due to missingness)
## 
## 
## ========================================================================
## Model: lm_scala
## Formula: avg_salary_k ~ scala
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.985 -27.985  -3.985  21.015 152.515 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  101.485      1.376   73.75   <2e-16 ***
## scala             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.48 on 741 degrees of freedom
##   (11 observations deleted due to missingness)
## 
## 
## ========================================================================
## Model: lm_databricks
## Formula: avg_salary_k ~ databricks
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.985 -27.985  -3.985  21.015 152.515 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  101.485      1.376   73.75   <2e-16 ***
## databricks        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.48 on 741 degrees of freedom
##   (11 observations deleted due to missingness)
## 
## 
## ========================================================================
## Model: lm_gcp
## Formula: avg_salary_k ~ gcp
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.985 -27.985  -3.985  21.015 152.515 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  101.485      1.376   73.75   <2e-16 ***
## gcp               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.48 on 741 degrees of freedom
##   (11 observations deleted due to missingness)
## 
## 
## ========================================================================
## Model: lm_snowflake
## Formula: avg_salary_k ~ snowflake
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.985 -27.985  -3.985  21.015 152.515 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  101.485      1.376   73.75   <2e-16 ***
## snowflake         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.48 on 741 degrees of freedom
##   (11 observations deleted due to missingness)
## 
## 
## ========================================================================
## Model: lm_tensorflow
## Formula: avg_salary_k ~ tensorflow
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.985 -27.985  -3.985  21.015 152.515 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  101.485      1.376   73.75   <2e-16 ***
## tensorflow        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.48 on 741 degrees of freedom
##   (11 observations deleted due to missingness)
## 
## 
## ========================================================================
## Model: lm_xgboost
## Formula: avg_salary_k ~ xgboost
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.985 -27.985  -3.985  21.015 152.515 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  101.485      1.376   73.75   <2e-16 ***
## xgboost           NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.48 on 741 degrees of freedom
##   (11 observations deleted due to missingness)
## 
## 
## ========================================================================
## Model: lm_azure
## Formula: avg_salary_k ~ azure
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -85.985 -27.985  -3.985  21.015 152.515 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  101.485      1.376   73.75   <2e-16 ***
## azure             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.48 on 741 degrees of freedom
##   (11 observations deleted due to missingness)

Building Multi LM Function

run_multivar_lm <- function(df,
                            y = "avg_salary_k",
                            x_cols = 12:34,
                            na_action = na.exclude,
                            sanitize_names = FALSE,
                            coerce_binary_chars = TRUE,
                            standardize = FALSE) {
  # --- Basic checks ---
  if (!y %in% names(df)) {
    stop(sprintf("Response column '%s' not found in data.", y))
  }
  if (any(x_cols < 1 | x_cols > ncol(df))) {
    stop("x_cols index range is out of bounds for the provided data frame.")
  }
  
  # --- Optionally sanitize column names ---
  if (sanitize_names) {
    names(df) <- make.unique(make.names(names(df)))
    if (!y %in% names(df)) {
      stop("After sanitizing names, the response column was renamed; update 'y' to the new name.")
    }
    message("Column names were sanitized via make.names() + make.unique().")
  }
  
  # --- Ensure response is numeric ---
  if (!is.numeric(df[[y]])) {
    suppressWarnings({
      df[[y]] <- as.numeric(df[[y]])
    })
    if (anyNA(df[[y]])) {
      warning(sprintf("Coercing '%s' to numeric introduced NAs; check your response column.", y))
    }
  }
  
  # --- Collect candidate predictors ---
  xnames <- names(df)[x_cols]
  
  # --- Filter: skip problematic types, coerce binary chars, skip constants ---
  keep <- logical(length(xnames))
  for (i in seq_along(xnames)) {
    xn <- xnames[i]
    xv <- df[[xn]]
    
    # Skip list/data-frame columns
    if (is.list(xv) && !is.atomic(xv)) {
      message(sprintf("Skipping '%s': predictor is a list/data-frame column.", xn))
      keep[i] <- FALSE
      next
    }
    
    # Optional coercion: "0"/"1" -> numeric
    if (coerce_binary_chars && is.character(xv)) {
      vals <- unique(na.omit(xv))
      if (length(vals) > 0 && all(vals %in% c("0", "1"))) {
        df[[xn]] <- as.numeric(xv)
        xv <- df[[xn]]
      }
    }
    
    # Skip non-variant predictors (all NA or <2 unique non-NA values)
    if (all(is.na(xv)) || length(unique(na.omit(xv))) < 2) {
      message(sprintf("Skipping '%s': predictor has <2 unique non-NA values.", xn))
      keep[i] <- FALSE
      next
    }
    
    keep[i] <- TRUE
  }
  xnames <- xnames[keep]
  
  if (length(xnames) == 0) {
    stop("No valid predictors found in the selected range after filtering.")
  }
  
  # --- Optional standardization of numeric predictors ---
  if (standardize) {
    for (xn in xnames) {
      if (is.numeric(df[[xn]])) {
        df[[xn]] <- as.numeric(scale(df[[xn]]))
      }
    }
    message("Numeric predictors standardized (mean=0, sd=1).")
  }
  
  # --- Build safe formula with backticks ---
  backtick <- function(nm) paste0("`", nm, "`")
  rhs <- paste(vapply(xnames, backtick, ""), collapse = " + ")
  f <- as.formula(paste(backtick(y), "~", rhs))
  
  # --- Fit model ---
  fit <- lm(f, data = df, na.action = na_action)
  
  # --- Print concise diagnostics ---
  cat("\n", strrep("=", 72), "\n", sep = "")
  cat("Multivariable Linear Model\n")
  cat(sprintf("Formula: %s\n", deparse(f)))
  cat(strrep("-", 72), "\n", sep = "")
  s <- summary(fit)
  print(s)
  
  # Extra diagnostics
  cat(strrep("-", 72), "\n", sep = "")
  cat(sprintf("Observations used: %d\n", s$df[1] + s$df[2]))
  cat(sprintf("Residual df: %d\n", s$df[2]))
  cat(sprintf("Residual SE: %.3f\n", s$sigma))
  cat(sprintf("Multiple R-squared: %.5f\n", s$r.squared))
  cat(sprintf("Adjusted R-squared: %.5f\n", s$adj.r.squared))
  suppressWarnings({
    aic_val <- try(AIC(fit), silent = TRUE)
    if (!inherits(aic_val, "try-error")) {
      cat(sprintf("AIC: %.3f\n", aic_val))
    }
  })
  
  invisible(fit)
}

Running Multi LM Function

mmodel <- run_multivar_lm(final_wide_trim)

## 
## ========================================================================
## Multivariable Linear Model
## Formula: avg_salary_k ~ python + excel + sas + tableau + spark + sql + 
##  Formula:     aws + mongo + pytorch + tensor + hadoop + keras + scikit + 
##  Formula:     flink + google_an + r + scala + databricks + gcp + snowflake + 
##  Formula:     tensorflow + xgboost + azure
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = f, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -74.135 -23.389  -6.199  19.860 125.237 
## 
## Coefficients: (8 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   92.184      2.443  37.739  < 2e-16 ***
## python        22.627      2.928   7.729 3.63e-14 ***
## excel         -2.549      2.579  -0.988 0.323313    
## sas           18.499      4.675   3.957 8.35e-05 ***
## tableau       -7.601      3.515  -2.162 0.030920 *  
## spark         -1.442      4.071  -0.354 0.723305    
## sql          -10.613      3.060  -3.468 0.000555 ***
## aws            9.877      3.258   3.031 0.002522 ** 
## mongo          7.776      6.173   1.260 0.208133    
## pytorch      -18.653      7.584  -2.459 0.014149 *  
## tensor        15.412      6.682   2.306 0.021371 *  
## hadoop         4.712      4.207   1.120 0.263017    
## keras         -3.025      8.340  -0.363 0.716898    
## scikit        11.969      6.222   1.924 0.054778 .  
## flink          4.294     11.486   0.374 0.708604    
## google_an    -28.312      9.386  -3.016 0.002646 ** 
## r                 NA         NA      NA       NA    
## scala             NA         NA      NA       NA    
## databricks        NA         NA      NA       NA    
## gcp               NA         NA      NA       NA    
## snowflake         NA         NA      NA       NA    
## tensorflow        NA         NA      NA       NA    
## xgboost           NA         NA      NA       NA    
## azure             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.21 on 726 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.1841, Adjusted R-squared:  0.1672 
## F-statistic: 10.92 on 15 and 726 DF,  p-value: < 2.2e-16
## 
## ------------------------------------------------------------------------
## Observations used: 742
## Residual df: 726
## Residual SE: 34.206
## Multiple R-squared: 0.18407
## Adjusted R-squared: 0.16721
## AIC: 7365.593

Building Backward Stepwise Regression Function

run_stepwise_backward_lm <- function(df,
                                     y = "avg_salary_k",
                                     x_cols = 12:34,
                                     na_action = na.exclude,
                                     sanitize_names = FALSE,
                                     coerce_binary_chars = TRUE,
                                     standardize = FALSE,
                                     criterion = c("AIC", "BIC"),
                                     trace = TRUE) {
  criterion <- match.arg(criterion)
  
  # --- Basic checks ---
  if (!y %in% names(df)) {
    stop(sprintf("Response column '%s' not found in data.", y))
  }
  if (any(x_cols < 1 | x_cols > ncol(df))) {
    stop("x_cols index range is out of bounds for the provided data frame.")
  }
  
  # --- Optionally sanitize column names ---
  if (sanitize_names) {
    names(df) <- make.unique(make.names(names(df)))
    if (!y %in% names(df)) {
      stop("After sanitizing names, the response column was renamed; update 'y' to the new name.")
    }
    message("Column names were sanitized via make.names() + make.unique().")
  }
  
  # --- Ensure response is numeric ---
  if (!is.numeric(df[[y]])) {
    suppressWarnings({
      df[[y]] <- as.numeric(df[[y]])
    })
    if (anyNA(df[[y]])) {
      warning(sprintf("Coercing '%s' to numeric introduced NAs; check your response column.", y))
    }
  }
  
  # --- Collect candidate predictors ---
  xnames <- names(df)[x_cols]
  
  # --- Filter: skip problematic types, coerce binary chars, skip constants ---
  keep <- logical(length(xnames))
  for (i in seq_along(xnames)) {
    xn <- xnames[i]
    xv <- df[[xn]]
    
    # Skip list/data-frame columns
    if (is.list(xv) && !is.atomic(xv)) {
      message(sprintf("Skipping '%s': predictor is a list/data-frame column.", xn))
      keep[i] <- FALSE
      next
    }
    
    # Optional coercion: "0"/"1" -> numeric
    if (coerce_binary_chars && is.character(xv)) {
      vals <- unique(na.omit(xv))
      if (length(vals) > 0 && all(vals %in% c("0", "1"))) {
        df[[xn]] <- as.numeric(xv)
        xv <- df[[xn]]
      }
    }
    
    # Skip non-variant predictors (all NA or <2 unique non-NA values)
    if (all(is.na(xv)) || length(unique(na.omit(xv))) < 2) {
      message(sprintf("Skipping '%s': predictor has <2 unique non-NA values.", xn))
      keep[i] <- FALSE
      next
    }
    
    keep[i] <- TRUE
  }
  xnames <- xnames[keep]
  
  if (length(xnames) == 0) {
    stop("No valid predictors found in the selected range after filtering.")
  }
  
  # --- Optional standardization of numeric predictors ---
  if (standardize) {
    for (xn in xnames) {
      if (is.numeric(df[[xn]])) {
        df[[xn]] <- as.numeric(scale(df[[xn]]))
      }
    }
    message("Numeric predictors standardized (mean=0, sd=1).")
  }
  
  # --- Build safe formula with backticks ---
  backtick <- function(nm) paste0("`", nm, "`")
  rhs <- paste(vapply(xnames, backtick, ""), collapse = " + ")
  f_full <- as.formula(paste(backtick(y), "~", rhs))
  
  # --- Fit full model ---
  fit_full <- lm(f_full, data = df, na.action = na_action)
  
  # --- Choose penalty k (AIC vs BIC) ---
  n <- nobs(fit_full)  # number of observations actually used
  k <- if (criterion == "AIC") 2 else log(n)
  
  # --- Backward stepwise using stats::step ---
  fit_step <- step(fit_full,
                   direction = "backward",
                   trace = as.integer(trace),
                   k = k)
  
  # --- Print concise diagnostics for the final selected model ---
  cat("\n", strrep("=", 72), "\n", sep = "")
  cat(sprintf("Backward Stepwise Linear Model (criterion: %s, k = %.3f)\n", criterion, k))
  cat(sprintf("Final Formula: %s\n", deparse(formula(fit_step))))
  cat(strrep("-", 72), "\n", sep = "")
  
  s <- summary(fit_step)
  print(s)
  
  # Which predictors were selected?
  selected <- attr(terms(fit_step), "term.labels")
  dropped  <- setdiff(xnames, selected)
  
  cat(strrep("-", 72), "\n", sep = "")
  cat("Selected predictors:\n")
  if (length(selected) > 0) {
    cat(paste0(" - ", selected), sep = "\n")
  } else {
    cat(" (none; intercept-only model)\n")
  }
  
  cat("\nDropped predictors:\n")
  if (length(dropped) > 0) {
    cat(paste0(" - ", dropped), sep = "\n")
  } else {
    cat(" (none dropped)\n")
  }
  
  # Extra metrics
  cat("\n", strrep("-", 72), "\n", sep = "")
  cat(sprintf("Observations used: %d\n", s$df[1] + s$df[2]))
  cat(sprintf("Residual df: %d\n", s$df[2]))
  cat(sprintf("Residual SE: %.3f\n", s$sigma))
  cat(sprintf("Multiple R-squared: %.5f\n", s$r.squared))
  cat(sprintf("Adjusted R-squared: %.5f\n", s$adj.r.squared))
  cat(sprintf("%s: %.3f\n", criterion, AIC(fit_step) + (k - 2) * length(coef(fit_step))))
  # Note: AIC() returns AIC with k = 2. For BIC, we report adjusted by (k-2)*p.
  
  invisible(fit_step)
}

Running Backward Stepwise Regression Function

stepmodel <- run_stepwise_backward_lm(final_wide_trim)

## Start:  AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql + 
##     aws + mongo + pytorch + tensor + hadoop + keras + scikit + 
##     flink + google_an + r + scala + databricks + gcp + snowflake + 
##     tensorflow + xgboost + azure
## 
## 
## Step:  AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql + 
##     aws + mongo + pytorch + tensor + hadoop + keras + scikit + 
##     flink + google_an + r + scala + databricks + gcp + snowflake + 
##     tensorflow + xgboost
## 
## 
## Step:  AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql + 
##     aws + mongo + pytorch + tensor + hadoop + keras + scikit + 
##     flink + google_an + r + scala + databricks + gcp + snowflake + 
##     tensorflow
## 
## 
## Step:  AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql + 
##     aws + mongo + pytorch + tensor + hadoop + keras + scikit + 
##     flink + google_an + r + scala + databricks + gcp + snowflake
## 
## 
## Step:  AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql + 
##     aws + mongo + pytorch + tensor + hadoop + keras + scikit + 
##     flink + google_an + r + scala + databricks + gcp
## 
## 
## Step:  AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql + 
##     aws + mongo + pytorch + tensor + hadoop + keras + scikit + 
##     flink + google_an + r + scala + databricks
## 
## 
## Step:  AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql + 
##     aws + mongo + pytorch + tensor + hadoop + keras + scikit + 
##     flink + google_an + r + scala
## 
## 
## Step:  AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql + 
##     aws + mongo + pytorch + tensor + hadoop + keras + scikit + 
##     flink + google_an + r
## 
## 
## Step:  AIC=5257.89
## avg_salary_k ~ python + excel + sas + tableau + spark + sql + 
##     aws + mongo + pytorch + tensor + hadoop + keras + scikit + 
##     flink + google_an
## 
##             Df Sum of Sq    RSS    AIC
## - spark      1       147 849581 5256.0
## - keras      1       154 849588 5256.0
## - flink      1       164 849597 5256.0
## - excel      1      1143 850577 5256.9
## - hadoop     1      1468 850902 5257.2
## - mongo      1      1857 851291 5257.5
## <none>                   849434 5257.9
## - scikit     1      4330 853764 5259.7
## - tableau    1      5471 854904 5260.7
## - tensor     1      6224 855657 5261.3
## - pytorch    1      7077 856511 5262.0
## - google_an  1     10646 860080 5265.1
## - aws        1     10750 860184 5265.2
## - sql        1     14071 863505 5268.1
## - sas        1     18316 867750 5271.7
## - python     1     69885 919319 5314.6
## 
## Step:  AIC=5256.02
## avg_salary_k ~ python + excel + sas + tableau + sql + aws + mongo + 
##     pytorch + tensor + hadoop + keras + scikit + flink + google_an
## 
##             Df Sum of Sq    RSS    AIC
## - keras      1       127 849708 5254.1
## - flink      1       133 849713 5254.1
## - excel      1      1101 850681 5255.0
## - hadoop     1      1346 850927 5255.2
## - mongo      1      1741 851321 5255.5
## <none>                   849581 5256.0
## - scikit     1      4195 853776 5257.7
## - tableau    1      5335 854916 5258.7
## - tensor     1      6111 855692 5259.3
## - pytorch    1      6936 856516 5260.0
## - google_an  1     10602 860182 5263.2
## - aws        1     10604 860184 5263.2
## - sql        1     14507 864087 5266.6
## - sas        1     18508 868089 5270.0
## - python     1     70982 920563 5313.6
## 
## Step:  AIC=5254.13
## avg_salary_k ~ python + excel + sas + tableau + sql + aws + mongo + 
##     pytorch + tensor + hadoop + scikit + flink + google_an
## 
##             Df Sum of Sq    RSS    AIC
## - flink      1       170 849878 5252.3
## - excel      1      1103 850811 5253.1
## - hadoop     1      1404 851112 5253.4
## - mongo      1      1690 851398 5253.6
## <none>                   849708 5254.1
## - scikit     1      4156 853863 5255.7
## - tableau    1      5425 855133 5256.9
## - tensor     1      6241 855949 5257.6
## - pytorch    1      6876 856584 5258.1
## - aws        1     10481 860189 5261.2
## - google_an  1     10726 860434 5261.4
## - sql        1     14388 864096 5264.6
## - sas        1     18787 868495 5268.4
## - python     1     70881 920589 5311.6
## 
## Step:  AIC=5252.28
## avg_salary_k ~ python + excel + sas + tableau + sql + aws + mongo + 
##     pytorch + tensor + hadoop + scikit + google_an
## 
##             Df Sum of Sq    RSS    AIC
## - excel      1      1110 850988 5251.2
## - hadoop     1      1544 851421 5251.6
## - mongo      1      1686 851564 5251.7
## <none>                   849878 5252.3
## - scikit     1      4135 854013 5253.9
## - tableau    1      5477 855355 5255.0
## - tensor     1      6754 856631 5256.1
## - pytorch    1      7331 857209 5256.6
## - google_an  1     10718 860596 5259.6
## - aws        1     10948 860826 5259.8
## - sql        1     14822 864699 5263.1
## - sas        1     18856 868734 5266.6
## - python     1     71124 921002 5309.9
## 
## Step:  AIC=5251.24
## avg_salary_k ~ python + sas + tableau + sql + aws + mongo + pytorch + 
##     tensor + hadoop + scikit + google_an
## 
##             Df Sum of Sq    RSS    AIC
## - mongo      1      1515 852503 5250.6
## - hadoop     1      1554 852542 5250.6
## <none>                   850988 5251.2
## - scikit     1      4060 855049 5252.8
## - tableau    1      5993 856981 5254.5
## - tensor     1      6998 857986 5255.3
## - pytorch    1      7292 858280 5255.6
## - aws        1     11034 862022 5258.8
## - google_an  1     11498 862486 5259.2
## - sql        1     15412 866400 5262.6
## - sas        1     18820 869808 5265.5
## - python     1     72221 923209 5309.7
## 
## Step:  AIC=5250.56
## avg_salary_k ~ python + sas + tableau + sql + aws + pytorch + 
##     tensor + hadoop + scikit + google_an
## 
##             Df Sum of Sq    RSS    AIC
## - hadoop     1      1980 854483 5250.3
## <none>                   852503 5250.6
## - scikit     1      4119 856621 5252.1
## - tableau    1      5556 858059 5253.4
## - tensor     1      7143 859646 5254.8
## - pytorch    1      7683 860185 5255.2
## - google_an  1     11455 863958 5258.5
## - aws        1     12616 865119 5259.5
## - sql        1     14645 867148 5261.2
## - sas        1     20120 872623 5265.9
## - python     1     71465 923968 5308.3
## 
## Step:  AIC=5250.29
## avg_salary_k ~ python + sas + tableau + sql + aws + pytorch + 
##     tensor + scikit + google_an
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   854483 5250.3
## - scikit     1      5112 859595 5252.7
## - tableau    1      5621 860104 5253.2
## - tensor     1      6544 861027 5253.9
## - pytorch    1      8208 862691 5255.4
## - google_an  1     11679 866162 5258.4
## - sql        1     13033 867516 5259.5
## - aws        1     15824 870307 5261.9
## - sas        1     18930 873413 5264.5
## - python     1     74733 929216 5310.5
## 
## ========================================================================
## Backward Stepwise Linear Model (criterion: AIC, k = 2.000)
## Final Formula: avg_salary_k ~ python + sas + tableau + sql + aws + pytorch + 
##  Final Formula:     tensor + scikit + google_an
## ------------------------------------------------------------------------
## 
## Call:
## lm(formula = avg_salary_k ~ python + sas + tableau + sql + aws + 
##     pytorch + tensor + scikit + google_an, data = df, na.action = na_action)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.452 -23.502  -5.291  19.320 123.744 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   90.952      2.089  43.543  < 2e-16 ***
## python        22.804      2.850   8.001 4.82e-15 ***
## sas           18.520      4.599   4.027 6.24e-05 ***
## tableau       -7.582      3.455  -2.194 0.028524 *  
## sql           -9.796      2.932  -3.341 0.000876 ***
## aws           11.327      3.076   3.682 0.000249 ***
## pytorch      -19.513      7.359  -2.652 0.008181 ** 
## tensor        14.230      6.010   2.368 0.018155 *  
## scikit        12.114      5.788   2.093 0.036717 *  
## google_an    -29.472      9.318  -3.163 0.001626 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.17 on 732 degrees of freedom
##   (11 observations deleted due to missingness)
## Multiple R-squared:  0.1792, Adjusted R-squared:  0.1691 
## F-statistic: 17.76 on 9 and 732 DF,  p-value: < 2.2e-16
## 
## ------------------------------------------------------------------------
## Selected predictors:
##  - python
##  - sas
##  - tableau
##  - sql
##  - aws
##  - pytorch
##  - tensor
##  - scikit
##  - google_an
## 
## Dropped predictors:
##  - excel
##  - spark
##  - mongo
##  - hadoop
##  - keras
##  - flink
##  - r
##  - scala
##  - databricks
##  - gcp
##  - snowflake
##  - tensorflow
##  - xgboost
##  - azure
## 
## ------------------------------------------------------------------------
## Observations used: 742
## Residual df: 732
## Residual SE: 34.166
## Multiple R-squared: 0.17922
## Adjusted R-squared: 0.16912
## AIC: 7357.991

Google Gemini. (2026). Gemini 3 Flash \[Large language model\]. https://gemini.google.com. Accessed March 20, 2026.

Project 3 Collaboration Data Acquisition

Mei Qi Ng

now