How To: Load Data Straight From Kaggle

Author

By Tony Fraser

Published

November 8, 2023

Overview

This snippet demonstrates how to call the kaggle client directly from an R program. If you use this, you r code can run against any kaggle dataset.

Required setup

  1. Make sure python and pip are both installed on your computer, and in your path.
  2. Make sure you installed the kaggle client with pip install kaggle. Kaggle needs to be in your path as well.
  3. You need your credentials file from kaggle. Log in, click on settings, api, and then download your an api credentials file. You’ll see below that this code looks for a file called nogit_kaggle.json in the same directory. Adjust as necessary.

Notes before you start

  1. This is automation code and you don’t have to use it. If you have kaggle installed on your computer already, you can just as easily run the kaggle download with all the flags yourself.
  2. Make SURE to add your dataset file and your kaggle credentials file to .gitignore.
  3. In this demo, the name “merged_stack_wide.csv” is the exact name of the file stored up on kaggle.

load_kaggle_data()

library(jsonlite)
library(readr)
library(tidyverse)

load_kaggle_data <- function(local_filename, kaggle_dataset) {
  if (!file.exists(local_filename)) {
    # Load Kaggle API credentials only if the file needs to be downloaded
    kaggle_credentials <- fromJSON("./nogit_kaggle.json")
    kaggle_username <- kaggle_credentials$username
    kaggle_key <- kaggle_credentials$key
    Sys.setenv(KAGGLE_USERNAME = kaggle_username)
    Sys.setenv(KAGGLE_KEY = kaggle_key)

    if (system("kaggle --version", ignore.stdout = TRUE, ignore.stderr = TRUE) != 0) {
      stop("Kaggle CLI is not installed or not in the PATH.")
    }

    message(paste("Downloading", local_filename, "from Kaggle"))
    
    if (system(paste("kaggle datasets download", kaggle_dataset, "-f", local_filename)) != 0) {
      stop("Failed to download the dataset with Kaggle CLI.")
    }
    
    unzip_status <- system(paste("unzip -o", local_filename))
    if (unzip_status != 0) {
      stop("Failed to unzip the dataset.")
    }
    
    zip_filename <- paste0(local_filename, ".zip")
    if (file.exists(zip_filename)) {
      file.remove(zip_filename)
    }
  } else {
    message(paste(local_filename, "found locally. Loading..."))
  }
  
  read_csv(local_filename)
}

Use the function

The function will download the kaggle dataset file, save locally, unzip it, delete the zip file, read the csv back in, and return a dataframe. If you have already downloaded the file, it will load from your local computer, like loading from cache. Simply delete the data file to force a re-download.

k_dataset <- "tonyfraser/formatted-stack-overflow-survey-2017-2022"
local_wide <- "merged_stack_wide.csv"

# Load the dataset
merged_stack_wide <- load_kaggle_data(local_wide, k_dataset)
glimpse(merged_stack_wide)
Rows: 460,298
Columns: 44
$ Year               <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201…
$ OrgSize            <chr> NA, "20 to 99 employees", "10,000 or more employees…
$ Country            <chr> "United States", "United Kingdom", "United Kingdom"…
$ Employment         <chr> "Not employed, and not looking for work", "Employed…
$ Gender             <chr> "Male", "Male", "Male", "Male", NA, NA, "Male", "Ma…
$ EdLevel            <chr> "Secondary school", "Some college/university study …
$ Age                <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ DevType            <chr> NA, NA, "Other", NA, "Mobile developer; Graphics pr…
$ DatabaseWorkedWith <chr> NA, "MySQL; SQLite", "MySQL", "MongoDB; Redis; SQL …
$ LanguageWorkedWith <chr> "Swift", "JavaScript; Python; Ruby; SQL", "Java; PH…
$ PlatformWorkedWith <chr> "iOS", "Amazon Web Services (AWS)", NA, "Windows De…
$ YearsCodePro       <chr> "2 to 3 years", "9 to 10 years", "20 or more years"…
$ AnnualSalary       <dbl> NA, NA, 113750, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ YearsCodeProAvg    <dbl> 2.5, 9.5, 20.0, 14.5, 20.0, 6.5, 9.5, 10.5, 13.5, 1…
$ OrgSizeAvg         <dbl> NA, 59.50, 5.00, 5.00, 14.50, NA, 59.50, 10.00, 253…
$ AgeAvg             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ python             <chr> "no", "yes", "yes", "yes", "no", "no", "yes", "yes"…
$ sql                <chr> "no", "yes", "no", "yes", "no", "no", "no", "no", "…
$ java               <chr> "no", "no", "yes", "no", "no", "no", "no", "no", "n…
$ javascript         <chr> "no", "yes", "no", "no", "no", "yes", "no", "yes", …
$ ruby               <chr> "no", "yes", "no", "no", "no", "no", "no", "yes", "…
$ php                <chr> "no", "no", "yes", "no", "no", "yes", "no", "no", "…
$ c                  <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ swift              <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "n…
$ scala              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ r                  <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "n…
$ rust               <chr> "no", "no", "no", "no", "no", "yes", "no", "no", "n…
$ julia              <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ mysql              <chr> "no", "yes", "yes", "yes", "no", "yes", "no", "yes"…
$ microsoftsqlserver <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ mongodb            <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "n…
$ postgresql         <chr> "no", "no", "no", "no", "no", "no", "no", "yes", "n…
$ oracle             <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ ibmdb2             <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ redis              <chr> "no", "no", "no", "yes", "no", "no", "no", "yes", "…
$ sqlite             <chr> "no", "yes", "no", "yes", "no", "no", "yes", "no", …
$ mariadb            <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ microsoftazure     <chr> "no", "no", "no", "no", "no", "no", "no", "no", "ye…
$ googlecloud        <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ ibmcloudorwatson   <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ kubernetes         <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ linux              <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "…
$ windows            <chr> "no", "no", "no", "yes", "no", "no", "yes", "no", "…
$ aws                <chr> "no", "yes", "no", "yes", "no", "yes", "no", "yes",…