Spring 2026

Data Storage

Common Storage Needs

  • Every part of the pipeline can have storage requirements
  • For example: Even during data collection and processing, you may need to store data to:
    • To retain metadata information about the location and state of your data
    • To retain important queries for obtaining data
    • Help filter other sources, scripts for scraping
    • Post-processed versions of data to make analysis easier
  • The analysis & visualization portion could require the storage of:
    • Models
    • Training/testing sets
    • Results of analysis
    • Plots, tables, graphs, etc
  • Data storage can help provide durability and availability to your data analysis process

Common Sources for Storage

  • You can store your data in the same types of sources your receive: CSV, excel, stata, a database, etc
  • Data Warehouses – Optimized for structured data, typically in a database, to provide facile access to data for analysis and reporting
  • Data Lakes – Optimized for structured and unstructured data, to provide flexibility and the ability to retrieve raw, unprocessed data

Typical Workflows

  1. Scraping and collection tasks obtain data from different sources
  2. Munging and processing tasks reshape, clean, group, and aggregate data
  3. Intermediate data is warehoused for anlaysis purposes
  4. Models trained on warehoused, structured data
  5. Results & visualizations are stored for reporting purposes

Storage with R

Let’s Go Get Some Data

For these examples, I’ll use some New Zealand fiscal data:

url_str <- "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv"
enterprise_data <- read.table(url_str, header=T, sep=',')   # Load the data off the Internet
head(enterprise_data, 3)
##   Year Industry_aggregation_NZSIOC Industry_code_NZSIOC Industry_name_NZSIOC
## 1 2023                     Level 1                99999       All industries
## 2 2023                     Level 1                99999       All industries
## 3 2023                     Level 1                99999       All industries
##                Units Variable_code
## 1 Dollars (millions)           H01
## 2 Dollars (millions)           H04
## 3 Dollars (millions)           H05
##                                     Variable_name     Variable_category  Value
## 1                                    Total income Financial performance 930995
## 2 Sales, government funding, grants and subsidies Financial performance 821630
## 3               Interest, dividends and donations Financial performance  84354
##                                                                                             Industry_code_ANZSIC06
## 1 ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)
## 2 ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)
## 3 ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)

The write Functions

You can write out data frames as text-readable files using any delimiter (separator) you wish:

write.table(enterprise_data, file="enterprise.csv", sep=",")
system("cat enterprise.csv | head -n 4")  # just show the top 4 lines of that file to confirm

Saving as Binary Objects

  • R data objects in the environment can be saved as binary objects
  • This allows for faster loads
  • Also helps with partial environment restores
save(enterprise_data, file="enterprise.Rdata")   # Save it as a binary object
rm(enterprise_data)                              # Remove it from the environment
load(file="enterprise.Rdata")                    # Restore it to the environment

Saving To Excel

You will need to install external library writexl

library(writexl)
write_xlsx(enterprise_data, "enterprise-data.xslsx")

Other Exports

Install the haven external library to perform other exports:

  • SAS:  write_xpt(enterprise_data, "enterprise-data.xpt")

  • SPSS:  write_sav(enterprise_data, "enterprise-data.sav")

  • Stata:  write_dta(enterprise_data, "enterprise-data.dta")

  • More information: https://haven.tidyverse.org/

Storage with Python

Let’s Go Get Some Data

For these examples, I’ll use some New Zealand fiscal data:

import pandas as pd
import os
url_str = "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv"
enterprise_data = pd.read_csv(url_str, sep=',')   # Load the data off the Internet
enterprise_data.head(3)
##    Year  ...                             Industry_code_ANZSIC06
## 0  2023  ...  ANZSIC06 divisions A-S (excluding classes K633...
## 1  2023  ...  ANZSIC06 divisions A-S (excluding classes K633...
## 2  2023  ...  ANZSIC06 divisions A-S (excluding classes K633...
## 
## [3 rows x 10 columns]

The to_csv DataFrame Method

You can write out data frames as text-readable files using any delimeter (separator) you wish:

enterprise_data.to_csv("enterprise.csv", sep=",")
os.system("cat enterprise.csv | head -n 4")  # just show the top 4 lines of that file to confirm
## 0

Pickling

  • Python allows you to serialize objects to a binary format
  • These can be saved
  • Python calls such serialized binary objects “pickles”
enterprise_data.to_pickle('enterprise.pkl')        # Save it as a binary object
enterprise_data = None                             # Remove it from the environment
enterprise_data = pd.read_pickle('enterprise.pkl') # Restore it to the environment

Saving To Excel

You will need to install external library openpyxl

import pandas as pd
import openpyxl
url_str = "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv"
enterprise_data = pd.read_csv(url_str, sep=',') 
enterprise_data.to_excel("foo.xlsx")

Note: I fixed this after the video … this should work now.

Other Exports

Install the haven external library to perform other exports:

Storage with Julia

Let’s Go Get Some Data

For these examples, I’ll use some New Zealand fiscal data:

using DataFrames, HTTP, CSV
url_str = "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv";
enterprise_data = DataFrame(CSV.File(HTTP.get(url_str).body));

Use CSV.write to Output a CSCV

Serialize Data to Binary Files

  • Julia also allows you to serialize data to a binary format
  • These can be saved and reloaded for quick storage/retrieval
using Serialization
serialize("enterprise.bin", enterprise_data)
myvar = deserialize("enterprise.bin")

More Info: https://docs.julialang.org/en/v1/stdlib/Serialization/

Do-Blocks in Julia

  • When using things like list comprehension or mapping it can be problematic if what you need to do requires more than one statement
  • Of course, we could make a function, but that’s annoying if it’s a one-off
  • We call these ideas anonymous functions –more about that later in the class
  • One way to employ anonymous functions in Julia is using a do-block
  • You’re essentially passing a block of code as an argument to something
  • It’s basically saying: do this block of code when you get to that part of your function

More Info: https://docs.julialang.org/en/v1/manual/functions/#Do-Block-Syntax-for-Function-Arguments

Saving To Excel, in General

You will need to install external package XLSX

XLSX.openxlsx("new-spreadsheet.xlsx", mode="w") do xf
    sheet = xf[1]
    XLSX.rename!(sheet, "MyGreatSheet")
    sheet["A1"] = "CSCI296"
    sheet["A2"] = "Paul"
    sheet["A3"] = "Wiegand"
    sheet["A4"] = 99.9

    # will add a matrix from "A7" to "C9"
    sheet["A7:C9"] = [ 1 2 3 ; 4 5 6 ; 7 8 9 ]
end

More Info: https://felipenoris.github.io/XLSX.jl/stable/tutorial/#Writing-Excel-Files

Saving DataFrames To Excel

Other Formats