Data Storage

Spring 2026

Data Storage

Common Storage Needs

Every part of the pipeline can have storage requirements
For example: Even during data collection and processing, you may need to store data to:
- To retain metadata information about the location and state of your data
- To retain important queries for obtaining data
- Help filter other sources, scripts for scraping
- Post-processed versions of data to make analysis easier
The analysis & visualization portion could require the storage of:
- Models
- Training/testing sets
- Results of analysis
- Plots, tables, graphs, etc
Data storage can help provide durability and availability to your data analysis process

Common Sources for Storage

You can store your data in the same types of sources your receive: CSV, excel, stata, a database, etc
Data Warehouses – Optimized for structured data, typically in a database, to provide facile access to data for analysis and reporting
Data Lakes – Optimized for structured and unstructured data, to provide flexibility and the ability to retrieve raw, unprocessed data

Typical Workflows

Scraping and collection tasks obtain data from different sources
Munging and processing tasks reshape, clean, group, and aggregate data
Intermediate data is warehoused for anlaysis purposes
Models trained on warehoused, structured data
Results & visualizations are stored for reporting purposes

Storage with R

Let’s Go Get Some Data

For these examples, I’ll use some New Zealand fiscal data:

url_str <- "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv"
enterprise_data <- read.table(url_str, header=T, sep=',')   # Load the data off the Internet
head(enterprise_data, 3)

##   Year Industry_aggregation_NZSIOC Industry_code_NZSIOC Industry_name_NZSIOC
## 1 2023                     Level 1                99999       All industries
## 2 2023                     Level 1                99999       All industries
## 3 2023                     Level 1                99999       All industries
##                Units Variable_code
## 1 Dollars (millions)           H01
## 2 Dollars (millions)           H04
## 3 Dollars (millions)           H05
##                                     Variable_name     Variable_category  Value
## 1                                    Total income Financial performance 930995
## 2 Sales, government funding, grants and subsidies Financial performance 821630
## 3               Interest, dividends and donations Financial performance  84354
##                                                                                             Industry_code_ANZSIC06
## 1 ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)
## 2 ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)
## 3 ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)

The `write` Functions

You can write out data frames as text-readable files using any delimiter (separator) you wish:

write.table(enterprise_data, file="enterprise.csv", sep=",")
system("cat enterprise.csv | head -n 4")  # just show the top 4 lines of that file to confirm

Saving as Binary Objects

R data objects in the environment can be saved as binary objects
This allows for faster loads
Also helps with partial environment restores

save(enterprise_data, file="enterprise.Rdata")   # Save it as a binary object
rm(enterprise_data)                              # Remove it from the environment
load(file="enterprise.Rdata")                    # Restore it to the environment

Saving To Excel

You will need to install external library writexl

library(writexl)
write_xlsx(enterprise_data, "enterprise-data.xslsx")

Other Exports

Install the haven external library to perform other exports:

SAS: write_xpt(enterprise_data, "enterprise-data.xpt")
SPSS: write_sav(enterprise_data, "enterprise-data.sav")
Stata: write_dta(enterprise_data, "enterprise-data.dta")
More information: https://haven.tidyverse.org/

Storage with Python

Let’s Go Get Some Data

For these examples, I’ll use some New Zealand fiscal data:

import pandas as pd
import os
url_str = "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv"
enterprise_data = pd.read_csv(url_str, sep=',')   # Load the data off the Internet
enterprise_data.head(3)

##    Year  ...                             Industry_code_ANZSIC06
## 0  2023  ...  ANZSIC06 divisions A-S (excluding classes K633...
## 1  2023  ...  ANZSIC06 divisions A-S (excluding classes K633...
## 2  2023  ...  ANZSIC06 divisions A-S (excluding classes K633...
## 
## [3 rows x 10 columns]

The `to_csv` DataFrame Method

You can write out data frames as text-readable files using any delimeter (separator) you wish:

enterprise_data.to_csv("enterprise.csv", sep=",")
os.system("cat enterprise.csv | head -n 4")  # just show the top 4 lines of that file to confirm

## 0

Pickling

Python allows you to serialize objects to a binary format
These can be saved
Python calls such serialized binary objects “pickles”

enterprise_data.to_pickle('enterprise.pkl')        # Save it as a binary object
enterprise_data = None                             # Remove it from the environment
enterprise_data = pd.read_pickle('enterprise.pkl') # Restore it to the environment

Saving To Excel

You will need to install external library openpyxl

import pandas as pd
import openpyxl
url_str = "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv"
enterprise_data = pd.read_csv(url_str, sep=',') 
enterprise_data.to_excel("foo.xlsx")

Note: I fixed this after the video … this should work now.

Other Exports

Install the haven external library to perform other exports:

Stata: enterprise_data.to_stata("enterprise-data.dta")
SAS: Requires the saspy library
SPSS: Requires the pyreadstat library
More information: https://pandas.pydata.org/docs/reference/io.html

Storage with Julia

Let’s Go Get Some Data

For these examples, I’ll use some New Zealand fiscal data:

using DataFrames, HTTP, CSV
url_str = "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv";
enterprise_data = DataFrame(CSV.File(HTTP.get(url_str).body));

Use CSV.write to Output a CSCV

You can write out data frames as text-readable files using any delimiter (separator) you wish:

CSV.write("enterprise.csv", enterprise_data, delim='|');

More Info: https://dataframes.juliadata.org/stable/man/importing_and_exporting/

Serialize Data to Binary Files

Julia also allows you to serialize data to a binary format
These can be saved and reloaded for quick storage/retrieval

using Serialization
serialize("enterprise.bin", enterprise_data)
myvar = deserialize("enterprise.bin")

More Info: https://docs.julialang.org/en/v1/stdlib/Serialization/

Do-Blocks in Julia

When using things like list comprehension or mapping it can be problematic if what you need to do requires more than one statement
Of course, we could make a function, but that’s annoying if it’s a one-off
We call these ideas anonymous functions –more about that later in the class
One way to employ anonymous functions in Julia is using a do-block
You’re essentially passing a block of code as an argument to something
It’s basically saying: do this block of code when you get to that part of your function

More Info: https://docs.julialang.org/en/v1/manual/functions/#Do-Block-Syntax-for-Function-Arguments

Saving To Excel, in General

You will need to install external package XLSX

XLSX.openxlsx("new-spreadsheet.xlsx", mode="w") do xf
    sheet = xf[1]
    XLSX.rename!(sheet, "MyGreatSheet")
    sheet["A1"] = "CSCI296"
    sheet["A2"] = "Paul"
    sheet["A3"] = "Wiegand"
    sheet["A4"] = 99.9

    # will add a matrix from "A7" to "C9"
    sheet["A7:C9"] = [ 1 2 3 ; 4 5 6 ; 7 8 9 ]
end

More Info: https://felipenoris.github.io/XLSX.jl/stable/tutorial/#Writing-Excel-Files

Saving DataFrames To Excel

Use XLSX.writetable to write a DataFrame to Excel format
You can put different dataframes into different sheets straightfowardly

XLSX.writetable("enterprise.xlsx", "EnterpriseSheet" => enterprise_data);

More Info: https://felipenoris.github.io/XLSX.jl/stable/tutorial/#Writing-Excel-Files

Other Formats

There are external packages for reading/writing other stats data formats:

Data Storage

Common Storage Needs

Common Sources for Storage

Typical Workflows

Storage with R

Let’s Go Get Some Data

The write Functions

Saving as Binary Objects

Saving To Excel

Other Exports

Storage with Python

Let’s Go Get Some Data

The to_csv DataFrame Method

Pickling

Saving To Excel

Other Exports

Storage with Julia

Let’s Go Get Some Data

Use CSV.write to Output a CSCV

Serialize Data to Binary Files

Do-Blocks in Julia

Saving To Excel, in General

Saving DataFrames To Excel

Other Formats

The `write` Functions

The `to_csv` DataFrame Method