Data Storage

Spring 2025

Data Storage

Common Storage Needs

Every part of the pipeline can have storage requirements
For example: Even during data collection and processing, you may need to store data to:
- To retain metadata information about the location and state of your data
- To retain important queries for obtaining data
- Help filter other sources, scripts for scraping
- Post-processed versions of data to make analysis easier
The analysis & visualization portion could require the storage of:
- Models
- Training/testing sets
- Results of analysis
- Plots, tables, graphs, etc
Data storage can help provide durability and availability to your data analysis process

Common Sources for Storage

You can store your data in the same types of sources your receive: CSV, excel, stata, a database, etc
Data Warehouses – Optimized for structured data, typically in a database, to provide facile access to data for analysis and reporting
Data Lakes – Optimized for structured and unstructured data, to provide flexability and the ability to retrieve raw, unprocessed data

Typical Workflows

Scraping and collection tasks obtain data from different sources
Munging and processing tasks reshape, clean, group, and aggregate data
Intermediate data is warehoused for anlaysis purposes
Models trained on warehoused, structured data
Results & visualizations are stored for reporting purposes

Storage with R

Let’s Go Get Some Data

For these examples, I’ll use some New Zealand fiscal data:

url_str <- "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv"
enterprise_data <- read.table(url_str, header=T, sep=',')   # Load the data off the Internet
head(enterprise_data, 3)

##   Year Industry_aggregation_NZSIOC Industry_code_NZSIOC Industry_name_NZSIOC
## 1 2023                     Level 1                99999       All industries
## 2 2023                     Level 1                99999       All industries
## 3 2023                     Level 1                99999       All industries
##                Units Variable_code
## 1 Dollars (millions)           H01
## 2 Dollars (millions)           H04
## 3 Dollars (millions)           H05
##                                     Variable_name     Variable_category  Value
## 1                                    Total income Financial performance 930995
## 2 Sales, government funding, grants and subsidies Financial performance 821630
## 3               Interest, dividends and donations Financial performance  84354
##                                                                                             Industry_code_ANZSIC06
## 1 ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)
## 2 ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)
## 3 ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)

The `write` Functions

You can write out data frames as text-readable files using any delimeter (separator) you wish:

write.table(enterprise_data, file="enterprise.csv", sep=",")
system("cat enterprise.csv | head -n 4")  # just show the top 4 lines of that file to confirm

Saving as Binary Objects

R data objects in the environment can be saved as binary objects
This allows for faster loads
Also helps with partial environment restores

save(enterprise_data, file="enterprise.Rdata")   # Save it as a binary object
rm(enterprise_data)                              # Remove it from the environment
load(file="enterprise.Rdata")                    # Restore it to the environment

Saving To Excel

You will need to install external library writexl

library(writexl)
write_xlsx(enterprise_data, "enterprise-data.xslsx")

Other Exports

Install the haven external library to perform other exports:

SAS: write_xpt(enterprise_data, "enterprise-data.xpt")
SPSS: write_sav(enterprise_data, "enterprise-data.sav")
Stata: write_dta(enterprise_data, "enterprise-data.dta")
More information: https://haven.tidyverse.org/

Storage with Python

Let’s Go Get Some Data

For these examples, I’ll use some New Zealand fiscal data:

import pandas as pd
import os
url_str = "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv"
enterprise_data = pd.read_csv(url_str, sep=',')   # Load the data off the Internet
enterprise_data.head(3)

##    Year  ...                             Industry_code_ANZSIC06
## 0  2023  ...  ANZSIC06 divisions A-S (excluding classes K633...
## 1  2023  ...  ANZSIC06 divisions A-S (excluding classes K633...
## 2  2023  ...  ANZSIC06 divisions A-S (excluding classes K633...
## 
## [3 rows x 10 columns]

The `to_csv` DataFrame Method

You can write out data frames as text-readable files using any delimeter (separator) you wish:

enterprise_data.to_csv("enterprise.csv", sep=",")
os.system("cat enterprise.csv | head -n 4")  # just show the top 4 lines of that file to confirm

## 0

Pickling

Python allows you to serialize objects to a binary format
These can be saved
Python calls such serialized binary objects “pickles”

enterprise_data.to_pickle('enterprise.pkl')        # Save it as a binary object
enterprise_data = None                             # Remove it from the environment
enterprise_data = pd.read_pickle('enterprise.pkl') # Restore it to the environment

Saving To Excel

You will need to install external library openpyxl

enterprise_data.to_excel('enterprise-data.xlsx', index=False)

Other Exports

Install the whatever external libraries needed to perform other exports:

Stata: enterprise_data.to_stata("enterprise-data.dta")
SAS: Requires the saspy library
SPSS: Requires the pyreadstat library
More information: https://pandas.pydata.org/docs/reference/io.html

Project Proposals

Semester Project:

Pose question(s) you’d like to answer from data
Locate some data that you’d like to analyze
Collect and prepare all of the data that you will need
Perform the analysis needed to answer your question using some modeling technique
Develop at least two visualizations in support of your analysis and conclusions
Store off the products of your collating, analysis, and visualizations
End of Semester: Submit all code, and give a presentation showing how you addressed each step of the data pipeline

Project Proposal:

Pose question(s) you’d like to answer from data
Locate preliminary sources of the data you will need
Give high level consideration to the types of analysis you may wish to do
Middle of Semester: Give a short presentation describing the three items above

Data Storage

Common Storage Needs

Common Sources for Storage

Typical Workflows

Storage with R

Let’s Go Get Some Data

The write Functions

Saving as Binary Objects

Saving To Excel

Other Exports

Storage with Python

Let’s Go Get Some Data

The to_csv DataFrame Method

Pickling

Saving To Excel

Other Exports

Project Proposals

Semester Project:

Project Proposal:

The `write` Functions

The `to_csv` DataFrame Method