Spring 2025

Data Storage

Common Storage Needs

  • Every part of the pipeline can have storage requirements
  • For example: Even during data collection and processing, you may need to store data to:
    • To retain metadata information about the location and state of your data
    • To retain important queries for obtaining data
    • Help filter other sources, scripts for scraping
    • Post-processed versions of data to make analysis easier
  • The analysis & visualization portion could require the storage of:
    • Models
    • Training/testing sets
    • Results of analysis
    • Plots, tables, graphs, etc
  • Data storage can help provide durability and availability to your data analysis process

Common Sources for Storage

  • You can store your data in the same types of sources your receive: CSV, excel, stata, a database, etc
  • Data Warehouses – Optimized for structured data, typically in a database, to provide facile access to data for analysis and reporting
  • Data Lakes – Optimized for structured and unstructured data, to provide flexability and the ability to retrieve raw, unprocessed data

Typical Workflows

  1. Scraping and collection tasks obtain data from different sources
  2. Munging and processing tasks reshape, clean, group, and aggregate data
  3. Intermediate data is warehoused for anlaysis purposes
  4. Models trained on warehoused, structured data
  5. Results & visualizations are stored for reporting purposes

Storage with R

Let’s Go Get Some Data

For these examples, I’ll use some New Zealand fiscal data:

url_str <- "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv"
enterprise_data <- read.table(url_str, header=T, sep=',')   # Load the data off the Internet
head(enterprise_data, 3)
##   Year Industry_aggregation_NZSIOC Industry_code_NZSIOC Industry_name_NZSIOC
## 1 2023                     Level 1                99999       All industries
## 2 2023                     Level 1                99999       All industries
## 3 2023                     Level 1                99999       All industries
##                Units Variable_code
## 1 Dollars (millions)           H01
## 2 Dollars (millions)           H04
## 3 Dollars (millions)           H05
##                                     Variable_name     Variable_category  Value
## 1                                    Total income Financial performance 930995
## 2 Sales, government funding, grants and subsidies Financial performance 821630
## 3               Interest, dividends and donations Financial performance  84354
##                                                                                             Industry_code_ANZSIC06
## 1 ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)
## 2 ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)
## 3 ANZSIC06 divisions A-S (excluding classes K6330, L6711, O7552, O760, O771, O772, S9540, S9601, S9602, and S9603)

The write Functions

You can write out data frames as text-readable files using any delimeter (separator) you wish:

write.table(enterprise_data, file="enterprise.csv", sep=",")
system("cat enterprise.csv | head -n 4")  # just show the top 4 lines of that file to confirm

Saving as Binary Objects

  • R data objects in the environment can be saved as binary objects
  • This allows for faster loads
  • Also helps with partial environment restores
save(enterprise_data, file="enterprise.Rdata")   # Save it as a binary object
rm(enterprise_data)                              # Remove it from the environment
load(file="enterprise.Rdata")                    # Restore it to the environment

Saving To Excel

You will need to install external library writexl

library(writexl)
write_xlsx(enterprise_data, "enterprise-data.xslsx")

Other Exports

Install the haven external library to perform other exports:

  • SAS:  write_xpt(enterprise_data, "enterprise-data.xpt")

  • SPSS:  write_sav(enterprise_data, "enterprise-data.sav")

  • Stata:  write_dta(enterprise_data, "enterprise-data.dta")

  • More information: https://haven.tidyverse.org/

Storage with Python

Let’s Go Get Some Data

For these examples, I’ll use some New Zealand fiscal data:

import pandas as pd
import os
url_str = "https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2023-financial-year-provisional/Download-data/annual-enterprise-survey-2023-financial-year-provisional.csv"
enterprise_data = pd.read_csv(url_str, sep=',')   # Load the data off the Internet
enterprise_data.head(3)
##    Year  ...                             Industry_code_ANZSIC06
## 0  2023  ...  ANZSIC06 divisions A-S (excluding classes K633...
## 1  2023  ...  ANZSIC06 divisions A-S (excluding classes K633...
## 2  2023  ...  ANZSIC06 divisions A-S (excluding classes K633...
## 
## [3 rows x 10 columns]

The to_csv DataFrame Method

You can write out data frames as text-readable files using any delimeter (separator) you wish:

enterprise_data.to_csv("enterprise.csv", sep=",")
os.system("cat enterprise.csv | head -n 4")  # just show the top 4 lines of that file to confirm
## 0

Pickling

  • Python allows you to serialize objects to a binary format
  • These can be saved
  • Python calls such serialized binary objects “pickles”
enterprise_data.to_pickle('enterprise.pkl')        # Save it as a binary object
enterprise_data = None                             # Remove it from the environment
enterprise_data = pd.read_pickle('enterprise.pkl') # Restore it to the environment

Saving To Excel

You will need to install external library openpyxl

enterprise_data.to_excel('enterprise-data.xlsx', index=False)

Other Exports

Install the whatever external libraries needed to perform other exports:

Project Proposals

Semester Project:

  1. Pose question(s) you’d like to answer from data
  2. Locate some data that you’d like to analyze
  3. Collect and prepare all of the data that you will need
  4. Perform the analysis needed to answer your question using some modeling technique
  5. Develop at least two visualizations in support of your analysis and conclusions
  6. Store off the products of your collating, analysis, and visualizations
  7. End of Semester: Submit all code, and give a presentation showing how you addressed each step of the data pipeline

Project Proposal:

  1. Pose question(s) you’d like to answer from data
  2. Locate preliminary sources of the data you will need
  3. Give high level consideration to the types of analysis you may wish to do
  4. Middle of Semester: Give a short presentation describing the three items above