Workshop 2 - Data Structure for Accounting

Author

Alberto Dorantes, Ph.D.

Published

May 11, 2026

Abstract

In this workshop we continue practicing basic data analytic skills using real financial data

1 Data Structures for Business Analytics

Before working with real financial datasets, it is essential to understand how data is organized and stored. In this section we review two fundamental concepts: (1) common data formats — wide vs. long — and (2) the Star Database model, the standard structure used in business intelligence and data analytics.

1.1 Common Data Formats: Wide vs. Long (Panel Data)

1.1.1 Wide Format

In a wide format table, each entity (e.g., a firm) occupies a single row, and the measurements for different time periods or categories appear as separate columns. This layout feels natural because it resembles a traditional accounting spreadsheet: you can read all of a firm’s information across a single row.

Example — Revenue (in millions USD) for three firms over four years (Wide Format):

Wide Format — Revenue by firm and year
Firm	Rev_2020	Rev_2021	Rev_2022	Rev_2023
Apple	274,515	365,817	394,328	383,285
Microsoft	143,015	168,088	198,270	211,915
Tesla	31,536	53,823	81,462	96,773

Advantages of wide format:

Easy to read visually — one row per entity.
Familiar to spreadsheet users.
Convenient for simple cross-sectional calculations (e.g., comparing firms in a single year).

Disadvantages of wide format:

Adding a new year requires adding a new column, which breaks the structure.
It is difficult to filter, aggregate, or summarize by time period using analytic tools.
Incompatible with most Business Intelligence (BI) tools such as Power BI, Tableau, or Excel Pivot Tables.

1.1.2 Long Format (Panel Data)

In a long format (also called panel data or tidy data) table, each row represents one observation for one entity at one point in time. Instead of spreading time periods across columns, time becomes its own column. Every variable has exactly one column and every observation has exactly one row.

Example — Same revenue data restructured as Long Format (Panel Data):

Long (Panel Data) Format — Revenue by firm and year
FirmID	FirmName	Year	Revenue
AAPL	Apple	2020	274,515
AAPL	Apple	2021	365,817
AAPL	Apple	2022	394,328
AAPL	Apple	2023	383,285
MSFT	Microsoft	2020	143,015
MSFT	Microsoft	2021	168,088
MSFT	Microsoft	2022	198,270
MSFT	Microsoft	2023	211,915
TSLA	Tesla	2020	31,536
TSLA	Tesla	2021	53,823
TSLA	Tesla	2022	81,462
TSLA	Tesla	2023	96,773

Notice that the same 12 data points are present in both tables. The difference is purely structural. In long format:

FirmID / FirmName identify the entity.
Year is an explicit column (not a column name).
Revenue is a single column regardless of how many years exist.

Advantages of long format:

Easily extendable — new years just add new rows, not new columns.
Required by most analytics and BI tools for filtering, grouping, and summarization.
The natural format for statistical models and machine learning.
Supports multiple variables cleanly: just add columns for COGS, NetIncome, TotalAssets, etc.

Disadvantages of long format:

More rows — a dataset with 3,600 firms × 24 years has over 86,000 rows.
Less intuitive at first glance for users accustomed to spreadsheets.

Key takeaway: The dataset you will work with in this workshop (findata) is already in long format. Each row is one firm-year observation. This is the standard structure required before building Dashboards and conducting any serious data analysis.

1.1.3 Visual Comparison: Wide → Long

The diagram below illustrates the transformation from wide to long format conceptually:

WIDE FORMAT LONG FORMAT ───────────────────────────── ────────────────────────────────── Firm Rev_2020 Rev_2021 Firm Year Revenue ───────────────────────────── ────────────────────────────────── Apple 274,515 365,817 → Apple 2020 274,515 Tesla 31,536 53,823 Apple 2021 365,817 Tesla 2020 31,536 Tesla 2021 53,823 ───────────────────────────── ────────────────────────────────── 2 rows × 3 columns 4 rows × 3 columns ***

This transformation is called pivoting (wide → long is an unpivot, or melt). In Excel this can be done with Power Query. In R it is done with tidyr::pivot_longer().

1.2 The Star Database Model

When organizations need to analyze large volumes of historical data — such as financial statements for thousands of firms over many years — they often organize the data using a Star Database Model (also called a Star Schema). This is the foundational model of data warehouses and Business Intelligence systems.

The name “star” comes from its shape: a central Fact Table is surrounded by several Dimension Tables, forming a star-like diagram.

          ┌─────────────────┐
          │  DIM_TIME       │
          │─────────────────│
          │ YearID (PK)     │
          │ Year            │
          │ Quarter         │
          └────────┬────────┘
                   │

┌─────────────┐ │ ┌──────────────────┐ │ DIM_FIRM │ │ │ DIM_INDUSTRY │ │─────────────│ │ │──────────────────│ │ FirmID (PK) │ │ │ IndustryID (PK) │ │ FirmName │ │ │ IndustryName │ │ Ticker ├─────┼─────┤ Sector │ │ Country │ │ └──────────────────┘ │ Exchange │ │ └──────┬──────┘ │ │ │ │ ┌─────────────────────────────┐ │ │ FACT_FINANCIALS │ └───│─────────────────────────────│ │ FirmID (FK → DIM_FIRM) │ │ YearID (FK → DIM_TIME) │ │ IndustryID(FK → DIM_IND.) │ │ Revenue │ │ COGS │ │ SGAE │ │ Depreciation │ │ TotalAssets │ │ TotalLiabilities │ │ ShareholderEquity │ └─────────────────────────────┘ ***

1.2.1 Fact Tables

A Fact Table is the core of the star schema. It stores the quantitative measurements (facts) of business events or transactions. Each row in a Fact Table represents one observation — for example, the financial results of one firm for one fiscal year.

Characteristics of a Fact Table:

Contains numeric measures (revenue, costs, assets, liabilities, etc.).
Contains foreign keys that link to each Dimension Table.
Is typically in long format — one row per unique combination of dimensions.
Tends to be large (millions of rows in enterprise systems).

Example — FACT_FINANCIALS (excerpt, values in thousands USD):

Fact Table — FACT_FINANCIALS (excerpt)
FirmID	YearID	IndustryID	Revenue	COGS	TotalAssets
AAPL	2023	TECH	383,285	223,546	352,583
MSFT	2023	TECH	211,915	65,863	411,976
TSLA	2023	AUTO	96,773	79,113	93,941
AAPL	2022	TECH	394,328	223,546	352,755
MSFT	2022	TECH	198,270	62,650	364,840

1.2.2 Dimension Tables

Dimension Tables store the descriptive attributes that give context to the facts. They answer the questions: Who? What? When? Where? They are usually much smaller than the Fact Table and contain a Primary Key (PK) that is referenced by the Fact Table as a Foreign Key (FK).

DIM_FIRM — Firm dimension (catalog of companies):

Dimension Table — DIM_FIRM
FirmID	FirmName	Ticker	Country	Exchange	Status
AAPL	Apple Inc.	AAPL	USA	NASDAQ	Active
MSFT	Microsoft	MSFT	USA	NASDAQ	Active
TSLA	Tesla Inc.	TSLA	USA	NASDAQ	Active

DIM_TIME — Time dimension:

Dimension Table — DIM_TIME
YearID	Year	Decade	IsCrisisYear
2020	2020	2020s	Yes
2021	2021	2020s	No
2022	2022	2020s	No
2023	2023	2020s	No

DIM_INDUSTRY — Industry dimension:

Dimension Table — DIM_INDUSTRY
IndustryID	IndustryName	Sector
TECH	Technology	Technology
AUTO	Automotive	Consumer
BANK	Commercial Banking	Finance
HLTH	Healthcare	Healthcare

1.2.3 How the Star Model Works Together

The power of the Star Schema comes from joining the Fact Table with the Dimension Tables using the keys. For example, to answer “What was the average Revenue by Industry and Year for active US firms?”, a query would:

Start with FACT_FINANCIALS (to get Revenue values).
Join with DIM_FIRM (to filter by Country = USA and Status = Active).
Join with DIM_INDUSTRY (to group by IndustryName).
Join with DIM_TIME (to group by Year).

In Excel, this is essentially what you do when you use VLOOKUP / BUSCARV to bring firm names and industry labels from a catalog sheet into your data sheet — and then use a Pivot Table with Slicers to summarize the results. The findata sheet acts as your Fact Table and the usfirm sheet acts as a Dimension Table!

1.2.4 Summary: Key Concepts

Summary of key data structure concepts
Concept	Description	Analogy in Excel
Wide Format	One row per entity; periods as columns	Traditional accounting table
Long Format	One row per entity-period; periods as a column value	The findata dataset
Fact Table	Stores numeric measures; one row per observation	findata sheet
Dimension Table	Stores descriptive attributes; one row per entity	usfirm sheet
Foreign Key	A column in Fact Table that links to a Dimension	The ticker used in VLOOKUP
Star Schema	Fact Table surrounded by Dimension Tables	Data model behind Pivot Tables

Remember: The dataset for this workshop was built following an ETL (Extraction, Transformation, Loading) process and is already structured as a long-format Fact Table, ready to be analyzed with Pivot Tables and Slicers — exactly the Star Schema approach in practice.

2 CHALLENGE - Business Analytics using real data

Last workshop we review the basics of financial statements, financial ratios and basic data analytic skills using data from a Multinational firm. In this workshop we continue practicing data analytical skills and financial ratio calculations, but we will use real historical data of US firms.

You have to work in Excel for this Exercise

Public firms are firms that issue shares to the public. In the US, you can get any historical financial statement of public US firms from [https://www.sec.gov/edgar/searchedgar/companysearch.html||the SEC EDGAR site]. This is the US Security and Exchange Commision (SEC), Company Filings Database. Among many other responsibilities, the SEC monitors and regulates the financial market (similar to the “Comisión Nacional Bancaria y de Valores” in Mexico).

In the SEC EDGAR (Electronic Data Gathering, Analysis, and Retrieval) database any body can find (for free) real financial statement and other corporate information of public US firms. For this exercise I created a dataset using Economatica, a leading financial software. An important data source of Economatica is SEC EDGAR.

I did some programming for data collection, data cleaning, and data management with financial statement variables of all US public firms from the NASDAQ and the New York exchange (around 3,600 firms). This information can be found in the SEC database in the 10-K reports, which are annual statements (Filings) of US firms.

Download the W2-template. This template has 3 Sheets:

• findata: it has real annual data of several financial statement variables for all US public from 2000 to 2023. This file is a dataset with a “panel data” structure, also called “long” format. This long format is required before we start creating Dashboards to summarize the data by different dimensions such as year, industry, firm, etc.

• usfirm: this is a catalog of all US public firms with general information such as company name, industry, status (active/cancelled), etc.

• data dictionary: dictionary that explains the meaning of each variable (column) of the findata dataset

The data is in thousands US dollars. I constructed this dataset following an “ETL” (Extraction, Transformation, Loading) process.

Your task is to use this integrated dataset to do the following Dashboards:

2.1 Dupont Dashboard

Create new columns in the Data Sheet for EBIT, TotalRevenue, Net Income, Shareholder Equity, and PreviousTotalAssets (at t-1). Consider the following:

EBIT (Earnings Before Interest and Taxes) = revenue - cogs - sgae - depreciation

totalRevenue = revenue + otherincome

netIncome = EBIT + otherincome + extraordinaryitems - finexp - incometax

previousTotalAssets = totalassets at the beginning of the year = totalassets in the previous year (at t-1)

The “OTHERINCOME” column represents all revenue different than operational Sales (for example, financial income, extraordinary sales, etc). Then, to to compute EBIT and those ratios related with it, consider ONLY revenue (Not OtherIncome) as Sales. To compute Net Income, ATO, and those ratios related with it, consider Sales + OtherIncome as Total Revenue. For the ratios that end up with invalid result, leave empty spaces (use the conditional function IFERROR or SI.ERROR to do this).

Create a Dashboard to see the following financial ratios BY YEAR:

Average ROA (AVGROA),

Average Profit Margin (AVGPM),

Average Turn Over (AVGATO),

Average Operating ROA (AVGROABIT),

Average Operating Profit Margin (AVGOPPM),

Average Operating Asset Turnover (AVGOPATO).

For the the Operating ratios (AVGROABIT AVGOPPM, AVGOPATO) use only revenue (without including Otherincome nor Extraordinaryitems) since other income is usually not part of daily regular operations of a company.

You have to include SLICERS (Segmentación de Datos) to select a) Active vs Cancelled firms, b) one or more company names, and c) one or more Industries.

Notes:

To compute ROA, ROABIT and ATO, use Total Assets at the beginning of the year, that is the same as Total Assets at the end of the previous year (use Total Assets at t-1), as it is stated in the Note about the basics of Financial Analysis.

For all average ratios, remember that you have to use calculated fields.

Remember, it’s usually NOT correct to calculate arithmetic averages for ratios or percentages!

In the same Dashboard design and create a graph that illustrate the behavior of ROABIT over the years

2.2 Financial Leverage Dashboard

This dashboard has to have the following ratios:

Average debt-to-equity ratio and Average debt-to-asset ratio per year. You have to include the same SLICERS you used for the DuPont Dashboard.

Before doing the Dashboards, you have to calculate some variables that you might need. You also have to add the firm name, status and industry as a new columns to the dataset using the “buscarv” (vlookup) function.

With these Dashboards you will be able to analyze the Dupont and financial leverage ratios by Industry for the US public firms over time. Now you can have a better benchmark for analyzing financial ratios for other companies.

Once you have your 2 Dashboards, copy and paste with values in the corresponding Sheets and you will receive feedback and grading of your results.

You have to submit your Excel file with all your work through Canvas

You have to submit your Excel file through Canvas