Workshop 2 - Data Structure for Accounting

Author

Alberto Dorantes, Ph.D.

Published

May 11, 2026

Abstract
In this workshop we continue practicing basic data analytic skills using real financial data

1 Data Structures for Business Analytics

Before working with real financial datasets, it is essential to understand how data is organized and stored. In this section we review two fundamental concepts: (1) common data formats — wide vs. long — and (2) the Star Database model, the standard structure used in business intelligence and data analytics.

1.1 Common Data Formats: Wide vs. Long (Panel Data)

1.1.1 Wide Format

In a wide format table, each entity (e.g., a firm) occupies a single row, and the measurements for different time periods or categories appear as separate columns. This layout feels natural because it resembles a traditional accounting spreadsheet: you can read all of a firm’s information across a single row.

Example — Revenue (in millions USD) for three firms over four years (Wide Format):

Wide Format — Revenue by firm and year
Firm Rev_2020 Rev_2021 Rev_2022 Rev_2023
Apple 274,515 365,817 394,328 383,285
Microsoft 143,015 168,088 198,270 211,915
Tesla 31,536 53,823 81,462 96,773

Advantages of wide format:

  • Easy to read visually — one row per entity.
  • Familiar to spreadsheet users.
  • Convenient for simple cross-sectional calculations (e.g., comparing firms in a single year).

Disadvantages of wide format:

  • Adding a new year requires adding a new column, which breaks the structure.
  • It is difficult to filter, aggregate, or summarize by time period using analytic tools.
  • Incompatible with most Business Intelligence (BI) tools such as Power BI, Tableau, or Excel Pivot Tables.

1.1.2 Long Format (Panel Data)

In a long format (also called panel data or tidy data) table, each row represents one observation for one entity at one point in time. Instead of spreading time periods across columns, time becomes its own column. Every variable has exactly one column and every observation has exactly one row.

Example — Same revenue data restructured as Long Format (Panel Data):

Long (Panel Data) Format — Revenue by firm and year
FirmID FirmName Year Revenue
AAPL Apple 2020 274,515
AAPL Apple 2021 365,817
AAPL Apple 2022 394,328
AAPL Apple 2023 383,285
MSFT Microsoft 2020 143,015
MSFT Microsoft 2021 168,088
MSFT Microsoft 2022 198,270
MSFT Microsoft 2023 211,915
TSLA Tesla 2020 31,536
TSLA Tesla 2021 53,823
TSLA Tesla 2022 81,462
TSLA Tesla 2023 96,773

Notice that the same 12 data points are present in both tables. The difference is purely structural. In long format:

  • FirmID / FirmName identify the entity.
  • Year is an explicit column (not a column name).
  • Revenue is a single column regardless of how many years exist.

Advantages of long format:

  • Easily extendable — new years just add new rows, not new columns.
  • Required by most analytics and BI tools for filtering, grouping, and summarization.
  • The natural format for statistical models and machine learning.
  • Supports multiple variables cleanly: just add columns for COGS, NetIncome, TotalAssets, etc.

Disadvantages of long format:

  • More rows — a dataset with 3,600 firms × 24 years has over 86,000 rows.
  • Less intuitive at first glance for users accustomed to spreadsheets.

Key takeaway: The dataset you will work with in this workshop (findata) is already in long format. Each row is one firm-year observation. This is the standard structure required before building Dashboards and conducting any serious data analysis.


1.1.3 Visual Comparison: Wide → Long

The diagram below illustrates the transformation from wide to long format conceptually:


WIDE FORMAT LONG FORMAT ───────────────────────────── ────────────────────────────────── Firm Rev_2020 Rev_2021 Firm Year Revenue ───────────────────────────── ────────────────────────────────── Apple 274,515 365,817 → Apple 2020 274,515 Tesla 31,536 53,823 Apple 2021 365,817 Tesla 2020 31,536 Tesla 2021 53,823 ───────────────────────────── ────────────────────────────────── 2 rows × 3 columns 4 rows × 3 columns ***

This transformation is called pivoting (wide → long is an unpivot, or melt). In Excel this can be done with Power Query. In R it is done with tidyr::pivot_longer().


1.2 The Star Database Model

When organizations need to analyze large volumes of historical data — such as financial statements for thousands of firms over many years — they often organize the data using a Star Database Model (also called a Star Schema). This is the foundational model of data warehouses and Business Intelligence systems.

The name “star” comes from its shape: a central Fact Table is surrounded by several Dimension Tables, forming a star-like diagram.


          ┌─────────────────┐
          │  DIM_TIME       │
          │─────────────────│
          │ YearID (PK)     │
          │ Year            │
          │ Quarter         │
          └────────┬────────┘
                   │

┌─────────────┐ │ ┌──────────────────┐ │ DIM_FIRM │ │ │ DIM_INDUSTRY │ │─────────────│ │ │──────────────────│ │ FirmID (PK) │ │ │ IndustryID (PK) │ │ FirmName │ │ │ IndustryName │ │ Ticker ├─────┼─────┤ Sector │ │ Country │ │ └──────────────────┘ │ Exchange │ │ └──────┬──────┘ │ │ │ │ ┌─────────────────────────────┐ │ │ FACT_FINANCIALS │ └───│─────────────────────────────│ │ FirmID (FK → DIM_FIRM) │ │ YearID (FK → DIM_TIME) │ │ IndustryID(FK → DIM_IND.) │ │ Revenue │ │ COGS │ │ SGAE │ │ Depreciation │ │ TotalAssets │ │ TotalLiabilities │ │ ShareholderEquity │ └─────────────────────────────┘ ***

1.2.1 Fact Tables

A Fact Table is the core of the star schema. It stores the quantitative measurements (facts) of business events or transactions. Each row in a Fact Table represents one observation — for example, the financial results of one firm for one fiscal year.

Characteristics of a Fact Table:

  • Contains numeric measures (revenue, costs, assets, liabilities, etc.).
  • Contains foreign keys that link to each Dimension Table.
  • Is typically in long format — one row per unique combination of dimensions.
  • Tends to be large (millions of rows in enterprise systems).

Example — FACT_FINANCIALS (excerpt, values in thousands USD):

Fact Table — FACT_FINANCIALS (excerpt)
FirmID YearID IndustryID Revenue COGS TotalAssets
AAPL 2023 TECH 383,285 223,546 352,583
MSFT 2023 TECH 211,915 65,863 411,976
TSLA 2023 AUTO 96,773 79,113 93,941
AAPL 2022 TECH 394,328 223,546 352,755
MSFT 2022 TECH 198,270 62,650 364,840

1.2.2 Dimension Tables

Dimension Tables store the descriptive attributes that give context to the facts. They answer the questions: Who? What? When? Where? They are usually much smaller than the Fact Table and contain a Primary Key (PK) that is referenced by the Fact Table as a Foreign Key (FK).

DIM_FIRM — Firm dimension (catalog of companies):

Dimension Table — DIM_FIRM
FirmID FirmName Ticker Country Exchange Status
AAPL Apple Inc. AAPL USA NASDAQ Active
MSFT Microsoft MSFT USA NASDAQ Active
TSLA Tesla Inc. TSLA USA NASDAQ Active

DIM_TIME — Time dimension:

Dimension Table — DIM_TIME
YearID Year Decade IsCrisisYear
2020 2020 2020s Yes
2021 2021 2020s No
2022 2022 2020s No
2023 2023 2020s No

DIM_INDUSTRY — Industry dimension:

Dimension Table — DIM_INDUSTRY
IndustryID IndustryName Sector
TECH Technology Technology
AUTO Automotive Consumer
BANK Commercial Banking Finance
HLTH Healthcare Healthcare

1.2.3 How the Star Model Works Together

The power of the Star Schema comes from joining the Fact Table with the Dimension Tables using the keys. For example, to answer “What was the average Revenue by Industry and Year for active US firms?”, a query would:

  1. Start with FACT_FINANCIALS (to get Revenue values).
  2. Join with DIM_FIRM (to filter by Country = USA and Status = Active).
  3. Join with DIM_INDUSTRY (to group by IndustryName).
  4. Join with DIM_TIME (to group by Year).

In Excel, this is essentially what you do when you use VLOOKUP / BUSCARV to bring firm names and industry labels from a catalog sheet into your data sheet — and then use a Pivot Table with Slicers to summarize the results. The findata sheet acts as your Fact Table and the usfirm sheet acts as a Dimension Table!


1.2.4 Summary: Key Concepts

Summary of key data structure concepts
Concept Description Analogy in Excel
Wide Format One row per entity; periods as columns Traditional accounting table
Long Format One row per entity-period; periods as a column value The findata dataset
Fact Table Stores numeric measures; one row per observation findata sheet
Dimension Table Stores descriptive attributes; one row per entity usfirm sheet
Foreign Key A column in Fact Table that links to a Dimension The ticker used in VLOOKUP
Star Schema Fact Table surrounded by Dimension Tables Data model behind Pivot Tables

Remember: The dataset for this workshop was built following an ETL (Extraction, Transformation, Loading) process and is already structured as a long-format Fact Table, ready to be analyzed with Pivot Tables and Slicers — exactly the Star Schema approach in practice.


2 CHALLENGE - Business Analytics using real data

Last workshop we review the basics of financial statements, financial ratios and basic data analytic skills using data from a Multinational firm. In this workshop we continue practicing data analytical skills and financial ratio calculations, but we will use real historical data of US firms.

You have to work in Excel for this Exercise

Public firms are firms that issue shares to the public. In the US, you can get any historical financial statement of public US firms from [https://www.sec.gov/edgar/searchedgar/companysearch.html||the SEC EDGAR site]. This is the US Security and Exchange Commision (SEC), Company Filings Database. Among many other responsibilities, the SEC monitors and regulates the financial market (similar to the “Comisión Nacional Bancaria y de Valores” in Mexico).

In the SEC EDGAR (Electronic Data Gathering, Analysis, and Retrieval) database any body can find (for free) real financial statement and other corporate information of public US firms. For this exercise I created a dataset using Economatica, a leading financial software. An important data source of Economatica is SEC EDGAR.

I did some programming for data collection, data cleaning, and data management with financial statement variables of all US public firms from the NASDAQ and the New York exchange (around 3,600 firms). This information can be found in the SEC database in the 10-K reports, which are annual statements (Filings) of US firms.

Download the W2-template. This template has 3 Sheets:

• findata: it has real annual data of several financial statement variables for all US public from 2000 to 2023. This file is a dataset with a “panel data” structure, also called “long” format. This long format is required before we start creating Dashboards to summarize the data by different dimensions such as year, industry, firm, etc.

• usfirm: this is a catalog of all US public firms with general information such as company name, industry, status (active/cancelled), etc.

• data dictionary: dictionary that explains the meaning of each variable (column) of the findata dataset

The data is in thousands US dollars. I constructed this dataset following an “ETL” (Extraction, Transformation, Loading) process.

Your task is to use this integrated dataset to do the following Dashboards:

2.1 Dupont Dashboard

  1. Create new columns in the Data Sheet for EBIT, TotalRevenue, Net Income, Shareholder Equity, and PreviousTotalAssets (at t-1). Consider the following:

EBIT (Earnings Before Interest and Taxes) = revenue - cogs - sgae - depreciation

totalRevenue = revenue + otherincome

netIncome = EBIT + otherincome + extraordinaryitems - finexp - incometax

previousTotalAssets = totalassets at the beginning of the year = totalassets in the previous year (at t-1)

The “OTHERINCOME” column represents all revenue different than operational Sales (for example, financial income, extraordinary sales, etc). Then, to to compute EBIT and those ratios related with it, consider ONLY revenue (Not OtherIncome) as Sales. To compute Net Income, ATO, and those ratios related with it, consider Sales + OtherIncome as Total Revenue. For the ratios that end up with invalid result, leave empty spaces (use the conditional function IFERROR or SI.ERROR to do this).

  1. Create a Dashboard to see the following financial ratios BY YEAR:

Average ROA (AVGROA),

Average Profit Margin (AVGPM),

Average Turn Over (AVGATO),

Average Operating ROA (AVGROABIT),

Average Operating Profit Margin (AVGOPPM),

Average Operating Asset Turnover (AVGOPATO).

For the the Operating ratios (AVGROABIT AVGOPPM, AVGOPATO) use only revenue (without including Otherincome nor Extraordinaryitems) since other income is usually not part of daily regular operations of a company.

You have to include SLICERS (Segmentación de Datos) to select a) Active vs Cancelled firms, b) one or more company names, and c) one or more Industries.

Notes:

To compute ROA, ROABIT and ATO, use Total Assets at the beginning of the year, that is the same as Total Assets at the end of the previous year (use Total Assets at t-1), as it is stated in the Note about the basics of Financial Analysis.

For all average ratios, remember that you have to use calculated fields.

Remember, it’s usually NOT correct to calculate arithmetic averages for ratios or percentages!

  1. In the same Dashboard design and create a graph that illustrate the behavior of ROABIT over the years

2.2 Financial Leverage Dashboard

This dashboard has to have the following ratios:

  1. Average debt-to-equity ratio and Average debt-to-asset ratio per year. You have to include the same SLICERS you used for the DuPont Dashboard.

Before doing the Dashboards, you have to calculate some variables that you might need. You also have to add the firm name, status and industry as a new columns to the dataset using the “buscarv” (vlookup) function.

With these Dashboards you will be able to analyze the Dupont and financial leverage ratios by Industry for the US public firms over time. Now you can have a better benchmark for analyzing financial ratios for other companies.

Once you have your 2 Dashboards, copy and paste with values in the corresponding Sheets and you will receive feedback and grading of your results.

You have to submit your Excel file with all your work through Canvas

You have to submit your Excel file through Canvas