Workshop 2 - Data Structure for Accounting
1 Data Structures for Business Analytics
Before working with real financial datasets, it is essential to understand how data is organized and stored. In this section we review two fundamental concepts: (1) common data formats — wide vs. long — and (2) the Star Database model, the standard structure used in business intelligence and data analytics.
1.1 Common Data Formats: Wide vs. Long (Panel Data)
1.1.1 Wide Format
In a wide format table, each entity (e.g., a firm) occupies a single row, and the measurements for different time periods or categories appear as separate columns. This layout feels natural because it resembles a traditional accounting spreadsheet: you can read all of a firm’s information across a single row.
Example — Revenue (in millions USD) for three firms over four years (Wide Format):
| Firm | Rev_2020 | Rev_2021 | Rev_2022 | Rev_2023 |
|---|---|---|---|---|
| Apple | 274,515 | 365,817 | 394,328 | 383,285 |
| Microsoft | 143,015 | 168,088 | 198,270 | 211,915 |
| Tesla | 31,536 | 53,823 | 81,462 | 96,773 |
Advantages of wide format:
- Easy to read visually — one row per entity.
- Familiar to spreadsheet users.
- Convenient for simple cross-sectional calculations (e.g., comparing firms in a single year).
Disadvantages of wide format:
- Adding a new year requires adding a new column, which breaks the structure.
- It is difficult to filter, aggregate, or summarize by time period using analytic tools.
- Incompatible with most Business Intelligence (BI) tools such as Power BI, Tableau, or Excel Pivot Tables.
1.1.2 Long Format (Panel Data)
In a long format (also called panel data or tidy data) table, each row represents one observation for one entity at one point in time. Instead of spreading time periods across columns, time becomes its own column. Every variable has exactly one column and every observation has exactly one row.
Example — Same revenue data restructured as Long Format (Panel Data):
| FirmID | FirmName | Year | Revenue |
|---|---|---|---|
| AAPL | Apple | 2020 | 274,515 |
| AAPL | Apple | 2021 | 365,817 |
| AAPL | Apple | 2022 | 394,328 |
| AAPL | Apple | 2023 | 383,285 |
| MSFT | Microsoft | 2020 | 143,015 |
| MSFT | Microsoft | 2021 | 168,088 |
| MSFT | Microsoft | 2022 | 198,270 |
| MSFT | Microsoft | 2023 | 211,915 |
| TSLA | Tesla | 2020 | 31,536 |
| TSLA | Tesla | 2021 | 53,823 |
| TSLA | Tesla | 2022 | 81,462 |
| TSLA | Tesla | 2023 | 96,773 |
Notice that the same 12 data points are present in both tables. The difference is purely structural. In long format:
- FirmID / FirmName identify the entity.
- Year is an explicit column (not a column name).
- Revenue is a single column regardless of how many years exist.
Advantages of long format:
- Easily extendable — new years just add new rows, not new columns.
- Required by most analytics and BI tools for filtering, grouping, and summarization.
- The natural format for statistical models and machine learning.
- Supports multiple variables cleanly: just add columns for COGS, NetIncome, TotalAssets, etc.
Disadvantages of long format:
- More rows — a dataset with 3,600 firms × 24 years has over 86,000 rows.
- Less intuitive at first glance for users accustomed to spreadsheets.
Key takeaway: The dataset you will work with in this workshop (findata) is already in long format. Each row is one firm-year observation. This is the standard structure required before building Dashboards and conducting any serious data analysis.
1.1.3 Visual Comparison: Wide → Long
The diagram below illustrates the transformation from wide to long format conceptually:
WIDE FORMAT LONG FORMAT ───────────────────────────── ────────────────────────────────── Firm Rev_2020 Rev_2021 Firm Year Revenue ───────────────────────────── ────────────────────────────────── Apple 274,515 365,817 → Apple 2020 274,515 Tesla 31,536 53,823 Apple 2021 365,817 Tesla 2020 31,536 Tesla 2021 53,823 ───────────────────────────── ────────────────────────────────── 2 rows × 3 columns 4 rows × 3 columns ***
This transformation is called pivoting (wide → long is an unpivot, or melt). In Excel this can be done with Power Query. In R it is done with tidyr::pivot_longer().
1.2 The Star Database Model
When organizations need to analyze large volumes of historical data — such as financial statements for thousands of firms over many years — they often organize the data using a Star Database Model (also called a Star Schema). This is the foundational model of data warehouses and Business Intelligence systems.
The name “star” comes from its shape: a central Fact Table is surrounded by several Dimension Tables, forming a star-like diagram.
┌─────────────────┐
│ DIM_TIME │
│─────────────────│
│ YearID (PK) │
│ Year │
│ Quarter │
└────────┬────────┘
│
┌─────────────┐ │ ┌──────────────────┐ │ DIM_FIRM │ │ │ DIM_INDUSTRY │ │─────────────│ │ │──────────────────│ │ FirmID (PK) │ │ │ IndustryID (PK) │ │ FirmName │ │ │ IndustryName │ │ Ticker ├─────┼─────┤ Sector │ │ Country │ │ └──────────────────┘ │ Exchange │ │ └──────┬──────┘ │ │ │ │ ┌─────────────────────────────┐ │ │ FACT_FINANCIALS │ └───│─────────────────────────────│ │ FirmID (FK → DIM_FIRM) │ │ YearID (FK → DIM_TIME) │ │ IndustryID(FK → DIM_IND.) │ │ Revenue │ │ COGS │ │ SGAE │ │ Depreciation │ │ TotalAssets │ │ TotalLiabilities │ │ ShareholderEquity │ └─────────────────────────────┘ ***
1.2.1 Fact Tables
A Fact Table is the core of the star schema. It stores the quantitative measurements (facts) of business events or transactions. Each row in a Fact Table represents one observation — for example, the financial results of one firm for one fiscal year.
Characteristics of a Fact Table:
- Contains numeric measures (revenue, costs, assets, liabilities, etc.).
- Contains foreign keys that link to each Dimension Table.
- Is typically in long format — one row per unique combination of dimensions.
- Tends to be large (millions of rows in enterprise systems).
Example — FACT_FINANCIALS (excerpt, values in thousands USD):
| FirmID | YearID | IndustryID | Revenue | COGS | TotalAssets |
|---|---|---|---|---|---|
| AAPL | 2023 | TECH | 383,285 | 223,546 | 352,583 |
| MSFT | 2023 | TECH | 211,915 | 65,863 | 411,976 |
| TSLA | 2023 | AUTO | 96,773 | 79,113 | 93,941 |
| AAPL | 2022 | TECH | 394,328 | 223,546 | 352,755 |
| MSFT | 2022 | TECH | 198,270 | 62,650 | 364,840 |
1.2.2 Dimension Tables
Dimension Tables store the descriptive attributes that give context to the facts. They answer the questions: Who? What? When? Where? They are usually much smaller than the Fact Table and contain a Primary Key (PK) that is referenced by the Fact Table as a Foreign Key (FK).
DIM_FIRM — Firm dimension (catalog of companies):
| FirmID | FirmName | Ticker | Country | Exchange | Status |
|---|---|---|---|---|---|
| AAPL | Apple Inc. | AAPL | USA | NASDAQ | Active |
| MSFT | Microsoft | MSFT | USA | NASDAQ | Active |
| TSLA | Tesla Inc. | TSLA | USA | NASDAQ | Active |
DIM_TIME — Time dimension:
| YearID | Year | Decade | IsCrisisYear |
|---|---|---|---|
| 2020 | 2020 | 2020s | Yes |
| 2021 | 2021 | 2020s | No |
| 2022 | 2022 | 2020s | No |
| 2023 | 2023 | 2020s | No |
DIM_INDUSTRY — Industry dimension:
| IndustryID | IndustryName | Sector |
|---|---|---|
| TECH | Technology | Technology |
| AUTO | Automotive | Consumer |
| BANK | Commercial Banking | Finance |
| HLTH | Healthcare | Healthcare |
1.2.3 How the Star Model Works Together
The power of the Star Schema comes from joining the Fact Table with the Dimension Tables using the keys. For example, to answer “What was the average Revenue by Industry and Year for active US firms?”, a query would:
- Start with FACT_FINANCIALS (to get Revenue values).
- Join with DIM_FIRM (to filter by Country = USA and Status = Active).
- Join with DIM_INDUSTRY (to group by IndustryName).
- Join with DIM_TIME (to group by Year).
In Excel, this is essentially what you do when you use VLOOKUP / BUSCARV to bring firm names and industry labels from a catalog sheet into your data sheet — and then use a Pivot Table with Slicers to summarize the results. The findata sheet acts as your Fact Table and the usfirm sheet acts as a Dimension Table!
1.2.4 Summary: Key Concepts
| Concept | Description | Analogy in Excel |
|---|---|---|
| Wide Format | One row per entity; periods as columns | Traditional accounting table |
| Long Format | One row per entity-period; periods as a column value | The findata dataset |
| Fact Table | Stores numeric measures; one row per observation | findata sheet |
| Dimension Table | Stores descriptive attributes; one row per entity | usfirm sheet |
| Foreign Key | A column in Fact Table that links to a Dimension | The ticker used in VLOOKUP |
| Star Schema | Fact Table surrounded by Dimension Tables | Data model behind Pivot Tables |
Remember: The dataset for this workshop was built following an ETL (Extraction, Transformation, Loading) process and is already structured as a long-format Fact Table, ready to be analyzed with Pivot Tables and Slicers — exactly the Star Schema approach in practice.
2 CHALLENGE - Business Analytics using real data
Last workshop we review the basics of financial statements, financial ratios and basic data analytic skills using data from a Multinational firm. In this workshop we continue practicing data analytical skills and financial ratio calculations, but we will use real historical data of US firms.
You have to work in Excel for this Exercise
Public firms are firms that issue shares to the public. In the US, you can get any historical financial statement of public US firms from [https://www.sec.gov/edgar/searchedgar/companysearch.html||the SEC EDGAR site]. This is the US Security and Exchange Commision (SEC), Company Filings Database. Among many other responsibilities, the SEC monitors and regulates the financial market (similar to the “Comisión Nacional Bancaria y de Valores” in Mexico).
In the SEC EDGAR (Electronic Data Gathering, Analysis, and Retrieval) database any body can find (for free) real financial statement and other corporate information of public US firms. For this exercise I created a dataset using Economatica, a leading financial software. An important data source of Economatica is SEC EDGAR.
I did some programming for data collection, data cleaning, and data management with financial statement variables of all US public firms from the NASDAQ and the New York exchange (around 3,600 firms). This information can be found in the SEC database in the 10-K reports, which are annual statements (Filings) of US firms.
Download the W2-template. This template has 3 Sheets:
• findata: it has real annual data of several financial statement variables for all US public from 2000 to 2023. This file is a dataset with a “panel data” structure, also called “long” format. This long format is required before we start creating Dashboards to summarize the data by different dimensions such as year, industry, firm, etc.
• usfirm: this is a catalog of all US public firms with general information such as company name, industry, status (active/cancelled), etc.
• data dictionary: dictionary that explains the meaning of each variable (column) of the findata dataset
The data is in thousands US dollars. I constructed this dataset following an “ETL” (Extraction, Transformation, Loading) process.
Your task is to use this integrated dataset to do the following Dashboards:
2.1 Dupont Dashboard
- Create new columns in the Data Sheet for EBIT, TotalRevenue, Net Income, Shareholder Equity, and PreviousTotalAssets (at t-1). Consider the following:
EBIT (Earnings Before Interest and Taxes) = revenue - cogs - sgae - depreciation
totalRevenue = revenue + otherincome
netIncome = EBIT + otherincome + extraordinaryitems - finexp - incometax
previousTotalAssets = totalassets at the beginning of the year = totalassets in the previous year (at t-1)
The “OTHERINCOME” column represents all revenue different than operational Sales (for example, financial income, extraordinary sales, etc). Then, to to compute EBIT and those ratios related with it, consider ONLY revenue (Not OtherIncome) as Sales. To compute Net Income, ATO, and those ratios related with it, consider Sales + OtherIncome as Total Revenue. For the ratios that end up with invalid result, leave empty spaces (use the conditional function IFERROR or SI.ERROR to do this).
- Create a Dashboard to see the following financial ratios BY YEAR:
Average ROA (AVGROA),
Average Profit Margin (AVGPM),
Average Turn Over (AVGATO),
Average Operating ROA (AVGROABIT),
Average Operating Profit Margin (AVGOPPM),
Average Operating Asset Turnover (AVGOPATO).
For the the Operating ratios (AVGROABIT AVGOPPM, AVGOPATO) use only revenue (without including Otherincome nor Extraordinaryitems) since other income is usually not part of daily regular operations of a company.
You have to include SLICERS (Segmentación de Datos) to select a) Active vs Cancelled firms, b) one or more company names, and c) one or more Industries.
Notes:
To compute ROA, ROABIT and ATO, use Total Assets at the beginning of the year, that is the same as Total Assets at the end of the previous year (use Total Assets at t-1), as it is stated in the Note about the basics of Financial Analysis.
For all average ratios, remember that you have to use calculated fields.
Remember, it’s usually NOT correct to calculate arithmetic averages for ratios or percentages!
- In the same Dashboard design and create a graph that illustrate the behavior of ROABIT over the years
2.2 Financial Leverage Dashboard
This dashboard has to have the following ratios:
- Average debt-to-equity ratio and Average debt-to-asset ratio per year. You have to include the same SLICERS you used for the DuPont Dashboard.
Before doing the Dashboards, you have to calculate some variables that you might need. You also have to add the firm name, status and industry as a new columns to the dataset using the “buscarv” (vlookup) function.
With these Dashboards you will be able to analyze the Dupont and financial leverage ratios by Industry for the US public firms over time. Now you can have a better benchmark for analyzing financial ratios for other companies.
Once you have your 2 Dashboards, copy and paste with values in the corresponding Sheets and you will receive feedback and grading of your results.
You have to submit your Excel file with all your work through Canvas
You have to submit your Excel file through Canvas