Online analytical processing is a very common way to analyze raw transaction data by aggregating along different combinations of dimensions.

1 Facts and Dimensions

1.1 Types of tables

  • Fact table contains records each describe an instance of a transaction.
    • Each transaction records contains categorical attributes (which describes contextual aspects of the transaction, such as space, time, user)
    • as well as numeric attributes (called “measures” which describes quantitative aspects of the transaction, such as no of items sold, dollar amount).
  • Dimension table contain records that further elaborates the contextual attributes, such as user profile data, location details … etc.

1.2 Multi-dimensional model

In a typical setting…

  • Each fact table contains foreign keys that references the primary key of multiple dimension tables. In the most simple form, it is called a STAR schema.

  • Dimension tables can contain foreign keys that references other dimensional tables. This provides a sophisticated detail breakdown of the contextual aspects. This is also called a SNOWFLAKE schema.

  • Also this is not a hard rule, Fact table tends to be independent of other Fact table and usually doesn’t contain reference pointer among each other.

  • However, different Fact table usually share the same set of dimension tables. This is also called GALAXY schema. But it is a hard rule that Dimension table NEVER points / references Fact table

1.3 STAR schema

A simple STAR schema is shown in the following diagram.


  • Each dimension can also be hierarchical so that the analysis can be done at different degree of granularity.

  • For example, the time dimension can be broken down into days, weeks, months, quarter and annual;

  • Similarly, location dimension can be broken down into countries, regions, cities … etc.

1.4 Sales fact table

Here we first create a sales fact table that records each sales transaction.

# Setup the dimension tables
state_table <- data.frame(
  key=c(
    "LM", "CA", "AR", 
    "SD", "SA"
  ), 
  name=c(
    "Lima", "Callao", "Arequipa", 
    "Santo Domingo", "Santiago"), 
  country=c(
    "Peru", "Peru", "Peru", 
    "Dominicana", "Dominicana"
  )
)

month_table <- data.frame(
  key=1:12, 
  desc=c(
    "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
    "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
  ), 
  quarter=c(
    "Q1","Q1","Q1","Q2","Q2","Q2",
    "Q3","Q3","Q3","Q4","Q4","Q4"
  )
)

# Function to generate the Sales table
gen_sales <- function(no_of_recs) { # Generate transaction data randomly
    prod_table <- data.frame(
      key=c("Cool", "Kurios", "Energina"), 
      price=c(225, 570, 1120),
      stringsAsFactors=T
    )
    loc <- sample(state_table$key, no_of_recs, replace=T, prob=c(2,2,1,1,1))
    time_month <- sample(month_table$key, no_of_recs, replace=T)
    time_year <- sample(c(2023, 2024), no_of_recs, replace=T)
    prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2))
    unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3))
    amount <- unit*prod_table[match(prod, prod_table$key),]$price
    sales <- data.frame(
      month=time_month, 
      year=time_year, 
      loc=loc, 
      prod=prod, 
      unit=unit, 
      amount=amount
    ) # Sort the records by time order
    sales <- sales[order(sales$year, sales$month),]
    row.names(sales) <- NULL
    return(sales)
}

# Now create the sales fact table
sales_fact <- gen_sales(5000)
sales_fact %>%
  format.dt.f(.)

2 Multi-dimensional Cube

2.1 Cube

Now, we turn this fact table into a hypercube with multiple dimensions. Each cell in the cube represents an aggregate value for a unique combination of each dimension.


Build up a cube

revenue_cube <- tapply(
  sales_fact$amount, 
  sales_fact[,c("prod", "month", "year", "loc")], 
  FUN=function(x){return(sum(x))}
)

revenue_cube[1:2,1:12,1,1:2]
, , loc = AR

          month
prod          1     2     3    4     5     6     7    8     9    10    11    12
  Cool     1575   900  1125 1125  2700  2475  2250  675   450   675   675   675
  Energina 7840 11200 14560 7840 12320 14560 19040 7840 19040 11200 25760 12320

, , loc = CA

          month
prod           1     2     3     4     5     6     7     8     9    10    11
  Cool      2925  2025  3600  2700  1575  2925  2700  2925  2025  1575  2925
  Energina 28000 23520 20160 22400 23520 30240 12320 19040 17920 40320 30240
          month
prod          12
  Cool      1800
  Energina 25760

dimnames(revenue_cube) %>%
  listviewer::jsonedit()

3 OLAP Operations

3.1 Common operations

Slice, Dice, Rollup, Drilldown, Pivot

3.2 Slice

Fixing certain dimensions to analyze the remaining dimensions.

# sales happening in "2023", "Jan"
revenue_cube[, "1", "2023",]
          loc
prod          AR    CA    LM    SA    SD
  Cool      1575  2925  2250  1350  3375
  Energina  7840 28000 23520 15680 22400
  Kurios   11970 25080 16530 11400 14820
# sales happening in "2023", "Jan", "Kurios".
revenue_cube["Kurios", "1", "2023",]
   AR    CA    LM    SA    SD 
11970 25080 16530 11400 14820 

3.3 Dice

Limited each dimension to a certain range of values, while keeping the number of dimensions the same in the resulting cube.

# sales happening in [Jan/Feb/Mar, Energina/Kurios, LM]
revenue_cube[c("Kurios","Energina"), c("1","2","3"), 1, c("LM")]
          month
prod           1     2     3
  Kurios   16530 23940 16530
  Energina 23520 26880 30240

3.4 Rollup

Applying an aggregation function to collapse a number of dimensions.

# annual revenue for each product and collapse the location dimension
apply(
  revenue_cube, 
  c("year", "prod"), 
  FUN=function(x) {return(sum(x, na.rm=TRUE))}
)
      prod
year     Cool Energina Kurios
  2023 112950  1113280 913710
  2024 120150  1090880 894900

3.5 Drilldown

Reverse of “rollup” and applying an aggregation function to a finer level of granularity.

# annual and monthly revenue for each product and 
# collapse the location dimension 
apply( # we don't care where we sold our product
  revenue_cube, 
  c("year", "month", "prod"), 
  FUN=function(x) {return(sum(x, na.rm=TRUE))}
)[,,1]
      month
year       1     2    3    4     5     6     7     8     9    10    11    12
  2023 11475 10350 8550 9450 11475 12375  8325  9225  8775  7650  8325  6975
  2024  8325  9900 6975 9450  9900  7425 11700 10575 11700 10800 12375 11025

3.6 Pivot

About analyzing the combination of a pair of selected dimensions.

# analyze the revenue by year and month
apply(
  revenue_cube, 
  c("year", "month"), 
  FUN=function(x) {return(sum(x, na.rm=TRUE))}
)[,1:6]
      month
year        1      2      3      4      5      6
  2023 188715 169710 158060 161890 161475 185585
  2024 178075 170910 167885 177870 168570 185505

# analyze the revenue by product and location
apply(
  revenue_cube, 
  c("prod", "loc"), 
  FUN=function(x) {return(sum(x, na.rm=TRUE))}
)
          loc
prod           AR     CA     LM     SA     SD
  Cool      33300  60750  64350  34425  40275
  Energina 316960 610400 647360 316960 312480
  Kurios   263340 506730 520410 260490 257640