Online analytical processing is a very common way to analyze raw transaction data by aggregating along different combinations of dimensions.
In a typical setting…
Each fact table contains foreign keys that references the primary key of multiple dimension tables. In the most simple form, it is called a STAR schema.
Dimension tables can contain foreign keys that references other dimensional tables. This provides a sophisticated detail breakdown of the contextual aspects. This is also called a SNOWFLAKE schema.
Also this is not a hard rule, Fact table tends to be independent of other Fact table and usually doesn’t contain reference pointer among each other.
However, different Fact table usually share the same set of dimension tables. This is also called GALAXY schema. But it is a hard rule that Dimension table NEVER points / references Fact table
A simple STAR schema is shown in the following diagram.
Each dimension can also be hierarchical so that the analysis can be done at different degree of granularity.
For example, the time dimension can be broken down into days, weeks, months, quarter and annual;
Similarly, location dimension can be broken down into countries, regions, cities … etc.
Here we first create a sales fact table that records each sales transaction.
# Setup the dimension tables
state_table <- data.frame(
key=c(
"LM", "CA", "AR",
"SD", "SA"
),
name=c(
"Lima", "Callao", "Arequipa",
"Santo Domingo", "Santiago"),
country=c(
"Peru", "Peru", "Peru",
"Dominicana", "Dominicana"
)
)
month_table <- data.frame(
key=1:12,
desc=c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
),
quarter=c(
"Q1","Q1","Q1","Q2","Q2","Q2",
"Q3","Q3","Q3","Q4","Q4","Q4"
)
)
# Function to generate the Sales table
gen_sales <- function(no_of_recs) { # Generate transaction data randomly
prod_table <- data.frame(
key=c("Cool", "Kurios", "Energina"),
price=c(225, 570, 1120),
stringsAsFactors=T
)
loc <- sample(state_table$key, no_of_recs, replace=T, prob=c(2,2,1,1,1))
time_month <- sample(month_table$key, no_of_recs, replace=T)
time_year <- sample(c(2023, 2024), no_of_recs, replace=T)
prod <- sample(prod_table$key, no_of_recs, replace=T, prob=c(1, 3, 2))
unit <- sample(c(1,2), no_of_recs, replace=T, prob=c(10, 3))
amount <- unit*prod_table[match(prod, prod_table$key),]$price
sales <- data.frame(
month=time_month,
year=time_year,
loc=loc,
prod=prod,
unit=unit,
amount=amount
) # Sort the records by time order
sales <- sales[order(sales$year, sales$month),]
row.names(sales) <- NULL
return(sales)
}
# Now create the sales fact table
sales_fact <- gen_sales(5000)
sales_fact %>%
format.dt.f(.)
Now, we turn this fact table into a hypercube with multiple dimensions. Each cell in the cube represents an aggregate value for a unique combination of each dimension.
Build up a cube
revenue_cube <- tapply(
sales_fact$amount,
sales_fact[,c("prod", "month", "year", "loc")],
FUN=function(x){return(sum(x))}
)
revenue_cube[1:2,1:12,1,1:2]
, , loc = AR
month
prod 1 2 3 4 5 6 7 8 9 10 11 12
Cool 1575 900 1125 1125 2700 2475 2250 675 450 675 675 675
Energina 7840 11200 14560 7840 12320 14560 19040 7840 19040 11200 25760 12320
, , loc = CA
month
prod 1 2 3 4 5 6 7 8 9 10 11
Cool 2925 2025 3600 2700 1575 2925 2700 2925 2025 1575 2925
Energina 28000 23520 20160 22400 23520 30240 12320 19040 17920 40320 30240
month
prod 12
Cool 1800
Energina 25760
dimnames(revenue_cube) %>%
listviewer::jsonedit()
Slice, Dice, Rollup, Drilldown, Pivot
Fixing certain dimensions to analyze the remaining dimensions.
# sales happening in "2023", "Jan"
revenue_cube[, "1", "2023",]
loc
prod AR CA LM SA SD
Cool 1575 2925 2250 1350 3375
Energina 7840 28000 23520 15680 22400
Kurios 11970 25080 16530 11400 14820
# sales happening in "2023", "Jan", "Kurios".
revenue_cube["Kurios", "1", "2023",]
AR CA LM SA SD
11970 25080 16530 11400 14820
Limited each dimension to a certain range of values, while keeping the number of dimensions the same in the resulting cube.
# sales happening in [Jan/Feb/Mar, Energina/Kurios, LM]
revenue_cube[c("Kurios","Energina"), c("1","2","3"), 1, c("LM")]
month
prod 1 2 3
Kurios 16530 23940 16530
Energina 23520 26880 30240
Applying an aggregation function to collapse a number of dimensions.
# annual revenue for each product and collapse the location dimension
apply(
revenue_cube,
c("year", "prod"),
FUN=function(x) {return(sum(x, na.rm=TRUE))}
)
prod
year Cool Energina Kurios
2023 112950 1113280 913710
2024 120150 1090880 894900
Reverse of “rollup” and applying an aggregation function to a finer level of granularity.
# annual and monthly revenue for each product and
# collapse the location dimension
apply( # we don't care where we sold our product
revenue_cube,
c("year", "month", "prod"),
FUN=function(x) {return(sum(x, na.rm=TRUE))}
)[,,1]
month
year 1 2 3 4 5 6 7 8 9 10 11 12
2023 11475 10350 8550 9450 11475 12375 8325 9225 8775 7650 8325 6975
2024 8325 9900 6975 9450 9900 7425 11700 10575 11700 10800 12375 11025
About analyzing the combination of a pair of selected dimensions.
# analyze the revenue by year and month
apply(
revenue_cube,
c("year", "month"),
FUN=function(x) {return(sum(x, na.rm=TRUE))}
)[,1:6]
month
year 1 2 3 4 5 6
2023 188715 169710 158060 161890 161475 185585
2024 178075 170910 167885 177870 168570 185505
# analyze the revenue by product and location
apply(
revenue_cube,
c("prod", "loc"),
FUN=function(x) {return(sum(x, na.rm=TRUE))}
)
loc
prod AR CA LM SA SD
Cool 33300 60750 64350 34425 40275
Energina 316960 610400 647360 316960 312480
Kurios 263340 506730 520410 260490 257640