A cube is a trulymultidimensional data structure for capturing and analyzing data.A cube generalizes the tabular spreadsheet such that there can be any number of dimensions (and not only two as in spreadsheets).
A cube consists of uniquely identifiable cells at each of the dimensions’ intersections. A nonempty cell is called a fact
Dimensions are used for two purposes: the selection of data and the grouping of data at a desired level of detail.
A dimension is organized into a containment-like hierarchy composed of a number of levels, each of which represents a level of detail that is of interest to analyses to be performed.
Facts are the objects that represent the subjects of the desired analyses, i.e., the interesting “things” or events or processes, that are to be analyzed to better understand their behavior.
The facts are implicitly defined by their combination of dimension values. If a non-empty cell exists for a particular combination, a fact exists; otherwise, no fact exists for that combination.
A measure has two components: a numerical property of a fact, e.g., the sales price or profit, and a formula (most often a simple aggregation function such as SUM) that can be used to combine several measure values into one. In amultidimensional database, measures generally represent the properties of the chosen facts that the users want to study,
A star schema has one dimension table for each dimension. This table has a key column and one column for each level of the dimension
A snowflake schema has a fact table, as does a star schema. Snowflake schemas, however, contain several dimension tables for each dimension, namely one table for each (non-) level.
Data warehouse as a subject oriented, integrated, time variant, non-volatile collection of data in support of management’s decision making process
1.1. A) Facts:nrofevents, totalparticipants. Dimensions: dayofweek, year, location.
1.2. A) Facts: nrofregistrations. Dimensions: instID, institute, cityID, city, province, study, phase, year, sex
Use the data spreadsheet BI Raw Data.csv. The encoding of this file is “ISO-8859-1”.
A large warehouse located in the US has many customers for its different product from around the world. The warehouse manager Mr Jack Bezos has many large customers from over the world ordering different products from his warehouse. We follow the method of Section 1.1.1 where all steps are more or less already given (and demoed at the lecture).
Step 1: Business questions Mr. Bezos wants to know answers for the following questions • Who are his top-5 most valued customers? • What are his top-5 most important products?
** Step 2a: Multidimensional model Notice that Mr. Bezos wants to know something about customers and products. These are dimensions. Time usually also is a dimension if you are interested in trends, so we also include the dimension ‘orderdate’. What is it he wants to know? It is ‘most valued’ and ‘most important’, but what does this mean concretely? How can we measure value and importance? Obviously in terms of money: the most valued customer is the one who bought for the most money, and the most important product is the one which was sold for the most money. Therefore, our fact is ‘amount’ (of money), i.e., ‘sales’.
Step 2b: Create tables in your database The database consists of three tables: • ‘customer’ with attributes customerid:integer (primary key, NOT NULL), name:character varying( 100), country:character varying(100) • ‘product’ with attributes productid: integer (primary key, NOT NULL), name:character varying(100), category:character varying(100) • ‘sales‘ with attributes orderdate:date (primary key, NOT NULL), customerid:integer (primary key, NOT NULL), productid: integer (primary key, NOT NULL), sales:double precision
Note: The orderid is not a key! A customer can order more than one product in one order which results in several rows for one order. Furthermore, in its purest form, a cube does not store information on individual transactions or cases, but only aggregated data for each combination of dimension values. In the case of Mr Bezos, the cube does not store individual orders, but the sales (fact) for all combinations of the dimensions orderdate, customer, and product. In other words, the orderdate, customerid, and productid together are the primary key!
Step 3: ETL — Prepare data and fill the database
1º Passo: Abrir os pacotes
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 4.0.5
library(DBI)
## Warning: package 'DBI' was built under R version 4.0.5
library(RPostgreSQL)
## Warning: package 'RPostgreSQL' was built under R version 4.0.5
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.0.5
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
2º Passo: Then, we load the data by using the command “read delim“ and put it into the object “data0“
data0 <- read_delim(file = "C:/Users/naian/Documents/UTwente/DataScience/BI_Raw_data.csv",
delim = ";", col_names = TRUE, col_types = NULL,
locale = locale(encoding="ISO-8859-1"))
##
## -- Column specification --------------------------------------------------------
## cols(
## Order_ID = col_double(),
## Order_Date_Year = col_double(),
## Order_Date_Month = col_double(),
## Order_Date_Day = col_character(),
## Order_Price_Total = col_double(),
## Customer_Name = col_character(),
## Customer_Country = col_character(),
## Product_Name = col_character(),
## Product_Category = col_character(),
## Product_Order_Unit_Price = col_double(),
## Product_Order_Quantity = col_double(),
## Product_Order_Price_Total = col_double()
## )
3º Passo: Criar a tabela Product a partir da tabela principal data0
product <- data0 %>% #Seleciono a minha tabela principal de onde vão sair os dados
select(Product_Name, Product_Category) %>% #Seleciono as colunas relacionadas a produto
rename(name = Product_Name, category = Product_Category) %>% #Renomeio as minhas colunas selecionadas
group_by(name, category) %>% #Remover as linhas duplicadas das colunas selecionadas
distinct() %>%
ungroup() %>% #Tenho que desagrupar os dados
mutate(productid = row_number()) #Criar uma nova coluna. O productid pode ser utilizado como primary key da tabela (e que vai servir de link com a tabela principal)
4º Passo: Criar a tabela Customer a partir da tabela principal data0
customer <- data0 %>% #Seleciono a minha tabela principal de onde vão sair os dados
select(Customer_Name, Customer_Country) %>% #Seleciono as colunas relacionadas a produto
rename(name = Customer_Name, country = Customer_Country) %>% #Renomeio as minhas colunas selecionadas
group_by(name, country) %>% #Remover as linhas duplicadas das colunas selecionadas
distinct() %>%
ungroup() %>% #Tenho que desagrupar os dados
mutate(customerid = row_number()) #Criar uma nova coluna. O productid pode ser utilizado como primary key da tabela (e que vai servir de link com a tabela principal)
5º Passo: Criar a tabela sales com os atributos de interesse
sales <- data0 %>%
select(Order_Date_Day, Product_Order_Price_Total, Customer_Name, Customer_Country, Product_Name, Product_Category, Order_Price_Total)
6º Passo: Criar a tabela sales juntando a tabela de product
sales1 <- sales %>%
full_join(product, by = c("Product_Name" = "name", "Product_Category" = "category")) %>% #For the sales table you first need to select the columns from the original data that you need and then join the product and customer table to add the productid and customerid to the sales table
select( -Product_Name, -Product_Category) #drop columns that are redundant in the sales table.
7º Passo: Criar a tabela sales juntando a tabela customer
sales2 <- sales1 %>%
full_join(customer, by = c("Customer_Name" = "name", "Customer_Country" = "country")) %>% #For the sales table you first need to select the columns from the original data that you need and then join the product and customer table to add the productid and customerid to the sales table
select( -Customer_Name, -Customer_Country) #drop columns that are redundant in the sales table.
8º Passo: Having now constructed all data for the dimension and fact tables, you can now store them in the database. From R you can connect to the PostgreSQL database server with the functions dbDriver, dbConnect in packages DBI and RPostgreSQL. So, install these packages and load them first. The code to store the constructed R tables in the database is as follows (for schema ‘ass2’).
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, port = 5432, host = "bronto.ewi.utwente.nl",
dbname = "dab_ds21221a_33", user = "dab_ds21221a_33", password = "DctKLxeJOIzBjcz/",
options="-c search_path=ass2")
dbWriteTable(con, "product", value = product, overwrite = T, row.names = F)
## [1] TRUE
dbWriteTable(con, "customer", value = customer, overwrite = T, row.names = F)
## [1] TRUE
dbWriteTable(con, "sales", value = sales, overwrite = T, row.names = F)
## [1] TRUE
Customer and Sales
Product and Sales
9º Passo: Pegar as informações adquiridas nos passos anteriores
dbListTables(con)
## [1] "product_main" "customer_main" "returnstatus" "Sales"
## [5] "product" "customer" "sales" "product"
## [9] "customer" "sales"
str(dbReadTable(con,"customer"))
## 'data.frame': 89 obs. of 3 variables:
## $ name : chr "Simons bistro" "Richter Supermarkt" "Bon app'" "Rattlesnake Canyon Grocery" ...
## $ country : chr "Denmark" "Switzerland" "France" "USA" ...
## $ customerid: int 1 2 3 4 5 6 7 8 9 10 ...
str(dbReadTable(con,"product"))
## 'data.frame': 77 obs. of 3 variables:
## $ name : chr "Pavlova" "Chang" "Lakkalikööri" "Spegesild" ...
## $ category : chr "Confections" "Beverages" "Beverages" "Seafood" ...
## $ productid: int 1 2 3 4 5 6 7 8 9 10 ...
str(dbReadTable(con,"sales"))
## 'data.frame': 2155 obs. of 7 variables:
## $ Order_Date_Day : chr "10-9-2009" "10-9-2009" "10-9-2009" "10-9-2009" ...
## $ Product_Order_Price_Total: num 244 190 36 360 500 ...
## $ Customer_Name : chr "Simons bistro" "Richter Supermarkt" "Richter Supermarkt" "Richter Supermarkt" ...
## $ Customer_Country : chr "Denmark" "Switzerland" "Switzerland" "Switzerland" ...
## $ Product_Name : chr "Pavlova" "Chang" "Lakkalikööri" "Spegesild" ...
## $ Product_Category : chr "Confections" "Beverages" "Beverages" "Seafood" ...
## $ Order_Price_Total : num 244 586 586 586 1057 ...
Step 4: Visualize Now that we have our data ready in a form suitable for analysis, it is time to address the two business problems raised in the beginning. We create a dashboard with metrics to measure the required KPIs. We describe below how to do this for R and for Tableau. Other programming languages and tools have similar steps. Don’t use spreadsheet software for this. Depois criar um gráfico com ggplot
ggplot(sales2, aes(x=productid, y=Product_Order_Price_Total, color="yellow"))+
theme_classic()+
geom_point()+
ggtitle("Sales")
##Assignment 3: Do it yourself
Step 1: Business Question –> Star Squema foi fornecida
• Which products/product categories made the most loss? • Which products/product categories were shipped really late (more than 2 days)? • Which products/product categories were returned the most?
Database Squema
Step 2: Multidimensional model and database structure
1º Passo: Pegar o meu banco de dados principal
main <-read_delim(file = "C:/Users/naian/Documents/UTwente/DataScience/SuperstoreSales_main.csv",
delim = ";", col_names = TRUE, col_types = NULL,
locale = locale(encoding="ISO-8859-1"))
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_character(),
## `Row ID` = col_double(),
## `Order ID` = col_double(),
## `Order Quantity` = col_double(),
## Sales = col_number(),
## `Unit Price` = col_number()
## )
## i Use `spec()` for the full column specifications.
2º Passo: A partir do meu banco principal vou criar uma tabela para product_main
product_main <- main %>%
select(`Product Name`, `Product Category`, `Product Sub-Category`) %>%
rename(name = `Product Name`, category =`Product Category`, product_sub_category = `Product Sub-Category`) %>%
group_by(name, category, product_sub_category) %>%
distinct() %>%
ungroup() %>%
mutate(productid = row_number())
3º Passo: A partir do meu banco principal vou criar uma tabela para customer_main
customer_main <- main %>%
select(`Customer Name`, Province, Region, `Customer Segment`) %>%
rename(Name = `Customer Name`, Segment = `Customer Segment`) %>%
arrange(Name, Province, Region, Segment) %>%
group_by(Name, Province, Region, Segment) %>%
distinct() %>%
ungroup() %>%
mutate(customerid = row_number())
4º Passo: Criar a tabela returns
returns <-read_delim(file = "C:/Users/naian/Documents/UTwente/DataScience/SuperstoreSales_returns.csv",
delim = ";", col_names = TRUE, col_types = NULL,
locale = locale(encoding="ISO-8859-1"))
##
## -- Column specification --------------------------------------------------------
## cols(
## `Order ID` = col_double(),
## Status = col_character()
## )
5º Passo: Criar a tabela retunrstatusid
returnstatusid <- c(1,2)
returnvalue <- c('Returned','NotReturned')
returnstatus <- data.frame(returnstatusid, returnvalue)
6º Passo: Juntar a tabela main com a returns
main <- merge(main, returns, by = "Order ID", all.x = TRUE)
main$Status[is.na(main$Status)] <- "NotReturned"
7º Passo: Transformar as colunas order date e ship date de strings para formato de data
main$`Order Date` <- dmy(main$`Order Date`)
main$`Ship Date` <- dmy(main$`Ship Date`)
is.Date(main$`Order Date`)
## [1] TRUE
is.Date(main$`Ship Date`)
## [1] TRUE
8º Passo: Calcular a diferença de dias entre a requisição de compras e o envio do produto
main$delivery_time <- interval(main$`Order Date`, main$`Ship Date`) / ddays()
9º Passo: Criar uma nova coluna para saber se o tempo entre a requisição de compras e o envio do produto foi igual ou maior que 2 dias
for(i in 1:length(main$delivery_time)) {
if (main$delivery_time[i] < 2) {
main$Late[i] = "NotLate"
}
else {
main$Late[i] = "Late"
}
}
10º Passo: Selecionando as colunas de interesse da tabela Main para criar a tabela Sales
Sales <- main %>%
select(`Customer Name`, Province, Region, `Customer Segment`, `Product Name`, `Product Category`, `Product Sub-Category`, `Order Date`, Sales, `Order Quantity`, `Unit Price`, `Profit`, `Shipping Cost`, Status, `delivery_time`, Late)
11º Passo: Unir a tabela sales com a tabela customer
Sales <- Sales %>%
full_join(customer_main, by = c("Customer Name" = "Name", "Province" = "Province", "Region" = "Region", "Customer Segment" = "Segment")) %>%
select(-`Customer Name`, -Province, -Region, -`Customer Segment`)
12º Passo: Unir a tabela sales com a tabela product
Sales <- Sales %>%
full_join(product_main, by = c("Product Name" = "name", "Product Category" = "category","Product Sub-Category" = "product_sub_category")) %>%
select(-`Product Name`, -`Product Category`, -`Product Sub-Category`)
13º Passo: Unir a tabela sales com a tabela returnstatus
Sales <- Sales %>%
full_join(returnstatus, by = c("Status" = "returnvalue")) %>%
select(-Status)
14º Passo: Renomear as colunas
Sales <- Sales %>%
rename(Order_Date = `Order Date`, Order_Quantity = `Order Quantity`,
Unit_Price = `Unit Price`, Shipping_Cost = `Shipping Cost`)
15º PassO
Sales$Profit <- gsub(',','.',Sales$Profit)
Sales$Profit <- as.numeric(Sales$Profit)
Sales$Shipping_Cost <- gsub(',','.',Sales$Shipping_Cost)
Sales$Shipping_Cost <- as.numeric(Sales$Shipping_Cost)
16º Passo: Transferir os dados para PostgreSQL
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, port = 5432, host = "bronto.ewi.utwente.nl",
dbname = "dab_ds21221a_33", user = "dab_ds21221a_33", password = "DctKLxeJOIzBjcz/",
options="-c search_path=ass3")
dbWriteTable(con, "product_main", value = product_main, overwrite = T, row.names = F)
## [1] TRUE
dbWriteTable(con, "customer_main", value = customer_main, overwrite = T, row.names = F)
## [1] TRUE
dbWriteTable(con, "returnstatus", value = returnstatus, overwrite = T, row.names = F)
## [1] TRUE
dbWriteTable(con, "Sales", value = Sales, overwrite = T, row.names = F)
## [1] TRUE
Profit
Profit2
Profit3
17º Passo: Output dos dados
dbGetQuery(con,
"SELECT table_name FROM information_schema.tables WHERE table_schema='ass3'")
## table_name
## 1 product_main
## 2 customer_main
## 3 returnstatus
## 4 Sales
## 5 product
## 6 customer
## 7 sales
str(dbReadTable(con, c("ass3", "customer")))
## 'data.frame': 1832 obs. of 5 variables:
## $ Name : chr "Aaron Bergman" "Aaron Bergman" "Aaron Hawkins" "Aaron Hawkins" ...
## $ Province : chr "Alberta" "Nunavut" "British Columbia" "Nova Scotia" ...
## $ Region : chr "West" "Nunavut" "West" "Atlantic" ...
## $ Segment : chr "Corporate" "Corporate" "Home Office" "Home Office" ...
## $ customerid: int 1 2 3 4 5 6 7 8 9 10 ...
str(dbReadTable(con, c("ass3", "product")))
## 'data.frame': 1263 obs. of 4 variables:
## $ Name : chr "\"While you Were Out\" Message Book, One Form per Page" "*Staples* Highlighting Markers" "*Staples* Letter Opener" "*Staples* Packaging Labels" ...
## $ Category : chr "Office Supplies" "Office Supplies" "Office Supplies" "Office Supplies" ...
## $ Sub_Category: chr "Paper" "Pens & Art Supplies" "Scissors, Rulers and Trimmers" "Labels" ...
## $ productid : int 1 2 3 4 5 6 7 8 9 10 ...
str(dbReadTable(con, c("ass3", "returnstatus")))
## 'data.frame': 2 obs. of 2 variables:
## $ returnstatusid: num 1 2
## $ returnvalue : chr "Returned" "NotReturned"
str(dbReadTable(con, c("ass3", "sales")))
## 'data.frame': 8399 obs. of 10 variables:
## $ Order_Date : Date, format: "2010-10-13" "2012-02-20" ...
## $ Sales : num 26154 693 14056 280808 17614 ...
## $ Order_Quantity: num 6 2 15 26 24 23 30 14 46 32 ...
## $ Unit_Price : num 3894 208 846 10753 7089 ...
## $ Profit : num -213.25 -4.64 -128.38 1054.82 -1748.56 ...
## $ Shipping_Cost : num 35 2.56 8.99 5.81 89.3 5.03 2.25 8.99 4.2 1.99 ...
## $ Late : chr "Late" "NotLate" "NotLate" "NotLate" ...
## $ customerid : int 1327 1565 1146 1145 1145 1145 1000 1000 1587 1708 ...
## $ productid : int 405 712 675 1051 711 257 384 333 56 1104 ...
## $ returnstatusid: num 2 2 2 2 2 2 2 2 2 1 ...
A key difference of multidimensional modeling from “ordinary” data modeling is that the multidimensional modeler should neither try to include all the available data nor all the existing relationships in the data in the model. Only those aspects that are essential “drivers” of the business should be included.
Multidimensional modeling process into four subprocesses:
Choose the business process(es) to model. Step 1 relates to the observation that not all business processes are equally important for the business.
Choose the granularity of the business process. Step 2 implies that it is important to use the data granularity that best matches the analysis needs.
Design the dimensions. Step 3 then goes on to refine the schema of each part of the grain into a complete dimension with levels and attributes
Choose the measures. the numerical measures to capture for each combination of dimension values
Goodness score = [Sum of every interval of the timestamp of starting the game and that of switching to another app (which is the total time of playing the game)] x the level of completion of the game. Since beta testing is for getting feedback, it is necessary for testers to play as much as they can in order to come up with more advice from their experiences. In addition to total playing time, the level of completion of the game should be considered. Therefore, we multiply these two factors, and if someone has a higher goodness score, it means that he or she plays this game for a long time and completes many levels of the game.
Which gamers are more suitable for being beta testers?
Star Squema Beta Tester
We choose Scores as facts because the main purpose of the Beta Tester Service is to evaluate the good gamers in order to find the most suitable ones for being beta testers. Also, we decide to create two dimensions: Beta Testers and Developers since both are related to the process of developing a game.