Data Science: Data Preparation and Visualization

Data:25/10/2021

Assigment 1: Facts and Dimension

Conceitos Importantes

1) Cubo

A cube is a trulymultidimensional data structure for capturing and analyzing data.A cube generalizes the tabular spreadsheet such that there can be any number of dimensions (and not only two as in spreadsheets).

A cube consists of uniquely identifiable cells at each of the dimensions’ intersections. A nonempty cell is called a fact

2) Dimension

Dimensions are used for two purposes: the selection of data and the grouping of data at a desired level of detail.

A dimension is organized into a containment-like hierarchy composed of a number of levels, each of which represents a level of detail that is of interest to analyses to be performed.

3) Facts

Facts are the objects that represent the subjects of the desired analyses, i.e., the interesting “things” or events or processes, that are to be analyzed to better understand their behavior.

The facts are implicitly defined by their combination of dimension values. If a non-empty cell exists for a particular combination, a fact exists; otherwise, no fact exists for that combination.

4) Measures

A measure has two components: a numerical property of a fact, e.g., the sales price or profit, and a formula (most often a simple aggregation function such as SUM) that can be used to combine several measure values into one. In amultidimensional database, measures generally represent the properties of the chosen facts that the users want to study,

5) Star Squema

A star schema has one dimension table for each dimension. This table has a key column and one column for each level of the dimension

6) Snowflake Squema

A snowflake schema has a fact table, as does a star schema. Snowflake schemas, however, contain several dimension tables for each dimension, namely one table for each (non-) level.

7) Data Warehouse

Data warehouse as a subject oriented, integrated, time variant, non-volatile collection of data in support of management’s decision making process

1.1. A) Facts:nrofevents, totalparticipants. Dimensions: dayofweek, year, location.

1.2. A) Facts: nrofregistrations. Dimensions: instID, institute, cityID, city, province, study, phase, year, sex

Assigment 2: Re-create the demo from the lecture

Use the data spreadsheet BI Raw Data.csv. The encoding of this file is “ISO-8859-1”.

A large warehouse located in the US has many customers for its different product from around the world. The warehouse manager Mr Jack Bezos has many large customers from over the world ordering different products from his warehouse. We follow the method of Section 1.1.1 where all steps are more or less already given (and demoed at the lecture).

Step 1: Business questions Mr. Bezos wants to know answers for the following questions • Who are his top-5 most valued customers? • What are his top-5 most important products?

** Step 2a: Multidimensional model Notice that Mr. Bezos wants to know something about customers and products. These are dimensions. Time usually also is a dimension if you are interested in trends, so we also include the dimension ‘orderdate’. What is it he wants to know? It is ‘most valued’ and ‘most important’, but what does this mean concretely? How can we measure value and importance? Obviously in terms of money: the most valued customer is the one who bought for the most money, and the most important product is the one which was sold for the most money. Therefore, our fact is ‘amount’ (of money), i.e., ‘sales’.

Step 2b: Create tables in your database The database consists of three tables: • ‘customer’ with attributes customerid:integer (primary key, NOT NULL), name:character varying( 100), country:character varying(100) • ‘product’ with attributes productid: integer (primary key, NOT NULL), name:character varying(100), category:character varying(100) • ‘sales‘ with attributes orderdate:date (primary key, NOT NULL), customerid:integer (primary key, NOT NULL), productid: integer (primary key, NOT NULL), sales:double precision

Note: The orderid is not a key! A customer can order more than one product in one order which results in several rows for one order. Furthermore, in its purest form, a cube does not store information on individual transactions or cases, but only aggregated data for each combination of dimension values. In the case of Mr Bezos, the cube does not store individual orders, but the sales (fact) for all combinations of the dimensions orderdate, customer, and product. In other words, the orderdate, customerid, and productid together are the primary key!

Step 3: ETL — Prepare data and fill the database

1º Passo: Abrir os pacotes

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)

## Warning: package 'readr' was built under R version 4.0.5

library(DBI)

## Warning: package 'DBI' was built under R version 4.0.5

library(RPostgreSQL)

## Warning: package 'RPostgreSQL' was built under R version 4.0.5

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.5

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.0.5

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

2º Passo: Then, we load the data by using the command “read delim“ and put it into the object “data0“

data0 <- read_delim(file = "C:/Users/naian/Documents/UTwente/DataScience/BI_Raw_data.csv",
      delim = ";", col_names = TRUE, col_types = NULL,
          locale = locale(encoding="ISO-8859-1"))

## 
## -- Column specification --------------------------------------------------------
## cols(
##   Order_ID = col_double(),
##   Order_Date_Year = col_double(),
##   Order_Date_Month = col_double(),
##   Order_Date_Day = col_character(),
##   Order_Price_Total = col_double(),
##   Customer_Name = col_character(),
##   Customer_Country = col_character(),
##   Product_Name = col_character(),
##   Product_Category = col_character(),
##   Product_Order_Unit_Price = col_double(),
##   Product_Order_Quantity = col_double(),
##   Product_Order_Price_Total = col_double()
## )

3º Passo: Criar a tabela Product a partir da tabela principal data0

product <- data0 %>% #Seleciono a minha tabela principal de onde vão sair os dados
  select(Product_Name, Product_Category) %>% #Seleciono as colunas relacionadas a produto
    rename(name = Product_Name, category = Product_Category) %>%  #Renomeio as minhas colunas selecionadas
      group_by(name, category) %>% #Remover as linhas duplicadas das colunas selecionadas
        distinct() %>% 
          ungroup() %>% #Tenho que desagrupar os dados
            mutate(productid = row_number()) #Criar uma nova coluna. O productid pode ser utilizado como primary key da tabela (e que vai servir de link com a tabela principal)

4º Passo: Criar a tabela Customer a partir da tabela principal data0

customer <- data0 %>% #Seleciono a minha tabela principal de onde vão sair os dados
  select(Customer_Name, Customer_Country) %>% #Seleciono as colunas relacionadas a produto
    rename(name = Customer_Name, country = Customer_Country) %>%  #Renomeio as minhas colunas selecionadas
      group_by(name, country) %>% #Remover as linhas duplicadas das colunas selecionadas
        distinct() %>% 
          ungroup() %>% #Tenho que desagrupar os dados
            mutate(customerid = row_number()) #Criar uma nova coluna. O productid pode ser utilizado como primary key da tabela (e que vai servir de link com a tabela principal)

5º Passo: Criar a tabela sales com os atributos de interesse

sales <- data0 %>% 
  select(Order_Date_Day, Product_Order_Price_Total, Customer_Name, Customer_Country, Product_Name, Product_Category, Order_Price_Total)

6º Passo: Criar a tabela sales juntando a tabela de product

sales1 <- sales %>% 
  full_join(product, by = c("Product_Name" = "name", "Product_Category" = "category")) %>%  #For the sales table you first need to select the columns from the original data that you need and then join the product and customer table to add the productid and customerid to the sales table
      select( -Product_Name, -Product_Category) #drop columns that are redundant in the sales table.

7º Passo: Criar a tabela sales juntando a tabela customer

sales2 <- sales1 %>% 
  full_join(customer, by = c("Customer_Name" = "name", "Customer_Country" = "country")) %>%  #For the sales table you first need to select the columns from the original data that you need and then join the product and customer table to add the productid and customerid to the sales table
      select( -Customer_Name, -Customer_Country) #drop columns that are redundant in the sales table.

8º Passo: Having now constructed all data for the dimension and fact tables, you can now store them in the database. From R you can connect to the PostgreSQL database server with the functions dbDriver, dbConnect in packages DBI and RPostgreSQL. So, install these packages and load them first. The code to store the constructed R tables in the database is as follows (for schema ‘ass2’).

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, port = 5432, host = "bronto.ewi.utwente.nl",
dbname = "dab_ds21221a_33", user = "dab_ds21221a_33", password = "DctKLxeJOIzBjcz/",
options="-c search_path=ass2")
dbWriteTable(con, "product", value = product, overwrite = T, row.names = F)

## [1] TRUE

dbWriteTable(con, "customer", value = customer, overwrite = T, row.names = F)

## [1] TRUE

dbWriteTable(con, "sales", value = sales, overwrite = T, row.names = F)

## [1] TRUE

Customer and Sales

Product and Sales

9º Passo: Pegar as informações adquiridas nos passos anteriores

dbListTables(con)

##  [1] "product_main"  "customer_main" "returnstatus"  "Sales"        
##  [5] "product"       "customer"      "sales"         "product"      
##  [9] "customer"      "sales"

str(dbReadTable(con,"customer"))

## 'data.frame':    89 obs. of  3 variables:
##  $ name      : chr  "Simons bistro" "Richter Supermarkt" "Bon app'" "Rattlesnake Canyon Grocery" ...
##  $ country   : chr  "Denmark" "Switzerland" "France" "USA" ...
##  $ customerid: int  1 2 3 4 5 6 7 8 9 10 ...

str(dbReadTable(con,"product"))

## 'data.frame':    77 obs. of  3 variables:
##  $ name     : chr  "Pavlova" "Chang" "Lakkalikööri" "Spegesild" ...
##  $ category : chr  "Confections" "Beverages" "Beverages" "Seafood" ...
##  $ productid: int  1 2 3 4 5 6 7 8 9 10 ...

str(dbReadTable(con,"sales"))

## 'data.frame':    2155 obs. of  7 variables:
##  $ Order_Date_Day           : chr  "10-9-2009" "10-9-2009" "10-9-2009" "10-9-2009" ...
##  $ Product_Order_Price_Total: num  244 190 36 360 500 ...
##  $ Customer_Name            : chr  "Simons bistro" "Richter Supermarkt" "Richter Supermarkt" "Richter Supermarkt" ...
##  $ Customer_Country         : chr  "Denmark" "Switzerland" "Switzerland" "Switzerland" ...
##  $ Product_Name             : chr  "Pavlova" "Chang" "Lakkalikööri" "Spegesild" ...
##  $ Product_Category         : chr  "Confections" "Beverages" "Beverages" "Seafood" ...
##  $ Order_Price_Total        : num  244 586 586 586 1057 ...

Step 4: Visualize Now that we have our data ready in a form suitable for analysis, it is time to address the two business problems raised in the beginning. We create a dashboard with metrics to measure the required KPIs. We describe below how to do this for R and for Tableau. Other programming languages and tools have similar steps. Don’t use spreadsheet software for this. Depois criar um gráfico com ggplot

ggplot(sales2, aes(x=productid, y=Product_Order_Price_Total, color="yellow"))+
  theme_classic()+
    geom_point()+
      ggtitle("Sales")

##Assignment 3: Do it yourself

Step 1: Business Question –> Star Squema foi fornecida

• Which products/product categories made the most loss? • Which products/product categories were shipped really late (more than 2 days)? • Which products/product categories were returned the most?

Database Squema

Step 2: Multidimensional model and database structure

1º Passo: Pegar o meu banco de dados principal

main <-read_delim(file = "C:/Users/naian/Documents/UTwente/DataScience/SuperstoreSales_main.csv",
      delim = ";", col_names = TRUE, col_types = NULL,
          locale = locale(encoding="ISO-8859-1"))

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_character(),
##   `Row ID` = col_double(),
##   `Order ID` = col_double(),
##   `Order Quantity` = col_double(),
##   Sales = col_number(),
##   `Unit Price` = col_number()
## )
## i Use `spec()` for the full column specifications.

2º Passo: A partir do meu banco principal vou criar uma tabela para product_main

product_main <- main %>%
  select(`Product Name`, `Product Category`, `Product Sub-Category`) %>% 
    rename(name = `Product Name`, category =`Product Category`, product_sub_category = `Product Sub-Category`) %>%
      group_by(name, category, product_sub_category) %>% 
        distinct() %>% 
          ungroup() %>% 
            mutate(productid = row_number())

3º Passo: A partir do meu banco principal vou criar uma tabela para customer_main

customer_main <- main %>%
  select(`Customer Name`, Province, Region, `Customer Segment`) %>%
  rename(Name = `Customer Name`, Segment = `Customer Segment`) %>%
  arrange(Name, Province, Region, Segment) %>%
  group_by(Name, Province, Region, Segment) %>%
  distinct() %>%
  ungroup() %>%
  mutate(customerid = row_number())

4º Passo: Criar a tabela returns

returns <-read_delim(file = "C:/Users/naian/Documents/UTwente/DataScience/SuperstoreSales_returns.csv",
      delim = ";", col_names = TRUE, col_types = NULL,
          locale = locale(encoding="ISO-8859-1"))

## 
## -- Column specification --------------------------------------------------------
## cols(
##   `Order ID` = col_double(),
##   Status = col_character()
## )

5º Passo: Criar a tabela retunrstatusid

returnstatusid <- c(1,2)
returnvalue <- c('Returned','NotReturned')
returnstatus <- data.frame(returnstatusid, returnvalue)

6º Passo: Juntar a tabela main com a returns

main <- merge(main, returns, by = "Order ID", all.x = TRUE)
main$Status[is.na(main$Status)] <- "NotReturned"

7º Passo: Transformar as colunas order date e ship date de strings para formato de data

main$`Order Date` <- dmy(main$`Order Date`)
main$`Ship Date` <- dmy(main$`Ship Date`)

is.Date(main$`Order Date`)

## [1] TRUE

is.Date(main$`Ship Date`)

## [1] TRUE

8º Passo: Calcular a diferença de dias entre a requisição de compras e o envio do produto

main$delivery_time <- interval(main$`Order Date`, main$`Ship Date`) / ddays()

9º Passo: Criar uma nova coluna para saber se o tempo entre a requisição de compras e o envio do produto foi igual ou maior que 2 dias

for(i in 1:length(main$delivery_time)) {
  if (main$delivery_time[i] < 2) {
    main$Late[i] = "NotLate"
  }
  else {
    main$Late[i] = "Late"
  }
}

10º Passo: Selecionando as colunas de interesse da tabela Main para criar a tabela Sales

Sales <- main %>%
  select(`Customer Name`, Province, Region, `Customer Segment`, `Product Name`, `Product Category`, `Product Sub-Category`, `Order Date`, Sales, `Order Quantity`, `Unit Price`, `Profit`, `Shipping Cost`, Status, `delivery_time`, Late)

11º Passo: Unir a tabela sales com a tabela customer

Sales <- Sales %>% 
  full_join(customer_main, by = c("Customer Name" = "Name", "Province" = "Province", "Region" = "Region", "Customer Segment" = "Segment")) %>%
  select(-`Customer Name`, -Province, -Region, -`Customer Segment`)

12º Passo: Unir a tabela sales com a tabela product

Sales <- Sales %>% 
  full_join(product_main, by = c("Product Name" = "name", "Product Category" = "category","Product Sub-Category" = "product_sub_category")) %>%
  select(-`Product Name`, -`Product Category`, -`Product Sub-Category`)

13º Passo: Unir a tabela sales com a tabela returnstatus

Sales <- Sales %>% 
  full_join(returnstatus, by = c("Status" = "returnvalue")) %>% 
    select(-Status)

14º Passo: Renomear as colunas

Sales <- Sales %>%
  rename(Order_Date = `Order Date`, Order_Quantity = `Order Quantity`, 
         Unit_Price = `Unit Price`, Shipping_Cost =  `Shipping Cost`)

15º PassO

Sales$Profit <- gsub(',','.',Sales$Profit)
Sales$Profit <- as.numeric(Sales$Profit)
Sales$Shipping_Cost <- gsub(',','.',Sales$Shipping_Cost)
Sales$Shipping_Cost <- as.numeric(Sales$Shipping_Cost)

16º Passo: Transferir os dados para PostgreSQL

drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, port = 5432, host = "bronto.ewi.utwente.nl",
                 dbname = "dab_ds21221a_33", user = "dab_ds21221a_33", password = "DctKLxeJOIzBjcz/",
                 options="-c search_path=ass3")
dbWriteTable(con, "product_main", value = product_main, overwrite = T, row.names = F)

## [1] TRUE

dbWriteTable(con, "customer_main", value = customer_main, overwrite = T, row.names = F)

## [1] TRUE

dbWriteTable(con, "returnstatus", value = returnstatus, overwrite = T, row.names = F)

## [1] TRUE

dbWriteTable(con, "Sales", value = Sales, overwrite = T, row.names = F)

## [1] TRUE

Profit

Profit2

Profit3

17º Passo: Output dos dados

dbGetQuery(con,
           "SELECT table_name FROM information_schema.tables WHERE table_schema='ass3'")

##      table_name
## 1  product_main
## 2 customer_main
## 3  returnstatus
## 4         Sales
## 5       product
## 6      customer
## 7         sales

str(dbReadTable(con, c("ass3", "customer")))

## 'data.frame':    1832 obs. of  5 variables:
##  $ Name      : chr  "Aaron Bergman" "Aaron Bergman" "Aaron Hawkins" "Aaron Hawkins" ...
##  $ Province  : chr  "Alberta" "Nunavut" "British Columbia" "Nova Scotia" ...
##  $ Region    : chr  "West" "Nunavut" "West" "Atlantic" ...
##  $ Segment   : chr  "Corporate" "Corporate" "Home Office" "Home Office" ...
##  $ customerid: int  1 2 3 4 5 6 7 8 9 10 ...

str(dbReadTable(con, c("ass3", "product")))

## 'data.frame':    1263 obs. of  4 variables:
##  $ Name        : chr  "\"While you Were Out\" Message Book, One Form per Page" "*Staples* Highlighting Markers" "*Staples* Letter Opener" "*Staples* Packaging Labels" ...
##  $ Category    : chr  "Office Supplies" "Office Supplies" "Office Supplies" "Office Supplies" ...
##  $ Sub_Category: chr  "Paper" "Pens & Art Supplies" "Scissors, Rulers and Trimmers" "Labels" ...
##  $ productid   : int  1 2 3 4 5 6 7 8 9 10 ...

str(dbReadTable(con, c("ass3", "returnstatus")))

## 'data.frame':    2 obs. of  2 variables:
##  $ returnstatusid: num  1 2
##  $ returnvalue   : chr  "Returned" "NotReturned"

str(dbReadTable(con, c("ass3", "sales")))

## 'data.frame':    8399 obs. of  10 variables:
##  $ Order_Date    : Date, format: "2010-10-13" "2012-02-20" ...
##  $ Sales         : num  26154 693 14056 280808 17614 ...
##  $ Order_Quantity: num  6 2 15 26 24 23 30 14 46 32 ...
##  $ Unit_Price    : num  3894 208 846 10753 7089 ...
##  $ Profit        : num  -213.25 -4.64 -128.38 1054.82 -1748.56 ...
##  $ Shipping_Cost : num  35 2.56 8.99 5.81 89.3 5.03 2.25 8.99 4.2 1.99 ...
##  $ Late          : chr  "Late" "NotLate" "NotLate" "NotLate" ...
##  $ customerid    : int  1327 1565 1146 1145 1145 1145 1000 1000 1587 1708 ...
##  $ productid     : int  405 712 675 1051 711 257 384 333 56 1104 ...
##  $ returnstatusid: num  2 2 2 2 2 2 2 2 2 1 ...

Assignment 4: multidimensional modeling. Case “Mobile app beta tester service”

A key difference of multidimensional modeling from “ordinary” data modeling is that the multidimensional modeler should neither try to include all the available data nor all the existing relationships in the data in the model. Only those aspects that are essential “drivers” of the business should be included.

Multidimensional modeling process into four subprocesses:

Choose the business process(es) to model. Step 1 relates to the observation that not all business processes are equally important for the business.
Choose the granularity of the business process. Step 2 implies that it is important to use the data granularity that best matches the analysis needs.
Design the dimensions. Step 3 then goes on to refine the schema of each part of the grain into a complete dimension with levels and attributes
Choose the measures. the numerical measures to capture for each combination of dimension values

Goodness score = [Sum of every interval of the timestamp of starting the game and that of switching to another app (which is the total time of playing the game)] x the level of completion of the game. Since beta testing is for getting feedback, it is necessary for testers to play as much as they can in order to come up with more advice from their experiences. In addition to total playing time, the level of completion of the game should be considered. Therefore, we multiply these two factors, and if someone has a higher goodness score, it means that he or she plays this game for a long time and completes many levels of the game.
Which gamers are more suitable for being beta testers?
Star Squema Beta Tester

We choose Scores as facts because the main purpose of the Beta Tester Service is to evaluate the good gamers in order to find the most suitable ones for being beta testers. Also, we decide to create two dimensions: Beta Testers and Developers since both are related to the process of developing a game.

Data_Preparation_Visualization