1. Week 1
- Introduction to BI
Week 2
Week 3/4
Week 5
Week 6/7
Week 8 - Plan of Analysis
Week 9
Week 10-15
Formative Qs week 10
Week 11
Week 12
Forecasting

1. Week 1

Introduction to BI

The Balanced Scorecard

On the website of the Balanced Scorecard Institute you can learn the basics of the Balanced Score Card (BSC).

For the videos below, please open them in a new tab! Direct clicking won’t work.

This video introduces the Balanced Score Card, in simple terms.

Market Capitalism versus Managerial Capitalism

Market capitalism is very much the domain of Milton Friedman.

This video explains Friedman’s views, particularly on corporate social responsibility. Try to formulate for yourself how Friedman’s thinks of “stakeholders”,

In this video, you can learn more about the logic of managerial capitalism.

Managing supply chains in times of turbulence

Cristopher is one the gurus in logistics.

An interesting article by Cristopher can be downloaded here

Week 2

We have introduced a case study, on Fashion Logistics.

As stated by Provost & Fawcett (2013): “The key to success is often a creative problem formulation by an analyst regarding how to cast the business problem as one or more data science problems”.

Now reconsider the following slide:

Give an accurate summary of the situation!

Under the (unrealistic) assumption that the fashion supply chain behaves as one entity with clear, uniform objectives, then describe what the set of KPIs to run this entity smoothly would look like! We can refer to this assumption as the ideal situation.
Next, given your set of KPIs, what data is needed from the various parts of the supply chain (retailers; wholesalers; distribution centers; transportation companies; municipalities; umbrella organizations)?
Now, realize that we are not in an ideal situation: the various parts of the suppy chain do not make up one organization: they have common goals on some issues; conflicting goals on other issues; some of them are actually competing; and so on. So, before we start off, we need your views on (i) what the potential conflicts are, and (ii) how these conflicts are stumbling blocks to obtaining the industry-level KPIs needed to improve the supply chain!

In data science, we use the CRISP model, see below. The above questions refer to the first parts of the model: business understanding, and data understanding.

From your answers to the questions 1-3 above, think of a smart (organizational) design to collect the data needed to compile your KPIs!

In data science, we make a rough distinction between:

Classification
Pattern detection
Prediction.

In designing your BI/data science project for fashion logistics, you can use one or more of these kinds of techniques to typify your strategy. What would be the main focus of your strategy? Why?

Week 3/4

We advise you to look closely at all of the slides presented duing the two sessions.

A particularly relevant slide is the following. So if a question pops up on dimensions, then have this slide in mind!

Robert Rongen

Week 5

One of the most time-consuming steps in the CRISP-model, is data preparation.

There are several reasons for this:

When analyzing the data, we have to make use of data in rectangular formats. That is, all data in the analysis have to be in a matrix. But the data may be organized in (relational) data bases, and we first have to merge the data before analysis.
Data contain errors which have to be fixed.
Data are incomplete, or inaccurate.

The data obtained in the project includes the postal codes of origin and destination. Since these postal codes are one-on-one related to coordinates (expressed as longitude and latitude), we can use a pretty complex formula to compute the beeline distance (the shortest distance between 2 points on a globe). However, we need data on:

traveling distance (which is sustantially higher than the beeline distance)
traveling distance, in its turn, determines fuel costs, and CO₂ emissions
traveling times (which translates into driving hours that have to be paid for).

In the data set, the beeline distances are given.

Think of ways to estimate (from the beeline distance):

traveling distances between two locations in the Netherlands
traveling times between locations
CO₂ emissions

Week 6/7

TBA: R-training, and interview with Henny Jordaan

Week 8 - Plan of Analysis

Based on your design for collection and analysis of data in the ideal situation, including KPIs and metrics, but now taking into the constraints of the data as available, several researchers (and students) are helping AMFI in analyzing the data.

There are three aspects to this.

What can we learn from the data?
Which improvements can be brought about in the supply chain, in order to save time, cost and lower emissions (e.g. by simulating hubs close to cities where goods are bundled before sending them to stores in the city centers)?
Which data are noyt yet available, but crucial to have for the next round of the project?

In order to tackle these issues systematically, we need a plan. Draft your plan, as follows.

Descriptive analyses: which indicators and statistics describe the status quo?
Describe your approach of computing potential improvements?

Your plan does not need to be at the highly detailed level of R-scripts; but we do expect that you include the variables as they appear in the data. One good option is to use Archimate to desribe the workflow!

As an assessment criterion, we will use the clarity of your plan, and the ability to translate the plan into analyses (e.g., in R-scripts).

Week 9

Henri Masson

Week 10-15

6 Techniques:

Week 10: Regression Trees
Week 11: Association Rule Mining
Week 12: kNN Method for classification [supervised learning]
Week 13: K-Means Clustering [unsupervised learning]
Week 14: Forecasting I (Basics, and toolkit) [up to time series decomposition]
Week 15: Forecasting II (Intro to advanced techniques) [Henri Masson]

Formative Qs week 10

Regression trees analysis is a powerful technique which combines classification and cause-and-effect analysis.

It is often used to explain a numerical variable (that is, it is a value measured on an ordinal, interval or ratio scale): income, temperature, quality level, and so on.

In our example we have used wine quality, to see if and how quality is indicated by a number of predictors (or independent, or explanatory) variables. Often, we don’t have a clear model or theory in kind. We use the analysis to reveal which of the variables are the best indicators, or predictors.

Applying the analysis to wine, gives a lot of output. But the results are neatly summarized and visualized in the two graphs.

You should be able to interpret these graphs, clearly!

wwine <- read.csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/whitewines.csv")
(n <- nrow(wwine) )            # of observations

## [1] 4898

set.seed(789)
train_set <- sample(n,0.80*n)  # sample of 80%
wwine_train <- wwine[train_set,]
wwine_test  <- wwine[-train_set,]
library(rpart)
m <- rpart(quality ~ .,data=wwine_train)
# install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(m, digits=2,fallen.leaves=TRUE)
p <- predict(m,wwine_test)
cor(p,wwine_test$quality)

## [1] 0.5309301

# Scatterplot
pdf <- as.data.frame(p)
wwine_test$pred <- pdf$p; head(wwine_test)

library(ggplot2)

# Jitter the points
ggplot(wwine_test, aes(x=quality, y=pred)) + geom_point(shape=1, position=position_jitter(width=1,height=.5))+ geom_smooth()

Week 11

Association Rule Mining, is a technique employed in cases where we have a lot of “items”, all of which occur infrequently. The typical example is the supermarket, where customers purchase a couple of items (maybe 1 or 2 items; maybe 100 items for weekly shopping), out of the 1000s that the supermarket is selling.

The challenge is to see which items somehow are bought in combination.

This is in fact what you have learned in statistics, when analyzing tables (cross-tabulations). A statistic called chi-square tells us if there is a positive association between two variables - both measured in classes (nominal or ordinal scales).

In this technique, we analyze an enormous number of 2x2 cross-tabulations, in a smart way.

The output is, however, relatively simple.

You should be able to interpret the rules! Of crucial importance are the correct interpretation of the three measures:

Support
Confidence
Lift.

Oh yeah, don’t forget what;s the meaning of density, in the sparse matrix. And what was a sparse matrix again?

## market basket analysis
library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

# you can experiment with the data set Groceries
# data("Groceries") 
# we made a selection
groc1000 <- read.transactions("buas_mba_assignment.csv",sep=",")
summary(groc1000)

## transactions as itemMatrix in sparse format with
##  1000 rows (elements/itemsets/transactions) and
##  154 columns (items) and a density of 0.03038961 
## 
## most frequent items:
##       whole milk             soda       rolls/buns other vegetables 
##              305              215              213              197 
##           yogurt          (Other) 
##              151             3599 
## 
## element (itemset/transaction) length distribution:
## sizes
##   2   3   4   5   6   7   8   9  10 
## 212 185 150 123 102  78  62  48  40 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    3.00    4.00    4.68    6.00   10.00 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics

groc1000_r1 <- apriori(groc1000, parameter = list(support=.025, confidence=.05, minlen=2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.05    0.1    1 none FALSE            TRUE       5   0.025      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 25 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[154 item(s), 1000 transaction(s)] done [0.00s].
## sorting and recoding items ... [53 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [66 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(groc1000_r1[1:5]) # first 5 rules

##     lhs                     rhs                support confidence coverage
## [1] {curd}               => {whole milk}       0.026   0.4642857  0.056   
## [2] {whole milk}         => {curd}             0.026   0.0852459  0.305   
## [3] {brown bread}        => {whole milk}       0.031   0.4078947  0.076   
## [4] {whole milk}         => {brown bread}      0.031   0.1016393  0.305   
## [5] {whipped/sour cream} => {other vegetables} 0.027   0.3600000  0.075   
##     lift     count
## [1] 1.522248 26   
## [2] 1.522248 26   
## [3] 1.337360 31   
## [4] 1.337360 31   
## [5] 1.827411 27

Week 12

kNN stands for classifying objects, based on the k nearest neighbours.

Pretty cryptic, but elegantly simple.

Suppose a guy walks into your bank, and applies for a loan. You can ask for many traits (age, gender, eye color, previous loans, income, …) and use the loan performance of look-alikes to predict your client’s creditworthiness.

Elegant as it is, we call it a lazy learner. Why?

You have to be able to understand the following concepts:

Normalizing data (why do we have to do it?)
Training and test set (why do we need a test set? Or do we need it?)
The size of the training and test sets.

We looked at model evaluation. The result of the procedure is simple: we use the model to classify the object. Whether this works ok or not, is tested in our test set, of new cases (well, not new really, but they are not used in the analysis).

For the objects in the test set we can compare the predicted values against the real outcomes.

From a classification table, we can derive:

The number of correctly classified cases
The false positives (persons accepted, who should have been rejected)
The false negatives (persons rejected, who should have bee accepted).

From a managerial point of view, you have to be able to evaluate these numbers.

False negatives and false positives may carry different weights. In testing for Covid-19, what would be worse: a false negative or a false positive? The same question is relevant for our loan applicant.

Good thing, we can agree or disagree with your point of view, but there are no hard rules, here. You are the manager, you decide!

We won’t give an elaborate example and script, but just focus on the bottom line.

Try to interpret the classification table below.

Forecasting

Consider the sketch below.

You are the data analyst of “the Manufacturer”. Which data will you collect, and in which form for your:

CEO (Chief Executive Officer; the big boss)
COO ( Chief Operation manager; the responsible of production)
The shareholders (owning the shares of your company)
The stakeholders (any person being concerned by the activities of the Manufacturer).

Forecasting: exponential smoothing and arima (chapters 7 and 8 of the reference textbook)

How would you check if a sophisticated (exponential smoothing, ARIMA) time series forecasting method (exponential smoothing, ARIMA) performs better than the elementary ones (average, naïve, seasonal naïve, drift)?

Advanced Data Mining for Logistics

Assuming you are the data analyst of a manufacturing company.

How will you assess the predictability of your production time series?
In the case where it appears that you time series is chaotic, what will you do?

BI Coursebook