1. Week 1

Introduction to BI

The Balanced Scorecard

On the website of the Balanced Scorecard Institute you can learn the basics of the Balanced Score Card (BSC).

For the videos below, please open them in a new tab! Direct clicking won’t work.

This video introduces the Balanced Score Card, in simple terms.

Market Capitalism versus Managerial Capitalism

Market capitalism is very much the domain of Milton Friedman.

This video explains Friedman’s views, particularly on corporate social responsibility. Try to formulate for yourself how Friedman’s thinks of “stakeholders”,

In this video, you can learn more about the logic of managerial capitalism.

Managing supply chains in times of turbulence

Cristopher is one the gurus in logistics.

An interesting article by Cristopher can be downloaded here

Week 2

We have introduced a case study, on Fashion Logistics.

As stated by Provost & Fawcett (2013): “The key to success is often a creative problem formulation by an analyst regarding how to cast the business problem as one or more data science problems”.

Now reconsider the following slide:

Give an accurate summary of the situation!

  1. Under the (unrealistic) assumption that the fashion supply chain behaves as one entity with clear, uniform objectives, then describe what the set of KPIs to run this entity smoothly would look like! We can refer to this assumption as the ideal situation.
  2. Next, given your set of KPIs, what data is needed from the various parts of the supply chain (retailers; wholesalers; distribution centers; transportation companies; municipalities; umbrella organizations)?
  3. Now, realize that we are not in an ideal situation: the various parts of the suppy chain do not make up one organization: they have common goals on some issues; conflicting goals on other issues; some of them are actually competing; and so on. So, before we start off, we need your views on (i) what the potential conflicts are, and (ii) how these conflicts are stumbling blocks to obtaining the industry-level KPIs needed to improve the supply chain!

In data science, we use the CRISP model, see below. The above questions refer to the first parts of the model: business understanding, and data understanding.

From your answers to the questions 1-3 above, think of a smart (organizational) design to collect the data needed to compile your KPIs!

In data science, we make a rough distinction between:

In designing your BI/data science project for fashion logistics, you can use one or more of these kinds of techniques to typify your strategy. What would be the main focus of your strategy? Why?

Week 3/4

We advise you to look closely at all of the slides presented duing the two sessions.

A particularly relevant slide is the following. So if a question pops up on dimensions, then have this slide in mind!

Robert Rongen

Week 5

One of the most time-consuming steps in the CRISP-model, is data preparation.

There are several reasons for this:

The data obtained in the project includes the postal codes of origin and destination. Since these postal codes are one-on-one related to coordinates (expressed as longitude and latitude), we can use a pretty complex formula to compute the beeline distance (the shortest distance between 2 points on a globe). However, we need data on:

In the data set, the beeline distances are given.

Think of ways to estimate (from the beeline distance):

Week 6/7

TBA: R-training, and interview with Henny Jordaan

Week 8 - Plan of Analysis

Based on your design for collection and analysis of data in the ideal situation, including KPIs and metrics, but now taking into the constraints of the data as available, several researchers (and students) are helping AMFI in analyzing the data.

There are three aspects to this.

  1. What can we learn from the data?
  2. Which improvements can be brought about in the supply chain, in order to save time, cost and lower emissions (e.g. by simulating hubs close to cities where goods are bundled before sending them to stores in the city centers)?
  3. Which data are noyt yet available, but crucial to have for the next round of the project?

In order to tackle these issues systematically, we need a plan. Draft your plan, as follows.

Your plan does not need to be at the highly detailed level of R-scripts; but we do expect that you include the variables as they appear in the data. One good option is to use Archimate to desribe the workflow!

As an assessment criterion, we will use the clarity of your plan, and the ability to translate the plan into analyses (e.g., in R-scripts).

Week 9

Henri Masson

Week 10-15

6 Techniques:

Formative Qs week 10

Regression trees analysis is a powerful technique which combines classification and cause-and-effect analysis.

It is often used to explain a numerical variable (that is, it is a value measured on an ordinal, interval or ratio scale): income, temperature, quality level, and so on.

In our example we have used wine quality, to see if and how quality is indicated by a number of predictors (or independent, or explanatory) variables. Often, we don’t have a clear model or theory in kind. We use the analysis to reveal which of the variables are the best indicators, or predictors.

Applying the analysis to wine, gives a lot of output. But the results are neatly summarized and visualized in the two graphs.

You should be able to interpret these graphs, clearly!

wwine <- read.csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/whitewines.csv")
(n <- nrow(wwine) )            # of observations
## [1] 4898
set.seed(789)
train_set <- sample(n,0.80*n)  # sample of 80%
wwine_train <- wwine[train_set,]
wwine_test  <- wwine[-train_set,]
library(rpart)
m <- rpart(quality ~ .,data=wwine_train)
# install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(m, digits=2,fallen.leaves=TRUE)
p <- predict(m,wwine_test)
cor(p,wwine_test$quality)
## [1] 0.5309301
# Scatterplot
pdf <- as.data.frame(p)
wwine_test$pred <- pdf$p; head(wwine_test)
library(ggplot2)

# Jitter the points
ggplot(wwine_test, aes(x=quality, y=pred)) + geom_point(shape=1, position=position_jitter(width=1,height=.5))+ geom_smooth()

Week 11

Association Rule Mining, is a technique employed in cases where we have a lot of “items”, all of which occur infrequently. The typical example is the supermarket, where customers purchase a couple of items (maybe 1 or 2 items; maybe 100 items for weekly shopping), out of the 1000s that the supermarket is selling.

The challenge is to see which items somehow are bought in combination.

This is in fact what you have learned in statistics, when analyzing tables (cross-tabulations). A statistic called chi-square tells us if there is a positive association between two variables - both measured in classes (nominal or ordinal scales).

In this technique, we analyze an enormous number of 2x2 cross-tabulations, in a smart way.

The output is, however, relatively simple.

You should be able to interpret the rules! Of crucial importance are the correct interpretation of the three measures:

Oh yeah, don’t forget what;s the meaning of density, in the sparse matrix. And what was a sparse matrix again?

## market basket analysis
library(arules)
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
# you can experiment with the data set Groceries
# data("Groceries") 
# we made a selection
groc1000 <- read.transactions("buas_mba_assignment.csv",sep=",")
summary(groc1000)
## transactions as itemMatrix in sparse format with
##  1000 rows (elements/itemsets/transactions) and
##  154 columns (items) and a density of 0.03038961 
## 
## most frequent items:
##       whole milk             soda       rolls/buns other vegetables 
##              305              215              213              197 
##           yogurt          (Other) 
##              151             3599 
## 
## element (itemset/transaction) length distribution:
## sizes
##   2   3   4   5   6   7   8   9  10 
## 212 185 150 123 102  78  62  48  40 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    3.00    4.00    4.68    6.00   10.00 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3   baby cosmetics
groc1000_r1 <- apriori(groc1000, parameter = list(support=.025, confidence=.05, minlen=2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.05    0.1    1 none FALSE            TRUE       5   0.025      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 25 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[154 item(s), 1000 transaction(s)] done [0.00s].
## sorting and recoding items ... [53 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [66 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(groc1000_r1[1:5]) # first 5 rules
##     lhs                     rhs                support confidence coverage
## [1] {curd}               => {whole milk}       0.026   0.4642857  0.056   
## [2] {whole milk}         => {curd}             0.026   0.0852459  0.305   
## [3] {brown bread}        => {whole milk}       0.031   0.4078947  0.076   
## [4] {whole milk}         => {brown bread}      0.031   0.1016393  0.305   
## [5] {whipped/sour cream} => {other vegetables} 0.027   0.3600000  0.075   
##     lift     count
## [1] 1.522248 26   
## [2] 1.522248 26   
## [3] 1.337360 31   
## [4] 1.337360 31   
## [5] 1.827411 27

Week 12

kNN stands for classifying objects, based on the k nearest neighbours.

Pretty cryptic, but elegantly simple.

Suppose a guy walks into your bank, and applies for a loan. You can ask for many traits (age, gender, eye color, previous loans, income, …) and use the loan performance of look-alikes to predict your client’s creditworthiness.

Elegant as it is, we call it a lazy learner. Why?

You have to be able to understand the following concepts:

We looked at model evaluation. The result of the procedure is simple: we use the model to classify the object. Whether this works ok or not, is tested in our test set, of new cases (well, not new really, but they are not used in the analysis).

For the objects in the test set we can compare the predicted values against the real outcomes.

From a classification table, we can derive:

From a managerial point of view, you have to be able to evaluate these numbers.

False negatives and false positives may carry different weights. In testing for Covid-19, what would be worse: a false negative or a false positive? The same question is relevant for our loan applicant.

Good thing, we can agree or disagree with your point of view, but there are no hard rules, here. You are the manager, you decide!

We won’t give an elaborate example and script, but just focus on the bottom line.

Try to interpret the classification table below.

Forecasting

  1. Consider the sketch below.

You are the data analyst of “the Manufacturer”. Which data will you collect, and in which form for your:

  1. Forecasting: exponential smoothing and arima (chapters 7 and 8 of the reference textbook)

How would you check if a sophisticated (exponential smoothing, ARIMA) time series forecasting method (exponential smoothing, ARIMA) performs better than the elementary ones (average, naïve, seasonal naïve, drift)?

  1. Advanced Data Mining for Logistics

Assuming you are the data analyst of a manufacturing company.