On the website of the Balanced Scorecard Institute you can learn the basics of the Balanced Score Card (BSC).
For the videos below, please open them in a new tab! Direct clicking won’t work.
This video introduces the Balanced Score Card, in simple terms.
Market capitalism is very much the domain of Milton Friedman.
This video explains Friedman’s views, particularly on corporate social responsibility. Try to formulate for yourself how Friedman’s thinks of “stakeholders”,
In this video, you can learn more about the logic of managerial capitalism.
Cristopher is one the gurus in logistics.
An interesting article by Cristopher can be downloaded here
We have introduced a case study, on Fashion Logistics.
As stated by Provost & Fawcett (2013): “The key to success is often a creative problem formulation by an analyst regarding how to cast the business problem as one or more data science problems”.
Now reconsider the following slide:
Give an accurate summary of the situation!
In data science, we use the CRISP model, see below. The above questions refer to the first parts of the model: business understanding, and data understanding.
From your answers to the questions 1-3 above, think of a smart (organizational) design to collect the data needed to compile your KPIs!
In data science, we make a rough distinction between:
In designing your BI/data science project for fashion logistics, you can use one or more of these kinds of techniques to typify your strategy. What would be the main focus of your strategy? Why?
We advise you to look closely at all of the slides presented duing the two sessions.
A particularly relevant slide is the following. So if a question pops up on dimensions, then have this slide in mind!
Robert Rongen
One of the most time-consuming steps in the CRISP-model, is data preparation.
There are several reasons for this:
The data obtained in the project includes the postal codes of origin and destination. Since these postal codes are one-on-one related to coordinates (expressed as longitude and latitude), we can use a pretty complex formula to compute the beeline distance (the shortest distance between 2 points on a globe). However, we need data on:
In the data set, the beeline distances are given.
Think of ways to estimate (from the beeline distance):
TBA: R-training, and interview with Henny Jordaan
Based on your design for collection and analysis of data in the ideal situation, including KPIs and metrics, but now taking into the constraints of the data as available, several researchers (and students) are helping AMFI in analyzing the data.
There are three aspects to this.
In order to tackle these issues systematically, we need a plan. Draft your plan, as follows.
Your plan does not need to be at the highly detailed level of R-scripts; but we do expect that you include the variables as they appear in the data. One good option is to use Archimate to desribe the workflow!
As an assessment criterion, we will use the clarity of your plan, and the ability to translate the plan into analyses (e.g., in R-scripts).
Henri Masson
6 Techniques:
Regression trees analysis is a powerful technique which combines classification and cause-and-effect analysis.
It is often used to explain a numerical variable (that is, it is a value measured on an ordinal, interval or ratio scale): income, temperature, quality level, and so on.
In our example we have used wine quality, to see if and how quality is indicated by a number of predictors (or independent, or explanatory) variables. Often, we don’t have a clear model or theory in kind. We use the analysis to reveal which of the variables are the best indicators, or predictors.
Applying the analysis to wine, gives a lot of output. But the results are neatly summarized and visualized in the two graphs.
You should be able to interpret these graphs, clearly!
wwine <- read.csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/whitewines.csv")
(n <- nrow(wwine) ) # of observations
## [1] 4898
set.seed(789)
train_set <- sample(n,0.80*n) # sample of 80%
wwine_train <- wwine[train_set,]
wwine_test <- wwine[-train_set,]
library(rpart)
m <- rpart(quality ~ .,data=wwine_train)
# install.packages("rpart.plot")
library(rpart.plot)
rpart.plot(m, digits=2,fallen.leaves=TRUE)
p <- predict(m,wwine_test)
cor(p,wwine_test$quality)
## [1] 0.5309301
# Scatterplot
pdf <- as.data.frame(p)
wwine_test$pred <- pdf$p; head(wwine_test)
library(ggplot2)
# Jitter the points
ggplot(wwine_test, aes(x=quality, y=pred)) + geom_point(shape=1, position=position_jitter(width=1,height=.5))+ geom_smooth()
Association Rule Mining, is a technique employed in cases where we have a lot of “items”, all of which occur infrequently. The typical example is the supermarket, where customers purchase a couple of items (maybe 1 or 2 items; maybe 100 items for weekly shopping), out of the 1000s that the supermarket is selling.
The challenge is to see which items somehow are bought in combination.
This is in fact what you have learned in statistics, when analyzing tables (cross-tabulations). A statistic called chi-square tells us if there is a positive association between two variables - both measured in classes (nominal or ordinal scales).
In this technique, we analyze an enormous number of 2x2 cross-tabulations, in a smart way.
The output is, however, relatively simple.
You should be able to interpret the rules! Of crucial importance are the correct interpretation of the three measures:
Oh yeah, don’t forget what;s the meaning of density, in the sparse matrix. And what was a sparse matrix again?
## market basket analysis
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
# you can experiment with the data set Groceries
# data("Groceries")
# we made a selection
groc1000 <- read.transactions("buas_mba_assignment.csv",sep=",")
summary(groc1000)
## transactions as itemMatrix in sparse format with
## 1000 rows (elements/itemsets/transactions) and
## 154 columns (items) and a density of 0.03038961
##
## most frequent items:
## whole milk soda rolls/buns other vegetables
## 305 215 213 197
## yogurt (Other)
## 151 3599
##
## element (itemset/transaction) length distribution:
## sizes
## 2 3 4 5 6 7 8 9 10
## 212 185 150 123 102 78 62 48 40
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 3.00 4.00 4.68 6.00 10.00
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
groc1000_r1 <- apriori(groc1000, parameter = list(support=.025, confidence=.05, minlen=2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.05 0.1 1 none FALSE TRUE 5 0.025 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 25
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[154 item(s), 1000 transaction(s)] done [0.00s].
## sorting and recoding items ... [53 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [66 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(groc1000_r1[1:5]) # first 5 rules
## lhs rhs support confidence coverage
## [1] {curd} => {whole milk} 0.026 0.4642857 0.056
## [2] {whole milk} => {curd} 0.026 0.0852459 0.305
## [3] {brown bread} => {whole milk} 0.031 0.4078947 0.076
## [4] {whole milk} => {brown bread} 0.031 0.1016393 0.305
## [5] {whipped/sour cream} => {other vegetables} 0.027 0.3600000 0.075
## lift count
## [1] 1.522248 26
## [2] 1.522248 26
## [3] 1.337360 31
## [4] 1.337360 31
## [5] 1.827411 27
kNN stands for classifying objects, based on the k nearest neighbours.
Pretty cryptic, but elegantly simple.
Suppose a guy walks into your bank, and applies for a loan. You can ask for many traits (age, gender, eye color, previous loans, income, …) and use the loan performance of look-alikes to predict your client’s creditworthiness.
Elegant as it is, we call it a lazy learner. Why?
You have to be able to understand the following concepts:
We looked at model evaluation. The result of the procedure is simple: we use the model to classify the object. Whether this works ok or not, is tested in our test set, of new cases (well, not new really, but they are not used in the analysis).
For the objects in the test set we can compare the predicted values against the real outcomes.
From a classification table, we can derive:
From a managerial point of view, you have to be able to evaluate these numbers.
False negatives and false positives may carry different weights. In testing for Covid-19, what would be worse: a false negative or a false positive? The same question is relevant for our loan applicant.
Good thing, we can agree or disagree with your point of view, but there are no hard rules, here. You are the manager, you decide!
We won’t give an elaborate example and script, but just focus on the bottom line.
Try to interpret the classification table below.
You are the data analyst of “the Manufacturer”. Which data will you collect, and in which form for your:
How would you check if a sophisticated (exponential smoothing, ARIMA) time series forecasting method (exponential smoothing, ARIMA) performs better than the elementary ones (average, naïve, seasonal naïve, drift)?
Assuming you are the data analyst of a manufacturing company.