Monte Carlo Resampling (a.k.a Bootstrapping)

Derek G. Nokes
2015-03-03

Introduction

Bootstrap: “to pull oneself up by one's bootstraps”; to seemingly do the impossible

  • Much of early statistics was about what to do when we didn't have enough data

  • Now we are more likely to have too much data

  • In early statistics we needed computational shortcuts

  • Now we have immense computational power

  • Bootstrap resampling allows us harness computational power to get an infinite amount of data

Load the Data

We load some data

# load the data
colClasses=c('POSIXct','numeric','numeric',
             'numeric','numeric','numeric',
             'numeric')
# load the sales data
path<-"C:/Users/dgn2/Documents/R/IS606/"
salesFile<-"sales.csv"
sales <- read.csv(paste0(path,salesFile),
                  header=TRUE,
                  stringsAsFactors=FALSE,
                  colClasses=colClasses)
# load the price and cost details
detailsFile<-"details.csv"
details <- read.csv(paste0(path,detailsFile),
                    header=TRUE,
                    stringsAsFactors=FALSE)

P&L

# extract the price and cost of each product
turkeyPrice<-details$price[2]
turkeyCost<-details$cost[2]

# define the function to compute revenue, expense, and P&L
pnlUnderScenario<- function(demand,supply,pricePerUnit,
                            costPerUnit){
  expense<-supply*costPerUnit
  unitsSold<-demand
  flag<-supply-demand<0
  unitsSold[flag]<-supply[flag]
  revenue<-unitsSold*pricePerUnit
  pnl<-sum(revenue-expense)
  }

Supply & Demand

Let's take a look at the P&L

# turkey revenue, expense, and P&L
turkeyDemand<-sales[,3]
turkeySupply<-sales[,6]
turkeyExpense<-turkeySupply*turkeyCost
turkeyUnitsSold<-turkeyDemand
turkeyFlag<-turkeySupply-turkeyDemand<0
turkeyUnitsSold[turkeyFlag]<-turkeySupply[
  turkeyFlag]
turkeyRevenue<-turkeyUnitsSold*turkeyPrice
turkeyPnl<-turkeyRevenue-turkeyExpense

Supply & Demand

Let's take a look at the supply & demand:

plot of chunk unnamed-chunk-4

Visual inspection indicates that the demand is typically above the supply

Resample the Data

We sample from the data with replacement

# set the parameters for resampling
dimension<-dim(sales)
nRows<-dimension[1]
nCols<-dimension[2]
nPaths<-1000
# create the data resampling index
resampleIndex<-sample(1:nRows,
                      nRows*nPaths,
                      replace=TRUE,
                      prob=NULL)

# create the resampled data for each column
resampledData<-sales[resampleIndex,]

Reshape the Data

Re-organize the data so that we have nRows by nPath matrix

# reshape the demand data
turkeyDemandPaths<-data.frame(matrix(
  resampledData[,3],nrow=nRows,ncol=nPaths))
# reshape the supply data
turkeySupplyPaths<-data.frame(matrix(
  resampledData[,6],nrow=nRows,ncol=nPaths))

Each row represents a point in time and each column represents an alternate reality consistent with the variability of observed observations

Compute the Statistic we Care About

Say we care about the expected value of supply versus demand.

turkeyCumulativeDemand<-cumsum(
  turkeyDemandPaths)
turkeyCumulativeSupply<-cumsum(
  turkeySupplyPaths)

Resampling gives us a number of alternate realities consistent with the variation in the data

Compute the Statistic we Care About - Continued

From these paths / alternate realities we can determine the distribution of the statistic of interest (in this case, the expected value)

turkeyExpectedDemand<-mean(as.numeric(
  turkeyCumulativeDemand[nRows,]/nRows))
turkeyExpectedSupply<-mean(as.numeric(
  turkeyCumulativeSupply[nRows,]/nRows))

The expected demand for turkey significantly exceeded supply (22.0643615 versus 17.2389231).

Distribution of Turkey Supply & Demand

plot of chunk unnamed-chunk-9

Notice the complete separation between the two distributions

Find the P&L Distribution

We find the distribution of P&L by resampling the original supply and demand.

# find the distribution of P&L using the original supply 
# and demand
turkeyPnLUnderScenarios<-0

for (pathIndex in 1:nPaths){
  turkeyPnLUnderScenarios[pathIndex]<-
    pnlUnderScenario(
    turkeyDemandPaths[,pathIndex],
    turkeySupplyPaths[,pathIndex],turkeyPrice,
    turkeyCost)
  }

P&L Distribution Visualization

plot of chunk unnamed-chunk-11

Optimizing Expected P&L

# maximize turkey P&L for each price scenario
turkeySupplyRange<-min(turkeyDemand):
  max(turkeyDemand)
nScenarios<-length(turkeySupplyRange)
turkeyPnLUnderSupplyScenarios<-matrix(rep(0,
  nPaths*nScenarios),nrow=nPaths,ncol=nScenarios)
for (scenarioIndex in 1:nScenarios){
  turkeySupplyScenario<-rep(turkeySupplyRange[
    scenarioIndex],nRows,nrow=nRows,ncol=1)
  for (pathIndex in 1:nPaths){
    turkeyPnLUnderSupplyScenarios[pathIndex,
      scenarioIndex]<-pnlUnderScenario(
    turkeyDemandPaths[,pathIndex],
      turkeySupplyScenario,
      turkeyPrice,turkeyCost)
  }
}

We can determine the distribution of P&L for a range of supply scenarios, then find the supply that maximizes expected P&L for each product.

Optimizing Expected P&L - Continued

# compute the expected P&L under each scenario
turkeyExpectedPnLUnderScenario<-colMeans(
  turkeyPnLUnderSupplyScenarios)
# compute the percentile bounds
lowerBound<-apply(turkeyPnLUnderSupplyScenarios, 2, 
                  quantile, probs = c(0.05))
upperBound<-apply(turkeyPnLUnderSupplyScenarios, 2, 
                  quantile, probs = c(0.95))
# find the max P&L
turkeyMaxExpectedPnLUnderScenario<-max(
  turkeyExpectedPnLUnderScenario)
# find the max P&L index
turkeyMaxIndex<-turkeyExpectedPnLUnderScenario==
  turkeyMaxExpectedPnLUnderScenario
# find the optimal supply
turkeyOptimalSupply<-turkeySupplyRange[
  turkeyMaxIndex]

Expected P&L By Supply Scenario

The expected P&L of 43.17275 by day for turkey is maximized by supplying 20 units each day.

plot of chunk unnamed-chunk-14

How does this help us?

  • Visualize variation
  • Determine the distribution of any statistic/parameter; helps determine whether a result could be explained just by variation (i.e., significance)
  • Solve path-dependent problems

Most Important Points:

  • Anyone can use this to get the right answer
  • Anyone can understand this
  • Computational power is cheaper than the skills needed to get the right answer the traditional way!
  • Welcome to the era of computation!

References

Efron, Bradley; Tibshirani, Robert J. (1993). An introduction to the bootstrap, New York: Chapman & Hall.

Simon, J. L. The Philosophy and Practice of Resampling Statistics http://www.juliansimon.com/writings/Resampling_Philosophy/

Simon, J. L. (1997): Resampling: The New Statistics http://www.resample.com/intro-text-online/