Introductory Lecture Problem

Seth J. Chandler

August 28, 2014

Introduction

Purpose

The idea of this document is not to scare you – though I accept it could have that result – but rather to show you what you really will be able to do at the end of this course. A major goal of this course is to let you evaluate settlement offers. And not just simple settlement offers, but the more complex scenarios you will see in the real world.

We’re going to use some statistics, decision theory and finance in order to do this, three of the main topics of the course. We’ll also use R very heavily. I do not – repeat DO NOT – expect you to understand 100% or even 90% what we are doing today. It is almost as if you are in a course in Russian and I speak to you mostly in the sort of Russian you will actually know by the end of the course. Rather, I want to let you know where we want to be at the end of the journey to give you an incentive to go on the trip.

Special Note

If this is totally, completely scary and you could never imagine in a million years being able to do this, perhaps this course is not for you. On the other hand, if it looks doable with some work, I believe passionately you will find this not only one of the most valuable courses for law practice, but also a course you will use frequently in making good every day life decisions. My hope, however, is that you will not only find this course useful, but sufficiently fun that you want to continue to work in the area. The lawyer of the future will need programming and algorithmic skills in addition to the conventional ones of today. This course will continue your growth in that field.

The Scenario

You are involved in some litigation in Texas and representing the plaintiff. The defendant does not have a great deal of money relative to the size of judgments that might be rendered against them, but they do have liability insurance in the amount of $1,000,000.

You have also done some research on cases similar to this one. So far, cases like this have been brought only in Texas and California. Your assistant stuck the data in a text file that looks like this. For each piece of litigation, you’ve identified the state in which it was brought.

Getting the Data

As is often the case, actually getting the data into R is often the hardest part. We see that the first 10 cases are from California and the second 10 cases are from Texas. We can use the rep command and the c command to make a vector of states that will end up being one of the columns of our data.frame. We will name this piece of data state and use “<-” to assign the right hand side to the left hand side.

state<-c(rep("Texas",10),rep("California",10))
state
##  [1] "Texas"      "Texas"      "Texas"      "Texas"      "Texas"     
##  [6] "Texas"      "Texas"      "Texas"      "Texas"      "Texas"     
## [11] "California" "California" "California" "California" "California"
## [16] "California" "California" "California" "California" "California"

The output is simply a vector of characters.

Now we need to input the judgments. The simplest way to do this is simply to type it in. We put a c around the input to combine the values into a vector. 1

judgment<-c(436000,94000,247000,418000,656000,334000,49000,402000,134000,
            128000,1023000,617000,198000,0,0,0,0,39000,1640000,1089000)
judgment
##  [1]  436000   94000  247000  418000  656000  334000   49000  402000
##  [9]  134000  128000 1023000  617000  198000       0       0       0
## [17]       0   39000 1640000 1089000

Moving from data intto a data.frame

Now let’s create an R data.frame, perhaps R’s best and most famous data structure. It’s kind of like a spreadsheet. R is kind of a column-oriented language, so we are going to tell the data.frame the name and contents of each column. I then ask R to print out the result.

g<-data.frame(thestate=state,thejudgment=judgment)
print(g)
##      thestate thejudgment
## 1       Texas      436000
## 2       Texas       94000
## 3       Texas      247000
## 4       Texas      418000
## 5       Texas      656000
## 6       Texas      334000
## 7       Texas       49000
## 8       Texas      402000
## 9       Texas      134000
## 10      Texas      128000
## 11 California     1023000
## 12 California      617000
## 13 California      198000
## 14 California           0
## 15 California           0
## 16 California           0
## 17 California           0
## 18 California       39000
## 19 California     1640000
## 20 California     1089000

Fiddling with the data.frame

In order to avoid confusion, I named the columns “thestate” and “thejudgment” so that they would not be confused with the variables state and judgment, which hold data. But, really, I don’t want to have to refer to the columns as “thestate” and “thejudgment.” So, I want to rename the columns. I want to rename them “state” and “judgment.” Here’s the R command to do so. Compare this output with the previous one and you should see the difference.2

colnames(g)<-c("state","judgment")
print(g)
##         state judgment
## 1       Texas   436000
## 2       Texas    94000
## 3       Texas   247000
## 4       Texas   418000
## 5       Texas   656000
## 6       Texas   334000
## 7       Texas    49000
## 8       Texas   402000
## 9       Texas   134000
## 10      Texas   128000
## 11 California  1023000
## 12 California   617000
## 13 California   198000
## 14 California        0
## 15 California        0
## 16 California        0
## 17 California        0
## 18 California    39000
## 19 California  1640000
## 20 California  1089000

Exporting and Importing the data.frame

Exporting

Often its helpful if other programs have access to your data. R contains a variety of mechanisms for exporting data.frames. One of those is write.csv, which creates – not surprisingly – a csv (comma separated value) file that is simple and widely used. Here’s the sort of code you would use.

write.csv(g,file="put your file name here")

Importing

Or, perhaps the data is already waiting for us in a csv file and we want to suck it into an R data.frame. Here’s how we do it.

read.csv(file="your file name here",header=TRUE,stringsAsFactors=FALSE) 

This command assumes that the first row of the csv file contains the header and that the file is not so huge that we need any strings in the data to be compressed into R “factors.” There are lots of optional arguments to read.csv for more elaborate situations. The documentation of this command is pretty good.

The judgment distribution

First pass: Summary judgment

I’d like to get a handle on what the judgments look like. This will help me evaluate any settlement offers that come in from the defendant. One way of doing this is just to get a summary of the data.

summary(g)
##         state       judgment      
##  California:10   Min.   :      0  
##  Texas     :10   1st Qu.:  46500  
##                  Median : 222500  
##                  Mean   : 375200  
##                  3rd Qu.: 481250  
##                  Max.   :1640000

The result tells us the minimum, maximum and quartiles of the data. Kind of handy.

Second pass: Histogram

We might want to visualize the results, however. One way to do this would be with a simple histogram. R has built into it the hist command, which creates a simple histogram of a single vector of data.

hist(g[,"judgment"])

plot of chunk unnamed-chunk-8 The idea of this code is first to extract the “judgment” column of our data (g). We do this by using a blank “g[,” to represent all the rows of g and then “judgment” to signify which columns we want. We then wrap this up in the hist command. The result is a passable depiction of our data.

Third pass: Does California look different?

One of the things we might be curious about as we sit in our Texas law office is whether to include the lawsuits from California as part of our analysis. Maybe California is just different? There’s no perfect way to resolve this issue from the data alone, but it wouldn’t hurt to compare the distribution of Texas judgments with that of the California judgments. Here’s one way we would do it: the tapply command.

tapply(g[,"judgment"],g[,"state"],summary)
## $California
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0  118000  461000  922000 1640000 
## 
## $Texas
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49000  130000  290000  290000  414000  656000

The idea of tapply is to break the data in the first argument (g[,“judgment”]) up by the values in the second argument (g[,“state”]) and then to apply the function contained in the third argument (summary) to each of the groups created thereby.

Third Pass (again)

I can also use tapply to make some histograms.

tapply(g[,"judgment"],g[,"state"],hist)

plot of chunk unnamed-chunk-10plot of chunk unnamed-chunk-10

## $California
## $breaks
## [1]       0  500000 1000000 1500000 2000000
## 
## $counts
## [1] 6 1 2 1
## 
## $density
## [1] 1.2e-06 2.0e-07 4.0e-07 2.0e-07
## 
## $mids
## [1]  250000  750000 1250000 1750000
## 
## $xname
## [1] "X[[1L]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"
## 
## $Texas
## $breaks
## [1] 0e+00 1e+05 2e+05 3e+05 4e+05 5e+05 6e+05 7e+05
## 
## $counts
## [1] 2 2 1 1 3 0 1
## 
## $density
## [1] 2e-06 2e-06 1e-06 1e-06 3e-06 0e+00 1e-06
## 
## $mids
## [1]  50000 150000 250000 350000 450000 550000 650000
## 
## $xname
## [1] "X[[2L]]"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

What you should notice is that this is not a very good graphic. The x-scales are different, the labeling is ugly, etc. So, while perhaps this would be enough for someone good at data to make some decisions, it isn’t the type of document you want to give to the senior partner or the client. For that, we need something more sophisticated than hist and the built-in R functionality.

Enter ggplot2

Unlike Mathematica, which contains a vast amount of functionality built-in – and has a memory and file footprint to match – R, which is collectively produced, has a fairly small core and depends on external packages developed by third parties to achieve most of its functionality. One of the most important third party packages is ggplot2, developed by Hadley Wickham of Rice University. This is the package that most people use to do any sophisticated graphics. It’s a little difficult to understand at first, but once you get the hang of it you will realize it is extremely powerful and flexible.

This package comes included with RStudio. But all that means is that is in somewhere on your hard drive. To make it available to R’s computational engine, we need the following code. By the way, you might want to push the little Update button on the Packages tab of RStudio to make sure you have the latest version of this and any other packages.

require(ggplot2)

What can ggplot2 do?

There are books written on how to use ggplot2. So, this little document will barely scratch the surface. Rather than teach you the details, I’m just going to show you the kind of code you need to make an attractive document.

ggDatabase<-ggplot(g, aes(judgment, fill = state)) #creates a plot data base
transparentLayer<-geom_density(alpha = 0.2)
print(nicePicture<-ggDatabase + transparentLayer)

plot of chunk unnamed-chunk-12

This graphic gives us an interesting picture.3 It suggests to me that judgments in California may be somewhat higher than in Texas. Perhaps we shouldn’t use the California cases much if at all in evaluating settlements.

Evaluating a complex settlement offer

Suppose our defendant makes a somewhat complex settlement offer. The defendant says, “let’s both cap our risk here. I will guarantee you $100,000 (Contingency 1) even if the judgment is less than $100,000, indeed even if the judgment is zero. But I don’t have a lot of cash. So, if the judgment is more than $1 million (Contingency 2), I will pay you $100,000 a year for 12 years, starting one year from now. And if the judgment is between $100,000 and $1 million (Contingency 3), I will pay you 18% of the judgment for 7 years, starting 1 year from now.”

To evaluate this offer, we want to bring everything back to present value. We need to transform these cashflows – or annuities – into a sum of money. The insight we need is that money received today can grow. Here’s the value of $x invested at r% compounded annually for t years.

\(x (r+1)^t\)

With a little algebra, we can see that this means that if we receive $y in the future, that is the equivalent of the following amount today.

\(y (r+1)^{-t}\)

So, what we need to go is just calculate \(\underset{t=1}{\overset{12}{\sum }}100000*((1+r){}^{\wedge}t){}^{\wedge}(-1)\) to get the value for Contingency 2 and \(\underset{t=1}{\overset{7}{\sum }}0.18*j*((1+r){}^{\wedge}t){}^{\wedge}(-1)\) to get the value for Contingency 3.

Enter lifecontingencies

There are many ways we could do this computation, but there is a very nice package in R, lifecontingencies that is designed for just these sorts of problems – and indeed far more complex ones. You should install it on your computer. And we’ll now require it to be present.

require("lifecontingencies")

The command we need to execute is aptly named presentValue. Here’s the documentation

presentValue(cashFlows, timeIds, interestRates, probabilities,power=1)

So, here is how we could evaluate an annuity that pays $1 at time 1, $3 at time 2 and $5 at time 3. I will assume a discount rate (interest rate) of 10%

presentValue(c(1,3,5),c(1,2,3),0.1)
## [1] 7.145

We can now write an R function that calculates the value of Contingency 2 depending on the interest rate. Remember that rep(1000000,12) produces 100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000, 100000 and seq(1,12) produces 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12.

contingency2<-function (r) presentValue(rep(100000,12),seq(1,12),r)

And we can write an R function that calculates the value of Contingency 3 depending on the judgment and and the interest rate.

contingency3<-function (j,r) presentValue(rep(0.18*j,7),seq(1,7),r)

The settlement-transformed judgment

What we now want to do is compute the present value of any potential judgment after having been transformed by the defendant’s settlement offer. to do this, we write a function of judgment (j) and interest rate (r). I will call it, after some thought, settlementTransformedJudgment. A bit long, but expressive.

settlementTransformedJudgment<-function (j,r) {
  ifelse(j<100000,100000,
         ifelse(100000<=j & j<1000000,
                contingency3(j,r),
                contingency2(r)))
}

I make use in the code above of the a nested ifelse command. The idea of ifelse is set forth in the R documentation as

ifelse returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE.

And the usage is

ifelse(test, yes, no)

The settlement transform of our judgments

I now want to find the value, after the settlement transform, of the judgments in our little database. If I take the mean of those values, I might have some sense as to the value of the settlement offer the defendant is making. Let me start by assuming that the appropriate discount rate to use is 4%.

The R function I need is sapply. This function is very much like tapply (and like Map in Mathematica) in that it takes a function as an argument4 and then applies that function to pieces of data. Here I create a little function settlementTransform04 that is the same as settlementTransformedJudgment but that fixes the interest rate at 0.04.

settlementTransform04<- function(j) settlementTransformedJudgment(j,0.04)
print(
  transformed_judgments<-sapply(g[,"judgment"],settlementTransform04)
)
##  [1] 471041 100000 266851 451595 708723 360844 100000 434309 144770 138287
## [11] 938507 666588 213913 100000 100000 100000 100000 100000 938507 938507

I could also have written this code as follows:

function(j) settlementTransformedJudgment(j,0.04)
print(
  transformed_judgments<-sapply(g[,"judgment"],function(j) settlementTransformedJudgment(j,0.04))
)

Doing so would prevent my namespace from being “polluted” by settlementTransform04. Because I never gave the function I applied a name in the code above, it is known as an “anonymous function.” Just a useful piece of progrmaming vocabulary.

Evaluating the settlement offer

We are now ready to evaluate the settlement offer. Here’s all the code we need to do it.

mean(transformed_judgments)
## [1] 368622

Or, if we want to restrict ourselves to just the judgments from Texas, here’s what we could do. Notice that I am using g[,“state”]==“Texas”], which evaluates to TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, as a way of picking the elements of the transformed_judgments vector that I want.

mean(transformed_judgments[g[,"state"]=="Texas"])
## [1] 317642

Bottom line. If we think all the judgments are relevant, the defendant’s offer is worth about $369,000 whereas if only the Texas judgments are relevant, the defendant’s offer is worth about $318,000.

Evaluating the settlement offer as a function of the interest rate

So far, I just evaluated the settlement offer on the assumption that the discount rate was 4%. But what if I want to know how the settlement offer depends on the discount rate. Here’s how I would do it. I would write a function that put a lot of what we have done together. The function will have as its first argument the judgments in the relevant cases.

settlementValue<-function (relevant_judgments,r) 
  mean(sapply(relevant_judgments,
              function(j) settlementTransformedJudgment(j,r))
       )

I now use this function to generate a data.frame relating the interest rate to the settlement value assuming all the cases and relevant and then assuming that just the Texas cases are relevant.

print(settlementValueAllCases<-
  data.frame("r"=seq(0.01,0.1,0.01),
            "settlement.value"=sapply(seq(0.01,0.1,0.01),function (r) {
            settlementValue(g[,"judgment"],r)})))
##       r settlement.value
## 1  0.01           420003
## 2  0.02           401575
## 3  0.03           384489
## 4  0.04           368622
## 5  0.05           353865
## 6  0.06           340120
## 7  0.07           327298
## 8  0.08           315322
## 9  0.09           304120
## 10 0.10           293628
print(settlementValueJustTexas<-
  data.frame("r"=seq(0.01,0.1,0.01),
            "settlement.value"=sapply(seq(0.01,0.1,0.01),function (r) {
            settlementValue(g[state=="Texas","judgment"],r)})))
##       r settlement.value
## 1  0.01           353651
## 2  0.02           340946
## 3  0.03           328960
## 4  0.04           317642
## 5  0.05           306946
## 6  0.06           296830
## 7  0.07           287255
## 8  0.08           278184
## 9  0.09           269584
## 10 0.10           261425

Plotting the settlement value as a function of interest rate

I’d now like to use ggplot to visualize these computations of settlement value.

qplot(r,settlement.value,data=settlementValueAllCases,
      geom=c("point","line"),
      main="Value of settlement offer as a function of the discount rate\nusing all cases")

plot of chunk unnamed-chunk-24

Are the Texas cases materially different from the California cases?

I’d now like to get a better handle on whether the inclusion of the California cases makes sense. On the one hand, I’d like more data. On the other hand, I don’t want inapplicable data contaminating my evaluation of a settlement offer. There are a number of ways to make the comparison. One is to ask whether the means of the judgments reached in the two states would be as different as they are as a matter of chance.

For reasons that need not concern you today, a sensible approach to this issue is to compute a statistic known as W. Wethen to ask whether a value of W as large as that actually found would be likely to occur if something known as the “null hypothesis” were true. It’s not that different from asking, if we are observing Munchkins lying around, how likely is it that this would be true if we were really in Kansas (our null hypothesis). In this context, however, the null hypothesis means that the mean of whatever distribution generated the California cases and the mean of whatever distribution generated the Texas cases are really the same. An R command that can do the necessary computation is wilcoxon.test. The idea is to consider the relationship between judgment and state, where state is no longer treated as some sort of “character” variable but rather as a variable that can take on several values – what R-lingo terms a “factor variable.” We also add the caution that the data is not paired: we did not try the same case in California and in Texas and see what the judgments were in both cases. The with command can be thought of as simply an instruction to R to use data from g for purposes of understanding what thing like “judgment” and “state” refer to in the code that follows.

with(g,wilcox.test(judgment ~ factor(state),paired=FALSE))
## Warning: cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  judgment by factor(state)
## W = 43, p-value = 0.6219
## alternative hypothesis: true location shift is not equal to 0

The result we see is that the p statistic is fairly high (considerably greater than things like 0.05 or 0.01). R helpfully tells us that the data does not disprove the null hypothesis. Even though the Texas and California data looks a bit different, that may simply be a matter of chance. So, in evaluating settlements probably we should keep the California data in.

Risk aversion (advanced)

So far, I have assumed it is appropriate to simply take the mean of the transformed judgments to calculate the value of the settlement offer. But the plaintiff might be risk averse; the plaintiff might be thinking about the worst case rather than the mean. We will learn a lot of ways of taking risk aversion into account in this course. But one way to do so is to think of a plaintiff as behaving as if instead of the actual outcome they always received the worse of two possible outcomes from the distribution of possible judgments. This method of thinking about risk is a variant of an increasingly popular method called “spectral measures”. It’s essentially taking an “order statistic” on a distribution.

This is fairly advanced material, so if you feel you’ve had enough R and enough Analytic Methods for now, don’t feel compelled to read this part of the presentation. On the other hand …

Simulating order statistics in R

Step 1: Creating a matrix of draws

The code below shows how to evaluate a settlement given a plaintiff who behaves as if instead of the actual outcome they always received the worse of two possible outcomes from the distribution of possible judgments. I start by using sapply again. Here, I use it to compute the present value the plaintiff would receive pursuant to the settlement for all of the distributions in our little database. I call the result settlementValueDistribution. I then use R’s built-in sample command to create 10,000 draws from this distribution and place them in a \(5000*2\) matrix. I permit the sampling to proceed “with replacement,” meaning that the fact that one particular value has been drawn from the distribution does not preclude that value from being drawn again. I use the head command to permit you to see just the first part of this large matrix.

settlementValueDistribution<-sapply(g[,"judgment"],function (j) settlementTransformedJudgment(j,0.04))
draws2<-matrix(sample(settlementValueDistribution,10000,replace=TRUE),nrow=5000,ncol=2)
head(draws2)
##        [,1]   [,2]
## [1,] 266851 938507
## [2,] 451595 100000
## [3,] 100000 938507
## [4,] 266851 938507
## [5,] 360844 213913
## [6,] 100000 138287

Step 2: Computing the mean of the minimum of many sets of two draws

I now take each row of the resulting matrix and find the minimum value. To do this, I use the built-in apply function, and specify, by making the second “margin” argument equal to 1, that I want to go by rows instead of by columns, in applying the min function to each row. The min function, you may have guessed, takes a bunch of numbers and finds the smallest. I end by just taking the mean value of these 5,000 minimums.

mean(apply(draws2,1,min))
## [1] 208005

The result is about $207,000. For the risk averse plaintiff, the settlement has a much lower value than it did for the risk neutral plaintiff examined before. Of course, that is partly because this strange settlement offer has been structured so it is not risk free. The risk averse plaintiff will likewise place a lower value on going to trial relative to the risk neutral plaintiff.

Conclusion

I hope that you have learned a bit about how to evaluate a settlement and a bit about how to program in R. We have barely scratched the surface of either topic. I hope you are inspired, however, to learn more.


  1. You may wonder why I am repeating the judgment and (above) the state in the code. I do this because if you assign the results of a computation to a variable, R’s default behavior is not to print out the result. An alternative way of addressing this behavior is to write print(vv<-2+3). That results in both assignment and printing.

  2. Notice, also that in R, at least sometimes, the left hand side of an assignment need not be a symbol like g but can be a function of g, such as colnames.

  3. Notice that ggplot has smoothed out the data. That is, our data presented discrete judgments. There was no possibility, for example, that a judgment of $94,002 could be entered. What ggplot has assumed here – and we could definitely tell it not too – was that the judgments we observed were drawn from a continuous distribution. The ggplot function is attempting to approximate that distribution. In Mathematica, by the way, one can achieve similar results by creating a SmoothKernelDistribution.

  4. Programming in which the argument to a function can itself be a function is often called “functional programming.” Functional programming has been at the core of Mathematica for 25 years and also is central to R. It’s popularity has grown rapidly over the past 5-10 years and is now being included as a feature of popular languages such as Java.