[ source files available on GitHub ]
Libraries needed for data processing and plotting:
library("tm")
library("SnowballC")
library("caTools")
library("rpart")
library("rpart.plot")
library("ROCR")
We will be looking into how to use the text of emails in the inboxes of Enron executives to predict if those emails are relevant to an investigation into the company.
We will be extracting word frequencies from the text of the documents, and then integrating those frequencies into predictive models.
We are going to talk about predictive coding – an emerging use of text analytics in the area of criminal justice.
The case we will consider concerns Enron, a US energy company based out of Houston, Texas that was involved in a number of electricity production and distribution markets and that collapsed in the early 2000’s after widespread account fraud was exposed. To date Enron remains a stunning symbol of corporate corruption.
While Enron’s collapse stemmed largely from accounting fraud, the firm also faced sanctions for its involvement in the California electricity crisis.
In 2000 to 2001, California experienced a number of power blackouts, despite having sufficient generating capacity.
It later surfaced that Enron played a key role in this energy crisis by artificially reducing power supply to spike prices and then making a profit from this market instability.
The Federal Energy Regulatory Commission, or FERC, investigated Enron’s involvement in the crisis, and its investigation eventually led to a $1.52 billion settlement.
FERC’s investigation into Enron will be the topic of today’s recitation.
Enron was a huge company, and its corporate servers contained millions of emails and other electronic files. Sifting through these documents to find the ones relevant to an investigation is no simple task.
In law, this electronic document retrieval process is called the eDiscovery problem, and relevant files are called responsive documents.
Traditionally, the eDiscovery problem has been solved by using keyword search. In our case, perhaps, searching for phrases like “electricity bid” or “energy schedule”, followed by an expensive and time-consuming manual review process, in which attorneys read through thousands of documents to determine which ones are responsive.
Predictive coding is a new technique in which attorneys manually label some documents and then use text analytics models trained on the manually labeled documents to predict which of the remaining documents are responsive.
As part of its investigation, the FERC released hundreds of thousands of emails from top executives at Enron creating the largest publicly available set of emails today.
We will use this data set called the Enron Corpus to perform predictive coding in this recitation.
The data set contains just two fields:
The labels for these emails were made by attorneys as part of the 2010 text retrieval conference legal track, a predictive coding competition.
emails <- read.csv("data/energy_bids.csv.gz", stringsAsFactors = FALSE)
str(emails)
## 'data.frame': 855 obs. of 2 variables:
## $ email : chr "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Coope"| __truncated__ "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Felic"| __truncated__ "14:13:53 Synchronizing Mailbox 'Kean, Steven J.' 14:13:53 Synchronizing Hierarchy 14:13:53 Synchronizing Favorites 14:13:53 Syn"| __truncated__ "^ ----- Forwarded by Steven J Kean/NA/Enron on 03/02/2001 12:27 PM ----- Suzanne_Nimocks@mckinsey.com Sent by: Carol_Benter@mck"| __truncated__ ...
## $ responsive: int 0 1 0 1 0 0 1 0 0 0 ...
Let’s look at a few examples (using the strwrap()
function for easier-to-read formatting):
strwrap(emails$email[1])
## [1] "North America's integrated electricity market requires cooperation on environmental"
## [2] "policies Commission for Environmental Cooperation releases working paper on North"
## [3] "America's electricity market Montreal, 27 November 2001 -- The North American Commission"
## [4] "for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend"
## [5] "towards increasing trade, competition and cross-border investment in electricity between"
## [6] "Canada, Mexico and the United States. It is hoped that the working paper, Environmental"
## [7] "Challenges and Opportunities in the Evolving North American Electricity Market, will"
## [8] "stimulate public discussion around a CEC symposium of the same title about the need to"
## [9] "coordinate environmental policies trinationally as a North America-wide electricity"
## [10] "market develops. The CEC symposium will take place in San Diego on 29-30 November, and"
## [11] "will bring together leading experts from industry, academia, NGOs and the governments of"
## [12] "Canada, Mexico and the United States to consider the impact of the evolving continental"
## [13] "electricity market on human health and the environment. \"Our goal [with the working paper"
## [14] "and the symposium] is to highlight key environmental issues that must be addressed as the"
## [15] "electricity markets in North America become more and more integrated,\" said Janine"
## [16] "Ferretti, executive director of the CEC. \"We want to stimulate discussion around the"
## [17] "important policy questions being raised so that countries can cooperate in their approach"
## [18] "to energy and the environment.\" The CEC, an international organization created under an"
## [19] "environmental side agreement to NAFTA known as the North American Agreement on"
## [20] "Environmental Cooperation, was established to address regional environmental concerns,"
## [21] "help prevent potential trade and environmental conflicts, and promote the effective"
## [22] "enforcement of environmental law. The CEC Secretariat believes that greater North"
## [23] "American cooperation on environmental policies regarding the continental electricity"
## [24] "market is necessary to: * protect air quality and mitigate climate change, * minimize the"
## [25] "possibility of environment-based trade disputes, * ensure a dependable supply of"
## [26] "reasonably priced electricity across North America * avoid creation of pollution havens,"
## [27] "and * ensure local and national environmental measures remain effective. The Changing"
## [28] "Market The working paper profiles the rapid changing North American electricity market."
## [29] "For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of"
## [30] "electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9"
## [31] "thousand GWh of electricity. \"Over the past few decades, the North American electricity"
## [32] "market has developed into a complex array of cross-border transactions and"
## [33] "relationships,\" said Phil Sharp, former US congressman and chairman of the CEC's"
## [34] "Electricity Advisory Board. \"We need to achieve this new level of cooperation in our"
## [35] "environmental approaches as well.\" The Environmental Profile of the Electricity Sector"
## [36] "The electricity sector is the single largest source of nationally reported toxins in the"
## [37] "United States and Canada and a large source in Mexico. In the US, the electricity sector"
## [38] "emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2"
## [39] "emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions."
## [40] "These emissions have a large impact on airsheds, watersheds and migratory species"
## [41] "corridors that are often shared between the three North American countries. \"We want to"
## [42] "discuss the possible outcomes from greater efforts to coordinate federal, state or"
## [43] "provincial environmental laws and policies that relate to the electricity sector,\" said"
## [44] "Ferretti. \"How can we develop more compatible environmental approaches to help make"
## [45] "domestic environmental policies more effective?\" The Effects of an Integrated Electricity"
## [46] "Market One key issue raised in the paper is the effect of market integration on the"
## [47] "competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice"
## [48] "largely determines environmental impacts from a specific facility, along with pollution"
## [49] "control technologies, performance standards and regulations. The paper highlights other"
## [50] "impacts of a highly competitive market as well. For example, concerns about so called"
## [51] "\"pollution havens\" arise when significant differences in environmental laws or"
## [52] "enforcement practices induce power companies to locate their operations in jurisdictions"
## [53] "with lower standards. \"The CEC Secretariat is exploring what additional environmental"
## [54] "policies will work in this restructured market and how these policies can be adapted to"
## [55] "ensure that they enhance competitiveness and benefit the entire region,\" said Sharp."
## [56] "Because trade rules and policy measures directly influence the variables that drive a"
## [57] "successfully integrated North American electricity market, the working paper also"
## [58] "addresses fuel choice, technology, pollution control strategies and subsidies. The CEC"
## [59] "will use the information gathered during the discussion period to develop a final report"
## [60] "that will be submitted to the Council in early 2002. For more information or to view the"
## [61] "live video webcast of the symposium, please go to: http://www.cec.org/electricity. You"
## [62] "may download the working paper and other supporting documents from:"
## [63] "http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english."
## [64] "Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montréal"
## [65] "(Québec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org"
## [66] "***********"
We can see just by parsing through the first couple of lines that this is an email about a new working paper, “The Environmental Challenges and Opportunities in the Evolving North American Electricity Market”, released by the Commission for Environmental Cooperation, or CEC.
While this certainly deals with electricity markets, it doesn’t have to do with energy schedules or bids, hence it is not responsive to our query. If we look at the value in the responsive
variable for this email:
emails$responsive[1]
## [1] 0
we see that its value is 0, as expected.
Let’s check the second email:
strwrap(emails$email[2])
## [1] "FYI -----Original Message----- From: \"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON"
## [2] "[mailto:IMCEANOTES-+22Ginny+20Feliciano+22+20+3Cgfeliciano+40earthlink+2Enet+3E+40ENRON@ENRON.com]"
## [3] "Sent: Thursday, June 28, 2001 3:40 PM To: Silvia Woodard; Paul Runci; Katrin Thomas; John"
## [4] "A. Riggs; Kurt E. Yeager; Gregg Ward; Philip K. Verleger; Admiral Richard H. Truly; Susan"
## [5] "Tomasky; Tsutomu Toichi; Susan F. Tierney; John A. Strom; Gerald M. Stokes; Kevin"
## [6] "Stoffer; Edward M. Stern; Irwin M. Stelzer; Hoff Stauffer; Steven R. Spencer; Robert"
## [7] "Smart; Bernie Schroeder; George A. Schreiber, Jr.; Robert N. Schock; James R."
## [8] "Schlesinger; Roger W. Sant; John W. Rowe; James E. Rogers; John F. Riordan; James"
## [9] "Ragland; Frank J. Puzio; Tony Prophet; Robert Priddle; Michael Price; John B. Phillips;"
## [10] "Robert Perciasepe; D. Louis Peoples; Robert Nordhaus; Walker Nolan; William A. Nitze;"
## [11] "Kazutoshi Muramatsu; Ernest J. Moniz; Nancy C. Mohn; Callum McCarthy; Thomas R. Mason;"
## [12] "Edward P. Martin; Jan W. Mares; James K. Malernee; S. David Freeman; Edwin Lupberger;"
## [13] "Amory B. Lovins; Lynn LeMaster; Hoesung Lee; Lay, Kenneth; Lester Lave; Wilfrid L. Kohl;"
## [14] "Soo Kyung Kim; Melanie Kenderdine; Paul L. Joskow; Ira H. Jolles; Frederick E. John; John"
## [15] "Jimison; William W. Hogan; Robert A. Hefner, III; James K. Gray; Craig G. Goodman;"
## [16] "Charles F. Goff, Jr.; Jerry D. Geist; Fritz Gautschi; Larry G. Garberding; Roger Gale;"
## [17] "William Fulkerson; Stephen E. Frank; George Frampton; Juan Eibenschutz; Theodore R. Eck;"
## [18] "Congressman John Dingell; Brian N. Dickie; William E. Dickenson; Etienne Deffarges;"
## [19] "Wilfried Czernie; Loren C. Cox; Anne Cleary; Bernard H. Cherry; Red Cavaney; Ralph"
## [20] "Cavanagh; Thomas R. Casten; Peter Bradford; Peter D. Blair; Ellen Berman; Roger A."
## [21] "Berliner; Michael L. Beatty; Vicky A. Bailey; Merribel S. Ayres; Catherine G. Abbott"
## [22] "Subject: Energy Deregulation - California State Auditor Report Attached is my report"
## [23] "prepared on behalf of the California State Auditor. I look forward to seeing you at The"
## [24] "Aspen Institute Energy Policy Forum. Charles J. Cicchetti Pacific Economics Group, LLC -"
## [25] "ca report new.pdf ***********"
The original message is actually very short, it just says FYI, and most of it is a forwarded message. We have the list of recipients, and down at the very bottom is the message itself. “Attached is my report prepared on behalf of the California State auditor.” There is also an attached report.
Our data set contains just the text of the emails and not the text of the attachments. It turns out, as we might expect, that this attachment had to do with Enron’s electricity bids in California, and therefore this email is responsive to our query.
We can check this in the value of the responsive
variable.
emails$responsive[2]
## [1] 1
We see that that it is a 1.
Let’s look at the breakdown of the number of emails that are responsive to our query.
table(emails$responsive)
##
## 0 1
## 716 139
We see that the data set is unbalanced, with a relatively small proportion of emails responsive to the query. This is typical in predictive coding problems.
We will need to convert our tweets to a corpus for pre-processing. Various function in the tm
package can be used to create a corpus in many different ways.
We will create it from the tweet
column of our data frame using two functions, Corpus()
and VectorSource()
. We feed to this latter the Tweets
variable of the tweets
data frame.
corpus <- Corpus(VectorSource(emails$email))
Let’s take a look at corpus:
corpus
## <<VCorpus (documents: 855, metadata (corpus/indexed): 0/0)>>
We use the tm_map()
function which takes as
To transform all text to lower case:
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
Removing words can be done with the removeWords
argument to the tm_map()
function, with an extra argument, i.e. what the stop words are that we want to remove, for which we simply use the list for english that is provided by the tm
package.
We will remove all of these English stop words, but we will also remove the word “apple” since all of these tweets have the word “apple” and it probably won’t be very useful in our prediction problem.
corpus <- tm_map(corpus, removeWords, stopwords("english"))
Lastly, we want to stem our document with the stemDocument
argument.
corpus <- tm_map(corpus, stemDocument)
Now that we have gone through those four preprocessing steps, we can take a second look at the first email in the corpus.
strwrap(corpus[[1]])
## [1] "north america integr electr market requir cooper environment polici commiss environment"
## [2] "cooper releas work paper north america electr market montreal 27 novemb 2001 north"
## [3] "american commiss environment cooper cec releas work paper highlight trend toward increas"
## [4] "trade competit crossbord invest electr canada mexico unit state hope work paper"
## [5] "environment challeng opportun evolv north american electr market will stimul public"
## [6] "discuss around cec symposium titl need coordin environment polici trinat north americawid"
## [7] "electr market develop cec symposium will take place san diego 2930 novemb will bring"
## [8] "togeth lead expert industri academia ngos govern canada mexico unit state consid impact"
## [9] "evolv continent electr market human health environ goal work paper symposium highlight"
## [10] "key environment issu must address electr market north america becom integr said janin"
## [11] "ferretti execut director cec want stimul discuss around import polici question rais"
## [12] "countri can cooper approach energi environ cec intern organ creat environment side"
## [13] "agreement nafta known north american agreement environment cooper establish address"
## [14] "region environment concern help prevent potenti trade environment conflict promot effect"
## [15] "enforc environment law cec secretariat believ greater north american cooper environment"
## [16] "polici regard continent electr market necessari protect air qualiti mitig climat chang"
## [17] "minim possibl environmentbas trade disput ensur depend suppli reason price electr across"
## [18] "north america avoid creation pollut haven ensur local nation environment measur remain"
## [19] "effect chang market work paper profil rapid chang north american electr market exampl"
## [20] "2001 us project export 131 thousand gigawatthour gwh electr canada mexico 2007 number"
## [21] "project grow 169 thousand gwh electr past decad north american electr market develop"
## [22] "complex array crossbord transact relationship said phil sharp former us congressman"
## [23] "chairman cec electr advisori board need achiev new level cooper environment approach well"
## [24] "environment profil electr sector electr sector singl largest sourc nation report toxin"
## [25] "unit state canada larg sourc mexico us electr sector emit approxim 25 percent nox emiss"
## [26] "rough 35 percent co2 emiss 25 percent mercuri emiss almost 70 percent so2 emiss emiss"
## [27] "larg impact airsh watersh migratori speci corridor often share three north american"
## [28] "countri want discuss possibl outcom greater effort coordin feder state provinci"
## [29] "environment law polici relat electr sector said ferretti can develop compat environment"
## [30] "approach help make domest environment polici effect effect integr electr market one key"
## [31] "issu rais paper effect market integr competit particular fuel coal natur gas renew fuel"
## [32] "choic larg determin environment impact specif facil along pollut control technolog"
## [33] "perform standard regul paper highlight impact high competit market well exampl concern"
## [34] "call pollut haven aris signific differ environment law enforc practic induc power compani"
## [35] "locat oper jurisdict lower standard cec secretariat explor addit environment polici will"
## [36] "work restructur market polici can adapt ensur enhanc competit benefit entir region said"
## [37] "sharp trade rule polici measur direct influenc variabl drive success integr north"
## [38] "american electr market work paper also address fuel choic technolog pollut control"
## [39] "strategi subsidi cec will use inform gather discuss period develop final report will"
## [40] "submit council earli 2002 inform view live video webcast symposium pleas go"
## [41] "httpwwwcecorgelectr may download work paper support document"
## [42] "httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish commiss"
## [43] "environment cooper 393 rue stjacqu ouest bureau 200 montréal québec canada h2i 1n9 tel"
## [44] "514 3504300 fax 514 3504314 email infoccemtlorg"
It looks quite a bit different now. It is a lot harder to read now that we removed all the stop words and punctuation and word stems, but now the emails in this corpus are ready for our machine learning algorithms.
We are now ready to extract the word frequencies to be used in our prediction problem. The tm
package provides a function called DocumentTermMatrix()
that generates a matrix where:
The values in the matrix are the number of times that word appears in each document.
DTM <- DocumentTermMatrix(corpus)
DTM
## <<DocumentTermMatrix (documents: 855, terms: 21735)>>
## Non-/sparse entries: 102511/18480914
## Sparsity : 99%
## Maximal term length: 113
## Weighting : term frequency (tf)
what we can see is that even though we have only 855 emails in the corpus, we have over 21735 terms that showed up at least once, which is clearly too many variables for the number of observations we have.
So we want to remove the terms that don’t appear too often in our data set.
Therefore let’s remove some terms that don’t appear very often.
sparse_DTM <- removeSparseTerms(DTM, 0.97)
Now we can take a look at the summary statistics for the document-term matrix:
sparse_DTM
## <<DocumentTermMatrix (documents: 855, terms: 788)>>
## Non-/sparse entries: 51644/622096
## Sparsity : 92%
## Maximal term length: 19
## Weighting : term frequency (tf)
We can see that we have decreased the number of terms to 788, which is a much more reasonable number.
Let’s convert the sparse matrix into a data frame that we will be able to use for our predictive models.
labeledTerms <- as.data.frame(as.matrix(sparse_DTM))
To make all variable names R-friendly use:
colnames(labeledTerms) <- make.names(colnames(labeledTerms))
We also have to add back-in the outcome variable
labeledTerms$responsive <- emails$responsive
# str(labeledTerms)
The data frame contains an awful lot of variables, 789 in total, of which 788 are the frequencies of various words in the emails, and the last one is responsive
, i.e. the outcome variable.
Lastly, let’s split our data into a training set and a testing set, putting 70% of the data in the training set.
set.seed(144)
split <- sample.split(labeledTerms$responsive, SplitRatio = 0.7)
train <- subset(labeledTerms, split == TRUE)
test <- subset(labeledTerms, split == FALSE)
Now we are ready to build the model, and we will build a simple CART model using the default parameters. A random forest would be another good choice from our toolset.
emailCART <- rpart(responsive ~ . , data = train, method = "class")
prp(emailCART)
We see at the very top is the word California.
californ
appears at least twice in an email, we are going to take the right path and predict that a document is responsive.system
, demand
, bid
, and gas
.jeff
, which is perhaps a reference to Enron’s CEO, Jeff Skillings, who ended up actually being jailed for his involvement in the fraud at the company.Now that we have trained a model, we need to evaluate it on the test set.
We build an object predictCART
that has the predicted probabilities for each class from our CART model, by using the predict()
function on the model emailCART
and the test data with newdata = test
.
predictCART <- predict(emailCART, newdata = test)
This new object gives us the predicted probabilities on the test set. We can look at the first 10 rows with
predictCART[1:10, ]
## 0 1
## character(0) 0.2156863 0.78431373
## character(0).1 0.9557522 0.04424779
## character(0).2 0.9557522 0.04424779
## character(0).3 0.8125000 0.18750000
## character(0).4 0.4000000 0.60000000
## character(0).5 0.9557522 0.04424779
## character(0).6 0.9557522 0.04424779
## character(0).7 0.9557522 0.04424779
## character(0).8 0.1250000 0.87500000
## character(0).9 0.1250000 0.87500000
In our case we are interested in the predicted probability of the document being responsive and it would be convenient to handle that as a separated variable.
predictCART.prob <- predictCART[ , 2]
This new object contains our test set predicted probabilities.
We are interested in the accuracy of our model on the test set, i.e. out-of-sample. First we compute the confusion matrix:
cmat_CART <- table(test$responsive, predictCART.prob >= 0.5)
cmat_CART
##
## FALSE TRUE
## 0 195 20
## 1 17 25
accu_CART <- (cmat_CART[1,1] + cmat_CART[2,2])/sum(cmat_CART)
Let’s compare this to a simple baseline model that always predicts non-responsive (i.e. the most common value of the dependent variable).
To compute the accuracy of the baseline model, let’s make a table of just the outcome variable responsive
.
cmat_baseline <- table(test$responsive)
cmat_baseline
##
## 0 1
## 215 42
accu_baseline <- max(cmat_baseline)/sum(cmat_baseline)
The accuracy of the baseline model is then 0.8366.
We see just a small improvement in accuracy using the CART model, which is a common case in unbalanced data sets.
However, as in most document retrieval applications, there are uneven costs for different types of errors here.
Typically, a human will still have to manually review all of the predicted responsive documents to make sure they are actually responsive.
Therefore:
Therefore, we are going to assign a higher cost to false negatives than to false positives, which makes this a good time to look at other cut-offs on our ROC curve.
Let’s look at the ROC curve so we can understand the performance of our model at different cutoffs.
To plot the ROC curve we use the performance()
function to extract the true positive rate and false positive rate.
predROCR <- prediction(predictCART.prob, test$responsive)
perfROCR <- performance(predROCR, "tpr", "fpr")
We then plot the ROC curve, with the option that color-codes the different cutoff thresholds.
plot(perfROCR, colorize = TRUE, lwd = 4)
The best cutoff to select is entirely dependent on the costs assigned by the decision maker to false positives and true positives.
However, we do favor cutoffs that give us a high sensitivity, i.e. we want to identify a large number of the responsive documents.
Therefore a choice that might look promising could be in the part of the curve where it becomes flatter (going towards the right), where we have a true positive rate of around 70% (meaning that we’re getting about 70% of all the responsive documents), and a false positive rate of about 20% (meaning that we are making mistakes and accidentally identifying as responsive 20% of the non-responsive documents.)
Since, typically, the vast majority of documents are non-responsive, operating at this cutoff would result in a large decrease in the amount of manual effort needed in the eDiscovery process.
From the blue color of the plot at this particular location we can infer that we are looking at a threshold around maybe 0.15 or so, significantly lower than 0.5, which is definitely what we would expect since we favor false positives to false negatives.
auc_CART <- as.numeric(performance(predROCR, "auc")@y.values)
The AUC of the CART models is 0.7936, which means that our model can differentiate between a randomly selected responsive and non-responsive document about 79.4% of the time.