Course: Analytics Edge Unit 5 Recitation

Enron is an energy company based out of Houston involved in a number of electricity production and distribution markets. While Enron’s collapse stemmed largely from accounting fraud, the firm also faced sanctions for its involvement in the California electricity crisis. And in 2000 to 2001, it had a number of power blackouts, despite having sufficient generating capacity. It later surfaced that Enron played a key role in this energy crisis by artificially reducing power supply to spike prices and then making a profit from this market instability. The Federal Energy Regulatory Commission, or FERC, investigated Enron’s involvement in the crisis.

Now, Enron was a huge company, and its corporate servers contained millions of emails and other electronic files. Sifting through these documents to find the ones relevant to an investigation is no simple task. In law, this electronic document retrieval process is called the eDiscovery problem, and relevant files are called responsive documents.

Traditionally, the eDiscovery problem has been solved by using keyword search– in our case, perhaps, searching for phrases like “electricity bid” or “energy schedule”– followed by an expensive and time-consuming manual review process, in which attorneys read through thousands of documents to determine which ones are responsive. However, predictive coding is a new technique, in which attorneys manually label some documents and then use text analytics models trained on the manually labeled documents to predict which of the remaining documents are responsive.

Now, as part of its investigation, the FERC released hundreds of thousands of emails from top executives at Enron creating the largest publicly available set of emails today. We will use this data set called the Enron Corpus to perform predictive coding in this recitation.

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_5_Text_analytics")
emails <- read.csv("energy_bids.csv", stringsAsFactors = FALSE)
str(emails)
## 'data.frame':    855 obs. of  2 variables:
##  $ email     : chr  "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Coope"| __truncated__ "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Felic"| __truncated__ "14:13:53 Synchronizing Mailbox 'Kean, Steven J.' 14:13:53 Synchronizing Hierarchy 14:13:53 Synchronizing Favorites 14:13:53 Syn"| __truncated__ "^ ----- Forwarded by Steven J Kean/NA/Enron on 03/02/2001 12:27 PM ----- Suzanne_Nimocks@mckinsey.com Sent by: Carol_Benter@mck"| __truncated__ ...
##  $ responsive: int  0 1 0 1 0 0 1 0 0 0 ...
emails$email[1]
## [1] "North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. \"Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated,\" said Janine Ferretti, executive director of the CEC. \"We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment.\" The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: *  protect air quality and mitigate climate change, *  minimize the possibility of environment-based trade disputes, *  ensure a dependable supply of reasonably priced electricity across North America *  avoid creation of pollution havens, and *  ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. \"Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships,\" said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. \"We need to achieve this new level of cooperation in our environmental approaches as well.\" The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. \"We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector,\" said Ferretti. \"How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?\" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called \"pollution havens\" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. \"The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region,\" said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montréal (Québec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********"
emails$responsive[1]
## [1] 0
emails$email[2]
## [1] "FYI -----Original Message----- From: \t\"Ginny Feliciano\" <gfeliciano@earthlink.net>@ENRON [mailto:IMCEANOTES-+22Ginny+20Feliciano+22+20+3Cgfeliciano+40earthlink+2Enet+3E+40ENRON@ENRON.com] Sent:\tThursday, June 28, 2001 3:40 PM To:\tSilvia Woodard; Paul Runci; Katrin Thomas; John A. Riggs; Kurt E. Yeager; Gregg Ward; Philip K. Verleger; Admiral Richard H. Truly; Susan Tomasky; Tsutomu Toichi; Susan F. Tierney; John A. Strom; Gerald M. Stokes; Kevin Stoffer; Edward M. Stern; Irwin M. Stelzer; Hoff Stauffer; Steven R. Spencer; Robert Smart; Bernie Schroeder; George A. Schreiber, Jr.; Robert N. Schock; James R. Schlesinger; Roger W. Sant; John W. Rowe; James E. Rogers; John F. Riordan; James Ragland; Frank J. Puzio; Tony Prophet; Robert Priddle; Michael Price; John B. Phillips; Robert Perciasepe; D. Louis Peoples; Robert Nordhaus; Walker Nolan; William A. Nitze; Kazutoshi Muramatsu; Ernest J. Moniz; Nancy C. Mohn; Callum McCarthy; Thomas R. Mason; Edward P. Martin; Jan W. Mares; James K. Malernee; S. David Freeman; Edwin Lupberger; Amory B. Lovins; Lynn LeMaster; Hoesung Lee; Lay, Kenneth; Lester Lave; Wilfrid L. Kohl; Soo Kyung Kim; Melanie Kenderdine; Paul L. Joskow; Ira H. Jolles; Frederick E. John; John Jimison; William W. Hogan; Robert A. Hefner, III; James K. Gray; Craig G. Goodman; Charles F. Goff, Jr.; Jerry D. Geist; Fritz Gautschi; Larry G. Garberding; Roger Gale; William Fulkerson; Stephen E. Frank; George Frampton; Juan Eibenschutz; Theodore R. Eck; Congressman John Dingell; Brian N. Dickie; William E. Dickenson; Etienne Deffarges; Wilfried Czernie; Loren C. Cox; Anne Cleary; Bernard H. Cherry; Red Cavaney; Ralph Cavanagh; Thomas R. Casten; Peter Bradford; Peter D. Blair; Ellen Berman; Roger A. Berliner; Michael L. Beatty; Vicky A. Bailey; Merribel S. Ayres; Catherine G. Abbott Subject:\tEnergy Deregulation - California State Auditor Report Attached is my report prepared on behalf of the  California State Auditor. I look forward to seeing you at The Aspen  Institute Energy Policy Forum. Charles J. Cicchetti Pacific Economics Group, LLC - ca report new.pdf ***********"
emails$responsive[2]
## [1] 1
table(emails$responsive)
## 
##   0   1 
## 716 139

Create corpus

Note that when creating the corpus, dependent variable is excluded. That’s why we have to add it back later.

library(tm)
## Loading required package: NLP
corpus <- Corpus(VectorSource(emails$email))
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## North America's integrated electricity market requires cooperation on environmental policies Commission for Environmental Cooperation releases working paper on North America's electricity market Montreal, 27 November 2001 -- The North American Commission for Environmental Cooperation (CEC) is releasing a working paper highlighting the trend towards increasing trade, competition and cross-border investment in electricity between Canada, Mexico and the United States. It is hoped that the working paper, Environmental Challenges and Opportunities in the Evolving North American Electricity Market, will stimulate public discussion around a CEC symposium of the same title about the need to coordinate environmental policies trinationally as a North America-wide electricity market develops. The CEC symposium will take place in San Diego on 29-30 November, and will bring together leading experts from industry, academia, NGOs and the governments of Canada, Mexico and the United States to consider the impact of the evolving continental electricity market on human health and the environment. "Our goal [with the working paper and the symposium] is to highlight key environmental issues that must be addressed as the electricity markets in North America become more and more integrated," said Janine Ferretti, executive director of the CEC. "We want to stimulate discussion around the important policy questions being raised so that countries can cooperate in their approach to energy and the environment." The CEC, an international organization created under an environmental side agreement to NAFTA known as the North American Agreement on Environmental Cooperation, was established to address regional environmental concerns, help prevent potential trade and environmental conflicts, and promote the effective enforcement of environmental law. The CEC Secretariat believes that greater North American cooperation on environmental policies regarding the continental electricity market is necessary to: *  protect air quality and mitigate climate change, *  minimize the possibility of environment-based trade disputes, *  ensure a dependable supply of reasonably priced electricity across North America *  avoid creation of pollution havens, and *  ensure local and national environmental measures remain effective. The Changing Market The working paper profiles the rapid changing North American electricity market. For example, in 2001, the US is projected to export 13.1 thousand gigawatt-hours (GWh) of electricity to Canada and Mexico. By 2007, this number is projected to grow to 16.9 thousand GWh of electricity. "Over the past few decades, the North American electricity market has developed into a complex array of cross-border transactions and relationships," said Phil Sharp, former US congressman and chairman of the CEC's Electricity Advisory Board. "We need to achieve this new level of cooperation in our environmental approaches as well." The Environmental Profile of the Electricity Sector The electricity sector is the single largest source of nationally reported toxins in the United States and Canada and a large source in Mexico. In the US, the electricity sector emits approximately 25 percent of all NOx emissions, roughly 35 percent of all CO2 emissions, 25 percent of all mercury emissions and almost 70 percent of SO2 emissions. These emissions have a large impact on airsheds, watersheds and migratory species corridors that are often shared between the three North American countries. "We want to discuss the possible outcomes from greater efforts to coordinate federal, state or provincial environmental laws and policies that relate to the electricity sector," said Ferretti. "How can we develop more compatible environmental approaches to help make domestic environmental policies more effective?" The Effects of an Integrated Electricity Market One key issue raised in the paper is the effect of market integration on the competitiveness of particular fuels such as coal, natural gas or renewables. Fuel choice largely determines environmental impacts from a specific facility, along with pollution control technologies, performance standards and regulations. The paper highlights other impacts of a highly competitive market as well. For example, concerns about so called "pollution havens" arise when significant differences in environmental laws or enforcement practices induce power companies to locate their operations in jurisdictions with lower standards. "The CEC Secretariat is exploring what additional environmental policies will work in this restructured market and how these policies can be adapted to ensure that they enhance competitiveness and benefit the entire region," said Sharp. Because trade rules and policy measures directly influence the variables that drive a successfully integrated North American electricity market, the working paper also addresses fuel choice, technology, pollution control strategies and subsidies. The CEC will use the information gathered during the discussion period to develop a final report that will be submitted to the Council in early 2002. For more information or to view the live video webcast of the symposium, please go to: http://www.cec.org/electricity. You may download the working paper and other supporting documents from: http://www.cec.org/programs_projects/other_initiatives/electricity/docs.cfm?varlan=english. Commission for Environmental Cooperation 393, rue St-Jacques Ouest, Bureau 200 Montréal (Québec) Canada H2Y 1N9 Tel: (514) 350-4300; Fax: (514) 350-4314 E-mail: info@ccemtl.org ***********

pre-processing

corpus <- tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)
corpus[[1]]
## <<PlainTextDocument (metadata: 7)>>
## north america integr electr market requir cooper  environment polici commiss  environment cooper releas work paper  north america electr market montreal 27 novemb 2001   north american commiss  environment cooper cec  releas  work paper highlight  trend toward increas trade competit  crossbord invest  electr  canada mexico   unit state   hope   work paper environment challeng  opportun   evolv north american electr market will stimul public discuss around  cec symposium    titl   need  coordin environment polici trinat   north americawid electr market develop  cec symposium will take place  san diego  2930 novemb  will bring togeth lead expert  industri academia ngos   govern  canada mexico   unit state  consid  impact   evolv continent electr market  human health   environ  goal   work paper   symposium   highlight key environment issu  must  address   electr market  north america becom    integr said janin ferretti execut director   cec  want  stimul discuss around  import polici question  rais   countri can cooper   approach  energi   environ  cec  intern organ creat   environment side agreement  nafta known   north american agreement  environment cooper  establish  address region environment concern help prevent potenti trade  environment conflict  promot  effect enforc  environment law  cec secretariat believ  greater north american cooper  environment polici regard  continent electr market  necessari    protect air qualiti  mitig climat chang   minim  possibl  environmentbas trade disput   ensur  depend suppli  reason price electr across north america   avoid creation  pollut haven    ensur local  nation environment measur remain effect  chang market  work paper profil  rapid chang north american electr market  exampl  2001  us  project  export 131 thousand gigawatthour gwh  electr  canada  mexico  2007  number  project  grow  169 thousand gwh  electr   past  decad  north american electr market  develop   complex array  crossbord transact  relationship said phil sharp former us congressman  chairman   cec electr advisori board  need  achiev  new level  cooper   environment approach  well  environment profil   electr sector  electr sector   singl largest sourc  nation report toxin   unit state  canada   larg sourc  mexico   us  electr sector emit approxim 25 percent   nox emiss rough 35 percent   co2 emiss 25 percent   mercuri emiss  almost 70 percent  so2 emiss  emiss   larg impact  airsh watersh  migratori speci corridor   often share   three north american countri  want  discuss  possibl outcom  greater effort  coordin feder state  provinci environment law  polici  relat   electr sector said ferretti  can  develop  compat environment approach  help make domest environment polici  effect  effect   integr electr market one key issu rais   paper   effect  market integr   competit  particular fuel   coal natur gas  renew fuel choic larg determin environment impact   specif facil along  pollut control technolog perform standard  regul  paper highlight  impact   high competit market  well  exampl concern   call pollut haven aris  signific differ  environment law  enforc practic induc power compani  locat  oper  jurisdict  lower standard  cec secretariat  explor  addit environment polici will work   restructur market    polici can  adapt  ensur   enhanc competit  benefit  entir region said sharp  trade rule  polici measur direct influenc  variabl  drive  success integr north american electr market  work paper also address fuel choic technolog pollut control strategi  subsidi  cec will use  inform gather   discuss period  develop  final report  will  submit   council  earli 2002   inform   view  live video webcast   symposium pleas go  httpwwwcecorgelectr  may download  work paper   support document  httpwwwcecorgprogramsprojectsotherinitiativeselectricitydocscfmvarlanenglish commiss  environment cooper 393 rue stjacqu ouest bureau 200 montrãal quãbec canada h2i 1n9 tel 514 3504300 fax 514 3504314 email infoccemtlorg

Build a document term matrix

dtm <- DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 855, terms: 22164)>>
## Non-/sparse entries: 102863/18847357
## Sparsity           : 99%
## Maximal term length: 156
## Weighting          : term frequency (tf)

Remove any term that doesn’t appear in at least 3% of the documents.

dtm <- removeSparseTerms(dtm, 0.97)
dtm
## <<DocumentTermMatrix (documents: 855, terms: 788)>>
## Non-/sparse entries: 51612/622128
## Sparsity           : 92%
## Maximal term length: 19
## Weighting          : term frequency (tf)

Build a data frame called labeledTerms out of this document-term matrix.

labeledTerms <- as.data.frame(as.matrix(dtm))

So this data frame is only including right now the frequencies of the words that appeared in at least 3% of the documents, but in order to run our text analytics models, we’re also going to have the outcome variable, which is whether or not each email was responsive. So we need to add in this outcome variable.

labeledTerms$responsive <- emails$responsive
str(labeledTerms)
## 'data.frame':    855 obs. of  789 variables:
##  $ 100                : num  0 0 0 0 0 0 5 0 0 0 ...
##  $ 1400               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 1999               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 2000               : num  0 0 1 0 1 0 6 0 1 0 ...
##  $ 2001               : num  2 1 0 0 0 0 7 0 0 0 ...
##  $ 713                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ 77002              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abl                : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ accept             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ access             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ accord             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ account            : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ act                : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ action             : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ activ              : num  0 0 1 0 1 0 1 0 0 0 ...
##  $ actual             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ add                : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ addit              : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ address            : num  3 0 0 0 2 0 0 0 0 1 ...
##  $ administr          : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ advanc             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ advis              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ affect             : num  0 0 0 0 2 0 0 0 0 0 ...
##  $ afternoon          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ agenc              : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ ago                : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ agre               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ agreement          : num  2 0 0 0 2 0 1 0 0 1 ...
##  $ alan               : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ allow              : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ along              : num  1 0 0 0 1 0 1 0 0 0 ...
##  $ alreadi            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ also               : num  1 0 0 0 0 0 8 0 0 0 ...
##  $ altern             : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ although           : num  0 0 0 0 0 0 6 0 0 0 ...
##  $ amend              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ america            : num  4 0 0 0 0 0 0 0 1 0 ...
##  $ among              : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ amount             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ analysi            : num  0 0 0 2 0 0 0 0 0 0 ...
##  $ analyst            : num  0 0 0 0 0 0 6 0 0 0 ...
##  $ andor              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ andrew             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ announc            : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ anoth              : num  0 0 0 0 0 0 6 0 0 0 ...
##  $ answer             : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ anyon              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ anyth              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ appear             : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ appli              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ applic             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ appreci            : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ approach           : num  3 0 0 0 0 0 1 0 0 0 ...
##  $ appropri           : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ approv             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ approxim           : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ april              : num  0 0 0 0 0 0 3 0 0 0 ...
##  $ area               : num  0 0 0 0 1 0 3 0 0 0 ...
##  $ around             : num  2 0 0 0 0 0 1 0 0 0 ...
##  $ arrang             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ articl             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ ask                : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ asset              : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ assist             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ associ             : num  0 0 1 0 1 0 0 0 0 0 ...
##  $ assum              : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ attach             : num  0 1 0 1 1 0 1 0 3 1 ...
##  $ attend             : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ attent             : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ attorney           : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ august             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ author             : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ avail              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ averag             : num  0 0 0 0 0 0 5 0 0 0 ...
##  $ avoid              : num  1 0 0 0 1 0 2 0 0 0 ...
##  $ awar               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ back               : num  0 0 0 0 1 1 1 0 0 0 ...
##  $ balanc             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bank               : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ base               : num  0 0 0 0 1 0 9 0 0 0 ...
##  $ basi               : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ becom              : num  1 0 0 0 0 0 4 0 0 0 ...
##  $ begin              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ believ             : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ benefit            : num  1 0 0 0 0 0 5 0 0 0 ...
##  $ best               : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ better             : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ bid                : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ big                : num  0 0 0 0 0 1 6 0 0 0 ...
##  $ bill               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ billion            : num  0 0 0 0 0 0 2 0 0 0 ...
##  $ bit                : num  0 0 0 0 0 1 2 0 0 0 ...
##  $ board              : num  1 0 0 0 0 0 0 0 0 0 ...
##  $ bob                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ book               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ brian              : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ brief              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bring              : num  1 0 0 0 0 0 2 0 0 0 ...
##  $ build              : num  0 0 0 0 0 0 7 0 1 0 ...
##   [list output truncated]
labeledTerms[1:10,1:10]
##                100 1400 1999 2000 2001 713 77002 abl accept access
## character(0)     0    0    0    0    2   0     0   0      0      0
## character(0).1   0    0    0    0    1   0     0   0      0      0
## character(0).2   0    0    0    1    0   0     0   0      0      0
## character(0).3   0    0    0    0    0   0     0   0      0      0
## character(0).4   0    0    0    1    0   0     0   0      0      0
## character(0).5   0    0    0    0    0   0     0   0      0      0
## character(0).6   5    0    0    6    7   0     0   2      1      0
## character(0).7   0    0    0    0    0   0     0   0      0      0
## character(0).8   0    0    0    1    0   0     0   0      0      0
## character(0).9   0    0    0    0    0   0     0   0      0      0

So as we expect, there are an awful lot of variables, 789 in total.788 of those variables are the frequencies of various words in the emails, and the last one is responsive, the outcome variable.

split dataset

library(caTools)
set.seed(144)
spl <- sample.split(labeledTerms$responsive, 0.7)
train <- subset(labeledTerms, spl == TRUE)
test <- subset(labeledTerms, spl == FALSE)

Build a model

And we’ll build a simple CART model using the default parameters. But a random forest would be another good choice from our toolset. So we’ll start by loading up the packages for the CART model.

library(rpart)
library(rpart.plot)
emailCART <- rpart(responsive ~., data = train, method = "class")
prp(emailCART)

Evaluate the model

pred <- predict(emailCART, newdata = test)
pred[1:10,]
##                        0          1
## character(0)   0.2156863 0.78431373
## character(0).1 0.9557522 0.04424779
## character(0).2 0.9557522 0.04424779
## character(0).3 0.8125000 0.18750000
## character(0).4 0.4000000 0.60000000
## character(0).5 0.9557522 0.04424779
## character(0).6 0.9557522 0.04424779
## character(0).7 0.9557522 0.04424779
## character(0).8 0.1250000 0.87500000
## character(0).9 0.1250000 0.87500000

So the left column here is the predicted probability of the document being non-responsive. And the right column is the predicted probability of the document being responsive.

So in our case, we want to extract the predicted probability of the document being responsive. So we’re looking for the rightmost column.

pred.prob <- pred[, 2]
table(test$responsive, pred.prob >= 0.5)
##    
##     FALSE TRUE
##   0   195   20
##   1    17   25

Model accuracy is 0.856

(195+25)/nrow(test)
## [1] 0.8560311

Baseline model

table(test$responsive)
## 
##   0   1 
## 215  42

Model accuracy is 0.8365759

215/nrow(test)
## [1] 0.8365759

So we see just a small improvement in accuracy using the CART model, which, as we know, is a common case in unbalanced data sets.

If we have a false positive, in which a non-responsive document is labeled as responsive, the mistake translates to a bit of additional work in the manual review process but no further harm, since the manual review process will remove this erroneous result. But on the other hand, if we have a false negative, in which a responsive document is labeled as non-responsive by our model, we will miss the document entirely in our predictive coding process. Therefore, we’re going to assign a higher cost to false negatives than to false positives, which makes this a good time to look at other cut-offs on our ROC curve.

ROC curve

library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess
predROCR <- prediction(pred.prob, test$responsive)
perfROCR <-  performance(predROCR, "tpr", "fpr")
plot(perfROCR, colorize = TRUE)

Now, of course, the best cutoff to select is entirely dependent on the costs assigned by the decision maker to false positives and true positives. However, again, we do favor cutoffs that give us a high sensitivity. We want to identify a large number of the responsive documents.

So something that might look promising might be a point right around here,in this part of the curve, where we have a true positive rate of around 70%, meaning that we’re getting about 70% of all the responsive documents, and a false positive rate of about 20%, meaning that we’re making mistakes and accidentally identifying as responsive 20% of the non-responsive documents.

Now, since, typically, the vast majority of documents are non-responsive, operating at this cutoff would result, perhaps, in a large decrease in the amount of manual effort neededin the eDiscovery process. And we can see from the blue color of the plot at this particular location that we’re looking at a threshold around maybe 0.15 or so, significantly lower than 50%, which is definitely what we would expect since we favor false positives to false negatives.

compute AUC value

performance(predROCR, "auc")@y.values
## [[1]]
## [1] 0.7936323

We can see that we have an AUC in the test set of 83.5%, which means that our model can differentiate between a randomly selected responsive and non-responsive document about 83% of the time.