library(tidyverse)
library(ggplot2)
library(colorspace)
library(caret)
library(ROSE)
library(rpart)
library(partykit)
library(grid)
library(libcoin)
library(mvtnorm)
library(rpart.plot)
library(randomForest)
library(GGally)
library(igraphdata)
library(igraph)
library(tidygraph) 
library(ggraph) 

Abstract

This study is an endeavor to investigate even though the Enron Corporation violated it at every turn, there were certain factors that made the situation worse for the company. The establishment of a special purpose entity to hide financial losses and a mountain of debt; mark-to-market accounting, while a great idea for accounting, has disastrous results when used in real business operations. Corporate governance at Enron Corporation collapse.

Introduction

The Enron email + financial dataset is a trove of information regarding the Enron Corporation, an energy, commodities, and services company that infamously went bankrupt in December 2001 as a result of fraudulent business practices. In the aftermath of the company’s collapse, the Federal Energy Regulatory Commission released more 1.6 million emails sent and received by Enron executives in the years from 2000–2002 (History of Enron). After numerous complaints regarding the sensitive nature of the emails, the FERC redacted a large portion of the emails, but about 0.5 million remain available to the public. The email + financial data contains the emails themselves, metadata about the emails such as number received by and sent from each individual, and financial information including salary and stock options. The Enron data set has become a valuable training and testing ground for machine learning practitioners to try and develop models that can identify the persons of interests (POIs) from the features within the data. The persons of interest are the individuals who were eventually tried for fraud or criminal activity in the Enron investigation and include several top level executives. The objective of this project was to create a machine learning model that could separate out the POIs. I choose not to use the text contained within the emails as input for my classifier, but rather the metadata about the emails and the financial information. The ultimate objective of investigating the Enron data set is to be able to predict cases of fraud or unsafe business practices far in advance, so those responsible can be punished, and those who are innocent are not harmed. Machine learning holds the promise of a world with no more Enron, so let’s get started!

Objectives of the Study

Specifically, the main objective of the underlying study is to investigate the ENRON case on fraud analytics thinking. So, the objectives are outlined as follow: 1. Why did this happen? 2. Why didn’t the company’s directors protect the employees and investors? 3. Analyze the Enron data set to promote fraud analytics thinking. 4. How the corporate governance should be changed? 5. How can credibility be recovered with investors?

Head of the data set

##                 name salary to_messages total_payments loan_advances   bonus
## 1    ALLEN PHILLIP K 201955        2902        4484442           NaN 4175000
## 2      BADUM JAMES P    NaN         NaN         182466           NaN     NaN
## 3 BANNANTINE JAMES M    477         566         916197           NaN     NaN
## 4      BAXTER JOHN C 267102         NaN        5634343           NaN 1200000
## 5     BAY FRANKLIN R 239671         NaN         827696           NaN  400000
## 6 BAZELIDES PHILIP J  80818         NaN         860136           NaN     NaN
##                email_address deferred_income total_stock_value expenses
## 1    phillip.allen@enron.com        -3081055           1729541    13868
## 2                        NaN             NaN            257817     3486
## 3 james.bannantine@enron.com           -5104           5243487    56301
## 4                        NaN        -1386055          10623258    11200
## 5        frank.bay@enron.com         -201641             63014   129142
## 6                        NaN             NaN           1599641      NaN
##   from_poi_to_this_person exercised_stock_options from_messages   other
## 1                      47                 1729541          2195     152
## 2                     NaN                  257817           NaN     NaN
## 3                      39                 4046157            29  864523
## 4                     NaN                 6680544           NaN 2660303
## 5                     NaN                     NaN           NaN      69
## 6                     NaN                 1599641           NaN     874
##   from_this_person_to_poi   poi long_term_incentive shared_receipt_with_poi
## 1                      65 FALSE              304805                    1407
## 2                     NaN FALSE                 NaN                     NaN
## 3                       0 FALSE                 NaN                     465
## 4                     NaN FALSE             1586055                     NaN
## 5                     NaN FALSE                 NaN                     NaN
## 6                     NaN FALSE               93750                     NaN
##   restricted_stock director_fees
## 1           126027           NaN
## 2              NaN           NaN
## 3          1757552           NaN
## 4          3942714           NaN
## 5           145796           NaN
## 6              NaN           NaN

Name of the Data set

##   [1] "ALLEN PHILLIP K"               "BADUM JAMES P"                
##   [3] "BANNANTINE JAMES M"            "BAXTER JOHN C"                
##   [5] "BAY FRANKLIN R"                "BAZELIDES PHILIP J"           
##   [7] "BECK SALLY W"                  "BELDEN TIMOTHY N"             
##   [9] "BELFER ROBERT"                 "BERBERIAN DAVID"              
##  [11] "BERGSIEKER RICHARD P"          "BHATNAGAR SANJAY"             
##  [13] "BIBI PHILIPPE A"               "BLACHMAN JEREMY M"            
##  [15] "BLAKE JR. NORMAN P"            "BOWEN JR RAYMOND M"           
##  [17] "BROWN MICHAEL"                 "BUCHANAN HAROLD G"            
##  [19] "BUTTS ROBERT H"                "BUY RICHARD B"                
##  [21] "CALGER CHRISTOPHER F"          "CARTER REBECCA C"             
##  [23] "CAUSEY RICHARD A"              "CHAN RONNIE"                  
##  [25] "CHRISTODOULOU DIOMEDES"        "CLINE KENNETH W"              
##  [27] "COLWELL WESLEY"                "CORDES WILLIAM R"             
##  [29] "COX DAVID"                     "CUMBERLAND MICHAEL S"         
##  [31] "DEFFNER JOSEPH M"              "DELAINEY DAVID W"             
##  [33] "DERRICK JR. JAMES V"           "DETMERING TIMOTHY J"          
##  [35] "DIETRICH JANET R"              "DIMICHELE RICHARD G"          
##  [37] "DODSON KEITH"                  "DONAHUE JR JEFFREY M"         
##  [39] "DUNCAN JOHN H"                 "DURAN WILLIAM D"              
##  [41] "ECHOLS JOHN B"                 "ELLIOTT STEVEN"               
##  [43] "FALLON JAMES B"                "FASTOW ANDREW S"              
##  [45] "FITZGERALD JAY L"              "FOWLER PEGGY"                 
##  [47] "FOY JOE"                       "FREVERT MARK A"               
##  [49] "FUGH JOHN L"                   "GAHN ROBERT S"                
##  [51] "GARLAND C KEVIN"               "GATHMANN WILLIAM D"           
##  [53] "GIBBS DANA R"                  "GILLIS JOHN"                  
##  [55] "GLISAN JR BEN F"               "GOLD JOSEPH"                  
##  [57] "GRAMM WENDY L"                 "GRAY RODNEY"                  
##  [59] "HAEDICKE MARK E"               "HANNON KEVIN P"               
##  [61] "HAUG DAVID L"                  "HAYES ROBERT E"               
##  [63] "HAYSLETT RODERICK J"           "HERMANN ROBERT J"             
##  [65] "HICKERSON GARY J"              "HIRKO JOSEPH"                 
##  [67] "HORTON STANLEY C"              "HUGHES JAMES A"               
##  [69] "HUMPHREY GENE E"               "IZZO LAWRENCE L"              
##  [71] "JACKSON CHARLENE R"            "JAEDICKE ROBERT"              
##  [73] "KAMINSKI WINCENTY J"           "KEAN STEVEN J"                
##  [75] "KISHKILL JOSEPH G"             "KITCHEN LOUISE"               
##  [77] "KOENIG MARK E"                 "KOPPER MICHAEL J"             
##  [79] "LAVORATO JOHN J"               "LAY KENNETH L"                
##  [81] "LEFF DANIEL P"                 "LEMAISTRE CHARLES"            
##  [83] "LEWIS RICHARD"                 "LINDHOLM TOD A"               
##  [85] "LOCKHART EUGENE E"             "LOWRY CHARLES P"              
##  [87] "MARTIN AMANDA K"               "MCCARTY DANNY J"              
##  [89] "MCCLELLAN GEORGE"              "MCCONNELL MICHAEL S"          
##  [91] "MCDONALD REBECCA"              "MCMAHON JEFFREY"              
##  [93] "MENDELSOHN JOHN"               "METTS MARK"                   
##  [95] "MEYER JEROME J"                "MEYER ROCKFORD G"             
##  [97] "MORAN MICHAEL P"               "MORDAUNT KRISTINA M"          
##  [99] "MULLER MARK S"                 "MURRAY JULIA H"               
## [101] "NOLES JAMES L"                 "OLSON CINDY K"                
## [103] "OVERDYKE JR JERE C"            "PAI LOU L"                    
## [105] "PEREIRA PAULO V. FERRAZ"       "PICKERING MARK R"             
## [107] "PIPER GREGORY F"               "PIRO JIM"                     
## [109] "POWERS WILLIAM"                "PRENTICE JAMES"               
## [111] "REDMOND BRIAN L"               "REYNOLDS LAWRENCE"            
## [113] "RICE KENNETH D"                "RIEKER PAULA H"               
## [115] "SAVAGE FRANK"                  "SCRIMSHAW MATTHEW"            
## [117] "SHANKMAN JEFFREY A"            "SHAPIRO RICHARD S"            
## [119] "SHARP VICTORIA T"              "SHELBY REX"                   
## [121] "SHERRICK JEFFREY B"            "SHERRIFF JOHN R"              
## [123] "SKILLING JEFFREY K"            "STABLER FRANK"                
## [125] "SULLIVAN-SHAKLOVITZ COLLEEN"   "SUNDE MARTIN"                 
## [127] "TAYLOR MITCHELL S"             "THE TRAVEL AGENCY IN THE PARK"
## [129] "THORN TERENCE H"               "TILNEY ELIZABETH A"           
## [131] "TOTAL"                         "UMANOFF ADAM S"               
## [133] "URQUHART JOHN A"               "WAKEHAM JOHN"                 
## [135] "WALLS JR ROBERT H"             "WALTERS GARETH W"             
## [137] "WASAFF GEORGE"                 "WESTFAHL RICHARD K"           
## [139] "WHALEY DAVID A"                "WHALLEY LAWRENCE G"           
## [141] "WHITE JR THOMAS E"             "WINOKUR JR. HERBERT S"        
## [143] "WODRASKA JOHN"                 "WROBEL BRUCE"                 
## [145] "YEAGER F SCOTT"                "YEAP SOON"

Column Name of the Data Set

##  [1] "name"                    "salary"                 
##  [3] "to_messages"             "total_payments"         
##  [5] "loan_advances"           "bonus"                  
##  [7] "email_address"           "deferred_income"        
##  [9] "total_stock_value"       "expenses"               
## [11] "from_poi_to_this_person" "exercised_stock_options"
## [13] "from_messages"           "other"                  
## [15] "from_this_person_to_poi" "poi"                    
## [17] "long_term_incentive"     "shared_receipt_with_poi"
## [19] "restricted_stock"        "director_fees"

Summary of the Data Set

##      name               salary          to_messages      total_payments     
##  Length:146         Min.   :     477   Min.   :   57.0   Min.   :      148  
##  Class :character   1st Qu.:  211816   1st Qu.:  541.2   1st Qu.:   394475  
##  Mode  :character   Median :  259996   Median : 1211.0   Median :  1101393  
##                     Mean   :  562194   Mean   : 2073.9   Mean   :  5081526  
##                     3rd Qu.:  312117   3rd Qu.: 2634.8   3rd Qu.:  2093263  
##                     Max.   :26704229   Max.   :15149.0   Max.   :309886585  
##                     NA's   :51         NA's   :60        NA's   :21         
##  loan_advances          bonus          email_address      deferred_income    
##  Min.   :  400000   Min.   :   70000   Length:146         Min.   :-27992891  
##  1st Qu.: 1600000   1st Qu.:  431250   Class :character   1st Qu.:  -694862  
##  Median :41762500   Median :  769375   Mode  :character   Median :  -159792  
##  Mean   :41962500   Mean   : 2374235                      Mean   : -1140475  
##  3rd Qu.:82125000   3rd Qu.: 1200000                      3rd Qu.:   -38346  
##  Max.   :83925000   Max.   :97343619                      Max.   :     -833  
##  NA's   :142        NA's   :64                            NA's   :97         
##  total_stock_value      expenses       from_poi_to_this_person
##  Min.   :   -44093   Min.   :    148   Min.   :  0.00         
##  1st Qu.:   494510   1st Qu.:  22614   1st Qu.: 10.00         
##  Median :  1102872   Median :  46950   Median : 35.00         
##  Mean   :  6773957   Mean   : 108729   Mean   : 64.90         
##  3rd Qu.:  2949847   3rd Qu.:  79952   3rd Qu.: 72.25         
##  Max.   :434509511   Max.   :5235198   Max.   :528.00         
##  NA's   :20          NA's   :51        NA's   :60             
##  exercised_stock_options from_messages          other         
##  Min.   :     3285       Min.   :   12.00   Min.   :       2  
##  1st Qu.:   527886       1st Qu.:   22.75   1st Qu.:    1215  
##  Median :  1310814       Median :   41.00   Median :   52382  
##  Mean   :  5987054       Mean   :  608.79   Mean   :  919065  
##  3rd Qu.:  2547724       3rd Qu.:  145.50   3rd Qu.:  362096  
##  Max.   :311764000       Max.   :14368.00   Max.   :42667589  
##  NA's   :44              NA's   :60         NA's   :53        
##  from_this_person_to_poi    poi          long_term_incentive
##  Min.   :  0.00          Mode :logical   Min.   :   69223   
##  1st Qu.:  1.00          FALSE:128       1st Qu.:  281250   
##  Median :  8.00          TRUE :18        Median :  442035   
##  Mean   : 41.23                          Mean   : 1470361   
##  3rd Qu.: 24.75                          3rd Qu.:  938672   
##  Max.   :609.00                          Max.   :48521928   
##  NA's   :60                              NA's   :80         
##  shared_receipt_with_poi restricted_stock    director_fees    
##  Min.   :   2.0          Min.   : -2604490   Min.   :   3285  
##  1st Qu.: 249.8          1st Qu.:   254018   1st Qu.:  98784  
##  Median : 740.5          Median :   451740   Median : 108579  
##  Mean   :1176.5          Mean   :  2321741   Mean   : 166805  
##  3rd Qu.:1888.2          3rd Qu.:  1002370   3rd Qu.: 113784  
##  Max.   :5521.0          Max.   :130322299   Max.   :1398517  
##  NA's   :60              NA's   :36          NA's   :129

Data Cleaning

The first step is to load in all the data and scrutinize it for any errors that need to be corrected and outlines that should be removed. The data is provided in the form of a Python dictionary with each individual as a key and the information about the individual as values, and I will convert it to a pandas data frame for easier data manipulation. I can then view the information about the data set to see if anything stands out right away.

dim(enron_data01)
## [1] 146  20
print("the % of NA: 0.4132625995")
## [1] "the % of NA: 0.4132625995"
sum(is.na(enron_data01))
## [1] 1088
print("the number of NA in salary:")
## [1] "the number of NA in salary:"
sum(is.na(enron_data01$salary))
## [1] 51
print("the number of NA in deferral_payments:")
## [1] "the number of NA in deferral_payments:"
sum(is.na(enron_data01$deferral_payments))
## [1] 0
print("the number of NA in restricted_stock_deferred:")
## [1] "the number of NA in restricted_stock_deferred:"
sum(is.na(enron_data01$restricted_stock_deferred))
## [1] 0
print("the number of NA in loan_advances:")
## [1] "the number of NA in loan_advances:"
sum(is.na(enron_data01$loan_advances))
## [1] 142
print("the number of NA in director_fees:")
## [1] "the number of NA in director_fees:"
sum(is.na(enron_data01$director_fees))
## [1] 129

Univariate analysis

Univariate analysis is a technique for analyzing data on one variable independently,each variable is analyzed without being linked to other variables. Univariate analysis is also called descriptive statistics.

hist(enron_data01$salary)

hist(enron_data01$salary[enron_data01$salary<1100000])

Feature Visualization

In order to understand the features I have, I want to visualize at least some of the data. Visualizing the data can help with feature selection by revealing trends in the data. The following is a simple scatterplot of the email ratio features I created and the bonus ratios I created. For the email ratios, my intuition tells me that persons of interest would tend to have points higher in both ratios and therefore should tend to be located in the upper right of the plot. For the bonus ratios, I would expect similar behavior. In both plots, the non persons of interest are clustered to the bottom left, but there is not a clear trend among the persons of interest. I also noticed suspiciously that several of the bonus to total ratios are greater than one. I thought this might be an error in the dataset, but after looking at the official financial data document, I saw some individuals did indeed have larger bonuses than their total payments because they had negative values in other payment categories. There are no firm conclusions to draw from these graphs, but it does appear that the new features might be of some use in identifiying persons of interest as the POIs exhibit noticeable differences from the non POIs in both graphs.

summary(enron_data01$salary)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##      477   211816   259996   562194   312117 26704229       51
ggplot(enron_data01, aes(x = salary, fill = poi)) +
  geom_density(alpha = 0.5) + 
  labs(title = "salary", 
       x = "salary", 
       y = "Density", 
       col = "poi") +
  theme_light()

summary(enron_data01$bonus)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    70000   431250   769375  2374235  1200000 97343619       64
ggplot(enron_data01, aes(x = bonus, fill = poi)) +
  geom_density(alpha = 0.5) + 
  labs(title = "bonus", 
       x = "bonus", 
       y = "Density", 
       col = "poi") +
  theme_light()

summary(enron_data01$total_payments)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##       148    394475   1101393   5081526   2093263 309886585        21
ggplot(enron_data01, aes(x = total_payments, fill = poi)) +
  geom_density(alpha = 0.5) + 
  labs(title = "total_payments", 
       x = "total_payments", 
       y = "Density", 
       col = "poi") +
  theme_light()

summary(enron_data01$exercised_stock_options)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##      3285    527886   1310814   5987054   2547724 311764000        44
ggplot(enron_data01, aes(x = exercised_stock_options, fill = poi)) +
  geom_density(alpha = 0.5) + 
  labs(title = "exercised_stock_options", 
       x = "exercised_stock_options", 
       y = "Density", 
       col = "poi") +
  theme_light()

summary(enron_data01$total_stock_value)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##    -44093    494510   1102872   6773957   2949847 434509511        20
ggplot(enron_data01, aes(x = total_stock_value, fill = poi)) +
  geom_density(alpha = 0.5) + 
  labs(title = "total_stock_value", 
       x = "total_stock_value", 
       y = "Density", 
       col = "poi") +
  theme_light()

summary(enron_data01$shared_receipt_with_poi)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     2.0   249.8   740.5  1176.5  1888.2  5521.0      60
ggplot(enron_data01, aes(x = shared_receipt_with_poi, fill = poi)) +
  geom_density(alpha = 0.5) + 
  labs(title = "shared_receipt_with_poi", 
       x = "shared_receipt_with_poi", 
       y = "Density", 
       col = "poi") +
  theme_light()

summary(enron_data01$to_messages)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    57.0   541.2  1211.0  2073.9  2634.8 15149.0      60
ggplot(enron_data01, aes(x = to_messages, fill = poi)) +
  geom_density(alpha = 0.5) + 
  labs(title = "to_messages", 
       x = "to_messages", 
       y = "Density", 
       col = "Class") +
  theme_light()
## Warning: Removed 60 rows containing non-finite values (`stat_density()`).

summary(enron_data01$from_messages)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    12.00    22.75    41.00   608.79   145.50 14368.00       60
ggplot(enron_data01, aes(x = from_messages, fill = poi)) +
  geom_density(alpha = 0.5) + 
  labs(title = "from_messages", 
       x = "from_messages", 
       y = "Density", 
       col = "Class") +
  theme_light()
## Warning: Removed 60 rows containing non-finite values (`stat_density()`).

summary(enron_data01$from_this_person_to_poi)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.00    8.00   41.23   24.75  609.00      60
ggplot(enron_data01, aes(x = from_this_person_to_poi, fill = poi)) +
  geom_density(alpha = 0.5) + 
  labs(title = "from_this_person_to_poi", 
       x = "from_this_person_to_poi", 
       y = "Density", 
       col = "Class") +
  theme_light()
## Warning: Removed 60 rows containing non-finite values (`stat_density()`).

summary(enron_data01$from_poi_to_this_person)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   10.00   35.00   64.90   72.25  528.00      60
ggplot(enron_data01, aes(x = from_poi_to_this_person, fill = poi)) +
  geom_density(alpha = 0.5) + 
  labs(title = "from_poi_to_this_person", 
       x = "from_poi_to_this_person", 
       y = "Density", 
       col = "Class") +
  theme_light()
## Warning: Removed 60 rows containing non-finite values (`stat_density()`).

########################
ggplot(enron_data01, aes(x=salary, y=bonus)) + geom_point()
## Warning: Removed 64 rows containing missing values (`geom_point()`).

ggplot(enron_data01, aes(x=salary, fill=factor(poi))) + geom_histogram(bins=100) + labs(y="No. of transactions", title="Distribution of amount by poi", fill="poi") + facet_grid(poi~., scale="free_y") + theme(plot.title=element_text(hjust=0.5))
## Warning: Removed 51 rows containing non-finite values (`stat_bin()`).

Anomaly Detection

In data mining, anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions. In particular, in the context of abuse and network intrusion detection, the interesting objects are often not rare objects, but unexpected bursts in activity. This pattern does not adhere to the common statistical definition of an outlier as a rare object, and many outlier detection methods (in particular unsupervised methods) will fail on such data, unless it has been aggregated appropriately. Instead, a cluster analysis algorithm may be able to detect the micro clusters formed by these patterns. Three broad categories of anomaly detection techniques exist. Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as “normal” and “abnormal” and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set, and then test the likelihood of a test instance to be generated by the learnt model.

data5 <-enron_data01[,c("name","salary","bonus","poi","total_payments","total_stock_value")]
str(data5)
## 'data.frame':    146 obs. of  6 variables:
##  $ name             : chr  "ALLEN PHILLIP K" "BADUM JAMES P" "BANNANTINE JAMES M" "BAXTER JOHN C" ...
##  $ salary           : num  201955 NaN 477 267102 239671 ...
##  $ bonus            : num  4175000 NaN NaN 1200000 400000 ...
##  $ poi              : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ total_payments   : num  4484442 182466 916197 5634343 827696 ...
##  $ total_stock_value: num  1729541 257817 5243487 10623258 63014 ...
outlier1=subset(data5,salary>1000000)
outlier1
##                   name   salary    bonus   poi total_payments total_stock_value
## 48      FREVERT MARK A  1060932  2000000 FALSE       17252530          14622185
## 80       LAY KENNETH L  1072321  7000000  TRUE      103559793          49110078
## 123 SKILLING JEFFREY K  1111258  5600000  TRUE        8682716          26093672
## 131              TOTAL 26704229 97343619 FALSE      309886585         434509511
outlier2=subset(data5,bonus>6000000)
outlier2
##                name   salary    bonus   poi total_payments total_stock_value
## 79  LAVORATO JOHN J   339288  8000000 FALSE       10425757           5167144
## 80    LAY KENNETH L  1072321  7000000  TRUE      103559793          49110078
## 131           TOTAL 26704229 97343619 FALSE      309886585         434509511
outlier3=subset(data5,total_payments>100000000)
outlier3
##              name   salary    bonus   poi total_payments total_stock_value
## 80  LAY KENNETH L  1072321  7000000  TRUE      103559793          49110078
## 131         TOTAL 26704229 97343619 FALSE      309886585         434509511
ggplot(enron_data01, aes(x=poi, y=salary)) + geom_boxplot()

ggplot(enron_data01, aes(x=poi, y=bonus)) + geom_boxplot()

non_poi = subset(enron_data01, enron_data01$poi=="False")
poi = subset(enron_data01, enron_data01$poi=="True")
dim(non_poi)
## [1]  0 20
print("summary of non_poi")
## [1] "summary of non_poi"
summary(non_poi)
##      name               salary     to_messages  total_payments loan_advances
##  Length:0           Min.   : NA   Min.   : NA   Min.   : NA    Min.   : NA  
##  Class :character   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA    1st Qu.: NA  
##  Mode  :character   Median : NA   Median : NA   Median : NA    Median : NA  
##                     Mean   :NaN   Mean   :NaN   Mean   :NaN    Mean   :NaN  
##                     3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA    3rd Qu.: NA  
##                     Max.   : NA   Max.   : NA   Max.   : NA    Max.   : NA  
##      bonus     email_address      deferred_income total_stock_value
##  Min.   : NA   Length:0           Min.   : NA     Min.   : NA      
##  1st Qu.: NA   Class :character   1st Qu.: NA     1st Qu.: NA      
##  Median : NA   Mode  :character   Median : NA     Median : NA      
##  Mean   :NaN                      Mean   :NaN     Mean   :NaN      
##  3rd Qu.: NA                      3rd Qu.: NA     3rd Qu.: NA      
##  Max.   : NA                      Max.   : NA     Max.   : NA      
##     expenses   from_poi_to_this_person exercised_stock_options from_messages
##  Min.   : NA   Min.   : NA             Min.   : NA             Min.   : NA  
##  1st Qu.: NA   1st Qu.: NA             1st Qu.: NA             1st Qu.: NA  
##  Median : NA   Median : NA             Median : NA             Median : NA  
##  Mean   :NaN   Mean   :NaN             Mean   :NaN             Mean   :NaN  
##  3rd Qu.: NA   3rd Qu.: NA             3rd Qu.: NA             3rd Qu.: NA  
##  Max.   : NA   Max.   : NA             Max.   : NA             Max.   : NA  
##      other     from_this_person_to_poi   poi          long_term_incentive
##  Min.   : NA   Min.   : NA             Mode:logical   Min.   : NA        
##  1st Qu.: NA   1st Qu.: NA                            1st Qu.: NA        
##  Median : NA   Median : NA                            Median : NA        
##  Mean   :NaN   Mean   :NaN                            Mean   :NaN        
##  3rd Qu.: NA   3rd Qu.: NA                            3rd Qu.: NA        
##  Max.   : NA   Max.   : NA                            Max.   : NA        
##  shared_receipt_with_poi restricted_stock director_fees
##  Min.   : NA             Min.   : NA      Min.   : NA  
##  1st Qu.: NA             1st Qu.: NA      1st Qu.: NA  
##  Median : NA             Median : NA      Median : NA  
##  Mean   :NaN             Mean   :NaN      Mean   :NaN  
##  3rd Qu.: NA             3rd Qu.: NA      3rd Qu.: NA  
##  Max.   : NA             Max.   : NA      Max.   : NA
head(non_poi)
##  [1] name                    salary                  to_messages            
##  [4] total_payments          loan_advances           bonus                  
##  [7] email_address           deferred_income         total_stock_value      
## [10] expenses                from_poi_to_this_person exercised_stock_options
## [13] from_messages           other                   from_this_person_to_poi
## [16] poi                     long_term_incentive     shared_receipt_with_poi
## [19] restricted_stock        director_fees          
## <0 rows> (or 0-length row.names)
dim(poi)
## [1]  0 20
print("summary of poi")
## [1] "summary of poi"
summary(poi)
##      name               salary     to_messages  total_payments loan_advances
##  Length:0           Min.   : NA   Min.   : NA   Min.   : NA    Min.   : NA  
##  Class :character   1st Qu.: NA   1st Qu.: NA   1st Qu.: NA    1st Qu.: NA  
##  Mode  :character   Median : NA   Median : NA   Median : NA    Median : NA  
##                     Mean   :NaN   Mean   :NaN   Mean   :NaN    Mean   :NaN  
##                     3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA    3rd Qu.: NA  
##                     Max.   : NA   Max.   : NA   Max.   : NA    Max.   : NA  
##      bonus     email_address      deferred_income total_stock_value
##  Min.   : NA   Length:0           Min.   : NA     Min.   : NA      
##  1st Qu.: NA   Class :character   1st Qu.: NA     1st Qu.: NA      
##  Median : NA   Mode  :character   Median : NA     Median : NA      
##  Mean   :NaN                      Mean   :NaN     Mean   :NaN      
##  3rd Qu.: NA                      3rd Qu.: NA     3rd Qu.: NA      
##  Max.   : NA                      Max.   : NA     Max.   : NA      
##     expenses   from_poi_to_this_person exercised_stock_options from_messages
##  Min.   : NA   Min.   : NA             Min.   : NA             Min.   : NA  
##  1st Qu.: NA   1st Qu.: NA             1st Qu.: NA             1st Qu.: NA  
##  Median : NA   Median : NA             Median : NA             Median : NA  
##  Mean   :NaN   Mean   :NaN             Mean   :NaN             Mean   :NaN  
##  3rd Qu.: NA   3rd Qu.: NA             3rd Qu.: NA             3rd Qu.: NA  
##  Max.   : NA   Max.   : NA             Max.   : NA             Max.   : NA  
##      other     from_this_person_to_poi   poi          long_term_incentive
##  Min.   : NA   Min.   : NA             Mode:logical   Min.   : NA        
##  1st Qu.: NA   1st Qu.: NA                            1st Qu.: NA        
##  Median : NA   Median : NA                            Median : NA        
##  Mean   :NaN   Mean   :NaN                            Mean   :NaN        
##  3rd Qu.: NA   3rd Qu.: NA                            3rd Qu.: NA        
##  Max.   : NA   Max.   : NA                            Max.   : NA        
##  shared_receipt_with_poi restricted_stock director_fees
##  Min.   : NA             Min.   : NA      Min.   : NA  
##  1st Qu.: NA             1st Qu.: NA      1st Qu.: NA  
##  Median : NA             Median : NA      Median : NA  
##  Mean   :NaN             Mean   :NaN      Mean   :NaN  
##  3rd Qu.: NA             3rd Qu.: NA      3rd Qu.: NA  
##  Max.   : NA             Max.   : NA      Max.   : NA
head(poi)
##  [1] name                    salary                  to_messages            
##  [4] total_payments          loan_advances           bonus                  
##  [7] email_address           deferred_income         total_stock_value      
## [10] expenses                from_poi_to_this_person exercised_stock_options
## [13] from_messages           other                   from_this_person_to_poi
## [16] poi                     long_term_incentive     shared_receipt_with_poi
## [19] restricted_stock        director_fees          
## <0 rows> (or 0-length row.names)
non_poi_money = non_poi[c('salary','bonus','exercised_stock_options','total_stock_value')]
dim(non_poi_money)
## [1] 0 4
summary(non_poi_money)
##      salary        bonus     exercised_stock_options total_stock_value
##  Min.   : NA   Min.   : NA   Min.   : NA             Min.   : NA      
##  1st Qu.: NA   1st Qu.: NA   1st Qu.: NA             1st Qu.: NA      
##  Median : NA   Median : NA   Median : NA             Median : NA      
##  Mean   :NaN   Mean   :NaN   Mean   :NaN             Mean   :NaN      
##  3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA             3rd Qu.: NA      
##  Max.   : NA   Max.   : NA   Max.   : NA             Max.   : NA
head(non_poi_money)
## [1] salary                  bonus                   exercised_stock_options
## [4] total_stock_value      
## <0 rows> (or 0-length row.names)
poi_money = poi[c('salary','bonus','exercised_stock_options','total_stock_value','total_payments')]
dim(poi_money)
## [1] 0 5
summary(poi_money)
##      salary        bonus     exercised_stock_options total_stock_value
##  Min.   : NA   Min.   : NA   Min.   : NA             Min.   : NA      
##  1st Qu.: NA   1st Qu.: NA   1st Qu.: NA             1st Qu.: NA      
##  Median : NA   Median : NA   Median : NA             Median : NA      
##  Mean   :NaN   Mean   :NaN   Mean   :NaN             Mean   :NaN      
##  3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA             3rd Qu.: NA      
##  Max.   : NA   Max.   : NA   Max.   : NA             Max.   : NA      
##  total_payments
##  Min.   : NA   
##  1st Qu.: NA   
##  Median : NA   
##  Mean   :NaN   
##  3rd Qu.: NA   
##  Max.   : NA
head(poi_money)
## [1] salary                  bonus                   exercised_stock_options
## [4] total_stock_value       total_payments         
## <0 rows> (or 0-length row.names)
non_poi_eamil = non_poi[c('shared_receipt_with_poi','to_messages','from_messages','from_this_person_to_poi','from_poi_to_this_person')]
dim(non_poi_eamil)
## [1] 0 5
summary(non_poi_eamil)
##  shared_receipt_with_poi  to_messages  from_messages from_this_person_to_poi
##  Min.   : NA             Min.   : NA   Min.   : NA   Min.   : NA            
##  1st Qu.: NA             1st Qu.: NA   1st Qu.: NA   1st Qu.: NA            
##  Median : NA             Median : NA   Median : NA   Median : NA            
##  Mean   :NaN             Mean   :NaN   Mean   :NaN   Mean   :NaN            
##  3rd Qu.: NA             3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA            
##  Max.   : NA             Max.   : NA   Max.   : NA   Max.   : NA            
##  from_poi_to_this_person
##  Min.   : NA            
##  1st Qu.: NA            
##  Median : NA            
##  Mean   :NaN            
##  3rd Qu.: NA            
##  Max.   : NA
head(non_poi_eamil)
## [1] shared_receipt_with_poi to_messages             from_messages          
## [4] from_this_person_to_poi from_poi_to_this_person
## <0 rows> (or 0-length row.names)
poi_eamil = poi[c('shared_receipt_with_poi','to_messages','from_messages','from_this_person_to_poi','from_poi_to_this_person')]
dim(poi_eamil)
## [1] 0 5
summary(poi_eamil)
##  shared_receipt_with_poi  to_messages  from_messages from_this_person_to_poi
##  Min.   : NA             Min.   : NA   Min.   : NA   Min.   : NA            
##  1st Qu.: NA             1st Qu.: NA   1st Qu.: NA   1st Qu.: NA            
##  Median : NA             Median : NA   Median : NA   Median : NA            
##  Mean   :NaN             Mean   :NaN   Mean   :NaN   Mean   :NaN            
##  3rd Qu.: NA             3rd Qu.: NA   3rd Qu.: NA   3rd Qu.: NA            
##  Max.   : NA             Max.   : NA   Max.   : NA   Max.   : NA            
##  from_poi_to_this_person
##  Min.   : NA            
##  1st Qu.: NA            
##  Median : NA            
##  Mean   :NaN            
##  3rd Qu.: NA            
##  Max.   : NA
head(poi_eamil)
## [1] shared_receipt_with_poi to_messages             from_messages          
## [4] from_this_person_to_poi from_poi_to_this_person
## <0 rows> (or 0-length row.names)
# added feature, fraction of e-mails to and from poi
enron_data01$fraction_to_poi = enron_data01$from_this_person_to_poi/enron_data01$from_messages
enron_data01$fraction_from_poi = enron_data01$from_poi_to_this_person/enron_data01$to_messages
# delete from_this_person_to_poi, from_messages, from_poi_to_this_person, to_messages
enron_data02 <- enron_data01[,c(-3,-14,-16,-22)]

colnames(enron_data02)
##  [1] "name"                    "salary"                 
##  [3] "total_payments"          "loan_advances"          
##  [5] "bonus"                   "email_address"          
##  [7] "deferred_income"         "total_stock_value"      
##  [9] "expenses"                "from_poi_to_this_person"
## [11] "exercised_stock_options" "from_messages"          
## [13] "from_this_person_to_poi" "long_term_incentive"    
## [15] "shared_receipt_with_poi" "restricted_stock"       
## [17] "director_fees"           "fraction_to_poi"
dim(enron_data02)
## [1] 146  18
head(enron_data02)
##                 name salary total_payments loan_advances   bonus
## 1    ALLEN PHILLIP K 201955        4484442           NaN 4175000
## 2      BADUM JAMES P    NaN         182466           NaN     NaN
## 3 BANNANTINE JAMES M    477         916197           NaN     NaN
## 4      BAXTER JOHN C 267102        5634343           NaN 1200000
## 5     BAY FRANKLIN R 239671         827696           NaN  400000
## 6 BAZELIDES PHILIP J  80818         860136           NaN     NaN
##                email_address deferred_income total_stock_value expenses
## 1    phillip.allen@enron.com        -3081055           1729541    13868
## 2                        NaN             NaN            257817     3486
## 3 james.bannantine@enron.com           -5104           5243487    56301
## 4                        NaN        -1386055          10623258    11200
## 5        frank.bay@enron.com         -201641             63014   129142
## 6                        NaN             NaN           1599641      NaN
##   from_poi_to_this_person exercised_stock_options from_messages
## 1                      47                 1729541          2195
## 2                     NaN                  257817           NaN
## 3                      39                 4046157            29
## 4                     NaN                 6680544           NaN
## 5                     NaN                     NaN           NaN
## 6                     NaN                 1599641           NaN
##   from_this_person_to_poi long_term_incentive shared_receipt_with_poi
## 1                      65              304805                    1407
## 2                     NaN                 NaN                     NaN
## 3                       0                 NaN                     465
## 4                     NaN             1586055                     NaN
## 5                     NaN                 NaN                     NaN
## 6                     NaN               93750                     NaN
##   restricted_stock director_fees fraction_to_poi
## 1           126027           NaN      0.02961276
## 2              NaN           NaN             NaN
## 3          1757552           NaN      0.00000000
## 4          3942714           NaN             NaN
## 5           145796           NaN             NaN
## 6              NaN           NaN             NaN
summary(enron_data01$fraction_to_poi)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.01242 0.10057 0.18406 0.27204 1.00000      60
ggplot(enron_data01, aes(x = fraction_to_poi, fill = poi)) +
  geom_density(alpha = 0.5) + 
  labs(title = "fraction_to_poi", 
       x = "fraction_to_poi", 
       y = "Density", 
       col = "Class") +
  theme_light()
## Warning: Removed 60 rows containing non-finite values (`stat_density()`).

summary(enron_data01$fraction_from_poi)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## 0.00000 0.00920 0.02585 0.03796 0.05609 0.21734      60
ggplot(enron_data01, aes(x = fraction_from_poi, fill = poi)) +
  geom_density(alpha = 0.5) + 
  labs(title = "fraction_from_poi", 
       x = "fraction_from_poi", 
       y = "Density", 
       col = "Class") +
  theme_light()
## Warning: Removed 60 rows containing non-finite values (`stat_density()`).

data2 <-enron_data01[,c("bonus","fraction_from_poi","fraction_to_poi","salary","poi")]
ggpairs(data2,columns=1:5,aes(color=poi))+
  ggtitle("Plot Martrix")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(enron_data01,aes(x=enron_data01$bonus,y=enron_data01$salary,color=enron_data01$poi))+geom_point(shape=1)+
  geom_smooth(method=lm,se=FALSE,fullrange=TRUE)+
  ggtitle("Salary/Bonus for POI and Non-POI")+
  theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

ggplot(enron_data01,aes(x=enron_data01$bonus,y=enron_data01$fraction_to_poi,color=enron_data01$poi))+geom_point(shape=1)+
  geom_smooth(method=lm,se=FALSE,fullrange=TRUE)+
  ggtitle("Bonus/Fraction_to_poi for POI and Non-POI")+
  theme_bw()
## Warning: Use of `enron_data01$bonus` is discouraged.
## ℹ Use `bonus` instead.
## Warning: Use of `enron_data01$fraction_to_poi` is discouraged.
## ℹ Use `fraction_to_poi` instead.
## Warning: Use of `enron_data01$poi` is discouraged.
## ℹ Use `poi` instead.
## Warning: Use of `enron_data01$bonus` is discouraged.
## ℹ Use `bonus` instead.
## Warning: Use of `enron_data01$fraction_to_poi` is discouraged.
## ℹ Use `fraction_to_poi` instead.
## Warning: Use of `enron_data01$poi` is discouraged.
## ℹ Use `poi` instead.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 85 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 85 rows containing missing values (`geom_point()`).

enron_high_salary = subset(enron_data01, enron_data01$bonus>=5600000|enron_data01$salary>1000000)


enron_high_salary = enron_high_salary[c('name','salary','bonus','poi','restricted_stock')]

head(enron_high_salary)
##                   name   salary    bonus   poi restricted_stock
## 48      FREVERT MARK A  1060932  2000000 FALSE          4188667
## 79     LAVORATO JOHN J   339288  8000000 FALSE          1008149
## 80       LAY KENNETH L  1072321  7000000  TRUE         14761694
## 123 SKILLING JEFFREY K  1111258  5600000  TRUE          6843672
## 131              TOTAL 26704229 97343619 FALSE        130322299
data5 <-enron_data01[,c("name","salary","bonus","poi","total_payments","total_stock_value")]
str(data5)
## 'data.frame':    146 obs. of  6 variables:
##  $ name             : chr  "ALLEN PHILLIP K" "BADUM JAMES P" "BANNANTINE JAMES M" "BAXTER JOHN C" ...
##  $ salary           : num  201955 NaN 477 267102 239671 ...
##  $ bonus            : num  4175000 NaN NaN 1200000 400000 ...
##  $ poi              : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ total_payments   : num  4484442 182466 916197 5634343 827696 ...
##  $ total_stock_value: num  1729541 257817 5243487 10623258 63014 ...
outlier1=subset(data5,salary>1000000)
outlier1
##                   name   salary    bonus   poi total_payments total_stock_value
## 48      FREVERT MARK A  1060932  2000000 FALSE       17252530          14622185
## 80       LAY KENNETH L  1072321  7000000  TRUE      103559793          49110078
## 123 SKILLING JEFFREY K  1111258  5600000  TRUE        8682716          26093672
## 131              TOTAL 26704229 97343619 FALSE      309886585         434509511
outlier2=subset(data5,bonus>6000000)
outlier2
##                name   salary    bonus   poi total_payments total_stock_value
## 79  LAVORATO JOHN J   339288  8000000 FALSE       10425757           5167144
## 80    LAY KENNETH L  1072321  7000000  TRUE      103559793          49110078
## 131           TOTAL 26704229 97343619 FALSE      309886585         434509511
outlier3=subset(data5,total_payments>100000000)
outlier3
##              name   salary    bonus   poi total_payments total_stock_value
## 80  LAY KENNETH L  1072321  7000000  TRUE      103559793          49110078
## 131         TOTAL 26704229 97343619 FALSE      309886585         434509511

Splitting data

Data was split into 2 groups as the training, validation and testing set as mentioned before in the machine learning approach. After obtaining the initial result, the validation set was used to tune the hyper-parameters in order to improve the final scores. The testing set was then used to make the final evaluation. The percentages of training, validation and testing data was the same as the ones used for the machine learning approach.

# take out 'name','email_address', because of uesless
# take out 'loan_advances', 'restricted_stock_deferred','deferral_payments','director_fees', because of missing values take out all 'loan_advances' because of missing values
enron_data03 <- enron_data01[,c(-1,-3,-9,-12,-15)]
dim(enron_data03)
## [1] 146  17
enron_data03[is.na(enron_data03)] <- 0
enron_final = enron_data03
dim(enron_final)
## [1] 146  17
colnames(enron_final)
##  [1] "salary"                  "total_payments"         
##  [3] "loan_advances"           "bonus"                  
##  [5] "email_address"           "deferred_income"        
##  [7] "expenses"                "from_poi_to_this_person"
##  [9] "from_messages"           "other"                  
## [11] "poi"                     "long_term_incentive"    
## [13] "shared_receipt_with_poi" "restricted_stock"       
## [15] "director_fees"           "fraction_to_poi"        
## [17] "fraction_from_poi"
enron_final$poi = as.factor(enron_final$poi)
train_index = createDataPartition(enron_final$poi, times = 1, p=0.8, list=F)
train_data = enron_final[train_index,]
test_data = enron_final[-train_index,]
dim(train_data)
## [1] 118  17
dim(test_data)
## [1] 28 17
head(train_data)
##    salary total_payments loan_advances   bonus             email_address
## 1  201955        4484442             0 4175000   phillip.allen@enron.com
## 2       0         182466             0       0                       NaN
## 4  267102        5634343             0 1200000                       NaN
## 6   80818         860136             0       0                       NaN
## 7  231330         969068             0  700000      sally.beck@enron.com
## 10 216582         228474             0       0 david.berberian@enron.com
##    deferred_income expenses from_poi_to_this_person from_messages   other   poi
## 1         -3081055    13868                      47          2195     152 FALSE
## 2                0     3486                       0             0       0 FALSE
## 4         -1386055    11200                       0             0 2660303 FALSE
## 6                0        0                       0             0     874 FALSE
## 7                0    37172                     144          4343     566 FALSE
## 10               0    11892                       0             0       0 FALSE
##    long_term_incentive shared_receipt_with_poi restricted_stock director_fees
## 1               304805                    1407           126027             0
## 2                    0                       0                0             0
## 4              1586055                       0          3942714             0
## 6                93750                       0                0             0
## 7                    0                    2639           126027             0
## 10                   0                       0           869220             0
##    fraction_to_poi fraction_from_poi
## 1       0.02961276        0.01619573
## 2       0.00000000        0.00000000
## 4       0.00000000        0.00000000
## 6       0.00000000        0.00000000
## 7       0.08887866        0.01968558
## 10      0.00000000        0.00000000
head(test_data)
##    salary total_payments loan_advances   bonus              email_address
## 3     477         916197             0       0 james.bannantine@enron.com
## 5  239671         827696             0  400000        frank.bay@enron.com
## 8  213999        5501630             0 5249999       tim.belden@enron.com
## 9       0         102500             0       0                        NaN
## 12      0       15456290             0       0 sanjay.bhatnagar@enron.com
## 15      0           1279             0       0                        NaN
##    deferred_income expenses from_poi_to_this_person from_messages  other   poi
## 3            -5104    56301                      39            29 864523 FALSE
## 5          -201641   129142                       0             0     69 FALSE
## 8         -2334434    17355                     228           484 210698  TRUE
## 9                0        0                       0             0      0 FALSE
## 12               0        0                       0            29 137864 FALSE
## 15         -113784     1279                       0             0      0 FALSE
##    long_term_incentive shared_receipt_with_poi restricted_stock director_fees
## 3                    0                     465          1757552             0
## 5                    0                       0           145796             0
## 8                    0                    5521           157569             0
## 9                    0                       0                0          3285
## 12                   0                     463         -2604490        137864
## 15                   0                       0                0        113784
##    fraction_to_poi fraction_from_poi
## 3       0.00000000        0.06890459
## 5       0.00000000        0.00000000
## 8       0.22314050        0.02853210
## 9       0.00000000        0.00000000
## 12      0.03448276        0.00000000
## 15      0.00000000        0.00000000
train_un = train_data

set.seed(167)
train_un = ovun.sample(poi ~.,data=train_data, p=0.5,method="both")$data
table(train_data$poi)
## 
## FALSE  TRUE 
##   103    15
table(train_un$poi)
## 
## FALSE  TRUE 
##    69    49

Decision Tree

rpartTrue2=as.party(rpartTrue1)
plot(rpartTrue2)

tc <- rpart.control(minbucket=5,maxdepth=10,xval=5,cp=0.005)
fit <- rpart(poi ~ ., data=train_un, control="tc")
train.pred <- predict(fit, train_un, type="class")
table(train_un$poi == train.pred)['TRUE'] / length(train.pred)
## TRUE 
##    1
table(train_un$poi,train.pred)
##        train.pred
##         FALSE TRUE
##   FALSE    69    0
##   TRUE      0   49
confusionMatrix(table(train.pred,train_un$poi))
## Confusion Matrix and Statistics
## 
##           
## train.pred FALSE TRUE
##      FALSE    69    0
##      TRUE      0   49
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9692, 1)
##     No Information Rate : 0.5847     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.5847     
##          Detection Rate : 0.5847     
##    Detection Prevalence : 0.5847     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : FALSE      
## 
rpart.plot(fit, main="Decision Tree")

fit$cptable
##     CP nsplit rel error    xerror       xstd
## 1 1.00      0         1 1.0000000 0.10924096
## 2 0.01      1         0 0.1632653 0.05573195
set.seed(100)
rf_train<-randomForest(as.factor(train_un$poi)~.,data=train_un,mtry=10,ntree=1000)
plot(rf_train)   
legend(800,0.02,"poi=0",cex=0.9,bty="n")    
legend(800,0.0245,"total",cex=0.09,bty="n")

model_rf = randomForest(poi ~.,data=train_un,proximity=TRUE, importance=TRUE)
model_rf
## 
## Call:
##  randomForest(formula = poi ~ ., data = train_un, proximity = TRUE,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 5.08%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    65    4  0.05797101
## TRUE      2   47  0.04081633
MDSplot(model_rf,train_un$poi)

print(model_rf)    
## 
## Call:
##  randomForest(formula = poi ~ ., data = train_un, proximity = TRUE,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 5.08%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    65    4  0.05797101
## TRUE      2   47  0.04081633
hist(treesize(model_rf))   

Random Forest Model

First we will use Random forest Model, this model used machine learning algorithm trademarked by Leo Breiman and Adele Cutler, which combines the output of multiple decision trees to achieve a single result. This model is easy to use and fit for the combination of numeric, categorical, and binomial data

## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 7.63%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    65    4  0.05797101
## TRUE      5   44  0.10204082
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 4.24%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    66    3  0.04347826
## TRUE      2   47  0.04081633
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 3.39%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    67    2  0.02898551
## TRUE      2   47  0.04081633
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 4.24%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    65    4  0.05797101
## TRUE      1   48  0.02040816
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 3.39%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    66    3  0.04347826
## TRUE      1   48  0.02040816
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 4.24%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    65    4  0.05797101
## TRUE      1   48  0.02040816
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 3.39%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    66    3  0.04347826
## TRUE      1   48  0.02040816
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 4.24%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    65    4  0.05797101
## TRUE      1   48  0.02040816
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 9
## 
##         OOB estimate of  error rate: 4.24%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    65    4  0.05797101
## TRUE      1   48  0.02040816
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 4.24%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    65    4  0.05797101
## TRUE      1   48  0.02040816
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 11
## 
##         OOB estimate of  error rate: 4.24%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    65    4  0.05797101
## TRUE      1   48  0.02040816
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 4.24%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    65    4  0.05797101
## TRUE      1   48  0.02040816
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 13
## 
##         OOB estimate of  error rate: 5.08%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    64    5  0.07246377
## TRUE      1   48  0.02040816
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 14
## 
##         OOB estimate of  error rate: 5.08%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    64    5  0.07246377
## TRUE      1   48  0.02040816
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 15
## 
##         OOB estimate of  error rate: 5.08%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    64    5  0.07246377
## TRUE      1   48  0.02040816
## 
## Call:
##  randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un,      mtry = i, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 16
## 
##         OOB estimate of  error rate: 5.08%
## Confusion matrix:
##       FALSE TRUE class.error
## FALSE    64    5  0.07246377
## TRUE      1   48  0.02040816
## [1] "error rate for each model:"
##  [1] 0.07936989 0.04475855 0.03514126 0.03734847 0.03637603 0.03665674
##  [7] 0.04035143 0.04142852 0.04097885 0.04118010 0.03850353 0.04045475
## [13] 0.04425658 0.04531762 0.04646065 0.04852949

model_rf$importance
##                                FALSE         TRUE MeanDecreaseAccuracy
## salary                  0.0094609033 0.0795693271         0.0382025152
## total_payments          0.0152936902 0.0449569347         0.0276266436
## loan_advances           0.0000000000 0.0002961722         0.0001339254
## bonus                   0.0272229749 0.0637855810         0.0416452674
## email_address           0.0232422918 0.0414547217         0.0309175020
## deferred_income         0.0044627695 0.0195469221         0.0107459631
## expenses                0.0314268638 0.1341495316         0.0726393337
## from_poi_to_this_person 0.0053957219 0.0923907494         0.0412657995
## from_messages           0.0076624181 0.0296354463         0.0166844117
## other                   0.0398681820 0.1613413321         0.0887573841
## long_term_incentive     0.0045953445 0.0235475406         0.0124110150
## shared_receipt_with_poi 0.0331545205 0.1423970175         0.0779919111
## restricted_stock        0.0084717730 0.0404278357         0.0217634604
## director_fees           0.0002557417 0.0007005495         0.0004127640
## fraction_to_poi         0.0173089668 0.1737256134         0.0818659304
## fraction_from_poi       0.0054062146 0.0713437150         0.0322117704
##                         MeanDecreaseGini
## salary                        3.68982671
## total_payments                2.35070026
## loan_advances                 0.04718456
## bonus                         3.91176966
## email_address                 2.63604355
## deferred_income               0.79714438
## expenses                      7.13364717
## from_poi_to_this_person       3.62002181
## from_messages                 1.85402465
## other                         9.51575509
## long_term_incentive           1.12298897
## shared_receipt_with_poi       7.43702136
## restricted_stock              1.85552953
## director_fees                 0.07130404
## fraction_to_poi               7.67541249
## fraction_from_poi             2.93835135
varImpPlot(model_rf, main = "variable importance")

Results:

The analysis of all emails has been performed and verified. The anomaly detection was performed using methods like local outlier factor and Isolation forests. It allowed detection of specific hours which could indicate malicious activity as tasked. Similarly social network analysis and main contributors to Enron’s social network were identified and visualized as shown.

Conclusion

Although the script I used to test the classifier implemented cross-validation, I was skeptical of the relatively high precision, recall, and F1 score recorded. I was conscious that I had somehow overfit my model to the data even though the script implements cross-validation. Looking through the tester.py script, I saw that the random seed for the cross-validation split was set at 42 in order to generate reproducible results. I changed the random seed and sure enough, the performance of my model decreased. Therefore, I must have made the classic mistake of overfitting on my training set for the given cross-validation random seed, and I will need to look out for this problem in the future. Even taking precuations against overfitting, I had still optimized my model for a specific set of data. In order to get a better indicator of the perfomance of the Decision Tree model, I ran 10 tests with different random seeds and found the average performance metrics.

Future work:

The analysis done above provided very deep insights into the scandal and all associated data. This analysis can be taken further and be used with more complex combinations of anomaly detection methods and social network analysis. Special features can be extracted based on various classes of emails. Even content based features can be used for performing anomaly detection.