library(tidyverse)
library(ggplot2)
library(colorspace)
library(caret)
library(ROSE)
library(rpart)
library(partykit)
library(grid)
library(libcoin)
library(mvtnorm)
library(rpart.plot)
library(randomForest)
library(GGally)
library(igraphdata)
library(igraph)
library(tidygraph)
library(ggraph)
This study is an endeavor to investigate even though the Enron Corporation violated it at every turn, there were certain factors that made the situation worse for the company. The establishment of a special purpose entity to hide financial losses and a mountain of debt; mark-to-market accounting, while a great idea for accounting, has disastrous results when used in real business operations. Corporate governance at Enron Corporation collapse.
The Enron email + financial dataset is a trove of information regarding the Enron Corporation, an energy, commodities, and services company that infamously went bankrupt in December 2001 as a result of fraudulent business practices. In the aftermath of the company’s collapse, the Federal Energy Regulatory Commission released more 1.6 million emails sent and received by Enron executives in the years from 2000–2002 (History of Enron). After numerous complaints regarding the sensitive nature of the emails, the FERC redacted a large portion of the emails, but about 0.5 million remain available to the public. The email + financial data contains the emails themselves, metadata about the emails such as number received by and sent from each individual, and financial information including salary and stock options. The Enron data set has become a valuable training and testing ground for machine learning practitioners to try and develop models that can identify the persons of interests (POIs) from the features within the data. The persons of interest are the individuals who were eventually tried for fraud or criminal activity in the Enron investigation and include several top level executives. The objective of this project was to create a machine learning model that could separate out the POIs. I choose not to use the text contained within the emails as input for my classifier, but rather the metadata about the emails and the financial information. The ultimate objective of investigating the Enron data set is to be able to predict cases of fraud or unsafe business practices far in advance, so those responsible can be punished, and those who are innocent are not harmed. Machine learning holds the promise of a world with no more Enron, so let’s get started!
Specifically, the main objective of the underlying study is to investigate the ENRON case on fraud analytics thinking. So, the objectives are outlined as follow: 1. Why did this happen? 2. Why didn’t the company’s directors protect the employees and investors? 3. Analyze the Enron data set to promote fraud analytics thinking. 4. How the corporate governance should be changed? 5. How can credibility be recovered with investors?
## name salary to_messages total_payments loan_advances bonus
## 1 ALLEN PHILLIP K 201955 2902 4484442 NaN 4175000
## 2 BADUM JAMES P NaN NaN 182466 NaN NaN
## 3 BANNANTINE JAMES M 477 566 916197 NaN NaN
## 4 BAXTER JOHN C 267102 NaN 5634343 NaN 1200000
## 5 BAY FRANKLIN R 239671 NaN 827696 NaN 400000
## 6 BAZELIDES PHILIP J 80818 NaN 860136 NaN NaN
## email_address deferred_income total_stock_value expenses
## 1 phillip.allen@enron.com -3081055 1729541 13868
## 2 NaN NaN 257817 3486
## 3 james.bannantine@enron.com -5104 5243487 56301
## 4 NaN -1386055 10623258 11200
## 5 frank.bay@enron.com -201641 63014 129142
## 6 NaN NaN 1599641 NaN
## from_poi_to_this_person exercised_stock_options from_messages other
## 1 47 1729541 2195 152
## 2 NaN 257817 NaN NaN
## 3 39 4046157 29 864523
## 4 NaN 6680544 NaN 2660303
## 5 NaN NaN NaN 69
## 6 NaN 1599641 NaN 874
## from_this_person_to_poi poi long_term_incentive shared_receipt_with_poi
## 1 65 FALSE 304805 1407
## 2 NaN FALSE NaN NaN
## 3 0 FALSE NaN 465
## 4 NaN FALSE 1586055 NaN
## 5 NaN FALSE NaN NaN
## 6 NaN FALSE 93750 NaN
## restricted_stock director_fees
## 1 126027 NaN
## 2 NaN NaN
## 3 1757552 NaN
## 4 3942714 NaN
## 5 145796 NaN
## 6 NaN NaN
## [1] "ALLEN PHILLIP K" "BADUM JAMES P"
## [3] "BANNANTINE JAMES M" "BAXTER JOHN C"
## [5] "BAY FRANKLIN R" "BAZELIDES PHILIP J"
## [7] "BECK SALLY W" "BELDEN TIMOTHY N"
## [9] "BELFER ROBERT" "BERBERIAN DAVID"
## [11] "BERGSIEKER RICHARD P" "BHATNAGAR SANJAY"
## [13] "BIBI PHILIPPE A" "BLACHMAN JEREMY M"
## [15] "BLAKE JR. NORMAN P" "BOWEN JR RAYMOND M"
## [17] "BROWN MICHAEL" "BUCHANAN HAROLD G"
## [19] "BUTTS ROBERT H" "BUY RICHARD B"
## [21] "CALGER CHRISTOPHER F" "CARTER REBECCA C"
## [23] "CAUSEY RICHARD A" "CHAN RONNIE"
## [25] "CHRISTODOULOU DIOMEDES" "CLINE KENNETH W"
## [27] "COLWELL WESLEY" "CORDES WILLIAM R"
## [29] "COX DAVID" "CUMBERLAND MICHAEL S"
## [31] "DEFFNER JOSEPH M" "DELAINEY DAVID W"
## [33] "DERRICK JR. JAMES V" "DETMERING TIMOTHY J"
## [35] "DIETRICH JANET R" "DIMICHELE RICHARD G"
## [37] "DODSON KEITH" "DONAHUE JR JEFFREY M"
## [39] "DUNCAN JOHN H" "DURAN WILLIAM D"
## [41] "ECHOLS JOHN B" "ELLIOTT STEVEN"
## [43] "FALLON JAMES B" "FASTOW ANDREW S"
## [45] "FITZGERALD JAY L" "FOWLER PEGGY"
## [47] "FOY JOE" "FREVERT MARK A"
## [49] "FUGH JOHN L" "GAHN ROBERT S"
## [51] "GARLAND C KEVIN" "GATHMANN WILLIAM D"
## [53] "GIBBS DANA R" "GILLIS JOHN"
## [55] "GLISAN JR BEN F" "GOLD JOSEPH"
## [57] "GRAMM WENDY L" "GRAY RODNEY"
## [59] "HAEDICKE MARK E" "HANNON KEVIN P"
## [61] "HAUG DAVID L" "HAYES ROBERT E"
## [63] "HAYSLETT RODERICK J" "HERMANN ROBERT J"
## [65] "HICKERSON GARY J" "HIRKO JOSEPH"
## [67] "HORTON STANLEY C" "HUGHES JAMES A"
## [69] "HUMPHREY GENE E" "IZZO LAWRENCE L"
## [71] "JACKSON CHARLENE R" "JAEDICKE ROBERT"
## [73] "KAMINSKI WINCENTY J" "KEAN STEVEN J"
## [75] "KISHKILL JOSEPH G" "KITCHEN LOUISE"
## [77] "KOENIG MARK E" "KOPPER MICHAEL J"
## [79] "LAVORATO JOHN J" "LAY KENNETH L"
## [81] "LEFF DANIEL P" "LEMAISTRE CHARLES"
## [83] "LEWIS RICHARD" "LINDHOLM TOD A"
## [85] "LOCKHART EUGENE E" "LOWRY CHARLES P"
## [87] "MARTIN AMANDA K" "MCCARTY DANNY J"
## [89] "MCCLELLAN GEORGE" "MCCONNELL MICHAEL S"
## [91] "MCDONALD REBECCA" "MCMAHON JEFFREY"
## [93] "MENDELSOHN JOHN" "METTS MARK"
## [95] "MEYER JEROME J" "MEYER ROCKFORD G"
## [97] "MORAN MICHAEL P" "MORDAUNT KRISTINA M"
## [99] "MULLER MARK S" "MURRAY JULIA H"
## [101] "NOLES JAMES L" "OLSON CINDY K"
## [103] "OVERDYKE JR JERE C" "PAI LOU L"
## [105] "PEREIRA PAULO V. FERRAZ" "PICKERING MARK R"
## [107] "PIPER GREGORY F" "PIRO JIM"
## [109] "POWERS WILLIAM" "PRENTICE JAMES"
## [111] "REDMOND BRIAN L" "REYNOLDS LAWRENCE"
## [113] "RICE KENNETH D" "RIEKER PAULA H"
## [115] "SAVAGE FRANK" "SCRIMSHAW MATTHEW"
## [117] "SHANKMAN JEFFREY A" "SHAPIRO RICHARD S"
## [119] "SHARP VICTORIA T" "SHELBY REX"
## [121] "SHERRICK JEFFREY B" "SHERRIFF JOHN R"
## [123] "SKILLING JEFFREY K" "STABLER FRANK"
## [125] "SULLIVAN-SHAKLOVITZ COLLEEN" "SUNDE MARTIN"
## [127] "TAYLOR MITCHELL S" "THE TRAVEL AGENCY IN THE PARK"
## [129] "THORN TERENCE H" "TILNEY ELIZABETH A"
## [131] "TOTAL" "UMANOFF ADAM S"
## [133] "URQUHART JOHN A" "WAKEHAM JOHN"
## [135] "WALLS JR ROBERT H" "WALTERS GARETH W"
## [137] "WASAFF GEORGE" "WESTFAHL RICHARD K"
## [139] "WHALEY DAVID A" "WHALLEY LAWRENCE G"
## [141] "WHITE JR THOMAS E" "WINOKUR JR. HERBERT S"
## [143] "WODRASKA JOHN" "WROBEL BRUCE"
## [145] "YEAGER F SCOTT" "YEAP SOON"
## [1] "name" "salary"
## [3] "to_messages" "total_payments"
## [5] "loan_advances" "bonus"
## [7] "email_address" "deferred_income"
## [9] "total_stock_value" "expenses"
## [11] "from_poi_to_this_person" "exercised_stock_options"
## [13] "from_messages" "other"
## [15] "from_this_person_to_poi" "poi"
## [17] "long_term_incentive" "shared_receipt_with_poi"
## [19] "restricted_stock" "director_fees"
## name salary to_messages total_payments
## Length:146 Min. : 477 Min. : 57.0 Min. : 148
## Class :character 1st Qu.: 211816 1st Qu.: 541.2 1st Qu.: 394475
## Mode :character Median : 259996 Median : 1211.0 Median : 1101393
## Mean : 562194 Mean : 2073.9 Mean : 5081526
## 3rd Qu.: 312117 3rd Qu.: 2634.8 3rd Qu.: 2093263
## Max. :26704229 Max. :15149.0 Max. :309886585
## NA's :51 NA's :60 NA's :21
## loan_advances bonus email_address deferred_income
## Min. : 400000 Min. : 70000 Length:146 Min. :-27992891
## 1st Qu.: 1600000 1st Qu.: 431250 Class :character 1st Qu.: -694862
## Median :41762500 Median : 769375 Mode :character Median : -159792
## Mean :41962500 Mean : 2374235 Mean : -1140475
## 3rd Qu.:82125000 3rd Qu.: 1200000 3rd Qu.: -38346
## Max. :83925000 Max. :97343619 Max. : -833
## NA's :142 NA's :64 NA's :97
## total_stock_value expenses from_poi_to_this_person
## Min. : -44093 Min. : 148 Min. : 0.00
## 1st Qu.: 494510 1st Qu.: 22614 1st Qu.: 10.00
## Median : 1102872 Median : 46950 Median : 35.00
## Mean : 6773957 Mean : 108729 Mean : 64.90
## 3rd Qu.: 2949847 3rd Qu.: 79952 3rd Qu.: 72.25
## Max. :434509511 Max. :5235198 Max. :528.00
## NA's :20 NA's :51 NA's :60
## exercised_stock_options from_messages other
## Min. : 3285 Min. : 12.00 Min. : 2
## 1st Qu.: 527886 1st Qu.: 22.75 1st Qu.: 1215
## Median : 1310814 Median : 41.00 Median : 52382
## Mean : 5987054 Mean : 608.79 Mean : 919065
## 3rd Qu.: 2547724 3rd Qu.: 145.50 3rd Qu.: 362096
## Max. :311764000 Max. :14368.00 Max. :42667589
## NA's :44 NA's :60 NA's :53
## from_this_person_to_poi poi long_term_incentive
## Min. : 0.00 Mode :logical Min. : 69223
## 1st Qu.: 1.00 FALSE:128 1st Qu.: 281250
## Median : 8.00 TRUE :18 Median : 442035
## Mean : 41.23 Mean : 1470361
## 3rd Qu.: 24.75 3rd Qu.: 938672
## Max. :609.00 Max. :48521928
## NA's :60 NA's :80
## shared_receipt_with_poi restricted_stock director_fees
## Min. : 2.0 Min. : -2604490 Min. : 3285
## 1st Qu.: 249.8 1st Qu.: 254018 1st Qu.: 98784
## Median : 740.5 Median : 451740 Median : 108579
## Mean :1176.5 Mean : 2321741 Mean : 166805
## 3rd Qu.:1888.2 3rd Qu.: 1002370 3rd Qu.: 113784
## Max. :5521.0 Max. :130322299 Max. :1398517
## NA's :60 NA's :36 NA's :129
The first step is to load in all the data and scrutinize it for any errors that need to be corrected and outlines that should be removed. The data is provided in the form of a Python dictionary with each individual as a key and the information about the individual as values, and I will convert it to a pandas data frame for easier data manipulation. I can then view the information about the data set to see if anything stands out right away.
dim(enron_data01)
## [1] 146 20
print("the % of NA: 0.4132625995")
## [1] "the % of NA: 0.4132625995"
sum(is.na(enron_data01))
## [1] 1088
print("the number of NA in salary:")
## [1] "the number of NA in salary:"
sum(is.na(enron_data01$salary))
## [1] 51
print("the number of NA in deferral_payments:")
## [1] "the number of NA in deferral_payments:"
sum(is.na(enron_data01$deferral_payments))
## [1] 0
print("the number of NA in restricted_stock_deferred:")
## [1] "the number of NA in restricted_stock_deferred:"
sum(is.na(enron_data01$restricted_stock_deferred))
## [1] 0
print("the number of NA in loan_advances:")
## [1] "the number of NA in loan_advances:"
sum(is.na(enron_data01$loan_advances))
## [1] 142
print("the number of NA in director_fees:")
## [1] "the number of NA in director_fees:"
sum(is.na(enron_data01$director_fees))
## [1] 129
Univariate analysis is a technique for analyzing data on one variable independently,each variable is analyzed without being linked to other variables. Univariate analysis is also called descriptive statistics.
hist(enron_data01$salary)
hist(enron_data01$salary[enron_data01$salary<1100000])
In order to understand the features I have, I want to visualize at least some of the data. Visualizing the data can help with feature selection by revealing trends in the data. The following is a simple scatterplot of the email ratio features I created and the bonus ratios I created. For the email ratios, my intuition tells me that persons of interest would tend to have points higher in both ratios and therefore should tend to be located in the upper right of the plot. For the bonus ratios, I would expect similar behavior. In both plots, the non persons of interest are clustered to the bottom left, but there is not a clear trend among the persons of interest. I also noticed suspiciously that several of the bonus to total ratios are greater than one. I thought this might be an error in the dataset, but after looking at the official financial data document, I saw some individuals did indeed have larger bonuses than their total payments because they had negative values in other payment categories. There are no firm conclusions to draw from these graphs, but it does appear that the new features might be of some use in identifiying persons of interest as the POIs exhibit noticeable differences from the non POIs in both graphs.
summary(enron_data01$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 477 211816 259996 562194 312117 26704229 51
ggplot(enron_data01, aes(x = salary, fill = poi)) +
geom_density(alpha = 0.5) +
labs(title = "salary",
x = "salary",
y = "Density",
col = "poi") +
theme_light()
summary(enron_data01$bonus)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 70000 431250 769375 2374235 1200000 97343619 64
ggplot(enron_data01, aes(x = bonus, fill = poi)) +
geom_density(alpha = 0.5) +
labs(title = "bonus",
x = "bonus",
y = "Density",
col = "poi") +
theme_light()
summary(enron_data01$total_payments)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 148 394475 1101393 5081526 2093263 309886585 21
ggplot(enron_data01, aes(x = total_payments, fill = poi)) +
geom_density(alpha = 0.5) +
labs(title = "total_payments",
x = "total_payments",
y = "Density",
col = "poi") +
theme_light()
summary(enron_data01$exercised_stock_options)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3285 527886 1310814 5987054 2547724 311764000 44
ggplot(enron_data01, aes(x = exercised_stock_options, fill = poi)) +
geom_density(alpha = 0.5) +
labs(title = "exercised_stock_options",
x = "exercised_stock_options",
y = "Density",
col = "poi") +
theme_light()
summary(enron_data01$total_stock_value)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -44093 494510 1102872 6773957 2949847 434509511 20
ggplot(enron_data01, aes(x = total_stock_value, fill = poi)) +
geom_density(alpha = 0.5) +
labs(title = "total_stock_value",
x = "total_stock_value",
y = "Density",
col = "poi") +
theme_light()
summary(enron_data01$shared_receipt_with_poi)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.0 249.8 740.5 1176.5 1888.2 5521.0 60
ggplot(enron_data01, aes(x = shared_receipt_with_poi, fill = poi)) +
geom_density(alpha = 0.5) +
labs(title = "shared_receipt_with_poi",
x = "shared_receipt_with_poi",
y = "Density",
col = "poi") +
theme_light()
summary(enron_data01$to_messages)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 57.0 541.2 1211.0 2073.9 2634.8 15149.0 60
ggplot(enron_data01, aes(x = to_messages, fill = poi)) +
geom_density(alpha = 0.5) +
labs(title = "to_messages",
x = "to_messages",
y = "Density",
col = "Class") +
theme_light()
## Warning: Removed 60 rows containing non-finite values (`stat_density()`).
summary(enron_data01$from_messages)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12.00 22.75 41.00 608.79 145.50 14368.00 60
ggplot(enron_data01, aes(x = from_messages, fill = poi)) +
geom_density(alpha = 0.5) +
labs(title = "from_messages",
x = "from_messages",
y = "Density",
col = "Class") +
theme_light()
## Warning: Removed 60 rows containing non-finite values (`stat_density()`).
summary(enron_data01$from_this_person_to_poi)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 1.00 8.00 41.23 24.75 609.00 60
ggplot(enron_data01, aes(x = from_this_person_to_poi, fill = poi)) +
geom_density(alpha = 0.5) +
labs(title = "from_this_person_to_poi",
x = "from_this_person_to_poi",
y = "Density",
col = "Class") +
theme_light()
## Warning: Removed 60 rows containing non-finite values (`stat_density()`).
summary(enron_data01$from_poi_to_this_person)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 10.00 35.00 64.90 72.25 528.00 60
ggplot(enron_data01, aes(x = from_poi_to_this_person, fill = poi)) +
geom_density(alpha = 0.5) +
labs(title = "from_poi_to_this_person",
x = "from_poi_to_this_person",
y = "Density",
col = "Class") +
theme_light()
## Warning: Removed 60 rows containing non-finite values (`stat_density()`).
########################
ggplot(enron_data01, aes(x=salary, y=bonus)) + geom_point()
## Warning: Removed 64 rows containing missing values (`geom_point()`).
ggplot(enron_data01, aes(x=salary, fill=factor(poi))) + geom_histogram(bins=100) + labs(y="No. of transactions", title="Distribution of amount by poi", fill="poi") + facet_grid(poi~., scale="free_y") + theme(plot.title=element_text(hjust=0.5))
## Warning: Removed 51 rows containing non-finite values (`stat_bin()`).
In data mining, anomaly detection (also outlier detection) is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically the anomalous items will translate to some kind of problem such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are also referred to as outliers, novelties, noise, deviations and exceptions. In particular, in the context of abuse and network intrusion detection, the interesting objects are often not rare objects, but unexpected bursts in activity. This pattern does not adhere to the common statistical definition of an outlier as a rare object, and many outlier detection methods (in particular unsupervised methods) will fail on such data, unless it has been aggregated appropriately. Instead, a cluster analysis algorithm may be able to detect the micro clusters formed by these patterns. Three broad categories of anomaly detection techniques exist. Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as “normal” and “abnormal” and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set, and then test the likelihood of a test instance to be generated by the learnt model.
data5 <-enron_data01[,c("name","salary","bonus","poi","total_payments","total_stock_value")]
str(data5)
## 'data.frame': 146 obs. of 6 variables:
## $ name : chr "ALLEN PHILLIP K" "BADUM JAMES P" "BANNANTINE JAMES M" "BAXTER JOHN C" ...
## $ salary : num 201955 NaN 477 267102 239671 ...
## $ bonus : num 4175000 NaN NaN 1200000 400000 ...
## $ poi : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ total_payments : num 4484442 182466 916197 5634343 827696 ...
## $ total_stock_value: num 1729541 257817 5243487 10623258 63014 ...
outlier1=subset(data5,salary>1000000)
outlier1
## name salary bonus poi total_payments total_stock_value
## 48 FREVERT MARK A 1060932 2000000 FALSE 17252530 14622185
## 80 LAY KENNETH L 1072321 7000000 TRUE 103559793 49110078
## 123 SKILLING JEFFREY K 1111258 5600000 TRUE 8682716 26093672
## 131 TOTAL 26704229 97343619 FALSE 309886585 434509511
outlier2=subset(data5,bonus>6000000)
outlier2
## name salary bonus poi total_payments total_stock_value
## 79 LAVORATO JOHN J 339288 8000000 FALSE 10425757 5167144
## 80 LAY KENNETH L 1072321 7000000 TRUE 103559793 49110078
## 131 TOTAL 26704229 97343619 FALSE 309886585 434509511
outlier3=subset(data5,total_payments>100000000)
outlier3
## name salary bonus poi total_payments total_stock_value
## 80 LAY KENNETH L 1072321 7000000 TRUE 103559793 49110078
## 131 TOTAL 26704229 97343619 FALSE 309886585 434509511
ggplot(enron_data01, aes(x=poi, y=salary)) + geom_boxplot()
ggplot(enron_data01, aes(x=poi, y=bonus)) + geom_boxplot()
non_poi = subset(enron_data01, enron_data01$poi=="False")
poi = subset(enron_data01, enron_data01$poi=="True")
dim(non_poi)
## [1] 0 20
print("summary of non_poi")
## [1] "summary of non_poi"
summary(non_poi)
## name salary to_messages total_payments loan_advances
## Length:0 Min. : NA Min. : NA Min. : NA Min. : NA
## Class :character 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Mode :character Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA Max. : NA
## bonus email_address deferred_income total_stock_value
## Min. : NA Length:0 Min. : NA Min. : NA
## 1st Qu.: NA Class :character 1st Qu.: NA 1st Qu.: NA
## Median : NA Mode :character Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA
## expenses from_poi_to_this_person exercised_stock_options from_messages
## Min. : NA Min. : NA Min. : NA Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA Max. : NA
## other from_this_person_to_poi poi long_term_incentive
## Min. : NA Min. : NA Mode:logical Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA
## shared_receipt_with_poi restricted_stock director_fees
## Min. : NA Min. : NA Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA
head(non_poi)
## [1] name salary to_messages
## [4] total_payments loan_advances bonus
## [7] email_address deferred_income total_stock_value
## [10] expenses from_poi_to_this_person exercised_stock_options
## [13] from_messages other from_this_person_to_poi
## [16] poi long_term_incentive shared_receipt_with_poi
## [19] restricted_stock director_fees
## <0 rows> (or 0-length row.names)
dim(poi)
## [1] 0 20
print("summary of poi")
## [1] "summary of poi"
summary(poi)
## name salary to_messages total_payments loan_advances
## Length:0 Min. : NA Min. : NA Min. : NA Min. : NA
## Class :character 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Mode :character Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA Max. : NA
## bonus email_address deferred_income total_stock_value
## Min. : NA Length:0 Min. : NA Min. : NA
## 1st Qu.: NA Class :character 1st Qu.: NA 1st Qu.: NA
## Median : NA Mode :character Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA
## expenses from_poi_to_this_person exercised_stock_options from_messages
## Min. : NA Min. : NA Min. : NA Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA Max. : NA
## other from_this_person_to_poi poi long_term_incentive
## Min. : NA Min. : NA Mode:logical Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA
## shared_receipt_with_poi restricted_stock director_fees
## Min. : NA Min. : NA Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA
head(poi)
## [1] name salary to_messages
## [4] total_payments loan_advances bonus
## [7] email_address deferred_income total_stock_value
## [10] expenses from_poi_to_this_person exercised_stock_options
## [13] from_messages other from_this_person_to_poi
## [16] poi long_term_incentive shared_receipt_with_poi
## [19] restricted_stock director_fees
## <0 rows> (or 0-length row.names)
non_poi_money = non_poi[c('salary','bonus','exercised_stock_options','total_stock_value')]
dim(non_poi_money)
## [1] 0 4
summary(non_poi_money)
## salary bonus exercised_stock_options total_stock_value
## Min. : NA Min. : NA Min. : NA Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA Max. : NA
head(non_poi_money)
## [1] salary bonus exercised_stock_options
## [4] total_stock_value
## <0 rows> (or 0-length row.names)
poi_money = poi[c('salary','bonus','exercised_stock_options','total_stock_value','total_payments')]
dim(poi_money)
## [1] 0 5
summary(poi_money)
## salary bonus exercised_stock_options total_stock_value
## Min. : NA Min. : NA Min. : NA Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA Max. : NA
## total_payments
## Min. : NA
## 1st Qu.: NA
## Median : NA
## Mean :NaN
## 3rd Qu.: NA
## Max. : NA
head(poi_money)
## [1] salary bonus exercised_stock_options
## [4] total_stock_value total_payments
## <0 rows> (or 0-length row.names)
non_poi_eamil = non_poi[c('shared_receipt_with_poi','to_messages','from_messages','from_this_person_to_poi','from_poi_to_this_person')]
dim(non_poi_eamil)
## [1] 0 5
summary(non_poi_eamil)
## shared_receipt_with_poi to_messages from_messages from_this_person_to_poi
## Min. : NA Min. : NA Min. : NA Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA Max. : NA
## from_poi_to_this_person
## Min. : NA
## 1st Qu.: NA
## Median : NA
## Mean :NaN
## 3rd Qu.: NA
## Max. : NA
head(non_poi_eamil)
## [1] shared_receipt_with_poi to_messages from_messages
## [4] from_this_person_to_poi from_poi_to_this_person
## <0 rows> (or 0-length row.names)
poi_eamil = poi[c('shared_receipt_with_poi','to_messages','from_messages','from_this_person_to_poi','from_poi_to_this_person')]
dim(poi_eamil)
## [1] 0 5
summary(poi_eamil)
## shared_receipt_with_poi to_messages from_messages from_this_person_to_poi
## Min. : NA Min. : NA Min. : NA Min. : NA
## 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA 1st Qu.: NA
## Median : NA Median : NA Median : NA Median : NA
## Mean :NaN Mean :NaN Mean :NaN Mean :NaN
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.: NA
## Max. : NA Max. : NA Max. : NA Max. : NA
## from_poi_to_this_person
## Min. : NA
## 1st Qu.: NA
## Median : NA
## Mean :NaN
## 3rd Qu.: NA
## Max. : NA
head(poi_eamil)
## [1] shared_receipt_with_poi to_messages from_messages
## [4] from_this_person_to_poi from_poi_to_this_person
## <0 rows> (or 0-length row.names)
# added feature, fraction of e-mails to and from poi
enron_data01$fraction_to_poi = enron_data01$from_this_person_to_poi/enron_data01$from_messages
enron_data01$fraction_from_poi = enron_data01$from_poi_to_this_person/enron_data01$to_messages
# delete from_this_person_to_poi, from_messages, from_poi_to_this_person, to_messages
enron_data02 <- enron_data01[,c(-3,-14,-16,-22)]
colnames(enron_data02)
## [1] "name" "salary"
## [3] "total_payments" "loan_advances"
## [5] "bonus" "email_address"
## [7] "deferred_income" "total_stock_value"
## [9] "expenses" "from_poi_to_this_person"
## [11] "exercised_stock_options" "from_messages"
## [13] "from_this_person_to_poi" "long_term_incentive"
## [15] "shared_receipt_with_poi" "restricted_stock"
## [17] "director_fees" "fraction_to_poi"
dim(enron_data02)
## [1] 146 18
head(enron_data02)
## name salary total_payments loan_advances bonus
## 1 ALLEN PHILLIP K 201955 4484442 NaN 4175000
## 2 BADUM JAMES P NaN 182466 NaN NaN
## 3 BANNANTINE JAMES M 477 916197 NaN NaN
## 4 BAXTER JOHN C 267102 5634343 NaN 1200000
## 5 BAY FRANKLIN R 239671 827696 NaN 400000
## 6 BAZELIDES PHILIP J 80818 860136 NaN NaN
## email_address deferred_income total_stock_value expenses
## 1 phillip.allen@enron.com -3081055 1729541 13868
## 2 NaN NaN 257817 3486
## 3 james.bannantine@enron.com -5104 5243487 56301
## 4 NaN -1386055 10623258 11200
## 5 frank.bay@enron.com -201641 63014 129142
## 6 NaN NaN 1599641 NaN
## from_poi_to_this_person exercised_stock_options from_messages
## 1 47 1729541 2195
## 2 NaN 257817 NaN
## 3 39 4046157 29
## 4 NaN 6680544 NaN
## 5 NaN NaN NaN
## 6 NaN 1599641 NaN
## from_this_person_to_poi long_term_incentive shared_receipt_with_poi
## 1 65 304805 1407
## 2 NaN NaN NaN
## 3 0 NaN 465
## 4 NaN 1586055 NaN
## 5 NaN NaN NaN
## 6 NaN 93750 NaN
## restricted_stock director_fees fraction_to_poi
## 1 126027 NaN 0.02961276
## 2 NaN NaN NaN
## 3 1757552 NaN 0.00000000
## 4 3942714 NaN NaN
## 5 145796 NaN NaN
## 6 NaN NaN NaN
summary(enron_data01$fraction_to_poi)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.01242 0.10057 0.18406 0.27204 1.00000 60
ggplot(enron_data01, aes(x = fraction_to_poi, fill = poi)) +
geom_density(alpha = 0.5) +
labs(title = "fraction_to_poi",
x = "fraction_to_poi",
y = "Density",
col = "Class") +
theme_light()
## Warning: Removed 60 rows containing non-finite values (`stat_density()`).
summary(enron_data01$fraction_from_poi)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00000 0.00920 0.02585 0.03796 0.05609 0.21734 60
ggplot(enron_data01, aes(x = fraction_from_poi, fill = poi)) +
geom_density(alpha = 0.5) +
labs(title = "fraction_from_poi",
x = "fraction_from_poi",
y = "Density",
col = "Class") +
theme_light()
## Warning: Removed 60 rows containing non-finite values (`stat_density()`).
data2 <-enron_data01[,c("bonus","fraction_from_poi","fraction_to_poi","salary","poi")]
ggpairs(data2,columns=1:5,aes(color=poi))+
ggtitle("Plot Martrix")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(enron_data01,aes(x=enron_data01$bonus,y=enron_data01$salary,color=enron_data01$poi))+geom_point(shape=1)+
geom_smooth(method=lm,se=FALSE,fullrange=TRUE)+
ggtitle("Salary/Bonus for POI and Non-POI")+
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
ggplot(enron_data01,aes(x=enron_data01$bonus,y=enron_data01$fraction_to_poi,color=enron_data01$poi))+geom_point(shape=1)+
geom_smooth(method=lm,se=FALSE,fullrange=TRUE)+
ggtitle("Bonus/Fraction_to_poi for POI and Non-POI")+
theme_bw()
## Warning: Use of `enron_data01$bonus` is discouraged.
## ℹ Use `bonus` instead.
## Warning: Use of `enron_data01$fraction_to_poi` is discouraged.
## ℹ Use `fraction_to_poi` instead.
## Warning: Use of `enron_data01$poi` is discouraged.
## ℹ Use `poi` instead.
## Warning: Use of `enron_data01$bonus` is discouraged.
## ℹ Use `bonus` instead.
## Warning: Use of `enron_data01$fraction_to_poi` is discouraged.
## ℹ Use `fraction_to_poi` instead.
## Warning: Use of `enron_data01$poi` is discouraged.
## ℹ Use `poi` instead.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 85 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 85 rows containing missing values (`geom_point()`).
enron_high_salary = subset(enron_data01, enron_data01$bonus>=5600000|enron_data01$salary>1000000)
enron_high_salary = enron_high_salary[c('name','salary','bonus','poi','restricted_stock')]
head(enron_high_salary)
## name salary bonus poi restricted_stock
## 48 FREVERT MARK A 1060932 2000000 FALSE 4188667
## 79 LAVORATO JOHN J 339288 8000000 FALSE 1008149
## 80 LAY KENNETH L 1072321 7000000 TRUE 14761694
## 123 SKILLING JEFFREY K 1111258 5600000 TRUE 6843672
## 131 TOTAL 26704229 97343619 FALSE 130322299
data5 <-enron_data01[,c("name","salary","bonus","poi","total_payments","total_stock_value")]
str(data5)
## 'data.frame': 146 obs. of 6 variables:
## $ name : chr "ALLEN PHILLIP K" "BADUM JAMES P" "BANNANTINE JAMES M" "BAXTER JOHN C" ...
## $ salary : num 201955 NaN 477 267102 239671 ...
## $ bonus : num 4175000 NaN NaN 1200000 400000 ...
## $ poi : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ total_payments : num 4484442 182466 916197 5634343 827696 ...
## $ total_stock_value: num 1729541 257817 5243487 10623258 63014 ...
outlier1=subset(data5,salary>1000000)
outlier1
## name salary bonus poi total_payments total_stock_value
## 48 FREVERT MARK A 1060932 2000000 FALSE 17252530 14622185
## 80 LAY KENNETH L 1072321 7000000 TRUE 103559793 49110078
## 123 SKILLING JEFFREY K 1111258 5600000 TRUE 8682716 26093672
## 131 TOTAL 26704229 97343619 FALSE 309886585 434509511
outlier2=subset(data5,bonus>6000000)
outlier2
## name salary bonus poi total_payments total_stock_value
## 79 LAVORATO JOHN J 339288 8000000 FALSE 10425757 5167144
## 80 LAY KENNETH L 1072321 7000000 TRUE 103559793 49110078
## 131 TOTAL 26704229 97343619 FALSE 309886585 434509511
outlier3=subset(data5,total_payments>100000000)
outlier3
## name salary bonus poi total_payments total_stock_value
## 80 LAY KENNETH L 1072321 7000000 TRUE 103559793 49110078
## 131 TOTAL 26704229 97343619 FALSE 309886585 434509511
Data was split into 2 groups as the training, validation and testing set as mentioned before in the machine learning approach. After obtaining the initial result, the validation set was used to tune the hyper-parameters in order to improve the final scores. The testing set was then used to make the final evaluation. The percentages of training, validation and testing data was the same as the ones used for the machine learning approach.
# take out 'name','email_address', because of uesless
# take out 'loan_advances', 'restricted_stock_deferred','deferral_payments','director_fees', because of missing values take out all 'loan_advances' because of missing values
enron_data03 <- enron_data01[,c(-1,-3,-9,-12,-15)]
dim(enron_data03)
## [1] 146 17
enron_data03[is.na(enron_data03)] <- 0
enron_final = enron_data03
dim(enron_final)
## [1] 146 17
colnames(enron_final)
## [1] "salary" "total_payments"
## [3] "loan_advances" "bonus"
## [5] "email_address" "deferred_income"
## [7] "expenses" "from_poi_to_this_person"
## [9] "from_messages" "other"
## [11] "poi" "long_term_incentive"
## [13] "shared_receipt_with_poi" "restricted_stock"
## [15] "director_fees" "fraction_to_poi"
## [17] "fraction_from_poi"
enron_final$poi = as.factor(enron_final$poi)
train_index = createDataPartition(enron_final$poi, times = 1, p=0.8, list=F)
train_data = enron_final[train_index,]
test_data = enron_final[-train_index,]
dim(train_data)
## [1] 118 17
dim(test_data)
## [1] 28 17
head(train_data)
## salary total_payments loan_advances bonus email_address
## 1 201955 4484442 0 4175000 phillip.allen@enron.com
## 2 0 182466 0 0 NaN
## 4 267102 5634343 0 1200000 NaN
## 6 80818 860136 0 0 NaN
## 7 231330 969068 0 700000 sally.beck@enron.com
## 10 216582 228474 0 0 david.berberian@enron.com
## deferred_income expenses from_poi_to_this_person from_messages other poi
## 1 -3081055 13868 47 2195 152 FALSE
## 2 0 3486 0 0 0 FALSE
## 4 -1386055 11200 0 0 2660303 FALSE
## 6 0 0 0 0 874 FALSE
## 7 0 37172 144 4343 566 FALSE
## 10 0 11892 0 0 0 FALSE
## long_term_incentive shared_receipt_with_poi restricted_stock director_fees
## 1 304805 1407 126027 0
## 2 0 0 0 0
## 4 1586055 0 3942714 0
## 6 93750 0 0 0
## 7 0 2639 126027 0
## 10 0 0 869220 0
## fraction_to_poi fraction_from_poi
## 1 0.02961276 0.01619573
## 2 0.00000000 0.00000000
## 4 0.00000000 0.00000000
## 6 0.00000000 0.00000000
## 7 0.08887866 0.01968558
## 10 0.00000000 0.00000000
head(test_data)
## salary total_payments loan_advances bonus email_address
## 3 477 916197 0 0 james.bannantine@enron.com
## 5 239671 827696 0 400000 frank.bay@enron.com
## 8 213999 5501630 0 5249999 tim.belden@enron.com
## 9 0 102500 0 0 NaN
## 12 0 15456290 0 0 sanjay.bhatnagar@enron.com
## 15 0 1279 0 0 NaN
## deferred_income expenses from_poi_to_this_person from_messages other poi
## 3 -5104 56301 39 29 864523 FALSE
## 5 -201641 129142 0 0 69 FALSE
## 8 -2334434 17355 228 484 210698 TRUE
## 9 0 0 0 0 0 FALSE
## 12 0 0 0 29 137864 FALSE
## 15 -113784 1279 0 0 0 FALSE
## long_term_incentive shared_receipt_with_poi restricted_stock director_fees
## 3 0 465 1757552 0
## 5 0 0 145796 0
## 8 0 5521 157569 0
## 9 0 0 0 3285
## 12 0 463 -2604490 137864
## 15 0 0 0 113784
## fraction_to_poi fraction_from_poi
## 3 0.00000000 0.06890459
## 5 0.00000000 0.00000000
## 8 0.22314050 0.02853210
## 9 0.00000000 0.00000000
## 12 0.03448276 0.00000000
## 15 0.00000000 0.00000000
train_un = train_data
set.seed(167)
train_un = ovun.sample(poi ~.,data=train_data, p=0.5,method="both")$data
table(train_data$poi)
##
## FALSE TRUE
## 103 15
table(train_un$poi)
##
## FALSE TRUE
## 69 49
rpartTrue2=as.party(rpartTrue1)
plot(rpartTrue2)
tc <- rpart.control(minbucket=5,maxdepth=10,xval=5,cp=0.005)
fit <- rpart(poi ~ ., data=train_un, control="tc")
train.pred <- predict(fit, train_un, type="class")
table(train_un$poi == train.pred)['TRUE'] / length(train.pred)
## TRUE
## 1
table(train_un$poi,train.pred)
## train.pred
## FALSE TRUE
## FALSE 69 0
## TRUE 0 49
confusionMatrix(table(train.pred,train_un$poi))
## Confusion Matrix and Statistics
##
##
## train.pred FALSE TRUE
## FALSE 69 0
## TRUE 0 49
##
## Accuracy : 1
## 95% CI : (0.9692, 1)
## No Information Rate : 0.5847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.5847
## Detection Rate : 0.5847
## Detection Prevalence : 0.5847
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : FALSE
##
rpart.plot(fit, main="Decision Tree")
fit$cptable
## CP nsplit rel error xerror xstd
## 1 1.00 0 1 1.0000000 0.10924096
## 2 0.01 1 0 0.1632653 0.05573195
set.seed(100)
rf_train<-randomForest(as.factor(train_un$poi)~.,data=train_un,mtry=10,ntree=1000)
plot(rf_train)
legend(800,0.02,"poi=0",cex=0.9,bty="n")
legend(800,0.0245,"total",cex=0.09,bty="n")
model_rf = randomForest(poi ~.,data=train_un,proximity=TRUE, importance=TRUE)
model_rf
##
## Call:
## randomForest(formula = poi ~ ., data = train_un, proximity = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 5.08%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 65 4 0.05797101
## TRUE 2 47 0.04081633
MDSplot(model_rf,train_un$poi)
print(model_rf)
##
## Call:
## randomForest(formula = poi ~ ., data = train_un, proximity = TRUE, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 5.08%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 65 4 0.05797101
## TRUE 2 47 0.04081633
hist(treesize(model_rf))
First we will use Random forest Model, this model used machine learning algorithm trademarked by Leo Breiman and Adele Cutler, which combines the output of multiple decision trees to achieve a single result. This model is easy to use and fit for the combination of numeric, categorical, and binomial data
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 7.63%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 65 4 0.05797101
## TRUE 5 44 0.10204082
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 4.24%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 66 3 0.04347826
## TRUE 2 47 0.04081633
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 3.39%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 67 2 0.02898551
## TRUE 2 47 0.04081633
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 4.24%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 65 4 0.05797101
## TRUE 1 48 0.02040816
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 3.39%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 66 3 0.04347826
## TRUE 1 48 0.02040816
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 4.24%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 65 4 0.05797101
## TRUE 1 48 0.02040816
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 3.39%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 66 3 0.04347826
## TRUE 1 48 0.02040816
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 4.24%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 65 4 0.05797101
## TRUE 1 48 0.02040816
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 9
##
## OOB estimate of error rate: 4.24%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 65 4 0.05797101
## TRUE 1 48 0.02040816
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 10
##
## OOB estimate of error rate: 4.24%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 65 4 0.05797101
## TRUE 1 48 0.02040816
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 11
##
## OOB estimate of error rate: 4.24%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 65 4 0.05797101
## TRUE 1 48 0.02040816
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 4.24%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 65 4 0.05797101
## TRUE 1 48 0.02040816
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 13
##
## OOB estimate of error rate: 5.08%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 64 5 0.07246377
## TRUE 1 48 0.02040816
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 14
##
## OOB estimate of error rate: 5.08%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 64 5 0.07246377
## TRUE 1 48 0.02040816
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 15
##
## OOB estimate of error rate: 5.08%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 64 5 0.07246377
## TRUE 1 48 0.02040816
##
## Call:
## randomForest(formula = as.factor(train_un$poi) ~ ., data = train_un, mtry = i, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 16
##
## OOB estimate of error rate: 5.08%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 64 5 0.07246377
## TRUE 1 48 0.02040816
## [1] "error rate for each model:"
## [1] 0.07936989 0.04475855 0.03514126 0.03734847 0.03637603 0.03665674
## [7] 0.04035143 0.04142852 0.04097885 0.04118010 0.03850353 0.04045475
## [13] 0.04425658 0.04531762 0.04646065 0.04852949
model_rf$importance
## FALSE TRUE MeanDecreaseAccuracy
## salary 0.0094609033 0.0795693271 0.0382025152
## total_payments 0.0152936902 0.0449569347 0.0276266436
## loan_advances 0.0000000000 0.0002961722 0.0001339254
## bonus 0.0272229749 0.0637855810 0.0416452674
## email_address 0.0232422918 0.0414547217 0.0309175020
## deferred_income 0.0044627695 0.0195469221 0.0107459631
## expenses 0.0314268638 0.1341495316 0.0726393337
## from_poi_to_this_person 0.0053957219 0.0923907494 0.0412657995
## from_messages 0.0076624181 0.0296354463 0.0166844117
## other 0.0398681820 0.1613413321 0.0887573841
## long_term_incentive 0.0045953445 0.0235475406 0.0124110150
## shared_receipt_with_poi 0.0331545205 0.1423970175 0.0779919111
## restricted_stock 0.0084717730 0.0404278357 0.0217634604
## director_fees 0.0002557417 0.0007005495 0.0004127640
## fraction_to_poi 0.0173089668 0.1737256134 0.0818659304
## fraction_from_poi 0.0054062146 0.0713437150 0.0322117704
## MeanDecreaseGini
## salary 3.68982671
## total_payments 2.35070026
## loan_advances 0.04718456
## bonus 3.91176966
## email_address 2.63604355
## deferred_income 0.79714438
## expenses 7.13364717
## from_poi_to_this_person 3.62002181
## from_messages 1.85402465
## other 9.51575509
## long_term_incentive 1.12298897
## shared_receipt_with_poi 7.43702136
## restricted_stock 1.85552953
## director_fees 0.07130404
## fraction_to_poi 7.67541249
## fraction_from_poi 2.93835135
varImpPlot(model_rf, main = "variable importance")
The analysis of all emails has been performed and verified. The anomaly detection was performed using methods like local outlier factor and Isolation forests. It allowed detection of specific hours which could indicate malicious activity as tasked. Similarly social network analysis and main contributors to Enron’s social network were identified and visualized as shown.
Although the script I used to test the classifier implemented cross-validation, I was skeptical of the relatively high precision, recall, and F1 score recorded. I was conscious that I had somehow overfit my model to the data even though the script implements cross-validation. Looking through the tester.py script, I saw that the random seed for the cross-validation split was set at 42 in order to generate reproducible results. I changed the random seed and sure enough, the performance of my model decreased. Therefore, I must have made the classic mistake of overfitting on my training set for the given cross-validation random seed, and I will need to look out for this problem in the future. Even taking precuations against overfitting, I had still optimized my model for a specific set of data. In order to get a better indicator of the perfomance of the Decision Tree model, I ran 10 tests with different random seeds and found the average performance metrics.
The analysis done above provided very deep insights into the scandal and all associated data. This analysis can be taken further and be used with more complex combinations of anomaly detection methods and social network analysis. Special features can be extracted based on various classes of emails. Even content based features can be used for performing anomaly detection.