Decision Trees

The Data
Exploratory Data Analysis
Train Test Split
Decision Trees
Prediction and Evaluation
Tree Visualization

Random Forests

Example - Decision Tree

The Data
Exploratory Data Analysis
Categorical Features
Train Test Split
Training the Model
Predictions and Evaluation

Example - Random Forest

Predictions and Evaluation

Decision Trees

> import pandas as pd
+ import numpy as np
+ import matplotlib.pyplot as plt
+ import seaborn as sns

The Data

> df = pd.read_csv('kyphosis.csv')

df.head()

Kyphosis	Age	Number	Start
absent	71	3	5
absent	158	3	14
present	128	4	5
absent	2	5	1
absent	1	4	15
absent	1	2	16

Kyphosis - abnormally curved spine
Age - age in months (children)
Number - number of vertebrae
Start - number of top most vertebrae operated on

> df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 4 columns):
Kyphosis    81 non-null object
Age         81 non-null int64
Number      81 non-null int64
Start       81 non-null int64
dtypes: int64(3), object(1)
memory usage: 2.7+ KB

> target_names = list(df['Kyphosis'].unique())
+ target_names

['absent', 'present']

Exploratory Data Analysis

> sns.set_style('darkgrid')
+ g = sns.pairplot(df,hue='Kyphosis',palette='Set1')
+ g._legend.remove();
+ plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
+ plt.tight_layout()
+ plt.show()

Train Test Split

> from sklearn.model_selection import train_test_split

> X = df.drop('Kyphosis',axis=1)
+ y = df['Kyphosis']

> X_train, X_test, y_train, y_test = train_test_split(
+     X, y, test_size=0.30)

Decision Trees

> from sklearn.tree import DecisionTreeClassifier

> dtree = DecisionTreeClassifier()

> dtree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

Prediction and Evaluation

> predictions = dtree.predict(X_test)

> from sklearn.metrics import classification_report,confusion_matrix

> print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

      absent       0.86      0.95      0.90        19
     present       0.75      0.50      0.60         6

    accuracy                           0.84        25
   macro avg       0.80      0.72      0.75        25
weighted avg       0.83      0.84      0.83        25

> print(confusion_matrix(y_test,predictions))

[[18  1]
 [ 3  3]]

Tree Visualization

> from IPython.display import Image  
+ from sklearn.externals.six import StringIO

> from sklearn.tree import export_graphviz
+ import pydot

> features = list(df.columns[1:])
+ features

['Age', 'Number', 'Start']

> dot_data = StringIO()  
+ export_graphviz(dtree, out_file=dot_data,
+                 feature_names=features,
+                 class_names=target_names,
+                 filled=True,rounded=True)
+ 
+ graph = pydot.graph_from_dot_data(dot_data.getvalue())  
+ Image(graph[0].create_png())

> graph[0].write_png("dtree.png")

Random Forests

> from sklearn.ensemble import RandomForestClassifier
+ rfc = RandomForestClassifier(n_estimators=100)
+ rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

> rfc_pred = rfc.predict(X_test)

> print(confusion_matrix(y_test,rfc_pred))

[[18  1]
 [ 3  3]]

> print(classification_report(y_test,rfc_pred))

              precision    recall  f1-score   support

      absent       0.86      0.95      0.90        19
     present       0.75      0.50      0.60         6

    accuracy                           0.84        25
   macro avg       0.80      0.72      0.75        25
weighted avg       0.83      0.84      0.83        25

Example - Decision Tree

We will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors).

We will use lending data from 2007-2010 and try to classify and predict whether or not the borrower paid back their loan in full. You can download the data from here.

Here are what the columns represent:

credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment: The monthly installments owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico: The FICO credit score of the borrower.
days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths: The borrower’s number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).

The Data

> loans = pd.read_csv('loan_data.csv')

loans.head()

credit.policy	purpose	int.rate	installment	log.annual.inc	dti	fico	days.with.cr.line	revol.bal	revol.util	inq.last.6mths	delinq.2yrs
1	debt_consolidation	0.1189	829.10	11.35041	19.48	737	5639.958	28854	52.1	0	0
1	credit_card	0.1071	228.22	11.08214	14.29	707	2760.000	33623	76.7	0	0
1	debt_consolidation	0.1357	366.86	10.37349	11.63	682	4710.000	3511	25.6	1	0
1	debt_consolidation	0.1008	162.34	11.35041	8.10	712	2699.958	33667	73.2	1	0
1	credit_card	0.1426	102.92	11.29973	14.97	667	4066.000	4740	39.5	0	1
1	credit_card	0.0788	125.13	11.90497	16.98	727	6120.042	50807	51.0	0	0

> loandesc = loans.describe()

	credit.policy	int.rate	installment	log.annual.inc	dti	fico	days.with.cr.line	revol.bal	revol.util	inq.last.6mths	delinq.2yrs	pub.rec	not.fully.paid
count	9578.0000000	9578.0000000	9578.0000	9578.0000000	9578.00000	9578.00000	9578.0000	9578.00	9578.00000	9578.000000	9578.0000000	9578.0000000	9578.0000000
mean	0.8049697	0.1226401	319.0894	10.9321171	12.60668	710.84631	4560.7672	16913.96	46.79924	1.577469	0.1637085	0.0621215	0.1600543
std	0.3962447	0.0268470	207.0713	0.6148128	6.88397	37.97054	2496.9304	33756.19	29.01442	2.200245	0.5462149	0.2621263	0.3666755
min	0.0000000	0.0600000	15.6700	7.5475017	0.00000	612.00000	178.9583	0.00	0.00000	0.000000	0.0000000	0.0000000	0.0000000
25%	1.0000000	0.1039000	163.7700	10.5584135	7.21250	682.00000	2820.0000	3187.00	22.60000	0.000000	0.0000000	0.0000000	0.0000000
50%	1.0000000	0.1221000	268.9500	10.9288836	12.66500	707.00000	4139.9583	8596.00	46.30000	1.000000	0.0000000	0.0000000	0.0000000

> loans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy        9578 non-null int64
purpose              9578 non-null object
int.rate             9578 non-null float64
installment          9578 non-null float64
log.annual.inc       9578 non-null float64
dti                  9578 non-null float64
fico                 9578 non-null int64
days.with.cr.line    9578 non-null float64
revol.bal            9578 non-null int64
revol.util           9578 non-null float64
inq.last.6mths       9578 non-null int64
delinq.2yrs          9578 non-null int64
pub.rec              9578 non-null int64
not.fully.paid       9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB

Exploratory Data Analysis

Create a histogram of two FICO distributions on top of each other, one for each credit.policy outcome.

> sns.set_style('darkgrid')
+ plt.figure(figsize=(10,6))
+ loans[loans['credit.policy']==1]['fico'].hist(
+     alpha=0.7,color='yellow',
+     bins=30,label='Credit.Policy=1');
+ loans[loans['credit.policy']==0]['fico'].hist(
+       alpha=0.5,color='black',
+       bins=30,label='Credit.Policy=0');
+ plt.legend();
+ plt.xlabel('FICO');
+ plt.show()

Create a similar figure, except this time select by the not.fully.paid column.

> plt.figure(figsize=(10,6))
+ 
+ loans[loans['not.fully.paid']==0]['fico'].hist(
+     alpha=0.5,color='black',
+     bins=30,label='not.fully.paid=0');
+ loans[loans['not.fully.paid']==1]['fico'].hist(
+     alpha=0.7,color='yellow',
+     bins=30,label='not.fully.paid=1');
+ plt.legend();
+ plt.xlabel('FICO');
+ plt.show()

Create a countplot using seaborn showing the counts of loans by purpose, with the color hue defined by not.fully.paid.

> plt.figure(figsize=(11,7))
+ sns.countplot(x='purpose',hue='not.fully.paid',
+   data=loans,palette='rainbow');
+ plt.show()

Let’s see the trend between FICO score and interest rate.

> sns.jointplot(x='fico',y='int.rate',
+   data=loans,color='purple');
+ plt.show()

Create lmplots to see if the trend differed between not.fully.paid and credit.policy.

> plt.figure(figsize=(11,7))
+ sns.lmplot(y='int.rate',x='fico',
+     data=loans,hue='credit.policy',
+     col='not.fully.paid',palette='Set1');
+ plt.show()

Categorical Features

Notice the purpose column as categorical.

We need to transform it using dummy variables. We’ll do this in one clean step using pd.get_dummies.

Create a list of 1 element containing the string ‘purpose’. Call this list cat_feats.

> cat_feats = ['purpose']

Now use pd.get_dummies(loans,columns=cat_feats,drop_first=True) to create a fixed larger dataframe that has new feature columns with dummy variables. Set this dataframe as final_data.

> final_data = pd.get_dummies(loans,
+     columns=cat_feats,drop_first=True)

> final_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 19 columns):
credit.policy                 9578 non-null int64
int.rate                      9578 non-null float64
installment                   9578 non-null float64
log.annual.inc                9578 non-null float64
dti                           9578 non-null float64
fico                          9578 non-null int64
days.with.cr.line             9578 non-null float64
revol.bal                     9578 non-null int64
revol.util                    9578 non-null float64
inq.last.6mths                9578 non-null int64
delinq.2yrs                   9578 non-null int64
pub.rec                       9578 non-null int64
not.fully.paid                9578 non-null int64
purpose_credit_card           9578 non-null uint8
purpose_debt_consolidation    9578 non-null uint8
purpose_educational           9578 non-null uint8
purpose_home_improvement      9578 non-null uint8
purpose_major_purchase        9578 non-null uint8
purpose_small_business        9578 non-null uint8
dtypes: float64(6), int64(7), uint8(6)
memory usage: 1.0 MB

Train Test Split

Use sklearn to split your data into a training set and a testing set

not.full.paid is the dependent (target) variable.

> X = final_data.drop('not.fully.paid',axis=1)
+ y = final_data['not.fully.paid']
+ X_train, X_test, y_train, y_test = train_test_split(
+     X, y, test_size=0.30, random_state=101)

Training the Model

> dtree = DecisionTreeClassifier()

> dtree.fit(X_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

Predictions and Evaluation

> predictions = dtree.predict(X_test)

> print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.85      0.82      0.84      2431
           1       0.19      0.23      0.21       443

    accuracy                           0.73      2874
   macro avg       0.52      0.53      0.52      2874
weighted avg       0.75      0.73      0.74      2874

> print(confusion_matrix(y_test,predictions))

[[2000  431]
 [ 342  101]]

Example - Random Forest

Create an instance of the RandomForestClassifier class and fit it to our training data from the previous step.

> rfc = RandomForestClassifier(n_estimators=600)

> rfc.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=600,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

Predictions and Evaluation

Predict the class of not.fully.paid for the X_test data.

> predictions = rfc.predict(X_test)

> print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.85      1.00      0.92      2431
           1       0.53      0.02      0.03       443

    accuracy                           0.85      2874
   macro avg       0.69      0.51      0.48      2874
weighted avg       0.80      0.85      0.78      2874

> print(confusion_matrix(y_test,predictions))

[[2424    7]
 [ 435    8]]

Python for Decision Trees and Random Forests

Python code in R Markdown

Paul Jozefek

2020-09-24

Decision Trees

The Data

Exploratory Data Analysis

Train Test Split

Decision Trees

Prediction and Evaluation

Tree Visualization

Random Forests

Example - Decision Tree

The Data

Exploratory Data Analysis

Categorical Features

Train Test Split

Training the Model

Predictions and Evaluation

Example - Random Forest

Predictions and Evaluation