> import pandas as pd
+ import numpy as np
+ import matplotlib.pyplot as plt
+ import seaborn as sns
> df = pd.read_csv('kyphosis.csv')
Kyphosis | Age | Number | Start |
---|---|---|---|
absent | 71 | 3 | 5 |
absent | 158 | 3 | 14 |
present | 128 | 4 | 5 |
absent | 2 | 5 | 1 |
absent | 1 | 4 | 15 |
absent | 1 | 2 | 16 |
Kyphosis
- abnormally curved spineAge
- age in months (children)Number
- number of vertebraeStart
- number of top most vertebrae operated on> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 4 columns):
Kyphosis 81 non-null object
Age 81 non-null int64
Number 81 non-null int64
Start 81 non-null int64
dtypes: int64(3), object(1)
memory usage: 2.7+ KB
> target_names = list(df['Kyphosis'].unique())
+ target_names
['absent', 'present']
> sns.set_style('darkgrid')
+ g = sns.pairplot(df,hue='Kyphosis',palette='Set1')
+ g._legend.remove();
+ plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
+ plt.tight_layout()
+ plt.show()
> from sklearn.model_selection import train_test_split
> X = df.drop('Kyphosis',axis=1)
+ y = df['Kyphosis']
> X_train, X_test, y_train, y_test = train_test_split(
+ X, y, test_size=0.30)
> from sklearn.tree import DecisionTreeClassifier
> dtree = DecisionTreeClassifier()
> dtree.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
> predictions = dtree.predict(X_test)
> from sklearn.metrics import classification_report,confusion_matrix
> print(classification_report(y_test,predictions))
precision recall f1-score support
absent 0.86 0.95 0.90 19
present 0.75 0.50 0.60 6
accuracy 0.84 25
macro avg 0.80 0.72 0.75 25
weighted avg 0.83 0.84 0.83 25
> print(confusion_matrix(y_test,predictions))
[[18 1]
[ 3 3]]
> from IPython.display import Image
+ from sklearn.externals.six import StringIO
> from sklearn.tree import export_graphviz
+ import pydot
> features = list(df.columns[1:])
+ features
['Age', 'Number', 'Start']
> dot_data = StringIO()
+ export_graphviz(dtree, out_file=dot_data,
+ feature_names=features,
+ class_names=target_names,
+ filled=True,rounded=True)
+
+ graph = pydot.graph_from_dot_data(dot_data.getvalue())
+ Image(graph[0].create_png())
> graph[0].write_png("dtree.png")
> from sklearn.ensemble import RandomForestClassifier
+ rfc = RandomForestClassifier(n_estimators=100)
+ rfc.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
> rfc_pred = rfc.predict(X_test)
> print(confusion_matrix(y_test,rfc_pred))
[[18 1]
[ 3 3]]
> print(classification_report(y_test,rfc_pred))
precision recall f1-score support
absent 0.86 0.95 0.90 19
present 0.75 0.50 0.60 6
accuracy 0.84 25
macro avg 0.80 0.72 0.75 25
weighted avg 0.83 0.84 0.83 25
We will be exploring publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors).
We will use lending data from 2007-2010 and try to classify and predict whether or not the borrower paid back their loan in full. You can download the data from here.
Here are what the columns represent:
credit.policy
: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.purpose
: The purpose of the loan (takes values “credit_card”, “debt_consolidation”, “educational”, “major_purchase”, “small_business”, and “all_other”).int.rate
: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.installment
: The monthly installments owed by the borrower if the loan is funded.log.annual.inc
: The natural log of the self-reported annual income of the borrower.dti
: The debt-to-income ratio of the borrower (amount of debt divided by annual income).fico
: The FICO credit score of the borrower.days.with.cr.line
: The number of days the borrower has had a credit line.revol.bal
: The borrower’s revolving balance (amount unpaid at the end of the credit card billing cycle).revol.util
: The borrower’s revolving line utilization rate (the amount of the credit line used relative to total credit available).inq.last.6mths
: The borrower’s number of inquiries by creditors in the last 6 months.delinq.2yrs
: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.pub.rec
: The borrower’s number of derogatory public records (bankruptcy filings, tax liens, or judgments).> loans = pd.read_csv('loan_data.csv')
credit.policy | purpose | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | debt_consolidation | 0.1189 | 829.10 | 11.35041 | 19.48 | 737 | 5639.958 | 28854 | 52.1 | 0 | 0 | 0 | 0 |
1 | credit_card | 0.1071 | 228.22 | 11.08214 | 14.29 | 707 | 2760.000 | 33623 | 76.7 | 0 | 0 | 0 | 0 |
1 | debt_consolidation | 0.1357 | 366.86 | 10.37349 | 11.63 | 682 | 4710.000 | 3511 | 25.6 | 1 | 0 | 0 | 0 |
1 | debt_consolidation | 0.1008 | 162.34 | 11.35041 | 8.10 | 712 | 2699.958 | 33667 | 73.2 | 1 | 0 | 0 | 0 |
1 | credit_card | 0.1426 | 102.92 | 11.29973 | 14.97 | 667 | 4066.000 | 4740 | 39.5 | 0 | 1 | 0 | 0 |
1 | credit_card | 0.0788 | 125.13 | 11.90497 | 16.98 | 727 | 6120.042 | 50807 | 51.0 | 0 | 0 | 0 | 0 |
> loandesc = loans.describe()
credit.policy | int.rate | installment | log.annual.inc | dti | fico | days.with.cr.line | revol.bal | revol.util | inq.last.6mths | delinq.2yrs | pub.rec | not.fully.paid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 9578.0000000 | 9578.0000000 | 9578.0000 | 9578.0000000 | 9578.00000 | 9578.00000 | 9578.0000 | 9578.00 | 9578.00000 | 9578.000000 | 9578.0000000 | 9578.0000000 | 9578.0000000 |
mean | 0.8049697 | 0.1226401 | 319.0894 | 10.9321171 | 12.60668 | 710.84631 | 4560.7672 | 16913.96 | 46.79924 | 1.577469 | 0.1637085 | 0.0621215 | 0.1600543 |
std | 0.3962447 | 0.0268470 | 207.0713 | 0.6148128 | 6.88397 | 37.97054 | 2496.9304 | 33756.19 | 29.01442 | 2.200245 | 0.5462149 | 0.2621263 | 0.3666755 |
min | 0.0000000 | 0.0600000 | 15.6700 | 7.5475017 | 0.00000 | 612.00000 | 178.9583 | 0.00 | 0.00000 | 0.000000 | 0.0000000 | 0.0000000 | 0.0000000 |
25% | 1.0000000 | 0.1039000 | 163.7700 | 10.5584135 | 7.21250 | 682.00000 | 2820.0000 | 3187.00 | 22.60000 | 0.000000 | 0.0000000 | 0.0000000 | 0.0000000 |
50% | 1.0000000 | 0.1221000 | 268.9500 | 10.9288836 | 12.66500 | 707.00000 | 4139.9583 | 8596.00 | 46.30000 | 1.000000 | 0.0000000 | 0.0000000 | 0.0000000 |
> loans.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
credit.policy 9578 non-null int64
purpose 9578 non-null object
int.rate 9578 non-null float64
installment 9578 non-null float64
log.annual.inc 9578 non-null float64
dti 9578 non-null float64
fico 9578 non-null int64
days.with.cr.line 9578 non-null float64
revol.bal 9578 non-null int64
revol.util 9578 non-null float64
inq.last.6mths 9578 non-null int64
delinq.2yrs 9578 non-null int64
pub.rec 9578 non-null int64
not.fully.paid 9578 non-null int64
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB
credit.policy
outcome.> sns.set_style('darkgrid')
+ plt.figure(figsize=(10,6))
+ loans[loans['credit.policy']==1]['fico'].hist(
+ alpha=0.7,color='yellow',
+ bins=30,label='Credit.Policy=1');
+ loans[loans['credit.policy']==0]['fico'].hist(
+ alpha=0.5,color='black',
+ bins=30,label='Credit.Policy=0');
+ plt.legend();
+ plt.xlabel('FICO');
+ plt.show()
not.fully.paid
column.> plt.figure(figsize=(10,6))
+
+ loans[loans['not.fully.paid']==0]['fico'].hist(
+ alpha=0.5,color='black',
+ bins=30,label='not.fully.paid=0');
+ loans[loans['not.fully.paid']==1]['fico'].hist(
+ alpha=0.7,color='yellow',
+ bins=30,label='not.fully.paid=1');
+ plt.legend();
+ plt.xlabel('FICO');
+ plt.show()
purpose
, with the color hue defined by not.fully.paid
.> plt.figure(figsize=(11,7))
+ sns.countplot(x='purpose',hue='not.fully.paid',
+ data=loans,palette='rainbow');
+ plt.show()
> sns.jointplot(x='fico',y='int.rate',
+ data=loans,color='purple');
+ plt.show()
not.fully.paid
and credit.policy
.> plt.figure(figsize=(11,7))
+ sns.lmplot(y='int.rate',x='fico',
+ data=loans,hue='credit.policy',
+ col='not.fully.paid',palette='Set1');
+ plt.show()
Notice the purpose
column as categorical.
We need to transform it using dummy variables. We’ll do this in one clean step using pd.get_dummies
.
cat_feats
.> cat_feats = ['purpose']
pd.get_dummies(loans,columns=cat_feats,drop_first=True)
to create a fixed larger dataframe that has new feature columns with dummy variables. Set this dataframe as final_data
.> final_data = pd.get_dummies(loans,
+ columns=cat_feats,drop_first=True)
> final_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 19 columns):
credit.policy 9578 non-null int64
int.rate 9578 non-null float64
installment 9578 non-null float64
log.annual.inc 9578 non-null float64
dti 9578 non-null float64
fico 9578 non-null int64
days.with.cr.line 9578 non-null float64
revol.bal 9578 non-null int64
revol.util 9578 non-null float64
inq.last.6mths 9578 non-null int64
delinq.2yrs 9578 non-null int64
pub.rec 9578 non-null int64
not.fully.paid 9578 non-null int64
purpose_credit_card 9578 non-null uint8
purpose_debt_consolidation 9578 non-null uint8
purpose_educational 9578 non-null uint8
purpose_home_improvement 9578 non-null uint8
purpose_major_purchase 9578 non-null uint8
purpose_small_business 9578 non-null uint8
dtypes: float64(6), int64(7), uint8(6)
memory usage: 1.0 MB
not.full.paid
is the dependent (target) variable.
> X = final_data.drop('not.fully.paid',axis=1)
+ y = final_data['not.fully.paid']
+ X_train, X_test, y_train, y_test = train_test_split(
+ X, y, test_size=0.30, random_state=101)
> dtree = DecisionTreeClassifier()
> dtree.fit(X_train,y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
> predictions = dtree.predict(X_test)
> print(classification_report(y_test,predictions))
precision recall f1-score support
0 0.85 0.82 0.84 2431
1 0.19 0.23 0.21 443
accuracy 0.73 2874
macro avg 0.52 0.53 0.52 2874
weighted avg 0.75 0.73 0.74 2874
> print(confusion_matrix(y_test,predictions))
[[2000 431]
[ 342 101]]
> rfc = RandomForestClassifier(n_estimators=600)
> rfc.fit(X_train,y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=600,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
> predictions = rfc.predict(X_test)
> print(classification_report(y_test,predictions))
precision recall f1-score support
0 0.85 1.00 0.92 2431
1 0.53 0.02 0.03 443
accuracy 0.85 2874
macro avg 0.69 0.51 0.48 2874
weighted avg 0.80 0.85 0.78 2874
> print(confusion_matrix(y_test,predictions))
[[2424 7]
[ 435 8]]