1 Informações iniciais

Vamos aqui reproduzir o código do video de Dataprofessor - Chanin Nantasenamat: https://youtu.be/XmSlFPDjKdc.
O código original em linguagem python do Jupyter Lab está em https://github.com/dataprofessor/code/tree/master/python/iris.

VocÊ pode baixar o arquivo iris-classification-random-forest.ipynb na pasta de projeto. Entretanto, vamos alterar copiando e colando os códigos para este RMarkdown a fim de reproduzir a rotina de classificação com o dataset Iris. Recomendo acessar o arquivo https://github.com/dataprofessor/code/blob/master/python/iris/iris-classification-random-forest.ipynb de modo que aparecerá cada bloco de códigos a ser utilizado nos chunks do Rmd.

2 Importar bibliotecas - pacotes

Pressupõe-se que o leitor já tenha os pacotes Python instalados.

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

3 Carregar o dataset iris

O dataset está dentro do pacote datasets.

iris = datasets.load_iris()

4 Características de Input

O dataset iris contém 4 características de input e 1 variável output (o rótulo de classe).

# Input features
print(iris.feature_names)
## ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
# Output features
print(iris.target_names)
## ['setosa' 'versicolor' 'virginica']

5 Visualização inicial dos dados

5.1 Características Input

iris.data
## array([[5.1, 3.5, 1.4, 0.2],
##        [4.9, 3. , 1.4, 0.2],
##        [4.7, 3.2, 1.3, 0.2],
##        [4.6, 3.1, 1.5, 0.2],
##        [5. , 3.6, 1.4, 0.2],
##        [5.4, 3.9, 1.7, 0.4],
##        [4.6, 3.4, 1.4, 0.3],
##        [5. , 3.4, 1.5, 0.2],
##        [4.4, 2.9, 1.4, 0.2],
##        [4.9, 3.1, 1.5, 0.1],
##        [5.4, 3.7, 1.5, 0.2],
##        [4.8, 3.4, 1.6, 0.2],
##        [4.8, 3. , 1.4, 0.1],
##        [4.3, 3. , 1.1, 0.1],
##        [5.8, 4. , 1.2, 0.2],
##        [5.7, 4.4, 1.5, 0.4],
##        [5.4, 3.9, 1.3, 0.4],
##        [5.1, 3.5, 1.4, 0.3],
##        [5.7, 3.8, 1.7, 0.3],
##        [5.1, 3.8, 1.5, 0.3],
##        [5.4, 3.4, 1.7, 0.2],
##        [5.1, 3.7, 1.5, 0.4],
##        [4.6, 3.6, 1. , 0.2],
##        [5.1, 3.3, 1.7, 0.5],
##        [4.8, 3.4, 1.9, 0.2],
##        [5. , 3. , 1.6, 0.2],
##        [5. , 3.4, 1.6, 0.4],
##        [5.2, 3.5, 1.5, 0.2],
##        [5.2, 3.4, 1.4, 0.2],
##        [4.7, 3.2, 1.6, 0.2],
##        [4.8, 3.1, 1.6, 0.2],
##        [5.4, 3.4, 1.5, 0.4],
##        [5.2, 4.1, 1.5, 0.1],
##        [5.5, 4.2, 1.4, 0.2],
##        [4.9, 3.1, 1.5, 0.2],
##        [5. , 3.2, 1.2, 0.2],
##        [5.5, 3.5, 1.3, 0.2],
##        [4.9, 3.6, 1.4, 0.1],
##        [4.4, 3. , 1.3, 0.2],
##        [5.1, 3.4, 1.5, 0.2],
##        [5. , 3.5, 1.3, 0.3],
##        [4.5, 2.3, 1.3, 0.3],
##        [4.4, 3.2, 1.3, 0.2],
##        [5. , 3.5, 1.6, 0.6],
##        [5.1, 3.8, 1.9, 0.4],
##        [4.8, 3. , 1.4, 0.3],
##        [5.1, 3.8, 1.6, 0.2],
##        [4.6, 3.2, 1.4, 0.2],
##        [5.3, 3.7, 1.5, 0.2],
##        [5. , 3.3, 1.4, 0.2],
##        [7. , 3.2, 4.7, 1.4],
##        [6.4, 3.2, 4.5, 1.5],
##        [6.9, 3.1, 4.9, 1.5],
##        [5.5, 2.3, 4. , 1.3],
##        [6.5, 2.8, 4.6, 1.5],
##        [5.7, 2.8, 4.5, 1.3],
##        [6.3, 3.3, 4.7, 1.6],
##        [4.9, 2.4, 3.3, 1. ],
##        [6.6, 2.9, 4.6, 1.3],
##        [5.2, 2.7, 3.9, 1.4],
##        [5. , 2. , 3.5, 1. ],
##        [5.9, 3. , 4.2, 1.5],
##        [6. , 2.2, 4. , 1. ],
##        [6.1, 2.9, 4.7, 1.4],
##        [5.6, 2.9, 3.6, 1.3],
##        [6.7, 3.1, 4.4, 1.4],
##        [5.6, 3. , 4.5, 1.5],
##        [5.8, 2.7, 4.1, 1. ],
##        [6.2, 2.2, 4.5, 1.5],
##        [5.6, 2.5, 3.9, 1.1],
##        [5.9, 3.2, 4.8, 1.8],
##        [6.1, 2.8, 4. , 1.3],
##        [6.3, 2.5, 4.9, 1.5],
##        [6.1, 2.8, 4.7, 1.2],
##        [6.4, 2.9, 4.3, 1.3],
##        [6.6, 3. , 4.4, 1.4],
##        [6.8, 2.8, 4.8, 1.4],
##        [6.7, 3. , 5. , 1.7],
##        [6. , 2.9, 4.5, 1.5],
##        [5.7, 2.6, 3.5, 1. ],
##        [5.5, 2.4, 3.8, 1.1],
##        [5.5, 2.4, 3.7, 1. ],
##        [5.8, 2.7, 3.9, 1.2],
##        [6. , 2.7, 5.1, 1.6],
##        [5.4, 3. , 4.5, 1.5],
##        [6. , 3.4, 4.5, 1.6],
##        [6.7, 3.1, 4.7, 1.5],
##        [6.3, 2.3, 4.4, 1.3],
##        [5.6, 3. , 4.1, 1.3],
##        [5.5, 2.5, 4. , 1.3],
##        [5.5, 2.6, 4.4, 1.2],
##        [6.1, 3. , 4.6, 1.4],
##        [5.8, 2.6, 4. , 1.2],
##        [5. , 2.3, 3.3, 1. ],
##        [5.6, 2.7, 4.2, 1.3],
##        [5.7, 3. , 4.2, 1.2],
##        [5.7, 2.9, 4.2, 1.3],
##        [6.2, 2.9, 4.3, 1.3],
##        [5.1, 2.5, 3. , 1.1],
##        [5.7, 2.8, 4.1, 1.3],
##        [6.3, 3.3, 6. , 2.5],
##        [5.8, 2.7, 5.1, 1.9],
##        [7.1, 3. , 5.9, 2.1],
##        [6.3, 2.9, 5.6, 1.8],
##        [6.5, 3. , 5.8, 2.2],
##        [7.6, 3. , 6.6, 2.1],
##        [4.9, 2.5, 4.5, 1.7],
##        [7.3, 2.9, 6.3, 1.8],
##        [6.7, 2.5, 5.8, 1.8],
##        [7.2, 3.6, 6.1, 2.5],
##        [6.5, 3.2, 5.1, 2. ],
##        [6.4, 2.7, 5.3, 1.9],
##        [6.8, 3. , 5.5, 2.1],
##        [5.7, 2.5, 5. , 2. ],
##        [5.8, 2.8, 5.1, 2.4],
##        [6.4, 3.2, 5.3, 2.3],
##        [6.5, 3. , 5.5, 1.8],
##        [7.7, 3.8, 6.7, 2.2],
##        [7.7, 2.6, 6.9, 2.3],
##        [6. , 2.2, 5. , 1.5],
##        [6.9, 3.2, 5.7, 2.3],
##        [5.6, 2.8, 4.9, 2. ],
##        [7.7, 2.8, 6.7, 2. ],
##        [6.3, 2.7, 4.9, 1.8],
##        [6.7, 3.3, 5.7, 2.1],
##        [7.2, 3.2, 6. , 1.8],
##        [6.2, 2.8, 4.8, 1.8],
##        [6.1, 3. , 4.9, 1.8],
##        [6.4, 2.8, 5.6, 2.1],
##        [7.2, 3. , 5.8, 1.6],
##        [7.4, 2.8, 6.1, 1.9],
##        [7.9, 3.8, 6.4, 2. ],
##        [6.4, 2.8, 5.6, 2.2],
##        [6.3, 2.8, 5.1, 1.5],
##        [6.1, 2.6, 5.6, 1.4],
##        [7.7, 3. , 6.1, 2.3],
##        [6.3, 3.4, 5.6, 2.4],
##        [6.4, 3.1, 5.5, 1.8],
##        [6. , 3. , 4.8, 1.8],
##        [6.9, 3.1, 5.4, 2.1],
##        [6.7, 3.1, 5.6, 2.4],
##        [6.9, 3.1, 5.1, 2.3],
##        [5.8, 2.7, 5.1, 1.9],
##        [6.8, 3.2, 5.9, 2.3],
##        [6.7, 3.3, 5.7, 2.5],
##        [6.7, 3. , 5.2, 2.3],
##        [6.3, 2.5, 5. , 1.9],
##        [6.5, 3. , 5.2, 2. ],
##        [6.2, 3.4, 5.4, 2.3],
##        [5.9, 3. , 5.1, 1.8]])

5.2 Variável Output (o rótulo Class)

iris.target
## array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
##        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
##        0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
##        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
##        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
##        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
##        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

5.3 Definindo variáveis input e output

Serão 4 variáveis input para X e a variável output (rótulo class) para Y.

X = iris.data
Y = iris.target

5.4 Análise das dimensões de X e Y

X.shape
## (150, 4)
Y.shape
## (150,)

6 Modelo de classificação por Random Forest

clf = RandomForestClassifier()
clf.fit(X, Y)
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

7 Importância das características na explicação de Y

print(clf.feature_importances_)
## [0.10283932 0.02174083 0.4636447  0.41177515]

8 Previsão

X[0]
## array([5.1, 3.5, 1.4, 0.2])
print(clf.predict([[5.1, 3.5, 1.4, 0.2]]))  # resultado 0 para setosa
## [0]
print(clf.predict(X[[0]]))
## [0]
print(clf.predict_proba(X[[0]]))
## [[1. 0. 0.]]
clf.fit(iris.data, iris.target_names[iris.target])
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

9 Data split (razão 80/20): treino e teste

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
X_train.shape, Y_train.shape
## ((120, 4), (120,))
X_test.shape, Y_test.shape
## ((30, 4), (30,))

10 Refazendo o modelo Random Forest

clf.fit(X_train, Y_train)
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

10.1 Faz previsão em amostra simples do dataset

print(clf.predict([[5.1, 3.5, 1.4, 0.2]]))
## [0]
print(clf.predict_proba([[5.1, 3.5, 1.4, 0.2]]))
## [[1. 0. 0.]]

10.2 Faz previsão na amostra test

Rótulos de classe previstos:

print(clf.predict(X_test))
## [2 0 2 1 0 1 2 1 2 1 1 1 1 0 2 0 0 2 2 2 2 0 1 2 1 0 2 0 2 1]

Rótulos de classe observados:

print(Y_test)
## [2 0 2 1 0 1 2 1 2 1 1 1 1 0 2 0 0 1 2 2 2 0 1 2 1 0 2 0 2 2]

11 Performance do Modelo

print(clf.score(X_test, Y_test))
## 0.9333333333333333
LS0tDQp0aXRsZTogJ01hY2hpbmUgTGVhcm5pbmcgaW4gUHl0aG9uOiBCdWlsZGluZyBhIENsYXNzaWZpY2F0aW9uIE1vZGVsJw0KYXV0aG9yOiAiQWRyaWFubyBNYXJjb3MgUiBGaWd1ZWlyZWRvIg0KZGF0ZTogImByIGZvcm1hdChTeXMuRGF0ZSgpLCAnJWQgJUIgJVknKWAiDQpsaW5rY29sb3I6IGJsdWUNCm91dHB1dDogDQogIGh0bWxfZG9jdW1lbnQ6DQogICAgY29kZV9kb3dubG9hZDogeWVzDQogICAgdGhlbWU6IGRlZmF1bHQNCiAgICBudW1iZXJfc2VjdGlvbnM6IHRydWUNCiAgICB0b2M6IHllcw0KICAgIHRvY19mbG9hdDogbm8NCiAgICBkZl9wcmludDogcGFnZWQNCiAgICBmaWdfY2FwdGlvbjogdHJ1ZSAgICANCiAgICANCi0tLQ0KDQpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0NCmtuaXRyOjpvcHRzX2NodW5rJHNldChlY2hvID0gVFJVRSkNCmBgYA0KDQojIEluZm9ybWHDp8O1ZXMgaW5pY2lhaXMNCg0KVmFtb3MgYXF1aSByZXByb2R1emlyIG8gY8OzZGlnbyBkbyB2aWRlbyBkZSBEYXRhcHJvZmVzc29yIC0gQ2hhbmluIE5hbnRhc2VuYW1hdDogaHR0cHM6Ly95b3V0dS5iZS9YbVNsRlBEaktkYy4gICAgIA0KTyBjw7NkaWdvIG9yaWdpbmFsIGVtIGxpbmd1YWdlbSBweXRob24gZG8gSnVweXRlciBMYWIgZXN0w6EgZW0gaHR0cHM6Ly9naXRodWIuY29tL2RhdGFwcm9mZXNzb3IvY29kZS90cmVlL21hc3Rlci9weXRob24vaXJpcy4gICAgDQoNClZvY8OKIHBvZGUgYmFpeGFyIG8gYXJxdWl2byBpcmlzLWNsYXNzaWZpY2F0aW9uLXJhbmRvbS1mb3Jlc3QuaXB5bmIgbmEgcGFzdGEgZGUgcHJvamV0by4gRW50cmV0YW50bywgdmFtb3MgYWx0ZXJhciBjb3BpYW5kbyBlIGNvbGFuZG8gb3MgY8OzZGlnb3MgcGFyYSBlc3RlIFJNYXJrZG93biBhIGZpbSBkZSByZXByb2R1emlyIGEgcm90aW5hIGRlIGNsYXNzaWZpY2HDp8OjbyBjb20gbyBkYXRhc2V0IElyaXMuIFJlY29tZW5kbyBhY2Vzc2FyIG8gYXJxdWl2byBodHRwczovL2dpdGh1Yi5jb20vZGF0YXByb2Zlc3Nvci9jb2RlL2Jsb2IvbWFzdGVyL3B5dGhvbi9pcmlzL2lyaXMtY2xhc3NpZmljYXRpb24tcmFuZG9tLWZvcmVzdC5pcHluYiBkZSBtb2RvIHF1ZSBhcGFyZWNlcsOhIGNhZGEgYmxvY28gZGUgY8OzZGlnb3MgYSBzZXIgdXRpbGl6YWRvIG5vcyBjaHVua3MgZG8gUm1kLg0KDQojIEltcG9ydGFyIGJpYmxpb3RlY2FzIC0gcGFjb3Rlcw0KDQpQcmVzc3Vww7VlLXNlIHF1ZSBvIGxlaXRvciBqw6EgdGVuaGEgb3MgcGFjb3RlcyBQeXRob24gaW5zdGFsYWRvcy4NCg0KYGBge3B5dGhvbn0NCmZyb20gc2tsZWFybiBpbXBvcnQgZGF0YXNldHMNCmZyb20gc2tsZWFybi5tb2RlbF9zZWxlY3Rpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXQNCmZyb20gc2tsZWFybi5lbnNlbWJsZSBpbXBvcnQgUmFuZG9tRm9yZXN0Q2xhc3NpZmllcg0KZnJvbSBza2xlYXJuLmRhdGFzZXRzIGltcG9ydCBtYWtlX2NsYXNzaWZpY2F0aW9uDQpgYGANCg0KIyBDYXJyZWdhciBvIGRhdGFzZXQgaXJpcyANCg0KTyBkYXRhc2V0IGVzdMOhIGRlbnRybyBkbyBwYWNvdGUgZGF0YXNldHMuDQoNCmBgYHtweXRob259DQppcmlzID0gZGF0YXNldHMubG9hZF9pcmlzKCkNCmBgYA0KDQojIENhcmFjdGVyw61zdGljYXMgZGUgSW5wdXQNCg0KTyBkYXRhc2V0ICppcmlzKiBjb250w6ltIDQgY2FyYWN0ZXLDrXN0aWNhcyBkZSBpbnB1dCBlIDEgdmFyacOhdmVsIG91dHB1dCAobyByw7N0dWxvIGRlIGNsYXNzZSkuDQoNCmBgYHtweXRob259DQojIElucHV0IGZlYXR1cmVzDQpwcmludChpcmlzLmZlYXR1cmVfbmFtZXMpDQpgYGANCg0KYGBge3B5dGhvbn0NCiMgT3V0cHV0IGZlYXR1cmVzDQpwcmludChpcmlzLnRhcmdldF9uYW1lcykNCmBgYA0KDQojIFZpc3VhbGl6YcOnw6NvIGluaWNpYWwgZG9zIGRhZG9zDQoNCiMjIENhcmFjdGVyw61zdGljYXMgSW5wdXQgDQoNCmBgYHtweXRob259DQppcmlzLmRhdGENCmBgYA0KDQojIyBWYXJpw6F2ZWwgT3V0cHV0IChvIHLDs3R1bG8gQ2xhc3MpIA0KDQpgYGB7cHl0aG9ufQ0KaXJpcy50YXJnZXQNCmBgYA0KIyMgRGVmaW5pbmRvIHZhcmnDoXZlaXMgaW5wdXQgZSBvdXRwdXQgDQoNClNlcsOjbyA0IHZhcmnDoXZlaXMgaW5wdXQgcGFyYSBYIGUgYSB2YXJpw6F2ZWwgb3V0cHV0IChyw7N0dWxvIGNsYXNzKSBwYXJhIFkuICAgDQoNCmBgYHtweXRob259DQpYID0gaXJpcy5kYXRhDQpZID0gaXJpcy50YXJnZXQNCmBgYA0KDQojIyBBbsOhbGlzZSBkYXMgZGltZW5zw7VlcyBkZSBYIGUgWQ0KDQpgYGB7cHl0aG9ufQ0KWC5zaGFwZQ0KWS5zaGFwZQ0KYGBgDQoNCiMgTW9kZWxvIGRlIGNsYXNzaWZpY2HDp8OjbyBwb3IgUmFuZG9tIEZvcmVzdA0KDQpgYGB7cHl0aG9ufQ0KY2xmID0gUmFuZG9tRm9yZXN0Q2xhc3NpZmllcigpDQpjbGYuZml0KFgsIFkpDQpgYGANCg0KIyBJbXBvcnTDom5jaWEgZGFzIGNhcmFjdGVyw61zdGljYXMgbmEgZXhwbGljYcOnw6NvIGRlIFkNCg0KYGBge3B5dGhvbn0NCnByaW50KGNsZi5mZWF0dXJlX2ltcG9ydGFuY2VzXykNCmBgYA0KDQojIFByZXZpc8Ojbw0KDQpgYGB7cHl0aG9ufQ0KWFswXQ0KYGBgDQoNCmBgYHtweXRob259DQpwcmludChjbGYucHJlZGljdChbWzUuMSwgMy41LCAxLjQsIDAuMl1dKSkgICMgcmVzdWx0YWRvIDAgcGFyYSBzZXRvc2ENCmBgYA0KDQpgYGB7cHl0aG9ufQ0KcHJpbnQoY2xmLnByZWRpY3QoWFtbMF1dKSkNCmBgYA0KDQpgYGB7cHl0aG9ufQ0KcHJpbnQoY2xmLnByZWRpY3RfcHJvYmEoWFtbMF1dKSkNCmBgYA0KDQpgYGB7cHl0aG9ufQ0KY2xmLmZpdChpcmlzLmRhdGEsIGlyaXMudGFyZ2V0X25hbWVzW2lyaXMudGFyZ2V0XSkNCmBgYA0KDQojIERhdGEgc3BsaXQgKHJhesOjbyA4MC8yMCk6IHRyZWlubyBlIHRlc3RlDQoNCmBgYHtweXRob259DQpYX3RyYWluLCBYX3Rlc3QsIFlfdHJhaW4sIFlfdGVzdCA9IHRyYWluX3Rlc3Rfc3BsaXQoWCwgWSwgdGVzdF9zaXplPTAuMikNCmBgYA0KDQpgYGB7cHl0aG9ufQ0KWF90cmFpbi5zaGFwZSwgWV90cmFpbi5zaGFwZQ0KYGBgDQoNCmBgYHtweXRob259DQpYX3Rlc3Quc2hhcGUsIFlfdGVzdC5zaGFwZQ0KYGBgDQoNCiMgUmVmYXplbmRvIG8gbW9kZWxvIFJhbmRvbSBGb3Jlc3QNCg0KYGBge3B5dGhvbn0NCmNsZi5maXQoWF90cmFpbiwgWV90cmFpbikNCmBgYA0KDQojIyBGYXogcHJldmlzw6NvIGVtIGFtb3N0cmEgc2ltcGxlcyBkbyBkYXRhc2V0DQoNCmBgYHtweXRob259DQpwcmludChjbGYucHJlZGljdChbWzUuMSwgMy41LCAxLjQsIDAuMl1dKSkNCmBgYA0KDQpgYGB7cHl0aG9ufQ0KcHJpbnQoY2xmLnByZWRpY3RfcHJvYmEoW1s1LjEsIDMuNSwgMS40LCAwLjJdXSkpDQpgYGANCg0KIyMgRmF6IHByZXZpc8OjbyBuYSBhbW9zdHJhIHRlc3QNCg0KUsOzdHVsb3MgZGUgY2xhc3NlIHByZXZpc3RvczoNCg0KYGBge3B5dGhvbn0NCnByaW50KGNsZi5wcmVkaWN0KFhfdGVzdCkpDQpgYGANCg0KUsOzdHVsb3MgZGUgY2xhc3NlIG9ic2VydmFkb3M6DQoNCmBgYHtweXRob259DQpwcmludChZX3Rlc3QpDQpgYGANCg0KIyBQZXJmb3JtYW5jZSBkbyBNb2RlbG8NCg0KYGBge3B5dGhvbn0NCnByaW50KGNsZi5zY29yZShYX3Rlc3QsIFlfdGVzdCkpDQpgYGANCg0K