Vamos aqui reproduzir o código do video de Dataprofessor - Chanin
Nantasenamat: https://youtu.be/XmSlFPDjKdc.
O código original em linguagem python do Jupyter Lab está em https://github.com/dataprofessor/code/tree/master/python/iris.
VocÊ pode baixar o arquivo iris-classification-random-forest.ipynb na pasta de projeto. Entretanto, vamos alterar copiando e colando os códigos para este RMarkdown a fim de reproduzir a rotina de classificação com o dataset Iris. Recomendo acessar o arquivo https://github.com/dataprofessor/code/blob/master/python/iris/iris-classification-random-forest.ipynb de modo que aparecerá cada bloco de códigos a ser utilizado nos chunks do Rmd.
Pressupõe-se que o leitor já tenha os pacotes Python instalados.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
O dataset está dentro do pacote datasets.
iris = datasets.load_iris()
O dataset iris contém 4 características de input e 1 variável output (o rótulo de classe).
# Input features
print(iris.feature_names)
## ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
# Output features
print(iris.target_names)
## ['setosa' 'versicolor' 'virginica']
iris.data
## array([[5.1, 3.5, 1.4, 0.2],
## [4.9, 3. , 1.4, 0.2],
## [4.7, 3.2, 1.3, 0.2],
## [4.6, 3.1, 1.5, 0.2],
## [5. , 3.6, 1.4, 0.2],
## [5.4, 3.9, 1.7, 0.4],
## [4.6, 3.4, 1.4, 0.3],
## [5. , 3.4, 1.5, 0.2],
## [4.4, 2.9, 1.4, 0.2],
## [4.9, 3.1, 1.5, 0.1],
## [5.4, 3.7, 1.5, 0.2],
## [4.8, 3.4, 1.6, 0.2],
## [4.8, 3. , 1.4, 0.1],
## [4.3, 3. , 1.1, 0.1],
## [5.8, 4. , 1.2, 0.2],
## [5.7, 4.4, 1.5, 0.4],
## [5.4, 3.9, 1.3, 0.4],
## [5.1, 3.5, 1.4, 0.3],
## [5.7, 3.8, 1.7, 0.3],
## [5.1, 3.8, 1.5, 0.3],
## [5.4, 3.4, 1.7, 0.2],
## [5.1, 3.7, 1.5, 0.4],
## [4.6, 3.6, 1. , 0.2],
## [5.1, 3.3, 1.7, 0.5],
## [4.8, 3.4, 1.9, 0.2],
## [5. , 3. , 1.6, 0.2],
## [5. , 3.4, 1.6, 0.4],
## [5.2, 3.5, 1.5, 0.2],
## [5.2, 3.4, 1.4, 0.2],
## [4.7, 3.2, 1.6, 0.2],
## [4.8, 3.1, 1.6, 0.2],
## [5.4, 3.4, 1.5, 0.4],
## [5.2, 4.1, 1.5, 0.1],
## [5.5, 4.2, 1.4, 0.2],
## [4.9, 3.1, 1.5, 0.2],
## [5. , 3.2, 1.2, 0.2],
## [5.5, 3.5, 1.3, 0.2],
## [4.9, 3.6, 1.4, 0.1],
## [4.4, 3. , 1.3, 0.2],
## [5.1, 3.4, 1.5, 0.2],
## [5. , 3.5, 1.3, 0.3],
## [4.5, 2.3, 1.3, 0.3],
## [4.4, 3.2, 1.3, 0.2],
## [5. , 3.5, 1.6, 0.6],
## [5.1, 3.8, 1.9, 0.4],
## [4.8, 3. , 1.4, 0.3],
## [5.1, 3.8, 1.6, 0.2],
## [4.6, 3.2, 1.4, 0.2],
## [5.3, 3.7, 1.5, 0.2],
## [5. , 3.3, 1.4, 0.2],
## [7. , 3.2, 4.7, 1.4],
## [6.4, 3.2, 4.5, 1.5],
## [6.9, 3.1, 4.9, 1.5],
## [5.5, 2.3, 4. , 1.3],
## [6.5, 2.8, 4.6, 1.5],
## [5.7, 2.8, 4.5, 1.3],
## [6.3, 3.3, 4.7, 1.6],
## [4.9, 2.4, 3.3, 1. ],
## [6.6, 2.9, 4.6, 1.3],
## [5.2, 2.7, 3.9, 1.4],
## [5. , 2. , 3.5, 1. ],
## [5.9, 3. , 4.2, 1.5],
## [6. , 2.2, 4. , 1. ],
## [6.1, 2.9, 4.7, 1.4],
## [5.6, 2.9, 3.6, 1.3],
## [6.7, 3.1, 4.4, 1.4],
## [5.6, 3. , 4.5, 1.5],
## [5.8, 2.7, 4.1, 1. ],
## [6.2, 2.2, 4.5, 1.5],
## [5.6, 2.5, 3.9, 1.1],
## [5.9, 3.2, 4.8, 1.8],
## [6.1, 2.8, 4. , 1.3],
## [6.3, 2.5, 4.9, 1.5],
## [6.1, 2.8, 4.7, 1.2],
## [6.4, 2.9, 4.3, 1.3],
## [6.6, 3. , 4.4, 1.4],
## [6.8, 2.8, 4.8, 1.4],
## [6.7, 3. , 5. , 1.7],
## [6. , 2.9, 4.5, 1.5],
## [5.7, 2.6, 3.5, 1. ],
## [5.5, 2.4, 3.8, 1.1],
## [5.5, 2.4, 3.7, 1. ],
## [5.8, 2.7, 3.9, 1.2],
## [6. , 2.7, 5.1, 1.6],
## [5.4, 3. , 4.5, 1.5],
## [6. , 3.4, 4.5, 1.6],
## [6.7, 3.1, 4.7, 1.5],
## [6.3, 2.3, 4.4, 1.3],
## [5.6, 3. , 4.1, 1.3],
## [5.5, 2.5, 4. , 1.3],
## [5.5, 2.6, 4.4, 1.2],
## [6.1, 3. , 4.6, 1.4],
## [5.8, 2.6, 4. , 1.2],
## [5. , 2.3, 3.3, 1. ],
## [5.6, 2.7, 4.2, 1.3],
## [5.7, 3. , 4.2, 1.2],
## [5.7, 2.9, 4.2, 1.3],
## [6.2, 2.9, 4.3, 1.3],
## [5.1, 2.5, 3. , 1.1],
## [5.7, 2.8, 4.1, 1.3],
## [6.3, 3.3, 6. , 2.5],
## [5.8, 2.7, 5.1, 1.9],
## [7.1, 3. , 5.9, 2.1],
## [6.3, 2.9, 5.6, 1.8],
## [6.5, 3. , 5.8, 2.2],
## [7.6, 3. , 6.6, 2.1],
## [4.9, 2.5, 4.5, 1.7],
## [7.3, 2.9, 6.3, 1.8],
## [6.7, 2.5, 5.8, 1.8],
## [7.2, 3.6, 6.1, 2.5],
## [6.5, 3.2, 5.1, 2. ],
## [6.4, 2.7, 5.3, 1.9],
## [6.8, 3. , 5.5, 2.1],
## [5.7, 2.5, 5. , 2. ],
## [5.8, 2.8, 5.1, 2.4],
## [6.4, 3.2, 5.3, 2.3],
## [6.5, 3. , 5.5, 1.8],
## [7.7, 3.8, 6.7, 2.2],
## [7.7, 2.6, 6.9, 2.3],
## [6. , 2.2, 5. , 1.5],
## [6.9, 3.2, 5.7, 2.3],
## [5.6, 2.8, 4.9, 2. ],
## [7.7, 2.8, 6.7, 2. ],
## [6.3, 2.7, 4.9, 1.8],
## [6.7, 3.3, 5.7, 2.1],
## [7.2, 3.2, 6. , 1.8],
## [6.2, 2.8, 4.8, 1.8],
## [6.1, 3. , 4.9, 1.8],
## [6.4, 2.8, 5.6, 2.1],
## [7.2, 3. , 5.8, 1.6],
## [7.4, 2.8, 6.1, 1.9],
## [7.9, 3.8, 6.4, 2. ],
## [6.4, 2.8, 5.6, 2.2],
## [6.3, 2.8, 5.1, 1.5],
## [6.1, 2.6, 5.6, 1.4],
## [7.7, 3. , 6.1, 2.3],
## [6.3, 3.4, 5.6, 2.4],
## [6.4, 3.1, 5.5, 1.8],
## [6. , 3. , 4.8, 1.8],
## [6.9, 3.1, 5.4, 2.1],
## [6.7, 3.1, 5.6, 2.4],
## [6.9, 3.1, 5.1, 2.3],
## [5.8, 2.7, 5.1, 1.9],
## [6.8, 3.2, 5.9, 2.3],
## [6.7, 3.3, 5.7, 2.5],
## [6.7, 3. , 5.2, 2.3],
## [6.3, 2.5, 5. , 1.9],
## [6.5, 3. , 5.2, 2. ],
## [6.2, 3.4, 5.4, 2.3],
## [5.9, 3. , 5.1, 1.8]])
iris.target
## array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
## 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
## 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
## 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
## 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
## 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
## 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
Serão 4 variáveis input para X e a variável output (rótulo class) para Y.
X = iris.data
Y = iris.target
X.shape
## (150, 4)
Y.shape
## (150,)
clf = RandomForestClassifier()
clf.fit(X, Y)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()
print(clf.feature_importances_)
## [0.10283932 0.02174083 0.4636447 0.41177515]
X[0]
## array([5.1, 3.5, 1.4, 0.2])
print(clf.predict([[5.1, 3.5, 1.4, 0.2]])) # resultado 0 para setosa
## [0]
print(clf.predict(X[[0]]))
## [0]
print(clf.predict_proba(X[[0]]))
## [[1. 0. 0.]]
clf.fit(iris.data, iris.target_names[iris.target])
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
X_train.shape, Y_train.shape
## ((120, 4), (120,))
X_test.shape, Y_test.shape
## ((30, 4), (30,))
clf.fit(X_train, Y_train)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()
print(clf.predict([[5.1, 3.5, 1.4, 0.2]]))
## [0]
print(clf.predict_proba([[5.1, 3.5, 1.4, 0.2]]))
## [[1. 0. 0.]]
Rótulos de classe previstos:
print(clf.predict(X_test))
## [2 0 2 1 0 1 2 1 2 1 1 1 1 0 2 0 0 2 2 2 2 0 1 2 1 0 2 0 2 1]
Rótulos de classe observados:
print(Y_test)
## [2 0 2 1 0 1 2 1 2 1 1 1 1 0 2 0 0 1 2 2 2 0 1 2 1 0 2 0 2 2]
print(clf.score(X_test, Y_test))
## 0.9333333333333333