L’objectif est de déterminer la qualité d’un vin (de 0 = très mauvais à 10 = exceptionnel) à partir de mesures physico-chimiques.
Détails : http://archive.ics.uci.edu/ml/datasets/Wine+Quality
On part du principe que si deux vins différents ont la même composition, leurs qualités gustatives seront identiques. Dans la base de données il y a 6 400 vins rouges et blancs provenant du Portugal. Si un vin inconnu a une composition identique à l’un des 6400 vins répertoriés, on peut supposer qu’il obtiendra la même note.
Avec les algorithmes d’apprentissage utilisés en RCP209, je vais proposer une façon de construire une note pour une composition nouvelle. Je vais donc faire une prédiction et avec le savoir accumulé sur 6400 vins, je vais tenter de prédire ou estimer la note.
Voici le jeu de données qui contient 6497 vins et 13 variables dont 1 variables quantitative illustrative et 1 variables qualitative illustrative.
#Importing required packages.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn import metrics
import numpy as np
import imp
## C:/Users/lnzb7292/AppData/Local/Programs/Python/Python37/python.exe:1: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
from sklearn.neighbors import KNeighborsRegressor
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
import matplotlib.colors as colors
import matplotlib.cm as cmx
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, RandomizedSearchCV
from scipy.stats import uniform
from scipy.stats import norm
import encodings
# import the dataset
df_red = pd.read_csv('C:/Users/lnzb7292/Downloads/RCP209/projet/Projet Vins RCP209/winequality-red.csv', sep=';')
df_red.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1599 entries, 0 to 1598
## Data columns (total 12 columns):
## fixed acidity 1599 non-null float64
## volatile acidity 1599 non-null float64
## citric acid 1599 non-null float64
## residual sugar 1599 non-null float64
## chlorides 1599 non-null float64
## free sulfur dioxide 1599 non-null float64
## total sulfur dioxide 1599 non-null float64
## density 1599 non-null float64
## pH 1599 non-null float64
## sulphates 1599 non-null float64
## alcohol 1599 non-null float64
## quality 1599 non-null int64
## dtypes: float64(11), int64(1)
## memory usage: 150.0 KB
df_red.head(10)
## fixed acidity volatile acidity citric acid ... sulphates alcohol quality
## 0 7.4 0.70 0.00 ... 0.56 9.4 5
## 1 7.8 0.88 0.00 ... 0.68 9.8 5
## 2 7.8 0.76 0.04 ... 0.65 9.8 5
## 3 11.2 0.28 0.56 ... 0.58 9.8 6
## 4 7.4 0.70 0.00 ... 0.56 9.4 5
## 5 7.4 0.66 0.00 ... 0.56 9.4 5
## 6 7.9 0.60 0.06 ... 0.46 9.4 5
## 7 7.3 0.65 0.00 ... 0.47 10.0 7
## 8 7.8 0.58 0.02 ... 0.57 9.5 7
## 9 7.5 0.50 0.36 ... 0.80 10.5 5
##
## [10 rows x 12 columns]
df_red.describe()
## fixed acidity volatile acidity ... alcohol quality
## count 1599.000000 1599.000000 ... 1599.000000 1599.000000
## mean 8.319637 0.527821 ... 10.422983 5.636023
## std 1.741096 0.179060 ... 1.065668 0.807569
## min 4.600000 0.120000 ... 8.400000 3.000000
## 25% 7.100000 0.390000 ... 9.500000 5.000000
## 50% 7.900000 0.520000 ... 10.200000 6.000000
## 75% 9.200000 0.640000 ... 11.100000 6.000000
## max 15.900000 1.580000 ... 14.900000 8.000000
##
## [8 rows x 12 columns]
df_white = pd.read_csv('C:/Users/lnzb7292/Downloads/RCP209/projet/Projet Vins RCP209/winequality-white.csv', sep=';')
df_white.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 4898 entries, 0 to 4897
## Data columns (total 12 columns):
## fixed acidity 4898 non-null float64
## volatile acidity 4898 non-null float64
## citric acid 4898 non-null float64
## residual sugar 4898 non-null float64
## chlorides 4898 non-null float64
## free sulfur dioxide 4898 non-null float64
## total sulfur dioxide 4898 non-null float64
## density 4898 non-null float64
## pH 4898 non-null float64
## sulphates 4898 non-null float64
## alcohol 4898 non-null float64
## quality 4898 non-null int64
## dtypes: float64(11), int64(1)
## memory usage: 459.3 KB
df_white.head(10)
## fixed acidity volatile acidity citric acid ... sulphates alcohol quality
## 0 7.0 0.27 0.36 ... 0.45 8.8 6
## 1 6.3 0.30 0.34 ... 0.49 9.5 6
## 2 8.1 0.28 0.40 ... 0.44 10.1 6
## 3 7.2 0.23 0.32 ... 0.40 9.9 6
## 4 7.2 0.23 0.32 ... 0.40 9.9 6
## 5 8.1 0.28 0.40 ... 0.44 10.1 6
## 6 6.2 0.32 0.16 ... 0.47 9.6 6
## 7 7.0 0.27 0.36 ... 0.45 8.8 6
## 8 6.3 0.30 0.34 ... 0.49 9.5 6
## 9 8.1 0.22 0.43 ... 0.45 11.0 6
##
## [10 rows x 12 columns]
df_white.describe()
## fixed acidity volatile acidity ... alcohol quality
## count 4898.000000 4898.000000 ... 4898.000000 4898.000000
## mean 6.854788 0.278241 ... 10.514267 5.877909
## std 0.843868 0.100795 ... 1.230621 0.885639
## min 3.800000 0.080000 ... 8.000000 3.000000
## 25% 6.300000 0.210000 ... 9.500000 5.000000
## 50% 6.800000 0.260000 ... 10.400000 6.000000
## 75% 7.300000 0.320000 ... 11.400000 6.000000
## max 14.200000 1.100000 ... 14.200000 9.000000
##
## [8 rows x 12 columns]
df_white['color'] = "W"
df_red['color'] = "R"
df = pd.concat([df_red, df_white])
df.head(10)
## fixed acidity volatile acidity citric acid ... alcohol quality color
## 0 7.4 0.70 0.00 ... 9.4 5 R
## 1 7.8 0.88 0.00 ... 9.8 5 R
## 2 7.8 0.76 0.04 ... 9.8 5 R
## 3 11.2 0.28 0.56 ... 9.8 6 R
## 4 7.4 0.70 0.00 ... 9.4 5 R
## 5 7.4 0.66 0.00 ... 9.4 5 R
## 6 7.9 0.60 0.06 ... 9.4 5 R
## 7 7.3 0.65 0.00 ... 10.0 7 R
## 8 7.8 0.58 0.02 ... 9.5 7 R
## 9 7.5 0.50 0.36 ... 10.5 5 R
##
## [10 rows x 13 columns]
df["color"].value_counts()
#vins2=pd.read_csv("C:/Users/lnzb7292/Downloads/RCP209/projet/Projet Vins RCP209/winequality.csv", sep=';')
#vins2["color"].value_counts()
#df.to_csv('C:/Users/lnzb7292/Downloads/RCP209/projet/vinsqualityessai.csv',sep='\t', encoding='utf-8',index=False) #sep=';',mode = 'w', index=False)
## W 4898
## R 1599
## Name: color, dtype: int64
wine = pd.read_csv('C:/Users/lnzb7292/Downloads/RCP209/projet/Projet Vins RCP209/vinsquality.csv', sep=';')
wine.head(10)
## fixed acidity volatile acidity citric acid ... alcohol quality color
## 0 7.4 0.70 0.00 ... 9.4 5 R
## 1 7.8 0.88 0.00 ... 9.8 5 R
## 2 7.8 0.76 0.04 ... 9.8 5 R
## 3 11.2 0.28 0.56 ... 9.8 6 R
## 4 7.4 0.70 0.00 ... 9.4 5 R
## 5 7.4 0.66 0.00 ... 9.4 5 R
## 6 7.9 0.60 0.06 ... 9.4 5 R
## 7 7.3 0.65 0.00 ... 10.0 7 R
## 8 7.8 0.58 0.02 ... 9.5 7 R
## 9 7.5 0.50 0.36 ... 10.5 5 R
##
## [10 rows x 13 columns]
vins=df
print(vins.describe())
## fixed acidity volatile acidity ... alcohol quality
## count 6497.000000 6497.000000 ... 6497.000000 6497.000000
## mean 7.215307 0.339666 ... 10.491801 5.818378
## std 1.296434 0.164636 ... 1.192712 0.873255
## min 3.800000 0.080000 ... 8.000000 3.000000
## 25% 6.400000 0.230000 ... 9.500000 5.000000
## 50% 7.000000 0.290000 ... 10.300000 6.000000
## 75% 7.700000 0.400000 ... 11.300000 6.000000
## max 15.900000 1.580000 ... 14.900000 9.000000
##
## [8 rows x 12 columns]
X = vins.drop(['quality','color'],axis=1)
Y = vins['quality']
vins["color"].value_counts()
## W 4898
## R 1599
## Name: color, dtype: int64
print(len(vins[vins.color == 'W']), "vins Blanc")
## 4898 vins Blanc
print(len(vins[vins.color == 'R']), "vins Rouge")
## 1599 vins Rouge
On dispose de plusieurs milliers de notes données par des experts à des milliers de vins dont on connaît les mêmes 12 informations sur leur composition, ci-dessous, pour deux vins. Comme nous avons beaucoup de vins du très bon aux très mauvais, nous allons vérifier si les notes sont distribuées de façon non uniforme.
import matplotlib.pyplot as plt
plt.close('all')
#plt.style.use('ggplot')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10,4))
vins.quality.hist(bins=18, ax=ax)
plt.title('Distribution des notes des vins')
plt.show()
Les vins avec une note de 3 et 9 sont peu représentés avec seulement 5-10 echantillions, ce qui risque de poser un peu de problème de sur apprentissage.
Nous examiner les caractéristiques des variables et surtout les relations entre ces variables. Pour cela on utilise une méthode d”analyse factorielle pour mettre en évidence des relations entre les variables quantitatives (mesures physico-chimique) et la variable quantitative illustrative “quality”.
Nous allons lancer une analyse en composante principale (ACP) pour représenter un ensemble de points dans un espace de dimension.
pca = PCA(n_components=5)
Xn = normalize(X)
pca.fit(Xn)
## PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
## svd_solver='auto', tol=0.0, whiten=False)
PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,svd_solver='auto', tol=0.0, whiten=False)
## PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
## svd_solver='auto', tol=0.0, whiten=False)
eig = pd.DataFrame(dict(valeur=pca.explained_variance_ratio_))
ax = eig.plot(kind='bar', figsize=(3,3))
ax.set_title("Valeur propres de l'ACP apres normalisation");
plt.show()
Nous avons les 2 premiers axes qui regroupent plus de 90% des informations. Regardons les coordonnées du premier v1 et deuxième v2 axe.
v2 = pd.DataFrame(pca.components_[0:2,:]).T
v2.index = vins.columns[:-2]
v2.columns = ['v1', 'v2']
ax = v2.plot(y=['v1', 'v2'], kind='bar', figsize=(10,4))
ax.legend(loc='upper left')
ax.set_title("Comparaison des coordonnees des deux premiers axes de l ACP")
plt.show()
On remarque que l’alcool, l’acidité, le dioxyde, le pH semble jouer un rôle plus grand que les autres variables
proj = pca.transform(Xn)
pl = pd.DataFrame(proj[:, :3])
pl.columns = ['v1', 'v2', 'v3']
pl['quality'] = wine['quality']
pl['color'] = wine['color']
#Premier graphe selon les couleurs.
ax = sns.lmplot(x="v1", y="v2", hue="color", truncate=True, data=pl, scatter_kws={"s": 1}, fit_reg=False, size=3)
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\seaborn\regression.py:546: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
## warnings.warn(msg, UserWarning)
ax.ax.set_title("Projection des vins sur les deux premiers axes de l ACP");
plt.show()
Avec l’ACP on remarque avec le grahe que les vins blancs et rouges pourraient être différents chimiquement et qu’il y a une frontière entre les vins. Il est donc possible de prédire la couleur en fonction des données disponibles dans ce jeu de données via une classification. Cependant ce n’est pas l’objectif de se projet.
On représente maintenant les notes des vins.
##Représentations des notes des vins.
fig, axs = plt.subplots(1, 3, figsize=(12,4))
red = pl[pl.color == 'R']
white = pl[pl.color == 'W']
# Choisir un dégragé ici
cmap = plt.get_cmap('plasma')
cnorm = colors.Normalize(vmin=pl['quality'].min(), vmax=pl['quality'].max())
scalar = cmx.ScalarMappable(norm=cnorm, cmap=cmap)
for i, data, title in [(0, pl, 'tous'), (1, red, 'red'), (2, white, 'white')]:
ax = axs[i]
# On trace les points pour que le texte n'apparaissent pas en dehors des zones
pl.plot(x='v1', y='v2', kind='scatter', color="white", ax=ax)
for note in sorted(set(data['quality'])):
sub = data[data.quality == note]
if sub.shape[0] > 100:
sub = sub.sample(n=30)
color = scalar.to_rgba(note)
for i, row in enumerate(sub.itertuples()):
ax.text(row[1], row[2], str(row[4]), color=color)
ax.set_title(title);
plt.show()
Les vins rouges et blancs apparaissent comme très différents, cela vaudra sans doute le coup de faire deux modèles si la performance n’est pas assez bonne. Les bonnes notes ne se détache pas particulièremnt sur ces graphes. Le problème est peut-être simple mais ce ne sont pas ces graphes qui vont nous le dire.
Nous allons voir plus en détails les relations entre les variables avec une analyse plus poussée de l’ACP via la programation sur R en annexe.
Maintenant on va choisir quel algorithme est le plus précis pour prédire la note d’un vin. Nous allons tester 6 algorithmes pour avoir la meilleure prédiction possible.
1.Random Forests
2.Logistic Regression
3.Stochastic Gradient Decent Classifier
4.Decision Trees
5.SVM
6.Plus proche voisins knn
Voici les algorithmes que nous allons tester avec des hyperparametres par défaut.
#Now seperate the dataset as response variable and feature variabes
X = vins.drop(['quality','color'], axis = 1)
y = vins['quality']
#Train and Test splitting of data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
#Applying Standard scaling to get optimized result
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
Nous allons d’abord diviser les données en deux. 80% de des données iront dans la partie training pour entrainer le modèle et les 20% restants des données iront dans la partie test pour la validation du modèle.
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
## RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
## max_depth=None, max_features='auto', max_leaf_nodes=None,
## min_impurity_decrease=0.0, min_impurity_split=None,
## min_samples_leaf=1, min_samples_split=2,
## min_weight_fraction_leaf=0.0, n_estimators=10,
## n_jobs=None, oob_score=False, random_state=None,
## verbose=0, warm_start=False)
##
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
## "10 in version 0.20 to 100 in 0.22.", FutureWarning)
pred_rfc = rfc.predict(X_test)
#Let's see how our model performed
print(classification_report(y_test, pred_rfc))
## precision recall f1-score support
##
## 3 0.00 0.00 0.00 6
## 4 0.55 0.14 0.22 43
## 5 0.61 0.71 0.65 402
## 6 0.64 0.70 0.67 597
## 7 0.64 0.46 0.53 215
## 8 0.82 0.25 0.38 36
## 9 0.00 0.00 0.00 1
##
## accuracy 0.63 1300
## macro avg 0.46 0.32 0.35 1300
## weighted avg 0.63 0.63 0.61 1300
##
##
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
## 'precision', 'predicted', average, warn_for)
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, pred_rfc), 3))
## Overall Accuracy: 0.625
rfc_defaut=round(metrics.accuracy_score(y_test, pred_rfc), 3)
lr = LogisticRegression()
lr.fit(X_train, y_train)
## LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
## intercept_scaling=1, l1_ratio=None, max_iter=100,
## multi_class='warn', n_jobs=None, penalty='l2',
## random_state=None, solver='warn', tol=0.0001, verbose=0,
## warm_start=False)
##
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
## FutureWarning)
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
## "this warning.", FutureWarning)
pred_lr = lr.predict(X_test)
#Let's see how our model performed
print(classification_report(y_test, pred_lr))
## precision recall f1-score support
##
## 3 0.00 0.00 0.00 6
## 4 0.00 0.00 0.00 43
## 5 0.53 0.61 0.57 402
## 6 0.52 0.69 0.60 597
## 7 0.48 0.12 0.19 215
## 8 0.00 0.00 0.00 36
## 9 0.00 0.00 0.00 1
##
## accuracy 0.52 1300
## macro avg 0.22 0.20 0.19 1300
## weighted avg 0.48 0.52 0.48 1300
##
##
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
## 'precision', 'predicted', average, warn_for)
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, pred_lr), 3))
## Overall Accuracy: 0.524
sgd = SGDClassifier(penalty=None)
sgd.fit(X_train, y_train)
## SGDClassifier(alpha=0.0001, average=False, class_weight=None,
## early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
## l1_ratio=0.15, learning_rate='optimal', loss='hinge',
## max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty=None,
## power_t=0.5, random_state=None, shuffle=True, tol=0.001,
## validation_fraction=0.1, verbose=0, warm_start=False)
pred_sgd = sgd.predict(X_test)
print(classification_report(y_test, pred_sgd))
## precision recall f1-score support
##
## 3 0.00 0.00 0.00 6
## 4 0.00 0.00 0.00 43
## 5 0.46 0.64 0.53 402
## 6 0.51 0.46 0.48 597
## 7 0.38 0.33 0.35 215
## 8 0.00 0.00 0.00 36
## 9 0.00 0.00 0.00 1
##
## accuracy 0.46 1300
## macro avg 0.19 0.20 0.20 1300
## weighted avg 0.44 0.46 0.44 1300
##
##
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
## 'precision', 'predicted', average, warn_for)
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, pred_sgd), 3))
## Overall Accuracy: 0.462
4.Decision Trees
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train,y_train)
## DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
## max_features=None, max_leaf_nodes=None,
## min_impurity_decrease=0.0, min_impurity_split=None,
## min_samples_leaf=1, min_samples_split=2,
## min_weight_fraction_leaf=0.0, presort=False,
## random_state=None, splitter='best')
pred_dt = dt.predict(X_test)
print(classification_report(y_test, pred_dt))
## precision recall f1-score support
##
## 3 0.00 0.00 0.00 6
## 4 0.22 0.19 0.20 43
## 5 0.55 0.60 0.58 402
## 6 0.60 0.56 0.58 597
## 7 0.47 0.47 0.47 215
## 8 0.25 0.28 0.26 36
## 9 0.00 0.00 0.00 1
##
## accuracy 0.54 1300
## macro avg 0.30 0.30 0.30 1300
## weighted avg 0.54 0.54 0.54 1300
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, pred_dt), 3))
## Overall Accuracy: 0.536
svc = SVC()
svc.fit(X_train, y_train)
## SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
## decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
## kernel='rbf', max_iter=-1, probability=False, random_state=None,
## shrinking=True, tol=0.001, verbose=False)
pred_svc = svc.predict(X_test)
print(classification_report(y_test, pred_svc))
## precision recall f1-score support
##
## 3 0.00 0.00 0.00 6
## 4 0.00 0.00 0.00 43
## 5 0.58 0.65 0.61 402
## 6 0.55 0.72 0.62 597
## 7 0.61 0.19 0.29 215
## 8 0.00 0.00 0.00 36
## 9 0.00 0.00 0.00 1
##
## accuracy 0.56 1300
## macro avg 0.25 0.22 0.22 1300
## weighted avg 0.53 0.56 0.52 1300
##
##
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
## 'precision', 'predicted', average, warn_for)
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, pred_svc), 3))
## Overall Accuracy: 0.562
svc_defaut=round(metrics.accuracy_score(y_test, pred_svc), 3)
knn = KNeighborsRegressor(n_neighbors=1)
knn.fit(X_train, y_train)
## KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
## metric_params=None, n_jobs=None, n_neighbors=1, p=2,
## weights='uniform')
pred_knn = svc.predict(X_test)
print(classification_report(y_test, pred_knn))
## precision recall f1-score support
##
## 3 0.00 0.00 0.00 6
## 4 0.00 0.00 0.00 43
## 5 0.58 0.65 0.61 402
## 6 0.55 0.72 0.62 597
## 7 0.61 0.19 0.29 215
## 8 0.00 0.00 0.00 36
## 9 0.00 0.00 0.00 1
##
## accuracy 0.56 1300
## macro avg 0.25 0.22 0.22 1300
## weighted avg 0.53 0.56 0.52 1300
##
##
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
## 'precision', 'predicted', average, warn_for)
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, pred_knn), 3))
## Overall Accuracy: 0.562
Nous avons trouver que l’algorithme le plus précis est les forets aléatoire avec 0.64% de précision. Puis nous avons les algorithmes arbre de décision, SVM et les plus proche voisins avec 0.56% de precision pour prédire les notes. Nous allons selectionner et valider les hyperparametres.
# Designate distributions to sample hyperparameters from
n_estimators = np.random.uniform(70, 80, 5).astype(int)
max_features = np.random.normal(6, 3, 5).astype(int)
# Check max_features>0 & max_features<=total number of features
max_features[max_features <= 0] = 1
max_features[max_features > X.shape[1]] = X.shape[1]
hyperparameters = {'n_estimators': list(n_estimators),
'max_features': list(max_features)}
print (hyperparameters)
## {'n_estimators': [72, 71, 72, 79, 79], 'max_features': [3, 7, 6, 5, 8]}
On va sélectionner ces hyperparametres pour l’optimisation de l’algorithme avec RandomizedSearchCV qui est plus rapide en execution que GridSearchCV.
##Randomized search using cross-validation pour les Forets aléatoires
# Run randomized search
randomCV = RandomizedSearchCV(RandomForestClassifier(), param_distributions=hyperparameters, n_iter=20)
randomCV.fit(X_train, y_train)
# Identify optimal hyperparameter values
## RandomizedSearchCV(cv='warn', error_score='raise-deprecating',
## estimator=RandomForestClassifier(bootstrap=True,
## class_weight=None,
## criterion='gini',
## max_depth=None,
## max_features='auto',
## max_leaf_nodes=None,
## min_impurity_decrease=0.0,
## min_impurity_split=None,
## min_samples_leaf=1,
## min_samples_split=2,
## min_weight_fraction_leaf=0.0,
## n_estimators='warn',
## n_jobs=None,
## oob_score=False,
## random_state=None,
## verbose=0,
## warm_start=False),
## iid='warn', n_iter=20, n_jobs=None,
## param_distributions={'max_features': [3, 7, 6, 5, 8],
## 'n_estimators': [72, 71, 72, 79, 79]},
## pre_dispatch='2*n_jobs', random_state=None, refit=True,
## return_train_score=False, scoring=None, verbose=0)
##
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
## warnings.warn(CV_WARNING, FutureWarning)
best_n_estim = randomCV.best_params_['n_estimators']
best_max_features = randomCV.best_params_['max_features']
print("The best performing n_estimators value is: {:5d}".format(best_n_estim))
## The best performing n_estimators value is: 72
print("The best performing max_features value is: {:5d}".format(best_max_features))
## The best performing max_features value is: 3
On a trouvé les meilleurs hyperparametres donc on va pouvoir lancer l’apprentissage puis on va faire des tests sur les données.
##Apprentissage optimal avec les nouveaux hyperparametres
# Train classifier using optimal hyperparameter values
# We could have also gotten this model out from randomCV.best_estimator_
rfc2 = RandomForestClassifier(n_estimators=best_n_estim,
max_features=best_max_features)
rfc2.fit(X_train, y_train)
## RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
## max_depth=None, max_features=3, max_leaf_nodes=None,
## min_impurity_decrease=0.0, min_impurity_split=None,
## min_samples_leaf=1, min_samples_split=2,
## min_weight_fraction_leaf=0.0, n_estimators=72,
## n_jobs=None, oob_score=False, random_state=None,
## verbose=0, warm_start=False)
rfc2_predictions = rfc2.predict(X_test)
print (metrics.classification_report(y_test, rfc2_predictions))
## precision recall f1-score support
##
## 3 0.00 0.00 0.00 6
## 4 0.60 0.07 0.12 43
## 5 0.67 0.70 0.69 402
## 6 0.65 0.76 0.70 597
## 7 0.70 0.52 0.60 215
## 8 1.00 0.33 0.50 36
## 9 0.00 0.00 0.00 1
##
## accuracy 0.66 1300
## macro avg 0.52 0.34 0.37 1300
## weighted avg 0.67 0.66 0.65 1300
##
##
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
## 'precision', 'predicted', average, warn_for)
print ("Overall Accuracy optimal:", round(metrics.accuracy_score(y_test, rfc2_predictions), 3))
## Overall Accuracy optimal: 0.664
print("Overall Accuracy par defaut :",rfc_defaut)
## Overall Accuracy par defaut : 0.625
On voit qu’il y a eu une amélioration de la précison avec les nouveau hyperparametres. Cependant nous avons encore une trop grande imprécison pour noter les vins surtout à cause des vins noté 9 ou nous avons que 5 echantillions sur les 6400 vins.
#converting the numpy array to list
xRF=np.array(rfc2_predictions).tolist()
#printing first 5 predictions
print("\nLa prediction pour random Forest:\n")
##
## La prediction pour random Forest:
for i in range(0,5):
print (xRF[i])
## 6
## 5
## 7
## 6
## 5
#printing first five expectations
print("\nLes observations pour random Forest:\n")
##
## Les observations pour random Forest:
print (y_test.head())
## 1504 8
## 1419 5
## 3162 7
## 3091 6
## 2433 6
## Name: quality, dtype: int64
On remarque que les notes prédites ne sont pas très fiables.
Maintenant, on va faire un test sur l’algorithme SVM afin de voir si on peut améliorer la précision de la prédiction.
# Designate distributions to sample hyperparameters from
np.random.seed(123)
g_range = np.random.uniform(0.0, 0.3, 5).astype(float)
C_range = np.random.normal(1, 0.1, 5).astype(float)
# Check that gamma>0 and C>0
C_range[C_range < 0] = 0.0001
hyperparameters = {'gamma': list(g_range),
'C': list(C_range)}
print (hyperparameters)
## {'gamma': [0.2089407556793585, 0.08584180048511383, 0.06805543606926093, 0.16539443072486737, 0.2158406909356689], 'C': [1.0322106068339623, 0.9948482279060615, 0.9795799035361106, 1.197934843277785, 0.8380699934963254]}
##RandomizedSearchCV using cross-validation pour SVM
On prendra l’algorithme SVM avec un noyau non linéaire pour cette prédiction et de type radial basis function car il est très populaire.
# Run randomized search
randomCV = RandomizedSearchCV(SVC(kernel='rbf', ), param_distributions=hyperparameters, n_iter=20)
randomCV.fit(X_train, y_train)
# Identify optimal hyperparameter values
## RandomizedSearchCV(cv='warn', error_score='raise-deprecating',
## estimator=SVC(C=1.0, cache_size=200, class_weight=None,
## coef0=0.0, decision_function_shape='ovr',
## degree=3, gamma='auto_deprecated',
## kernel='rbf', max_iter=-1, probability=False,
## random_state=None, shrinking=True, tol=0.001,
## verbose=False),
## iid='warn', n_iter=20, n_jobs=None,
## param_distributions={'C': [1.0322106068339623,
## 0.9948482279060615,
## 0.9795799035361106,
## 1.197934843277785,
## 0.8380699934963254],
## 'gamma': [0.2089407556793585,
## 0.08584180048511383,
## 0.06805543606926093,
## 0.16539443072486737,
## 0.2158406909356689]},
## pre_dispatch='2*n_jobs', random_state=None, refit=True,
## return_train_score=False, scoring=None, verbose=0)
##
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
## warnings.warn(CV_WARNING, FutureWarning)
best_gamma = randomCV.best_params_['gamma']
best_C = randomCV.best_params_['C']
print("The best performing gamma value is: {:5.2f}".format(best_gamma))
## The best performing gamma value is: 0.22
print("The best performing C value is: {:5.2f}".format(best_C))
## The best performing C value is: 0.99
##Apprentissage optimal avec les nouveaux hyperparametres
L’algorithme SVM
# Train SVM and output predictions
rbfSVM = SVC(kernel='rbf', C=best_C, gamma=best_gamma)
rbfSVM.fit(X_train, y_train)
## SVC(C=0.9948482279060615, cache_size=200, class_weight=None, coef0=0.0,
## decision_function_shape='ovr', degree=3, gamma=0.2158406909356689,
## kernel='rbf', max_iter=-1, probability=False, random_state=None,
## shrinking=True, tol=0.001, verbose=False)
svm_predictions = rbfSVM.predict(X_test)
print(metrics.classification_report(y_test, svm_predictions))
## precision recall f1-score support
##
## 3 0.00 0.00 0.00 6
## 4 1.00 0.02 0.05 43
## 5 0.59 0.67 0.63 402
## 6 0.56 0.71 0.63 597
## 7 0.64 0.25 0.36 215
## 8 0.00 0.00 0.00 36
## 9 0.00 0.00 0.00 1
##
## accuracy 0.58 1300
## macro avg 0.40 0.24 0.24 1300
## weighted avg 0.58 0.58 0.54 1300
##
##
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
## 'precision', 'predicted', average, warn_for)
print("Overall Accuracy optimise:", round(metrics.accuracy_score(y_test, svm_predictions),1))
## Overall Accuracy optimise: 0.6
print(" Overall Accuracy par defaut: ",svc_defaut)
## Overall Accuracy par defaut: 0.562
On remarque que la précision à augmenter mais pas suffisament par rapport à l’algorithme forets aléatoires.
On vérifie avec quelques notes de vins pour l’algorithme SVM
#converting the numpy array to list
x=np.array(svm_predictions).tolist()
#printing first 5 predictions
print("\nThe prediction SVM:\n")
##
## The prediction SVM:
for i in range(0,5):
print (x[i])
## 6
## 5
## 7
## 5
## 5
#printing first five expectations
print("\nThe expectation SVM:\n")
##
## The expectation SVM:
print(y_test.head())
## 1504 8
## 1419 5
## 3162 7
## 3091 6
## 2433 6
## Name: quality, dtype: int64
Pour 5 exemples, nous avons seulement 2 notes correctes Cependant avec plus d’exemples de vins notés on arrive à 57% pour l’algorithme SVM.
Le meilleur algorithme pour prédire la note d’un vin pour ce jeu de données est l’algorithme Forets aléatoires. L’optimisation des hyperparametres a permis d’améliorer la precision mais pas de beaucoup. Ce projet a été très interéssant à réaliser car j’ai appliqué les méthodes vue en cours RCP208 et RCP209 et j’ai aussi utilisé des nouvelles bibliothèques comme Pandas. J’ai aussi utilisé R pour faire les analyse en composante principale qui est plus simple que sur python. Il serait aussi intéressant de faire une classification multi classe pour prédire la couleur du vin.