Introduction

L’objectif est de déterminer la qualité d’un vin (de 0 = très mauvais à 10 = exceptionnel) à partir de mesures physico-chimiques.

Détails : http://archive.ics.uci.edu/ml/datasets/Wine+Quality

On part du principe que si deux vins différents ont la même composition, leurs qualités gustatives seront identiques. Dans la base de données il y a 6 400 vins rouges et blancs provenant du Portugal. Si un vin inconnu a une composition identique à l’un des 6400 vins répertoriés, on peut supposer qu’il obtiendra la même note.

Avec les algorithmes d’apprentissage utilisés en RCP209, je vais proposer une façon de construire une note pour une composition nouvelle. Je vais donc faire une prédiction et avec le savoir accumulé sur 6400 vins, je vais tenter de prédire ou estimer la note.

Voici le jeu de données qui contient 6497 vins et 13 variables dont 1 variables quantitative illustrative et 1 variables qualitative illustrative.

#Importing required packages.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn import metrics
import numpy as np
import imp
## C:/Users/lnzb7292/AppData/Local/Programs/Python/Python37/python.exe:1: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
from sklearn.neighbors import KNeighborsRegressor
from sklearn.decomposition import PCA
from sklearn.preprocessing import normalize
import matplotlib.colors as colors
import matplotlib.cm as cmx
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, RandomizedSearchCV

from scipy.stats import uniform
from scipy.stats import norm
import encodings
# import the dataset

df_red = pd.read_csv('C:/Users/lnzb7292/Downloads/RCP209/projet/Projet Vins RCP209/winequality-red.csv', sep=';')
df_red.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1599 entries, 0 to 1598
## Data columns (total 12 columns):
## fixed acidity           1599 non-null float64
## volatile acidity        1599 non-null float64
## citric acid             1599 non-null float64
## residual sugar          1599 non-null float64
## chlorides               1599 non-null float64
## free sulfur dioxide     1599 non-null float64
## total sulfur dioxide    1599 non-null float64
## density                 1599 non-null float64
## pH                      1599 non-null float64
## sulphates               1599 non-null float64
## alcohol                 1599 non-null float64
## quality                 1599 non-null int64
## dtypes: float64(11), int64(1)
## memory usage: 150.0 KB
df_red.head(10)
##    fixed acidity  volatile acidity  citric acid  ...  sulphates  alcohol  quality
## 0            7.4              0.70         0.00  ...       0.56      9.4        5
## 1            7.8              0.88         0.00  ...       0.68      9.8        5
## 2            7.8              0.76         0.04  ...       0.65      9.8        5
## 3           11.2              0.28         0.56  ...       0.58      9.8        6
## 4            7.4              0.70         0.00  ...       0.56      9.4        5
## 5            7.4              0.66         0.00  ...       0.56      9.4        5
## 6            7.9              0.60         0.06  ...       0.46      9.4        5
## 7            7.3              0.65         0.00  ...       0.47     10.0        7
## 8            7.8              0.58         0.02  ...       0.57      9.5        7
## 9            7.5              0.50         0.36  ...       0.80     10.5        5
## 
## [10 rows x 12 columns]
df_red.describe()
##        fixed acidity  volatile acidity  ...      alcohol      quality
## count    1599.000000       1599.000000  ...  1599.000000  1599.000000
## mean        8.319637          0.527821  ...    10.422983     5.636023
## std         1.741096          0.179060  ...     1.065668     0.807569
## min         4.600000          0.120000  ...     8.400000     3.000000
## 25%         7.100000          0.390000  ...     9.500000     5.000000
## 50%         7.900000          0.520000  ...    10.200000     6.000000
## 75%         9.200000          0.640000  ...    11.100000     6.000000
## max        15.900000          1.580000  ...    14.900000     8.000000
## 
## [8 rows x 12 columns]
df_white = pd.read_csv('C:/Users/lnzb7292/Downloads/RCP209/projet/Projet Vins RCP209/winequality-white.csv', sep=';')
df_white.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 4898 entries, 0 to 4897
## Data columns (total 12 columns):
## fixed acidity           4898 non-null float64
## volatile acidity        4898 non-null float64
## citric acid             4898 non-null float64
## residual sugar          4898 non-null float64
## chlorides               4898 non-null float64
## free sulfur dioxide     4898 non-null float64
## total sulfur dioxide    4898 non-null float64
## density                 4898 non-null float64
## pH                      4898 non-null float64
## sulphates               4898 non-null float64
## alcohol                 4898 non-null float64
## quality                 4898 non-null int64
## dtypes: float64(11), int64(1)
## memory usage: 459.3 KB
df_white.head(10)
##    fixed acidity  volatile acidity  citric acid  ...  sulphates  alcohol  quality
## 0            7.0              0.27         0.36  ...       0.45      8.8        6
## 1            6.3              0.30         0.34  ...       0.49      9.5        6
## 2            8.1              0.28         0.40  ...       0.44     10.1        6
## 3            7.2              0.23         0.32  ...       0.40      9.9        6
## 4            7.2              0.23         0.32  ...       0.40      9.9        6
## 5            8.1              0.28         0.40  ...       0.44     10.1        6
## 6            6.2              0.32         0.16  ...       0.47      9.6        6
## 7            7.0              0.27         0.36  ...       0.45      8.8        6
## 8            6.3              0.30         0.34  ...       0.49      9.5        6
## 9            8.1              0.22         0.43  ...       0.45     11.0        6
## 
## [10 rows x 12 columns]
df_white.describe()
##        fixed acidity  volatile acidity  ...      alcohol      quality
## count    4898.000000       4898.000000  ...  4898.000000  4898.000000
## mean        6.854788          0.278241  ...    10.514267     5.877909
## std         0.843868          0.100795  ...     1.230621     0.885639
## min         3.800000          0.080000  ...     8.000000     3.000000
## 25%         6.300000          0.210000  ...     9.500000     5.000000
## 50%         6.800000          0.260000  ...    10.400000     6.000000
## 75%         7.300000          0.320000  ...    11.400000     6.000000
## max        14.200000          1.100000  ...    14.200000     9.000000
## 
## [8 rows x 12 columns]
df_white['color'] = "W"

df_red['color'] = "R"
df = pd.concat([df_red, df_white])
df.head(10)
##    fixed acidity  volatile acidity  citric acid  ...  alcohol  quality  color
## 0            7.4              0.70         0.00  ...      9.4        5      R
## 1            7.8              0.88         0.00  ...      9.8        5      R
## 2            7.8              0.76         0.04  ...      9.8        5      R
## 3           11.2              0.28         0.56  ...      9.8        6      R
## 4            7.4              0.70         0.00  ...      9.4        5      R
## 5            7.4              0.66         0.00  ...      9.4        5      R
## 6            7.9              0.60         0.06  ...      9.4        5      R
## 7            7.3              0.65         0.00  ...     10.0        7      R
## 8            7.8              0.58         0.02  ...      9.5        7      R
## 9            7.5              0.50         0.36  ...     10.5        5      R
## 
## [10 rows x 13 columns]
df["color"].value_counts() 
#vins2=pd.read_csv("C:/Users/lnzb7292/Downloads/RCP209/projet/Projet Vins RCP209/winequality.csv", sep=';')
#vins2["color"].value_counts() 
#df.to_csv('C:/Users/lnzb7292/Downloads/RCP209/projet/vinsqualityessai.csv',sep='\t', encoding='utf-8',index=False) #sep=';',mode = 'w', index=False)
## W    4898
## R    1599
## Name: color, dtype: int64
wine = pd.read_csv('C:/Users/lnzb7292/Downloads/RCP209/projet/Projet Vins RCP209/vinsquality.csv', sep=';')
wine.head(10)
##    fixed acidity  volatile acidity  citric acid  ...  alcohol  quality  color
## 0            7.4              0.70         0.00  ...      9.4        5      R
## 1            7.8              0.88         0.00  ...      9.8        5      R
## 2            7.8              0.76         0.04  ...      9.8        5      R
## 3           11.2              0.28         0.56  ...      9.8        6      R
## 4            7.4              0.70         0.00  ...      9.4        5      R
## 5            7.4              0.66         0.00  ...      9.4        5      R
## 6            7.9              0.60         0.06  ...      9.4        5      R
## 7            7.3              0.65         0.00  ...     10.0        7      R
## 8            7.8              0.58         0.02  ...      9.5        7      R
## 9            7.5              0.50         0.36  ...     10.5        5      R
## 
## [10 rows x 13 columns]
vins=df
print(vins.describe())
##        fixed acidity  volatile acidity  ...      alcohol      quality
## count    6497.000000       6497.000000  ...  6497.000000  6497.000000
## mean        7.215307          0.339666  ...    10.491801     5.818378
## std         1.296434          0.164636  ...     1.192712     0.873255
## min         3.800000          0.080000  ...     8.000000     3.000000
## 25%         6.400000          0.230000  ...     9.500000     5.000000
## 50%         7.000000          0.290000  ...    10.300000     6.000000
## 75%         7.700000          0.400000  ...    11.300000     6.000000
## max        15.900000          1.580000  ...    14.900000     9.000000
## 
## [8 rows x 12 columns]
X = vins.drop(['quality','color'],axis=1)
Y = vins['quality']


vins["color"].value_counts()
## W    4898
## R    1599
## Name: color, dtype: int64
print(len(vins[vins.color == 'W']), "vins Blanc")
## 4898 vins Blanc
print(len(vins[vins.color == 'R']), "vins Rouge")
## 1599 vins Rouge

On dispose de plusieurs milliers de notes données par des experts à des milliers de vins dont on connaît les mêmes 12 informations sur leur composition, ci-dessous, pour deux vins. Comme nous avons beaucoup de vins du très bon aux très mauvais, nous allons vérifier si les notes sont distribuées de façon non uniforme.

Distribution des notes des vins

import matplotlib.pyplot as plt
plt.close('all')
#plt.style.use('ggplot')
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10,4))
vins.quality.hist(bins=18, ax=ax)
plt.title('Distribution des notes des vins')
plt.show()

Les vins avec une note de 3 et 9 sont peu représentés avec seulement 5-10 echantillions, ce qui risque de poser un peu de problème de sur apprentissage.

Nous examiner les caractéristiques des variables et surtout les relations entre ces variables. Pour cela on utilise une méthode d”analyse factorielle pour mettre en évidence des relations entre les variables quantitatives (mesures physico-chimique) et la variable quantitative illustrative “quality”.

Nous allons lancer une analyse en composante principale (ACP) pour représenter un ensemble de points dans un espace de dimension.

Analyse en composante principale ACP

pca = PCA(n_components=5)
Xn = normalize(X)
pca.fit(Xn)
## PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
##     svd_solver='auto', tol=0.0, whiten=False)
PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,svd_solver='auto', tol=0.0, whiten=False)
## PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
##     svd_solver='auto', tol=0.0, whiten=False)

eig = pd.DataFrame(dict(valeur=pca.explained_variance_ratio_))
ax = eig.plot(kind='bar', figsize=(3,3))
ax.set_title("Valeur propres de l'ACP apres normalisation");
plt.show()

Nous avons les 2 premiers axes qui regroupent plus de 90% des informations. Regardons les coordonnées du premier v1 et deuxième v2 axe.

v2 = pd.DataFrame(pca.components_[0:2,:]).T
v2.index = vins.columns[:-2]
v2.columns = ['v1', 'v2']
ax = v2.plot(y=['v1', 'v2'], kind='bar', figsize=(10,4))
ax.legend(loc='upper left')
ax.set_title("Comparaison des coordonnees des deux premiers axes de l ACP")


plt.show()

On remarque que l’alcool, l’acidité, le dioxyde, le pH semble jouer un rôle plus grand que les autres variables

proj = pca.transform(Xn)

pl = pd.DataFrame(proj[:, :3])
pl.columns = ['v1', 'v2', 'v3']
pl['quality'] = wine['quality']
pl['color'] = wine['color']

#Premier graphe selon les couleurs.

ax = sns.lmplot(x="v1", y="v2", hue="color", truncate=True, data=pl, scatter_kws={"s": 1}, fit_reg=False, size=3)
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\seaborn\regression.py:546: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
##   warnings.warn(msg, UserWarning)
ax.ax.set_title("Projection des vins sur les deux premiers axes de l ACP");

plt.show()

Avec l’ACP on remarque avec le grahe que les vins blancs et rouges pourraient être différents chimiquement et qu’il y a une frontière entre les vins. Il est donc possible de prédire la couleur en fonction des données disponibles dans ce jeu de données via une classification. Cependant ce n’est pas l’objectif de se projet.

On représente maintenant les notes des vins.

##Représentations des notes des vins.

fig, axs = plt.subplots(1, 3, figsize=(12,4))
red = pl[pl.color == 'R']
white = pl[pl.color == 'W']
# Choisir un dégragé ici
cmap = plt.get_cmap('plasma')
cnorm = colors.Normalize(vmin=pl['quality'].min(), vmax=pl['quality'].max())
scalar = cmx.ScalarMappable(norm=cnorm, cmap=cmap)

for i, data, title in [(0, pl, 'tous'), (1, red, 'red'), (2, white, 'white')]:
    ax = axs[i]
    # On trace les points pour que le texte n'apparaissent pas en dehors des zones
    pl.plot(x='v1', y='v2', kind='scatter', color="white", ax=ax)

    for note in sorted(set(data['quality'])):
        sub = data[data.quality == note]
        if sub.shape[0] > 100:
            sub = sub.sample(n=30)

        color = scalar.to_rgba(note)
        for i, row in enumerate(sub.itertuples()):
            ax.text(row[1], row[2], str(row[4]), color=color)
    ax.set_title(title);
    
    
plt.show()  

Les vins rouges et blancs apparaissent comme très différents, cela vaudra sans doute le coup de faire deux modèles si la performance n’est pas assez bonne. Les bonnes notes ne se détache pas particulièremnt sur ces graphes. Le problème est peut-être simple mais ce ne sont pas ces graphes qui vont nous le dire.

Nous allons voir plus en détails les relations entre les variables avec une analyse plus poussée de l’ACP via la programation sur R en annexe.

Maintenant on va choisir quel algorithme est le plus précis pour prédire la note d’un vin. Nous allons tester 6 algorithmes pour avoir la meilleure prédiction possible.

1.Random Forests

2.Logistic Regression

3.Stochastic Gradient Decent Classifier

4.Decision Trees

5.SVM

6.Plus proche voisins knn

Voici les algorithmes que nous allons tester avec des hyperparametres par défaut.

Les modèles d’algorithmes

Découpage des jeux de données training et Testing


#Now seperate the dataset as response variable and feature variabes
X = vins.drop(['quality','color'], axis = 1)
y = vins['quality']

#Train and Test splitting of data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
#Applying Standard scaling to get optimized result
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

Nous allons d’abord diviser les données en deux. 80% de des données iront dans la partie training pour entrainer le modèle et les 20% restants des données iront dans la partie test pour la validation du modèle.

1.Random Forest Classifier

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
## RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
##                        max_depth=None, max_features='auto', max_leaf_nodes=None,
##                        min_impurity_decrease=0.0, min_impurity_split=None,
##                        min_samples_leaf=1, min_samples_split=2,
##                        min_weight_fraction_leaf=0.0, n_estimators=10,
##                        n_jobs=None, oob_score=False, random_state=None,
##                        verbose=0, warm_start=False)
## 
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\ensemble\forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
##   "10 in version 0.20 to 100 in 0.22.", FutureWarning)
pred_rfc = rfc.predict(X_test)

#Let's see how our model performed
print(classification_report(y_test, pred_rfc))
##               precision    recall  f1-score   support
## 
##            3       0.00      0.00      0.00         6
##            4       0.55      0.14      0.22        43
##            5       0.61      0.71      0.65       402
##            6       0.64      0.70      0.67       597
##            7       0.64      0.46      0.53       215
##            8       0.82      0.25      0.38        36
##            9       0.00      0.00      0.00         1
## 
##     accuracy                           0.63      1300
##    macro avg       0.46      0.32      0.35      1300
## weighted avg       0.63      0.63      0.61      1300
## 
## 
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
##   'precision', 'predicted', average, warn_for)
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, pred_rfc), 3))
## Overall Accuracy: 0.625
rfc_defaut=round(metrics.accuracy_score(y_test, pred_rfc), 3)

2.Logistic Regression




lr = LogisticRegression()
lr.fit(X_train, y_train)
## LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
##                    intercept_scaling=1, l1_ratio=None, max_iter=100,
##                    multi_class='warn', n_jobs=None, penalty='l2',
##                    random_state=None, solver='warn', tol=0.0001, verbose=0,
##                    warm_start=False)
## 
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
##   FutureWarning)
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\linear_model\logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
##   "this warning.", FutureWarning)
pred_lr = lr.predict(X_test)
#Let's see how our model performed
print(classification_report(y_test, pred_lr))
##               precision    recall  f1-score   support
## 
##            3       0.00      0.00      0.00         6
##            4       0.00      0.00      0.00        43
##            5       0.53      0.61      0.57       402
##            6       0.52      0.69      0.60       597
##            7       0.48      0.12      0.19       215
##            8       0.00      0.00      0.00        36
##            9       0.00      0.00      0.00         1
## 
##     accuracy                           0.52      1300
##    macro avg       0.22      0.20      0.19      1300
## weighted avg       0.48      0.52      0.48      1300
## 
## 
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
##   'precision', 'predicted', average, warn_for)
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, pred_lr), 3))
## Overall Accuracy: 0.524

3.Stochastic Gradient Decent Classifier

sgd = SGDClassifier(penalty=None)
sgd.fit(X_train, y_train)
## SGDClassifier(alpha=0.0001, average=False, class_weight=None,
##               early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
##               l1_ratio=0.15, learning_rate='optimal', loss='hinge',
##               max_iter=1000, n_iter_no_change=5, n_jobs=None, penalty=None,
##               power_t=0.5, random_state=None, shuffle=True, tol=0.001,
##               validation_fraction=0.1, verbose=0, warm_start=False)
pred_sgd = sgd.predict(X_test)
print(classification_report(y_test, pred_sgd))
##               precision    recall  f1-score   support
## 
##            3       0.00      0.00      0.00         6
##            4       0.00      0.00      0.00        43
##            5       0.46      0.64      0.53       402
##            6       0.51      0.46      0.48       597
##            7       0.38      0.33      0.35       215
##            8       0.00      0.00      0.00        36
##            9       0.00      0.00      0.00         1
## 
##     accuracy                           0.46      1300
##    macro avg       0.19      0.20      0.20      1300
## weighted avg       0.44      0.46      0.44      1300
## 
## 
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
##   'precision', 'predicted', average, warn_for)
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, pred_sgd), 3))
## Overall Accuracy: 0.462

4.Decision Trees

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train,y_train)
## DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
##                        max_features=None, max_leaf_nodes=None,
##                        min_impurity_decrease=0.0, min_impurity_split=None,
##                        min_samples_leaf=1, min_samples_split=2,
##                        min_weight_fraction_leaf=0.0, presort=False,
##                        random_state=None, splitter='best')
pred_dt = dt.predict(X_test)
print(classification_report(y_test, pred_dt))
##               precision    recall  f1-score   support
## 
##            3       0.00      0.00      0.00         6
##            4       0.22      0.19      0.20        43
##            5       0.55      0.60      0.58       402
##            6       0.60      0.56      0.58       597
##            7       0.47      0.47      0.47       215
##            8       0.25      0.28      0.26        36
##            9       0.00      0.00      0.00         1
## 
##     accuracy                           0.54      1300
##    macro avg       0.30      0.30      0.30      1300
## weighted avg       0.54      0.54      0.54      1300
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, pred_dt), 3))
## Overall Accuracy: 0.536

5.SVM

svc = SVC()
svc.fit(X_train, y_train)
## SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
##     decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
##     kernel='rbf', max_iter=-1, probability=False, random_state=None,
##     shrinking=True, tol=0.001, verbose=False)
pred_svc = svc.predict(X_test)
print(classification_report(y_test, pred_svc))
##               precision    recall  f1-score   support
## 
##            3       0.00      0.00      0.00         6
##            4       0.00      0.00      0.00        43
##            5       0.58      0.65      0.61       402
##            6       0.55      0.72      0.62       597
##            7       0.61      0.19      0.29       215
##            8       0.00      0.00      0.00        36
##            9       0.00      0.00      0.00         1
## 
##     accuracy                           0.56      1300
##    macro avg       0.25      0.22      0.22      1300
## weighted avg       0.53      0.56      0.52      1300
## 
## 
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
##   'precision', 'predicted', average, warn_for)
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, pred_svc), 3))
## Overall Accuracy: 0.562
svc_defaut=round(metrics.accuracy_score(y_test, pred_svc), 3)

6.Plus proche voisins knn

knn = KNeighborsRegressor(n_neighbors=1)
knn.fit(X_train, y_train)
## KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
##                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
##                     weights='uniform')
pred_knn = svc.predict(X_test)
print(classification_report(y_test, pred_knn))
##               precision    recall  f1-score   support
## 
##            3       0.00      0.00      0.00         6
##            4       0.00      0.00      0.00        43
##            5       0.58      0.65      0.61       402
##            6       0.55      0.72      0.62       597
##            7       0.61      0.19      0.29       215
##            8       0.00      0.00      0.00        36
##            9       0.00      0.00      0.00         1
## 
##     accuracy                           0.56      1300
##    macro avg       0.25      0.22      0.22      1300
## weighted avg       0.53      0.56      0.52      1300
## 
## 
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
##   'precision', 'predicted', average, warn_for)
print ("Overall Accuracy:", round(metrics.accuracy_score(y_test, pred_knn), 3))
## Overall Accuracy: 0.562

Nous avons trouver que l’algorithme le plus précis est les forets aléatoire avec 0.64% de précision. Puis nous avons les algorithmes arbre de décision, SVM et les plus proche voisins avec 0.56% de precision pour prédire les notes. Nous allons selectionner et valider les hyperparametres.

Selection/validation des hyperparameters

Hyperparameters pour les Forets aléatoires

# Designate distributions to sample hyperparameters from
n_estimators = np.random.uniform(70, 80, 5).astype(int)
max_features = np.random.normal(6, 3, 5).astype(int)

# Check max_features>0 & max_features<=total number of features
max_features[max_features <= 0] = 1
max_features[max_features > X.shape[1]] = X.shape[1]

hyperparameters = {'n_estimators': list(n_estimators),
                   'max_features': list(max_features)}

print (hyperparameters)
## {'n_estimators': [72, 71, 72, 79, 79], 'max_features': [3, 7, 6, 5, 8]}

On va sélectionner ces hyperparametres pour l’optimisation de l’algorithme avec RandomizedSearchCV qui est plus rapide en execution que GridSearchCV.

##Randomized search using cross-validation pour les Forets aléatoires

# Run randomized search
randomCV = RandomizedSearchCV(RandomForestClassifier(), param_distributions=hyperparameters, n_iter=20)
randomCV.fit(X_train, y_train)

# Identify optimal hyperparameter values
## RandomizedSearchCV(cv='warn', error_score='raise-deprecating',
##                    estimator=RandomForestClassifier(bootstrap=True,
##                                                     class_weight=None,
##                                                     criterion='gini',
##                                                     max_depth=None,
##                                                     max_features='auto',
##                                                     max_leaf_nodes=None,
##                                                     min_impurity_decrease=0.0,
##                                                     min_impurity_split=None,
##                                                     min_samples_leaf=1,
##                                                     min_samples_split=2,
##                                                     min_weight_fraction_leaf=0.0,
##                                                     n_estimators='warn',
##                                                     n_jobs=None,
##                                                     oob_score=False,
##                                                     random_state=None,
##                                                     verbose=0,
##                                                     warm_start=False),
##                    iid='warn', n_iter=20, n_jobs=None,
##                    param_distributions={'max_features': [3, 7, 6, 5, 8],
##                                         'n_estimators': [72, 71, 72, 79, 79]},
##                    pre_dispatch='2*n_jobs', random_state=None, refit=True,
##                    return_train_score=False, scoring=None, verbose=0)
## 
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
##   warnings.warn(CV_WARNING, FutureWarning)
best_n_estim      = randomCV.best_params_['n_estimators']
best_max_features = randomCV.best_params_['max_features']

print("The best performing n_estimators value is: {:5d}".format(best_n_estim))
## The best performing n_estimators value is:    72
print("The best performing max_features value is: {:5d}".format(best_max_features))
## The best performing max_features value is:     3

On a trouvé les meilleurs hyperparametres donc on va pouvoir lancer l’apprentissage puis on va faire des tests sur les données.

##Apprentissage optimal avec les nouveaux hyperparametres

L’algorithme RandomForestClassifier

# Train classifier using optimal hyperparameter values
# We could have also gotten this model out from randomCV.best_estimator_
rfc2 = RandomForestClassifier(n_estimators=best_n_estim,
                            max_features=best_max_features)

rfc2.fit(X_train, y_train)
## RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
##                        max_depth=None, max_features=3, max_leaf_nodes=None,
##                        min_impurity_decrease=0.0, min_impurity_split=None,
##                        min_samples_leaf=1, min_samples_split=2,
##                        min_weight_fraction_leaf=0.0, n_estimators=72,
##                        n_jobs=None, oob_score=False, random_state=None,
##                        verbose=0, warm_start=False)
rfc2_predictions = rfc2.predict(X_test)
print (metrics.classification_report(y_test, rfc2_predictions))
##               precision    recall  f1-score   support
## 
##            3       0.00      0.00      0.00         6
##            4       0.60      0.07      0.12        43
##            5       0.67      0.70      0.69       402
##            6       0.65      0.76      0.70       597
##            7       0.70      0.52      0.60       215
##            8       1.00      0.33      0.50        36
##            9       0.00      0.00      0.00         1
## 
##     accuracy                           0.66      1300
##    macro avg       0.52      0.34      0.37      1300
## weighted avg       0.67      0.66      0.65      1300
## 
## 
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
##   'precision', 'predicted', average, warn_for)
print ("Overall Accuracy optimal:", round(metrics.accuracy_score(y_test, rfc2_predictions), 3))
## Overall Accuracy optimal: 0.664
print("Overall Accuracy par defaut :",rfc_defaut)
## Overall Accuracy par defaut : 0.625

On voit qu’il y a eu une amélioration de la précison avec les nouveau hyperparametres. Cependant nous avons encore une trop grande imprécison pour noter les vins surtout à cause des vins noté 9 ou nous avons que 5 echantillions sur les 6400 vins.

#converting the numpy array to list
xRF=np.array(rfc2_predictions).tolist()

#printing first 5 predictions
print("\nLa prediction pour random Forest:\n")
## 
## La prediction pour random Forest:
for i in range(0,5):
  print (xRF[i])
    
    
## 6
## 5
## 7
## 6
## 5
#printing first five expectations
print("\nLes observations pour random Forest:\n")
## 
## Les observations pour random Forest:
print (y_test.head())
## 1504    8
## 1419    5
## 3162    7
## 3091    6
## 2433    6
## Name: quality, dtype: int64

On remarque que les notes prédites ne sont pas très fiables.

Maintenant, on va faire un test sur l’algorithme SVM afin de voir si on peut améliorer la précision de la prédiction.

L’algorithme SVM

# Designate distributions to sample hyperparameters from
np.random.seed(123)
g_range = np.random.uniform(0.0, 0.3, 5).astype(float)
C_range = np.random.normal(1, 0.1, 5).astype(float)

# Check that gamma>0 and C>0
C_range[C_range < 0] = 0.0001

hyperparameters = {'gamma': list(g_range),
                    'C': list(C_range)}

print (hyperparameters)
## {'gamma': [0.2089407556793585, 0.08584180048511383, 0.06805543606926093, 0.16539443072486737, 0.2158406909356689], 'C': [1.0322106068339623, 0.9948482279060615, 0.9795799035361106, 1.197934843277785, 0.8380699934963254]}

##RandomizedSearchCV using cross-validation pour SVM

On prendra l’algorithme SVM avec un noyau non linéaire pour cette prédiction et de type radial basis function car il est très populaire.

# Run randomized search
randomCV = RandomizedSearchCV(SVC(kernel='rbf', ), param_distributions=hyperparameters, n_iter=20)
randomCV.fit(X_train, y_train)

# Identify optimal hyperparameter values
## RandomizedSearchCV(cv='warn', error_score='raise-deprecating',
##                    estimator=SVC(C=1.0, cache_size=200, class_weight=None,
##                                  coef0=0.0, decision_function_shape='ovr',
##                                  degree=3, gamma='auto_deprecated',
##                                  kernel='rbf', max_iter=-1, probability=False,
##                                  random_state=None, shrinking=True, tol=0.001,
##                                  verbose=False),
##                    iid='warn', n_iter=20, n_jobs=None,
##                    param_distributions={'C': [1.0322106068339623,
##                                               0.9948482279060615,
##                                               0.9795799035361106,
##                                               1.197934843277785,
##                                               0.8380699934963254],
##                                         'gamma': [0.2089407556793585,
##                                                   0.08584180048511383,
##                                                   0.06805543606926093,
##                                                   0.16539443072486737,
##                                                   0.2158406909356689]},
##                    pre_dispatch='2*n_jobs', random_state=None, refit=True,
##                    return_train_score=False, scoring=None, verbose=0)
## 
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
##   warnings.warn(CV_WARNING, FutureWarning)
best_gamma  = randomCV.best_params_['gamma']
best_C      = randomCV.best_params_['C']
print("The best performing gamma value is: {:5.2f}".format(best_gamma))
## The best performing gamma value is:  0.22
print("The best performing C value is: {:5.2f}".format(best_C))
## The best performing C value is:  0.99

##Apprentissage optimal avec les nouveaux hyperparametres

L’algorithme SVM

# Train SVM and output predictions
rbfSVM = SVC(kernel='rbf', C=best_C, gamma=best_gamma)
rbfSVM.fit(X_train, y_train)
## SVC(C=0.9948482279060615, cache_size=200, class_weight=None, coef0=0.0,
##     decision_function_shape='ovr', degree=3, gamma=0.2158406909356689,
##     kernel='rbf', max_iter=-1, probability=False, random_state=None,
##     shrinking=True, tol=0.001, verbose=False)
svm_predictions = rbfSVM.predict(X_test)
print(metrics.classification_report(y_test, svm_predictions))
##               precision    recall  f1-score   support
## 
##            3       0.00      0.00      0.00         6
##            4       1.00      0.02      0.05        43
##            5       0.59      0.67      0.63       402
##            6       0.56      0.71      0.63       597
##            7       0.64      0.25      0.36       215
##            8       0.00      0.00      0.00        36
##            9       0.00      0.00      0.00         1
## 
##     accuracy                           0.58      1300
##    macro avg       0.40      0.24      0.24      1300
## weighted avg       0.58      0.58      0.54      1300
## 
## 
## C:\Users\lnzb7292\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\metrics\classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
##   'precision', 'predicted', average, warn_for)
print("Overall Accuracy optimise:", round(metrics.accuracy_score(y_test, svm_predictions),1))
## Overall Accuracy optimise: 0.6
print(" Overall Accuracy par defaut: ",svc_defaut)
##  Overall Accuracy par defaut:  0.562

On remarque que la précision à augmenter mais pas suffisament par rapport à l’algorithme forets aléatoires.

On vérifie avec quelques notes de vins pour l’algorithme SVM

#converting the numpy array to list
x=np.array(svm_predictions).tolist()

#printing first 5 predictions
print("\nThe prediction SVM:\n")
## 
## The prediction SVM:
for i in range(0,5):
    print (x[i])
## 6
## 5
## 7
## 5
## 5
#printing first five expectations
print("\nThe expectation SVM:\n")
## 
## The expectation SVM:
print(y_test.head())
## 1504    8
## 1419    5
## 3162    7
## 3091    6
## 2433    6
## Name: quality, dtype: int64

Pour 5 exemples, nous avons seulement 2 notes correctes Cependant avec plus d’exemples de vins notés on arrive à 57% pour l’algorithme SVM.

Conclusion

Le meilleur algorithme pour prédire la note d’un vin pour ce jeu de données est l’algorithme Forets aléatoires. L’optimisation des hyperparametres a permis d’améliorer la precision mais pas de beaucoup. Ce projet a été très interéssant à réaliser car j’ai appliqué les méthodes vue en cours RCP208 et RCP209 et j’ai aussi utilisé des nouvelles bibliothèques comme Pandas. J’ai aussi utilisé R pour faire les analyse en composante principale qui est plus simple que sur python. Il serait aussi intéressant de faire une classification multi classe pour prédire la couleur du vin.