ml/demo @ 5346d1d1c035e85b21d5572acc33dd36b0c555e0

The usual suspects - imports. You only need run this once.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Load the data and split it into features vs labels subsets. Again, only need to run it once.

wine_data = pd.read_csv("../WineQT.csv", delimiter=",")
wine_features = wine_data.drop("quality", axis=1).drop("Id", axis=1)
wine_labels = np.ravel(wine_data['quality'])

Check the data samples.

wine_features

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol
0	7.4	0.700	0.00	1.9	0.076	11.0	34.0	0.99780	3.51	0.56	9.4
1	7.8	0.880	0.00	2.6	0.098	25.0	67.0	0.99680	3.20	0.68	9.8
2	7.8	0.760	0.04	2.3	0.092	15.0	54.0	0.99700	3.26	0.65	9.8
3	11.2	0.280	0.56	1.9	0.075	17.0	60.0	0.99800	3.16	0.58	9.8
4	7.4	0.700	0.00	1.9	0.076	11.0	34.0	0.99780	3.51	0.56	9.4
...	...	...	...	...	...	...	...	...	...	...	...
1138	6.3	0.510	0.13	2.3	0.076	29.0	40.0	0.99574	3.42	0.75	11.0
1139	6.8	0.620	0.08	1.9	0.068	28.0	38.0	0.99651	3.42	0.82	9.5
1140	6.2	0.600	0.08	2.0	0.090	32.0	44.0	0.99490	3.45	0.58	10.5
1141	5.9	0.550	0.10	2.2	0.062	39.0	51.0	0.99512	3.52	0.76	11.2
1142	5.9	0.645	0.12	2.0	0.075	32.0	44.0	0.99547	3.57	0.71	10.2

1143 rows × 11 columns

wine_labels

array([5, 5, 5, ..., 5, 6, 5], shape=(1143,))

Split the dataset into train and test subsets, as is common.

NOTE: While it may be tempting to get creative with variable names, such as features_train, features_test, labels_train, labels_test, etc., it's been proven it's WAY TOO MUCH typing, ends up being too confusing, and most examples use x for features (as in, input data) and y for labels (as in, results).

x_train, x_test, y_train, y_test = train_test_split(wine_features, wine_labels, test_size=0.2, random_state=50)

Again, verify the data set sizes and samples.

print("train:", len(x_train), "test:", len(x_test))

train: 914 test: 229

print("sample:\n", resample(x_train, n_samples=5))

sample:
       fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
287             9.9              0.40         0.53             6.7      0.097   
147             7.8              0.44         0.28             2.7      0.100   
177            11.1              0.35         0.48             3.1      0.090   
1094            7.2              0.53         0.13             2.0      0.058   
305            10.4              0.41         0.55             3.2      0.076   

      free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
287                   6.0                  19.0  0.99860  3.27       0.82   
147                  18.0                  95.0  0.99660  3.22       0.67   
177                   5.0                  21.0  0.99860  3.17       0.53   
1094                 18.0                  22.0  0.99573  3.21       0.68   
305                  22.0                  54.0  0.99960  3.15       0.89   

      alcohol  
287      11.7  
147       9.4  
177      10.5  
1094      9.9  
305       9.9

Note that the data above is distributed along different ranges in the columns. We need to normalise it, meaning spreading all of it on a scale a..b, where a and b are the same for each column.

scaler = StandardScaler().fit(x_train)
nx_train = scaler.transform(x_train)
nx_test = scaler.transform(x_test)

Review the data set now.

print("normalised sample:\n", resample(nx_train, n_samples=5))

normalised sample:
 [[-0.36337554  0.30744761 -0.88665518 -0.56520518  0.37265697  1.25154072
   1.96409859 -0.55957907 -1.4666001  -1.03970354 -0.61549675]
 [-0.75789051 -0.09066002 -0.93801151 -0.3273245  -0.54314184 -1.14742793
  -1.15154328 -0.46240645  0.24877775 -0.72667599  0.03054512]
 [-0.58881267  0.93304531  0.24318401 -0.24803095 -0.38287705  1.95123991
   1.80831649 -0.21691774  0.12171272 -0.91449252 -0.89237183]
 [-1.7159983   0.25057509 -1.34886212 -0.64449874 -0.7263016  -1.0474709
  -0.59072774 -1.71030739  1.64649303  1.15148931  1.87637902]
 [ 0.31293584  1.27428042 -0.68122987 -0.01015027  0.00633745 -0.6476428
  -0.49725849  0.76503709  0.18524524 -0.10062089  0.03054512]]

Time to rock & roll! Let's train the SVC model.

print("**** TESTING C-Support Vector Classification ****")

from sklearn.svm import SVC

svc_model = SVC()
svc_model.fit(nx_train, y_train)

# now test the fitness with the test subset
svc_y_predict = svc_model.predict(nx_test)

# visualise it
print("x: predictions, y: labels")
svc_cm = np.array(confusion_matrix(y_test, svc_y_predict, labels=[0,1,2,3,4,5,6,7,8,9,10]))
svc_conf_matrix = pd.DataFrame(svc_cm)
print(svc_conf_matrix)

**** TESTING C-Support Vector Classification ****
x: predictions, y: labels
    0   1   2   3   4   5   6   7   8   9   10
0    0   0   0   0   0   0   0   0   0   0   0
1    0   0   0   0   0   0   0   0   0   0   0
2    0   0   0   0   0   0   0   0   0   0   0
3    0   0   0   0   0   2   0   0   0   0   0
4    0   0   0   0   0   6   4   1   0   0   0
5    0   0   0   0   0  78  21   0   0   0   0
6    0   0   0   0   0  29  50   0   0   0   0
7    0   0   0   0   0   1  27   9   0   0   0
8    0   0   0   0   0   0   1   0   0   0   0
9    0   0   0   0   0   0   0   0   0   0   0
10   0   0   0   0   0   0   0   0   0   0   0

Visualise the SVC model performance in a nice heatmap graph.

sns.heatmap(svc_conf_matrix, annot=True, fmt='g')
plt.show()

Try another, simpler, visualisation.

plt.scatter(range(0, len(y_test)), y_test, color = 'blue')
plt.scatter(range(0, len(y_test)), svc_y_predict, color = 'green')

<matplotlib.collections.PathCollection at 0x179ef1310>

Now let's try and train the NuSVC model, too.

print("**** TESTING Nu-Support Vector Classification ****")

from sklearn.svm import NuSVC

nusvc_model = NuSVC(nu=0.015)
nusvc_model.fit(nx_train, y_train)

# now test the fitness with the test subset
nusvc_y_predict = svc_model.predict(nx_test)

# visualise it
print("x: predictions, y: labels")
nu_cm = np.array(confusion_matrix(y_test, nusvc_y_predict, labels=[0,1,2,3,4,5,6,7,8,9,10]))
nu_conf_matrix = pd.DataFrame(nu_cm)
print(nu_conf_matrix)

**** TESTING Nu-Support Vector Classification ****
x: predictions, y: labels
    0   1   2   3   4   5   6   7   8   9   10
0    0   0   0   0   0   0   0   0   0   0   0
1    0   0   0   0   0   0   0   0   0   0   0
2    0   0   0   0   0   0   0   0   0   0   0
3    0   0   0   0   0   2   0   0   0   0   0
4    0   0   0   0   0   6   4   1   0   0   0
5    0   0   0   0   0  78  21   0   0   0   0
6    0   0   0   0   0  29  50   0   0   0   0
7    0   0   0   0   0   1  27   9   0   0   0
8    0   0   0   0   0   0   1   0   0   0   0
9    0   0   0   0   0   0   0   0   0   0   0
10   0   0   0   0   0   0   0   0   0   0   0

Visualise the NuSVC model performance in a nice heatmap graph as well.

# visualise the NuSVC model in a nice picture
sns.heatmap(nu_conf_matrix, annot=True, fmt='g')
plt.show()

Similarly, try the 2D visualisation.

plt.scatter(range(0, len(y_test)), y_test, color = 'blue')
plt.scatter(range(0, len(y_test)), nusvc_y_predict, color = 'green')

<matplotlib.collections.PathCollection at 0x179f5ad50>

It would appear that in this case, and with all the default settings, there is no difference beteren SVC and NuSVC models.

Let's try our luck with a regression model as well - SVR is one such possibility.

print("**** TESTING C-Support Vector Regression ****")

from sklearn.svm import SVR

svr_model = SVR(kernel="rbf")
svr_model.fit(x_train, y_train)

# now test the fitness with the test subset
svr_y_predict = svr_model.predict(x_test)

**** TESTING C-Support Vector Regression ****

Again, try a 2D visualisation of results.

plt.scatter(range(0, len(y_test)), y_test, color = 'blue')
plt.scatter(range(0, len(y_test)), svr_y_predict, color = 'green')

<matplotlib.collections.PathCollection at 0x179fe47d0>

Feel free to take this on and play with it some more. What is there to do? Here are some ideas:

improve visualisations
experiment with model settings
try different train/test sizes

wine-sklearn.ipynb 140 KB Geçmiş Ham

wine-sklearn.ipynb 140 KB

Geçmiş Ham