ml/demo @ 027652c05a892fd2680c387614aac8316fb3a435

Нема описа
Grega Bremec 027652c05a clean up some junk, make bashrc paths absolute		пре 6 месеци
.gitignore	2fbe5b5af5 add gitignore, README, and env files	пре 6 месеци
README.adoc	027652c05a clean up some junk, make bashrc paths absolute	пре 6 месеци
env-pytorch-26.yml	2fbe5b5af5 add gitignore, README, and env files	пре 6 месеци
env-sklearn-16.yml	2fbe5b5af5 add gitignore, README, and env files	пре 6 месеци
env-tf-216.yml	2fbe5b5af5 add gitignore, README, and env files	пре 6 месеци
		
			
			
				README.adoc
			
		
		
	
			
				== Getting Up ==

=== First Things First ===

You want to at least get the terminology right. Use a reference such as https://www.manning.com/books/deep-learning-with-python-third-edition[_"Deep Learning with Python"_ by François Chollet], specifically chapter 1, _"What is Deep Learning?"_

You want to understand how the technology works. From the same book, chapter 2, _"The Mathematical Building Blocks of Neural Networks"_ does a great job at explaining it without too much academic distraction.

=== Get a Look at the Landscape ===

There are so many different layers of projects, tools, and libraries.

Just looking at Python's most popular ML frameworks:

* https://scikit-learn.org/stable/index.html[SciKit Learn]
* https://pytorch.org/[PyTorch]
* https://www.tensorflow.org/[TensorFlow]

Additionally, https://keras.io/[Keras] is a multi-framework ML frontend that can work with TensorFlow, PyTorch, SciKit-Learn, JAX, and others.

The problem is this is the middle layer nowadays, not even the topmost any more.

These frameworks all come with their own tools to make the jobs of working with data and training models easier:

* SciKit has a https://scikit-learn.org/stable/related_projects.html[related projects page]
* PyTorch has an https://landscape.pytorch.org/[entire ecosystem of supporting tools]
* TensorFlow has https://www.tensorflow.org/resources/libraries-extensions[the same thing]

The lower layers are libraries you will use to work with data, such as:

* https://numpy.org/[NumPy]
* https://pandas.pydata.org/[Pandas] (based on NumPy)
* https://matplotlib.org/[Matplotlib] (for drawing simple graphs)

Core Python data types and libraries are an even lower layer. A course such as https://www.redhat.com/en/services/training/ad141-red-hat-training-presents-introduction-to-python-programming[AD141] might help you speed up through this lowest layer.

There will generally be some intermediate-level tools like:

* https://scipy.org/[SciPy], based on NumPy, for more efficient data processing, adding some high-level functions
* https://seaborn.pydata.org/[Seaborn], based on Matplotlib, for richer visualisation
* https://plotly.com/python/[Plotly], another high-level visualisation library

The higher layers involve job scheduling frameworks like:

* https://www.ray.io/[Ray], infrastructure orchestration for distributed workloads
* https://codeflare.dev/[CodeFlare], distributed serverless machine learning

Additionally, building on top of pre-trained models:

* https://huggingface.co/docs/transformers/index[Transformers], an LLM-oriented library of pre-trained models built on top of lower-level packages (TF, PyTorch)

Frameworks for creating applications:

* https://www.langchain.com/[LangChain], a framework for creating LLM-based applications
* https://github.com/i-am-bee/beeai-framework[Bee Agent Framework], a framework to build applications based on AI agents (Agentic AI) that can make decisions and perform tasks autonomously

There will then be special-purpose libraries, such as:

* https://github.com/dmlc/xgboost[XGBoost], a multi-language regularising gradient-boosting library
* https://lightgbm.readthedocs.io/en/stable/[LightGBM], another gradient boosting framework

MLOps is another can of worms:

* https://www.kubeflow.org/[KubeFlow] for Kubernetes-related integrations (included in RHOAI)
* https://github.com/elyra-ai/elyra[Elyra], integrating with JupyterLab Notebooks

Don't worry - start small and when the sandbox becomes too small, look around to see what can extend it.

=== Get Some Samples ===

Have a look at some very simple datasets to start with.

A couple of great types of models to start learning with are regression (prediction) and classification (shape recognition, text analysis).

The data you need for those is typically very different. Try figuring out why and what kind of models a certain type of data supports well.

Nowadays, https://www.kaggle.com/datasets[Kaggle] has some great example datasets (requires a free account).

* https://casas.wsu.edu/datasets/[CASAS-HAR] (Human Activity Recognition from Continuous Ambient Sensor Data) is a very flexible regression dataset (available for https://www.kaggle.com/datasets/utkarshx27/ambient-sensor-based-human-activity-recognition[download from Kaggle]).
* MNIST has two excellent classification datasets: handwritten numbers and fashion items - they are a great start for classification - almost every framework includes them. Both are also available at Kaggle - https://www.kaggle.com/datasets/hojjatk/mnist-dataset[numbers] and https://www.kaggle.com/datasets/zalando-research/fashionmnist[fashion].

== Getting Running ==

Create a system-independent Python installation. Trust me, you want to divorce it.

https://conda-forge.org[Conda-Forge] is a community GitHub organisation containing repositories of https://conda.org[Conda] recipes.

Conda is a _"language-agnostic, multi-platform package management ecosystem for projects"_.

Conda installer is called https://conda-forge.org/download/[Miniforge].

=== Installing and Preparing Conda ===

Fresh installation:

[subs="+quotes"]
----
$ *mkdir /opt/miniforge*
$ *bash ~/Downloads/Miniforge3-24.11.3-2-MacOSX-arm64.sh -b -f -p /opt/miniforge*
----

For updates:

[subs="+quotes"]
----
$ *bash ~/Downloads/Miniforge3-24.11.3-2-MacOSX-arm64.sh -b -f -u -p /opt/miniforge*
----

Obtain an activation script that is not embedded into your `bashrc`:

[subs="+quotes"]
----
$ *mv ~/.bash_profile{,-backup}*
$ *touch ~/.bash_profile*
$ *mamba init bash*
no change     /opt/miniforge/condabin/conda
...
no change     /opt/miniforge/etc/profile.d/conda.csh
modified      /foo/bar/.bash_profile

==> For changes to take effect, close and re-open your current shell. <==

Added mamba to /foo/bar/.bash_profile

==> For changes to take effect, close and re-open your current shell. <==

$ *(echo '#!/bin/false'; cat ~/.bash_profile) > conda-init.sh*
$ *mv ~/.bash_profile{-backup,}*
----

For activation a any time, source the script:

[subs="+quotes"]
----
$ *source conda-init.sh*
(base) $ 
----

=== Creating Conda Environments ===

You can create any number of environments in Conda.

Let's create a couple: SciKit-Learn, PyTorch, and TensorFlow.

Step one is always identifying the version of Python that the environment works best with, also in terms of all of its dependencies.

Sometimes, the toolkit will suggest the steps for the package manager we chose (Conda). I propose you completely ignore this and just roll your own environment. It will be for the better once you hit some issues (and you will) - you will at least be familiar with the components you chose and the process of replacing them and/or adding more.

Check https://www.python.org/downloads/[Current Python Release Status]. As of the time of this writing, 3.13 was the latest non-pre-release version.

Cross-check with https://scikit-learn.org/stable/install.html[latest stable SciKit-Learn release].

Create an environment description, say `env-sklearn-16.yml`:

[source,yaml]
----
---
name: sklearn-16
channels:
  - conda-forge
dependencies:
  - python>=3.13,<3.14
  - numpy>=1.19.5
  - scipy>=1.6.0
  - scikit-learn>=1.6.1,<1.7.0
  - cython>=3.0.10
  - pandas>=1.1.5
  - matplotlib>=3.3.4
  - seaborn>=0.9.0
...
----

Now tell `mamba` (or `conda`) to create it:

[subs="+quotes"]
----
(base) $ *mamba env create -n sklearn-16 -f ./env-sklearn-16.yml*
Channels:
 - conda-forge
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

Downloading and Extracting Packages:
...

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
...
----

Activate it (some checks along the way to show you how the entire thing works):

[subs="+quotes"]
----
(base) $ *which python*
/opt/miniforge/bin/python

(base) $ *python --version*
Python 3.12.9

(base) $ *mamba env list*
# conda environments:
#
base                 * /opt/miniforge
sklearn-16             /opt/miniforge/envs/sklearn-16

(base) $ *mamba activate sklearn-16*

(sklearn-16) $ *which python*
/opt/miniforge/envs/sklearn-16/bin/python

(sklearn-16) $ *python --version*
Python 3.13.2

(sklearn-16) $ *python3*
Python 3.13.2 | packaged by conda-forge | (main, Feb 17 2025, 14:02:48) [Clang 18.1.8 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>> *import sklearn*

>>> *sklearn.show_versions()*

System:
    python: 3.13.2 | packaged by conda-forge | (main, Feb 17 2025, 14:02:48) [Clang 18.1.8 ]
executable: /opt/miniforge/envs/sklearn-16/bin/python3
   machine: macOS-15.4-arm64-arm-64bit-Mach-O

Python dependencies:
      sklearn: 1.6.1
          pip: 25.0.1
   setuptools: 78.1.0
        numpy: 2.2.4
        scipy: 1.15.2
       Cython: 3.0.12
       pandas: 2.2.3
   matplotlib: 3.10.1
       joblib: 1.4.2
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 10
         prefix: libopenblas
       filepath: /opt/miniforge/envs/sklearn-16/lib/libopenblas.0.dylib
        version: 0.3.29
threading_layer: openmp
   architecture: VORTEX

       user_api: openmp
   internal_api: openmp
    num_threads: 10
         prefix: libomp
       filepath: /opt/miniforge/envs/sklearn-16/lib/libomp.dylib
        version: None

>>> *exit()*
----

If you want to later update some of the environment components, you can do so by editing the env file and issuing the following command:

[subs="+quotes"]
----
(sklearn-16) $ *mamba env update -f ./env-sklearn-16.yml*
----
+
====
WARNING: `env update` is always applied to _current_ environment.
====

You can do the same with other environments: PyTorch, TensorFlow, etc. These may even come with hardware acceleration support for your computer system.

[source,yaml]
----
---
name: pytorch-26
channels:
  - conda-forge
dependencies:
  - python>=3.12,<3.13
  - numpy>=1.19.5
  - pandas>=1.1.5
  - matplotlib>=3.3.4
  - pytorch>=2.6,<2.7
...
----

[subs="+quotes"]
----
(base) $ *mamba env create -n pytorch-26 -f ./env-pytorch-26.yml*
----

[source,yaml]
----
---
name: tensorflow-2.16
channels:
  - apple
  - conda-forge
dependencies:
  - python>=3.9
  - numpy>=1.19.5
  - pandas>=1.1.5
  - matplotlib>=3.3.4
  - tensorflow-deps
  - pip>=25.0
  - pip:
    - tensorflow-macos
    - tensorflow-metal
...
----

[subs="+quotes"]
----
(base) $ *mamba env create -n tf-216 -f ./env-tf-216.yml*
----

=== What is JupyterLab? ===

Try a workflow by writing a script. It's going to be a lot of re-running of the same code when testing it.

There is an example script for two model types using SciKit-Learn called `wine-sklearn.py`. The second model is deliberately commented out because there is an issue with it.

If you try figuring out what its problem is, you need to re-run the entire script every time you make a change, which is very awkward and time-consuming.

Try executing the same workflow in an interactive interpreter by copying the script to a Python shell line by line. It's extremely inconvenient.

Sometimes you want to return a couple of steps to change something about your data, and then re-run the training of a model. It is not very transparent what the state of your data is at the moment and what the correct order of steps should be.

JupyterLab Notebooks were designed to resolve those problems by being something in between. You can run them as a script, but you can also run individual blocks of a notebook called _cells_ in isolation.

Not only that - you can define different Python kernels which belong to various Conda environments, in the same JupyterLab instance, and simply associate your notebooks with the kernel they need, so that they can run in whichever environment you want them to.

If you want to use themm, the best way to do it is to install `jupyterlab` into the base environment.

[subs="+quotes"]
----
(_whatever_) $ *mamba activate base*

(base) $ *pip install jupyterlab*
Collecting jupyterlab
...
Successfully installed MarkupSafe-3.0.2 anyio-4.9.0 appnope-0.1.4 argon2-cffi-23.1.0 argon2-cffi-bindings-21.2.0 arrow-1.3.0 asttokens-3.0.0 async-lru-2.0.5 attrs-25.3.0 babel-2.17.0 beautifulsoup4-4.13.3 bleach-6.2.0 comm-0.2.2 debugpy-1.8.13 decorator-5.2.1 defusedxml-0.7.1 executing-2.2.0 fastjsonschema-2.21.1 fqdn-1.5.1 h11-0.14.0 httpcore-1.0.7 httpx-0.28.1 ipykernel-6.29.5 ipython-9.1.0 ipython-pygments-lexers-1.1.1 isoduration-20.11.0 jedi-0.19.2 jinja2-3.1.6 json5-0.12.0 jsonschema-4.23.0 jsonschema-specifications-2024.10.1 jupyter-client-8.6.3 jupyter-core-5.7.2 jupyter-events-0.12.0 jupyter-lsp-2.2.5 jupyter-server-2.15.0 jupyter-server-terminals-0.5.3 jupyterlab-4.4.0 jupyterlab-pygments-0.3.0 jupyterlab-server-2.27.3 matplotlib-inline-0.1.7 mistune-3.1.3 nbclient-0.10.2 nbconvert-7.16.6 nbformat-5.10.4 nest-asyncio-1.6.0 notebook-shim-0.2.4 overrides-7.7.0 pandocfilters-1.5.1 parso-0.8.4 pexpect-4.9.0 prometheus-client-0.21.1 prompt_toolkit-3.0.50 psutil-7.0.0 ptyprocess-0.7.0 pure-eval-0.2.3 pygments-2.19.1 python-dateutil-2.9.0.post0 python-json-logger-3.3.0 pyyaml-6.0.2 pyzmq-26.4.0 referencing-0.36.2 rfc3339-validator-0.1.4 rfc3986-validator-0.1.1 rpds-py-0.24.0 send2trash-1.8.3 six-1.17.0 sniffio-1.3.1 soupsieve-2.6 stack_data-0.6.3 terminado-0.18.1 tinycss2-1.4.0 tornado-6.4.2 traitlets-5.14.3 types-python-dateutil-2.9.0.20241206 typing_extensions-4.13.1 uri-template-1.3.0 wcwidth-0.2.13 webcolors-24.11.1 webencodings-0.5.1 websocket-client-1.8.0
----

Starting Jupyter will automatically open it in your browser.

[subs="+quotes"]
----
(base) $ *jupyter lab*
[I 2025-04-07 14:54:37.059 ServerApp] jupyter_lsp | extension was successfully linked.
...
[I 2025-04-07 14:54:39.694 LabApp] Build is up to date
----

If you want to reopen it at any later point, you can point your browser to `\http://localhost:8888/lab` and it will reload the last state of the workbench before you closed it.

=== Adding Conda Environments to JupyterLab ===

Introduce Jupyter Kernels into the specific environments - while JupyterLab is running, install `ipykernel` into your environment and tell the module to register itself.

[subs="+quotes"]
----
(base) $ *mamba activate sklearn-16*

(sklearn-16) $ *pip install ipykernel*
Collecting ipykernel
...
Successfully installed appnope-0.1.4 asttokens-3.0.0 comm-0.2.2 debugpy-1.8.13 decorator-5.2.1 executing-2.2.0 ipykernel-6.29.5 ipython-9.1.0 ipython-pygments-lexers-1.1.1 jedi-0.19.2 jupyter-client-8.6.3 jupyter-core-5.7.2 matplotlib-inline-0.1.7 nest-asyncio-1.6.0 parso-0.8.4 pexpect-4.9.0 platformdirs-4.3.7 prompt_toolkit-3.0.50 psutil-7.0.0 ptyprocess-0.7.0 pure-eval-0.2.3 pygments-2.19.1 pyzmq-26.4.0 stack_data-0.6.3 traitlets-5.14.3 wcwidth-0.2.13

(sklearn-16) $ *python -mipykernel install --user --name=sklearn-16*
Installed kernelspec sklearn-16 in /foo/bar/baz/Jupyter/kernels/sklearn-16
----

Do the same thing for the other two environments.

Once you open a notebook, you can select the kernel you need to run it with in the top-right corner menu.

== Magic Time ==

https://www.kaggle.com/datasets/yasserh/wine-quality-dataset[Wine Quality Dataset] is a versatile dataset that can be used both as a classification or a regression data set.

Download it from the above link and place it in the same directory as the other files, calling it `WineQT.csv`.

It has features using 11-dimension tensors describing a wine's chemical composition, with one integer label between 0 and 10 to express the rating the wine got.

=== What the Files Do ===

The following files are available in this project:

`wine-sklearn.py`::
    A SciKit-Learn script that loads data, splits it into training and testing subsets, normalizes the features and trains a _C-Support Vector Classification_ model called `SVC` in SKLearn. It then proceeds to visualise the efficiency of the model using a _confusion matrix_ and a heatmap. The idea is that the commented part, training of a modified SVC called NuSVC, which has an issue, would demonstrate how awkward is testing and fixing the script.