Release G: AI & Machine Learning
Kubeflow
Kubeflow is a collection of cloud native tools which covers all stages of the Model Development Life Cycle: data exploration, data preparation, model training/tuning/testing and model serving.
There is currently a diverse selection of libraries, tools and frameworks for machine learning.
Kubeflow allows you to compose and customize your own stack based on your specific needs
Kubeflow Pipelines
Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows using Docker containers.
The isolation provided by containers allows machine learning stages to be portable and reproducible.
Kubeflow Pipelines are designed to simplify the process of building and deploying machine learning systems at scale.
Kubeflow Pipelines provide:
- An orchestration engine for running multistep workflows
- Python SDK to build and run pipeline components
- A user interface to visualize your workflows.
Kubeflow Pipelines are based on Argo Workflows, an open source, container-native workflow engine for Kubernetes.
You can install kubeflow pipelines as a standalone component (it does include minio) using the following commands:
kubectl create -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=1.8.2" kubectl create -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=1.8.2"
Run the following command to view the pipeline dashboard on your localhost:
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
See also:
Experiment with the Pipelines Samples
Kubeflow Pipelines up and running in 5 minutes
Minio
MinIO is a High Performance Object Storage. It is API compatible with Amazon S3 cloud storage service.
Run the following command to view the minio dashboard on your localhost:
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
You can then connect to minio, create a bucket and then upload a file:
mc config host add minio http://localhost:9000 minio minio123 mc mb minio/ml-training mc cp iris.tar.gz minio/ml-training/data/iris.tar.gz
Minio Playground (AWS Access Key ID: Q3AM3UQ867SPQQA43P2F, AWS Secret Access Key: zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG
)
MinIO Bucket Notification Guide
Querying data without servers or databases using Amazon S3 Select
How to use s3 select in AWS SDK for Go
AWS Glue 101: All you need to know with a full walk-through
Kserve
KServe enables serverless inferencing on Kubernetes and provides performant, high abstraction interfaces for common machine learning (ML) frameworks like TensorFlow, XGBoost, scikit-learn, PyTorch, and ONNX to solve production model serving use cases.
You can install kserve using the following commands:
curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.8/hack/quick_install.sh" | bash
See Also:
Deploy Transformer with InferenceService
KServe: A Robust and Extensible Cloud Native Model Server
Deploying a model using the KServe Python SDK
Knative Monitoring
Grafana can be used to monitor your kubeflow stack.
The knative-monitoring yaml is available here: monitoring-metrics-prometheus.yaml
You can install knative monitoring using the following commands:
kubectl create ns knative-monitoring kubectl create -f kubeflow/monitoring-metrics-prometheus.yaml
Knative Eventing
Knative Eventing is a collection of APIs that enable you to use an event-driven architecture with your applications.
Knative eventing can be installed using the following commands:
kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.6.0/eventing-crds.yaml kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.6.0/eventing-core.yaml
To install the In memory broker use:
kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.6.0/in-memory-channel.yaml kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.6.0/mt-channel-broker.yaml
To install the kafka broker use:
kubectl apply -f https://github.com/knative-sandbox/eventing-kafka-broker/releases/download/knative-v1.6.0/eventing-kafka-controller.yaml kubectl apply -f https://github.com/knative-sandbox/eventing-kafka-broker/releases/download/knative-v1.6.0/eventing-kafka-broker.yaml
If you choose to use kafka you'll need to install strimzi first:
kubectl create ns kafka kubectl create -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka kubectl apply -f https://strimzi.io/examples/latest/kafka/kafka-persistent-single.yaml -n kafka kubectl wait kafka/my-cluster --for=condition=Ready --timeout=300s -n kafka
Usage of cloud events with kafkasource:
Knative eventing uses the cloudevents format for exchanging the events. Example of the cloudevents exchanged is as follows,
Producing kafka message:
Kafka message as cloud event:
See also:
Installing Knative Eventing using YAML files
Processing S3 Files using Knative Eventing
Katib
Katib is the project which is agnostic to machine learning (ML) frameworks. It can tune hyperparameters of applications written in any language of the users' choice and natively supports many ML frameworks, such as TensorFlow, MXNet, PyTorch, XGBoost, and others.
ARGO
Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD. Define workflows where each step in the workflow is a container.
Installing Argo
See Also:
Argo Events - The Event-Based Dependency Manager for Kubernetes
MLFlow
MLflow is a platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.
MLflow: A Machine Learning Lifecycle Platform
Installing MLflow
Time Series Databases
Influxdb
InfluxDB is an open-source time series database developed by InfluxData. It is written in the Go programming language and is used for the storage and retrieval of time series data in fields such as operations monitoring, application metrics, Internet of Things sensor data, and real-time analytics.
A time series database is specific designed to handle time-stamped metrics and occurrences or measurements.
Internet of Things (IoT) is typically defined as a group of devices that are connected to the Internet, all collecting, sharing, and storing data.
Examples include temperature sensors in an air-conditioning unit and pressure sensors installed on a remote oil pump.
Scalability and the capacity to quickly consume data are the main database requirements for IoT apps. NoSQL systems are ideal for IoT since they are designed with significant horizontal scalability.
InfluxDB is central to many IoT solutions providing high throughput ingestion, compression and real-time querying of that same data.
You can run influxdb on minikube using the following file: influxdb.yaml
The configmap stores the influxdb configuration file which points to the directory where the data files are stored, in this case /var/influxdb, you may want to change this in your own environment.
Python code example - create a time series table in influxdb
from influxdb import DataFrameClient, InfluxDBClient import pandas_datareader as pdr from datetime import datetime, timedelta print("Create pandas DataFrame") yesterday = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d') lastyear = (datetime.now() - timedelta(days=365)).strftime('%Y-%m-%d') today = datetime.today().strftime('%Y-%m-%d') companies = ['AAPL'] #, 'MSFT', 'GOOGL'] df = pdr.DataReader(companies, 'yahoo', start=lastyear, end=yesterday) print(df) print(df.index) column_names = list(df.columns) print(column_names) df.columns = (df.columns[0:]).tolist() new_columns = [] symbol = "" # Rename columns for column in df.columns: #print(column[0]) new_columns.append(column[0]) symbol=column[1] df.columns = new_columns # Add column df['Ticker'] = symbol column_names = list(df.columns) print(column_names) user = 'influxdb' password = 'influxdb' dbname = 'shares' protocol = 'line' host = 'localhost' port = 8086 measurement = 'price' client = DataFrameClient(host, port, user, password, dbname) print("Create database: " + dbname) client.create_database(dbname) print("Write DataFrame") client.write_points(df, measurement, protocol=protocol) print("Read DataFrame") client = InfluxDBClient(host, port, user, password, dbname) q = "select * from " + measurement df = pd.DataFrame(client.query(q, chunked=True, chunk_size=10000).get_points()) # Returns all points print(df.head()) print("Delete database: " + dbname) client.drop_database(dbname)
Feature Engineering
Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in supervised learning. In order to make machine learning work well on new tasks, it might be necessary to design and train better features.
Feature Engine
Feature-engine is a Python library with multiple transformers to engineer and select features to use in machine learning models.
Feature-engine: A Python library for Feature Engineering and Selection
Feature-engine: A new open source Python package for feature engineering
Feature Selection
SelectFromModel Feature Selection Example in Python
Time Series Forecasting
Time series analysis comprises methods for analyzing time-series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.
Feature engineering for time series forecasting
Feature engineering for time series forecasting
LSTM (Long short-term memory)
Time Series Forecasting With RNN(LSTM)
Time Series Forecasting Using LSTM
Explore and Interpret ML Models
Dalex
Visualizing ML model bias with dalex
AI Fairness 360
Anchors
Ulltimate Guide To Model Explainability
Feast
It is a framework for storing and serving features to machine learning models.
Creating feature store with Feast