Kubeflow

Kubeflow is a collection of cloud native tools which covers all stages of the Model Development Life Cycle: data exploration, data preparation, model training/tuning/testing and model serving.

There is currently a diverse selection of libraries, tools and frameworks for machine learning.

Kubeflow allows you to compose and customize your own stack based on your specific needs

Kubeflow Pipelines

Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows using Docker containers.

The isolation provided by containers allows machine learning stages to be portable and reproducible.

Kubeflow Pipelines are designed to simplify the process of building and deploying machine learning systems at scale.

Kubeflow Pipelines provide:

An orchestration engine for running multistep workflows
Python SDK to build and run pipeline components
A user interface to visualize your workflows.

Kubeflow Pipelines are based on Argo Workflows, an open source, container-native workflow engine for Kubernetes.

You can install kubeflow pipelines as a standalone component (it does include minio) using the following commands:

Kubeflow Pipelines

kubectl create -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=1.8.2"
kubectl create -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=1.8.2"

Run the following command to view the pipeline dashboard on your localhost:

pipeline ui

kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

Minio

MinIO Client Complete Guide

MinIO is a High Performance Object Storage. It is API compatible with Amazon S3 cloud storage service.

Run the following command to view the minio dashboard on your localhost:

minio ui

kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

You can then connect to minio, create a bucket and then upload a file:

minio commands

mc config host add minio http://localhost:9000 minio minio123
mc mb minio/ml-training
mc cp iris.tar.gz minio/ml-training/data/iris.tar.gz

AWS CLI with MinIO Server

Minio Playground (AWS Access Key ID: Q3AM3UQ867SPQQA43P2F, AWS Secret Access Key: zuf+tfteSlswRu7BJ86wekitnifILbZam1KYY3TG)

MinIO Bucket Notification Guide

Querying data without servers or databases using Amazon S3 Select

How to use s3 select in AWS SDK for Go

AWS Glue 101: All you need to know with a full walk-through

Kserve

KServe enables serverless inferencing on Kubernetes and provides performant, high abstraction interfaces for common machine learning (ML) frameworks like TensorFlow, XGBoost, scikit-learn, PyTorch, and ONNX to solve production model serving use cases.

You can install kserve using the following commands:

kserve

curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.8/hack/quick_install.sh" | bash

Knative Monitoring

Grafana can be used to monitor your kubeflow stack.

The knative-monitoring yaml is available here: monitoring-metrics-prometheus.yaml

You can install knative monitoring using the following commands:

Knative monitoring

kubectl create ns knative-monitoring
kubectl create -f kubeflow/monitoring-metrics-prometheus.yaml

Knative Eventing

Knative Eventing is a collection of APIs that enable you to use an event-driven architecture with your applications.

Knative eventing can be installed using the following commands:

Knative Eventing

kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.6.0/eventing-crds.yaml
kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.6.0/eventing-core.yaml

To install the In memory broker use:

In Memory Broker

kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.6.0/in-memory-channel.yaml
kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.6.0/mt-channel-broker.yaml

To install the kafka broker use:

Kafka

kubectl apply -f https://github.com/knative-sandbox/eventing-kafka-broker/releases/download/knative-v1.6.0/eventing-kafka-controller.yaml
kubectl apply -f https://github.com/knative-sandbox/eventing-kafka-broker/releases/download/knative-v1.6.0/eventing-kafka-broker.yaml

If you choose to use kafka you'll need to install strimzi first:

strimzi

kubectl create ns kafka
kubectl create -f 'https://strimzi.io/install/latest?namespace=kafka' -n kafka
kubectl apply -f https://strimzi.io/examples/latest/kafka/kafka-persistent-single.yaml -n kafka
kubectl wait kafka/my-cluster --for=condition=Ready --timeout=300s -n kafka

Usage of cloud events with kafkasource:

Knative eventing uses the cloudevents format for exchanging the events. Example of the cloudevents exchanged is as follows,

Producing kafka message:

Kafka message as cloud event:

Katib

Katib is the project which is agnostic to machine learning (ML) frameworks. It can tune hyperparameters of applications written in any language of the users' choice and natively supports many ML frameworks, such as TensorFlow, MXNet, PyTorch, XGBoost, and others.

ARGO

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD. Define workflows where each step in the workflow is a container.

Installing Argo

Quick Start

MLFlow

MLflow is a platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models.

MLflow: A Machine Learning Lifecycle Platform

Installing MLflow

Quickstart

Time Series Databases

Influxdb

InfluxDB is an open-source time series database developed by InfluxData. It is written in the Go programming language and is used for the storage and retrieval of time series data in fields such as operations monitoring, application metrics, Internet of Things sensor data, and real-time analytics.

A time series database is specific designed to handle time-stamped metrics and occurrences or measurements.

Internet of Things (IoT) is typically defined as a group of devices that are connected to the Internet, all collecting, sharing, and storing data.
Examples include temperature sensors in an air-conditioning unit and pressure sensors installed on a remote oil pump.

Scalability and the capacity to quickly consume data are the main database requirements for IoT apps. NoSQL systems are ideal for IoT since they are designed with significant horizontal scalability.

InfluxDB is central to many IoT solutions providing high throughput ingestion, compression and real-time querying of that same data.

You can run influxdb on minikube using the following file: influxdb.yaml

The configmap stores the influxdb configuration file which points to the directory where the data files are stored, in this case /var/influxdb, you may want to change this in your own environment.

Python code example - create a time series table in influxdb

Influxdb Dataframes

from influxdb import DataFrameClient, InfluxDBClient
import pandas_datareader as pdr
from datetime import datetime, timedelta

print("Create pandas DataFrame")
yesterday = (datetime.now() - timedelta(days=1)).strftime('%Y-%m-%d')
lastyear = (datetime.now() - timedelta(days=365)).strftime('%Y-%m-%d')
today = datetime.today().strftime('%Y-%m-%d')

companies = ['AAPL'] #, 'MSFT', 'GOOGL']
df = pdr.DataReader(companies, 'yahoo', start=lastyear, end=yesterday)
print(df)
print(df.index)

column_names = list(df.columns)
print(column_names)
df.columns = (df.columns[0:]).tolist()
new_columns = []
symbol = ""
# Rename columns
for column in df.columns:
    #print(column[0])
    new_columns.append(column[0])
    symbol=column[1]
df.columns = new_columns
# Add column
df['Ticker'] = symbol
column_names = list(df.columns)
print(column_names)

user = 'influxdb'
password = 'influxdb'
dbname = 'shares'
protocol = 'line'
host = 'localhost'
port = 8086
measurement = 'price'

client = DataFrameClient(host, port, user, password, dbname)

print("Create database: " + dbname)
client.create_database(dbname)

print("Write DataFrame")
client.write_points(df, measurement, protocol=protocol)


print("Read DataFrame")
client = InfluxDBClient(host, port, user, password, dbname)
q = "select * from " + measurement
df = pd.DataFrame(client.query(q, chunked=True, chunk_size=10000).get_points())  # Returns all points
print(df.head())

print("Delete database: " + dbname)
client.drop_database(dbname)

Feature Engineering

Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used in supervised learning. In order to make machine learning work well on new tasks, it might be necessary to design and train better features.

Feature Engine

Feature-engine is a Python library with multiple transformers to engineer and select features to use in machine learning models.

Feature-engine: A Python library for Feature Engineering and Selection

Feature-engine: A new open source Python package for feature engineering

Feature Selection

SelectFromModel Feature Selection Example in Python

Time Series Forecasting

Time series analysis comprises methods for analyzing time-series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.