AIML Monitoring System

When the performance of the AI model gets worse, it may cause a serious decline in quality of service.

So, AI models should be re-trained with newly collected data from real environment before it has a negative impact.

To decide the time to re-train and update model, monitoring system is essential in our AIML Framework.

When a network operator wants to analyze AI model performance and request re-training to ATHP, they need to collect necessary data and develop analysis modules.

To detect AI model performance degradation, there are tons of method according to the characteristics of model.

If a model predicts something, we can calculate the accuracy of the inference with real data after targeting time that model predicted.

If a model decide values of other functions, we can measure the credibility of the models' decision by tracking KPIs collected from the environment.

In the Near-RT RIC, we can monitor three kinds of metrics to evaluate the model.

First, RAN KPIs stored in SDL by KPIMon xApp.

If a model aims to maintain a specific KPI at a high value, the operator needs to monitor the change of the KPI value over time.

Second, models' input/output data and its trend.

Sometimes the deterioration of the KPI can occur by other causes, not the AI model.

Therefore, the operator must also monitor the AI model input/output data.

If input data trend is different from previous trend (or training data trend) , there might be some changes in the network environment.

We can gather these values from both Assist xApp and ML xApp as long as they store the data in the influxDB.

(We can add more function in xAppFW to assist to store those kinds of data easily)

Lastly, operators also might need to check KPI control messages.

By analyzing not only changes in the environment but also the impact on the network environment of RIC control message generated by AI model,

it can be possible to decide time to update AI model.

Monitoring RIC control messages and KPI values together can help to detect the malfunction of control message

The rapid detection of AI model performance degradation can increase network availability and reduce network operation cost by getting rid of potential risks

Monitoring system consists of a monitoring server deployed in the central cluster, monitoring agents distributed in the RICs, pluggable performance analysis modules and additional functions in the dashboard.

Operators can request and retrieve information of ML xApps (Assist xApp, too) using dashboard.

When a operator request collect data of a specific ML model, the monitoring server requests subscription of metrics stored in the RIC platform to monitoring agents.

This process supposed that KPIs are stored in chronological order in InfluxDB by KPIMon xApp.

In addition, it also assumes that the Assist xApps store the request and response data to/from ML xApps, and even RIC Control message as well. (or ML xApp itself can store the data in influxDB)

The monitoring agent forwards the stored data to the monitoring server.

If all raw data is not necessary to be transferred, basically statistical values such as mean,std,min,max and skew are calculated and sent every 5ms. ( time interval and format is flexible )

Since InfluxDB is default time-series DB in RIC,

we can use its default operations for time-series data. (https://docs.influxdata.com/flux/v0/stdlib/universe/)

The monitoring server stores the statistics data in the InfluxDB in the central cluster.

In the central cluster, several analysis modules can be added to validate model performance.

For example, the trend change detection module can check whether the skew value of model input/output data is out of a certain range.

If the trend of the model input value is changed, we can assume that the characteristics of the environment might be changed.

If the trend of the output value is changed, we also can think that there is something wrong with the model iteself.