Ops Metrics
Ops exposes Prometheus metrics for monitoring controller and server components.
Metrics Endpoints
- Controller:
:9090/metrics
- Server:
:9090/metrics
Controller Metrics
Resource Info Metrics
These metrics expose resource information during each reconcile.
| Metric |
Labels |
Description |
ops_controller_task_info |
namespace, name, desc, host, runtime_image |
Task resource info |
ops_controller_pipeline_info |
namespace, name, desc |
Pipeline resource info |
ops_controller_host_info |
namespace, name, address, hostname, distribution, arch, cpu_total, mem_total, disk_total, accelerator_vendor, accelerator_model, accelerator_count, heart_status |
Host resource info |
ops_controller_cluster_info |
namespace, name, server, version, node, pod, running_pod, heart_status |
Cluster resource info |
ops_controller_eventhooks_info |
namespace, name, type, subject, url |
EventHooks resource info |
TaskRun Metrics
| Metric |
Labels |
Description |
ops_controller_taskrun_info |
namespace, name, taskref, crontab, status |
TaskRun resource info and status |
ops_controller_taskrun_start_time |
namespace, name, taskref |
TaskRun start time (unix timestamp) |
ops_controller_taskrun_end_time |
namespace, name, taskref |
TaskRun end time (unix timestamp) |
ops_controller_taskrun_duration_seconds |
namespace, name, taskref, status |
TaskRun duration in seconds |
PipelineRun Metrics
| Metric |
Labels |
Description |
ops_controller_pipelinerun_info |
namespace, name, pipelineref, crontab, status |
PipelineRun resource info and status |
ops_controller_pipelinerun_start_time |
namespace, name, pipelineref |
PipelineRun start time (unix timestamp) |
ops_controller_pipelinerun_end_time |
namespace, name, pipelineref |
PipelineRun end time (unix timestamp) |
ops_controller_pipelinerun_duration_seconds |
namespace, name, pipelineref, status |
PipelineRun duration in seconds |
Run Count Metrics
| Metric |
Labels |
Description |
ops_controller_taskref_run_total |
namespace, taskref, status |
Total number of TaskRef runs |
ops_controller_pipelineref_run_total |
namespace, pipelineref, status |
Total number of PipelineRef runs |
EventHooks Metrics
| Metric |
Labels |
Description |
ops_controller_eventhooks_trigger_total |
namespace, eventhook_name, keyword, event_id, status |
EventHooks trigger count with matched keyword and event ID |
Reconcile Metrics
| Metric |
Labels |
Description |
ops_controller_reconcile_total |
controller, namespace, result |
Total number of reconcile operations |
ops_controller_reconcile_errors_total |
controller, namespace, error_type |
Total number of reconcile errors |
Controller Resource Metrics
| Metric |
Labels |
Description |
ops_controller_resource_goroutines |
pod |
Controller number of goroutines |
ops_controller_resource_cpu_usage_seconds_total |
pod |
Controller CPU usage in seconds (cumulative, read from cgroup) |
ops_controller_resource_memory_usage_bytes |
pod |
Controller memory usage in bytes (read from cgroup) |
ops_controller_uptime_seconds |
pod |
Controller uptime in seconds |
ops_controller_info |
pod, version, build_date |
Controller information |
Server Metrics
Resource Metrics
| Metric |
Labels |
Description |
ops_server_resource_goroutines |
pod |
Server number of goroutines |
ops_server_resource_cpu_usage_seconds_total |
pod |
Server CPU usage in seconds (cumulative, read from cgroup) |
ops_server_resource_memory_usage_bytes |
pod |
Server memory usage in bytes (read from cgroup) |
Throughput Metrics
| Metric |
Labels |
Description |
ops_server_throughput_http_requests_total |
method, path, status_code |
Total number of HTTP requests |
ops_server_throughput_api_requests_total |
endpoint, namespace, status |
Total number of API requests |
ops_server_throughput_api_errors_total |
endpoint, namespace, error_type |
Total number of API errors |
Server Info Metrics
| Metric |
Labels |
Description |
ops_server_info |
pod, version, build_date |
Server information |
ops_server_uptime_seconds |
pod |
Server uptime in seconds |
Example Queries
Get all running TaskRuns
ops_controller_taskrun_info{status="Running"}
Get Task run count by status
sum by (taskref, status) (ops_controller_taskref_run_total)
Get Host list with status
ops_controller_host_info
Get EventHooks triggers by keyword
sum by (eventhook_name, keyword) (ops_controller_eventhooks_trigger_total)
Get TaskRun duration
ops_controller_taskrun_duration_seconds
Get Controller CPU usage rate
rate(ops_controller_resource_cpu_usage_seconds_total{pod="xxx"}[5m])
Get Controller memory usage
ops_controller_resource_memory_usage_bytes{pod="xxx"}
Get Server CPU usage rate
rate(ops_server_resource_cpu_usage_seconds_total{pod="xxx"}[5m])
Get Server memory usage
ops_server_resource_memory_usage_bytes{pod="xxx"}