Ops Metrics
Ops exposes Prometheus metrics for monitoring controller and server components.
Metrics Endpoints
- Controller:
:9090/metrics - Server:
:9090/metrics
Controller Metrics
Resource Info Metrics
These metrics expose resource information during each reconcile.
| Metric | Labels | Description |
|---|---|---|
ops_controller_task_info |
namespace, name, desc, host, runtime_image | Task resource info (static fields only) |
ops_controller_task_status |
namespace, name | Task resource status (dynamic fields) |
ops_controller_pipeline_info |
namespace, name, desc | Pipeline resource info (static fields only) |
ops_controller_pipeline_status |
namespace, name | Pipeline resource status (dynamic fields) |
ops_controller_host_info |
namespace, name, address | Host resource info (static fields only) |
ops_controller_host_status |
namespace, name, hostname, distribution, arch, status | Host resource status (dynamic fields) |
ops_controller_cluster_info |
namespace, name, server | Cluster resource info (static fields only) |
ops_controller_cluster_status |
namespace, name, version, status, node, pod_count, running_pod, cert_not_after_days | Cluster resource status (dynamic fields) |
ops_controller_eventhooks_info |
namespace, name, type, subject, url | EventHooks resource info (static fields only) |
ops_controller_eventhooks_status |
namespace, name, keyword, event_id | EventHooks resource status (dynamic fields, including trigger information) |
TaskRun Metrics
| Metric | Labels | Description |
|---|---|---|
ops_controller_taskrun_info |
namespace, name, taskref, crontab | TaskRun resource info (static fields only) |
ops_controller_taskrun_status |
namespace, name, status | TaskRun resource status (dynamic fields) |
ops_controller_taskrun_start_time |
namespace, name, taskref | TaskRun start time (unix timestamp) |
ops_controller_taskrun_end_time |
namespace, name, taskref | TaskRun end time (unix timestamp) |
ops_controller_taskrun_duration_seconds |
namespace, name, taskref, status | TaskRun duration in seconds |
PipelineRun Metrics
| Metric | Labels | Description |
|---|---|---|
ops_controller_pipelinerun_info |
namespace, name, pipelineref, crontab | PipelineRun resource info (static fields only) |
ops_controller_pipelinerun_status |
namespace, name, status | PipelineRun resource status (dynamic fields) |
ops_controller_pipelinerun_start_time |
namespace, name, pipelineref | PipelineRun start time (unix timestamp) |
ops_controller_pipelinerun_end_time |
namespace, name, pipelineref | PipelineRun end time (unix timestamp) |
ops_controller_pipelinerun_duration_seconds |
namespace, name, pipelineref, status | PipelineRun duration in seconds |
Run Count Metrics
Run counts can be calculated from _info and _status metrics:
- TaskRun total by taskref and status:
count by (taskref, status) (ops_controller_taskrun_info{namespace="$namespace"} == 1) * on(namespace, name) group_left(status) ops_controller_taskrun_status{namespace="$namespace"} - PipelineRun total by pipelineref and status:
count by (pipelineref, status) (ops_controller_pipelinerun_info{namespace="$namespace"} == 1) * on(namespace, name) group_left(status) ops_controller_pipelinerun_status{namespace="$namespace"}
EventHooks Metrics
EventHooks trigger information is recorded in ops_controller_eventhooks_status metric with keyword and event_id labels.
Reconcile Metrics
| Metric | Labels | Description |
|---|---|---|
ops_controller_reconcile_total |
controller, namespace, result | Total number of reconcile operations |
ops_controller_reconcile_errors_total |
controller, namespace, error_type | Total number of reconcile errors |
Controller Resource Metrics
| Metric | Labels | Description |
|---|---|---|
ops_controller_resource_goroutines |
pod | Controller number of goroutines |
ops_controller_resource_cpu_usage_seconds_total |
pod | Controller CPU usage in seconds (cumulative, read from cgroup) |
ops_controller_resource_memory_usage_bytes |
pod | Controller memory usage in bytes (read from cgroup) |
ops_controller_uptime_seconds |
pod | Controller uptime in seconds |
ops_controller_info |
pod, version, build_date | Controller information |
Server Metrics
Resource Metrics
| Metric | Labels | Description |
|---|---|---|
ops_server_resource_goroutines |
pod | Server number of goroutines |
ops_server_resource_cpu_usage_seconds_total |
pod | Server CPU usage in seconds (cumulative, read from cgroup) |
ops_server_resource_memory_usage_bytes |
pod | Server memory usage in bytes (read from cgroup) |
Throughput Metrics
| Metric | Labels | Description |
|---|---|---|
ops_server_throughput_http_requests_total |
method, path, status_code | Total number of HTTP requests |
ops_server_throughput_api_requests_total |
endpoint, namespace, status | Total number of API requests |
ops_server_throughput_api_errors_total |
endpoint, namespace, error_type | Total number of API errors |
Server Info Metrics
| Metric | Labels | Description |
|---|---|---|
ops_server_info |
pod, version, build_date | Server information |
ops_server_uptime_seconds |
pod | Server uptime in seconds |
Example Queries
Get all running TaskRuns
ops_controller_taskrun_info{status="Running"}
Get Task run count by status
count by (taskref, status) (
ops_controller_taskrun_info{namespace="$namespace"} == 1
* on(namespace, name) group_left(status)
ops_controller_taskrun_status{namespace="$namespace"}
)
Get Host list with status
ops_controller_host_info
Get EventHooks triggers by keyword
count by (name, keyword) (ops_controller_eventhooks_status{namespace="$namespace",keyword!=""})
Get TaskRun duration
ops_controller_taskrun_duration_seconds
Get Controller CPU usage rate
rate(ops_controller_resource_cpu_usage_seconds_total{pod="xxx"}[5m])
Get Controller memory usage
ops_controller_resource_memory_usage_bytes{pod="xxx"}
Get Server CPU usage rate
rate(ops_server_resource_cpu_usage_seconds_total{pod="xxx"}[5m])
Get Server memory usage
ops_server_resource_memory_usage_bytes{pod="xxx"}