Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Monitoring

The Node Readiness Controller exposes Prometheus-compatible metrics. This page documents the metrics currently registered by the controller and how they can be used for monitoring rule evaluation, taint operations, failures, bootstrap progress, and rule health.

Metrics Endpoint

The controller serves metrics on /metrics only when metrics are explicitly enabled.

Depending on the installation, the endpoint is exposed as:

  • HTTP on port 8080 when the standard Prometheus component is enabled.
  • HTTPS on port 8443 when the Prometheus TLS component is enabled.

See Installation for deployment details.

Metric Lifecycle Management

When a NodeReadinessRule is deleted, the controller automatically cleans up the associated rule-labeled Prometheus series. This prevents stale metrics from remaining visible in dashboards and alerts.

Metrics cleaned up on rule deletion:

  • node_readiness_taint_operations_total{rule="..."}
  • node_readiness_evaluation_duration_seconds{rule="..."}
  • node_readiness_failures_total{rule="..."}
  • node_readiness_bootstrap_completed_total{rule="..."}
  • node_readiness_reconciliation_latency_seconds{rule="..."}
  • node_readiness_bootstrap_duration_seconds{rule="..."}
  • node_readiness_nodes_by_state{rule="..."}
  • node_readiness_rule_last_reconciliation_timestamp_seconds{rule="..."}

This ensures that:

  • Deleted rules do not continue to appear in dashboards with stale values.
  • Memory usage does not grow unbounded from removed rules.
  • Metric cardinality remains highly accurate over time.

Note: The global node_readiness_rules_total gauge is updated separately. Rule-labeled metrics are explicitly deleted during rule cleanup.


Core Metrics

node_readiness_rules_total

Number of NodeReadinessRule objects currently tracked by the controller.

PropertyValue
Typegauge
Labelsnone
Recorded whenThe controller refreshes or removes a tracked rule

node_readiness_taint_operations_total

Total number of taint operations performed by the controller.

PropertyValue
Typecounter
Labelsrule, operation (add, remove)
Recorded whenThe controller successfully adds or removes a taint

node_readiness_evaluation_duration_seconds

Duration of the controller’s internal rule evaluations.

PropertyValue
Typehistogram
Labelsrule
BucketsPrometheus default histogram buckets
Recorded whenThe controller evaluates a rule against a node

node_readiness_failures_total

Total number of failure events recorded by the controller.

PropertyValue
Typecounter
Labelsrule, reason (EvaluationError, AddTaintError, RemoveTaintError)
Recorded whenThe controller encounters an error evaluating or patching a node

node_readiness_bootstrap_completed_total

Total number of nodes that have completed bootstrap.

PropertyValue
Typecounter
Labelsrule
Recorded whenThe controller marks bootstrap as completed for a node under a bootstrap-only rule

Extended Health and SLI Metrics

node_readiness_reconciliation_latency_seconds

End-to-end latency from node condition change to taint operation completion.

PropertyValue
Typehistogram
Labelsrule, operation (add_taint, remove_taint)
Buckets0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300 seconds
Recorded whenA taint operation completes

Use case: Measure how quickly the controller responds to node condition changes in the cluster.

node_readiness_bootstrap_duration_seconds

Time from node creation to bootstrap completion for bootstrap-only rules.

PropertyValue
Typehistogram
Labelsrule
Buckets1, 5, 10, 30, 60, 120, 300, 600, 1200 seconds
Recorded whenBootstrap completion is observed for a node under a bootstrap-only rule

Use case: Measure the actual time nodes take to become fully provisioned and bootstrap-complete.

node_readiness_nodes_by_state

Number of nodes in each readiness state per rule.

PropertyValue
Typegauge
Labelsrule, state (ready, not_ready, bootstrapping)
Recorded whenA rule reconciliation completes

Use case: Track aggregate node health without introducing per-node metric cardinality, keeping controller memory footprint lean.

node_readiness_rule_last_reconciliation_timestamp_seconds

Unix timestamp of the last reconciliation for a rule.

PropertyValue
Typegauge
Labelsrule
Recorded whenA rule reconciliation loop successfully completes

Use case: Detect rules that may be stuck or not reconciling frequently enough.


Example Queries & SLOs

Latency Monitoring & SLOs

Objective: 95% of internal evaluations complete within 50 milliseconds (0.05s).

# Percentage of evaluations completing within 50ms
sum(rate(node_readiness_evaluation_duration_seconds_bucket{le="0.05"}[5m])) /
sum(rate(node_readiness_evaluation_duration_seconds_count[5m])) * 100
# P95 End-to-End Reconciliation Latency across all rules
histogram_quantile(0.95,
  sum by (le) (
    rate(node_readiness_reconciliation_latency_seconds_bucket[5m])
  )
)

Freshness Monitoring & SLOs

Objective: All rules reconcile within the last 2 minutes.

# Alert if any rule has not reconciled in the last 120 seconds
(time() - node_readiness_rule_last_reconciliation_timestamp_seconds) > 120

Availability Monitoring & SLOs

Objective: 99.9% of targeted nodes are ready.

# Percentage of ready nodes globally
100 * sum(node_readiness_nodes_by_state{state="ready"}) / sum(node_readiness_nodes_by_state)

# Percentage of ready nodes per rule
100 * node_readiness_nodes_by_state{state="ready"} / sum by (rule) (node_readiness_nodes_by_state)

Monitoring and Scale Testing

For an end-to-end monitoring setup with Prometheus and Grafana during scale tests, see the scale testing guide.

Alerting Recommendations

Typical alerts to consider:

  • High latency: P95 reconciliation latency above 10s for 5 minutes.
  • Stale reconciliations: Any rule with no reconciliation for more than 5 minutes.
  • High failure rate: Sustained increase in node_readiness_failures_total.
  • Low availability: Ready-node percentage below your target threshold for a sustained period.