Alert policy templates
Alert policies use the following templates to define how the alert is triggered. The alert templates have been created using Prometheus expressions.
DB CQLSH connection
CQLSH connection failure has been detected for universe '$universe_name' on $value T-Server instances.
ybp_health_check_cqlsh_connectivity_error{universe_uuid="$uuid"} > 0
DB compaction overload
Database compaction rejections detected for universe '$universe_name'.
sum by (node_prefix) (increase(majority_sst_files_rejections{node_prefix="$node_prefix"}[10m])) > 0
DB write/read test error
Test YSQL write/read operation failed on $value nodes for universe '$universe_name'.
count by (node_prefix) (yb_node_ysql_write_read{node_prefix="$node_prefix"} < 1)
DB version mismatch
Version mismatch has been detected for universe '$universe_name' for $value Master or T-Server instances.
ybp_health_check_tserver_version_mismatch{universe_uuid="$uuid"} + ybp_health_check_master_version_mismatch{universe_uuid="$uuid"} > 0
Client to node cert expiry
Client to node certificate for universe '$universe_name' expires in $value days.
min by (node_name) (ybp_health_check_c2n_cert_validity_days{universe_uuid="$uuid"} < 30)
Health check notification error
Failed to issue health check notification for universe '$universe_name'. You need to check Health notification settings and YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_health_check_notification_status{universe_uuid = "$uuid"}[1d]) < 1
Health check error
Failed to perform health check for universe '$universe_name'. You need to check YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_health_check_status{universe_uuid = "$uuid"}[1d]) < 1
Encryption at rest config expiry
Encryption at rest configuration for universe '$universe_name' expires in $value days.
ybp_universe_encryption_key_expiry_days{universe_uuid="$uuid"} < 3
DB Redis connection
Redis connection failure has been detected for universe '$universe_name' on $value T-Server instances.
ybp_health_check_redis_connectivity_error{universe_uuid="$uuid"} > 0
Backup schedule failure
Last attempt to run a scheduled backup for universe '$universe_name' failed due to other backup or universe operation in progress.
last_over_time(ybp_schedule_backup_status{universe_uuid = "$uuid"}[1d]) < 1
Node to node cert expiry
Node to node certificate for universe '$universe_name' expires in $value days.
min by (node_name) (ybp_health_check_n2n_cert_validity_days{universe_uuid="$uuid"} < 30)
DB core files
Core files detected for universe '$universe_name' on $value T-Server instances.
ybp_health_check_tserver_core_files{universe_uuid="$uuid"} > 0
Alert notification failed
Last attempt to send alert notifications for customer 'yugabyte support' failed. You need to check YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_alert_manager_status{customer_uuid = "$uuid"}[1d]) < 1
Backup failure
Last backup task for universe '$universe_name' failed. You need to check the backup task result for details.
last_over_time(ybp_create_backup_status{universe_uuid = "$uuid"}[1d]) < 1
Under-replicated tablets
$value tablets remain under-replicated for more than 5 minutes in universe '$universe_name'.
max by (node_prefix) (count by (node_prefix, exported_instance) (max_over_time(yb_node_underreplicated_tablet{node_prefix="$node_prefix"}[5m])) > 0)
DB memory overload
Database memory rejections have been detected for universe '$universe_name'.
sum by (node_prefix) (increase(leader_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])) + sum by (node_prefix) (increase(follower_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])) + sum by (node_prefix) (increase(operation_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])) > 0
Client to node CA cert expiry
Client to node CA certificate for universe '$universe_name' expires in $value days.
min by (node_name) (ybp_health_check_c2n_ca_cert_validity_days{universe_uuid="$uuid"} < 30)
DB queues overflow
Database queues overflow has been detected for universe '$universe_name'.
sum by (node_prefix) (increase(rpcs_queue_overflow{node_prefix="$node_prefix"}[10m])) + sum by (node_prefix) (increase(rpcs_timed_out_in_queue{node_prefix="$node_prefix"}[10m])) > 1
Alert rules sync failed
Last alert rules synchronization for customer 'yugabyte support' has failed. YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_alert_config_writer_status[1d]) < 1
Alert query failed
Last alert query for customer 'yugabyte support' failed. YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_alert_query_status[1d]) < 1
DB fatal logs
Fatal logs have been detected for universe '$universe_name' on $value Master or T-Server instances.
sum by (universe_uuid) (ybp_health_check_node_master_fatal_logs{universe_uuid="$uuid"} < bool 1) + sum by (universe_uuid) (ybp_health_check_node_tserver_fatal_logs{universe_uuid="$uuid"} < bool 1) > 0
Master leader missing
Master leader is missing for universe '$universe_name'.
max by (node_prefix) (yb_node_is_master_leader{node_prefix="$node_prefix"}) < 1
DB node restart
Universe '$universe_name' database node has restarted $value times during last 30 minutes.
max by (node_prefix) (changes(node_boot_time{node_prefix="$node_prefix"}[30m])) > 0
Node to node CA cert expiry
Node to node CA certificate for universe '$universe_name' expires in $value days.
min by (node_name) (ybp_health_check_n2n_ca_cert_validity_days{universe_uuid="$uuid"} < 30)
DB instance restart
Universe '$universe_name' Master or T-Server has restarted $value times during last 30 minutes.
max by (node_prefix) (changes(yb_node_boot_time{node_prefix="$node_prefix"}[30m]) and on (node_prefix) (max_over_time(ybp_universe_update_in_progress{node_prefix="$node_prefix"}[31m]) == 0)) > 0
DB node OOM
More than one out of memory (OOM) kills have been detected for universe '$universe_name' on $value nodes.
count by (node_prefix) (yb_node_oom_kills_10min{node_prefix="$node_prefix"} > 1) > 0
DB node down
$value database nodes are down for more than 15 minutes for universe '$universe_name'.
count by (node_prefix) (max_over_time(up{export_type="node_export",node_prefix="$node_prefix"}[15m]) < 1) > 0
DB node file descriptors usage
Node file descriptors usage for universe '$universe_name' is above 70% on $value nodes.
count by (universe_uuid) (ybp_health_check_used_fd_pct{universe_uuid="$uuid"} > 70)
Alert channel failed
Last attempt to send alert notifications to channel '{{ $labels.source_name }}' has failed. You need to try sending a test alert to obtain details.
last_over_time(ybp_alert_manager_channel_status{customer_uuid = "$uuid"}[1d]) < 1
DB node CPU usage
Average node CPU usage for universe '$universe_name' is more than 90% on $value nodes.
count by(node_prefix) ((100 - (avg by (node_prefix, instance) (avg_over_time(irate(node_cpu_seconds_total{job="node",mode="idle", node_prefix="$node_prefix"}[1m])[30m:])) * 100)) > 90)
DB instance down
$value database Master or T-Server instances are down for more than 15 minutes for universe '$universe_name'.
count by (node_prefix) (label_replace(max_over_time(up{export_type=~"master_export|tserver_export",node_prefix="$node_prefix"}[15m]), "exported_instance", "$1", "instance", "(.*)") < 1 and on (node_prefix, export_type, exported_instance) (min_over_time(ybp_universe_node_function{node_prefix="$node_prefix"}[15m]) == 1)) > 0
Inactive cronjob nodes
$value nodes have inactive cronjob for universe '$universe_name'.
ybp_universe_inactive_cron_nodes{universe_uuid = "$uuid"} > 0
DB node disk usage
Node disk usage for universe '$universe_name' is more than 70% on $value nodes.
count by (node_prefix) (100 - (sum without (saved_name) (node_filesystem_free_bytes{mountpoint=~"/mnt/.*", node_prefix="$node_prefix"}) / sum without (saved_name) (node_filesystem_size_bytes{mountpoint=~"/mnt/.*", node_prefix="$node_prefix"}) * 100) > 70)
Clock skew
Maximum clock skew for universe '$universe_name' is more than 500 milliseconds. The current value is $value milliseconds.
max by (node_prefix) (max_over_time(hybrid_clock_skew{node_prefix="$node_prefix"}[10m])) / 1000 > 500
Leaderless tablets
The tablet leader is missing for more than 5 minutes for $value tablets in universe '$universe_name'.
max by (node_prefix) (count by (node_prefix, exported_instance) (max_over_time(yb_node_leaderless_tablet{node_prefix="$node_prefix"}[5m])) > 0)