Alert policy templates
Alert policies use the following templates to define how the alert is triggered. The alert templates have been created using Prometheus expressions.
DB CQLSH connection
CQLSH connection failure has been detected for universe '$universe_name'
on $value
T-Server instances.
ybp_health_check_cqlsh_connectivity_error{universe_uuid="$uuid"} > 0
DB compaction overload
Database compaction rejections detected for universe '$universe_name'
.
sum by (node_prefix) (increase(majority_sst_files_rejections{node_prefix="$node_prefix"}[10m])) > 0
DB write/read test error
Test YSQL write/read operation failed on $value
nodes for universe '$universe_name'
.
count by (node_prefix) (yb_node_ysql_write_read{node_prefix="$node_prefix"} < 1)
DB version mismatch
Version mismatch has been detected for universe '$universe_name'
for $value
Master or T-Server instances.
ybp_health_check_tserver_version_mismatch{universe_uuid="$uuid"} + ybp_health_check_master_version_mismatch{universe_uuid="$uuid"} > 0
Client to node cert expiry
Client to node certificate for universe '$universe_name'
expires in $value
days.
min by (node_name) (ybp_health_check_c2n_cert_validity_days{universe_uuid="$uuid"} < 30)
Health check notification error
Failed to issue health check notification for universe '$universe_name'
. You need to check Health notification settings and YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_health_check_notification_status{universe_uuid = "$uuid"}[1d]) < 1
Health check error
Failed to perform health check for universe '$universe_name'
. You need to check YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_health_check_status{universe_uuid = "$uuid"}[1d]) < 1
Encryption at rest config expiry
Encryption at rest configuration for universe '$universe_name'
expires in $value
days.
ybp_universe_encryption_key_expiry_days{universe_uuid="$uuid"} < 3
DB Redis connection
Redis connection failure has been detected for universe '$universe_name'
on $value
T-Server instances.
ybp_health_check_redis_connectivity_error{universe_uuid="$uuid"} > 0
Backup schedule failure
Last attempt to run a scheduled backup for universe '$universe_name'
failed due to other backup or universe operation in progress.
last_over_time(ybp_schedule_backup_status{universe_uuid = "$uuid"}[1d]) < 1
Node to node cert expiry
Node to node certificate for universe '$universe_name'
expires in $value
days.
min by (node_name) (ybp_health_check_n2n_cert_validity_days{universe_uuid="$uuid"} < 30)
DB core files
Core files detected for universe '$universe_name'
on $value
T-Server instances.
ybp_health_check_tserver_core_files{universe_uuid="$uuid"} > 0
Alert notification failed
Last attempt to send alert notifications for customer 'yugabyte support'
failed. You need to check YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_alert_manager_status{customer_uuid = "$uuid"}[1d]) < 1
Backup failure
Last backup task for universe '$universe_name'
failed. You need to check the backup task result for details.
last_over_time(ybp_create_backup_status{universe_uuid = "$uuid"}[1d]) < 1
Under-replicated tablets
$value
tablets remain under-replicated for more than 5 minutes in universe '$universe_name'
.
max by (node_prefix) (count by (node_prefix, exported_instance) (max_over_time(yb_node_underreplicated_tablet{node_prefix="$node_prefix"}[5m])) > 0)
DB memory overload
Database memory rejections have been detected for universe '$universe_name'
.
sum by (node_prefix) (increase(leader_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])) + sum by (node_prefix) (increase(follower_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])) + sum by (node_prefix) (increase(operation_memory_pressure_rejections{node_prefix="$node_prefix"}[10m])) > 0
Client to node CA cert expiry
Client to node CA certificate for universe '$universe_name'
expires in $value
days.
min by (node_name) (ybp_health_check_c2n_ca_cert_validity_days{universe_uuid="$uuid"} < 30)
DB queues overflow
Database queues overflow has been detected for universe '$universe_name'
.
sum by (node_prefix) (increase(rpcs_queue_overflow{node_prefix="$node_prefix"}[10m])) + sum by (node_prefix) (increase(rpcs_timed_out_in_queue{node_prefix="$node_prefix"}[10m])) > 1
Alert rules sync failed
Last alert rules synchronization for customer 'yugabyte support'
has failed. YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_alert_config_writer_status[1d]) < 1
Alert query failed
Last alert query for customer 'yugabyte support'
failed. YugabyteDB Anywhere logs for details or contact Yugabyte Support.
last_over_time(ybp_alert_query_status[1d]) < 1
DB fatal logs
Fatal logs have been detected for universe '$universe_name'
on $value
Master or T-Server instances.
sum by (universe_uuid) (ybp_health_check_node_master_fatal_logs{universe_uuid="$uuid"} < bool 1) + sum by (universe_uuid) (ybp_health_check_node_tserver_fatal_logs{universe_uuid="$uuid"} < bool 1) > 0
Master leader missing
Master leader is missing for universe '$universe_name'
.
max by (node_prefix) (yb_node_is_master_leader{node_prefix="$node_prefix"}) < 1
DB node restart
Universe '$universe_name'
database node has restarted $value
times during last 30 minutes.
max by (node_prefix) (changes(node_boot_time{node_prefix="$node_prefix"}[30m])) > 0
Node to node CA cert expiry
Node to node CA certificate for universe '$universe_name'
expires in $value
days.
min by (node_name) (ybp_health_check_n2n_ca_cert_validity_days{universe_uuid="$uuid"} < 30)
DB instance restart
Universe '$universe_name'
Master or T-Server has restarted $value
times during last 30 minutes.
max by (node_prefix) (changes(yb_node_boot_time{node_prefix="$node_prefix"}[30m]) and on (node_prefix) (max_over_time(ybp_universe_update_in_progress{node_prefix="$node_prefix"}[31m]) == 0)) > 0
DB node OOM
More than one out of memory (OOM) kills have been detected for universe '$universe_name'
on $value
nodes.
count by (node_prefix) (yb_node_oom_kills_10min{node_prefix="$node_prefix"} > 1) > 0
DB node down
$value
database nodes are down for more than 15 minutes for universe '$universe_name'
.
count by (node_prefix) (max_over_time(up{export_type="node_export",node_prefix="$node_prefix"}[15m]) < 1) > 0
DB node file descriptors usage
Node file descriptors usage for universe '$universe_name'
is above 70% on $value
nodes.
count by (universe_uuid) (ybp_health_check_used_fd_pct{universe_uuid="$uuid"} > 70)
Alert channel failed
Last attempt to send alert notifications to channel '{{ $labels.source_name }}'
has failed. You need to try sending a test alert to obtain details.
last_over_time(ybp_alert_manager_channel_status{customer_uuid = "$uuid"}[1d]) < 1
DB node CPU usage
Average node CPU usage for universe '$universe_name'
is more than 90% on $value
nodes.
count by(node_prefix) ((100 - (avg by (node_prefix, instance) (avg_over_time(irate(node_cpu_seconds_total{job="node",mode="idle", node_prefix="$node_prefix"}[1m])[30m:])) * 100)) > 90)
DB instance down
$value
database Master or T-Server instances are down for more than 15 minutes for universe '$universe_name'
.
count by (node_prefix) (label_replace(max_over_time(up{export_type=~"master_export|tserver_export",node_prefix="$node_prefix"}[15m]), "exported_instance", "$1", "instance", "(.*)") < 1 and on (node_prefix, export_type, exported_instance) (min_over_time(ybp_universe_node_function{node_prefix="$node_prefix"}[15m]) == 1)) > 0
Inactive cronjob nodes
$value
nodes have inactive cronjob for universe '$universe_name'
.
ybp_universe_inactive_cron_nodes{universe_uuid = "$uuid"} > 0
DB node disk usage
Node disk usage for universe '$universe_name'
is more than 70% on $value
nodes.
count by (node_prefix) (100 - (sum without (saved_name) (node_filesystem_free_bytes{mountpoint=~"/mnt/.*", node_prefix="$node_prefix"}) / sum without (saved_name) (node_filesystem_size_bytes{mountpoint=~"/mnt/.*", node_prefix="$node_prefix"}) * 100) > 70)
Clock skew
Maximum clock skew for universe '$universe_name'
is more than 500 milliseconds. The current value is $value
milliseconds.
max by (node_prefix) (max_over_time(hybrid_clock_skew{node_prefix="$node_prefix"}[10m])) / 1000 > 500
Leaderless tablets
The tablet leader is missing for more than 5 minutes for $value
tablets in universe '$universe_name'
.
max by (node_prefix) (count by (node_prefix, exported_instance) (max_over_time(yb_node_leaderless_tablet{node_prefix="$node_prefix"}[5m])) > 0)