Monitor and observe
High-availability Namespaces are in Public Preview for Temporal Cloud.
How do you trigger failovers and observe Workflow Executions? This section provides how-to instructions for the following operations tasks:
Metrics
Replication lag refers to the transmission delay of Workflow updates and history events from the active region to the standby region. A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress, so always check the metric replication lag before initiating a failover. Temporal Cloud emits three replication lag-specific metrics. The following samples demonstrate how you can use these metrics to explore replication lag.
P99 replication lag histogram
histogram_quantile(0.99, sum(rate(temporal_cloud_v0_replication_lag_bucket[$__rate_interval])) by (temporal_namespace, le))
Average replication lag
sum(rate(temporal_cloud_v0_replication_lag_sum[$__rate_interval])) by (temporal_namespace)
/
sum(rate(temporal_cloud_v0_replication_lag_count[$__rate_interval])) by (temporal_namespace)
Monitoring and observability
You can view and alert on key cloud metrics using the Web UI, the 'tcld' CLI utility, and Temporal Cloud APIs. For example, during the process of adding a region to a Namespace, you can see the progress of Workflow replication. Errors -- if any occur -- will also surface in the Namespace Web UI.
You may notice that multi-region Namespace shows twice (2x) the Action count in temporal_cloud_v0_total_action_count
.
This doubling happens due to regional replication.
Auditing operational events
Temporal Cloud provides several ways to audit events:
- When Temporal triggers failovers, the audit log updates with details.
Look specifically for
"operation": "FailoverNamespace"
in the logs. - You can set alerts for Temporal-initiated failover events.
- After a failover, you can check that the Namespace is active in the new region using the Temporal Cloud Web UI.