For a high-availability configuration of the env zero self-hosted agent, you can install the agent in two different K8s clusters in two different availability zones. You can configure the self-hosted agent in the following mode:Documentation Index
Fetch the complete documentation index at: https://docs.envzero.com/llms.txt
Use this file to discover all available pages before exploring further.
Active-Passive
- Setup the failover agent in a separate region or availability zone
- Once you confirmed that the agent is live and running, scale the deployment to 0.
kubectl scale deploy <agentname>.agent-trigger-deployment --replicas=0
if using the proxy pod, also scale the proxy deployment to 0.
kubectl scale deploy <agentname>.agent-proxy-deployment --replicas=0 - When an outage has occurred in the primary cluster, scale the failover deployments back to 1
kubectl scale deploy <agentname>.agent-trigger-deployment --replicas=1
if using the proxy pod, also scale the proxy deployment to 1.
kubectl scale deploy <agentname>.agent-proxy-deployment --replicas=1
Things to consider
Working Directory
when failing over to a new cluster, the working directory may not be in sync or backed up. This will result in the high possibility that you cannot resume or approve a deployment that was waiting for approval. A simple redeployment will allow you to resume your deployments in the new cluster.- Alternatively you can use env zero Hosted Encrypted State to resume deployments across multiple clusters.
State File
When implementing a failover strategy for self-hosted env zero agents, it is critical to ensure that Terraform state files remain accessible and consistent across clusters.Remote state storage (recommended)
For best reliability, state should be stored in a remote backend that remains accessible regardless of cluster failures. This ensures that when a standby cluster is activated, it can resume deployments without losing state. The common use cases are:- env zero’s remote backend as it is backed up across two AWS regions, providing built-in redundancy
- Terraform’s remote backend
- AWS S3 + DynamoDB (state locking)
- Google Cloud Storage (GCS)
- Azure Blob Storage
State stored in the agent namespace (less ideal)
If the Terraform state is stored within the env zero agent’s Kubernetes namespace, read more about it here . In this case, extra steps are needed to sync state between the primary and standby clusters. This may involve:- Persistent storage replication across clusters.
- A manual or automated process to copy state from the failed cluster to the standby.
Next steps
- env zero Hosted Encrypted State - Resume deployments across clusters without PVC sync.
- Running Multiple Self-Hosted Agents - Assign separate agents to different projects or regions.
- Self-Hosted Kubernetes Agent - Configure the agent Helm chart for your cluster.