Terraform Enterprise VMware Reference Architecture

Introduction

This document provides recommended practices and a reference architecture for HashiCorp Terraform Enterprise implementations on VMware.

Implementation Modes

Terraform Enterprise can be installed and function in different implementation modes with increasing capability and complexity:

Standalone/Mounted Disk: The base architecture with a single application node that supports the standard implementation requirements for the platform.
Active/Active: This is an extension of Standalone mode that adds multiple active node capability that can expand horizontally to support larger and increasing execution loads.

This guide will present the base Standalone/Mounted Disk mode first and then discuss the differences that alter the implementation into the Active/Active mode.

Required Reading

Prior to making hardware sizing and architectural decisions, read through the pre-install checklist to familiarise yourself with the application components and architecture. Further, read the reliability and availability guidance as a primer to understanding the recommendations in this reference architecture.

Infrastructure Requirements

The operational mode determines the infrastructure requirements for your instance. For example, an installation in Mounted Disk mode may require a single virtual machine, whereas a stateless production installation may require multiple virtual machines to host the HCP Terraform application, PostgreSQL, Redis, and external Vault servers.

Standalone/Mounted Disk

This mode requires that you specify the local path for data storage. The local path should be a mounted disk from a SAN or NAS device, or some other replicated storage. This allows for rapid recovery or failover.

If you need or want to define storage externally and independently, you can choose the External Services operational mode. This is a more complicated implementation in VMware that requires you to independently manage other services which will not be detailed in this document. You will need to deploy S3-compatible storage either by connecting to a true AWS S3 bucket or by using a compatible alternative on-prem solution, such as Ceph. You will also need to deploy and separately manage an external PostgreSQL database on an additional server or servers.

Some additional information about the External Services option can be found at the end of this document.

Although it is possible for Terraform Enterprise to use an external Vault server instead of its internally managed one, we do not recommended it. External Vault usage is not addressed in this document.

The following table provides high-level server recommendations as a guideline. Please note, thick provision, lazy zeroed storage is preferred. Thin provisioned is only recommended if you are using an external PostgreSQL database and external Vault server. Using thin provisioned disks when using the internal database or Vault may result in serious performance issues.

Terraform Enterprise Servers

Type	CPU Sockets	Total Cores*	Memory	Disk
Minimum	2	4	16 GB RAM	40GB
Scaled	2	8	32 GB RAM	40GB

Note: Per VMware’s recommendation, always allocate the least amount of vCPUs and cores necessary and scale the resources based on application demand. HashiCorp recommends starting with 4 CPUs and increasing if necessary.

Hardware Sizing Considerations

The minimum size would be appropriate for most initial production deployments, or for development/testing environments.
The scaled size is for production environments where there is a consistent high workload in the form of concurrent terraform runs.
Please monitor the actual CPU utilization in vCenter before making the decision to increase the CPU allocation.

Other Considerations

Network

To deploy Terraform Enterprise on VMware you will need to create new or use existing networking infrastructure that has access to any infrastructure you expect to manage with the Terraform Enterprise server. If you plan to use your Terraform Enterprise server to manage or deploy infrastructure on external providers (eg Amazon Web Services, Microsoft Azure or Google Cloud), you will need to make sure the Terraform Enterprise server has unimpeded access to those providers. The same goes for any other public or private datacenter the server will need to connect with.

DNS

The fully qualified domain name should resolve to the IP address of the virtual machine using an A record. Creating the required DNS entry is outside the scope of this guide.

SSL/TLS

A valid, signed SSL/TLS certificate is required for secure communication between clients and the Terraform Enterprise application server. Requesting a certificate is outside the scope of this guide. You will be prompted for the public and private certificates during installation.

Infrastructure Diagram

vmware-mounted-disk-infrastructure-diagram

Application Layer

The Application Layer is a VMware virtual machine running on an ESXi cluster providing an auto-recovery mechanism in the event of virtual machine or physical server failure.

Storage Layer

The Storage Layer is provided in the form of attached disk space configured with or benefiting from inherent resiliency provided by the NAS or SAN. The primary Terraform Enterprise VM will have 2 disks which must meet the requirements detailed here. The first disk is independent to this VM and contains the OS and Terraform Enterprise components specific to this individual install, such as configuration information. The second disk will contain Terraform Enterprise's configuration information such as Workspaces and their resulting Terraform state files. This second disk needs to be regularly backed up, for instance via replication or snapshotting inherent to your SAN or other software, at a rate that meets your desired RPO. Similarly, the standby VM will have two disks. An OS disk that is independent to that VM and a disk which is simply a point in time copy of the primary instance's second disk.

Note: Terraform Enterprise's storage device or service must be highly reliable and high-speed in both I/O and connectivity to meet performance requirements. Device types in the supported list will usually meet these requirements, but many standard NAS and other device types will not perform at the level required. Only use a NAS or other device type not in the supported list if you are certain it can accommodate these requirements.

The specific selection and configuration of the storage device is not covered in this document. For more information about high-speed and highly available storage, please see your storage vendor. We recommend that each of these VMs be deployed as immutable architecture to enable one to easily redeploy the secondary VM when the primary has been upgraded or changed. If this is not possible a snapshot methodology inherent to Terraform Enterprise along with examples of restoring those snapshots is available at Terraform Enterprise Automated Recovery.

For more information about Terraform Enterprise's disk requirements, see Before Installing: Disk Requirements.

Active/Active

The Active/Active mode provides a higher level of availability and failover as well as horizontal scaling. It requires additional external services, and all of the requirements and instructions are available on the Terraform Enterprise Active/Active page.

We have tested Active/Active on VMware vSphere internally, with ESXi version 7.0.1 and vCenter Server version 7.0.2.00200, but should work on any version supported by the vSphere Provider for Terraform.

We recommend a setup with the following:

A load balancer to route traffic to both Terraform Enterprise virtual machines.
Both virtual machines located in the same physical datacenter and on the same network. High network latency between the Terraform Enterprise virtual machines and the external services may result in plan and apply issues.
High-speed disks, as they are critical for good performance.
Both Terraform Enterprise virtual machines can access an external Redis server, a PostgreSQL database, and an S3-compatible blob storage bucket. Terraform Enterprise will use an internal Vault server by default. Optionally, you can configure Terraform Enterprise to use an existing Vault cluster.

An example of a recommended setup:

Terraform Enterprise Active/Active on VMware

Infrastructure Provisioning

The recommended way to deploy Terraform Enterprise for production is through use of a Terraform configuration that defines the required resources, their references to other resources and dependencies.

Normal Operation

Component Interaction

In Mounted Disk Mode the PostgreSQL database will be run in a local container and data will be written to the specified path (which should be a mounted storage device, replicated and/or backed up frequently.) In Active/Active or External Services Mod the external PostgreSQL server will be used.

State and other data will be written to the specified local path (which should be a mounted storage device, replicated and/or backed up frequently) in Mounted Disk, and the S3-compatible storage in Active/Active or External Service Mode.

Redis is used to managed job flow and does not contain stateful data. In Mounted Disk Mode and External Services this service will be started locally as a container. In Active/Active this will be an external server.

Vault will be run in a local container and used only for transit data encryption and decryption. This stateless use of Vault provides easy recovery in the event of a Vault service failure.

Monitoring

While there is not currently a full monitoring guide for Terraform Enterprise, information around logging, diagnostics as well as reliability and availability can be found on our website.

Upgrades

See the Upgrades section of the documentation. For Active/Active you'll need to scale down to a single virtual machine before proceeding with an upgrade.

High Availability

Failure Scenarios

VMware vSphere provides a high level of resilience in various cases of failure, such as at the server hardware layer through vSphere High Availability (HA) and at the network layer through virtual distributed switching. In addition, employing tools such as VMware Site Recovery Manager or utilizing stretched clusters can assist in recovery in the case of a total data center to support failover to a DR datacenter. See the Disaster Recovery section.

The Active/Active deployment method can provide additional failover.

Terraform Enterprise Servers (VMware Virtual Machine)

Should the TFE-main server fail, it can be recovered in place, or the virtual machine can restored to a Disaster Recovery target, to resume service when the failure is limited to the Terraform Enterprise server layer. See the Disaster Recovery section.

Single ESXi Host Failure

In the event of a single ESXi host failure, vSphere HA will restart the Terraform Enterprise virtual machine to a functioning ESXi host in the cluster. This restart can take up to 30 seconds for the failed virtual machine to come back online on a healthy host within the cluster.

If VMware vSphere Fault Tolerance (FT) has been configured for the Terraform Enterprise server, the failover does not result in any visiable outage to the end user.

PostgreSQL Database

When running in Mounted Disk operational mode the PostgreSQL server runs inside a Docker container. If the PostgreSQL service fails a new container should be automatically created. However, if the service is hung, or otherwise fails without triggering a new container deployment, the Terraform Enterprise server should be stopped and the standby server started. All PostgreSQL data will have been written to the mounted disk and will then be accessible on the standby node.

Object Storage

The object storage will be stored on the mounted disk and the expectation is that the NAS or SAN or other highly available mounted storage is fault tolerant and replicated or has fast recovery available.

Redis (Active/Active Only)

The Redis service does not contain stateful data and does not require backups or data sync. While Redis Cluster is not supported, Redis Replication Groups can be utilized for high availability and/or failover. Redis Sentinel is not supported for high availability.

Disaster Recovery

Failure Scenarios

Terraform Enterprise Servers (VMware Virtual Machine)

Through deployment of two virtual machines in different ESXi clusters, the Terraform Enterprise Reference Architecture is designed to provide improved availability and reliability. Should the TFE-main server fail, it can be recovered, or traffic can be routed to a newly built Terraform Enterprise server, with a backup restored to it, to resume service when the failure is limited to the Terraform Enterprise server layer. The load balancer should be manually updated to point to the new Terraform Enterprise VM after services have been started on it in the event of a failure. The specifics of how data should be handled in a Disaster Recovery event will depend on the operational mode.

For a mounted disk install:

If the backup method is to snapshot the virtual machine from the ESX host, file-quiesence must be enabled to ensure data consistency. The other backup option is to make use of the Backup and Restore API.

Once a Disaster has been declared, or an in-place recovery after a failure is otherwise not an option, either a new virtual machine should be created and the backup from the primary should be restored into it via the API, or the virtual machine snapshot should be deployed to the the new ESX host. Please be aware, some configuration items may need to be updated; if the DR database address is different from the primary, for example. After any configuration changes, the Terraform Enterprise server will need to be restarted.

Mounted Disk - Data layer - PostgreSQL Database & Object Storage

The PostgreSQL data and object storage will be written to the mounted disk. The expectation is that the Terraform Enterprise application data is backedup via the Backup and Restore API, or the entire virtual machine is backed up via snapshot (with file-quiescence enabled), and then replicated or backed up offsite and made available to in the event of a DR.

Active/Active Disaster Recovery

You should back up and replicate the stateful external services (PostgreSQL and Blob Storage) to an offsite location to enable a disaster recovery or datacenter failover. Use service-native tools for backups rather than the Backup/Restore API, which is designed to help with platform migration. If you use virtual machine snapshots, you must enable file-quiecense. You do not need to back up the Redis instance because it does not store stateful data.

Redeploy the Terraform Enterprise virtual machines in the restore location using the same automation as in the primary datacenter, and update names and IP addresses for the external services as is necessary, or restore the virtual machine snapshot to the target datacenter and update any configuration as needed (database and redis urls, object storage endpoint) and restart the Terraform Enterprise application.

External Services Storage Options

This information is included if External Services operational mode is required.

External Services - Object Storage Options

An S3 Standard bucket, or compatible storage, must be specified during the Terraform Enterprise installation for application data to be stored securely and redundantly away from the virtual servers running the Terraform Enterprise application. This object storage must be accessible via the network to the Terraform Enterprise virtual machine. Vault is used to encrypt all application data stored in this location. This allows for further server-side encryption by S3 if required by your security policy.

Recommended object storage solutions are AWS S3, Google Cloud storage, Azure blob storage. Other options for S3-compatible storage are MinIO, and Ceph, and ECS, among many others. Please feel free to reach out to support with questions.

External Services - PostgreSQL Database

External Services - PostgreSQL Database Management

Using a PostgreSQL cluster will provide fault tolerance at the database layer. Documentation on how to deploy a PostgreSQL cluster can be found on the PostgreSQL documentation page.

Backup and recovery of PostgreSQL will vary based on your implementation and is not covered in this document. We do recommend regular database snapshots.

External Services - PostgreSQL Database Sizing

Type	CPU Sockets	Total Cores	Memory	Storage
Production	2	4-8 core	16-32 GB RAM	50GB

Active/Active - Redis Server

Redis server versions 5.x is supported and has been tested thoroughly with Terraform Enterprise. Redis (cluster enabled) Cluster is not currently supported. Options are provided for the following:

redis_port: Allows for connecting to a Redis server running on a nonstandard port
redis_use_password_auth: This can be set to 1 if you are using password authentication, or 0 if not.
redis_use_tls: Allows to enabling(1) or disabling(0) the TLS requirement

Additional details can be found on the Active/Active Installation page.