• Overview
    • Enforce Policy as Code
    • Infrastructure as Code
    • Inject Secrets into Terraform
    • Integrate with Existing Workflows
    • Manage Kubernetes
    • Manage Virtual Machine Images
    • Multi-Cloud Deployment
    • Network Infrastructure Automation
    • Terraform CLI
    • Terraform Cloud
    • Terraform Enterprise
  • Registry
  • Tutorials
    • About the Docs
    • Intro to Terraform
    • Configuration Language
    • Terraform CLI
    • Terraform Cloud
    • Terraform Enterprise
    • Provider Use
    • Plugin Development
    • Registry Publishing
    • Integration Program
    • Terraform Tools
    • CDK for Terraform
    • Glossary
  • Community
GitHubTerraform Cloud
Download

    Terraform Enterprise Admin

  • Overview
    • Credentials
    • Hardware
      • Supported OS
      • RedHat Linux
      • CentOS Linux
      • Operational Mode
      • PostgreSQL
      • Minio Setup Guide
      • External Vault
    • Network
    • Docker Engine
  • Operational Modes
    • Overview
    • AWS Reference Architecture
    • Azure Reference Architecture
    • GCP Reference Architecture
    • VMware Reference Architecture
    • Pre-Install Checklist
      • 1. Run Installer
      • 2. Configure in Browser
      • Automated Installation
      • Active/Active
      • Initial User Automation
      • Encryption Password
    • Uninstall
    • Configuration
    • Team Membership
    • Attributes
    • Login
      • Sample Auth Request
      • ADFS
      • Azure Active Directory
      • Okta
      • OneLogin
    • Troubleshooting
    • Overview
      • Automated Recovery
      • Upgrades
      • Log Forwarding
      • Monitoring
      • Backups and Restores
      • Admin CLI Commands
      • Terraform Cloud Agents on TFE
      • Demo to Mounted Disk Migration
    • Terraform Cloud Agents on TFE
      • Accessing the Admin Interface
      • General Settings
      • Customization
      • Integration Settings
      • Managing Accounts & Resources
      • Module Sharing
      • Admin API
      • Updating Terraform Enterprise License
    • Terraform Enterprise Logs
    • Overview
    • Architecture Summary
    • Reliability & Availability
    • Capacity & Performance
    • Security Model
    • Overview
      • Overview
      • v202206-1
      • v202205-1
      • v202204-2
      • v202204-1
      • v202203-1
      • v202202-1
      • v202201-2
      • v202201-1
      • Overview
      • v202112-2
      • v202112-1
      • v202111-1
      • v202110-1
      • v202109-2
      • v202109-1
      • v202108-1
      • v202107-1
      • v202106-1
      • v202105-1
      • v202104-1
      • v202103-3
      • v202103-2
      • v202103-1
      • v202102-2
      • v202102-1
      • v202101-1
      • Overview
      • Overview
      • Overview
  • Support
  • Application Usage

  • Overview
  • Plans and Features
  • Getting Started
    • API Docs template
    • Overview
    • Account
    • Agent Pools
    • Agent Tokens
    • Applies
    • Audit Trails
    • Comments
    • Configuration Versions
    • Cost Estimates
    • Feature Sets
    • Invoices
    • IP Ranges
    • Notification Configurations
    • OAuth Clients
    • OAuth Tokens
    • Organizations
    • Organization Memberships
    • Organization Tags
    • Organization Tokens
    • Plan Exports
    • Plans
    • Policies
    • Policy Checks
    • Policy Sets
    • Policy Set Parameters
      • Modules
      • Providers
      • Private Provider Versions and Platforms
      • GPG Keys
    • Runs
      • Run Tasks
      • Stages and Results
      • Custom Integration
    • Run Triggers
    • SSH Keys
    • State Versions
    • State Version Outputs
    • Subscriptions
    • Team Access
    • Team Membership
    • Team Tokens
    • Teams
    • User Tokens
    • Users
    • Variables
    • VCS Events
    • Workspaces
    • Workspace-Specific Variables
    • Workspace Resources
    • Variable Sets
      • Overview
      • Module Sharing
      • Organizations
      • Runs
      • Settings
      • Terraform Versions
      • Users
      • Workspaces
    • Changelog
    • Stability Policy
    • Overview
    • Creating Workspaces
    • Naming
    • Terraform Configurations
      • Overview
      • Managing Variables
      • Overview
      • VCS Connections
      • Access
      • Drift Detection
      • Notifications
      • SSH Keys for Modules
      • Run Triggers
      • Run Tasks
    • Terraform State
    • JSON Filtering
    • Remote Operations
    • Viewing and Managing Runs
    • Run States and Stages
    • Run Modes and Options
    • UI/VCS-driven Runs
    • API-driven Runs
    • CLI-driven Runs
    • The Run Environment
    • Installing Software
    • Users
    • Teams
    • Organizations
    • Permissions
    • Two-factor Authentication
    • API Tokens
      • Overview
      • Microsoft Azure AD
      • Okta
      • SAML
      • Linking a User Account
      • Testing
    • Overview
    • GitHub.com
    • GitHub.com (OAuth)
    • GitHub Enterprise
    • GitLab.com
    • GitLab EE and CE
    • Bitbucket Cloud
    • Bitbucket Server and Data Center
    • Azure DevOps Services
    • Azure DevOps Server
    • Troubleshooting
    • Overview
    • Adding Public Providers and Modules
    • Publishing Private Providers
    • Publishing Private Modules
    • Using Providers and Modules
    • Configuration Designer
  • Migrating to Terraform Cloud
    • Overview
    • Using Sentinel with Terraform 0.12
    • Manage Policies
    • Enforce and Override Policies
    • Mocking Terraform Sentinel Data
    • Working With JSON Result Data
      • Overview
      • tfconfig
      • tfconfig/v2
      • tfplan
      • tfplan/v2
      • tfstate
      • tfstate/v2
      • tfrun
    • Example Policies
    • Overview
    • AWS
    • GCP
    • Azure
      • Overview
      • Service Catalog
      • Admin Guide
      • Developer Reference
      • Example Customizations
      • V1 Setup Instructions
    • Splunk Integration
    • Kubernetes Integration
    • Run Tasks Integration
    • Overview
    • IP Ranges
    • Data Security
    • Security Model
    • Overview
    • Part 1: Overview of Our Recommended Workflow
    • Part 2: Evaluating Your Current Provisioning Practices
    • Part 3: How to Evolve Your Provisioning Practices
    • Part 3.1: From Manual Changes to Semi-Automation
    • Part 3.2: From Semi-Automation to Infrastructure as Code
    • Part 3.3: From Infrastructure as Code to Collaborative Infrastructure as Code
    • Part 3.4: Advanced Workflow Improvements

  • Terraform Cloud Agents

  • Other Docs

  • Intro to Terraform
  • Configuration Language
  • Terraform CLI
  • Terraform Cloud
  • Terraform Enterprise
  • Provider Use
  • Plugin Development
  • Registry Publishing
  • Integration Program
  • Terraform Tools
  • CDK for Terraform
  • Glossary
Type '/' to Search

»Monitoring a Terraform Enterprise Instance

This document outlines best practices for monitoring a Terraform Enterprise instance.

»Health Check

Terraform Enterprise provides a /_health_check endpoint on the instance. If Terraform Enterprise is up, the health check will return a 200 OK.

The /_health_check endpoint operates in 2 modes:

  • Full check
  • Minimal check

With a full check, the service will attempt to verify the status of internal components and PostgreSQL, in contrast to a minimal check which returns 200 OK automatically after a successful full check.

The endpoint's default behavior is to perform a full check during startup of the instance, and minimal checks after Terraform Enterprise is active and running.

Note: If you wish to force a full check, an additional query parameter is required: /_health_check?full=1. Take extra caution as every call will make requests to internal components and PostgreSQL, increasing system load and latency.

»Metrics & Telemetry

In addition to health-check monitoring, we recommend monitoring standard server metrics on the Terraform Enterprise instance:

  • I/O
  • RAM
  • CPU
  • Disk

As of the v202201-1 release, Terraform Enterprise supports exporting container-level resource utilization metrics.

»Terraform Enterprise Metrics

The Terraform Enterprise Metrics service collects a number of runtime metrics. Operators can use this data to gain real-time visibility into their installation. Additionally, these metrics can be used to set up monitoring and alerting to detect anomalous incidents, performance degradation, and utilization trends. Metrics are aggregated on a five second interval and are retained in memory for fifteen seconds. In order to leverage Terraform Enterprise metrics in monitoring, data must be stored in metric aggregation software. Terraform Enterprise currently supports exposing metrics data in Prometheus format, as well as a JSON representation.

»Enable Metrics Collection

Metrics collection can be configured with the metrics_endpoint_enabled config flag in the application config file. By default, metrics_endpoint_enabled is set to "0" (disabled). To enable metrics collection, set this value to "1".

»Access Metrics

When enabled, Terraform Enterprise will expose metrics on a port separate from the application. This allows operators to use network access controls to restrict access to metrics data to authorized consumers, i.e., a Prometheus server. By default, port 9090 is used for plaintext HTTP requests, and port 9091 for HTTPS traffic. Both of these values are configurable via the metrics_endpoint_port_http and metrics_endpoint_port_https configuration values, respectively.

Both the HTTP and HTTPS ports will respond to HTTP requests with the path /metrics. By default, requests to the /metrics endpoint will generate a response in JSON format; adding the query string ?format=prometheus will generate a response in Prometheus format.

When using Prometheus, it is recommended to use a scrape interval shorter than the expiration time of 15 seconds, to ensure that data points from short-lived processes are not missed.

»Container Metrics

These metrics report runtime information about Terraform Enterprise containers.

Exposed MetricMetrics TypeDescription
tfe.container.cpu.usage.usercounterRunning count, in nanoseconds, of the total amount of time processes in the container have spent in userspace
tfe.container.cpu.usage.kernelcounterRunning count, in nanoseconds, of the total amount of time processes in the container have spent in kernel space
tfe.container.memory.used_bytesgaugeThe amount of memory allocated to the container in bytes, minus memory that is used for page cache
tfe.container.memory.limitgaugeThe maximum amount of memory in bytes that can be allocated by the container
tfe.container.network.rx_bytes_totalcounterRunning count of the number of network bytes received by the container
tfe.container.network.rx_packets_totalcounterRunning count of the number of network packets received by the container
tfe.container.network.tx_bytes_totalcounterRunning count of the number of network bytes transmitted by the container
tfe.container.network.tx_packets_totalcounterRunning count of the number of network packets transmitted by the container
tfe.container.disk.io_op_read_totalcounterRunning count of the number of read disk operations executed by the container
tfe.container.disk.io_op_write_totalcounterRunning count of the number of write disk operations executed by the container
tfe.container.disk.io_bytes_read_totalcounterRunning count of the number of disk bytes read by the container
tfe.container.disk.io_bytes_write_totalcounterRunning count of the number of disk bytes written by the container
tfe.container.process_countgaugeThe number of processes active within the container
tfe.container.process_limitgaugeThe maximum number of processes that can be executed inside the container

The following metadata labels will be added to each container metric emitted:

  • id: The container ID
  • name: The container name
  • image: The container image

Build worker container metrics include four additional labels: run_type, run_id, workspace_name, and organization_name. You can use these labels to associate a build worker container with its type, run, workspace, and organization, respectively. Metrics for long-running service containers will not include these labels.

In addition to the per-container metrics, the following global metrics are exposed:

Exposed MetricMetrics TypeDescription
tfe.run.countgauge Number of running containers being used for Terraform operators (runs and plans)
tfe.run.limitgaugeMaximum number of jobs as defined by the capacity_concurrency Replicated config

The name and ID for build worker containers are unique for each build, and build container names take the form of a UUID. Be aware of this when planning for Prometheus storage capacity requirements that relate to metric cardinality. Environments that do not need to track resource consumption of individual build containers or runs can use Prometheus metric relabelling to remove the unique ID, name, and run type labels from container metrics. This reduces cardinality within the dataset while still retaining the ability to associate resource usage with a given workspace and organization.

»Grafana Dashboard

This template Grafana dashboard demonstrates how you can use Grafana and Prometheus to visualize exported Terraform Enterprise metrics.

github logoEdit this page
  • Overview
  • Docs
  • Extend
  • Privacy
  • Security
  • Press Kit
  • Consent Manager