» google_dataproc_cluster

Manages a Cloud Dataproc cluster resource within GCP. For more information see the official dataproc documentation.

» Example Usage - Basic

resource "google_dataproc_cluster" "simplecluster" {
  name   = "simplecluster"
  region = "us-central1"
}

» Example Usage - Advanced

resource "google_dataproc_cluster" "mycluster" {
  name     = "mycluster"
  region   = "us-central1"
  labels = {
    foo = "bar"
  }

  cluster_config {
    staging_bucket = "dataproc-staging-bucket"

    master_config {
      num_instances = 1
      machine_type  = "n1-standard-1"
      disk_config {
        boot_disk_type    = "pd-ssd"
        boot_disk_size_gb = 15
      }
    }

    worker_config {
      num_instances    = 2
      machine_type     = "n1-standard-1"
      min_cpu_platform = "Intel Skylake"
      disk_config {
        boot_disk_size_gb = 15
        num_local_ssds    = 1
      }
    }

    preemptible_worker_config {
      num_instances = 0
    }

    # Override or set some custom properties
    software_config {
      image_version = "1.3.7-deb9"
      override_properties = {
        "dataproc:dataproc.allow.zero.workers" = "true"
      }
    }

    gce_cluster_config {
      tags = ["foo", "bar"]
      service_account_scopes = [
        "https://www.googleapis.com/auth/monitoring",
        "useraccounts-ro",
        "storage-rw",
        "logging-write",
      ]
    }

    # You can define multiple initialization_action blocks
    initialization_action {
      script      = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh"
      timeout_sec = 500
    }
  }
}

» Example Usage - Using a GPU accelerator

resource "google_dataproc_cluster" "accelerated_cluster" {
  name   = "my-cluster-with-gpu"
  region = "us-central1"

  cluster_config {
    gce_cluster_config {
      zone = "us-central1-a"
    }

    master_config {
      accelerators {
        accelerator_type  = "nvidia-tesla-k80"
        accelerator_count = "1"
      }
    }
  }
}

» Argument Reference

  • name - (Required) The name of the cluster, unique within the project and zone.

  • project - (Optional) The ID of the project in which the cluster will exist. If it is not provided, the provider project is used.

  • region - (Optional) The region in which the cluster and associated nodes will be created in. Defaults to global.

  • labels - (Optional, Computed) The list of labels (key/value pairs) to be applied to instances in the cluster. GCP generates some itself including goog-dataproc-cluster-name which is the name of the cluster.

  • cluster_config - (Optional) Allows you to configure various aspects of the cluster. Structure defined below.


The cluster_config block supports:

    cluster_config {
        gce_cluster_config        { ... }
        master_config             { ... }
        worker_config             { ... }
        preemptible_worker_config { ... }
        software_config           { ... }

        # You can define multiple initialization_action blocks
        initialization_action     { ... }
        encryption_config         { ... }
    }
  • staging_bucket - (Optional) The Cloud Storage staging bucket used to stage files, such as Hadoop jars, between client machines and the cluster. Note: If you don't explicitly specify a staging_bucket then GCP will auto create / assign one for you. However, you are not guaranteed an auto generated bucket which is solely dedicated to your cluster; it may be shared with other clusters in the same region/zone also choosing to use the auto generation option.

  • gce_cluster_config (Optional) Common config settings for resources of Google Compute Engine cluster instances, applicable to all instances in the cluster. Structure defined below.

  • master_config (Optional) The Google Compute Engine config settings for the master instances in a cluster.. Structure defined below.

  • worker_config (Optional) The Google Compute Engine config settings for the worker instances in a cluster.. Structure defined below.

  • preemptible_worker_config (Optional) The Google Compute Engine config settings for the additional (aka preemptible) instances in a cluster. Structure defined below.

  • software_config (Optional) The config settings for software inside the cluster. Structure defined below.

  • autoscaling_config (Optional) The autoscaling policy config associated with the cluster. Structure defined below.

  • initialization_action (Optional) Commands to execute on each node after config is completed. You can specify multiple versions of these. Structure defined below.

  • encryption_config (Optional) The Customer managed encryption keys settings for the cluster. Structure defined below.


The cluster_config.gce_cluster_config block supports:

  cluster_config {
    gce_cluster_config {
      zone = "us-central1-a"

      # One of the below to hook into a custom network / subnetwork
      network    = google_compute_network.dataproc_network.name
      subnetwork = google_compute_network.dataproc_subnetwork.name

      tags = ["foo", "bar"]
    }
  }
  • zone - (Optional, Computed) The GCP zone where your data is stored and used (i.e. where the master and the worker nodes will be created in). If region is set to 'global' (default) then zone is mandatory, otherwise GCP is able to make use of Auto Zone Placement to determine this automatically for you. Note: This setting additionally determines and restricts which computing resources are available for use with other configs such as cluster_config.master_config.machine_type and cluster_config.worker_config.machine_type.

  • network - (Optional, Computed) The name or self_link of the Google Compute Engine network to the cluster will be part of. Conflicts with subnetwork. If neither is specified, this defaults to the "default" network.

  • subnetwork - (Optional) The name or self_link of the Google Compute Engine subnetwork the cluster will be part of. Conflicts with network.

  • service_account - (Optional) The service account to be used by the Node VMs. If not specified, the "default" service account is used.

  • service_account_scopes - (Optional, Computed) The set of Google API scopes to be made available on all of the node VMs under the service_account specified. These can be either FQDNs, or scope aliases. The following scopes must be set if any other scopes are set. They're necessary to ensure the correct functioning ofthe cluster, and are set automatically by the API:

    • useraccounts-ro (https://www.googleapis.com/auth/cloud.useraccounts.readonly)
    • storage-rw (https://www.googleapis.com/auth/devstorage.read_write)
    • logging-write (https://www.googleapis.com/auth/logging.write)
  • tags - (Optional) The list of instance tags applied to instances in the cluster. Tags are used to identify valid sources or targets for network firewalls.

  • internal_ip_only - (Optional) By default, clusters are not restricted to internal IP addresses, and will have ephemeral external IP addresses assigned to each instance. If set to true, all instances in the cluster will only have internal IP addresses. Note: Private Google Access (also known as privateIpGoogleAccess) must be enabled on the subnetwork that the cluster will be launched in.

  • metadata - (Optional) A map of the Compute Engine metadata entries to add to all instances (see Project and instance metadata).


The cluster_config.master_config block supports:

cluster_config {
  master_config {
    num_instances    = 1
    machine_type     = "n1-standard-1"
    min_cpu_platform = "Intel Skylake"

    disk_config {
      boot_disk_type    = "pd-ssd"
      boot_disk_size_gb = 15
      num_local_ssds    = 1
    }
  }
}
  • num_instances- (Optional, Computed) Specifies the number of master nodes to create. If not specified, GCP will default to a predetermined computed value (currently 1).

  • machine_type - (Optional, Computed) The name of a Google Compute Engine machine type to create for the master. If not specified, GCP will default to a predetermined computed value (currently n1-standard-4).

  • min_cpu_platform - (Optional, Computed, Beta) The name of a minimum generation of CPU family for the master. If not specified, GCP will default to a predetermined computed value for each zone. See the guide for details about which CPU families are available (and defaulted) for each zone.

  • image_uri (Optional) The URI for the image to use for this worker. See the guide for more information.

  • disk_config (Optional) Disk Config

    • boot_disk_type - (Optional) The disk type of the primary disk attached to each node. One of "pd-ssd" or "pd-standard". Defaults to "pd-standard".
    • boot_disk_size_gb - (Optional, Computed) Size of the primary disk attached to each node, specified in GB. The primary disk contains the boot volume and system libraries, and the smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.
    • num_local_ssds - (Optional) The amount of local SSD disks that will be attached to each master cluster node. Defaults to 0.
  • accelerators (Optional) The Compute Engine accelerator (GPU) configuration for these instances. Can be specified multiple times.

    • accelerator_type - (Required) The short name of the accelerator type to expose to this instance. For example, nvidia-tesla-k80.
    • accelerator_count - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of 1, 2, 4, or 8.

The cluster_config.worker_config block supports:

cluster_config {
  worker_config {
    num_instances    = 3
    machine_type     = "n1-standard-1"
    min_cpu_platform = "Intel Skylake"

    disk_config {
      boot_disk_type    = "pd-standard"
      boot_disk_size_gb = 15
      num_local_ssds    = 1
    }
  }
}
  • num_instances- (Optional, Computed) Specifies the number of worker nodes to create. If not specified, GCP will default to a predetermined computed value (currently 2). There is currently a beta feature which allows you to run a Single Node Cluster. In order to take advantage of this you need to set "dataproc:dataproc.allow.zero.workers" = "true" in cluster_config.software_config.properties

  • machine_type - (Optional, Computed) The name of a Google Compute Engine machine type to create for the worker nodes. If not specified, GCP will default to a predetermined computed value (currently n1-standard-4).

  • min_cpu_platform - (Optional, Computed, Beta) The name of a minimum generation of CPU family for the master. If not specified, GCP will default to a predetermined computed value for each zone. See the guide for details about which CPU families are available (and defaulted) for each zone.

  • disk_config (Optional) Disk Config

    • boot_disk_type - (Optional) The disk type of the primary disk attached to each node. One of "pd-ssd" or "pd-standard". Defaults to "pd-standard".
    • boot_disk_size_gb - (Optional, Computed) Size of the primary disk attached to each worker node, specified in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.
    • num_local_ssds - (Optional) The amount of local SSD disks that will be attached to each worker cluster node. Defaults to 0.
  • image_uri (Optional) The URI for the image to use for this worker. See the guide for more information.

  • accelerators (Optional) The Compute Engine accelerator configuration for these instances. Can be specified multiple times.

    • accelerator_type - (Required) The short name of the accelerator type to expose to this instance. For example, nvidia-tesla-k80.
    • accelerator_count - (Required) The number of the accelerator cards of this type exposed to this instance. Often restricted to one of 1, 2, 4, or 8.

The cluster_config.preemptible_worker_config block supports:

cluster_config {
  preemptible_worker_config {
    num_instances = 1

    disk_config {
      boot_disk_type    = "pd-standard"
      boot_disk_size_gb = 15
      num_local_ssds    = 1
    }
  }
}

Note: Unlike worker_config, you cannot set the machine_type value directly. This will be set for you based on whatever was set for the worker_config.machine_type value.

  • num_instances- (Optional) Specifies the number of preemptible nodes to create. Defaults to 0.

  • disk_config (Optional) Disk Config

    • boot_disk_type - (Optional) The disk type of the primary disk attached to each preemptible worker node. One of "pd-ssd" or "pd-standard". Defaults to "pd-standard".
    • boot_disk_size_gb - (Optional, Computed) Size of the primary disk attached to each preemptible worker node, specified in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.
    • num_local_ssds - (Optional) The amount of local SSD disks that will be attached to each preemptible worker node. Defaults to 0.

The cluster_config.software_config block supports:

cluster_config {
  # Override or set some custom properties
  software_config {
    image_version = "1.3.7-deb9"

    override_properties = {
      "dataproc:dataproc.allow.zero.workers" = "true"
    }
  }
}
  • image_version - (Optional, Computed) The Cloud Dataproc image version to use for the cluster - this controls the sets of software versions installed onto the nodes when you create clusters. If not specified, defaults to the latest version. For a list of valid versions see Cloud Dataproc versions

  • override_properties - (Optional) A list of override and additional properties (key/value pairs) used to modify various aspects of the common configuration files used when creating a cluster. For a list of valid properties please see Cluster properties


The cluster_config.autoscaling_config block supports:

cluster_config {
  # Override or set some custom properties
  autoscaling_config {
    policy_uri = "projects/projectId/locations/region/autoscalingPolicies/policyId"
  }
}
  • policy_uri - (Required) The autoscaling policy used by the cluster.

Only resource names including projectid and location (region) are valid. Examples:

https://www.googleapis.com/compute/v1/projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id] projects/[projectId]/locations/[dataproc_region]/autoscalingPolicies/[policy_id] Note that the policy must be in the same project and Cloud Dataproc region.


The initialization_action block (Optional) can be specified multiple times and supports:

cluster_config {
  # You can define multiple initialization_action blocks
  initialization_action {
    script      = "gs://dataproc-initialization-actions/stackdriver/stackdriver.sh"
    timeout_sec = 500
  }
}
  • script- (Required) The script to be executed during initialization of the cluster. The script must be a GCS file with a gs:// prefix.

  • timeout_sec - (Optional, Computed) The maximum duration (in seconds) which script is allowed to take to execute its action. GCP will default to a predetermined computed value if not set (currently 300).


The encryption_config block supports:

cluster_config {
  encryption_config {
    kms_key_name = "projects/projectId/locations/region/keyRings/keyRingName/cryptoKeys/keyName"
  }
}
  • kms_key_name - (Required) The Cloud KMS key name to use for PD disk encryption for all instances in the cluster.

» Attributes Reference

In addition to the arguments listed above, the following computed attributes are exported:

» Timeouts

This resource provides the following Timeouts configuration options:

  • create - Default is 20 minutes.
  • update - Default is 20 minutes.
  • delete - Default is 20 minutes.