Google Cloud Dataproc

Dataproc cooperative multi-tenancy

Data analysts run their BI workloads on Dataproc to generate dashboards, reports and insights. Diverse sets of data analysts from various teams analyzing data to generate reports, dashboards and insights drive the need for multi-tenancy for Dataproc workloads. Today, workloads from all the users on the cluster runs as a single service account thereby every workload has the same data access. Dataproc Cooperative Multi-tenancy enables multiple users with distinct data accesses to run workloads on the same cluster.

A Dataproc cluster usually runs the workloads as the cluster service account. Creating a Dataproc cluster with Dataproc Cooperative Multi-tenancy enables you to isolate user identities when running jobs that access Cloud Storage resources. The mapping of the Cloud IAM user(s) to a service account is specified at cluster creation time and many service accounts can be configured for a given cluster. This means that interactions with Cloud Storage will be authenticated as a service account that is mapped to the user who submits the job, instead of the cluster service account.

Considerations

Dataproc Cooperative Multi-Tenancy has the following considerations:

  • Setup the mapping of the Cloud IAM user to the service account by enabling the dataproc:dataproc.cooperative.multi-tenancy.user.mapping property. When a user submits a job to the cluster, the VM service account will impersonate the service account mapped to this user and interact with Cloud Storage as that service account, through the GCS connector.
  • Requires GCS connector version to be at least 2.1.4.
  • Does not support clusters with Kerberos enabled.
  • Intended for jobs submitted through the Dataproc Jobs API only.

Objectives

We intend to demonstrate the following objects as part of this blog.

  • Create a Dataproc cluster with Dataproc Cooperative Multi-tenancy enabled.
  • Submit jobs to the cluster with different user identities and observe different access rules applied when interacting with Cloud Storage.

Verify that interactions with Cloud Storage are authenticated with different service accounts, through StackDriver loggings.

Before You Begin

Create a Project

  1. In the Cloud Console, on the project selector page, select or create a Cloud project.
  2. Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.
  3. Enable the Dataproc API.
  4. Enable the StackDriver API.
  5. Install and initialize the Cloud SDK.

Simulate a Second User

Usually, you have another user as a second user, however you can also simulate a second user by using a separate service account. Since you are going to submit jobs to the cluster by different users, you can activate a service account in your gcloud settings to simulate a second user.

  • First, get your current activated account in gcloud. In most cases this would be your personal account
FIRST_USER=$(gcloud auth list --filter=status:ACTIVE --format="value(account)")
  • Create a service account
PROJECT_ID=<your-project-id>
SECOND_USER_SA_NAME=<name-of-service-account-to-simulate-second-user>
gcloud iam service-accounts create ${SECOND_USER_SA_NAME}
SECOND_USER=${SECOND_USER_SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com
  • Grant the service account proper permissions to submit jobs to a Dataproc cluster

Create a key for the service account and use the key to activate it in gcloud. You can delete the key file after the service account is activated.

gcloud iam service-accounts keys create  ./key.json --iam-account ${SECOND_USER}
gcloud auth activate-service-account ${SECOND_USER} --key-file ./key.json
rm ./key.json

Now if you run the following command:

gcloud auth list --filter=status:ACTIVE --format="value(account)"

You will see this service account is the active account. In order to proceed with the examples below, switch back to your original active account

gcloud config set account ${FIRST_USER}

Configure the Service Accounts

  • Create 3 additional service accounts, one as the Dataproc VM service account, and the other 2 as the service accounts mapped to users (user service accounts). Note: we recommend using a per-cluster VM service account and only allow it to impersonate user service accounts you intend to use on the specific cluster.
VM_SA_NAME=<vm-service-account-name>
USER_SA_ALLOW_NAME=<user-service-account-with-gcs-access>
USER_SA_DENY_NAME=<user-service-account-without-gcs-access>
gcloud iam service-accounts create ${VM_SA_NAME}
gcloud iam service-accounts create ${USER_SA_ALLOW_NAME}
gcloud iam service-accounts create ${USER_SA_DENY_NAME}
VM_SA=${VM_SA_NAME}@${PROJECT_ID}.iam.gserviceaccount.com
USER_SA_ALLOW=${USER_SA_ALLOW_NAME}@${PROJECT_ID}.iam.gserviceaccount.com
USER_SA_DENY=${USER_SA_DENY_NAME}@${PROJECT_ID}.iam.gserviceaccount.com
  • Grant the iam.serviceAccountTokenCreator role to the VM service account on the two user service accounts, so it can impersonate them.
gcloud iam service-accounts add-iam-policy-binding \
  ${USER_SA_ALLOW} \
  --member serviceAccount:${VM_SA} \
  --role roles/iam.serviceAccountTokenCreator

And

gcloud iam service-accounts add-iam-policy-binding \
    ${USER_SA_DENY} \
    --member serviceAccount:${VM_SA} \
    --role roles/iam.serviceAccountTokenCreator
  • Grant the dataproc.worker role to the VM service account so it can perform necessary jobs on the cluster VMs.
gcloud projects add-iam-policy-binding \
    ${PROJECT_ID} \
    --member serviceAccount:${VM_SA} \
    --role roles/dataproc.worker

Create Cloud Storage Resource and Configure Service Accounts

  • Create a bucket
BUCKET=<your bucket name>
gsutil mb gs://${BUCKET}
  • Write a simple file to the bucket.
echo "This is a simple file" | gsutil cp - gs://${BUCKET}/file
  • Grant only the first user service account, USER_SA_ALLOW, admin access to the bucket.
gsutil iam ch serviceAccount:${USER_SA_ALLOW}:admin gs://${BUCKET}

Create a Cluster and Configure Service Accounts

  • In this example, we will map the user “FIRST_USER” (personal user) to the service account with GCS admin permissions, and the user “SECOND_USER” (simulated with as a service account) to the service account without GCS access.
  • Note that cooperative multi-tenancy is only available in GCS connector from version 2.1.4 onwards. It is pre-installed on Dataproc image version 1.5.11 and up, but you can use the connectors initialization action to install a specific version of GCS connector on older Dataproc images.
  • The VM service account needs to call the generateAccessToken API to fetch access token for the job service account, so make sure your cluster has the right scopes. In the example below I’ll just use the cloud-platform scope.
CLUSTER_NAME=<cluster-name>
REGION=us-central1
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --image-version=1.5-debian10 \
    --scopes=cloud-platform \
    --service-account=${VM_SA} \
    --properties="^#^dataproc:dataproc.cooperative.multi-tenancy.user.mapping=${FIRST_USER}:${USER_SA_ALLOW},${SECOND_USER}:${USER_SA_DENY}"

Note:

# 1 The user service accounts might need to have access to the config bucket associated with the cluster in order to run jobs, so make sure you grant the user service accounts access.

configBucket=$(gcloud dataproc clusters describe ${CLUSTER_NAME} --region ${REGION} \
    --format="value(config.configBucket)")
gsutil iam ch serviceAccount:${USER_SA_ALLOW}:admin gs://${configBucket}
gsutil iam ch serviceAccount:${USER_SA_DENY}:admin gs://${configBucket}

# 2. On Dataproc clusters with 1.5+ images, by default, Spark and MapReduce history files are sent to the temp-bucket associated with the cluster, so you might want to grant the user service accounts access to this bucket.

tempBucket=$(gcloud dataproc clusters describe ${CLUSTER_NAME} --region ${REGION} \
    --format="value(config.tempBucket)")
gsutil iam ch serviceAccount:${USER_SA_ALLOW}:admin gs://${tempBucket}
gsutil iam ch serviceAccount:${USER_SA_DENY}:admin gs://${tempBucket}

Run Example Jobs

  • Run a Spark job as “FIRST_USER”, and since the mapped service account has access to the GCS file gs://${BUCKET}/file, the job will succeed.
gcloud config set account ${FIRST_USER}
gcloud dataproc jobs submit spark --region=${REGION} --cluster=${CLUSTER_NAME} \
     --class=org.apache.spark.examples.JavaWordCount \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar -- gs://${BUCKET}/file

And the job will succeed with output like:

is: 1
a: 1
simple: 1
This: 1
file: 1
...
Job [752712...] finished successfully.
done: true
  • Now run the same job as “SECOND_USER”, and since the mapped service account has no access to the GCS file gs://${BUCKET}/file, the job will fail, and the driver output will show it is because of permission issues.
gcloud config set account ${SECOND_USER}
gcloud dataproc jobs submit spark --region=${REGION} --cluster=${CLUSTER_NAME} \
     --class=org.apache.spark.examples.JavaWordCount \
    --jars=file:///usr/lib/spark/examples/jars/spark-examples.jar -- gs://${BUCKET}/file

And the job driver shows it is because the service account used does not have storage.get.access to the GCS file.

GET https://storage.googleapis.com/storage/v1/b/<BUCKET>/o/file
{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "message" : "[USER_SA_DENY] does not have storage.objects.get access to the Google Cloud Storage object.",
    "reason" : "forbidden"
  } ],
  "message" : "[USER_SA_DENY] does not have storage.objects.get access to the Google Cloud Storage object."
}

Similarly for a Hive job (creating an external table in GCS, inserting records, then reading the records), when running the following as user “FIRST_USER”,It will succeed because the mapped service account has access to the <bucket> :

gcloud config set account ${FIRST_USER}
gcloud dataproc jobs submit hive --region=${REGION} --cluster=${CLUSTER_NAME} \
  -e "create external table if not exists employee (eid int, name String) location 'gs://${BUCKET}/employee'; insert into employee values (1, 'alice'), (2, 'bob');select * from employee;"

...
Connecting to jdbc:hive2://<CLUSTER_NAME>-m:10000
Connected to: Apache Hive (version 2.3.7)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
No rows affected (0.538 seconds)
No rows affected (27.668 seconds)
+---------------+----------------+
| employee.eid  | employee.name  |
+---------------+----------------+
| 1             | alice          |
| 2             | bob            |
+---------------+----------------+
2 rows selected (1.962 seconds)
Beeline version 2.3.7 by Apache Hive
Closing: 0: jdbc:hive2://<CLUSTER_NAME>-m:10000
Job [ea9acf13205a44dd...] finished successfully.
done: true
...

However, when querying the table employee as a different user “SECOND_USER”, the job will use the second user service account which has no access to the bucket, and the job will fail.

gcloud config set account ${SECOND_USER}
gcloud dataproc jobs submit hive --region=${REGION} --cluster=${CLUSTER_NAME} \
  -e "select * from employee;"

…
GET https://storage.googleapis.com/storage/v1/b/<BUCKET>/o/employee
{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "message" : "[USER_SA_DENY] does not have storage.objects.get access to the Google Cloud Storage object.",
    "reason" : "forbidden"
  } ],
  "message" : "[USER_SA_DENY] does not have storage.objects.get access to the Google Cloud Storage object."
}

Verify Service Accounts Authentication With Cloud Storage Through StackDriver Logging

First, check the usage of the first service account which has access to the bucket.

  • Make sure the gcloud active account is your personal account
gcloud config set account ${FIRST_USER}
  • Find logs about access to the bucket using the service account with GCS permissions
gcloud logging read "resource.type=\"gcs_bucket\" AND resource.labels.bucket_name=\"${BUCKET}\" AND protoPayload.authenticationInfo.principalEmail=\"${USER_SA_ALLOW}\""

And we can see the results are that permissions are always granted:

protoPayload:
 '@type': type.googleapis.com/google.cloud.audit.AuditLog
  authenticationInfo:
    principalEmail: [USER_SA_ALLOW]
    ...
  authorizationInfo:
  - granted: true
    permission: storage.objects.get
    resource: projects/_/buckets/[BUCKET]/objects/file
    resourceAttributes: {}

Checking the service account which has no access to the bucket

gcloud logging read "resource.type=\"gcs_bucket\" AND resource.labels.bucket_name=\"${BUCKET}\"  AND protoPayload.authenticationInfo.principalEmail=\"${USER_SA_DENY}\""

And we see access permissions were never granted:

protoPayload:
 '@type': type.googleapis.com/google.cloud.audit.AuditLog
  authenticationInfo:
    principalEmail: [USER_SA_DENY]
    ...
  authorizationInfo:
  - permission: storage.objects.get
    resource: projects/_/buckets/[BUCKET]/objects/employee
    resourceAttributes: {}
  - permission: storage.objects.list
    resource: projects/_/buckets/[BUCKET]
    resourceAttributes: {}

And we can verify the VM service account was never directly used to access the bucket (the following gcloud command returns 0 log entries)

gcloud logging read "resource.type=\"gcs_bucket\" AND resource.labels.bucket_name=\"${BUCKET}\" AND protoPayload.authenticationInfo.principalEmail=\"${VM_SA}\""

Cleanup

  • Delete the cluster
gcloud dataproc clusters delete ${CLUSTER_NAME} --region ${REGION} --quiet
  • Delete the bucket
gsutil rm -r gs://${BUCKET}
  • Deactivate the service account used to simulate a second user
gcloud auth revoke ${SECOND_USER}
  • Delete the service accounts
gcloud iam service-accounts delete ${SECOND_USER} --quiet
gcloud iam service-accounts delete ${VM_SA} --quiet
gcloud iam service-accounts delete ${USER_SA_ALLOW} --quiet
gcloud iam service-accounts delete ${USER_SA_DENY} --quiet

Note

  1. The cooperative multi-tenancy feature does not yet work on clusters with Kerberos enabled.
  2. Jobs submitted by users without service accounts mapped to them will fall back to use the VM service account when accessing GCS resources. However, you can set the <strong>core:fs.gs.auth.impersonation.service.account</strong> property to change the fallback service account. The VM service account will have to be able to call <strong>generateAccessToken</strong> to fetch access tokens for this fallback service account as well.

This blog successfully demonstrates how you can use Dataproc Cooperative Multi-Tenancy to share Dataproc clusters across multiple users.

By Susheel Kaushik and Chao Yuan. Source: Google Cloud Blog.



For enquiries, product placements, sponsorships, and collaborations, connect with us at hello@globalcloudplatforms.com. We'd love to hear from you!


Our humans need coffee too! Your support is highly appreciated, thank you!

Total
0
Shares
Previous Article
Google Cloud - Public Sector Summit

Google Cloud CEO Thomas Kurian to kick off first-ever Public Sector Summit—Dec 8

Next Article
Plants | Resilience

DORA and the shared pursuit of digital operational resilience in finance

Related Posts