It’s 4 PM on a Friday afternoon, and your mind has already checked out for the weekend. Just as you are about to close your laptop, you see an e-mail come in from your engineering manager. You dread what lies ahead.
“Our data science team needs to analyze streaming data from our Kafka cluster. They need the data in BigQuery. Can you deliver this ask by Monday morning?
Sounds simple enough.
You might be tempted to write an ETL script that pulls data from the Kafka cluster every 30 minutes. But that suddenly becomes complicated when you have to introduce logic for retries. What if data written to your BigQuery table is not in the right format? And what if users are looking to filter a subset of the inbound data or convert certain fields into a different format?
Then you have to get into other user requirements quickly. What about non-functional requirements that are table stakes for any production data pipeline, such as monitoring & logging? Not to mention the operational challenges it would take to scale a homegrown ETL stack to the wider organization.
Not so simple a request anymore. Looks like your weekend is totally shot.
What if there was a cloud native way for this data movement use case?
Enter Dataflow Templates.
Dataflow Templates allow you to set your data in motion in just a handful of clicks. Dataflow Templates provides a user interface to select a source-sink combination from a dropdown menu, enter the values for required parameters, select optional settings, and deploy a pipeline. Once a pipeline is launched, it leverages the industry-leading, fully-managed Dataflow service, which includes horizontal & vertical autoscaling, dynamic work rebalancing, and limitless backends like Shuffle & Streaming Engine.
Need file format conversion? We’ve got a template for that.
Filter data using our built-in UDF support.
What about those pesky duplicates? We have that covered.
No wonder studies have found that Dataflow boosts data engineering productivity by 55%.
Looks like your weekend might not be over after all.
The Dataflow team is excited to announce the general availability of 24 Google-Provided Dataflow templates, listed below:
- Pub/Sub Subscription to BigQuery
- Pub/Sub Topic to BigQuery
- Pub/Sub Avro to BigQuery
- Pub/Sub Proto to BigQuery
- Pub/Sub to Pub/Sub
- Pub/Sub Avro to Cloud Storage
- Pub/Sub Text to Cloud Storage
- Cloud Storage Text to BigQuery
- Cloud Storage Text to Pub/Sub
- Kafka to BigQuery
- CDC from MySQL to BigQuery
- Datastream to Spanner
- Utility (for use cases that go beyond data transport)
If you are new to Dataflow, Dataflow Templates is absolutely the right place to begin your Dataflow journey.
If you have been using Dataflow for some time, you might note that Dataflow Templates have been around for as long as you can remember. It’s true that we introduced Dataflow Templates in 2017, and since then, thousands of customers have come to rely on Dataflow Templates to automate many of their data movements between different data stores. What’s new is that we now have the structure and personnel in place to provide technical support for these open-source contributions. We have made the requisite investments with dedicated staffing, and now when you use these Dataflow Templates, you can feel confident that your production workloads will be supported no differently than any other workload you run on Google Cloud.
Dataflow Templates might serve your immediate data processing needs, but as any data engineer knows, requirements evolve and customizations are necessary. Thankfully, Dataflow is well-positioned to serve those use cases too.
- Begin your Dataflow journey with our Google-provided templates
- Visit our open-source Templates repository so you modify our templates for your use case (or launch a Cloud Shell instance with the templates preloaded!)
- Deploy Flex Templates, which takes custom templates to the next level and more easily reuse code across your teams
- Review how Tyson Foods leveraged Templates to democratize data movement for their end users
By: Mehran Nazir (Product Manager, Dataflow)
Source: Google Cloud Blog