“Cloud Wisdom Weekly: for tech companies and startups” is a new blog series we’re running this fall to answer common questions our tech and startup customers ask us about how to build apps faster, smarter, and cheaper. In this installment, Google Cloud Big Data & Analytics Consultant Julianne Cuneo explores how to get started using BigQuery effectively.
Working with large amounts of data – like those encountered with traditional data warehouses and data lakes – can be challenging, complex, expensive, and reliant on specialized skills that can be difficult to source. To compete in today’s customer-centric and data-driven marketplaces, these challenges are critical to overcome.
Analyzing data at scale is crucial to this effort, but so is managing costs and resources. Many businesses are thus looking to the cloud to find solutions and strike the right balance. In this article, we will explore how growing tech companies and startups leverage BigQuery for innovation, and we will share tips that will help you do more with Google’s industry-leading enterprise cloud data warehouse.
Optimizing data management and analytics
Oftentimes, companies rush into loading data and running queries for the sake of seeing how a new technology will perform. This is reasonable for a quick proof-of-concept or evaluation, but it doesn’t necessarily set you up for success in the long term, as you’re encouraged to be more sophisticated in your approach to business, security, and budgetary needs. The below tips will help you set up a strong, scalable foundation, including specific examples of how to optimize a data platform architecture with BigQuery.
1. Independently scale storage and compute
When it comes to handling massive amounts of data, having the right storage capabilities is one of the biggest challenges. Assuming you can afford the cost associated with maintaining large volumes of information, effectively analyzing and extracting value from it can be even more daunting. A serverless architecture can help you overcome these challenges in a couple ways.
First, serverless platforms such as BigQuery separate compute and storage, letting you pay independently for the resources you use, flexibly scaling up or down as your data needs change. Whereas some services bundle resources such that you get (and pay for) more compute and storage than you need, this approach makes storing large amounts of data more cost-effective and therefore more feasible
Second, if you can afford to store more data, you create more potential for insights. To that end, BigQuery’s scalable compute capacity allows you to query terabytes or even petabytes of data in a single request.
Combined, these capabilities enable you to scale analytics efforts according to your needs, rather than a predefined amount of storage or compute resources.
2. Carefully organize storage and datasets
Providing secure and consistent data access to the necessary people at the right cost is another crucial aspect of data management and analytics. Appropriately planning for resource optimization can save time and circumvent security, billing, and workflow problems down the road. For instance, in BigQuery’s resource organization, key design considerations include:
- Datasets and their objects (e.g., tables, views, ML models, etc.) only belong to a single project. This is the project to which that dataset’s storage costs will be billed. Peruse this resource to consider whether you’d want to implement a centralized data warehouse approach, allocate data marts to individual projects, or mix both approaches.
- Access to objects in BigQuery can be controlled at the dataset, table, row, and column level, which should also be factored into your storage design (e.g., grouping closely-related objects in the same dataset to simplify access grants).
3. Optimize compute cost and performance across teams and use cases
Some use cases may require precise cost controls or resource planning to meet tight service-level agreements (SLAs). In BigQuery, for instance, data only belongs to a single project, but can be queried from anywhere, with compute resources billed to the project that runs the query, regardless of data location. Therefore, to granularly track query usage, you can create individual projects for different teams (e.g., finance, sales) or use cases (e.g., BI, data science).
In addition to segmenting your compute projects by team or use case for billing purposes, you should think about how you may want to control compute resources across projects for workload management. In BigQuery, you can use “slot commitments” to switch between an on-demand model and a flat-rate billing model, including mixing and matching approaches to balance on-demand efficiency with flat-rate predictability. “Slot commitments” are dedicated compute resources that can be further divided into smaller allocations (or “reservations”). These allocations can either be assigned to an individual project or shared by multiple projects, providing flexibility that allows you to reserve compute power for high-priority or compute-intensive workloads while enjoying cost savings over the on-demand query model.
For example, say your company has committed to 1,000 slots. You may choose to allocate 500 to your compute-intensive data science projects, 300 to ETL, and 200 to internal BI which has a more flexible SLA. Best of all, your idle slots aren’t isolated in a silo to be left unused. If your ETL projects aren’t using all of their 300 slots, these idle resources can be seamlessly shared with your other data science or BI projects until they are needed again.
4. Load and optimize your data schemas
Once you understand how your data will be organized, you can start populating your data warehouse. BigQuery provides numerous ways to ingest data through flat files in Google Cloud Storage, pre-built connectors to apps and databases through Data Transfer Service, streaming inserts, and compatibility with numerous third party data migration and ETL tools.
A few simple optimizations to your table schemas can help you achieve the best results. In most cases, this means applying partitioning and/or clustering based on your expected query patterns to significantly reduce the amount of data scanned by queries.
5. Unify your data investments
Your data and analysis needs might involve working with unstructured and semi-structured data alongside your more broadly-understood, structured data. For this, it is helpful to think beyond just “enterprise data warehouse” and broaden your focus to include solutions that provide a true, centralized data lake.
If you’re using BigQuery, the platform’s federation capabilities can seamlessly query data stored in Google services including Cloud Storage, Drive, Bigtable, Cloud SQL, and Cloud Spanner, as well as data in other clouds. BigQuery’s Storage API also gives other services such as Dataproc, Dataflow, ML, and BI tools fast access to BigQuery storage at high volumes. Features such as these can help ensure that your data efforts are part of a unified, consistent approach, rather than being splintered across platforms and teams.
6. Run queries and have fun!
Once your data is available, it’s time to start querying! To make sure you don’t hit any snags, your platform should ideally provide an easy onramp that lets people get started right away.
As an ANSI-compliant solution, BigQuery SQL provides the average SQL developer with the ability to leverage their existing skills right from the start. There are also numerous third-party tools that provide native connectors to BigQuery or leverage BigQuery’s JDBC/ODBC drivers to author queries on the user’s behalf. If you have numerous SQL scripts from a previous data warehouse investment, BigQuery’s Migration Service can help automate translation of jobs coming from Teradata, Redshift, and several other services. These features allow you to make data available, protected, and smartly-budgeted, and helps ensure it can easily plug into user-friendly interfaces for analysis.
And if you’re making the move to BigQuery, be sure to take advantage of BigQuery’s unique features, rather than just moving existing queries and continuing to operate in the status quo. Run those large analyses that wouldn’t have been able to execute on another system. Try training a prototype machine learning model using SQL-based BigQuery ML. Query streaming data in real-time. Perform geospatial analysis with built-in GIS functions. It’s time to innovate.
Building a solid data foundation takes time and planning
The tips put forth in this article should help position your company for success in the near- and long-term, sparing you from the need to rearchitect your warehousing solution as your business matures. Deciding to put the time, effort, and monetary investment into any new technology requires careful evaluation, so we encourage you to get hands-on with BigQuery through quickstarts, and by visiting our Startups page or reaching out to Google Cloud experts.
By: Julianne Cuneo (Big Data & Analytics Cloud Consultant)
Source: Google Cloud Blog