Speeding Up Small Queries In BigQuery With BI Engine

A quick and easy way to speed up small queries in BigQuery (such as to populate interactive applications or dashboards) is to use BI Engine. The New York Times, for example, uses the SQL interface to BI Engine to speed up their Data Reporting Engine.

The queries

To Illustrate, I’ll use three representative queries on tables between 100 MB and 3 GB — tables that are typically considered smallish by BigQuery standards. Because BigQuery is a columnar database, I’m reporting only the size of the columns that the query needs to read.

The first involves processing about 400 MB data to find the average cost of a Medicare claim:

      SUM(total_drug_cost)/SUM(total_claim_count) AS avg_cost
   FROM `bigquery-public-data.medicare.part_d_prescriber_2014`
   LIMIT 10

The second processes a larger table (3GB) to find the average tip for a Chicago taxi based on payment type:

 SELECT payment_type, AVG(tips) AS avg_tip
 FROM bigquery-public-data.chicago_taxi_trips.taxi_trips

The third processes a smaller table (100 MB) to find the most polluted sites:

              ANY_VALUE(state_name) AS state,
              AVG(aqi) as air_quality_index,
           FROM `bigquery-public-data.epa_historical_air_quality.pm10_daily_summary`
           GROUP BY site_num
           ORDER BY air_quality_index DESC
           LIMIT 10

Measuring query performance

To measure the query performance, I’ll run any given query 5 times, taking care to turn off the cache:

  from google.cloud import bigquery
from timeit import default_timer as timer
from datetime import timedelta
# Construct a BigQuery client object.
client = bigquery.Client()
def run_query(query, n=5):
   tot_slotmillis, tot_timeelapsed = 0, timedelta(0)
   for iter in range(n):
       query_job = client.query(query, bigquery.job.QueryJobConfig(use_query_cache=False))
       df = query_job.result().to_dataframe()
       tot_slotmillis += query_job.slot_millis
       tot_timeelapsed += (query_job.ended - query_job.started)
   print("Job stat: slot_mills={} server_time={}".format(tot_slotmillis/n, tot_timeelapsed/n))

The code prints out the total resources consumed (slot-milliseconds) and the total time taken by the query on the server.

If you are on a flat-rate pricing (most enterprise customers are), the slot-milliseconds reflects how long your slots are getting used. So, this is a reflection of the cost of the query to you.

The time elapsed is the time spent in computing the results. Note that the server_time is the time taken to process the request (I don’t measure the network roundtrip time because it’s going to be the same whether or not you use BI Engine).

I first ran the three queries without BI Engine (see my notebook in GitHub).

Turning on BI Engine

I then went to the BigQuery web console and created a 10 GB BI Engine reservation (monthly cost: $300):

Why did I pick 10 GB?

One of the key ways that BI Engine speeds up queries on small tables is that it caches the tables in memory. It also does other speedups, but this is the main one. So, you want to provide it enough memory to comfortably hold the tables you will be querying. BI Engine will automatically manage the memory for you.

In my case, the Chicago taxicab query involves 3GB, so I used 10 GB. For the other two queries, 1 GB would have been enough. You can go up to 100 GB, something I’d recommend if you have many concurrent queries on small tables. You might be able to get away with less because BI Engine stores data compressed, but it’s better to provide some extra room.

You can turn on BI Engine from a script using the Reservations API. Note, however, that it takes a few minutes for the memory to become available, so this is something you should consider doing for a few hours at least, not on a per-query basis.

Turning on BI Engine

Here’s how the 3 queries compared with and without BI Engine:

As you can see, I got cost improvements of 8x to 20x and timing improvements of about 2x. 

Note that some BI Engine capacity is included at no extra cost when you purchase a flat rate reservation–for example, you get 50 GB of BI Engine included for free with a 1000-slot annual commitment. At $30-$300/month, additional BI Engine capacity is a very cost-effective way to make your BigQuery reservations go further, and get a speedup too. 


Next steps

  1. Try it out. My code is on GitHub.
  2. For more about BI Engine and how to enroll for the preview, see this blog post.
  3. You can get the full speedup in your applications provided you are using the query() method in the BigQuery Client API [the insertJob() methods are also accelerated, but not as much]. So, check your code.
  4. To learn more about BigQuery, read my book.

Note: The SQL interface to BI Engine is in preview at the time of writing (April 2021). Use this enrollment form to turn this on in your account. Thanks to Mosha Pasumansky and the BI Engine team for helpful discussions.

By: Lak Lakshmanan(Head of Analytics & AI Solutions, Google Cloud)
Source: Google Cloud Blog

For enquiries, product placements, sponsorships, and collaborations, connect with us at hello@globalcloudplatforms.com. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

Previous Article
Google Cloud | Block One

Why Google Cloud Is The Ideal Platform For Block.one And Other DLT Companies

Next Article
Google Cloud | reCaptcha

How ReCAPTCHA Enterprise Protected Customers During The Holidays

Related Posts