Mercury, the Roman god of commerce, is often depicted carrying a purse, symbolic of business transactions, wearing winged sandals, illustrating his abilities to move at great speeds. Transactions power the world’s business systems today, ranging from millions of packages moving worldwide tracked in real time by logistics companies to global payments from personal loans to securities trading to intergovernmental transactions, keeping goods and services flowing worldwide. Today, we are very pleased to announce the public preview of multi-statement transactions in BigQuery.
BigQuery has long supported single-statement transactions through DML statements such as
TRUNCATE, applied to one table per transaction. Multi-statement transactions expand on this scope by supporting multiple SQL statements, including DML, spanning multiple tables in a single transaction. This means that the changes to data across multiple tables associated with all statements in a given transaction are committed atomically (all at once) if successful or all rolled back atomically in the event of a failure. These new multi-statement transaction capabilities now bridge the gap between online transaction processing (OLTP) systems and BigQuery through tighter and faster integration, just as Mercury’s speed enabled him to travel rapidly between the mortal and divine worlds.
A transaction is a set of tasks carried out against a database or a data warehouse representing a business unit of work. For example, an order expedite action may result in a series of changes to multiple tables, e.g., updating delivery dates on the order, adding new shipment lines on a shipment to create expedited delivery shipments and canceling (updating) prior normal delivery shipment lines. Correspondingly, when a batch job execute an ETL process in a data warehouse, it may perform data wrangling and data cleansing operations by transforming and saving the data in multiple tables. In both of these cases, if a failure occurs, then the entire transaction must be rolled back so that the system of record does not have partial data saved putting the dataset in an inconsistent state.
Transactions are known for their atomicity, consistency, isolation, and durability (ACID) properties. Atomicity indicates that all data changes (including metadata changes) are applied or reverted in a single operation, e.g. when a set of rows are added to a table with enrichment and the corresponding set of rows deleted, both operations happen together. Consistency refers to the consistency of data before and after the transaction has completed. When data is added to or updated in a table, the changes to data in tables are consistent with the constraints specified on the table or its columns, e.g. ensuring that the NOT NULL constraint is complied with when data gets added or modified in a table. Isolation indicates that multiple transactions can make changes to the table concurrently as long as the changes don’t conflict with each other, e.g. two transactions loading data into tables from different business units can proceed concurrently. Durability is the last and possibly the most important property of transactions which guarantees that in the event of a failure (after completing the transaction), all the changes committed by the transaction are saved for future transactions or queries.
All transactions in BigQuery including multi-statement transactions support snapshot isolation. At this isolation level, all statements in a transaction see a consistent snapshot of the database, as of the start of the transaction. BigQuery achieves this due to its multi-version property where it maintains the history of mutations to the database for a certain period. Transactions do not see any changes, committed or uncommitted, from other concurrent transactions while the transaction is in progress. BigQuery supports read-your-own-writes, where statements in a transaction can see all the changes applied by prior statements in the same transaction.
Let’s run through an example to demonstrate how BigQuery supports the concept of multi-statement transactions with multiple sets of changes across multiple tables in one operation. In this example, we will set up this demo by creating 10 tables each with 100 partitions
DECLARE num_tables DEFAULT 10; DECLARE x INT64 DEFAULT 0; # replace "demo" with your preferred dataset name DECLARE ds STRING default "`00jathreya`"; DECLARE prefix STRING DEFAULT "mst_demo"; DECLARE table_name STRING; DECLARE pre_table_name STRING; /* Sets up 10 tables, with schema (partition_id:integer,value:timestamp). Each table has 100 partitions, with partition key 0, 10, 20, ..., 990. */ SET x = 0; WHILE x < num_tables DO SET table_name = format("%s.%s_%d", ds, prefix, x); EXECUTE IMMEDIATE format("""DROP TABLE IF EXISTS %s""", table_name); EXECUTE IMMEDIATE format(""" CREATE OR REPLACE TABLE %s (partition_id INT64, value TIMESTAMP) PARTITION BY RANGE_BUCKET(partition_id, GENERATE_ARRAY(0, 1000, 10)) AS SELECT partition_id, CURRENT_TIMESTAMP() AS value FROM UNNEST(GENERATE_ARRAY(0, 990, 10)) AS partition_id """, table_name); SET x = x + 1; END WHILE; # Verify the partitioned tables created and the timestamp execute immediate format("""select table_schema, table_name, count(partition_id), sum(total_rows), max(last_modified_time) from %s.INFORMATION_SCHEMA.PARTITIONS where starts_with(table_name,'%s') group by table_schema, table_name""", ds, prefix); # Verify data in table 9 based on the timestamp EXECUTE IMMEDIATE format(""" SELECT * FROM %s ORDER BY partition_id """, table_name);
Next, we will show how the new
ROLLBACK commands in BigQuery allow you to save or reverse the changes in a transaction spanning multiple tables.
DECLARE num_tables DEFAULT 10; DECLARE x INT64 DEFAULT 0; # Replace `mydataset` with your preferred dataset name DECLARE ds STRING DEFAULT "`mydataset`"; DECLARE prefix STRING DEFAULT "mst_demo"; DECLARE table_name STRING; DECLARE pre_table_name STRING; /* Steps of the transaction: 1. For table 0, updates the timestamp to CURRENT_TIMESTAMP. 2. FOR table 1 to 9, merge table x using data from table x-1. All MERGE should see the updated data in table x-1, so as a result the change in table 0 should be propagated all the way to table 9. 3. A total of 1000 partition changes on 10 tables. */ BEGIN TRANSACTION; SET x = 0; WHILE x < num_tables DO SET table_name = format("%s.%s_%d", ds, prefix, x); IF x = 0 THEN EXECUTE IMMEDIATE format(""" UPDATE %s SET value = CURRENT_TIMESTAMP WHERE true """, table_name); ELSE SET pre_table_name = format("%s.%s_%d", ds, prefix, x-1); EXECUTE IMMEDIATE format(""" MERGE %s t2 USING %s t1 ON t2.partition_id = t1.partition_id WHEN MATCHED THEN UPDATE SET value = t1.value """, table_name, pre_table_name); END IF; SET x = x + 1; END WHILE; # Verify data in table 9 inside transaction shows recently updated data SET table_name = format("%s.%s_%d", ds, prefix, num_tables-1); EXECUTE IMMEDIATE format(""" SELECT * FROM %s ORDER BY partition_id """, table_name); IF x = 9 THEN COMMIT TRANSACTION; ELSE ROLLBACK TRANSACTION; END IF; # Verify data in table 9 outside transaction shows timestamp of original data EXECUTE IMMEDIATE format(""" SELECT * FROM %s ORDER BY partition_id """, table_name);
Monitoring and Management
Unlike single-statement transactions which map to a single DML command, multi-statement transactions are complex spanning multiple DML operations across multiple tables. To monitor these complex jobs, the BigQuery JOBS INFORMATION_SCHEMA views have been expanded to add a new column, TRANSACTION_ID, which contains a unique identifier for each transaction in BigQuery. You can query transactions that are in progress or the ones that have completed by querying these views.
#Returns the list of transactions running under a parent job ID .; SELECT DISTINCT transaction_id FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE parent_job_id = "job_id" AND state = "RUNNING"; # Returns the active transactions and parent job id that are running and affect a named table.; WITH running_transactions AS ( SELECT DISTINCT transaction_id FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE state = "RUNNING") SELECT transaction_id, parent_job_id, query FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT, running_transactions WHERE destination_table = "mytable" AND transaction_id = running_transactions.transaction_id; # Returns all successfully committed transactions SELECT transaction_id, parent_job_id, query FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE statement_type = "COMMIT_TRANSACTION" AND error_result IS NULL; # Returns all rolled back transactions SELECT transaction_id, parent_job_id, query FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE statement_type = "ROLLBACK_TRANSACTION" AND error_result IS NULL;
To terminate a transaction, you will need to terminate the parent job under which the transaction is running by passing the parent job ID of the script containing the transaction to the BQ.JOBS.CANCEL system procedure.
As enterprises accelerate towards more real time insight-driven decisions, multi-statement transactions are a key enabler towards bringing transaction data into analytical data warehouses, such as BigQuery. With strong ACID properties, full transaction support for commit and rollback, and rich monitoring and management APIs built on proven transaction processing backbone, we are very pleased to help our BigQuery customers unlock the value of transactions through these new capabilities.
By: Jagan R. Athreya (Product Manager, Google Cloud) and Pavan Edara (Principal Engineer, Google Cloud)
Source: Google Cloud Blog