google Cloud | Pipes

AutoML Tables Is Now Generally Available In BigQuery ML

Google’s cloud data warehouse, BigQuery, has enabled organizations around the world to accelerate their digital transformation and empower their data analysts to unlock actionable insights from their data. Using BigQuery ML, data analysts are able to create sophisticated machine learning models with just SQL and uncover predictive insights from their data much faster.  Today we are excited to announce the addition of the AutoML Tables model type to the list of supported ML models within BigQuery ML. The AutoML Tables model type, now generally available, integrates directly and seamlessly with our Vertex AI AutoML Tables offering, and enables teams to automatically build and deploy state-of-the-art machine learning models on structured data at massively increased speed and scale. BigQuery ML can improve AutoML models, as it transforms input variables into features for AutoML Tables by standardizing numeric columns, one-hot encoding non-numerical columns, extracting components from timestamp, and even expanding array and struct columns. It even does missing value imputation with approaches for numerical, categorial and timestamp columns. 

How does AutoML Tables build powerful, sophisticated models? Behind the scenes, AutoML does quite a lot of machine learning magic:

  • preprocesses the data
  • performs automatic feature engineering
  • model architecture search
  • model tuning
  • cross validation
  • automatic model selection and ensembling

Walking through an example: 

Linear Regression

Using the new_york_taxi_trips.tlc_yellow_trips_2018 dataset that is part of BigQuery’s public datasets, you can try using AutoML Tables to predict the taxi ride tip amount. As a first iteration, since we are trying to predict a continuous dependent variable, we will build a linear regression model using just SQL:

CREATE OR REPLACE MODEL <your_dataset>.nyc_yellow_taxi_tip
OPTIONS
    (model_type='linear_reg',
    input_label_cols=['tip_amount']) AS
SELECT
tip_amount,
  SUBSTR(CAST(pickup_datetime AS string), 12, 2) pickup_datetime,
  SUBSTR(CAST(dropoff_datetime AS string), 12, 2) dropoff_datetime,
  passenger_count,
  trip_distance,
  rate_code,
  fare_amount,
  pickup_location_id,
  dropoff_location_id
FROM
  `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018`;

You can see the evaluation metrics for this linear regression model below. Note the low R2 value of .35.

SELECT * FROM ml.evaluate(model bqml.nyc_yellow_taxi_tip);

You can then use the model to do predictions:

SELECT * FROM ML.PREDICT(MODEL your_dataset.nyc_yellow_taxi_tip, (
     SELECT
        tip_amount,
        SUBSTR(CAST(pickup_datetime AS string), 12, 2) pickup_datetime,
        SUBSTR(CAST(dropoff_datetime AS string), 12, 2) dropoff_datetime,
        passenger_count,
        trip_distance,
        rate_code,
        fare_amount,
        pickup_location_id,
        dropoff_location_id
    FROM
       `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018`         
LIMIT 100));

AutoML Tables

To try to improve on the R2 metric, you can use the AutoML Tables model type:

CREATE OR REPLACE MODEL <your_dataset>.nyc_yellow_taxi_tip_automl
OPTIONS(model_type='AUTOML_REGRESSOR',
        input_label_cols=['tip_amount'],
        budget_hours=1.0) AS
SELECT
  tip_amount,
  SUBSTR(CAST(pickup_datetime AS string), 12, 2) pickup_datetime,
  SUBSTR(CAST(dropoff_datetime AS string), 12, 2) dropoff_datetime,
  passenger_count,
  trip_distance,
  rate_code,
  fare_amount,
  pickup_location_id,
  dropoff_location_id
FROM
  `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018`
TABLESAMPLE SYSTEM (20 PERCENT)

There are a couple of things to note here:

1:  The model_type is ‘AUTOML_REGRESSOR’, because the goal is to predict a number (continuous dependent variable). To predict a category or class, you can instead use the ‘AUTOML_CLASSIFIER’ model type.

2:  The ‘budget-hours’ parameter tells AutoML Tables to train the model for a maximum of one hour, compress the model if necessary, and then stop.  

Here are the evaluation metrics with AutoML Tables:

The AutoML Tables model has improved the R2  from .35 to .41, a 15% improvement.  That is outstanding!  

Using the AutoML model for prediction:

SELECT 
  * 
FROM 
  ML.PREDICT(MODEL <your_dataset>.nyc_yellow_taxi_tip_automl, (
     SELECT
        tip_amount,
        SUBSTR(CAST(pickup_datetime AS string), 12, 2) pickup_datetime,
        SUBSTR(CAST(dropoff_datetime AS string), 12, 2) dropoff_datetime,
        passenger_count,
        trip_distance,
        rate_code,
        fare_amount,
        pickup_location_id,
        dropoff_location_id
    FROM
       `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018`         
LIMIT 100));

The power of AutoML Tables lies in the ability to feed the model any and all of your data, let Google’s machine learning perform the feature engineering, model selection, and hyperparameter tuning, and ensemble a state-of-the-art model for you. Whether it’s to use AutoML to create an initial benchmark for your data science team, or to use AutoML directly for your machine learning problems, one thing is clear: you can save time and complexity by relying on AutoML Tables.  That leaves you more time to solve your next business problem! 

By: Steve Walker (Customer Engineer, Machine Learning)
Source: Google Cloud Blog

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Article

Mercury Rising In BigQuery With Multistatement Transactions

Next Article
Google Cloud | Compute

EFPL’s Swiss Plasma Center Aims To Simulate Fusion On Google Cloud

Related Posts