Simplify Model Serving With Custom Prediction Routines On Vertex AI

The data received at serving time is rarely in the format your model expects. Numerical columns need to be normalized, features created, image bytes decoded, input values validated. Transforming the data can be as important as the prediction itself. That’s why we’re excited to announce custom prediction routines on Vertex AI, which simplify the process of writing pre and post processing code.

With custom prediction routines, you can provide your data transformations as Python code, and behind the scenes Vertex AI SDK will build a custom container that you can test locally and deploy to the cloud.

Understanding custom prediction routines

The Vertex AI pre-built containers handle prediction requests by performing the prediction operation of the machine learning framework. Prior to custom prediction routines, if you wanted to preprocess the input before the prediction is performed, or postprocess the model’s prediction before returning the result, you would need to build a custom container from scratch.

Building a custom serving container requires writing an HTTP server that wraps the trained model, translates HTTP requests into model inputs, and translates model outputs into responses. You can see an example here showing how to build a model server with FastAPI.

With custom prediction routines, Vertex AI provides the serving-related components for you, so that you can focus on your model and data transformations.

The predictor

The predictor class is responsible for the ML-related logic in a prediction request: loading the model, getting predictions, and applying custom preprocessing and postprocessing. To write custom prediction logic, you’ll subclass the Vertex AI Predictor interface. In most cases, customizing the predictor is all you’ll require, but check out this notebook if you’d like to see an example of customizing the request handler.

This release of custom prediction routines comes with reusable XGBoost and Sklearn predictors, but if you need to use a different framework you can create your own by subclassing the base predictor.

You can see an example predictor implementation below, specifically the reusable Sklearn predictor. This is all the code you would need to write in order to build this custom model server.

import joblib
import numpy as np

from import prediction_utils
from import Predictor

class SklearnPredictor(Predictor):
   """Default Predictor implementation for Sklearn models."""

   def __init__(self):

   def load(self, artifacts_uri: str):
       self._model = joblib.load("model.joblib")

   def preprocess(self, prediction_input: dict) -> np.ndarray:
       instances = prediction_input["instances"]
       return np.asarray(instances)

   def predict(self, instances: np.ndarray) -> np.ndarray:
       return self._model.predict(instances)

   def postprocess(self, prediction_results: np.ndarray) -> dict:
       return {"predictions": prediction_results.tolist()}

A predictor implements four methods:

  • Load: Loads in the model artifacts, and any optional preprocessing artifacts such as an encoder you saved to a pickle file.
  • Preprocess: Performs the logic to preprocess the input data before the prediction request. By default, the preprocess method receives a dictionary which contains all the data in the request body after it has been deserialized from JSON.
  • Predict: Performs the prediction, which will look something like model.predict(instances) depending on what framework you’re using.
  • Postprocess:Postprocesses the prediction results before returning them to the end user. By default, the output of the postprocess method will be serialized into a JSON object and returned as the response body.

You can customize as many of the above methods as your use case requires. To customize, all you need to do is subclass the predictor and save your new custom predictor to a Python file.

Let’s take a deeper look at how you might customize each one of these methods.


The load method is where you load in any artifacts from Cloud Storage. This includes the model, but can also include custom preprocessors.

For example, let’s say you wrote the following preprocessor to scale numerical features, and stored it as a pickle file called preprocessor.pkl in Cloud Storage.

class MySimpleScaler(object):
   def __init__(self):
       self._means = None
       self._stds = None

   def preprocess(self, data):
       if self._means is None:  # during training only
           self._means = np.mean(data, axis=0)

       if self._stds is None:  # during training only
           self._stds = np.std(data, axis=0)
           if not self._stds.all():
               raise ValueError("At least one column has standard deviation of 0.")
       return (data - self._means) / self._stds

When customizing the predictor, you would write a load method to read the pickle file, similar to the following, where artifacts_uri is the Cloud Storage path to your model and preprocessing artifacts.

def load(self, artifacts_uri: str):
   """Loads the preprocessor artifacts."""
   gcs_client = storage.Client()
   with open("preprocessor.pkl", 'wb') as preprocessor_f:
           f"{artifacts_uri}/preprocessor.pkl", preprocessor_f

   with open("preprocessor.pkl", "rb") as f:
       preprocessor = pickle.load(f)

   self._preprocessor = preprocessor


The preprocess method is where you write the logic to perform any preprocessing needed for your serving data. It can be as simple as just applying the preprocessor you loaded in the load method as shown below:

def preprocess(self, prediction_input):
   inputs = super().preprocess(prediction_input)
   return self._preprocessor.preprocess(inputs)

Instead of loading in a preprocessor, you might write the preprocessing directly in the preprocess method. For example, you might need to check your inputs are in the format you expect. Let’s say your model expects the feature at index 3 to be a string in its abbreviated form. You want to check that at serving time the value for that feature is abbreviated.

def preprocess(self, prediction_input):
   inputs = super().preprocess(prediction_input)
   clarity_dict={"Flawless": "FL",
                 "Internally Flawless": "IF",
                 "Very Very Slightly Included": "VVS1",
                 "Very Slightly Included": "VS2",
                 "Slightly Included": "S12",
                 "Included": "I3"}
   for sample in inputs:
       if sample[3] not in clarity_dict.values():
           sample[3] = clarity_dict[sample[3]]   
   return inputs

There are numerous other ways you could customize the preprocessing logic. You might need to tokenize text for a language model, generate new features, or load data from an external source.


This method usually just calls model.predict, and generally doesn’t need to be customized unless you’re building your predictor from scratch instead of with a reusable predictor.


Sometimes the model prediction is only the first step. After you get a prediction from the model you might need to transform it to make it valuable to the end user. This might be something as simple as converting the numerical class label returned by the model to the string label as shown below.

def postprocess(self, prediction_results):
   label_dict = {0: 'rose',
                 1: 'daisy',
                 2: 'dandelion',
                 3: 'tulip',
                 4: 'sunflower'}
   return {"predictions": [label_dict[class_num] for class_num in prediction_results]}

Or you could implement additional business logic. For example, you might want to only return a prediction if the model’s confidence is above a certain threshold. If it’s below, you want the input to be sent to a human instead to double check.

def postprocess(self, prediction_results):
   returned_predictions = []
   for result in prediction_results:
     if result > self._confidence_threshold:
       returned_predictions.append("confidence too low for prediction")
  return {"predictions": returned_predictions}

Just like with preprocessing, there are numerous ways you can postprocess your data with custom prediction routines. You might need to detokenize text for a language model, convert the model output into a more readable format for the end user, or even call a Vertex AI Matching Engine index endpoint to search for data with a similar embedding.

Local Testing

When you’ve written your predictor, you’ll want to save the class out to a Python file. Then you can build your image with the command below, where LOCAL_SOURCE_DIR is a local directory that contains the Python file where you saved your custom predictor.

from import LocalModel
from src_dir.predictor import MyCustomPredictor
import os

local_model = LocalModel.build_cpr_model(
   requirements_path=os.path.join(LOCAL_SOURCE_DIR, "requirements.txt"),

Once the image is built, you can test it out by deploying it to a local endpoint and then calling the predict method and passing in the request data. You’ll set artifact_uri to the path in Cloud Storage where you’ve saved your model and any artifacts needed for preprocessing or postprocessing. You can also use a local path for testing.

with local_model.deploy_to_local_endpoint(
) as local_endpoint:
   predict_response = local_endpoint.predict(
       headers={"Content-Type": "application/json"},

Deploy to Vertex AI

After testing the model locally to confirm that the predictions work as expected, the next steps are to push the image to Artifact Registry, import the model to the Vertex AI Model Registry, and optionally deploy it to an endpoint if you want online predictions.

# push image

# upload to registry
model = aiplatform.Model.upload(local_model=local_model,    

# deploy
endpoint = model.deploy(machine_type="n1-standard-4")

When the model has been uploaded to Vertex AI and deployed, you’ll be able to see it in the model registry. And then you can make prediction requests like you would with any other model you have deployed on Vertex AI.

# get prediction

What’s next

You now know the basics of how to use custom prediction routines to help add powerful customization to your serving workflows without having to worry about model servers or building Docker containers. To get hands on experience with an end to end example, check out this codelab. It’s time to start writing some custom prediction code of your own!

By: Nikita Namjoshi (Developer Advocate) and Sam Thrasher (Software Engineer)
Source: Google Cloud Blog

For enquiries, product placements, sponsorships, and collaborations, connect with us at We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

Previous Article

Best Practices Of Migrating Hive ACID Tables To BigQuery

Next Article

Empowering Everyday Innovation To Build A More Adaptive Business

Related Posts