PyTorch Serving

This page covers how PyTorch powers two-tower (and general) scoring in ecosystem.Ai. The runtime itself stays lightweight: PyTorch runs in a sidecar (ecosystem-notebooks /pytorch), and the runtime talks to it over HTTP.

In Workbench, PyTorch is an alternative training engine for the /two-tower module. It should produce the same run metadata and Mongo embedding export shape as the H2O engine, so downstream deployment remains engine-agnostic.

Engine options in the runtime

Engine	Status	Two-tower suitability
`api:` HTTP serving	Production	Recommended — PyTorch in a sidecar
DJL + TorchScript (in-JVM)	Scaffolding only (Maven `-Pdeep`)	Possible later; single model slot
ONNX Runtime	Not present	Would be a new integration
Precomputed embeddings	Pattern in use	Best latency for items

The recommended path is api: HTTP serving: it is already wired through ApiModelClient + ApiResponseNormalizer, supports framework=pytorch, and keeps libtorch out of the JVM.

Two implemented `api:pytorch` paths

1. General PyTorch model scoring


mojo.key=api:pytorch:http://ecosystem-notebooks:8010:my_model_v1

The runtime POSTs {model_id, instances:[features]} to {base_url}/invocations and normalizes the response into the canonical score shape.

2. Two-tower user-tower embedding

For two-tower, the user vector can be computed live by the sidecar when no precomputed vector exists:


predictor.model.type=similarity
predictor.twotower.user.embed=pytorch:http://ecosystem-notebooks:8010:two_tower_user_v1

PrePredictTwoTower calls ApiModelClient.embed(...), which POSTs the customer features to /pytorch/invocations and reads the embedding from the response. Item vectors stay precomputed, so there is one sidecar call per request, not one per offer. The call carries explicit deadlines (3s connect / 10s read, tunable via -Dembed.connect.timeout.ms / -Dembed.read.timeout.ms) so a hung sidecar cannot block runtime request threads.

The sidecar: ecosystem-notebooks `/pytorch`

The concrete PyTorch service lives in ecosystem-notebooks (Flask, port 8010) and exposes:

Endpoint	Purpose
`POST /pytorch/train`	train an MLP or a two-tower model; save artifacts
`POST /pytorch/invocations`	score / embed by `model_id`
`GET /pytorch/models`	list trained model ids
`GET /pytorch/health`	health check

Training contract

Workbench converts a saved Two-Tower configuration into this request. The source.pipeline should match the selected predictor/date filters from logging.ecosystemruntime_flatten.


{
  "model_id": "customer_offer_retrieval_v1",
  "model_type": "two_tower",
  "async": true,
  "problem_type": "binary_classification",
  "data": {
    "source": {
      "database": "logging",
      "collection": "ecosystemruntime_flatten",
      "pipeline": [
        { "$match": { "predictor": "my_predictor" } },
        { "$project": { "_id": 0, "customer_id": 1, "offer": 1, "price": 1, "rank": 1, "score": 1, "accepted": 1 } }
      ],
      "limit": 100000
    },
    "target_column": "accepted",
    "categorical_columns": ["customer_id", "offer"],
    "train_test_split": 0.2,
    "random_state": 42
  },
  "hyperparameters": {
    "epochs": 25, "batch_size": 256, "hidden": 64, "learning_rate": 0.001,
    "embedding_dim": 32,
    "user_features": ["customer_id", "price", "rank", "score"],
    "item_features": ["offer", "price", "rank", "score"],
    "user_id_column": "customer_id", "item_id_column": "offer"
  }
}

Training data is read from MongoDB (MONGODB_URI) via the source spec; a csv_path (under the existing DATA_DIR) or inline rows are also accepted. Artifacts are written under DATA_DIR/pytorch_models/{model_id}/.

The Workbench run metadata should record:


{
  "run_id": "tt_pytorch_abc123",
  "config_id": "customer_offer_retrieval_v1",
  "engine": "pytorch",
  "model_id": "customer_offer_retrieval_v1",
  "pytorch_sidecar_url": "http://ecosystem-notebooks:8010",
  "embedding_dim": 32,
  "user_column": "customer_id",
  "item_column": "offer"
}

Scoring / embedding contract


{
  "model_id": "customer_offer_retrieval_v1",
  "instances": [
    { "customer_id": "user_1", "price": 0, "rank": 1, "score": 0, "tower": "user" }
  ]
}

Response carries both a score and an embedding, so it satisfies general scoring and ApiModelClient.embed():


{
  "predictions": [ { "prediction": 0.87, "embedding": [0.11, 0.20, 0.07] } ],
  "final_result": [ { "prediction": 0.87, "embedding": [0.11, 0.20, 0.07] } ],
  "framework": "pytorch"
}

For two_tower models, set "tower": "user" or "tower": "item" on an instance to choose which tower’s embedding is returned (default: user).

Exporting PyTorch embeddings to Mongo

For production runtime scoring, Workbench should usually export PyTorch vectors to MongoDB rather than call the sidecar for every request.

The export job calls /pytorch/invocations in batches:


{
  "model_id": "customer_offer_retrieval_v1",
  "instances": [
    { "tower": "user", "customer_id": "user_1", "price": 0, "rank": 1, "score": 0 },
    { "tower": "user", "customer_id": "user_2", "price": 0, "rank": 1, "score": 0 }
  ]
}

and:


{
  "model_id": "customer_offer_retrieval_v1",
  "instances": [
    { "tower": "item", "offer": "ProductA", "price": 0, "rank": 1, "score": 0 },
    { "tower": "item", "offer": "ProductB", "price": 0, "rank": 1, "score": 0 }
  ]
}

In practice the export job fills the feature values from sampled source rows using the run’s recorded user_features / item_features (the zeros above are illustrative). The returned embeddings are L2-normalized in the workbench when the export’s normalized flag is set (the sidecar returns raw activations) and bulk-upserted into the configured embedding collections (default database logging) using the same document contract as H2O exports. Real-time scoring then uses the same PrePredictTwoTower and PostScoreTwoTower plugins regardless of training engine.

The pytorch_mlrun engine value is rejected at request validation (HTTP 422). The MLRun trainer sidecar only ships a tabular-MLP handler — it has no two-tower support. Use pytorch (this notebooks sidecar) for PyTorch two-tower training.

See the full request/response samples in the API Reference.