Skip to Content

Data Preparation

Both reference use-cases follow the same shape:

master.bank_transactions ──┐ ┌── master.bank_transactions_mlrun ├──► feature pipeline ──┤ master.bank_customer ──┘ (MongoDB aggregation) └── master.bank_customer_mlrun

The feature pipeline is a MongoDB aggregation persisted in ecosystem_meta.mlrun_feature_pipelines. The pipeline is generated by one of three paths:

  • AI pipeline assistant — the Generate tab in the console pairs a free-form prompt with a pipeline_kind selector and two action buttons (Draft only / Generate everything). The assistant samples the source collection, drafts the aggregation, infers the target column / problem type, and (optionally) runs every selected framework end-to-end. See Console Tour — Generate tab.
  • Manual editor — paste an aggregation directly into the Aggregation Pipeline JSON text area on the Data Source tab and click Preview to sample the first results.
  • The seeder (scripts/seed_mlrun_use_cases.py) — uses a hand-tuned aggregation that derives the per-use-case target field on the fly so the seed run is reproducible without any prior labelling.

AI assistant inputs

The Generate tab passes the following to the backend /api/v1/mlrun-runtime/generate-from-prompt endpoint:

FieldEffect
promptFree-form natural-language description of the goal.
kindOne of auto, numeric, categorical, temporal, mixed, aggregates, type_coercion. Selects the aggregation template the assistant uses.
source.{database,collection}Inherited from the Data Source tab and used to sample the schema (default 100 documents) before drafting the pipeline.
target.{database,collection}Where the engineered output gets written.
project_id, configuration_idWhen set, the assistant updates the existing configuration in place.
run_training + frameworksWhen run_training=true, the backend also dispatches one POST /train per selected framework after persisting the configuration.
persistAlways true from the Generate everything button; the Draft only button submits the same payload but never writes the configuration.

The assistant returns:

  • The full MlrunFeatureEnrichmentRequest (loaded back into the form).
  • A schema_summary describing how many documents and fields were sampled.
  • An optional training_spec with the inferred target, problem type, feature columns, and confidence.
  • A training_results list (one entry per framework) when run_training=true.

Pipeline kinds

KindWhen to pick it
autoDefault — let the assistant decide based on the prompt and the sampled schema.
numericPure numeric features; clean fillna with 0.
categoricalString columns; populates H2O enum_columns.
temporalCalendar / time-of-day boolean indicators.
mixedBalanced blend of numeric + categorical + temporal.
aggregatesPer-entity rollups using a Mongo $group aggregation pipeline.
type_coercionUse $addFields / $switch to clean mixed-type fields.

Spend Risk pipeline

The spend-risk aggregation:

  1. Filters transactions to the last 90 days.
  2. Buckets by merchant category and channel.
  3. Computes per-customer rollups: amount sum / mean / std, declined-count, and high-risk flag.
  4. Adds the derived is_high_risk boolean.
  5. Writes engineered rows to master.bank_transactions_mlrun.
[ { "$match": { "transaction_date": { "$gte": "$$NOW.$dateSubtract:90 days" } } }, { "$group": { "_id": "$customer_id", "amount_sum": { "$sum": "$amount" }, "amount_mean": { "$avg": "$amount" }, "amount_std": { "$stdDevPop": "$amount" }, "declined_count": { "$sum": { "$cond": [{ "$eq": ["$status", "DECLINED"] }, 1, 0] } }, "high_risk_categories": { "$push": "$merchant_category" } }}, { "$addFields": { "is_high_risk": { "$or": [ { "$gte": ["$declined_count", 3] }, { "$gt": [{ "$size": { "$setIntersection": ["$high_risk_categories", ["gambling", "crypto", "cash_advance"]] } }, 0] } ] } }}, { "$out": "bank_transactions_mlrun" } ]

Customer Personality pipeline

The personality aggregation pulls demographic fields from bank_customer, normalises them, and prepares a multiclass training frame:

  1. Selects demographic fields: age, gender, income_band, region, life_stage, marital_status, dependents, tenure_months.
  2. Converts age to bands (<25, 25–34, 35–44, 45–54, 55+).
  3. Drops rows with missing personality and rows whose personality value occurs fewer than 25 times across the entire collection.
  4. Writes the training frame to master.bank_customer_mlrun.
[ { "$match": { "personality": { "$nin": [null, ""] } } }, { "$addFields": { "age_band": { "$switch": { "branches": [ { "case": { "$lt": ["$age", 25] }, "then": "lt_25" }, { "case": { "$lt": ["$age", 35] }, "then": "25_34" }, { "case": { "$lt": ["$age", 45] }, "then": "35_44" }, { "case": { "$lt": ["$age", 55] }, "then": "45_54" } ], "default": "55_plus" } } }}, { "$project": { "customer_id": 1, "age_band": 1, "gender": 1, "income_band": 1, "region": 1, "life_stage": 1, "marital_status": 1, "dependents": 1, "tenure_months": 1, "personality": 1 }}, { "$out": "bank_customer_mlrun" } ]

Preflight checks

Before training, the seeder runs three checks against MongoDB:

  1. The source collection has at least one document (estimated_document_count > 0).
  2. For multiclass use-cases, there are ≥2 distinct values of the target field.
  3. For multiclass use-cases, the most frequent label is not ≥95% of the population (avoids the “predict-the-majority” trap).

If any check fails, the seeder skips the run and does not write a spurious SEED_MLRUN_USE_CASE activity row.

MLRun console — generated feature pipeline

Last updated on