Data Preparation

Both reference use-cases follow the same shape:


master.bank_transactions      ──┐                       ┌── master.bank_transactions_mlrun
                                ├──► feature pipeline ──┤
master.bank_customer          ──┘  (MongoDB aggregation) └── master.bank_customer_mlrun

The feature pipeline is a MongoDB aggregation persisted in ecosystem_meta.mlrun_feature_pipelines. The pipeline is generated by one of three paths:

AI pipeline assistant — the Generate tab in the console pairs a free-form prompt with a pipeline_kind selector and two action buttons (Draft only / Generate everything). The assistant samples the source collection, drafts the aggregation, infers the target column / problem type, and (optionally) runs every selected framework end-to-end. See Console Tour — Generate tab.
Manual editor — paste an aggregation directly into the Aggregation Pipeline JSON text area on the Data Source tab and click Preview to sample the first results.
The seeder (scripts/seed_mlrun_use_cases.py) — uses a hand-tuned aggregation that derives the per-use-case target field on the fly so the seed run is reproducible without any prior labelling.

AI assistant inputs

The Generate tab passes the following to the backend /api/v1/mlrun-runtime/generate-from-prompt endpoint:

Field	Effect
`prompt`	Free-form natural-language description of the goal.
`kind`	One of `auto`, `numeric`, `categorical`, `temporal`, `mixed`, `aggregates`, `type_coercion`. Selects the aggregation template the assistant uses.
`source.{database,collection}`	Inherited from the Data Source tab and used to sample the schema (default 100 documents) before drafting the pipeline.
`target.{database,collection}`	Where the engineered output gets written.
`project_id`, `configuration_id`	When set, the assistant updates the existing configuration in place.
`run_training` + `frameworks`	When `run_training=true`, the backend also dispatches one `POST /train` per selected framework after persisting the configuration.
`persist`	Always `true` from the Generate everything button; the Draft only button submits the same payload but never writes the configuration.

The assistant returns:

The full MlrunFeatureEnrichmentRequest (loaded back into the form).
A schema_summary describing how many documents and fields were sampled.
An optional training_spec with the inferred target, problem type, feature columns, and confidence.
A training_results list (one entry per framework) when run_training=true.

Pipeline kinds

Kind	When to pick it
`auto`	Default — let the assistant decide based on the prompt and the sampled schema.
`numeric`	Pure numeric features; clean fillna with 0.
`categorical`	String columns; populates H2O `enum_columns`.
`temporal`	Calendar / time-of-day boolean indicators.
`mixed`	Balanced blend of numeric + categorical + temporal.
`aggregates`	Per-entity rollups using a Mongo `$group` aggregation pipeline.
`type_coercion`	Use `$addFields` / `$switch` to clean mixed-type fields.

Spend Risk pipeline

The spend-risk aggregation:

Filters transactions to the last 90 days.
Buckets by merchant category and channel.
Computes per-customer rollups: amount sum / mean / std, declined-count, and high-risk flag.
Adds the derived is_high_risk boolean.
Writes engineered rows to master.bank_transactions_mlrun.


[
  { "$match": { "transaction_date": { "$gte": "$$NOW.$dateSubtract:90 days" } } },
  { "$group": {
      "_id": "$customer_id",
      "amount_sum": { "$sum": "$amount" },
      "amount_mean": { "$avg": "$amount" },
      "amount_std": { "$stdDevPop": "$amount" },
      "declined_count": { "$sum": { "$cond": [{ "$eq": ["$status", "DECLINED"] }, 1, 0] } },
      "high_risk_categories": { "$push": "$merchant_category" }
  }},
  { "$addFields": {
      "is_high_risk": {
          "$or": [
              { "$gte": ["$declined_count", 3] },
              { "$gt": [{ "$size": { "$setIntersection": ["$high_risk_categories", ["gambling", "crypto", "cash_advance"]] } }, 0] }
          ]
      }
  }},
  { "$out": "bank_transactions_mlrun" }
]

Customer Personality pipeline

The personality aggregation pulls demographic fields from bank_customer, normalises them, and prepares a multiclass training frame:

Selects demographic fields: age, gender, income_band, region, life_stage, marital_status, dependents, tenure_months.
Converts age to bands (<25, 25–34, 35–44, 45–54, 55+).
Drops rows with missing personality and rows whose personality value occurs fewer than 25 times across the entire collection.
Writes the training frame to master.bank_customer_mlrun.


[
  { "$match": { "personality": { "$nin": [null, ""] } } },
  { "$addFields": {
      "age_band": {
        "$switch": {
          "branches": [
            { "case": { "$lt": ["$age", 25] }, "then": "lt_25" },
            { "case": { "$lt": ["$age", 35] }, "then": "25_34" },
            { "case": { "$lt": ["$age", 45] }, "then": "35_44" },
            { "case": { "$lt": ["$age", 55] }, "then": "45_54" }
          ],
          "default": "55_plus"
        }
      }
  }},
  { "$project": {
      "customer_id": 1, "age_band": 1, "gender": 1, "income_band": 1,
      "region": 1, "life_stage": 1, "marital_status": 1,
      "dependents": 1, "tenure_months": 1, "personality": 1
  }},
  { "$out": "bank_customer_mlrun" }
]

Preflight checks

Before training, the seeder runs three checks against MongoDB:

The source collection has at least one document (estimated_document_count > 0).
For multiclass use-cases, there are ≥2 distinct values of the target field.
For multiclass use-cases, the most frequent label is not ≥95% of the population (avoids the “predict-the-majority” trap).

If any check fails, the seeder skips the run and does not write a spurious SEED_MLRUN_USE_CASE activity row.

MLRun console — generated feature pipeline