Data Preparation
Both reference use-cases follow the same shape:
master.bank_transactions ──┐ ┌── master.bank_transactions_mlrun
├──► feature pipeline ──┤
master.bank_customer ──┘ (MongoDB aggregation) └── master.bank_customer_mlrunThe feature pipeline is a MongoDB aggregation persisted in
ecosystem_meta.mlrun_feature_pipelines. The pipeline is generated by
one of three paths:
- AI pipeline assistant — the Generate tab in the console pairs
a free-form prompt with a
pipeline_kindselector and two action buttons (Draft only / Generate everything). The assistant samples the source collection, drafts the aggregation, infers the target column / problem type, and (optionally) runs every selected framework end-to-end. See Console Tour — Generate tab. - Manual editor — paste an aggregation directly into the Aggregation Pipeline JSON text area on the Data Source tab and click Preview to sample the first results.
- The seeder (
scripts/seed_mlrun_use_cases.py) — uses a hand-tuned aggregation that derives the per-use-case target field on the fly so the seed run is reproducible without any prior labelling.
AI assistant inputs
The Generate tab passes the following to the backend
/api/v1/mlrun-runtime/generate-from-prompt endpoint:
| Field | Effect |
|---|---|
prompt | Free-form natural-language description of the goal. |
kind | One of auto, numeric, categorical, temporal, mixed, aggregates, type_coercion. Selects the aggregation template the assistant uses. |
source.{database,collection} | Inherited from the Data Source tab and used to sample the schema (default 100 documents) before drafting the pipeline. |
target.{database,collection} | Where the engineered output gets written. |
project_id, configuration_id | When set, the assistant updates the existing configuration in place. |
run_training + frameworks | When run_training=true, the backend also dispatches one POST /train per selected framework after persisting the configuration. |
persist | Always true from the Generate everything button; the Draft only button submits the same payload but never writes the configuration. |
The assistant returns:
- The full
MlrunFeatureEnrichmentRequest(loaded back into the form). - A
schema_summarydescribing how many documents and fields were sampled. - An optional
training_specwith the inferred target, problem type, feature columns, and confidence. - A
training_resultslist (one entry per framework) whenrun_training=true.
Pipeline kinds
| Kind | When to pick it |
|---|---|
auto | Default — let the assistant decide based on the prompt and the sampled schema. |
numeric | Pure numeric features; clean fillna with 0. |
categorical | String columns; populates H2O enum_columns. |
temporal | Calendar / time-of-day boolean indicators. |
mixed | Balanced blend of numeric + categorical + temporal. |
aggregates | Per-entity rollups using a Mongo $group aggregation pipeline. |
type_coercion | Use $addFields / $switch to clean mixed-type fields. |
Spend Risk pipeline
The spend-risk aggregation:
- Filters transactions to the last 90 days.
- Buckets by merchant category and channel.
- Computes per-customer rollups: amount sum / mean / std, declined-count, and high-risk flag.
- Adds the derived
is_high_riskboolean. - Writes engineered rows to
master.bank_transactions_mlrun.
[
{ "$match": { "transaction_date": { "$gte": "$$NOW.$dateSubtract:90 days" } } },
{ "$group": {
"_id": "$customer_id",
"amount_sum": { "$sum": "$amount" },
"amount_mean": { "$avg": "$amount" },
"amount_std": { "$stdDevPop": "$amount" },
"declined_count": { "$sum": { "$cond": [{ "$eq": ["$status", "DECLINED"] }, 1, 0] } },
"high_risk_categories": { "$push": "$merchant_category" }
}},
{ "$addFields": {
"is_high_risk": {
"$or": [
{ "$gte": ["$declined_count", 3] },
{ "$gt": [{ "$size": { "$setIntersection": ["$high_risk_categories", ["gambling", "crypto", "cash_advance"]] } }, 0] }
]
}
}},
{ "$out": "bank_transactions_mlrun" }
]Customer Personality pipeline
The personality aggregation pulls demographic fields from
bank_customer, normalises them, and prepares a multiclass training
frame:
- Selects demographic fields:
age,gender,income_band,region,life_stage,marital_status,dependents,tenure_months. - Converts age to bands (
<25,25–34,35–44,45–54,55+). - Drops rows with missing
personalityand rows whosepersonalityvalue occurs fewer than 25 times across the entire collection. - Writes the training frame to
master.bank_customer_mlrun.
[
{ "$match": { "personality": { "$nin": [null, ""] } } },
{ "$addFields": {
"age_band": {
"$switch": {
"branches": [
{ "case": { "$lt": ["$age", 25] }, "then": "lt_25" },
{ "case": { "$lt": ["$age", 35] }, "then": "25_34" },
{ "case": { "$lt": ["$age", 45] }, "then": "35_44" },
{ "case": { "$lt": ["$age", 55] }, "then": "45_54" }
],
"default": "55_plus"
}
}
}},
{ "$project": {
"customer_id": 1, "age_band": 1, "gender": 1, "income_band": 1,
"region": 1, "life_stage": 1, "marital_status": 1,
"dependents": 1, "tenure_months": 1, "personality": 1
}},
{ "$out": "bank_customer_mlrun" }
]Preflight checks
Before training, the seeder runs three checks against MongoDB:
- The source collection has at least one document
(
estimated_document_count > 0). - For multiclass use-cases, there are ≥2 distinct values of the target field.
- For multiclass use-cases, the most frequent label is not ≥95% of the population (avoids the “predict-the-majority” trap).
If any check fails, the seeder skips the run and does not write a
spurious SEED_MLRUN_USE_CASE activity row.

Last updated on