Ecosystem Rewards

The Ecosystem Rewards algorithm implements a version of the Thompson Sampling algorithm for solving multi-armed bandit style problems.

Algorithm

The Ecosystem Rewards algorithm stores and updates a beta distribution for each offer in each segment in the system. When offers need to be scored for an entity in a given segment, the beta distributions for the offers in the segment are sampled and the samples are used as the scores for the offers. Segements can be specified using up to two discrete contextual variables or segmentation can be done at the entity level - i.e. each entity has it’s own beta distribution for for each offer.

The beta distributions are specified using an alpha and beta parameter. When an offer is presented and accetped then the alpha parameter for the relevant distribution is incremented. When and offer is presented and not accepted then the beta parameter for the relevant distribution is incremented. The size of the increment in each case is a parameter that can be set when configurating the algorithm. The historical data used for this updating of alpha and beta can be windowed by either time and or the number of events for each offer in a specific segment. It is stongly recommended that a window approach be configured, both for performance reasons and to prevent historical data from overwhelming changes in the current state of the system.

By default the algorithm will assume initial values of one for both alpha and beta for all of the distributions. This corresponds to a uniform distribution, which is a reasonable assumption when no data is available. However, it is possible to specify different initial values for alpha and beta for each offer in each segment. This can be useful when there is some prior knowledge about the performance of the offers in the system.

Parameters

  • Processing Window: Restricts the data used when the model updates based on a time period from the present going back a specified in milliseconds.
  • Historical Count: Restricts the data used when the model updates based on a count of interactions. The count used is per offer and segment.
  • Decay Parameter: Used to treat repeated interactions from the same customer differently from one off interactions from individual customers. The weight of each interaction from a customer is reduced by a factor of one over the decay parameter, i.e. the latest interaction has a weight of one, the interaction before that has a weight of one over the decay parameter and the interaction before that has a weight of one over the decay parameter squared
  • Max interactions: Used to treat repeated interactions from the same customer differently from one off interactions from individual customers. Restricts the number of interactions from an individual customer that will be used when updating the model - the latest interactions will be used.
  • Success Reward: The size of the increment to the alpha parameter of the beta distributions used in the Thompson Sampling when an interaction is successful. This impacts the rate of convergence.
  • Fail Reward: The size of the increment to the beta parameter of the beta distributions used in the Thompson Sampling when an interaction is not successful. This impacts the rate of convergence.
  • Prior Success Reward: The size of the increment to the alpha parameter of the beta distributions used in the Thompson Sampling when an interaction is successful in the historical data. Used when historical data is used to train the algorithm before deployment.
  • Prior Fail Reward: The size of the increment to the beta parameter of the beta distributions used in the Thompson Sampling when an interaction is not successful in the historical data. Used when historical data is used to train the algorithm before deployment.
  • Test options across segments: If there are options that are configured to only be available for specific values of the contextual variables, electing to test options across segments will occasionally predict those options for contextual variable values where they are not available.
  • epsilon: The proportion of API calls that will be allocated for exploration. This exploration is done be sampling from a uniform distribution rather than the beta distribution when scoring.

Example

Below is an example configuration of the Ecosystem Rewards algorithm in python

from prediction.apis import deployment_management as dm
from prediction.apis import ecosystem_generation_engine as ge
from prediction.apis import data_management_engine as dme
from prediction.apis import online_learning_management as ol
from prediction.apis import prediction_engine as pe
from prediction.apis import worker_file_service as fs
from prediction import jwt_access
 
auth = jwt_access.Authenticate("http://localhost:3001/api", ecosystem_username, ecosystem_password)
 
deployment_id = "demo-deployment"
 
online_learning_uuid = ol.create_online_learning(
        auth,
        algorithm="ecosystem_rewards",
        name=deployment_id,
        description="Demo deployment for illustrating python configuration",
        feature_store_collection="set_up_features",
        feature_store_database="my_mongo_database",
        options_store_database="my_mongo_database",
        options_store_collection="demo-deployment_options",
        randomisation_success_reward=0.5,
        randomisation_fail_reward=0.05,
        randomisation_processing_count=200,
        randomisation_processing_window=604800000,
        contextual_variables_offer_key="offer",
        contextual_variables_contextual_variable_one_name="customer_segment",
        contextual_variables_contextual_variable_one_from_data_source = True,
        contextual_variables_contextual_variable_one_lookup = "customer_segment",
        create_options_index=True,
        create_covering_index=True
)
 
online_learning = dm.define_deployment_multi_armed_bandit(epsilon=0, dynamic_interaction_uuid=online_learning_uuid)
 
parameter_access = dm.define_deployment_parameter_access(
    auth,
    lookup_key="customer_id",
    lookup_type="string",
    database="my_mongo_database",
    table_collection="customer_feature_store",
    datasource="mongodb"
)
 
deployment_step = dm.create_deployment(
    auth,
    project_id="demo-project",
    deployment_id=deployment_id,
    description="Demo project for illustrating python configuration",
    version="001",
    plugin_post_score_class="PlatformDynamicEngagement.java",
    plugin_pre_score_class="PreScoreDynamic.java",
    scoring_engine_path_dev="http://localhost:8091",
    mongo_connect=f"mongodb://{mongo_user}:{mongo_password}@localhost:54445/?authSource=admin",
    parameter_access=parameter_access,
    multi_armed_bandit=online_learning
)

Inspecting Beta distributions

To understand the behaviour of the Ecosystem Rewards algorithm, it is important to inspect the Beta distributions the are produced by the algorithm scoring. There are two ways to do this, using the simulation functionality or by producing box and whisker plots from the information contained in the Options Store.

Use the simulation functionality when setting up your configuration. The simulation results will include plots of the Beta dsitributions which can be used to understand the behaviour of the algorithm in the simulated environment.

Use the /postEcosystemRewardsBetaDistributionBoxPlots API on the ecosystem.Ai server to examine the distributions contained in the Options Store. The response to the API call will contain the parameters required to produce box and whisker plots of the Beta Distributions. The API is called with a json payload with the following structure:

{
    "options_store_collection": "demo_options",
    "options_store_database": "recommender",
    "contextual_variable_one": "segment_one_value",
    "contextual_variable_two": "segment_two_value",
    "outlier_threshold": 0.1,
    "show_low_data": "false"
}

options_store_collection and options_store_database specify the location of the Options Store in MongoDB. contextual_variable_one and contextual_variable_two specify the values of the contextual variables for which you want to inspect the beta distributions. outlier_threshold specifies the threshold for determining the max and min values for the box and whisker plots, the max is taken to be the value of the \(X\) where the CDF is 1 - outlier_threshold and the min is taken to be the value of the \(X\) where the CDF is outlier_threshold. show_low_data specifies whether to show the data for options without contacts or responses. The response from the API will be a json object with the following structure:

[
  {
    "q1": 0.20900000000000016,
    "q3": 0.2590000000000002,
    "min": 0.18800000000000014,
    "median": 0.23419203747072595,
    "max": 0.2830000000000002,
    "category": "Bulk Purchase Discount"
  },
  {
    "q1": 0.20900000000000016,
    "q3": 0.2590000000000002,
    "min": 0.18800000000000014,
    "median": 0.23419203747072595,
    "max": 0.2830000000000002,
    "category": "Bundle Purchase Discount"
  },
  {
    "q1": 0.23200000000000018,
    "q3": 0.2840000000000002,
    "min": 0.21100000000000016,
    "median": 0.25816249050873197,
    "max": 0.3080000000000002,
    "category": "Get 10% Off For The Next Hour"
  },
  {
    "q1": 0.004,
    "q3": 0.01900000000000001,
    "min": 0.002,
    "median": 0.013568521031207597,
    "max": 0.03200000000000002,
    "category": "Rewards Booster"
  },
  {
    "q1": 0.17100000000000012,
    "q3": 0.21900000000000017,
    "min": 0.1510000000000001,
    "median": 0.19559902200488996,
    "max": 0.2430000000000002,
    "category": "Save 10% By Paying Upfront"
  }
]

The same functionality can be use to produce box and whisker plots using the python package, as shown in the following truncated example.

from prediction.apis import data_munging_engine as mu
import matplotlib.pyplot as plt
 
boxes = mu.ecosystem_rewards_beta_box_plots(auth,options_store_collection,options_store_database,contextual_variable_one,contextual_variable_two)
 
fig, ax = plt.subplots()  
ax.bxp(boxes, showfliers=False)
ax.set_ylabel("PDF")
plt.xticks(rotation=90)
plt.show()

Quantifying exploration

Exploration and exploitation are not split explicitly in Thompson Sampling style algorithms. Instead, exploration occurs automatically when the beta distributions overlap. However, there are situations where it is useful to have a measure of the degree of exploration being performed by the algorithm. Here we outline two approaches for acheiving this:

  1. Popular Offer Comparison: The basis of this approach is the assumption that if the top option selected by the algorithm is not the option with the highest take up rate, then the algorithm is exploring.
  2. Beta Distribution Overlap: In this approach we consider the score produced for each option and determine the probability that it is higher than the scores sampled from the set of options with higher take up rates than the option under consideration. The average of these probabilities will give a measure of the degree of exploration being performed by the algorithm.

It is important to note that while we can use these approaches to quantify the number of recommendations where exploration is occuring, the exploration is of a different character than when using \(\epsilon\)-based exploration. During an instance of \(\epsilon\)-based exploration the options will be ranked randomly. In contrast when the Ecosystem Rewards algorithm explores it will still take into account the information it has about the system, making it much more likely to interchange similarly performing offers when exploring rather than ranking completely randomly. In this sense, an instance of \(\epsilon\)-based exploration is more extreme than an instance of Ecosystem Rewards based exploration.

In this approach we determine the proportion of recommendations made where the top option recommended is not the option with the highest take up rate, given any contextual variable values for the recommendation being made. There are two ways to do this:

  1. Using the logging data once recommendations have already been made.
  2. By performing the check in the post scoring logic while the recommendation is being made and logging the results. We will outline the both of these approaches here.

Using the logging data

To use logging data for the popular offer comparison, we make use of the /postEcosystemRewardsExploration API on the ecosystem.Ai server. This APi is called with a json payload with the following structure:

{
    "start_time": "2025-04-23T08:24:16"
    ,"end_time": "2025-04-25T08:24:17"
    ,"options_store_collection": "dynamic_set_up_feature_store_options"
    ,"options_store_database": "recommender_demos"
    ,"logging_collection": "ecosystemruntime"
    ,"logging_database": "logging"
    ,"predictor": "offer_recommend_dynamic"
    ,"sample_size": "1000"
    ,"check_indexes": "true"
}

start_time and end_time specify the time period over which the logs should be considered. The Options Store and logging parameters specify the location of the Options Store and contacts logging collections in MongoDB. predictor specifies the name of the case for which you want to quantify the exploration. check_indexes specifies whether the API should check for the required indexes on the Options Store and logging collections. If the indexes are not present, the API will create them. sample_size specifies the number of documents to retrieve from the logs, the items are selected randomly. Set sample_size to 0 to retrieve all of the relevant logs.

The response from the API will be a json object with the following structure:

[
  {
    "number_of_explore_offers": 7954,
    "_id": "None",
    "number_of_offers": 13803,
    "exploration_ratio": 0.5762515395203941
  }
]

exploration_ratio is the proportion of recommendations made where the top option recommended is not the option with the highest take up rate, i.e. options where we assume that exploration is taking place.

In order to produce this result two MongoDB aggregation pipelines are run. The first determines the most popular offer for each combination of contextual variable values. This is done using the propensity field contained in the Options Store. The second pipeline then aggregates over the contact logs for the specified time period and determines when the first option with an arm_reward field in the response does not match the option with the highest propensity. Recommendations where no options are returned, i.e. final_result is empty, are ignored. The specific pipelines run and their results can be seen in the logs of the ecosystem.Ai server.

This functionality can also be access using the ecosystem_rewards_explore function in the python package.

While this is the simplest and most easily interpretable exploration quantification, there are two potential issues to be aware of with this approach:

  • If the sorting of the option is modified in the post scoring logic such that the top option is not necessarily the option with the score from the Ecosystem Rewards algorithm, then the exploration ratio will be artificially inflated.
  • Only the top Option in the logs is considered. This means that exploration could be underestimated if there is less data or separation of propensities for options which have a lower propensity.

The first issue can be addressed by performing the check in the post scoring logic while the recommendation is being made.

🔥
Note

The offer popularity is determined using the propensity in the Options Store. This is impacted by the Processing Window and Historical Count parameters in the Ecosystem Rewards algorithm rather than the start_time and end_time specified in the API call.

Using the post scoring logic

The popular offer comparison can be performed in the post scoring logic by using the Options Store detail passed to the post score in the params object. This can be done using the getOptions function.

JSONArray options = getOptions(params);

options is a JSONArray of the scored options returned by the Ecosystem Rewards algorithm. Each scored options is represented using a JSONObject. The key fields in the JSONObject for this purpose are propensity and arm_reward. To determine if the recommendation should be marked as exploration, check whether the options with highest arm_reward also has the highest propensity. The result can then be added to predictModelMojoResult so that it will be stored in the logs.

🔥
Note

Do not use the key explore when adding the exploration indicator to predictModelMojoResult as that is the default key used to indicate epsilon based exploration.

The contact logs with the exploration indicator can then be counted to determine the exploration rate of the Ecosystem Rewards algorithm.

While this approach addresses the issue of modified sorting in the post scoring logic, it still has the drawback of only considering the highest scored option returned by the Ecosystem Rewards algorithm.

Beta Distribution Overlap

In this approach we consider the score produced for each option and determine the probability that it is higher than the scores sampled from the set of options with higher take up rates than the Option under consideration. The average of these probabilities will give a measure of the degree of exploration being performed by the algorithm.

To illustrate the appraoch followed, consider a single option in the contacts logging collection. Let \(r\) be the arm_reward for the item. For each option in the Options Store with a higher propensity than the option, \(\theta\), under consideration, we determine the probability \(P_{i}(X < r)\), i.e. the probability that the arm_reward generated for \(\theta\) is greater then the arm_reward generated for option \(i\), despite the fact the \(i\) has a higher propensity. To take into account the fact that we may not want to measure exploration between similarly performing offers, we replace \(r\) with \(r_t=r+t\) where \(t>0\). We, combine the probability \(P_{i}(X < r_t)\), \(\forall i\) to deteremine the probability that \(\theta\) is an exploration option \(P_{explore}\). This is done using the following formula:

\(P_{explore} = P_0(X < r_t) + P_0(X > r_t)[P_1(X < r_t) + P_1(X > r_t)[P_2(X < r_t) + ...]]\)

Options where either \(\alpha=\alpha_0\) or \(\beta=\beta_0\) are allocated avalue of \(P_{explore} = 1\). \(P_{explore}\) is calculated for each option in the contacts logging collection and the exploration rate of the Ecosystem Rewards algorithm is taken to be the average of the \(P_{explore}\) values.

To access the result of this calculation, make use of the /postEcosystemRewardsExplorationApprox API on the ecosystem.Ai server. This APi is called with a json payload with the following structure:

{
    "start_time": "2025-04-23T08:24:16"
    ,"end_time": "2025-04-25T08:24:17"
    ,"options_store_collection": "dynamic_set_up_feature_store_options"
    ,"options_store_database": "recommender_demos"
    ,"logging_collection": "ecosystemruntime"
    ,"logging_database": "logging"
    ,"predictor": "offer_recommend_dynamic"
    ,"check_indexes": "true"
    ,"sample_size": "1000"
    ,"score_filter": "1"
    ,"threshold": "0.001"
}

start_time and end_time specify the time period over which the logs should be considered. The Options Store and logging parameters specify the location of the Options Store and contacts logging collections in MongoDB. predictor specifies the name of the case for which you want to quantify the exploration. check_indexes specifies whether the API should check for the required indexes on the Options Store and logging collections. If the indexes are not present, the API will create them. sample_size specifies the number of documents to retrieve from the logs, the items are selected randomly. Set sample_size to 0 to retrieve all of the relevant logs. score_filter will filter out offers with the given score which can be useful if offers are being manually inserted in the post scoring logic, set to false to include all scores. threshold sets the value of \(t\) in \(r_t=r+t\).

The response from the API will be a json object with the following structure:

{
    "exploration_probability": 0.27273470649093373
}

exploration_probability is the average of the \(P_{explore}\) values.

To view the exploration_probability split by contextual variable values, use the /getEcosystemRewardsExploreApproxContext API. The API is called with the same payload as the /postEcosystemRewardsExplorationApprox API. The response from the API will be a json object with the following structure:

[
  {
    "contextual_variable_two": "All",
    "contextual_variable_one": "All",
    "number_of_interactions": 1000,
    "exploration_probability": 0.07521690966398219
  },
  {
    "contextual_variable_two": "Diploma",
    "contextual_variable_one": "Intentional",
    "number_of_interactions": 25,
    "exploration_probability": 0.034429999360797686
  },
  {
    "contextual_variable_two": "Grade12",
    "contextual_variable_one": "Intentional",
    "number_of_interactions": 165,
    "exploration_probability": 0.1975850836543586
  }
]

This functionality can also be access using the ecosystem_rewards_explore_approx and ecosystem_rewards_explore_approx_context functions in the python package.