Q-Learning

The Q-learning algorithm uses the Q-learning reinforcement learning framework with actions being presenting offers to clients and states being whether the client accepted the offers. The algorithm works at a customer level, training is done per customer. Segment level implementations are planned for future releases.

Algorithm

The Q-learning algorithm represents offers as the set of actions \(\mathcal{A}\) and the states \(\mathcal{S}\) represent take up of offers. The algorithm then uses a Q-table to score and rank offers using the Q value as this expresses the expected utility of the offer.

There are two approaches for generating the Q-table, both of which use a reward function, which calculates the reward for a given action. The first uses the data stored in the ecosystem.Ai runtime logs to generate the Q-table. In this case the states are the combination of the last two offers presented to the client and the last offer taken up by the client. The Q-table is then updated for each step in the logged data, by calculating the Q value for the action that was taken using the reward function and the Bellman equation. In the second approach, the Q-table is updated using a policy function, which calculates the next action The Q-table is calculated for each customer as a recommendation is made. The first row of the table is generated by selecting and initial state and action at random, calculating the reward using the reward function and then calculating Q using \(Q=\alpha (R + \gamma R_{max})\), where \(alpha\) is the learning rate, \(R\) is the reward returned by the reward function, \gamma is the discount factor and \(R_{max}\) is the maximum reward. The Q-table is then updated using the Q value and the reward. The subsequent rows of the Q-table are generated by selecting an action using the policy function, calculating the reward using the reward function and then calculating Q using the full Bellman equation. The Q-table is then updated using the Q value and the reward.

The policy function is a function that selects the next action based on the current state of the Q-table. The policy function used by the algorithm is \(\epsilon\)-greedy. The action with the highest Q value is selected with probability \(1-\epsilon\) and a random action is selected with probability \(\epsilon\).

The reward function is a java plugin that calculates the reward for a given action. The reward function should be calibrated be for specific use case. Below is an illustrative reward function which uses cop car values for offers and feature store information to caluclate the reward. This java class should be located in the src/main/java/com/ecosystem/algorithm/qlearn directory.

package com.ecosystem.algorithm.qlearn;
 
import com.ecosystem.plugin.PluginLoaderSuper;
import com.ecosystem.utils.log.LogManager;
import com.ecosystem.utils.log.Logger;
import org.json.JSONObject;
 
import java.util.List;
import java.util.ArrayList;
import java.util.Arrays;
 
public class QLearnRewardPlugin extends PluginLoaderSuper {
 
    private static final Logger LOGGER = LogManager.getLogger(QLearnRewardPlugin.class.getName());
 
    /**
     *
     * @param actions
     * @param action
     * @param state
     * @param offers
     * @param category
     * @return
     */
    public double reward(List<String> actions, String action, int state, List<String> offers, int category, JSONObject params, JSONObject log_action_state, String training_data_source) {
 
        double reward = 0.0;
 
        try {
            if (params.has("preloadCorpora")) {
                if (params.getJSONObject("preloadCorpora").has("copcar_lookup")) {
                    JSONObject copcar_lookup  = params.getJSONObject("preloadCorpora").getJSONObject("copcar_lookup");
                    JSONObject copcar_fields = copcar_lookup.getJSONObject("copcar");
                    String copcar_fields_string = String.valueOf(copcar_fields.get("value"));
                    ArrayList<String> copcarList = new ArrayList<>(Arrays.asList(copcar_fields_string.split(",")));
                    for (String copcar_key : copcarList) {
                        double  copcar_value = Double.valueOf(String.valueOf(params.getJSONObject("featuresObj").get(copcar_key)));
                        copcar_list.add(copcar_value);
                    }
                }
            }
 
            int list_size = copcar_list.size();
            if (list_size%2 == 1) {
                copcar_median = copcar_list.get(((list_size+1)/2)-1);
            } else {
                if (list_size > 0)
                    copcar_median = copcar_list.get((list_size/2)-1) + copcar_list.get(list_size/2);
                else
                    copcar_median = 0;
            }
 
            double previous_purchase = 0;
            if (params.getJSONObject("featuresObj").has("previous_purchase"))
                previous_purchase = Double.valueOf(String.valueOf(params.getJSONObject("featuresObj").get("previous_purchase")));
            else
                LOGGER.info("previous_purchase not found in featuresObj");
 
            if (previous_purchase > 0) {
                state = 1;
            } else {
                state = 0;
            }
 
            double cost = Double.parseDouble(offers.get(actions.indexOf(action)));
 
            if (cost > copcar_median && state == 1) {
                reward = 10;
            } else if (cost == copcar_median && state == 1) {
                reward = 8;
            } else if (cost < copcar_median && state == 1) {
                reward = 0;
            } else if (cost > copcar_median && state == 0) {
                reward = -10;
            } else if (cost == copcar_median && state == 0) {
                reward = -10;
            } else if (cost < copcar_median && state == 0) {
                reward = -10;
            } else {
                reward = 0;
            }
 
        } catch (Exception e) {
            e.printStackTrace();
            LOGGER.error(e.getMessage());
        }
 
        return reward;
 
    }
}

Parameters

  • Processing Window: Restricts the data used when the model updates based on a time period from the present going back a specified in milliseconds.
  • Historical Count: Restricts the data used when the model updates based on a count of interactions. This is an overall interaction count rather than a count per offer and segment as used in the Ecosystem Rewards algorithm.
  • Learning Rate: The learning rate \(\alpha\) used in the Bellman equation when calculating Q.
  • Discount Factor: The discount factor \(\gamma\) used in the Bellman equation when calculating Q.
  • Maximum Reward: The maximum reward \(R_{max}\) used in the Bellman equation when calculating Q.
  • Random Action: The value for \(\epsilon\) used in the \(\epsilon\)-greedy policy function.
  • training_data_source: (only supported through python package) The source of the training data for the online learning. The options are:
    • feature_store: Use the fields in the feature store for the training. These fields will need to be specified in the rewards plugin class. This approach will use the policy based method for creating the Q-table.
    • logging: Use data stored in the ecosystem.Ai runtime logs for the training. This approach will use the logging data based approach for creating the Q-table.

Example

Below is an example configuration of the Q-learning algorithm in python

from prediction.apis import deployment_management as dm
from prediction.apis import ecosystem_generation_engine as ge
from prediction.apis import data_management_engine as dme
from prediction.apis import online_learning_management as ol
from prediction.apis import prediction_engine as pe
from prediction.apis import worker_file_service as fs
from prediction import jwt_access
 
auth = jwt_access.Authenticate("http://localhost:3001/api", ecosystem_username, ecosystem_password)
 
deployment_id = "demo-deployment"
 
online_learning_uuid = ol.create_online_learning(
        auth,
        algorithm="q_learning",
        name=deployment_id,
        description="Demo deployment for illustrating python configuration",
        feature_store_collection="set_up_features",
        feature_store_database="my_mongo_database",
        options_store_database="my_mongo_database",
        options_store_collection="demo-deployment_options",
        randomisation_processing_count=500,
        randomisation_processing_window=604800000,
        randomisation_discount_factor=0.75,
        randomisation_max_reward=10,
        randomisation_random_action=0.2,
        randomisation_learning_rate=0.25,
        randomisation_training_data_source="feature_store",
        contextual_variables_offer_key="offer"
)
 
online_learning = dm.define_deployment_multi_armed_bandit(epsilon=0, dynamic_interaction_uuid=online_learning_uuid)
 
parameter_access = dm.define_deployment_parameter_access(
    auth,
    lookup_key="customer_id",
    lookup_type="string",
    database="my_mongo_database",
    table_collection="customer_feature_store",
    datasource="mongodb",
    defaults=["feature_one", "feature_two", "feature_three"]
)
 
deployment_step = dm.create_deployment(
    auth,
    project_id="demo-project",
    deployment_id=deployment_id,
    description="Demo project for illustrating python configuration",
    version="001",
    plugin_post_score_class="PlatformDynamicEngagement.java",
    plugin_pre_score_class="PreScoreDynamic.java",
    scoring_engine_path_dev="http://localhost:8091",
    mongo_connect=f"mongodb://{mongo_user}:{mongo_password}@localhost:54445/?authSource=admin",
    parameter_access=parameter_access,
    multi_armed_bandit=online_learning
)