(Note: for those interested, the Jupyter notebook with all the code for this post, plus additional outputs and explanations, is available for download on my Decision Sciences GitHub repository)
Introduction:
Reinforcement learning (RL) is making a resurgence of sorts, especially in the context of Agentic AI that leverages large language models (LLMs). This trend is fueling interest in how RL can serve as the “decision-making engine” behind an LLM’s language capabilities, producing AI “agents” capable of more autonomous and adaptive behavior. As Agentic AI continues to evolve, it’s crucial to understand the fundamentals of RL—how agents learn from trial and error, what terms like “rewards” and “policies” mean, and why this approach can be so powerful.
Reinforcement learning (RL) can sometimes feel like a “black box”—especially if you’re coming from a more traditional machine learning background. Unlike traditional ML, however, which often uses labeled data to predict outcomes, RL focuses on sequential decision-making in dynamic “environments”—allowing an AI agent to efficiently test different scenarios and discover optimal solutions. This is especially relevant in Decision Sciences, where businesses seek decision-driven methods to maximize profit or ROI through repeated experimentation.
Core Idea: we want an AI agent to learn the optimal amount of marketing spend (a single variable) to maximize net profit. By the end of this post, you’ll have a clear picture of how RL differs from traditional machine learning, why it’s relevant in scenarios like marketing optimization, and what it might look like before we explore more advanced use cases—such as Agentic AI with LLMs—in future posts.
A Simple RL Scenario: Foundations for Agentic AI:
Our simple spend optimization example demonstrates how reinforcement learning (RL) discovers the best marketing spend through trial and error—exactly the mechanism that underpins Agentic AI when paired with large language models. By having an agent repeatedly test actions (increasing or decreasing spend) and observe the resulting net profit, we see the fundamental RL loop: exploration (trying new spend levels) and exploitation (sticking to high-profit ones). Although it’s a small-scale illustration, this same process scales to more complex problems or integrates with an LLM’s reasoning. In other words, the core RL principle—learning from sequential interactions and rewards—remains central to building Agentic AI that can autonomously discover strategies in real-world settings.
What is Reinforcement Leaning (RL)?
In RL, an “agent” interacts with an environment by taking actions and observing the resulting reward and new state. Over many interactions (episodes), the agent gradually learns which actions lead to higher rewards. This process is driven by a policy, which maps states to actions. The RL algorithm updates this policy based on the feedback (i.e. rewards) it receives, aiming to maximize the cumulative reward over time.
Concretely, the agent starts with little or no knowledge about which actions are best. By trying different actions and observing their outcomes, it explores the environment. As it collects experience, it exploits what it has learned by choosing actions that yielded higher rewards in the past. This exploration vs. exploitation balance is central to RL: the agent must try new actions to discover better strategies while also using known successful actions to gain higher rewards.
In traditional supervised learning, you typically label each data point with a “correct answer.” But in reinforcement learning, you’re using historical data (or a simulated environment derived from it) to let the agent experiment with different actions—effectively learning the best strategy over time rather than reading it off the data directly.
RL is Interventional:
Although standard RL is a type of “testing” method, it is important to note that out-of-the-box RL doesn’t guarantee a formal causal model (i.e. such as pure “causal inference” using randomized control trials, for example)—agents typically learn by trial and error rather than by dissecting why an action works. However, RL is interventional: the agent actively takes actions and observes outcomes, which goes beyond simple correlation.
In practice, this means that while standard RL may not formally prove “cause and effect,” it does let you systematically test decisions (like changing spend) and see if they consistently drive higher returns. Thus, you can gain strong evidence that certain choices cause better results in your specific environment, even if it’s not a fully rigorous causal analysis (note: there is a growing research area often referred to as “causal reinforcement learning” that aims to integrate causal models into RL).
Mini Use Case: Marketing Spend Optimization:
Imagine you’re a marketing team with 10 customer records (e.g., 10 different accounts), and you’re trying to figure out the optimal spend on marketing that will maximize your return. The return in this case is the net profit generated from each customer based on different spend levels. You use a reinforcement learning model to test different strategies for marketing spend (the agent will choose the spend at each step). The objective is to maximize the return (reward) over time.
The model simulates a very simple marketing campaign where we optimize the spend across different marketing actions. By having an agent adjust the spend, it tries to maximize the total reward over multiple steps (episodes).
This toy environment I’ve set up uses a simple parabolic revenue function with known properties so we can know, upfront, what the optimal spending level is (more on this below), but it still highlights the core ideas behind RL.
Parabolic Function:

Revenue = -(s – 5)^2 + 25
- Peak at spend (s) = 5, maximum revenue of 25.
- If spend (s) drifts away from 5, revenue drops quickly.
For our simple example, we assume cost is directly proportional to the marketing spend (i.e., cost = spend x cost_factor), so it increases linearly in our model. In our marketing spend scenario, we assume each additional dollar spent on ads is charged at a constant rate, so total cost grows in direct proportion to spend. For example, if the cost factor is $1 per unit of spend, then spending $5 costs $5, spending $10 costs $10, and so on. But this isn’t always the case in the real world. If every $1 of direct ad spend results in $1.50 in total expenses, then your cost factor would be 1.5–meaning for every dollar allocated to the campaign, the actual cost grows at 1.5 times your spend. This could happen, for example, if you spend $2,000 on ad placements but also owe a 50% overhead in creative, editing, and platform fees, making the total cost $3,000.
Although the revenue function peaks at 25 when spend = 5, you’ll notice that net profit — the value our RL agent is optimizing for — is also $20 at spend = 4 (24 in revenue minus 4 in cost). This means from a pure net-profit standpoint, both 4 and 5 yield the same maximum of $20. In our simple model, the agent might therefore toggle between 4 and 5 without penalty, since the environment doesn’t differentiate one over the other.
If we wanted to prefer the lower cost for the same profit (i.e., pick 4 instead of 5), we could modify the environment’s reward function—for instance, by adding a slight penalty for higher spend levels. But for this simple example, treating them as equally optimal keeps the scenario straightforward and focuses on whether the agent converges on peak net profit rather than cost minimization. It is interesting to observe the behaviour of the agent during training either way.
Sample Data Set:
We also generated a simple 10-row dataset to help us validate the RL agent’s decisions step by step. By limiting ourselves to just 10 records—each corresponding to a small, clear snapshot of possible spending scenarios—we can easily compare the agent’s chosen actions and resulting net profits against our known parabolic revenue function. This way, we can follow along more intuitively, ensuring the agent’s performance aligns with our expectations (e.g., converging near the optimal spend of $4 or $5).
Below is the simple 10-row customer dataset with the fields for historical Spend, Revenue., Cost and Net Profit:
|
1 2 3 4 5 6 7 8 9 10 11 |
{‘spend’: 0, ‘revenue’: 0, ‘cost’: 0.0, ‘net_profit’: 0.0} {‘spend’: 1, ‘revenue’: 9, ‘cost’: 1.0, ‘net_profit’: 8.0} {‘spend’: 2, ‘revenue’: 16, ‘cost’: 2.0, ‘net_profit’: 14.0} {‘spend’: 3, ‘revenue’: 21, ‘cost’: 3.0, ‘net_profit’: 18.0} {‘spend’: 4, ‘revenue’: 24, ‘cost’: 4.0, ‘net_profit’: 20.0} {‘spend’: 5, ‘revenue’: 25, ‘cost’: 5.0, ‘net_profit’: 20.0} {‘spend’: 6, ‘revenue’: 24, ‘cost’: 6.0, ‘net_profit’: 18.0} {‘spend’: 7, ‘revenue’: 21, ‘cost’: 7.0, ‘net_profit’: 14.0} {‘spend’: 8, ‘revenue’: 16, ‘cost’: 8.0, ‘net_profit’: 8.0} {‘spend’: 9, ‘revenue’: 9, ‘cost’: 9.0, ‘net_profit’: 0.0} {‘spend’: 10, ‘revenue’: 0, ‘cost’: 10.0, ‘net_profit’: -10.0} |
And here is a quick example record from the 10-row dataset that helps illustrate how we’re tying the parabolic revenue function to a specific spend level:
For spend = 3 (customer record #4):
- Revenue = -(3-5)^2 + 25 = -(-2)^2 +25 = -4 + 25 = 21
- Cost = 3 x 1.0 (assuming a cost factor of 1.0) = 3
- Net Profit = 21-3 =18
Having a row like this in the dataset means we can explicitly see that if the agent picks a spend of 3, the resulting net profit should be 18. When the RL agent consistently discovers or selects higher net-profit spend levels (like 4 or 5), we can confirm it’s learning to move toward the peak of the parabolic function. This kind of record-by-record check makes the RL model’s behavior transparent—we can see exactly where each spend stands in terms of profit.
Expected Outcome:
Before diving into the code, let’s outline what should happen if our RL algorithm works correctly. As explained, we’ve deliberately used a parabolic revenue function where both spend = $4 and spend = $5 yield the maximum net profit of $20. If the agent truly learns the environment’s dynamics, we expect it to converge on an action strategy that hovers around these spend levels. We’ll see this reflected in the outputs and logs once the model is trained.
Therefore, in this simple environment, the optimal policy the agent converges on is essentially:
- Increase spend if it’s below $4 (or eventually $5),
- Decrease spend if it’s above $5,
- Do nothing (keep spend the same) when it’s already at $4 or $5.
The agent’s best move in any given state is to steer the spend level toward one of these peak net-profit points. In a real RL implementation, you won’t necessarily see this written as a simple “if/then” in the code—but if you analyze the trained agent’s behavior, you’ll find it consistently picks actions that nudge the spend toward $4 or $5 (or keep it there once it arrives). If you wanted to prefer $4 over $5 (e.g., to minimize total spend for the same net profit), you could modify the environment’s reward function accordingly, but in this toy example they’re treated as equally optimal.
RL Environment Setup:
OpenAI Gym is a widely used library in reinforcement learning (RL) that provides a standard interface for different “environments.” Think of the environment in RL as the board and rules of a game, while your RL agent is the player. By defining how the game state looks, what moves are possible, and how the board responds, the agent can systematically test different actions and learn which ones lead to success.
- What moves the player can make: this corresponds to the action space in RL, telling the agent which choices are valid.
- What the current situation on the board looks like: this represents the state—the information the agent observes at any given time.
- How the board reacts to each move: this is the reward (did we gain or lose?) and the new state (where we end up after the move).
In RL, the agent (player) repeatedly tries moves and sees the results. Having a consistent environment (the board game) means the agent can learn the best moves over time. Without such a structured setup, the agent wouldn’t know how to systematically test its decisions and figure out which ones lead to higher rewards.
Our Simple Environment for Spend Optimization:
In our spend optimization scenario, we define:
- Spend: the agent’s main decision variable (e.g., how many dollars to invest).
- Revenue: a parabolic function that peaks at spend = 5. Spending less or more leads to lower returns.
- Cost: spend multiplied by a cost factor.
- Reward: net profit (revenue minus cost).
The key constraints in this optimization scenario are as follows:
- Spend Range: spend is capped between 0 and max_spend=10, so the agent can’t allocate a negative budget or exceed $10.
- Discrete Actions: at each step, the agent can only increase spend by 1, decrease it by 1, or keep it the same.
- Fixed Cost Factor: cost scales linearly at a rate of 1.0 per spend unit (though you could vary it).
- Limited Episode Length: we typically run each episode for a maximum of 10 steps, after which it ends.
These constraints keep the model simple and ensure the agent’s spend decisions stay within a controlled, discrete set of possibilities, making it easier to illustrate basic RL concepts.
Below is the code for a custom Gym environment, SimpleMarketingEnv, that implements this logic:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
class SimpleMarketingEnv(gym.Env): def __init__(self, max_spend=10, cost_factor=1.0, max_steps=10): super().__init__() self.max_spend = max_spend self.cost_factor = cost_factor self.max_steps = max_steps self.observation_space = spaces.Box( low=np.array([0], dtype=np.float32), high=np.array([float(max_spend)], dtype=np.float32), shape=(1,), dtype=np.float32 ) self.action_space = spaces.Discrete(3) self.state = None self.current_step = 0 #Randomize the initial spend in reset so the agent trains on all states: def reset(self, seed=None, options=None): super().reset(seed=seed) self.current_step = 0 # Start at a random spend from 0..max_spend initial_spend = np.random.randint(0, self.max_spend + 1) self.state = np.array([float(initial_spend)], dtype=np.float32) return self.state, {} def step(self, action): spend = self.state[0] # Update spend based on action if action == 1: # +1 spend += 1 elif action == 2: # -1 spend -= 1 spend = float(np.clip(spend, 0, self.max_spend)) # Simple parabola revenue = -(s – 5)^2 + 25 revenue = –(spend – 5)**2 + 25 cost = spend * self.cost_factor reward = revenue – cost self.state = np.array([spend], dtype=np.float32) self.current_step += 1 done = False truncated = (self.current_step >= self.max_steps) return self.state, reward, done, truncated, {} def render(self): print(f“Step {self.current_step}, Spend={self.state[0]}”) |
Key Points:
- State: a single float in an array (the current spend).
- Actions:
- Reward: net profit = revenue – cost
- Episode: ends after
max_stepssteps or if we define another stopping condition.
Training with PPO:
We use Proximal Policy Optimization (PPO) from stable-baselines3 to train an agent on this environment. PPO is a policy gradient method for reinforcement learning that balances exploration (trying new actions) and exploitation (using the best-known actions). Developed by OpenAI, PPO periodically updates the policy in a way that keeps each update close to the old policy—hence “proximal.” This prevents overly large, destabilizing policy shifts, which can happen in naive policy gradient approaches. As a result, PPO tends to be more stable and simpler to tune than many other RL algorithms, making it a popular choice for a wide range of tasks.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
from stable_baselines3 import PPO from stable_baselines3.common.evaluation import evaluate_policy # Function to train the agent def train_simple_ppo(total_timesteps=10_000): env = SimpleMarketingEnv(max_spend=10, cost_factor=1.0, max_steps=10) model = PPO(“MlpPolicy”, env, verbose=1) model.learn(total_timesteps=total_timesteps) return model if __name__ == “__main__”: # Train the agent model = train_simple_ppo(total_timesteps=10000) # Evaluate the trained agent test_env = SimpleMarketingEnv(max_spend=10, cost_factor=1.0, max_steps=10) mean_reward, std_reward = evaluate_policy(model, test_env, n_eval_episodes=5, deterministic=True) print(f“Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}”) |
What’s Happening?
model = PPO("MlpPolicy", env, verbose=1): creates a PPO model with a neural network policy.model.learn(total_timesteps=10000): the agent interacts with the environment for 10,000 steps (across multiple episodes) to find a good policy.evaluate_policy(...): we test the final policy on a fresh environment to see the average reward.
Learning Process During Model Training:
The output below is the first 60 lines only (the first 6 episodes) of the detailed training log of the agent’s interactions with the environment during the training phase. It’s showing how the agent makes decisions step by step during its learning process, along with the corresponding rewards and updates to the spend. This RL model was trained using 10,000 timesteps.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
Using cpu device Wrapping the env with a `Monitor` wrapper Wrapping the env in a DummyVecEnv. [Training Step 1] old_spend=5.0, action=0, reward=20.00, new_spend=5.0 [Training Step 2] old_spend=5.0, action=0, reward=20.00, new_spend=5.0 [Training Step 3] old_spend=5.0, action=0, reward=20.00, new_spend=5.0 [Training Step 4] old_spend=5.0, action=0, reward=20.00, new_spend=5.0 [Training Step 5] old_spend=5.0, action=1, reward=18.00, new_spend=6.0 [Training Step 6] old_spend=6.0, action=2, reward=20.00, new_spend=5.0 [Training Step 7] old_spend=5.0, action=0, reward=20.00, new_spend=5.0 [Training Step 8] old_spend=5.0, action=1, reward=18.00, new_spend=6.0 [Training Step 9] old_spend=6.0, action=1, reward=14.00, new_spend=7.0 [Training Step 10] old_spend=7.0, action=1, reward=8.00, new_spend=8.0 [Training Step 1] old_spend=5.0, action=1, reward=18.00, new_spend=6.0 [Training Step 2] old_spend=6.0, action=1, reward=14.00, new_spend=7.0 [Training Step 3] old_spend=7.0, action=2, reward=18.00, new_spend=6.0 [Training Step 4] old_spend=6.0, action=0, reward=18.00, new_spend=6.0 [Training Step 5] old_spend=6.0, action=2, reward=20.00, new_spend=5.0 [Training Step 6] old_spend=5.0, action=2, reward=20.00, new_spend=4.0 [Training Step 7] old_spend=4.0, action=2, reward=18.00, new_spend=3.0 [Training Step 8] old_spend=3.0, action=1, reward=20.00, new_spend=4.0 [Training Step 9] old_spend=4.0, action=2, reward=18.00, new_spend=3.0 [Training Step 10] old_spend=3.0, action=0, reward=18.00, new_spend=3.0 [Training Step 1] old_spend=5.0, action=2, reward=20.00, new_spend=4.0 [Training Step 2] old_spend=4.0, action=1, reward=20.00, new_spend=5.0 [Training Step 3] old_spend=5.0, action=0, reward=20.00, new_spend=5.0 [Training Step 4] old_spend=5.0, action=2, reward=20.00, new_spend=4.0 [Training Step 5] old_spend=4.0, action=2, reward=18.00, new_spend=3.0 [Training Step 6] old_spend=3.0, action=0, reward=18.00, new_spend=3.0 [Training Step 7] old_spend=3.0, action=0, reward=18.00, new_spend=3.0 [Training Step 8] old_spend=3.0, action=0, reward=18.00, new_spend=3.0 [Training Step 9] old_spend=3.0, action=0, reward=18.00, new_spend=3.0 [Training Step 10] old_spend=3.0, action=2, reward=14.00, new_spend=2.0 [Training Step 1] old_spend=2.0, action=1, reward=18.00, new_spend=3.0 [Training Step 2] old_spend=3.0, action=1, reward=20.00, new_spend=4.0 [Training Step 3] old_spend=4.0, action=2, reward=18.00, new_spend=3.0 [Training Step 4] old_spend=3.0, action=1, reward=20.00, new_spend=4.0 [Training Step 5] old_spend=4.0, action=1, reward=20.00, new_spend=5.0 [Training Step 6] old_spend=5.0, action=2, reward=20.00, new_spend=4.0 [Training Step 7] old_spend=4.0, action=1, reward=20.00, new_spend=5.0 [Training Step 8] old_spend=5.0, action=0, reward=20.00, new_spend=5.0 [Training Step 9] old_spend=5.0, action=2, reward=20.00, new_spend=4.0 [Training Step 10] old_spend=4.0, action=1, reward=20.00, new_spend=5.0 [Training Step 1] old_spend=7.0, action=1, reward=8.00, new_spend=8.0 [Training Step 2] old_spend=8.0, action=0, reward=8.00, new_spend=8.0 [Training Step 3] old_spend=8.0, action=1, reward=0.00, new_spend=9.0 [Training Step 4] old_spend=9.0, action=1, reward=-10.00, new_spend=10.0 [Training Step 5] old_spend=10.0, action=2, reward=0.00, new_spend=9.0 [Training Step 6] old_spend=9.0, action=2, reward=8.00, new_spend=8.0 [Training Step 7] old_spend=8.0, action=0, reward=8.00, new_spend=8.0 [Training Step 8] old_spend=8.0, action=1, reward=0.00, new_spend=9.0 [Training Step 9] old_spend=9.0, action=0, reward=0.00, new_spend=9.0 [Training Step 10] old_spend=9.0, action=1, reward=-10.00, new_spend=10.0 [Training Step 1] old_spend=5.0, action=0, reward=20.00, new_spend=5.0 [Training Step 2] old_spend=5.0, action=2, reward=20.00, new_spend=4.0 [Training Step 3] old_spend=4.0, action=2, reward=18.00, new_spend=3.0 [Training Step 4] old_spend=3.0, action=2, reward=14.00, new_spend=2.0 [Training Step 5] old_spend=2.0, action=1, reward=18.00, new_spend=3.0 [Training Step 6] old_spend=3.0, action=1, reward=20.00, new_spend=4.0 [Training Step 7] old_spend=4.0, action=0, reward=20.00, new_spend=4.0 [Training Step 8] old_spend=4.0, action=1, reward=20.00, new_spend=5.0 [Training Step 9] old_spend=5.0, action=0, reward=20.00, new_spend=5.0 [Training Step 10] old_spend=5.0, action=1, reward=18.00, new_spend=6.0 [Training Step 1] old_spend=10.0, action=2, reward=0.00, new_spend=9.0 |
Key Elements in This Output:
- Old Spend: the amount of budget (or “spend”) before the agent makes a decision.
- Action: the action taken by the agent (e.g.,
0 = keep,1 = increase,2 = decrease). - Reward: the reward received after the action is taken (based on the environment’s feedback to the action).
- New Spend: the amount of budget (or “spend”) after the action has been executed.
Plot RL Agent “Action” Distribution Over Time:
Below is an example showing how to train a PPO agent in SimpleMarketingEnv, then log the actions with a custom loop, and finally plot the action distribution over time using a custom function. This approach gives us a clear view of which actions the agent takes over the course of model training, while ensuring that all actions are indeed being evaluated during the training process.

Evaluate RL Model vs. Random Policy:
A quick way to see if our RL agent truly learns something better than random guessing is to compare the mean reward of:
- Random Policy: sample random actions from the environment’s action space.
- RL Policy: our trained agent.
We can also calculate lift as the percentage improvement of the RL policy over the random policy.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import numpy as np def evaluate_random_policy(env, n_episodes=5): all_rewards = [] for _ in range(n_episodes): obs, _ = env.reset() done = False truncated = False episode_reward = 0.0 while not (done or truncated): # Random action action = env.action_space.sample() obs, reward, done, truncated, info = env.step(action) episode_reward += reward all_rewards.append(episode_reward) return np.mean(all_rewards), np.std(all_rewards) def calculate_lift(rl_mean, random_mean): “”” Calculate percentage lift of RL policy over random baseline. Lift (%) = ((RL – Random) / Random) * 100 “”” if random_mean == 0: # If the random mean is zero (edge case), we handle it separately return np.inf if rl_mean > 0 else 0 return ((rl_mean – random_mean) / random_mean) * 100.0 lift = calculate_lift(rl_mean_reward, random_mean_reward) print(f“Lift of RL over Random: {lift:.2f}%”) |
The output looks like this:
|
1 2 3 4 |
Random Policy –> Mean reward: 124.80 +/- 55.29 RL Policy –> Mean reward: 189.60 +/- 8.14 Your RL policy outperforms the random baseline! Lift of RL over Random: 51.92% |
Predictions (Single Step):
We can now apply our trained RL policy on the 10-record customer dataset by treating each record’s “spend” as the current state, letting the RL agent pick an action, and then interpreting that action as the recommended change in spend. Essentially, for each record:
- Set the environment’s state to the record’s current spend level.
- Ask the agent (model) for its action.
- Interpret that action (e.g., “increase by 1,” “decrease by 1,” or “stay”) as the recommendation.
Below is a simple code snippet that demonstrates how to do this:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
def apply_policy_to_dataset(model, dataset): “”” For each row in the dataset, treat the ‘spend’ column as the current state, and let the RL model pick an action. Returns a list of (spend, action, recommended_spend, reward) for each row. “”” results = [] for row in dataset: # 1. Assume ‘row’ has a ‘spend’ value. # We’ll create a one-step environment or manually set the state. spend = row[“spend”] # 2. Set up a dummy environment or just treat the spend as an observation obs = np.array([spend], dtype=np.float32) # 3. Model predicts an action action, _states = model.predict(obs, deterministic=True) # 4. Interpret the action: # 0 => stay, 1 => increase, 2 => decrease if action == 1: new_spend = min(spend + 1, 10) # or your max spend elif action == 2: new_spend = max(spend – 1, 0) # or your min spend else: new_spend = spend # 5. Calculate the reward from the environment’s formula revenue = –(new_spend – 5)**2 + 25 cost = new_spend * 1.0 # cost_factor=1.0 reward = revenue – cost results.append({ “current_spend”: spend, “action”: action, “recommended_spend”: new_spend, “reward_if_applied”: reward }) return results model_results = apply_policy_to_dataset(model, dataset) for row in model_results: print(row) |
The output looks like this:
|
1 2 3 4 5 6 7 8 9 10 11 |
{‘current_spend’: 0, ‘action’: array(1, dtype=int64), ‘recommended_spend’: 1, ‘reward_if_applied’: 8.0} {‘current_spend’: 1, ‘action’: array(1, dtype=int64), ‘recommended_spend’: 2, ‘reward_if_applied’: 14.0} {‘current_spend’: 2, ‘action’: array(1, dtype=int64), ‘recommended_spend’: 3, ‘reward_if_applied’: 18.0} {‘current_spend’: 3, ‘action’: array(1, dtype=int64), ‘recommended_spend’: 4, ‘reward_if_applied’: 20.0} {‘current_spend’: 4, ‘action’: array(1, dtype=int64), ‘recommended_spend’: 5, ‘reward_if_applied’: 20.0} {‘current_spend’: 5, ‘action’: array(2, dtype=int64), ‘recommended_spend’: 4, ‘reward_if_applied’: 20.0} {‘current_spend’: 6, ‘action’: array(2, dtype=int64), ‘recommended_spend’: 5, ‘reward_if_applied’: 20.0} {‘current_spend’: 7, ‘action’: array(2, dtype=int64), ‘recommended_spend’: 6, ‘reward_if_applied’: 18.0} {‘current_spend’: 8, ‘action’: array(2, dtype=int64), ‘recommended_spend’: 7, ‘reward_if_applied’: 14.0} {‘current_spend’: 9, ‘action’: array(2, dtype=int64), ‘recommended_spend’: 8, ‘reward_if_applied’: 8.0} {‘current_spend’: 10, ‘action’: array(2, dtype=int64), ‘recommended_spend’: 9, ‘reward_if_applied’: 0.0} |
- Step 1: we read each record’s spend level.
- Step 2: we create a simple observation (a 1D array with the spend).
- Step 3: we call model.predict(obs) to get the action the RL policy recommends.
- Step 4: we interpret that action (0 = stay, 1 = increase, 2 = decrease) and compute the resulting spend.
- Step 5: we optionally calculate the new reward if that recommendation were applied—just to confirm how profitable the agent’s choice is.
We end up with a list of rows showing, for each initial spend in your 10-record dataset, what action the agent recommends, the new spend, and the reward if you followed that advice. This effectively anchors the RL policy to each record, letting you see how the agent would adjust spend in every scenario.
Line-by-Line Interpretation:
Here’s a line-by-line interpretation of your results, keeping in mind that each row represents a single-step decision rather than a multi-step run. The agent is effectively asked, “What’s your best immediate move from this spend level?” and it picks an action that leads to a new spend and a single-step reward.
1) Current Spend = 0
|
1 |
{‘current_spend’: 0, ‘action’: array(1, dtype=int64), ‘recommended_spend’: 1, ‘reward_if_applied’: 8.0} |
- Action = 1 → “Increase spend by 1.”
- New spend = 1 → Net profit $8 (revenue=9, cost=1).
- Interpretation: From spend=0, the agent sees an immediate improvement by moving to spend=1.
2) Current Spend = 1
|
1 |
{‘current_spend’: 1, ‘action’: array(1, dtype=int64), ‘recommended_spend’: 2, ‘reward_if_applied’: 14.0} |
- Action = 1 → “Increase spend by 1.”
- New spend = 2 → Net profit $14 (revenue=16, cost=2).
- Interpretation: It continues to raise spend, as 2 yields higher immediate net profit than 1.
3) Current Spend = 2
|
1 |
{‘current_spend’: 2, ‘action’: array(1, dtype=int64), ‘recommended_spend’: 3, ‘reward_if_applied’: 18.0} |
- Action = 1 → “Increase spend by 1.”
- New spend = 3 → Net profit $18 (revenue=21, cost=3).
- Interpretation: Spend=3 is even better for a single step, so it chooses “increase” again.
4) Current Spend = 3
|
1 |
{‘current_spend’: 3, ‘action’: array(1, dtype=int64), ‘recommended_spend’: 4, ‘reward_if_applied’: 20.0} |
- Action = 1 → “Increase spend by 1.”
- New spend = 4 → Net profit $20 (revenue=24, cost=4).
- Interpretation: Spend=4 yields an immediate net profit of 20, which is better than 18 at spend=3.
5) Current Spend = 4
|
1 |
{‘current_spend’: 4, ‘action’: array(1, dtype=int64), ‘recommended_spend’: 5, ‘reward_if_applied’: 20.0} |
- Action = 1 → “Increase spend by 1.”
- New spend = 5 → Net profit $20 (revenue=25, cost=5).
- Interpretation: Spend=5 also produces $20 net profit, so moving from 4 to 5 is just as good (for one step).
6) Current Spend = 5
|
1 |
{‘current_spend’: 5, ‘action’: array(2, dtype=int64), ‘recommended_spend’: 4, ‘reward_if_applied’: 20.0} |
- Action = 2 → “Decrease spend by 1.”
- New spend = 4 → Net profit $20 (revenue=24, cost=4).
- Interpretation: At spend=5, net profit is $20. At spend=4, it’s also $20. The agent sees no difference for a single step, so it picks “decrease.” That’s still consistent with an immediate net profit of $20.
7) Current Spend = 6
|
1 |
{‘current_spend’: 6, ‘action’: array(2, dtype=int64), ‘recommended_spend’: 5, ‘reward_if_applied’: 20.0} |
- Action = 2 → “Decrease spend by 1.”
- New spend = 5 → Net profit $20 (revenue=25, cost=5).
- Interpretation: Going from 6 to 5 improves net profit from $19 to $20.
8) Current Spend = 7
|
1 |
{‘current_spend’: 7, ‘action’: array(2, dtype=int64), ‘recommended_spend’: 6, ‘reward_if_applied’: 18.0} |
- Action = 2 → “Decrease spend by 1.”
- New spend = 6 → Net profit $18 (revenue=24, cost=6).
- Interpretation: For a single step, 6 yields a better net profit ($18) than 7 does ($16), so the agent chooses to decrease spend.
9) Current Spend = 8
|
1 |
{‘current_spend’: 8, ‘action’: array(2, dtype=int64), ‘recommended_spend’: 7, ‘reward_if_applied’: 14.0} |
- Action = 2 → “Decrease spend by 1.”
- New spend = 7 → Net profit $14 (revenue=21, cost=7).
- Interpretation: Dropping from 8 to 7 is an immediate improvement from $8 net profit to $14.
10) Current Spend = 9
|
1 |
{‘current_spend’: 9, ‘action’: array(2, dtype=int64), ‘recommended_spend’: 8, ‘reward_if_applied’: 8.0} |
- Action = 2 → “Decrease spend by 1.”
- New spend = 8 → Net profit $8 (revenue=17, cost=9).
- Interpretation: That’s better than staying at 9, which would yield $0 net profit.
11) Current Spend = 10
|
1 |
{‘current_spend’: 10, ‘action’: array(2, dtype=int64), ‘recommended_spend’: 9, ‘reward_if_applied’: 0.0} |
- Action = 2 → “Decrease spend by 1.”
- New spend = 9 → Net profit $0 (revenue=18, cost=9).
- Interpretation: Spend=10 has a net profit of -$10, so dropping to 9 is at least $0 net profit for one step.
Overall Interpretation
- Spends < 5: The agent increases spend, inching closer to the peak net profit region.
- Spend = 5: The agent sees net profit = $20 but might also be okay with 4 (also $20).
- Spends > 5: The agent decreases spend to move back toward 5, improving immediate net profit.
These results align with single-step logic in a parabolic environment: for each immediate scenario, the agent picks the action that maximizes net profit right now.
Prediction (Multi-Step Horizon):
Below is a line-by-line summary of the multi-step results for each initial spend in your customer dataset, followed by a brief explanation for each record’s overall outcome. These runs show the agent repeatedly acting until it fully optimizes each customer record (i.e. not just the next step, as we did above), either reaching the maximum steps or stabilizes around its chosen strategy.
Starting from Spend=0
|
1 2 3 4 5 6 7 8 9 10 11 |
{‘step’: 0, ‘spend_before’: 0.0, ‘action’: 1, ‘spend_after’: 1.0, ‘reward’: 8.0} {‘step’: 1, ‘spend_before’: 1.0, ‘action’: 1, ‘spend_after’: 2.0, ‘reward’: 14.0} {‘step’: 2, ‘spend_before’: 2.0, ‘action’: 1, ‘spend_after’: 3.0, ‘reward’: 18.0} {‘step’: 3, ‘spend_before’: 3.0, ‘action’: 1, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 4, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 5, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 6, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 7, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 8, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 9, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} Total Reward: 180.0 |
- Interpretation: the agent initially increases spend each step (0 → 1 → 2 → 3 → 4 → 5) to boost net profit from $8 up to $20. Once it hits 4 or 5, it oscillates between those two spends because both yield a net profit of $20 in a single step. Over 10 steps, total reward is $180, reflecting mostly net profit of $20 each step after step 3.
Starting from Spend=1
|
1 2 3 4 5 6 7 8 9 10 11 |
{‘step’: 0, ‘spend_before’: 1.0, ‘action’: 1, ‘spend_after’: 2.0, ‘reward’: 14.0} {‘step’: 1, ‘spend_before’: 2.0, ‘action’: 1, ‘spend_after’: 3.0, ‘reward’: 18.0} {‘step’: 2, ‘spend_before’: 3.0, ‘action’: 1, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 3, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 4, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 5, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 6, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 7, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 8, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 9, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} Total Reward: 192.0 |
- Interpretation: from spend=1, the agent increases step by step until it hits 4 or 5. Then it bounces between 4 and 5, each yielding $20 net profit. The total reward is $192, slightly higher than the previous scenario because it reached the $20 region sooner.
Starting from Spend=2
|
1 2 3 4 5 6 7 8 9 10 11 |
{‘step’: 0, ‘spend_before’: 2.0, ‘action’: 1, ‘spend_after’: 3.0, ‘reward’: 18.0} {‘step’: 1, ‘spend_before’: 3.0, ‘action’: 1, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 2, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 3, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 4, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 5, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 6, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 7, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 8, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 9, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} Total Reward: 198.0 |
- Interpretation: similar pattern: the agent quickly moves from 2 → 3 → 4 → 5, then oscillates between 4 and 5 for $20 each step. It spends fewer steps in lower net-profit states, so it ends up with a higher total reward (198).
Starting from Spend=3
|
1 2 3 4 5 6 7 8 9 10 11 |
{‘step’: 0, ‘spend_before’: 3.0, ‘action’: 1, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 1, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 2, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 3, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 4, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 5, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 6, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 7, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 8, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 9, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} Total Reward: 200.0 |
- Interpretation: the agent starts near the peak, quickly moves between 4 and 5, consistently getting $20 net profit. Highest total reward so far (200) because it wastes almost no steps in suboptimal states.
Starting from Spend=4
|
1 2 3 4 |
{‘step’: 0, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 1, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 2, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} ... |
- Interpretation: the agent flips between 4 and 5 each step, each time getting $20. Another perfect scenario, netting 200 total.
Starting from Spend=5
|
1 2 3 |
{‘step’: 0, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} {‘step’: 1, ‘spend_before’: 4.0, ‘action’: 1, ‘spend_after’: 5.0, ‘reward’: 20.0} ... |
- Interpretation: the agent oscillates between 5 and 4, each yielding $20. Total reward 200, same reason.
Starting from Spend=6
|
1 2 3 |
{‘step’: 0, ‘spend_before’: 6.0, ‘action’: 2, ‘spend_after’: 5.0, ‘reward’: 20.0} {‘step’: 1, ‘spend_before’: 5.0, ‘action’: 2, ‘spend_after’: 4.0, ‘reward’: 20.0} ... |
- Interpretation: moves from 6 to 5, then toggles between 4 and 5. Again, mostly $20 each step. Total 200 or near 200.
Starting from Spend=7
|
1 2 3 |
{‘step’: 0, ‘spend_before’: 7.0, ‘action’: 2, ‘spend_after’: 6.0, ‘reward’: 18.0} {‘step’: 1, ‘spend_before’: 6.0, ‘action’: 2, ‘spend_after’: 5.0, ‘reward’: 20.0} ... |
- Interpretation: the agent quickly drops from 7 → 6 → 5, then flips between 4 and 5 for $20. Minor initial step is $18, but the rest is $20.
Starting from Spend=8
|
1 2 3 4 |
{‘step’: 0, ‘spend_before’: 8.0, ‘action’: 2, ‘spend_after’: 7.0, ‘reward’: 14.0} {‘step’: 1, ‘spend_before’: 7.0, ‘action’: 2, ‘spend_after’: 6.0, ‘reward’: 18.0} {‘step’: 2, ‘spend_before’: 6.0, ‘action’: 2, ‘spend_after’: 5.0, ‘reward’: 20.0} ... |
- Interpretation: the agent gradually moves from 8 → 7 → 6 → 5 (and then toggles 4 ↔ 5). Early steps yield lower net profit, but eventually it stabilizes at $20 each step.
Starting from Spend=9
|
1 2 3 4 5 |
{‘step’: 0, ‘spend_before’: 9.0, ‘action’: 2, ‘spend_after’: 8.0, ‘reward’: 8.0} {‘step’: 1, ‘spend_before’: 8.0, ‘action’: 2, ‘spend_after’: 7.0, ‘reward’: 14.0} {‘step’: 2, ‘spend_before’: 7.0, ‘action’: 2, ‘spend_after’: 6.0, ‘reward’: 18.0} {‘step’: 3, ‘spend_before’: 6.0, ‘action’: 2, ‘spend_after’: 5.0, ‘reward’: 20.0} ... |
- Interpretation: the agent sees immediate improvement from 9 → 8 → 7 → 6 → 5. Once near 5 or 4, it collects $20 each step. The first few steps have lower rewards (8, 14, 18), but the rest are $20.
Starting from Spend=10
|
1 2 3 4 5 6 |
{‘step’: 0, ‘spend_before’: 10.0, ‘action’: 2, ‘spend_after’: 9.0, ‘reward’: 0.0} {‘step’: 1, ‘spend_before’: 9.0, ‘action’: 2, ‘spend_after’: 8.0, ‘reward’: 8.0} {‘step’: 2, ‘spend_before’: 8.0, ‘action’: 2, ‘spend_after’: 7.0, ‘reward’: 14.0} {‘step’: 3, ‘spend_before’: 7.0, ‘action’: 2, ‘spend_after’: 6.0, ‘reward’: 18.0} {‘step’: 4, ‘spend_before’: 6.0, ‘action’: 2, ‘spend_after’: 5.0, ‘reward’: 20.0} ... |
- Interpretation: from 10, net profit is -$10, so the agent quickly drops to 9, 8, 7, 6, and eventually 5 or 4, collecting $20 each step once it’s near the peak.
Overall Summary for Each Initial Spend
- Spend=0 or 1: the agent steps up each turn until it reaches 4 or 5, then toggles for $20. It might take a few steps to get there, so total reward is in the 180–192 range.
- Spend=2 or 3: it quickly moves to 4 or 5, then oscillates. Fewer steps are “wasted” in low net-profit states, so total reward is higher (up to 198 or 200).
- Spend=4 or 5: the agent is already near or at the peak. It just toggles between 4 and 5 for $20 each step, giving total reward ~200.
- Spend=6, 7, 8, 9, or 10: the agent decreases spend step by step until it’s around 4 or 5. Early steps have lower net profit (e.g., $18 at 6, $14 at 7, etc.), but it eventually stabilizes at $20 each step. So total reward depends on how many steps it spends above 5.
In every case, the agent eventually settles around spend=4 or 5, where net profit is $20. The slight oscillation (4 ↔ 5) is because the environment yields equal single-step net profit for 4 and 5, so it flips back and forth. This is normal for a discrete action environment with no penalty for switching.
Hyperparameter Tuning (Optional):
Below is a fully functioning example showing how to use Optuna to tune a few hyperparameters (e.g., learning_rate, n_steps, and gamma) for a PPO agent in SimpleMarketingEnv. Because our environment is so simple, the RL model often converges quickly even with the default settings. However, by systematically exploring the hyperparameter space with Optuna, you can identify the best-performing configuration for your SimpleMarketingEnv.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
import optuna import numpy as np from stable_baselines3 import PPO from stable_baselines3.common.evaluation import evaluate_policy def make_env(): “”” Utility function to create a fresh instance of your environment. “”” return SimpleMarketingEnv(max_spend=10, cost_factor=1.0, max_steps=10) def objective(trial): “”” Objective function for Optuna. We define the hyperparameter search space, train the model, then evaluate it. Optuna aims to MINIMIZE the returned value, so we return -mean_reward (because we want to MAXIMIZE mean_reward). “”” # Define search space learning_rate = trial.suggest_loguniform(‘learning_rate’, 1e–5, 1e–3) n_steps = trial.suggest_categorical(‘n_steps’, [128, 256, 512]) gamma = trial.suggest_uniform(‘gamma’, 0.8, 0.9999) # Create train environment train_env = make_env() # Create model with the trial’s hyperparams model = PPO( “MlpPolicy”, train_env, verbose=0, learning_rate=learning_rate, n_steps=n_steps, gamma=gamma ) # Train the model for some timesteps model.learn(total_timesteps=5000) # Create a fresh env for evaluation eval_env = make_env() mean_reward, _ = evaluate_policy(model, eval_env, n_eval_episodes=5, deterministic=True) # We want to maximize mean_reward, but Optuna minimizes by default return –mean_reward |
Conclusion:
In this mini use case, we demonstrated how an RL agent can learn to optimize marketing spend through trial and error, even in a simplified environment. Despite the simplicity, the agent’s step-by-step approach—exploration (testing new actions) and exploitation (focusing on profitable actions)—is precisely how more advanced RL systems function, including those in Agentic AI setups that combine RL with large language models (LLMs).
By starting with a parabolic revenue function, we could clearly see how the agent converged on a strategy that maximizes net profit. In real-world scenarios, the optimal solution might not be so obvious, and that’s where RL’s power to discover high-value strategies through systematic experimentation truly shines. As you move toward Agentic AI, you’ll leverage these same fundamentals—defining environments, actions, and rewards—but scale them up or integrate them with LLM-driven reasoning to tackle more complex, dynamic decision-making problems.
In future posts, we’ll build on this simple example and explore three key ways to integrate LLMs with RL: using an LLM to summarize or interpret the agent’s decisions in plain English, having the LLM suggest or generate new environment configurations or hyperparameters, and combining RL with LLM-driven feedback (or “human-like” preferences) to refine the agent’s policy in a more Agentic AI setting.
(Note: for those interested, the Jupyter notebook with all the code for this post, plus additional outputs and explanations, is available for download on my Decision Sciences GitHub repository)


Leave a Reply