(Workbook) Agentic AI: Optimize Marketing Spend with LLM as Meta-Controller of Learning Environment (Part 3)

(Note: for those interested, the Jupyter notebook with all the code for this post, plus additional outputs and explanations, is available for download on my Decision Sciences GitHub repository)


Marketing teams are constantly seeking ways to maximize return on investment by fine-tuning their spend. In this post, we explore the fundamentals of an innovative, albeit simplified, approach that combines a reinforcement learning (RL) model with an LLM “meta-controller” to optimize marketing spend. We’ll break down how a custom RL environment is built, how helper functions and tools manage the RL sub-policy, how an LLM agent is initialized to oversee the training process, and finally, how to interpret the iterative results that guide us toward an optimal strategy.

With the challenge clearly defined, let’s explore how our approach evolved from a pure RL algorithm to a system where an LLM orchestrates the process as a meta-controller.


LLM as Meta-Controller: An Overview

In our first post, we focused solely on a reinforcement learning (RL) algorithm that explored different marketing spend strategies on its own, gradually learning from the rewards. In the second post, we shifted control so that an LLM directly made the discrete spend decisions (i.e., choosing to add or subtract 1) instead of the RL algorithm autonomously. Now, we’ve taken another significant step forward by introducing an LLM as a meta-controller. This means that the LLM no longer makes the individual spend decisions; instead, it oversees and orchestrates the entire RL training process. The meta-controller strategically manages tasks like initializing the RL sub-policy, training it, evaluating its performance, and adjusting environment parameters. By doing so, it leverages its natural language reasoning to guide the RL algorithm toward optimal performance, providing a higher level of abstraction and transparency compared to our earlier approaches.

A meta-controller is like the conductor of an orchestra. While the individual musicians (or the RL algorithm) perform their parts, the conductor oversees the entire performance—adjusting the tempo, dynamics, and balance—to ensure a harmonious result. Similarly, an LLM meta-controller doesn’t make every individual decision but instead orchestrates the overall training process, guiding, evaluating, and fine-tuning the RL agent to achieve optimal performance.

Having introduced the role of an LLM as a meta-controller, I now turn our focus to the real-world challenge it aims to solve—optimizing marketing spend across multiple customer accounts.


Introduction to the Marketing Spend Optimization Challenge

Imagine a Marketing team managing high-valued customer accounts that need to be targeted with optimized spend. The goal is simple yet challenging: determine the optimal spend that maximizes net profit. In our simplified scenario, revenue depends on the spend following a parabolic curve (discussed in more detail below)—rising up to a point before it begins to decline—while costs are directly proportional to the channel spend only. By employing reinforcement learning, we enable an RL agent to experiment with various spend levels and learn, over time, which decisions yield the highest profit. To further enhance the process, an LLM meta-controller is integrated to iteratively guide, adjust, and track performance improvements across training sessions. This combination allows us to dynamically steer the RL process toward the best possible outcomes.

To address this challenge, we needed a robust simulation environment. Next, I describe how we designed a custom RL environment that models our multi-channel marketing scenario.


Designing a Custom RL Environment

The heart of our solution is a custom Gymnasium environment called MultiChannelMarketingEnv. This environment now simulates two marketing channels where each channel’s spend is limited to a maximum value of 10. Here’s how it works: the environment defines an observation space representing the current spend levels for each channel and an action space that allows the RL agent to increment or decrement these spends.

The revenue for each channel is modeled as a parabolic function—peaking when spend is near 5—and the cost is calculated by multiplying the total spend by a fixed cost factor (set to 1.0 for consistency). The reward at each step is then computed as the total revenue minus the total cost. Additionally, the environment is designed to reset at the beginning of every episode, sampling initial spend levels randomly. This structure provides a controlled yet flexible simulation for testing various spend optimization strategies.

To be very clear, in this set-up the reinforcement learning (RL) agent is the workhorse of the system—it directly interacts with the environment by choosing actions, learning from the resulting rewards, and updating its policy based on trial and error. In contrast, the LLM meta-controller — which we will discuss the OpenAI set-up in more detail below — acts as a high-level overseer or “conductor” for the entire process. Rather than making individual spend decisions, the meta-controller monitors the RL agent’s performance, orchestrates its training cycles, and dynamically adjusts parameters to guide the learning process toward optimal performance.

In essence, while the RL agent is responsible for the granular, step-by-step interactions, the LLM meta-controller strategically manages and fine-tunes the overall training regimen, ensuring that the system evolves efficiently and transparently.


Recap of Our Multi-Channel Marketing Scenario

Unlike the previous two posts, in this example we have slightly increased complexity by introducing another marketing channel, for a two-channel scenario. At each step, the agent selects one of nine discrete actions. These actions correspond to various combinations of changes for both channels—for example, increasing the spend in channel 0 while decreasing the spend in channel 1, or keeping one channel constant while modifying the other.

Revenue for each channel is calculated using a parabolic function:

  • Peak at spend (s) = 5, maximum revenue of 25.
  • If spend (s) drifts away from 5, revenue drops quickly.

This function peaks at a spend of around $5, where each channel can achieve a maximum revenue of 25. The total revenue is the sum of the revenues from both channels. The cost is computed as the sum of the spends (each multiplied by a cost factor, which is fixed at 1 in our case). The net profit (or reward) is then:

In this multi-channel setup, the RL agent’s goal is to learn a policy that balances the spend across both channels to maximize the overall net profit. Previously, a standard RL algorithm would explore different spend values for a single channel and converge on near-optimal actions. Now, the challenge is to coordinate spending across two channels, leveraging the interplay between the two parabolic revenue functions to achieve the highest possible reward.

With the environment in place, the next step was to build the underlying RL sub-policy. In this section, we explain how helper functions and tools are implemented to control and monitor the RL agent.


Building the RL Sub-Policy and Tools

To manage the RL model effectively, we implement a set of helper functions that serve as the backbone for our toolset.

Helper functions are the building blocks that enable the LLM meta-controller to interface seamlessly with the RL environment. These functions encapsulate essential operations such as initializing the RL sub-policy, training the PPO model, evaluating performance, and adjusting environment parameters. By abstracting these low-level details into discrete, callable routines, the helper functions provide a standardized interface through which the meta-controller can invoke actions and retrieve feedback.

This modular design is critical because it allows the LLM meta-controller to focus on high-level strategic decisions—like orchestrating the overall training process and tracking performance—without needing to manage the underlying complexities of the RL algorithm directly. In essence, helper functions act as the essential communication layer between the meta-controller’s reasoning and the dynamic operations of the RL environment.

The helper functions I’ve set-up for this LLM meta-controller are as follows:

  • Initialize and reset the RL sub-policy: A function creates a new instance of the environment and a PPO model, ensuring that the agent starts from a clean slate.
  • Train the RL sub-policy: Another function allows the model to learn by training it for a specified number of timesteps, optionally updating hyperparameters like the learning rate.
  • Evaluate the sub-policy: The evaluation function runs multiple episodes to calculate metrics such as cumulative reward, mean per-step reward, and the average final spend per channel.
  • Track best performance: Global variables record the best mean per-step reward achieved, along with the corresponding spend levels. This helps us monitor the system’s progress.
  • Adjust environment parameters: A dedicated tool enables dynamic changes to the environment (e.g., updating the maximum spend) to further refine performance.

These helper functions are wrapped as LangChain tools, making them accessible for the meta-controller to invoke as needed during the iterative optimization process:


Then we put the list of tools in a dictionary for the LLM to access during training:

Now that we have established the tools to manage our RL agent, we introduce the LLM meta-controller that orchestrates these operations. Let’s see how to initialize and configure the meta-controller agent.


Creating the LLM Meta-Controller Agent

The next step involves setting up an LLM agent to serve as our meta-controller. Using LangChain, we integrate OpenAI’s GPT-3.5 Turbo (via the ChatOpenAI class) with a conversation memory (ConversationBufferMemory) that stores the chain-of-thought across interactions. The agent is initialized with our list of tools—each corresponding to one of the helper functions described earlier—and is configured to use a zero-shot reasoning approach.

This configuration enables the LLM to “think aloud” by generating a chain-of-thought that guides it through decisions such as initializing the sub-policy, training it, evaluating its performance, and adjusting the environment. With parameters like a deterministic temperature (set to 0) and verbose logging, the meta-controller is designed to provide detailed insights into its decision-making process, ensuring that every step is informed by previous results and that key performance metrics are tracked.

With our meta-controller agent fully configured, it’s time to put it into action. In the next section, we detail the iterative optimization process and explain how to interpret the output from each cycle.


Meta-Controller Prompt: Orchestrating Iterative RL Optimization

The meta-controller prompt is a multi-line instruction that guides the LLM in orchestrating the iterative process of optimizing the RL sub-policy. It directs the LLM to act as a high-level manager—using a suite of tools that can initialize the RL agent, train the model, evaluate performance, adjust environment parameters, and retrieve the best performance metrics along with the recommended spend levels.

The prompt lays out a clear objective: to achieve a mean per-step reward of 40 (the theoretical maximum) and determine the optimal spend configuration for each marketing channel.

In addition, the prompt specifies an iterative process that involves repeatedly initializing, training, evaluating, and adjusting until the target performance is reached. It instructs the meta-controller to record the highest mean per-step reward observed and to continue the process for at least 100 iterations, ensuring thorough exploration of the solution space.

Finally, once the iterations conclude, the LLM is expected to output a detailed final report that not only provides the best metrics and corresponding spend levels but also explains the reasoning behind its decision to stop iterating.


Iterative Optimization and Interpreting Results

The final piece of the puzzle is the prompt that instructs the meta-controller to guide the iterative optimization process. In this prompt, the LLM is told to repeatedly perform the following steps:

  1. Initialize the sub-policy.
  2. Train the PPO model.
  3. Evaluate its performance, where the output includes metrics such as cumulative reward, mean per-step reward, and average spend levels per channel.
  4. Optionally adjust environment parameters to see if further improvements can be made.

Each cycle of the process follows a clear Thought – Action – Observation pattern:

  • Thought: The agent outlines its reasoning, e.g., noting that the current mean per-step reward is below the optimal target.
  • Action: It then calls one of the available tools (e.g., training, evaluation, or parameter adjustment).
  • Observation: Finally, it reviews the output from that tool call, which provides the performance metrics and updates on the best score achieved.

The final readout includes a detailed summary of the iterative process, reporting the best mean per-step reward achieved and the recommended spend levels for each channel. The meta-controller also explains why it decided to stop iterating—for example, when further improvements plateau or when computational limits are reached. This structured output not only provides performance metrics but also a transparent narrative of how the optimization evolved over time.

Below is the output of an optimization run using LLM, so you can follow the Thought -> Action -> Observation pattern from start to finish:

Now that we have seen the detailed output from a full optimization run, we can unpack the information that is being provided.

Interpretation of LLM Agent Output

Below is a detailed interpretation of the iterative log output. This explanation helps us read and understand how the meta-controller’s chain-of-thought unfolds during the optimization process. The log is structured as a series of iterative cycles, each of which follows a pattern of thought, action, and observation. Here’s what you need to know:

  • Initialization:
    The process begins with a message stating that a new AgentExecutor chain is starting. The first action is to initialize the RL sub-policy. The log confirms that the sub-policy is initialized with the default PPO model, setting the stage for training.
  • Training Phase:
    Next, the agent calls the training tool (e.g., train_subpolicy) with parameters like a specified number of timesteps and a learning rate. Each training action results in a log message that confirms the training was executed (for example, “Trained sub-policy for 5000 timesteps (lr=0.0005)”). This indicates that the model is undergoing learning during that cycle.
  • Evaluation Phase:
    After training, the agent evaluates the sub-policy using the evaluate_subpolicy tool. The observation output includes several key performance metrics:
  • Adjustments and Iteration:
    When the evaluation indicates that the current performance is below the target, the agent may decide to adjust the environment parameters using the set_env_params tool. For example, it might tweak parameters like the cost factor. The log shows this as a clear “adjustment” action, and subsequent training and evaluation cycles reveal whether these adjustments have helped.
  • Progress and Fluctuations:
    As the iterations progress, you will notice that the mean per-step reward may sometimes increase and other times decrease. This fluctuation is normal in RL training, as the agent explores different strategies. The log records these variations, and the updated best score reflects the highest mean per-step reward observed up to that point.
  • Final Outcome:
    Eventually, after numerous iterations, the log concludes with a final summary. This final output states the best mean per-step reward achieved (in this example, very close to the target) along with the corresponding recommended spend levels for each channel. The agent also explains why it decided to stop iterating—often because further improvements were minimal or the performance had plateaued.

Overall, the log provides a transparent narrative of the agent’s decision-making process: it starts by initializing, then trains and evaluates its performance, adjusts parameters when necessary, and continues to iterate until its performance nears the optimal target. This detailed feedback allows you, as a user, to see both the progress made and the reasoning behind the final decision.


Conclusion

The integrated approach as described above—combining a custom RL environment with an LLM meta-controller—demonstrates a powerful (albeit simplified in this example), iterative method for optimizing marketing spend. By leveraging reinforcement learning, we allow the RL agent to explore various strategies, while the LLM meta-controller dynamically adjusts the process based on real-time performance feedback.

The detailed tracking of performance metrics and spend levels ensures that the system can converge toward an optimal solution, ultimately guiding marketing decisions with both precision and transparency. This framework not only optimizes net profit but also offers a compelling example of how advanced AI techniques can work together to solve real-world business challenges.

Over the next one or two posts, I’ll build on the example above by increasing the sophistication of our marketing spend scenario—exploring enhancements such as an unbounded revenue function, additional tactics and channels, and saturation curves. I’ll also refine the code used to execute and manage the LLM meta-controller.

(Note: for those interested, the Jupyter notebook with all the code for this post, plus additional outputs and explanations, is available for download on my Decision Sciences GitHub repository)


Discover more from Decision Sciences: Marketing

Subscribe to get the latest posts to your email.

One response to “(Workbook) Agentic AI: Optimize Marketing Spend with LLM as Meta-Controller of Learning Environment (Part 3)”

  1. beautiful! 86 2025 (Workbook) Agentic AI: Optimize Marketing Spend with LLM as Meta-Controller of Learning Environment (Part 3) terrific

Leave a Reply

Discover more from Decision Sciences: Marketing

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Discover more from Decision Sciences: Marketing

Subscribe now to keep reading and get access to the full archive.

Continue reading