(Workbook) Agentic AI: Letting LLMs Control Marketing Spend with Reinforcement Learning (Part 2)

(Note: for those interested, the Jupyter notebook with all the code for this post, plus additional outputs and explanations, is available for download on my Decision Sciences GitHub repository)

In my previous post, I showcased a simple Reinforcement Learning (RL) approach to optimizing marketing spend, demonstrating how an RL agent can learn the best actions to maximize net profit. This time, I’m taking a slightly unorthodox route: letting a large language model (LLM), rather than a trained RL system, directly control the RL environment’s actions. While it’s not the usual way to incorporate RL into an “Agentic AI” framework, this example highlights how an LLM can literally “think out loud,” decide which actions to take, and interact with the environment step by step—revealing a more transparent, flexible style of AI-driven decision-making.

Letting the LLM Take the Wheel

In a typical Agentic AI pipeline, an RL system often runs quietly behind the scenes as a low-level policy, with the LLM providing higher-level reasoning or oversight. Here, we’re flipping that dynamic: the LLM itself “drives” the environment at every step. This setup isn’t meant to replace a well-tuned RL system—indeed, RL remains a powerful method for optimizing tasks at scale. Instead, it illustrates how a language model can handle decisions in real time, letting you see every action, observe the environment’s feedback, and watch the AI’s chain-of-thought evolve. That level of transparency is both instructive for learning and potentially useful for scenarios where you want a more conversational or domain-savvy AI to manage tasks.

Recap of Our Simple RL Environment

Let’s briefly revisit the simple Marketing environment we’ve been working with. The state of the environment is just a single spend variable, which can range from $0 to $10. Each step, the agent (or AI) chooses one of three actions:

Stay at the current spend level,
Increase the spend by $1, or
Decrease the spend by $1.

We use a parabolic revenue function that peaks around spend=5. Subtracting the cost (spend × cost_factor) from the revenue gives us a net profit or reward, which the AI tries to maximize. Previously, a standard RL algorithm would repeatedly explore different spend values, observe rewards, and update its policy until it converged on near-optimal actions.

Parabolic Function:

Revenue = -(s – 5)^2 + 25

Peak at spend (s) = 5, maximum revenue of 25.
If spend (s) drifts away from 5, revenue drops quickly.

Why Let the LLM Drive?

I’m showcasing this example to demonstrate how a language model can literally “talk through” each decision in a simple environment, producing a human-readable chain of reasoning. This is a great way to illustrate the concept of “Agentic AI,” where the LLM actively decides actions step by step rather than relying on an invisible algorithm working behind the scenes (i.e. hidden policy network).

Think of it as handing the LLM the car keys to drive the environment, rather than having an unseen autopilot (the RL system) quietly steer on its own. In other words, we’ve given the environment a “steering wheel” that the LLM can operate: it presses the gas or the brakes (i.e., raises or lowers spend) based on what it sees, and it’s free to decide when to stop. Instead of an unseen algorithm controlling everything internally, the LLM is now in the driver’s seat—actively making decisions that anyone can watch and understand.

This is not the typical path to optimal solutions using LLMs—traditional RL often excels at efficiency and convergence—but it highlights the transparency and flexibility we gain when an AI agent is driven by natural language. Whether for rapid prototyping, educational demos, or exploring domain constraints in a more conversational manner, letting the LLM control our toy marketing environment offers a clear view of how an AI might reason about actions and rewards in real time.

Below is the code to set-up our simple Marketing environment, which is the same as the code used in my previous blog post:

import numpy as np
import gymnasium as gym
from gymnasium import spaces
from gymnasium.utils import seeding

class SimpleMarketingEnv(gym.Env):
    “””
    A simple marketing environment using Gymnasium.
    The agent controls a single ‘spend’ variable in [0..max_spend].
    Each step, the agent picks one of three actions:
      0 = stay,
      1 = increase spend by 1,
      2 = decrease spend by 1.
    The environment calculates a parabolic revenue function minus cost.
    “””

    metadata = {“render_modes”: [“human”]}

    def __init__(self, max_spend=10, cost_factor=1.0, max_steps=1000):
        super().__init__()
        self.max_spend = max_spend
        self.cost_factor = cost_factor
        self.max_steps = max_steps

        # Observation space: single float in [0, max_spend]
        self.observation_space = spaces.Box(
            low=np.array([0], dtype=np.float32),
            high=np.array([float(max_spend)], dtype=np.float32),
            shape=(1,),
            dtype=np.float32
        )

        # Action space: Discrete(3) -> {0=stay, 1=+1, 2=-1}
        self.action_space = spaces.Discrete(3)

        self.state = None
        self.current_step = 0
        self._np_random, _ = seeding.np_random()  # default seeding

    def seed(self, seed=None):
        “””
        Optional: If you want reproducible randomness,
        call env.seed(some_seed) before reset().
        “””
        self._np_random, _seed = seeding.np_random(seed)
        return [_seed]

    def reset(self, seed=None, options=None):
        “””
        Resets the environment.
        Returns (observation, info).
        “””
        super().reset(seed=seed)
        if seed is not None:
            self.seed(seed)

        self.current_step = 0
        # Start at a random spend from 0..max_spend
        # NOTE: use ‘integers(…)’ instead of ‘randint(…)’ in new NumPy
        initial_spend = self._np_random.integers(0, self.max_spend + 1)
        self.state = np.array([float(initial_spend)], dtype=np.float32)

        info = {}
        return self.state, info

    def step(self, action):
        “””
        Applies the given action (0=stay, 1=+1, 2=-1).
        Returns (observation, reward, terminated, truncated, info).
        “””
        spend = float(self.state[0])
        if action == 1:   # increase
            spend += 1
        elif action == 2: # decrease
            spend -= 1
        spend = np.clip(spend, 0, self.max_spend)

        # Parabolic revenue: -(spend – 5)^2 + 25
        revenue = -(spend – 5)**2 + 25
        cost = spend * self.cost_factor
        reward = revenue – cost

        self.state = np.array([spend], dtype=np.float32)
        self.current_step += 1

        terminated = False  # set True if you want a specific terminal condition
        truncated = (self.current_step >= self.max_steps)
        info = {}

        return self.state, reward, terminated, truncated, info

    def render(self):
        “””
        Optional: print or visualize the environment’s state.
        “””
        print(f”Step {self.current_step}, Spend={self.state[0]}”)

import numpy as np

import gymnasium as gym

from gymnasium import spaces

from gymnasium.utils import seeding

class SimpleMarketingEnv(gym.Env):

“””

A simple marketing environment using Gymnasium.

The agent controls a single ‘spend’ variable in [0..max_spend].

Each step, the agent picks one of three actions:

0 = stay,

1 = increase spend by 1,

2 = decrease spend by 1.

The environment calculates a parabolic revenue function minus cost.

“””

metadata = {“render_modes”: [“human”]}

def __init__(self, max_spend=10, cost_factor=1.0, max_steps=1000):

super().__init__()

self.max_spend = max_spend

self.cost_factor = cost_factor

self.max_steps = max_steps

# Observation space: single float in [0, max_spend]

self.observation_space = spaces.Box(

low=np.array([0], dtype=np.float32),

high=np.array([float(max_spend)], dtype=np.float32),

shape=(1,),

dtype=np.float32

)

# Action space: Discrete(3) -> {0=stay, 1=+1, 2=-1}

self.action_space = spaces.Discrete(3)

self.state = None

self.current_step = 0

self._np_random, _ = seeding.np_random() # default seeding

def seed(self, seed=None):

“””

Optional: If you want reproducible randomness,

call env.seed(some_seed) before reset().

“””

self._np_random, _seed = seeding.np_random(seed)

return [_seed]

def reset(self, seed=None, options=None):

“””

Resets the environment.

Returns (observation, info).

“””

super().reset(seed=seed)

if seed is not None:

self.seed(seed)

self.current_step = 0

# Start at a random spend from 0..max_spend

# NOTE: use ‘integers(…)’ instead of ‘randint(…)’ in new NumPy

initial_spend = self._np_random.integers(0, self.max_spend + 1)

self.state = np.array([float(initial_spend)], dtype=np.float32)

info = {}

return self.state, info

def step(self, action):

“””

Applies the given action (0=stay, 1=+1, 2=-1).

Returns (observation, reward, terminated, truncated, info).

“””

spend = float(self.state[0])

if action == 1: # increase

spend += 1

elif action == 2: # decrease

spend -= 1

spend = np.clip(spend, 0, self.max_spend)

# Parabolic revenue: -(spend – 5)^2 + 25

revenue = –(spend – 5)**2 + 25

cost = spend * self.cost_factor

reward = revenue – cost

self.state = np.array([spend], dtype=np.float32)

self.current_step += 1

terminated = False # set True if you want a specific terminal condition

truncated = (self.current_step >= self.max_steps)

info = {}

return self.state, reward, terminated, truncated, info

def render(self):

“””

Optional: print or visualize the environment’s state.

“””

print(f“Step {self.current_step}, Spend={self.state[0]}”)

Using Tools to Let the LLM Control the Environment?

Instead of coding a loop where a RL system calls env.step(action) internally, we expose that functionality through LangChain “tools.” We create a simple “reset” tool that initializes the environment to a new spend level and a “step” tool that takes a chosen action (e.g., “increase” or “decrease” by 1). The LLM calls these tools by name, passing the desired action in textual form. This might look like:

“reset_env” to start a new episode, returning a text description like “Environment reset. Current spend=3.”
“step_env with input ‘1’” to increase the spend, receiving back something like “Action=1, new spend=4, reward=20, done=False.”

As background, LangChain is a framework designed to bridge LLMs with external actions or data sources, often referred to as “tools.” By exposing our simple Marketing environment’s core functions (like reset_env or step_env) as tools, we give the LLM a simple, text-based way to manipulate the environment in real time. Instead of an internal RL loop calling env.step(action) under the hood, the LLM literally “requests” the action—like “step_env: 1”—as part of its natural language output. This approach not only makes the AI’s decision process visible but also showcases how an LLM can be given agentic control of an environment, issuing instructions just as a human would.

From the LLM’s perspective, it’s having a conversation with the environment—no different than if it were chatting with a human. Behind the scenes, we still have the same marketing environment code that calculates net profit. But now, the environment’s entire interface is text-based, letting the LLM “see” each reward in a more literal sense.

###############################
# Create Tools for the LLM #
###############################
from langchain.agents import Tool
from typing import Optional

marketing_env = SimpleMarketingEnv()

def reset_env_tool(_input: str) -> str:
    obs, _ = marketing_env.reset()
    return f”Environment reset. Current spend={obs[0]}, step=0″

def step_env_tool(action_str: str) -> str:
    try:
        action = int(action_str)
    except ValueError:
        return “Invalid action. Please provide 0, 1, or 2.”

    obs, reward, done, truncated, info = marketing_env.step(action)
    step_info = f”Action={action}, new spend={obs[0]:.1f}, reward={reward:.1f}, done={done or truncated}”
    return step_info

def stop_env_tool(_input: str) -> str:
    return “Stop requested. Finalizing environment run.”

tools = [
    Tool(
        name=”reset_env”,
        func=reset_env_tool,
        description=”Reset the environment to start a new episode.”
    ),
    Tool(
        name=”step_env”,
        func=step_env_tool,
        description=”Take an action in the environment: 0=stay, 1=increase, 2=decrease.”
    ),
    Tool(
        name=”stop_env”,
        func=stop_env_tool,
        description=”Stop the environment run.”
    )
]

###############################

# Create Tools for the LLM #

###############################

from langchain.agents import Tool

from typing import Optional

marketing_env = SimpleMarketingEnv()

def reset_env_tool(_input: str) -> str:

obs, _ = marketing_env.reset()

return f“Environment reset. Current spend={obs[0]}, step=0”

def step_env_tool(action_str: str) -> str:

try:

action = int(action_str)

except ValueError:

return “Invalid action. Please provide 0, 1, or 2.”

obs, reward, done, truncated, info = marketing_env.step(action)

step_info = f“Action={action}, new spend={obs[0]:.1f}, reward={reward:.1f}, done={done or truncated}”

return step_info

def stop_env_tool(_input: str) -> str:

return “Stop requested. Finalizing environment run.”

tools = [

Tool(

name=“reset_env”,

func=reset_env_tool,

description=“Reset the environment to start a new episode.”

Tool(

name=“step_env”,

func=step_env_tool,

description=“Take an action in the environment: 0=stay, 1=increase, 2=decrease.”

Tool(

name=“stop_env”,

func=stop_env_tool,

description=“Stop the environment run.”

)

]

Giving the LLM a Memory

One challenge we face when letting a large language model (LLM) “drive” an environment step by step is that it doesn’t automatically remember what happened in previous steps. By default, each response the LLM gives is based on the most recent message or two—it has no built-in memory of the entire interaction history. This can lead to the AI repeating unprofitable actions or forgetting that it already discovered a higher net profit earlier.

To fix this, we can give the LLM a conversation memory. In LangChain, this typically means creating a memory object—like ConversationBufferMemory—and passing it to the agent. The idea is that every step, the environment’s output (e.g., “spend=7, reward=14”) and the LLM’s own reasoning (“Thought: let’s lower spend again”) are all appended to a running transcript. The LLM then has access to this entire transcript whenever it decides the next action.

In simpler terms, it’s like the LLM keeping a notebook of everything it has tried so far. For example, if it tested spend=6, saw a reward of 18, and then tested spend=7, saw 14, it can now recall that 6 gave it a higher net profit than 7. Without memory, the LLM might forget which spend levels it already tested and accidentally go back to 7 again. With memory, it’s more likely to continue exploring different actions or stay at the best one found so far.

How the Thought→Action Flow Works

When we specify the Zero-Shot ReAct agent type in LangChain, the LLM is prompted to produce lines of reasoning (“Thought: …”) and then lines indicating which tool it wants to call (“Action: step_env”). After each action, the environment’s response is fed back to the LLM as an “Observation,” which it can incorporate into its next thought. This leads to a loop like:

Thought: “We’re at spend=3 and net profit=18, which is good but maybe we can do better.”
Action: “step_env” with input “1” (increase by $1).
Observation: “Action=1, new spend=4, reward=20, done=False.”

Seeing that net profit is now $20, the LLM might think, “We’ve hit the peak, so let’s stop.” This entire sequence is transparent in the logs, making the AI’s process far more interpretable than a typical RL approach.

In the code below we initialize the Agent and the LLM using ChatOpenAI and then run the agent with our Prompt. The prompt is essentially a set of instructions telling the LLM how to interact with a marketing spend environment. Instead of letting a hidden algorithm make decisions, we’re guiding the LLM to call specific “tools” (reset_env, step_env, stop_env) so it can raise or lower spend and see the resulting net profit. By reading the environment’s responses (in plain text), the LLM can iteratively decide whether to keep going or stop.

############################################
#  Add Conversation Memory to the Agent 
############################################
from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, initialize_agent, AgentType

# Create a memory object that stores all conversation history
memory = ConversationBufferMemory(memory_key=”chat_history”)

# Initialize the LLM (ChatOpenAI) with zero temperature
llm = ChatOpenAI(
    temperature=0,
    model_name=”gpt-3.5-turbo”  # or “gpt-4” if you have access
)

# Pass the memory object to the agent, so it can see the entire conversation
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True,
    memory=memory,
    handle_parsing_errors=True
)

###############################
# Provide the custom prompt
###############################
prompt = “””
You are an AI controlling a marketing spend environment.
Actions:
 – reset_env -> resets environment
 – step_env -> step the environment with an action: 0=stay, 1=+1, 2=-1
 – stop_env -> finalize your solution

Goal: maximize net profit by choosing good actions. 
Keep track of the highest reward you’ve encountered so far. 
Try multiple actions, including lowering or increasing spend all the way if needed, until you confirm you can’t do better. 
If no improvement occurs over a few steps, continue experimenting until you are confident the solution cannot be improved further.
Please keep calling step_env until you believe you’ve maximized net profit or see no further improvement. 
Then call stop_env or just give your final answer.
“””

#response = agent.invoke(prompt)
#print(“\nFinal Agent Response:\n”, response)

response = agent.invoke(prompt)
final_text = response.get(“output”, “”)
print(“\nFinal Agent Response:\n”, final_text)

############################################

# Add Conversation Memory to the Agent

############################################

from langchain.memory import ConversationBufferMemory

from langchain_openai import ChatOpenAI

from langchain.agents import AgentExecutor, initialize_agent, AgentType

# Create a memory object that stores all conversation history

memory = ConversationBufferMemory(memory_key=“chat_history”)

# Initialize the LLM (ChatOpenAI) with zero temperature

llm = ChatOpenAI(

temperature=0,

model_name=“gpt-3.5-turbo” # or “gpt-4” if you have access

)

# Pass the memory object to the agent, so it can see the entire conversation

agent = initialize_agent(

tools=tools,

llm=llm,

agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,

verbose=True,

memory=memory,

handle_parsing_errors=True

)

###############################

# Provide the custom prompt

###############################

prompt = “””

You are an AI controlling a marketing spend environment.

Actions:

– reset_env -> resets environment

– step_env -> step the environment with an action: 0=stay, 1=+1, 2=-1

– stop_env -> finalize your solution

Goal: maximize net profit by choosing good actions.

Keep track of the highest reward you’ve encountered so far.

Try multiple actions, including lowering or increasing spend all the way if needed, until you confirm you can’t do better.

If no improvement occurs over a few steps, continue experimenting until you are confident the solution cannot be improved further.

Please keep calling step_env until you believe you’ve maximized net profit or see no further improvement.

Then call stop_env or just give your final answer.

“””

#response = agent.invoke(prompt)

#print(“\nFinal Agent Response:\n”, response)

response = agent.invoke(prompt)

final_text = response.get(“output”, “”)

print(“\nFinal Agent Response:\n”, final_text)

The prompt also spells out both the goal (e.g., “maximize net profit”) and the instructions on how to proceed (like “call step_env until you see no further improvement, then call stop_env”). This ensures the LLM knows why it’s performing each action (to reach the highest net profit) and when it should wrap up (once further changes won’t help). By including these details, we guide the LLM toward completing the task in a logical, stepwise manner, rather than leaving it unsure about what success looks like or when to stop.

Running the code above gives the following output (note: the output will vary with each new run of the code):

> Entering new AgentExecutor chain…

I need to start by resetting the environment to begin a new episode and maximize net profit.
`Action`: reset_env
`Action Input`: N/A
`Observation`: Environment reset. Current spend=8.0, step=0
I should start by keeping the current spend and see if it leads to a higher profit.

`Action`: step_env
`Action Input`: 0
`Observation`: Action=0, new spend=8.0, reward=8.0, done=False
The reward is the same as the current spend, so I should try increasing the spend to see if it leads to a higher profit.

`Action`: step_env
`Action Input`: 1
`Observation`: Action=1, new spend=9.0, reward=0.0, done=False
Increasing the spend did not lead to a higher profit, so I should try decreasing the spend.

`Action`: step_env
`Action Input`: 2
`Observation`: Action=2, new spend=8.0, reward=8.0, done=False
Decreasing the spend led to a higher profit, so I should continue decreasing the spend to see if it can be further optimized.

`Action`: step_env
`Action Input`: 2
`Observation`: Action=2, new spend=7.0, reward=14.0, done=False
Decreasing the spend further led to an even higher profit, so I should continue decreasing the spend.

`Action`: step_env
`Action Input`: 2
`Observation`: Action=2, new spend=6.0, reward=18.0, done=False
Continuing to decrease the spend is still increasing the profit, so I should keep going.

`Action`: step_env
`Action Input`: 2
`Observation`: Action=2, new spend=5.0, reward=20.0, done=False
Continuing to decrease the spend is still increasing the profit, so I should keep going.

`Action`: step_env
`Action Input`: 2
`Observation`: Action=2, new spend=4.0, reward=20.0, done=False
Continuing to decrease the spend did not increase the profit further, so I should stop and finalize my solution.

`Action`: stop_env
`Action Input`: N/A
`Observation`: Stop requested. Finalizing environment run.
I now know the final answer is to decrease the marketing spend to $5.0 to maximize net profit. 

`Final Answer`: Decrease the marketing spend to $5.0.

> Finished chain.

Final Agent Response:
 Decrease the marketing spend to $5.0.

> Entering new AgentExecutor chain...

I need to start by resetting the environment to begin a new episode and maximize net profit.

`Action`: reset_env

`Action Input`: N/A

`Observation`: Environment reset. Current spend=8.0, step=0

I should start by keeping the current spend and see if it leads to a higher profit.

`Action`: step_env

`Action Input`: 0

`Observation`: Action=0, new spend=8.0, reward=8.0, done=False

The reward is the same as the current spend, so I should try increasing the spend to see if it leads to a higher profit.

`Action`: step_env

`Action Input`: 1

`Observation`: Action=1, new spend=9.0, reward=0.0, done=False

Increasing the spend did not lead to a higher profit, so I should try decreasing the spend.

`Action`: step_env

`Action Input`: 2

`Observation`: Action=2, new spend=8.0, reward=8.0, done=False

Decreasing the spend led to a higher profit, so I should continue decreasing the spend to see if it can be further optimized.

`Action`: step_env

`Action Input`: 2

`Observation`: Action=2, new spend=7.0, reward=14.0, done=False

Decreasing the spend further led to an even higher profit, so I should continue decreasing the spend.

`Action`: step_env

`Action Input`: 2

`Observation`: Action=2, new spend=6.0, reward=18.0, done=False

Continuing to decrease the spend is still increasing the profit, so I should keep going.

`Action`: step_env

`Action Input`: 2

`Observation`: Action=2, new spend=5.0, reward=20.0, done=False

Continuing to decrease the spend is still increasing the profit, so I should keep going.

`Action`: step_env

`Action Input`: 2

`Observation`: Action=2, new spend=4.0, reward=20.0, done=False

Continuing to decrease the spend did not increase the profit further, so I should stop and finalize my solution.

`Action`: stop_env

`Action Input`: N/A

`Observation`: Stop requested. Finalizing environment run.

I now know the final answer is to decrease the marketing spend to $5.0 to maximize net profit.

`Final Answer`: Decrease the marketing spend to $5.0.

> Finished chain.

Final Agent Response:

Decrease the marketing spend to $5.0.

In this output, you can see the LLM’s entire decision-making process as it tries to find the best spend level. The LLM begins with a spend of $8, tests staying at that level, then systematically tries increasing and decreasing to find the best net profit. It quickly discovers that raising spend to $9 yields no improvement, while lowering spend step by step (down to $5) steadily boosts net profit up to $20. After testing $4 and seeing no further gain, the LLM decides to stop, concluding that spend = $5 is the optimal choice for maximizing net profit.

Conclusion

By letting an LLM directly control our simple marketing environment, we illustrate a more agentic style of AI, where language-based reasoning drives each step rather than a learned policy. It’s not necessarily the standard path to optimizing net profit—traditional RL might still be more robust for large-scale or highly complex tasks. But as a demonstration of Agentic AI principles, it’s compelling: the LLM literally “talks” to the environment, sees textual rewards, and decides how to proceed.

In upcoming posts, I’ll continue to build on the foundational knowledge from the first two posts and explore a more hybrid, realistic approach, where a trained RL sub-policy handles the low-level actions while the LLM serves as a meta-controller—analyzing logs, tweaking hyperparameters, and guiding the entire system from a higher vantage point. For now, our simple environment and tool-based approach should give you a taste of how easily an LLM can step into the role of an RL policy, making decisions in a more transparent, conversational manner.

(Note: for those interested, the Jupyter notebook with all the code for this post, plus additional outputs and explanations, is available for download on my Decision Sciences GitHub repository)