(Note: for those interested, the Jupyter notebook with all the code for this post, plus additional outputs and explanations, is available for download on my Decision Sciences GitHub repository)
In my previous post, I showcased a simple Reinforcement Learning (RL) approach to optimizing marketing spend, demonstrating how an RL agent can learn the best actions to maximize net profit. This time, I’m taking a slightly unorthodox route: letting a large language model (LLM), rather than a trained RL system, directly control the RL environment’s actions. While it’s not the usual way to incorporate RL into an “Agentic AI” framework, this example highlights how an LLM can literally “think out loud,” decide which actions to take, and interact with the environment step by step—revealing a more transparent, flexible style of AI-driven decision-making.
Letting the LLM Take the Wheel
In a typical Agentic AI pipeline, an RL system often runs quietly behind the scenes as a low-level policy, with the LLM providing higher-level reasoning or oversight. Here, we’re flipping that dynamic: the LLM itself “drives” the environment at every step. This setup isn’t meant to replace a well-tuned RL system—indeed, RL remains a powerful method for optimizing tasks at scale. Instead, it illustrates how a language model can handle decisions in real time, letting you see every action, observe the environment’s feedback, and watch the AI’s chain-of-thought evolve. That level of transparency is both instructive for learning and potentially useful for scenarios where you want a more conversational or domain-savvy AI to manage tasks.
Recap of Our Simple RL Environment
Let’s briefly revisit the simple Marketing environment we’ve been working with. The state of the environment is just a single spend variable, which can range from $0 to $10. Each step, the agent (or AI) chooses one of three actions:
- Stay at the current spend level,
- Increase the spend by $1, or
- Decrease the spend by $1.
We use a parabolic revenue function that peaks around spend=5. Subtracting the cost (spend × cost_factor) from the revenue gives us a net profit or reward, which the AI tries to maximize. Previously, a standard RL algorithm would repeatedly explore different spend values, observe rewards, and update its policy until it converged on near-optimal actions.
Parabolic Function:

Revenue = -(s – 5)^2 + 25
- Peak at spend (s) = 5, maximum revenue of 25.
- If spend (s) drifts away from 5, revenue drops quickly.
Why Let the LLM Drive?
I’m showcasing this example to demonstrate how a language model can literally “talk through” each decision in a simple environment, producing a human-readable chain of reasoning. This is a great way to illustrate the concept of “Agentic AI,” where the LLM actively decides actions step by step rather than relying on an invisible algorithm working behind the scenes (i.e. hidden policy network).
Think of it as handing the LLM the car keys to drive the environment, rather than having an unseen autopilot (the RL system) quietly steer on its own. In other words, we’ve given the environment a “steering wheel” that the LLM can operate: it presses the gas or the brakes (i.e., raises or lowers spend) based on what it sees, and it’s free to decide when to stop. Instead of an unseen algorithm controlling everything internally, the LLM is now in the driver’s seat—actively making decisions that anyone can watch and understand.
This is not the typical path to optimal solutions using LLMs—traditional RL often excels at efficiency and convergence—but it highlights the transparency and flexibility we gain when an AI agent is driven by natural language. Whether for rapid prototyping, educational demos, or exploring domain constraints in a more conversational manner, letting the LLM control our toy marketing environment offers a clear view of how an AI might reason about actions and rewards in real time.
Below is the code to set-up our simple Marketing environment, which is the same as the code used in my previous blog post:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
import numpy as np import gymnasium as gym from gymnasium import spaces from gymnasium.utils import seeding class SimpleMarketingEnv(gym.Env): “”” A simple marketing environment using Gymnasium. The agent controls a single ‘spend’ variable in [0..max_spend]. Each step, the agent picks one of three actions: 0 = stay, 1 = increase spend by 1, 2 = decrease spend by 1. The environment calculates a parabolic revenue function minus cost. “”” metadata = {“render_modes”: [“human”]} def __init__(self, max_spend=10, cost_factor=1.0, max_steps=1000): super().__init__() self.max_spend = max_spend self.cost_factor = cost_factor self.max_steps = max_steps # Observation space: single float in [0, max_spend] self.observation_space = spaces.Box( low=np.array([0], dtype=np.float32), high=np.array([float(max_spend)], dtype=np.float32), shape=(1,), dtype=np.float32 ) # Action space: Discrete(3) -> {0=stay, 1=+1, 2=-1} self.action_space = spaces.Discrete(3) self.state = None self.current_step = 0 self._np_random, _ = seeding.np_random() # default seeding def seed(self, seed=None): “”” Optional: If you want reproducible randomness, call env.seed(some_seed) before reset(). “”” self._np_random, _seed = seeding.np_random(seed) return [_seed] def reset(self, seed=None, options=None): “”” Resets the environment. Returns (observation, info). “”” super().reset(seed=seed) if seed is not None: self.seed(seed) self.current_step = 0 # Start at a random spend from 0..max_spend # NOTE: use ‘integers(…)’ instead of ‘randint(…)’ in new NumPy initial_spend = self._np_random.integers(0, self.max_spend + 1) self.state = np.array([float(initial_spend)], dtype=np.float32) info = {} return self.state, info def step(self, action): “”” Applies the given action (0=stay, 1=+1, 2=-1). Returns (observation, reward, terminated, truncated, info). “”” spend = float(self.state[0]) if action == 1: # increase spend += 1 elif action == 2: # decrease spend -= 1 spend = np.clip(spend, 0, self.max_spend) # Parabolic revenue: -(spend – 5)^2 + 25 revenue = –(spend – 5)**2 + 25 cost = spend * self.cost_factor reward = revenue – cost self.state = np.array([spend], dtype=np.float32) self.current_step += 1 terminated = False # set True if you want a specific terminal condition truncated = (self.current_step >= self.max_steps) info = {} return self.state, reward, terminated, truncated, info def render(self): “”” Optional: print or visualize the environment’s state. “”” print(f“Step {self.current_step}, Spend={self.state[0]}”) |
Using Tools to Let the LLM Control the Environment?
Instead of coding a loop where a RL system calls env.step(action) internally, we expose that functionality through LangChain “tools.” We create a simple “reset” tool that initializes the environment to a new spend level and a “step” tool that takes a chosen action (e.g., “increase” or “decrease” by 1). The LLM calls these tools by name, passing the desired action in textual form. This might look like:
- “reset_env” to start a new episode, returning a text description like “Environment reset. Current spend=3.”
- “step_env with input ‘1’” to increase the spend, receiving back something like “Action=1, new spend=4, reward=20, done=False.”
As background, LangChain is a framework designed to bridge LLMs with external actions or data sources, often referred to as “tools.” By exposing our simple Marketing environment’s core functions (like reset_env or step_env) as tools, we give the LLM a simple, text-based way to manipulate the environment in real time. Instead of an internal RL loop calling env.step(action) under the hood, the LLM literally “requests” the action—like “step_env: 1”—as part of its natural language output. This approach not only makes the AI’s decision process visible but also showcases how an LLM can be given agentic control of an environment, issuing instructions just as a human would.
From the LLM’s perspective, it’s having a conversation with the environment—no different than if it were chatting with a human. Behind the scenes, we still have the same marketing environment code that calculates net profit. But now, the environment’s entire interface is text-based, letting the LLM “see” each reward in a more literal sense.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
############################### # Create Tools for the LLM # ############################### from langchain.agents import Tool from typing import Optional marketing_env = SimpleMarketingEnv() def reset_env_tool(_input: str) -> str: obs, _ = marketing_env.reset() return f“Environment reset. Current spend={obs[0]}, step=0” def step_env_tool(action_str: str) -> str: try: action = int(action_str) except ValueError: return “Invalid action. Please provide 0, 1, or 2.” obs, reward, done, truncated, info = marketing_env.step(action) step_info = f“Action={action}, new spend={obs[0]:.1f}, reward={reward:.1f}, done={done or truncated}” return step_info def stop_env_tool(_input: str) -> str: return “Stop requested. Finalizing environment run.” tools = [ Tool( name=“reset_env”, func=reset_env_tool, description=“Reset the environment to start a new episode.” ), Tool( name=“step_env”, func=step_env_tool, description=“Take an action in the environment: 0=stay, 1=increase, 2=decrease.” ), Tool( name=“stop_env”, func=stop_env_tool, description=“Stop the environment run.” ) ] |
Giving the LLM a Memory
One challenge we face when letting a large language model (LLM) “drive” an environment step by step is that it doesn’t automatically remember what happened in previous steps. By default, each response the LLM gives is based on the most recent message or two—it has no built-in memory of the entire interaction history. This can lead to the AI repeating unprofitable actions or forgetting that it already discovered a higher net profit earlier.
To fix this, we can give the LLM a conversation memory. In LangChain, this typically means creating a memory object—like ConversationBufferMemory—and passing it to the agent. The idea is that every step, the environment’s output (e.g., “spend=7, reward=14”) and the LLM’s own reasoning (“Thought: let’s lower spend again”) are all appended to a running transcript. The LLM then has access to this entire transcript whenever it decides the next action.
In simpler terms, it’s like the LLM keeping a notebook of everything it has tried so far. For example, if it tested spend=6, saw a reward of 18, and then tested spend=7, saw 14, it can now recall that 6 gave it a higher net profit than 7. Without memory, the LLM might forget which spend levels it already tested and accidentally go back to 7 again. With memory, it’s more likely to continue exploring different actions or stay at the best one found so far.
How the Thought→Action Flow Works
When we specify the Zero-Shot ReAct agent type in LangChain, the LLM is prompted to produce lines of reasoning (“Thought: …”) and then lines indicating which tool it wants to call (“Action: step_env”). After each action, the environment’s response is fed back to the LLM as an “Observation,” which it can incorporate into its next thought. This leads to a loop like:
- Thought: “We’re at spend=3 and net profit=18, which is good but maybe we can do better.”
- Action: “step_env” with input “1” (increase by $1).
- Observation: “Action=1, new spend=4, reward=20, done=False.”
Seeing that net profit is now $20, the LLM might think, “We’ve hit the peak, so let’s stop.” This entire sequence is transparent in the logs, making the AI’s process far more interpretable than a typical RL approach.
In the code below we initialize the Agent and the LLM using ChatOpenAI and then run the agent with our Prompt. The prompt is essentially a set of instructions telling the LLM how to interact with a marketing spend environment. Instead of letting a hidden algorithm make decisions, we’re guiding the LLM to call specific “tools” (reset_env, step_env, stop_env) so it can raise or lower spend and see the resulting net profit. By reading the environment’s responses (in plain text), the LLM can iteratively decide whether to keep going or stop.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
############################################ # Add Conversation Memory to the Agent ############################################ from langchain.memory import ConversationBufferMemory from langchain_openai import ChatOpenAI from langchain.agents import AgentExecutor, initialize_agent, AgentType # Create a memory object that stores all conversation history memory = ConversationBufferMemory(memory_key=“chat_history”) # Initialize the LLM (ChatOpenAI) with zero temperature llm = ChatOpenAI( temperature=0, model_name=“gpt-3.5-turbo” # or “gpt-4” if you have access ) # Pass the memory object to the agent, so it can see the entire conversation agent = initialize_agent( tools=tools, llm=llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True, memory=memory, handle_parsing_errors=True ) ############################### # Provide the custom prompt ############################### prompt = “”” You are an AI controlling a marketing spend environment. Actions: – reset_env -> resets environment – step_env -> step the environment with an action: 0=stay, 1=+1, 2=-1 – stop_env -> finalize your solution Goal: maximize net profit by choosing good actions. Keep track of the highest reward you’ve encountered so far. Try multiple actions, including lowering or increasing spend all the way if needed, until you confirm you can’t do better. If no improvement occurs over a few steps, continue experimenting until you are confident the solution cannot be improved further. Please keep calling step_env until you believe you’ve maximized net profit or see no further improvement. Then call stop_env or just give your final answer. “”” #response = agent.invoke(prompt) #print(“\nFinal Agent Response:\n”, response) response = agent.invoke(prompt) final_text = response.get(“output”, “”) print(“\nFinal Agent Response:\n”, final_text) |
The prompt also spells out both the goal (e.g., “maximize net profit”) and the instructions on how to proceed (like “call step_env until you see no further improvement, then call stop_env”). This ensures the LLM knows why it’s performing each action (to reach the highest net profit) and when it should wrap up (once further changes won’t help). By including these details, we guide the LLM toward completing the task in a logical, stepwise manner, rather than leaving it unsure about what success looks like or when to stop.
Running the code above gives the following output (note: the output will vary with each new run of the code):
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
> Entering new AgentExecutor chain... I need to start by resetting the environment to begin a new episode and maximize net profit. `Action`: reset_env `Action Input`: N/A `Observation`: Environment reset. Current spend=8.0, step=0 I should start by keeping the current spend and see if it leads to a higher profit. `Action`: step_env `Action Input`: 0 `Observation`: Action=0, new spend=8.0, reward=8.0, done=False The reward is the same as the current spend, so I should try increasing the spend to see if it leads to a higher profit. `Action`: step_env `Action Input`: 1 `Observation`: Action=1, new spend=9.0, reward=0.0, done=False Increasing the spend did not lead to a higher profit, so I should try decreasing the spend. `Action`: step_env `Action Input`: 2 `Observation`: Action=2, new spend=8.0, reward=8.0, done=False Decreasing the spend led to a higher profit, so I should continue decreasing the spend to see if it can be further optimized. `Action`: step_env `Action Input`: 2 `Observation`: Action=2, new spend=7.0, reward=14.0, done=False Decreasing the spend further led to an even higher profit, so I should continue decreasing the spend. `Action`: step_env `Action Input`: 2 `Observation`: Action=2, new spend=6.0, reward=18.0, done=False Continuing to decrease the spend is still increasing the profit, so I should keep going. `Action`: step_env `Action Input`: 2 `Observation`: Action=2, new spend=5.0, reward=20.0, done=False Continuing to decrease the spend is still increasing the profit, so I should keep going. `Action`: step_env `Action Input`: 2 `Observation`: Action=2, new spend=4.0, reward=20.0, done=False Continuing to decrease the spend did not increase the profit further, so I should stop and finalize my solution. `Action`: stop_env `Action Input`: N/A `Observation`: Stop requested. Finalizing environment run. I now know the final answer is to decrease the marketing spend to $5.0 to maximize net profit. `Final Answer`: Decrease the marketing spend to $5.0. > Finished chain. Final Agent Response: Decrease the marketing spend to $5.0. |
In this output, you can see the LLM’s entire decision-making process as it tries to find the best spend level. The LLM begins with a spend of $8, tests staying at that level, then systematically tries increasing and decreasing to find the best net profit. It quickly discovers that raising spend to $9 yields no improvement, while lowering spend step by step (down to $5) steadily boosts net profit up to $20. After testing $4 and seeing no further gain, the LLM decides to stop, concluding that spend = $5 is the optimal choice for maximizing net profit.
Conclusion
By letting an LLM directly control our simple marketing environment, we illustrate a more agentic style of AI, where language-based reasoning drives each step rather than a learned policy. It’s not necessarily the standard path to optimizing net profit—traditional RL might still be more robust for large-scale or highly complex tasks. But as a demonstration of Agentic AI principles, it’s compelling: the LLM literally “talks” to the environment, sees textual rewards, and decides how to proceed.
In upcoming posts, I’ll continue to build on the foundational knowledge from the first two posts and explore a more hybrid, realistic approach, where a trained RL sub-policy handles the low-level actions while the LLM serves as a meta-controller—analyzing logs, tweaking hyperparameters, and guiding the entire system from a higher vantage point. For now, our simple environment and tool-based approach should give you a taste of how easily an LLM can step into the role of an RL policy, making decisions in a more transparent, conversational manner.
(Note: for those interested, the Jupyter notebook with all the code for this post, plus additional outputs and explanations, is available for download on my Decision Sciences GitHub repository)


Leave a Reply