Sandbox Testing Guide
Abstract
Sandbox is a secure environment used to execute and test Agent code for specific Tasks. This guide will show you how to use the Sandbox framework for integrated testing of Agents and Tasks. It covers initializing the ChainStream environment, starting test Agents, and evaluating Task results.
Task Data Sources
- daily news
- daily dialogue (text from voice transcription)
- chat message records
- email history
- daily arxiv papers
- daily stock market updates
More data sources coming soon...
You can also expand on additional data sources and place them in the test_data
folder.
Task Evaluation Metrics
- Success Rate: Does the agent start without errors?
- Input/Output Correctness: Are input and output streams correctly selected?
- Static Evaluation: Differences between Agent Generator code and human routines.
- Dynamic Evaluation: Differences between Agent Generator output streams and human routine output streams.
More evaluation metrics coming soon...
You can also expand on additional evaluation metrics and write them in the evaluate_task
function.
Task Framework Development
-
Select a manually written Agent for evaluation. You can use pre-developed Agents from the
scripts
folder or write your own. For the process, refer to the ChainStream Agent Development Guide. -
Choose a Task for evaluation. You can refer to various tasks in the
tasks
folder or create a new Task, ensuring it inherits fromtask_config_base.py
's TaskConfig class. Define specific task descriptions, input-output streams, and override three methods:
1. init_environment: Initialize task environment, create test agents and streams.
2. start_task: Start the source stream.
3. evaluate_task: Evaluate output stream data processed by the Agent, and return evaluation results.
- Run the selected Agent and Task in the Sandbox.
Note
You can add your Task to the __init__.py
file in the tasks
folder and store it in a dictionary named ALL_TASKS
for centralized management and easier future referencing.
Sandbox Framework Development
Note
Requires a running Runtime with evaluation mode enabled, capable of monitoring actions of the testing Agent, including various APIs of the Chainstream Agent module.
1. Initialization
- ChainStream Initialization: Set the Task and Agent to be used.
- Get Runtime Environment: Initialize the Runtime using
get_chainstream_core()
. - Agent Setup: Read Agent script content based on file format.
def __init__(self, task, agent_file):
cs_server.init(server_type='core')
cs_server.start()
self.runtime = cs_server.get_chainstream_core()
self.task = task
if isinstance(agent_file, str) and agent_file.endswith('.py'):
with open(agent_file, 'r') as f:
agent_file = f.read()
self.agent_str = agent_file
self.result = {}
2. Start Testing Agent
- Initialize Task Environment: Call
init_environment
to initialize the Task environment within Runtime. - Start Agent: Call
_start_agent
to create an Agent instance, start it, and configure various action listeners. - Begin Task Flow: Call
start_task
to start the Task data source. - Evaluate Task: Call
evaluate_task
to collect test results after the data source ends, archive them, and invoke evaluation functions.
def start_test_agent(self):
self.task.init_environment(self.runtime)
self._start_agent()
self.task.start_task(self.runtime)
self.task.record_output(self.runtime)
def _start_agent(self):
namespace = {}
exec(self.agent_str, globals(), namespace)
class_object = None
globals().update(namespace)
for name, obj in namespace.items():
if isinstance(obj, type):
class_object = obj
break
if class_object is not None:
self.agent_instance = class_object()
self.agent_instance.start()
Tip
During development, you can add multiple custom exception classes like ExecError
, StartError
, RunningError
, etc., to capture and handle different stages' potential error scenarios, improving testing efficiency.
3. Testing Example
Success
Below is an example demonstrating how to use the SandBox
class for specific Task testing.
Here's how you can use the SandBox
class for specific Task testing:
if __name__ == "__main__":
from tasks import ALL_TASKS_OLD
ArxivTaskConfig = ALL_TASKS_OLD['ArxivTask']
agent_file = '''
import chainstream as cs
from chainstream.llm import get_model
class TestAgent(cs.agent.Agent):
def __init__(self):
super().__init__("test_arxiv_agent")
self.input_stream = cs.get_stream("all_arxiv")
self.output_stream = cs.get_stream("cs_arxiv")
self.llm = get_model(["text"])
def start(self):
def process_paper(paper):
if "abstract" in paper:
paper_title = paper["title"]
paper_content = paper["abstract"]
paper_versions = paper["versions"]
stage_tags = ['Conceptual', 'Development', 'Testing', 'Deployment', 'Maintenance','Other']
prompt = "Give you an abstract of a paper: {} and the version of this paper:{}. What tag would you like to add to this paper? Choose from the following: {}".format(paper_content,paper_versions, ', '.join(stage_tags))
prompt_message = [
{
"role": "user",
"content": prompt
}
]
response = self.llm.query(prompt_message)
print(paper_title+" : "+response)
self.output_stream.add_item(paper_title+" : "+response)
self.input_stream.for_each(self, process_paper)
def stop(self):
self.input_stream.unregister_all(self)
'''
oj = SandBox(ArxivTaskConfig(), agent_file)
oj.start_test_agent()
In this example, we have defined a specific Task. The agent_file
includes the Agent required to execute this Task. This allows us to instantiate and start the TestAgent
, testing its performance.