AgentBench
Active·★ 3.5k·Apache-2.0·Updated 2026-02-08
★ Trending★ Essential
A Comprehensive Benchmark to Evaluate LLMs as Agents (ICLR'24)
AgentBench is a comprehensive benchmark for evaluating Large Language Models (LLMs) as agents across diverse environments, now featuring a function-calling version integrated with AgentRL. It provides a containerized setup for various tasks like OS interaction, database operations, and web shopping, enabling robust and reproducible agent evaluation.
#LLM Evaluation#Agent Benchmarking#Function Calling#Docker#Multi-task Learning
01
Features
01Comprehensive LLM-as-Agent Evaluation across diverse environments.
02Function Calling integration for advanced agent interaction.
03Fully containerized deployment using Docker Compose for reproducibility.
04Multi-task and multi-turn interaction for realistic agent assessment.
05Extensible framework for adding new evaluation tasks.
02
Compatibility
Docker
Native
Verified via docs
Python
Native
Verified via docs
OpenAI API
Supported
Verified via docs
Large Language Models
Supported
Verified via docs
03
Quick start
1
$ pip install -r requirements.txt
04
Use cases
↳Systematically benchmark the performance of various LLM-based agents.
↳Develop and refine advanced LLM agent architectures and strategies.
↳Conduct academic research on the capabilities and limitations of agentic AI.
05
Alternatives
GitHub MCP Server★ 30.3k
GitHub's official MCP Server. Allows AI agents to interact directly with your GitHub repositories (read files, search code, issues).
chinese-llm-benchmark★ 6.1k
ReLE评测:中文AI大模型能力评测(持续更新):目前已囊括335个大模型,覆盖chatgpt、gpt-5.2、o4-mini、谷歌gemini-3-pro、Claude-4.5、文心ERNIE-X1.1、ERNIE-5.0-Thinking、qwen3-max、百川、讯飞星火、商汤senseChat等商用模型, 以及kimi-k2、ernie4.5、minimax-M2、deepseek-v3.2、qwen3-2507、llama4、智谱GLM-4.6、gemma3、mistral等开源大模型。不仅提供排行榜,也提供规模超200万的大模型缺陷库!方便广大社区研究分析、改进大模型。
FinnewsHunter★ 1.4k
FinnewsHunter: Multi-agent financial intelligence platform powered by AgenticX. Real-time news analysis, sentiment fusion, and alpha factor mining.
Related searches
Comments
Log in to leave a comment
No comments yet. Be the first!