15 open-source agents tracked · Ranked by trust score · Updated 2026-06-04 18:04 UTC
HVTracker independently evaluates 15 open-source observability & evaluation using daily signals from GitHub, package registries, and security databases. Each agent is scored on activity, adoption, transparency, safety, and identity. The top-ranked observability & evaluation is MLflow with a trust score of 90.7/100 (Grade A). Other leading projects include Weights & Biases Weave and Langfuse.
| # | Agent | Trust | Stars | Language |
|---|---|---|---|---|
| 1 | MLflow A Listed The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, eva | 90.7 | 26.3k | Python |
| 2 | Weights & Biases Weave A Listed Weave is a toolkit for developing AI-powered applications, built by Weights & Biases. | 83.6 | 1.1k | Python |
| 3 | Langfuse B Listed πͺ’ Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Inte | 77.4 | 28.5k | TypeScript |
| 4 | Promptfoo B Listed Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C | 76.7 | 21.9k | TypeScript |
| 5 | Arize Phoenix B Listed AI Observability & Evaluation | 76.2 | 10.0k | Python |
| 6 | Evidently B Listed Evidently is ββan open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or d | 72.9 | 7.6k | Jupyter Notebook |
| 7 | Giskard B Listed π’ Open-Source Evaluation & Testing library for LLM Agents | 71.0 | 5.4k | Python |
| 8 | LangWatch B Listed The platform for LLM evaluations and AI agent testing | 69.1 | 3.3k | TypeScript |
| 9 | DeepEval B Listed The LLM Evaluation Framework | 67.5 | 15.9k | Python |
| 10 | Helicone C Listed π§ Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 π | 58.4 | 5.8k | TypeScript |
| 11 | Ragas C Listed Supercharge Your LLM Application Evaluations π | 56.2 | 14.2k | Python |
| 12 | Opik D Listed Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, autom | 46.7 | 19.4k | Python |
| 13 | TruLens D Listed Evaluation and Tracking for LLM Experiments and AI Agents | 42.1 | 3.4k | Python |
| 14 | Agenta D Listed The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one | 40.6 | 4.2k | TypeScript |
| 15 | AgentOps D Listed Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frame | 35.1 | 5.6k | Python |