Observability & Evaluation

15 open-source agents tracked · Ranked by trust score · Updated 2026-06-04 18:04 UTC

HVTracker independently evaluates 15 open-source observability & evaluation using daily signals from GitHub, package registries, and security databases. Each agent is scored on activity, adoption, transparency, safety, and identity. The top-ranked observability & evaluation is MLflow with a trust score of 90.7/100 (Grade A). Other leading projects include Weights & Biases Weave and Langfuse.

15
Agents
64
Avg Trust
172.5k
Total Stars
2
Grade A
# Agent Trust Stars Language
1 MLflow A Listed The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, eva 90.7 26.3k Python
2 Weights & Biases Weave A Listed Weave is a toolkit for developing AI-powered applications, built by Weights & Biases. 83.6 1.1k Python
3 Langfuse B Listed πŸͺ’ Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Inte 77.4 28.5k TypeScript
4 Promptfoo B Listed Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, C 76.7 21.9k TypeScript
5 Arize Phoenix B Listed AI Observability & Evaluation 76.2 10.0k Python
6 Evidently B Listed Evidently is ​​an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or d 72.9 7.6k Jupyter Notebook
7 Giskard B Listed 🐒 Open-Source Evaluation & Testing library for LLM Agents 71.0 5.4k Python
8 LangWatch B Listed The platform for LLM evaluations and AI agent testing 69.1 3.3k TypeScript
9 DeepEval B Listed The LLM Evaluation Framework 67.5 15.9k Python
10 Helicone C Listed 🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 πŸ“ 58.4 5.8k TypeScript
11 Ragas C Listed Supercharge Your LLM Application Evaluations πŸš€ 56.2 14.2k Python
12 Opik D Listed Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, autom 46.7 19.4k Python
13 TruLens D Listed Evaluation and Tracking for LLM Experiments and AI Agents 42.1 3.4k Python
14 Agenta D Listed The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one 40.6 4.2k TypeScript
15 AgentOps D Listed Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frame 35.1 5.6k Python