Cofounder & CEO |Confident AI
Creator of DeepEval (14k+ stars, 3M+ monthly downloads), the most widely adopted open-source LLM evaluation framework used by OpenAI, Google, and Microsoft. Also built DeepTeam (1.4k+ stars) for red-teaming LLM systems with 50+ vulnerability tests. Imperial College London graduate. Former SWE at Google (YouTube) and Microsoft AI (Office 365). YC W25, raised $2.2M seed in 5 days.
Biography
**Jeffrey Ip** is the Cofounder & CEO of Confident AI, a San Francisco-based startup he founded in 2024 with Kritin Vongthongsri after previously working as an engineer at Google (YouTube) and Microsoft AI (Office365). He created DeepEval, an open-source LLM evaluation framework that has grown to become one of the most widely adopted in the world, used by enterprises including BCG, AstraZeneca, and Mercedes-Benz for testing AI systems. Under his leadership, Confident AI raised a $2.2 million oversubscribed seed round in just five days with participation from Y Combinator, Flex Capital, and oth
DeepEval is a comprehensive open-source framework for evaluating large language model applications, designed as a pytest-like tool specifically for LLM unit testing. It matters because it provides standardized, deterministic evaluation metrics for AI systems, addressing the critical challenge of measuring LLM performance reliably. The framework has achieved 14.2k GitHub stars, over 3 million monthly downloads, and processes 600k-800k daily evaluations for enterprises like BCG, AstraZeneca, AXA, and Capgemini.
G-Eval is a novel evaluation framework that uses LLM-as-a-judge with chain-of-thought prompting to assess LLM outputs against custom criteria. This matters because it enables human-like accuracy in evaluation while maintaining determinism and reproducibility. The framework has become the most versatile metric in DeepEval, capable of evaluating almost any use case and has been widely adopted for its ability to provide structured, reliable scoring of LLM outputs.
Jeffrey developed deterministic evaluation metrics based on LLM-powered decision trees that provide full control and reproducibility in LLM assessment. This matters because traditional LLM evaluation suffers from non-determinism, making comparisons unreliable. The DAG (Directed Acyclic Graph) metric architecture enables engineers to create use-case-specific, fully deterministic evaluation pipelines that are more controllable than traditional G-Eval metrics.
The Confident AI platform is a centralized SaaS solution that extends DeepEval's capabilities with cloud-based collaboration, dataset curation, and automated LLM testing with tracing. This matters because it provides enterprise-grade infrastructure for teams to benchmark, safeguard, and improve LLM applications at scale. The platform helps companies save up to 80% in LLM inference costs and hundreds of hours weekly on fixing breaking changes.
Developed DeepEval's integration with HuggingFace's transformers library, including the DeepEvalHuggingFaceCallback for real-time evaluation during model fine-tuning. This matters because it bridges the gap between model development and evaluation, allowing developers to assess LLM outputs directly within their training pipelines. The integration enables seamless evaluation of HuggingFace models and has become a critical tool for ML engineers working with transformer-based models.
Created extensive educational content and thought leadership around LLM evaluation best practices, including comprehensive guides on metrics, testing methodologies, and evaluation frameworks. This matters because it has helped establish industry standards and best practices for LLM evaluation, educating thousands of developers and organizations. The content has positioned Confident AI as a thought leader in the space and contributed to widespread adoption of systematic LLM evaluation approaches.
DeepEval is an open-source LLM evaluation framework I've been working on for the past year, and all of its LLM evaluation metrics uses LLM-as-a-judge.
Today I'm proud to announce Confident AI's oversubscribed $2.2m seed round with participation from Y Combinator, Flex Capital, Oliver Jung, Vermilion Cliffs Ventures, Liquid 2 Ventures, January Capital, and Rebel Fund.
LLM-as-a-Judge is a powerful solution that uses LLMs to evaluate LLM responses based on any specific criteria of your choice, which means using LLMs to carry out LLM (system) evaluation.
And so this was the beginning of how we applied to YC with Confident, got accepted, and closed our seed round in 5 days.
The concept is straightforward: provide an LLM with an evaluation criterion, and let it do the grading for you.