Evaluation how-to guides
These guides answer “How do I…?” format questions. They are goal-oriented and concrete, and are meant to help you complete a specific task. For conceptual explanations see the Conceptual guide. For end-to-end walkthroughs see Tutorials. For comprehensive descriptions of every class and function see the API reference.
Offline evaluation
Evaluate and improve your application before deploying it.
Run an evaluation
- Run an evaluation
- Run an evaluation asynchronously
- Run an evaluation comparing two experiments
- Evaluate a
langchain
runnable - Evaluate a
langgraph
graph - Evaluate an existing experiment (Python only)
- Run an evaluation via the REST API
- Run an evaluation from the UI
Define an evaluator
- Define a custom evaluator
- Define an LLM-as-a-judge evaluator
- Define a pairwise evaluator
- Define a summary evaluator
- Use an off-the-shelf evaluator via the SDK (Python only)
- Evaluate intermediate steps
- Return multiple metrics in one evaluator
- Return categorical vs numerical metrics
Configure the evaluation data
Configure an evaluation job
- Evaluate with repetitions
- Handle model rate limits
- Print detailed logs (Python only)
- Run an evaluation locally (beta, Python only)
Unit testing
Unit test your system to identify bugs and regressions.
Online evaluation
Evaluate and monitor your system's live performance on production data.
Automatic evaluation
Set up evaluators that automatically run for all experiments against a dataset.
Analyzing experiment results
Use the UI & API to understand your experiment results.
- Compare experiments with the comparison view
- Filter experiments
- View pairwise experiments
- Fetch experiment results in the SDK
- Upload experiments run outside of LangSmith with the REST API