Research
Browsing Lost Unformed Recollections: A Benchmark for Tip-of-the-Tongue Search and Reasoning [2025]
TRAIL: Trace Reasoning and Agentic Issue Localization [2025]
Glider: Grading LLM Interactions and Decisions using Explainable Ranking [2024]
Lynx: An Open Source Hallucination Evaluation Model [2024]
FinanceBench: A New Benchmark for Financial Question Answering [2023]
SimpleSafetyTests: A Test Suite for Identifying Critical Safety Risks in Large Language Models [2023]
Step by Step to Fairness: Attributing Societal Bias in Tasems [2023]
Perturbation Augmentation for Fairer NLP [2022]
Many Episode Learning in a Modular Embodied Agent via End-to-End Interaction [2022]