Marta Ziosi
Oxford
Marta Ziosi is a Postdoctoral Researcher at the Oxford Martin AI Governance Initiative, where she leads the workstream on AI best practices and conducts research on standards for Advanced AI. Marta has a background in policy, philosophy and mathematics, and she holds a PhD in algorithmic bias from the Oxford Internet Institute. She currently serves as a vice-chair for the EU GPAI Code of Practice.
Jat Singh
Cambridge
Jat leads the Compliant and Accountable Systems research group. The group considers the mechanisms by which technology can be better designed, engineered and deployed to accord with legal and regulatory concerns, and works to better ground policy/regulatory discussions in technical realities.
More panelists to follow
Lucilla Sioli
EU AI Office
Lucillia is the Director of the European AI Office of the European Commission. She is responsible for the coordination of the European AI strategy, including the implementation of the AI Act and international collaboration in trustworthy AI and AI for good.
Cozmin Ududec
UK AI Security Institute
Cozmin Ududec leads the Science of Evaluations team at the AI Security Institute. He was previously Chief Scientist at Invenia Labs, an applied ML startup focused on optimising electricity grids. Cozmin received his PhD from the University of Waterloo and the Perimeter Institute for Theoretical Physics.
Beyond Pass/Fail: Extracting Behavioral Insights from Large-Scale AI Agent Safety Evaluations
Automated LLM-based agent evaluations have become a standard for assessing AI capabilities in both industry and government, but current reporting practices focus on what agents accomplish without resolution on how they accomplish it. In this talk I will discuss how UK AISI mines evaluation transcripts to (i) detect issues in evaluation tasks that could lead to mis-estimating capabilities, and (ii) understand how agent capabilities are evolving. I will survey a selection of AISI's methods, tools, and results, and outline research opportunities for better analysis instruments and their connection to safety and governance.
Shayne Longpre
MIT
Shayne is a PhD Candidate at MIT. His research focuses on methods for training and evaluating general-purpose AI systems, often with implications for AI policy. He leads the Data Provenance Initiative, as well as efforts to introduce AI flaw reporting and safe harbors to proprietary systems. He has received recognition for his research with best paper awards from ACL (2024) and NAACL (2024, 2025), as well as coverage by the NYT, Washington Post, Atlantic, 404 Media, Vox, and MIT Tech Review.
In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI
The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we design and propose new flaw reporting and coordination measures for GPAI systems, including flaw report forms designed for rapid triaging, AI bug bounty programs, and coordination centers for universally transferable flaws, that may pertain to many developers at once. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.
Victor Ojewale
Brown University
Victor is a CS PhD student at Brown University. He is also affiliated with the Center for Tech Responsibility, Reimagination and Redesign (CNTR) and the Data Science Institute (DSI), where he is advised by Prof. Suresh Venkatasubramanian. Victor's research interests lie in understanding perceptions of algorithmic systems, AI Audits, and sociotechnical evaluation of Large language Models (LLMs). Victor is also a member of the RISE Lab at Brown University, where he also works with Prof. Malik Boykin. Previously, he studied Computer Science at the University of Ibadan.
Technical AI Governance in Practice: What Tools Miss, and Where We Go Next
Audits are increasingly used to identify risks in deployed AI systems, but current audit tooling often falls short by focusing narrowly on evaluation while neglecting key needs like harms discovery, audit communication, and support for advocacy. Based on interviews with 35 practitioners and a landscape analysis of over 400 tools, I outline how this limited scope hinders effective accountability. Yet even where tools do focus on evaluation, they often rely on monolingual and decontextualized methods that fail to capture real-world model behaviour. I illustrate this through a case study on multilingual evaluation, where we developed functional benchmarks in six languages. These benchmarks reveal significant cross-linguistic fragility in LLM performance and underscore the risks of governance frameworks that assume language-agnostic capability. Together, these findings point to the need for a more expansive vision of technical governance that centers contextual robustness, and the infrastructural conditions for meaningful accountability.