Marta Ziosi
University of Oxford
Marta Ziosi is a Postdoctoral Researcher at the Oxford Martin AI Governance Initiative, where she leads the workstream on AI best practices and conducts research on standards for Advanced AI. Marta has a background in policy, philosophy and mathematics, and she holds a PhD in algorithmic bias from the Oxford Internet Institute. She currently serves as a vice-chair for the EU GPAI Code of Practice.
Jat Singh
RC-Trust, Germany & University of Cambridge
Jat leads the Compliant and Accountable Systems research group. The group considers the mechanisms by which technology can be better designed, engineered and deployed to accord with legal and regulatory concerns, and works to better ground policy/regulatory discussions in technical realities.
Tobin South
Stanford University
Tobin leads AI Agents at WorkOS building enterprise-ready MCP & agent tools, and a research fellow at Stanford University leading research for the Loyal Agents initiative in HAI. Tobin completed his PhD at MIT in 2025 on "Private, Verifiable, and Auditable AI Systems", where he was a senior fellow with the E14 VC fund. Tobin was an Australian-American Fulbright Future Scholar for his time at MIT, a Pivotal Fellowship mentor, and an author of the 2025 International AI Safety Report.
Niloofar Mireshghallah
Meta FAIR
Niloofar Mireshghallah is an incoming assistant professor at CMU (EPP & LTI) and a Research Scientist at FAIR. Before, she was a post-doctoral scholar at the Paul G. Allen Center for Computer Science & Engineering at University of Washington. She received her Ph.D. from the CSE department of UC San Diego in 2023. Her research interests are Trustworthy Machine Learning and Natural Language Processing. She is a recipient of the National Center for Women & IT (NCWIT) Collegiate award in 2020 for her work on privacy-preserving inference, a finalist of the Qualcomm Innovation Fellowship in 2021 and a recipient of the 2022 Rising star in Adversarial ML award.
Lucilla Sioli
EU AI Office
Lucillia is the Director of the European AI Office of the European Commission. She is responsible for the coordination of the European AI strategy, including the implementation of the AI Act and international collaboration in trustworthy AI and AI for good.
EU Approach to General Purpose AI Governance
The EU AI Act establishes the first regulatory framework for AI. As AI rapidly evolves, the European Commission's AI Office develops and implements policies that ensure safety, accountability, and innovation.
This keynote outlines the EU’s approach to regulating General Purpose AI under the AI Act. It highlights key initiatives led by the AI Office and how to engage with them. It includes the Codes of Practice for regulatory guidance, a Tender on AI Safety to build evaluation and enforcement capacity, and a Network of Evaluators to advance scientific evaluations. Additionally, it presents the Scientific Panel, which aims to provide technical and scientific input for the enforcement of the AI Act.
Cozmin Ududec
UK AI Security Institute
Cozmin Ududec leads the Science of Evaluations team at the AI Security Institute. He was previously Chief Scientist at Invenia Labs, an applied ML startup focused on optimising electricity grids. Cozmin received his PhD from the University of Waterloo and the Perimeter Institute for Theoretical Physics.
Beyond Pass/Fail: Extracting Behavioral Insights from Large-Scale AI Agent Safety Evaluations
Automated LLM-based agent evaluations have become a standard for assessing AI capabilities in both industry and government, but current reporting practices focus on what agents accomplish without resolution on how they accomplish it. In this talk I will discuss how UK AISI mines evaluation transcripts to (i) detect issues in evaluation tasks that could lead to mis-estimating capabilities, and (ii) understand how agent capabilities are evolving. I will survey a selection of AISI's methods, tools, and results, and outline research opportunities for better analysis instruments and their connection to safety and governance.
Shayne Longpre
MIT
Shayne is a PhD Candidate at MIT. His research focuses on methods for training and evaluating general-purpose AI systems, often with implications for AI policy. He leads the Data Provenance Initiative, as well as efforts to introduce AI flaw reporting and safe harbors to proprietary systems. He has received recognition for his research with best paper awards from ACL (2024) and NAACL (2024, 2025), as well as coverage by the NYT, Washington Post, Atlantic, 404 Media, Vox, and MIT Tech Review.
In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI
The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we design and propose new flaw reporting and coordination measures for GPAI systems, including flaw report forms designed for rapid triaging, AI bug bounty programs, and coordination centers for universally transferable flaws, that may pertain to many developers at once. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.
Victor Ojewale
Brown University
Victor is a CS PhD student at Brown University. He is also affiliated with the Center for Tech Responsibility, Reimagination and Redesign (CNTR) and the Data Science Institute (DSI), where he is advised by Prof. Suresh Venkatasubramanian. Victor's research interests lie in understanding perceptions of algorithmic systems, AI Audits, and sociotechnical evaluation of Large language Models (LLMs). Victor is also a member of the RISE Lab at Brown University, where he also works with Prof. Malik Boykin. Previously, he studied Computer Science at the University of Ibadan.
Technical AI Governance in Practice: What Tools Miss, and Where We Go Next
Audits are increasingly used to identify risks in deployed AI systems, but current audit tooling often falls short by focusing narrowly on evaluation while neglecting key needs like harms discovery, audit communication, and support for advocacy. Based on interviews with 35 practitioners and a landscape analysis of over 400 tools, I outline how this limited scope hinders effective accountability. Yet even where tools do focus on evaluation, they often rely on monolingual and decontextualized methods that fail to capture real-world model behaviour. I illustrate this through a case study on multilingual evaluation, where we developed functional benchmarks in six languages. These benchmarks reveal significant cross-linguistic fragility in LLM performance and underscore the risks of governance frameworks that assume language-agnostic capability. Together, these findings point to the need for a more expansive vision of technical governance that centers contextual robustness, and the infrastructural conditions for meaningful accountability.