קולקוויום מחלקתי 6.6.24

Usual Time
יום חמישי 6.6.24 בשעה 12:00
Place
אודיטוריום בניין מדעי המחשב מס' 503
More Details

 

 

Title: On the Brittleness of Evaluation in NLP

Abstract: 

Large language models (LLMs; e.g., ChatGPT) are commonly evaluated against several popular benchmarks, including HELM, MMLU or BIG-bench, all of which rely on a single prompt template per task. I will begin by presenting our recent large-scale statistical analysis of over 6.5M samples, showing that minimal prompt paraphrases lead to drastic changes in both absolute performance and relative ranking of different LLMs. These results call into question many of the recent empirical observations about the strengths and weaknesses of LLMs. Following, I will discuss desiderata for a more meaningful evaluation in NLP, leading to our formulation of diverse metrics tailored for different use cases, and conclude with a proposal for a probabilistic benchmarking approach for modern LLMs.

 

Bio:

Dr. Gabriel Stanovsky is a senior lecturer (assistant professor) in the school of computer science & engineering at the Hebrew University of Jerusalem, and a research scientist at the Allen Institute for AI (AI2). He did his postdoctoral research at the University of Washington and AI2 in Seattle, working with Prof. Luke Zettlemoyer and Prof. Noah Smith, and his PhD with Prof. Ido Dagan at Bar-Ilan University. He is interested in developing natural language processing models which deal with real-world texts and help answer multi-disciplinary research questions, in archeology, law, medicine, and more.  His work has received awards at top-tier venues, including ACL, NAACL, and CoNLL, and recognition in popular journals such as Science and New Scientist, and The New York Times.