Winners of 2024

Platinum Prize
Zhimin Zhao, Queen’s University: An Exploratory Study on the Operations and Smells of Foundation Model Leaderboards
This study examines the state of practice of leaderboard operations tailored to foundation models (FM), focusing on their essential role in AI reporting. We aim to improve the understanding and maintainability of leaderboards through a thorough analysis of their characteristics, operations, and smells. To complete this survey, we begin with the retrieval of multivocal literature reviews on FM evaluation from Google Scholar and GitHub Awesome Lists. Then we search for qualified leaderboards starting from the literature using backward snowball sampling, complemented by leaderboards hosted in the Hugging Face Hub. Lastly, we analyze the documentation, publication, and commit history of leaderboards, complemented by (in)direct communication with their administrators. Our findings introduce a novel conceptual framework, called “leaderboard operations” (LBOps), for leaderboard deployment in real-world production. e also introduce the concept of “leaderboard smell”, which refers to operational issues that can affect the maintainability and trustworthiness of leaderboards. Our study underscores the need for more standardized benchmarking practices to support responsible AI reporting.

First Prize
Mohammad Hossein Amini, University of Ottawa: Evaluating the Impact of Flaky Simulators on Testing Autonomous Driving Systems
Fatal crashes involving Tesla and Google’s Autonomous Driving Systems (ADS) have underscored the need for accurate ADS testing. Simulation-based testing is the safe technique of choice for at-scale verification of ADS. Simulators exercise very large numbers of system-usage scenarios that would be prohibitively expensive or time-consuming to enact in the real-world. Many ADS rely on deep neural networks (DNNs) either partially or entirely. Recent studies have noted that ADS simulators can be flaky, introducing a major challenge to reproducibility and reliability in the field of ADS testing, causing severe safety issues for autonomous cars. However, to the best of our knowledge, there is no systematic study on the prevalence, impact, and potential mitigation strategies for test flakiness in ADS testing. In this paper, we present a systematic study on flakiness in simulation-based ADS testing and introduce a novel machine-learning-based solution for flakiness mitigation.

Second Prize
Pouya Fathollahzadeh, Queen’s University: Towards Refining Developer Queries using LLM-Based Named Entity Recognition
Software developers often spend a significant amount of time seeking answers to questions related to their coding tasks. They typically search for answers online, post queries on Q&A websites, and more recently, participate in chat communities. However, many of these questions go unanswered or need a lot of follow-up and clarification. Automatically identifying possible ways to refine a developer query in a way that adequately captures the problem and required context, such as software versions, could save time and effort. To address this issue, we first explore the use of Large Language Models (LLMs) for Named Entity Recognition (NER) to identify SE-related entities. We evaluate the performance of Mixtral 8x7B, a Large LLM, by prompting it for Named Entity Recognition (NER) tasks across four popular programming chatrooms on Discord. We then assess how effectively it can identify SE-related entities. Preliminary results show that the approach is very effective, with an accuracy =0.89. We then investigate how the presence of specific SE-related entities in queries influences the likelihood and speed of receiving a response. Our next step is to propose refinements to improve queries with the goal of making it more likely that they will get answers.

Third Prize
Iren Mazloomzadeh, Polytechnique Montreal: Exploring the Boundaries of Large Language Models in Refactoring
Refactoring plays a crucial role in software engineering, facilitating effective software maintenance in large-scale projects. The principles of refactoring not only enhance code readability, but also serve as guidelines for long-term project maintenance, reducing existing code smells, and preventing the emergence of new code issues over time. After many years of research and effort to automate refactoring in software engineering, state-of-the-art automatic refactoring tools fail to cover diverse types of refactoring and suffer from suggesting inaccurate refactored code snippets. By leveraging the innate capabilities of LLMs in code understanding and generation, we want to examine the potential and pitfalls of LLMs in conducting refactoring. In the study, we perform an empirical evaluation on the capability of LLMs in performing refactoring task. We start by collecting refactoring patterns from two GitHub repositories using an state-of-the-art refactoring detection tool. Then, we sample several cases across various types of refactoring and prompt an LLM to execute these refactoring tasks. Our findings reveal that, first, only certain types of refactoring in theory can be identified by automatic refactoring tools. On the other side, although LLMs demonstrate capability in understanding and generating code, they encounter challenges in accurately and comprehensively applying certain types of refactoring.