The effectiveness of clarification question models in engaging users within
search systems is currently constrained, casting doubt on their overall
usefulness. To improve the performance of these models, it is crucial to employ
assessment approaches that encompass both real-time feedback from users (online
evaluation) and the characteristics of clarification questions evaluated
through human assessment (offline evaluation). However, the relationship
between online and offline evaluations has been debated in information
retrieval. This study aims to investigate how this discordance holds in search
clarification. We use user engagement as ground truth and employ several
offline labels to investigate to what extent the offline ranked lists of
clarification resemble the ideal ranked lists based on online user engagement.<p></p>