Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems

Abstract

A Natural Language Understanding (NLU) component can be used in a dialogue system to perform intent classification, returning an N-best list of hypotheses with corresponding confidence estimates. We perform an in-depth evaluation of 5 NLUs, focusing on confidence estimation. We measure and visualize cali- bration for the 10 best hypotheses on model level and rank level, and also measure classi- fication performance. The results indicate a trade-off between calibration and performance. In particular, Rasa (with Sklearn classifier) had the best calibration but the lowest performance scores, while Watson Assistant had the best performance but a poor calibration.

Publication
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue (SigDIAL), 2022
Ranim Khojah
Ranim Khojah
PhD Student in Computer Science

My research interests include Large Language Models, chatbots, and software development.