Evaluating N-best Calibration of Natural Language Understanding for Dialogue Systems


A Natural Language Understanding (NLU) component can be used in a dialogue system to perform intent classification, returning an N-best list of hypotheses with corresponding confidence estimates. We perform an in-depth evaluation of 5 NLUs, focusing on confidence estimation. We measure and visualize cali- bration for the 10 best hypotheses on model level and rank level, and also measure classi- fication performance. The results indicate a trade-off between calibration and performance. In particular, Rasa (with Sklearn classifier) had the best calibration but the lowest performance scores, while Watson Assistant had the best performance but a poor calibration.

In SIGDIAL Conference
Ranim Khojah
Ranim Khojah
PhD Student in Computer Science

My research interests include Natural Language Processing, Chatbots, Software development.