At Character.AI, the top character for engagement (defined as chats per day) is a Spanish Tutor. However, we noticed that actual response accuracy (scored by human reviewers) is low. How would you figure out what’s happening and what would you suggest your PM to do as follow up steps?