A product team to be satisfied that their LLM-enabled product will perform well in the wild. LLMs are tricky. They don’t always provide the identical answer to the same or similar input. 1 can be greater than 4. This is called non-determinism.
How to Solve this Problem❓
To produce high quality LLM-products, teams will need to combine analytics with evaluation.
Combining analytics with evaluation is the key to improving performance. Analytics surface the questions users ask when using the model. Those questions create the evaluations product teams use to determine performance. They gather additional data, retrain/fine-tune the model, & release it again.
Today, evaluations are rule based or human-in-the-loop evaluations. But in the future, other models will judge the output to ensure consistency over time. And the iteration wheel improves ensuring that the Dad jokes from a model really are the best.