This week The Washington Post published a deep dive into AI chatbot assistants from H&R Block and TurboTax, which are meant to help serve up answers to arcane questions about taxes. The verdict, after extensive testing: they often provide irrelevant, misleading, or inaccurate responses.
This is just the latest crap pile that major companies have found themselves stepping in with regard to AI-powered chatbots. Last month saw a ruling in a case that involved an Air Canada chatbot promising a customer a discount that the airline later tried to renege on. (The court ultimately ruled that Air Canada was responsible for what the chatbot was saying.) But these are just the latest headlines in a long line of mishaps.
What is happening here? It’s easy to point to this as an example of corporations trying to jump on the AI bandwagon before the technology is fully ready, at least for public-facing use cases like customer service. Up until recently, automated chatbots would generally work by assisting with common questions with canned answers about straightforward information. Anything more complicated — or poorly communicated by the customer — was kicked to a human.
With generative AI, chatbots are now being asked to do more: interpreting the queries of flawed humans, then finding and summarizing the correct information on the fly. As anyone who’s used one of these chatbots knows, the results can amaze you with their accuracy. And they can also go completely off the rails.
This is a feature, not a bug, of generative systems. The same “magic” that enables a large language model to come up with responses that give the illusion of actual reasoning will also occasionally lead it down dead ends. And while models and fine-tuning techniques will continue to improve, it appears unlikely if not impossible to weed out AI’s propensity to “hallucinate” completely.
And this is why the AI-powered chatbot is unlikely to deliver on the promise of fully replacing the human element of customer service. Raw outputs will never be 100% perfect, and the liability that can be attached to a single wrong answer is potentially devastating, regardless of how many disclaimer labels you slap on top of it.
This is also why I’m not holding my breath for an AI-powered Siri or Alexa. Don’t get me wrong: I’m certain Apple and Amazon will roll out generative features in their digital assistants in a limited way, and probably fairly soon. But imagine the recent Gemini debacle happening on every iPhone in the world; there’s simply no way either assistant will get smart overnight.
The main takeaway — besides that AI hype and reality are worlds apart — is that raw output from LLMs is still a minefield, at least with this current generation of AI. There are glimmers of hope that models, and the systems that are built on them, will improve to a point where they can check themselves (Anthropic’s newly announced Claude 3 apparently can not just find obscure information in a data set, but also show some meta awareness about the task it’s performing). Until then, though: humans in the loop, always.
Leave a Reply