Apple Says Generative AI Isn’t Good at Math: What It Means to Banks and Credit Unions
GenAI tools can do amazing things. Doling out reliable financial advice and guidance isn’t one of them.
A survey from the Motley Fool revealed some surprising—and, frankly, hard to believe—statistics about Americans’ use of the generative AI tool ChatGPT for financial advice. The study found that:
- 54% of Americans have used ChatGPT for finance recommendations. Six in 10 Gen Zers and millennials, half of Gen Xers, and a third of baby boomers said they’ve received recommendations for at least one of eight financial products. Credit cards and checking accounts—cited by 26% and 23% of respondents, respectively—were the products most frequently asked about.
- Half of consumers said they would use ChatGPT to get a recommendation. That said, few expressed an interest in getting a recommendation for most products. For example, 25% said they’d want a recommendation from ChatGPT for a credit card—and the percentages go down from there.
- Respondents were “somewhat satisfied” with ChatGPT’s recommendations. On a 5-point scale (1=not satisfied, 5=very satisfied), the average overall satisfaction rating was 3.7, ranging from 3.6 from Gen Zers and baby boomers to 3.8 from millennials and 3.9 from Gen Xers.
According to the study, the most important factors determining consumers’ use of ChatGPT to find financial products are: 1) the performance and accuracy of the recommendations; 2) the ability to understand the logic behind the recommendations; and 3) the ability to verify information the recommendation is based on.
However, the conclusions from a new Apple study might make consumers rethink using ChatGPT—and other generative AI tools—to get financial advice. And they should temper the plans of bank and credit union executives in using artificial intelligence to offer financial advice and guidance to consumers.
Generative AI Falls Short on Mathematical Reasoning
Generative AI (genAI) tools can do lots of amazing things, but, as a new report from researchers at Apple demonstrates, large language models (LLMs) have some troubling limitations with “mathematical reasoning.” The Apple researchers concluded:
“Current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. When we add a single clause that appears relevant to the question, we observe significant performance drops (up to 65%) across all state-of-the-art models. Importantly, we demonstrate that LLMs struggle even when provided with multiple examples of the same question or examples containing similar irrelevant information. This suggests deeper issues in their reasoning processes that cannot be easily mitigated through few-shot learning or fine-tuning.”
A recent TechCrunch article documented some seemingly simple mathematical calculations that LLMs get wrong. The article states, “Claude can’t solve basic word problems, Gemini fails to understand quadratic equations, and Llama struggles with straightforward addition.”
Why can’t LLMs do basic math? The problem, according to TechCrunch, is tokenization:
“The process of dividing data up into chunks (e.g., breaking the word ‘fantastic’ into the syllables ‘fan,’ ‘tas,’ and ‘tic’), tokenization helps AI densely encode information. But because tokenizers—the AI models that do the tokenizing—don’t really know what numbers are, they frequently end up destroying the relationships between digits. For example, a tokenizer might treat the number ‘380’ as one token but represent ‘381’ as a pair of digits (‘38’ and ‘1’).”
Machine Learning Has a Problem, As Well
Annoyingly, many people use the term “machine learning” when referring to regression analysis or some other form of statistical analysis. According to the University of California at Berkeley, machine learning has three components:
- A decision process. In general, machine learning algorithms are used to make a prediction or classification. Based on some input data, which can be labeled or unlabeled, your algorithm will produce an estimate about a pattern in the data.
- An error function. An error function evaluates the prediction of the model. If there are known examples, an error function can make a comparison to assess the accuracy of the model.
- A model optimization process. If the model can fit better to the data points in the training set, then weights are adjusted to reduce the discrepancy between the known example and the model estimate. The algorithm will repeat this iterative “evaluate and optimize” process, updating weights autonomously until a threshold of accuracy has been met.
Regression analysis and most other forms of statistical analyses lack a model optimization process.
Here’s the real-world problem: While “investment” results are generally trackable, “spending” results are not. For the vast majority of people, however, how they spend is a bigger determinant of their financial performance than investing is.
The other challenge is that we don’t only spend to optimize our financial performance. We spend to optimize our emotional performance. How is a machine learning model going to track that?
AI Is Not Ready for Prime Time in Financial Advice
The instructions needed to provide financial advice and guidance involve many “clauses.” In other words, the goals and objectives for establishing financial advice and guidance are not simple and straightforward—and it’s these complex questions and instructions that genAI tools are not good at (according to Apple).
Bottom line: Banks and credit unions shouldn’t rely on AI to provide financial advice and guidance—right now. Maybe someday, but not now, and not for another five, maybe 10, years. If vendors claim they’re using machine learning, ask them about their model optimization process. If they claim to have a large language model, ask them how it overcomes math computation limitations.
Ron Shevlin is chief research officer at Cornerstone Advisors. Tune in to Ron’s What’s Going On In Banking podcast and follow him on LinkedIn and X.