Apple Engineers Show How Flimsy AI Reasoning Can Be
Washington, for example, has not prevented Iran’s ongoing indirect oil exports to China in recent years . In addition, Iran’s leaders have been directing more and more of the country’s oil revenue toward defense. They recently announced a planned increase in military expenditure of 200% , and some members of the ruling elite have called for setting the defense budget as a fixed share of gross domestic product to ensure adequate funding for military priorities. You can foun additiona information about ai customer service and artificial intelligence and NLP. The Iranian economy was already in a perilous state due in large part to the ongoing impact of US-led sanctions on Tehran and ongoing anxiety over the conflict in the Middle East.
This means that instead of just responding to prompts, AI agents can set objectives, plan steps and act to achieve them. IBM’s Watson exemplifies this, famously defeating chess grandmasters Garry Kasparov and Vladimir Kramnik. Chess, with its intricate rules and vast possible moves, necessitates a strategic, logic-driven approach — precisely the strength of symbolic AI. Neural networks learn by analyzing patterns in vast amounts of data, like neurons in the human brain, underpinning AI systems we use daily, such as ChatGPT and Google’s Gemini.
Contract analysis today is a tedious process fraught with the possibility of human error. Lawyers must painstakingly dissect agreements, identify conflicts and suggest optimizations — a time-consuming task that can lead to oversights. Neuro-symbolic AI could addresses this challenge by meticulously analyzing contracts, actively identifying conflicts and proposing optimizations. By breaking down problems systematically, o1 mimics human thought processes, considering strategies and recognizing mistakes. This ultimately leads to a more sophisticated ability to analyze information and solve complex problems. Additionally, o1 showcases elements of agentic AI, where systems can act independently to achieve goals.
This Apple AI study suggests ChatGPT and other chatbots can’t actually reason
The researchers propose that this reliable mode of failure means the models don’t really understand the problem at all. Their training data does allow them to respond with the correct answer in some situations, but as soon as the slightest actual “reasoning” is required, such as whether to count small kiwis, they start producing weird, unintuitive results. A group of AI research scientists at Apple released their paper, “Understanding the limitations of mathematical reasoning in large language models,” to general commentary Thursday. While the deeper concepts of symbolic learning and pattern reproduction are a bit in the weeds, the basic concept of their research is very easy to grasp. Unlike o1, which is a neural network employing extended reasoning, AlphaGeometry combines a neural network with a symbolic reasoning engine, creating a true neuro-symbolic model. Its application may be more specialized, but this approach represents a critical step toward AI models that can reason and think more like humans, capable of both intuition and deliberate analysis.
OpenAI’s ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic. That’s a pretty high success rate using either benchmark, regardless of whether or not the model itself is using “formal” reasoning behind the scenes (though total accuracy for many models dropped precipitously when the researchers added just one or two additional logical steps ChatGPT to the problems). This approach helps avoid any potential “data contamination” that can result from the static GSM8K questions being fed directly into an AI model’s training data. At the same time, these incidental changes don’t alter the actual difficulty of the inherent mathematical reasoning at all, meaning models should theoretically perform just as well when tested on GSM-Symbolic as GSM8K.
Apple’s AI study shows that changing trivial variables in math problems that wouldn’t fool kids or adding text that doesn’t alter how you’d solve the problem can significantly impact the reasoning performance of large language models. But if a new AI paper from Apple researchers is correct in its conclusions, then ChatGPT o1 and all other genAI models can’t actually reason. Apple’s study serves as a call to action for innovative strategies to enhance reasoning capabilities in AI models. Identifying and addressing these limitations is essential for advancing towards more sophisticated AI systems, including the long-term goal of Artificial General Intelligence (AGI). By focusing on these challenges, researchers and developers can contribute to the creation of AI systems that are not only more intelligent but also more reliable and aligned with human needs and ethical considerations. Adding these “seemingly relevant but ultimately inconsequential statements” to GSM-Symbolic templates leads to “catastrophic performance drops” for the LLMs.
Apple’s New Benchmark, ‘GSM-Symbolic,’ Highlights AI Reasoning Flaws – CircleID
Apple’s New Benchmark, ‘GSM-Symbolic,’ Highlights AI Reasoning Flaws.
Posted: Mon, 14 Oct 2024 07:00:00 GMT [source]
AllegroGraph is at the forefront of Neuro-Symbolic AI, a technology that uniquely integrates Machine Learning (Neuro AI) with knowledge and reasoning (Symbolic AI). This innovative approach sets a new benchmark in intelligent computing, ensuring AI reasoning is both contextually relevant and factually accurate. By leveraging Knowledge Graphs, AllegroGraph empowers organizations to harness AI insights for critical decision-making with unparalleled confidence and trust.
Apple’s research highlights a crucial gap in the reasoning capabilities of current LLMs, suggesting that merely scaling up data and computational power may not bridge this divide. While this prospect may sound daunting, it also opens the door to exciting possibilities for innovation. By understanding and addressing these limitations, we can pave the way for AI systems that not only excel in pattern recognition but also demonstrate true logical reasoning, ensuring they become reliable partners in our increasingly complex world. While you might assume that advanced models like GPT-4 possess robust reasoning skills, Apple’s research suggests a different reality.
Algeria marks 70th anniversary of liberation
We’re likely seeing a similar “illusion of understanding” with AI’s latest “reasoning” models, and seeing how that illusion can break when the model runs in to unexpected situations. Second, we praise the current determination of our beloved people and its ambitious youth, who are carrying the torch of completing the national march toward a new Algeria, great in its potential, and the genius of its daughters and sons, strong and proud of their national history. Algeria is determined to achieve the highest levels of socio-economic development through the mobilization of resources and building strong partnerships with friendly countries based on common views and mutual interests. The scientists developed a version of the GSM8K benchmark, a set of over 8,000 grade-school math word problems that AI models are tested on. Called GSM-Symbolic, Apple tests involved making simple changes to the math problems, like modifying the characters’ names, relationships, and numbers. This is where neuro-symbolic AI comes into play — a hybrid approach that blends the strengths of neural networks (intuition) with the precision of symbolic AI (logic).
The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately inconsequential statements” to the questions. For this “GSM-NoOp” benchmark set (short for “no operation”), a question about how many kiwis someone picks across multiple days might be modified to include the incidental detail that “five of them [the kiwis] were a bit smaller than average.” With enough training data and computation, the AI industry will likely reach what you might call “the illusion of understanding” with AI video synthesis eventually… A key finding of the research is the models’ sensitivity to irrelevant information. When extraneous details are added to test questions, significant performance drops occur. This vulnerability to changes in names and numbers indicates potential issues with overfitting and data contamination.
Gaps of up to 15 percent accuracy between the best and worst runs were common within a single model and, for some reason, changing the numbers tended to result in worse accuracy than changing the names. However, these metrics may not accurately reflect genuine improvements in reasoning capabilities. Apple’s introduction of the GSM Symbolic benchmark reveals significant performance discrepancies when only names and values are altered in test questions. This finding suggests that previous benchmarks might not fully capture the models’ true reasoning abilities, potentially leading to overestimation of their capabilities. Still, the overall variance shown for the GSM-Symbolic tests was often relatively small in the grand scheme of things.
KMWorld is the leading publisher, conference organizer, and information provider serving the knowledge management, content management, and document management markets. Franz Inc. not only offers cutting-edge technology but also provides consulting services for building industrial-strength Knowledge Graphs for Neuro-Symbolic AI solutions. AllegroGraph is designed to seamlessly integrate with LLMs, providing the most secure and scalable AI solution for enterprises.
Their insights underscore the importance of human judgment and ethical considerations, especially in critical fields like law, where the stakes are exceptionally high. As AI technologies automate legal research and analysis, it’s easy to succumb to rapid judgments (thinking fast) — assuming the legal profession will be reshaped beyond recognition. However, as Kahneman suggests, “Nothing in life is as important as you think it is while you are thinking about it.” Taking a moment for deliberate reflection, we might realize that perhaps the transformation isn’t as earth-shattering as it seems — or perhaps it is.
In tests, AlphaGeometry solved 83% of International Mathematical Olympiad geometry problems, matching o1’s performance and nearly reaching that of human gold medalists. According to OpenAI, o1 “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry and biology.” In a mock qualifying exam for the International Mathematics Olympiad, o1 correctly solved 83% of the problems — a dramatic improvement over GPT-4’s 13% success rate. Similarly, tax preparation software like TurboTax and H&R Block rely heavily on symbolic AI to navigate the intricate web of legal regulations and ensure accurate calculations.
Algerian-Turkish relations are a successful example of how to build strong and sustainable ties between countries based on shared history, common vision and mutual interests. By strengthening cooperation in various fields, the two countries can continue to achieve further progress and development for the benefit of their people and their region. Dr. Hopfield highlights that technological advancements like AI can bring both significant benefits and risks.
Given the long shared history between the two countries and the deep civilizational ties between them, the cultural aspect of this relationship had to be considered. In response to the wishes of the two peoples, the two presidents have agreed to reciprocally open cultural centers in Algiers and Istanbul. They also recognized the importance of working together in the field of Ottoman archives to explore and document the common history and deepen mutual understanding of the common past.
- A group of AI research scientists at Apple released their paper, “Understanding the limitations of mathematical reasoning in large language models,” to general commentary Thursday.
- They want to minimize the impact that Trump’s victory may have on their economy and are trying to reassure the domestic market.
- Adding in these red herrings led to what the researchers termed “catastrophic performance drops” in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested.
- “Current LLMs are not capable of genuine logical reasoning,” the researchers hypothesize based on these results.
Biden’s looser approach to sanction enforcement saw Iranian oil exports increase to 2 million barrels a day , with most of that oil going to China. Under Trump’s“maximum pressure” policy, Iranian oil exports were down to 100, ,000 barrels a day . And even though the sanctions have remained in place, the Biden administration partially rolled back the enforcement of some of those prohibitions as an incentive for Iran during these back-channel negotiations.
And although it can follow complex chains of reasoning it has been exposed to before, the fact that this chain can be broken by even superficial deviations suggests that it doesn’t actually reason so much as replicate patterns it has observed in its training data. Replacing the name with something else and changing the numbers should not alter the performance of reasoning AI models like ChatGPT. After all, a grade schooler could still solve the problem even after changing these details. The ability to reason accurately and consistently is essential for AI applications in critical areas such as education, healthcare, and decision-making systems. Understanding the limitations of LLMs’ reasoning capabilities is crucial for making sure AI safety and alignment with human values.
We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data. Apple’s recent research paper, provides a critical analysis of the reasoning capabilities in current large language models (LLMs). Challenging the widespread belief that these models possess genuine logical reasoning abilities, revealing instead a significant reliance on pattern recognition. These findings have far-reaching implications for the practical applications of LLMs and the future development of artificial intelligence. Imagine a world where AI is seamlessly integrated into critical areas like education and healthcare, making decisions that impact our daily lives. However, what if these systems falter when faced with unfamiliar situations or irrelevant details?
As AI continues to evolve, understanding and overcoming these reasoning limitations will be crucial in shaping the future of intelligent systems. This research from Apple not only highlights current shortcomings but also opens new avenues for innovation in AI development, potentially leading to more capable, reliable, and truly intelligent AI systems in the future. This observation is consistent with the other qualities often attributed to LLMs due to their facility with language. When, statistically, the phrase “I love you” is followed by “I love you, too,” the LLM can easily repeat that — but it doesn’t mean it loves you.
Iran’s Currency Was Already Tumbling − And Then Trump Won
It uses “chain-of-thought” prompting to break down problems into steps, much like a human would. It’s executing complex algorithms to produce this human-like reasoning, resulting in stronger problem-solving abilities. The results of this new GSM-Symbolic paper aren’t completely new in the world of AI research. Other recent papers have similarly suggested that LLMs don’t actually perform formal reasoning and instead mimic it with probabilistic pattern-matching of the closest similar data seen in their vast training sets.
These models often replicate reasoning steps from their training data without truly comprehending the underlying problems. This dependence on pattern recognition, rather than authentic logical reasoning, raises substantial concerns about their effectiveness in handling complex tasks. Instead, when the researchers tested symbolic ai example more than 20 state-of-the-art LLMs on GSM-Symbolic, they found average accuracy reduced across the board compared to GSM8K, with performance drops between 0.3 percent and 9.2 percent, depending on the model. The results also showed high variance across 50 separate runs of GSM-Symbolic with different names and values.
As a result of these concerns, Iranians have increasingly been converting most of their savings into US dollars or gold. (MENAFN- Asia Times)
As the world absorbed news of Donald Trump’s comeback victory in the 2024 US presidential race, concern in Iran turned to the impact of the election on its own Economy amid escalating regional tensions. This project is expected to support other partnerships between the two countries in the energy sector that align with the joint strategy in this field. It is worth mentioning that if the current relations between the two countries enjoy increasing momentum, they are not historically new. They are rooted in the depths of history, as Algeria and Türkiye share distinctive friendly, civilizational and political ties, and the history of the North African and Mediterranean region is replete with great achievements and heroic moments shared by the two countries.
Such sensitivities could severely hinder the models’ application in dynamic real-world environments, where data is rarely static or predictable. It serves as a bridge between Kahneman’s concepts of thinking fast and thinking slow, aiming to deliver better reasoning with ChatGPT App fewer mistakes. This approach paves the way for more advanced systems like AlphaGeometry that truly merge neural and symbolic approaches. OpenAI’s o1 model is not technically neuro-symbolic AI but rather a neural network designed to “think” longer before responding.
This meticulous, rule-based approach ensures each step is executed according to established guidelines. These are not well-defined concepts, and the questions tend to appear at the bleeding edge of AI research, where the state of the art changes on a daily basis. They want to minimize the impact that Trump’s victory may have on their economy and are trying to reassure the domestic market.
This methodical analysis — Kahneman’s “System 2 (slow)” thinking — finally exonerated the fans. The rial fell to a fresh record low as Donald trump was claiming victory – trading above the symbolic marker of 700,000 rials to the dollar, according to traders in Tehran , just as results of the US election were coming in. One of the most visible areas of this strategic cooperation is the economic sector. Hence, Türkiye has become Algeria’s fifth-largest trading partner, and Algeria has become Türkiye’s second-largest partner on the African continent. Apple isn’t going after rivals here; it’s simply trying to determine whether current genAI tech allows these LLMs to reason. Dr. Hinton, often called the godfather of AI, warns that as AI systems begin to exceed human intellectual abilities, we face unprecedented challenges in controlling them.
The ties between the two countries have witnessed remarkable development at various levels and have remarkably accelerated since 2020. That said, it’ll be interesting to see how OpenAI, Google, Meta, and others challenge Apple’s findings in the future. Perhaps they’ll devise other ways to benchmark their AIs and prove they can reason. If anything, Apple’s data might be used to alter how LLMs are trained to reason, especially in fields requiring accuracy. Apple researcher Mehrdad Farajtabar has a thread on X that covers the kind of changes Apple performed for the new GSM-Symbolic benchmarks that include additional examples. This caution is echoed by John J. Hopfield and Geoffrey E. Hinton, pioneers in neural networks and recipients of the 2024 Nobel Prize in Physics for their contributions to AI.
These expectations are supported by significant domestic investment, with the registration of 9,000 projects worth nearly $25 billion. The economy is seeing improved performance in the industrial and agricultural sectors, with the industry’s contribution to GDP expected to grow from 7.5% in 2023 to 9.3% by 2026 and agriculture exceeding 5%. On the other hand, the focus is currently on the knowledge economy and digitization to include all sectors, with the establishment of business incubators and training in several fields, most notably artificial intelligence, to keep pace with new economies based on modern technologies and innovations. For a while now, companies like OpenAI and Google have been touting advanced “reasoning” capabilities as the next big step in their latest artificial intelligence models. Now, though, a new study from six Apple engineers shows that the mathematical “reasoning” displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems. Why would a model that understands the problem be thrown off so easily by a random, irrelevant detail?
Without addressing these issues, the deployment of AI in sensitive domains could lead to unreliable or potentially harmful outcomes. OpenAI o1 not only demonstrates advanced reasoning but also hints at the future potential of artificial general intelligence. AGI refers to AI systems that can understand, learn and apply intelligence broadly, much like humans. This data-driven processing aligns with Kahneman’s “thinking fast” — rapid, intuitive thinking. While neural networks excel at finding patterns and making quick decisions, they can sometimes lead to errors, referred to as “hallucinations” in the AI world, due to biases or insufficient data.
This is just a simple example out of hundreds of questions that the researchers lightly modified, but nearly all of which led to enormous drops in success rates for the models attempting them. Apple’s study, available as a pre-print version at this link, details the types of experiments the researchers ran to see how the reasoning performance of various LLMs would vary. They looked at open-source models like Llama, Phi, Gemma, and Mistral and proprietary ones like ChatGPT o1-preview, o1 mini, and GPT-4o. This innovative approach, merging the precision of symbolic AI with the adaptability of neural networks, offers a compelling solution to the limitations of existing legal AI tools.