Can AI survive in the crypto world: 18 crypto experiments on big models
How well does AI really understand encryption? Does an agent equipped with a large language model have the actual ability to use encryption tools?
Original title: "Can AI survive in the crypto world: 18 large-scale crypto experiments"
Original author: Wang Chao, Empower Labs
In the annals of technological progress, revolutionary technologies often appear independently, each leading the transformation of an era. When two revolutionary technologies meet, their collision often has an exponential impact. Today, we are standing at such a historic moment: artificial intelligence and encryption technology, two equally disruptive new technologies, are stepping into the center of the stage hand in hand.
We imagine that many challenges in the field of AI can be solved by encryption technology; we expect AI Agent to build an autonomous economic network and promote the large-scale adoption of encryption technology; we also hope that AI can accelerate the development of existing scenarios in the encryption field. Countless eyes are focused on this, and a huge amount of funds are pouring in. Like any buzzword, it embodies people's desire for innovation, their vision for the future, and also contains uncontrollable ambition and greed.
However, in this hustle and bustle, we know very little about the most basic questions. How well does AI really understand the field of encryption? Does an agent equipped with a large language model have the actual ability to use encryption tools? How big are the differences between different models in encryption tasks?
The answers to these questions will determine the mutual influence of AI and encryption technology, and are also crucial to the product direction and technology route selection in this cross-field. In order to explore these issues, I did some evaluation experiments on large language models. By evaluating their knowledge and capabilities in the field of encryption, the level of encryption application of AI is measured, and the potential and challenges of the integration of AI and encryption technology are judged.
Conclusion first
The large language model performs well in cryptography and blockchain basics, and has a good understanding of the encryption ecosystem, but performs poorly in mathematical calculations and complex business logic analysis. In terms of private keys and basic wallet operations, the model has a satisfactory foundation, but faces severe challenges in how to store private keys in the cloud. Many models can generate effective smart contract code for simple scenarios, but cannot independently perform difficult tasks such as contract auditing and complex contract creation.
Commercial closed-source models are generally ahead. Only Llama 3.1-405B performs well in the open-source camp, while all open-source models with smaller parameter sizes fail. However, there is potential. Through prompt word guidance, thought chain reasoning and few-sample learning technology, the performance of all models has been greatly improved. The leading models have strong technical feasibility in some vertical application scenarios.
Experimental details
18 representative language models were selected as evaluation objects, including:
· Closed source models: GPT-4o, GPT-4o Mini, Claude 3.5 Sonnet, Gemini 1.5 Pro, Grok2 beta (temporarily closed source)
· Open source models: Llama 3.1 8B/70b/405B, Mistral Nemo 12B, DeepSeek-coder-v2, Nous-hermes2, Phi3 3.8B/14b, Gemma2 9B\27B, Command-R
· Mathematical optimization models: Qwen2-math-72B, MathΣtral
These models cover mainstream commercial and popular open source models, with a span of more than 100 times the number of parameters from 3.8B to 405B. Considering the close relationship between encryption technology and mathematics, the experiment also specially selected two mathematical optimization models.
The knowledge areas covered by the experiment include cryptography, blockchain basics, private key and wallet operations, smart contracts, DAO and governance, consensus and economic models, Dapp/DeFi/NFT, on-chain data analysis, etc. Each field consists of a series of questions and tasks from easy to difficult, which not only tests the knowledge reserve of the model, but also tests its performance in the application scenario through simulation tasks.
The design sources of the tasks are diverse, one part comes from the input of multiple experts in the encryption field, and the other part is generated by AI assistance and manually proofread to ensure the accuracy and challenge of the tasks. Some of the tasks use simpler multiple-choice questions to facilitate separate standardized automated testing and scoring. The other part of the experiment uses more complex questions, and the testing process is carried out by a combination of program automation + manual + AI. All test tasks are evaluated using the zero-sample reasoning method, without providing any examples, thinking guidance or instruction-type prompts.
Since the design of the experiment itself is relatively rough and does not have sufficient academic rigor, the questions and tasks used for testing are far from covering the entire encryption field, and the testing framework is not mature. Therefore, this article does not list specific experimental data, but focuses on sharing some insights from the experiment.
Knowledge/Concepts
During the evaluation process, the large language model performed well in basic knowledge tests in various fields such as encryption algorithms, blockchain basics, and DeFi applications. For example, in the question-and-answer questions that examine the understanding of the concept of data availability, all models gave accurate answers. As for the questions that evaluate the model's mastery of the Ethereum transaction structure, although the models have slightly different details in the answers, they generally contain correct key information. The multiple-choice questions that tested concepts were even easier to answer, with almost all models having an accuracy rate of over 95%.
Conceptual questions and answers are not difficult for big models.
Calculation/Business Logic
However, the situation is reversed when it comes to questions that require specific calculations. A simple RSA algorithm calculation question puts most models in trouble. This is actually not difficult to understand: big language models mainly operate by identifying and replicating patterns in training data, rather than by deeply understanding the essence of mathematical concepts. This limitation is particularly evident when dealing with abstract mathematical concepts such as modular operations and exponential operations. Given that the field of encryption is closely related to mathematics, this means that it is unreliable to directly rely on models for encryption-related mathematical calculations.
In other calculation questions, the performance of big language models is also unsatisfactory. For example, in a simple question about calculating the impermanent loss of AMM, although it does not involve complex mathematical operations, only 4 out of 18 models gave the correct answer. And another more basic question about calculating the probability of a block was answered incorrectly by all models. It stumped all the models, and none of them got it right. This not only exposes the shortcomings of large language models in precise calculation, but also reflects that they have major problems in business logic analysis. It is worth noting that even mathematical optimization models failed to show obvious advantages in calculation questions, and their performance was disappointing.
However, mathematical calculation problems are not unsolvable. If we make a slight adjustment and require LLMs to give the corresponding Python code instead of directly calculating the results, the accuracy will be greatly improved. Taking the aforementioned RSA calculation problem as an example, the Python code given by most models can be successfully executed and the correct results can be obtained. In the actual production environment, it is possible to bypass the link of LLMs' self-calculation by providing preset algorithm codes, which is similar to the way humans deal with such tasks. At the business logic level, the performance of the model can also be effectively improved through the guidance of carefully designed prompt words.
Private key management and wallet operation
If you ask the Agent what the first scenario for using cryptocurrency is, my answer is payment. Cryptocurrency can almost be regarded as a native form of currency for AI. Compared with the many obstacles that agents face in the traditional financial system, it is a natural choice to use encryption technology to equip themselves with digital identities and manage funds through encrypted wallets. Therefore, the generation and management of private keys and various wallet operations constitute the most basic skill requirements for agents to use encrypted networks independently.
The core of securely generating private keys lies in high-quality random numbers, which is obviously a capability that large language models do not have. However, the model has sufficient understanding of the security of private keys. When asked to generate private keys, most models choose to use code (such as Python's related libraries) to guide users to generate private keys independently. Even if a model directly gives a private key, it clearly states that this is only for demonstration purposes and is not a secure private key that can be used directly. In this regard, all large models have shown satisfactory performance.
Private key management faces some challenges, which mainly stem from the inherent limitations of the technical architecture rather than the lack of model capabilities. When using a locally deployed model, the generated private key can be considered relatively secure. However, if a commercial cloud model is used, we must assume that the private key is exposed to the operator of the model at the moment of generation. But for agents whose goal is to work independently, it is necessary to have private key permissions, which means that the private key cannot be only on the user's local computer. In this case, relying solely on the model itself is not enough to ensure the security of the private key, and additional security services such as a trusted execution environment or HSM need to be introduced.
If it is assumed that the agent already holds the private key securely, the various models in the test have shown good capabilities when performing various basic operations on this basis. Although there are often errors in the output steps and code, these problems can be largely solved under the appropriate engineering architecture. It can be said that from a technical perspective, there are not many obstacles to allowing agents to perform basic wallet operations autonomously.
Smart Contracts
The ability to understand, use, write, and identify risks in smart contracts is key for AI Agents to perform complex tasks in the on-chain world, and is therefore also a key testing area for the experiment. Large language models have shown significant potential in this area, but have also exposed some obvious problems.
In the test, almost all models were able to correctly answer basic contract concepts and identify simple bugs. In terms of contract gas optimization, most models were able to identify key optimization points and analyze possible conflicts caused by optimization. However, when it comes to deep business logic, the limitations of large models begin to emerge.
Take a token vesting contract as an example: All models correctly understood the contract function, and most models found several medium- and low-risk vulnerabilities. However, no model was able to autonomously discover a high-risk vulnerability hidden in the business logic that could cause some funds to be locked in special circumstances. In multiple tests using real contracts, the models performed roughly the same.
This shows that the big model's understanding of the contract is still at the formal level, lacking an understanding of the underlying business logic. However, after providing additional prompts, some models were eventually able to independently find the deeper loopholes hidden in the above contracts. Based on this performance, with good engineering design support, the big model has basically acquired the ability to serve as a co-pilot in the field of smart contracts. However, there is still a long way to go to independently undertake important tasks such as contract auditing.
One thing to note is that the code-related tasks in the experiment are mainly for contracts with simple logic and less than 2,000 lines of code. For larger and more complex projects, I think it is obviously beyond the effective processing capacity of the current model without fine-tuning or complex prompt word engineering, and it is not included in the test. In addition, this test only involves Solidity, and does not include other smart contract languages such as Rust and Move.
In addition to the above test content, the experiment also covers DeFi scenarios, DAOs and their governance, on-chain data analysis, consensus mechanism design, and Tokenomics. The large language model has demonstrated certain capabilities in all these aspects. Given that many tests are still in progress and the testing methods and frameworks are being continuously optimized, this article will not discuss these areas in depth for the time being.
Differences between models
Among all the large language models participating in the evaluation, GPT-4o and Claude 3.5 Sonnet continued their outstanding performance in other fields and are the undisputed leaders. Both models can give accurate answers to almost all basic questions, and they can provide in-depth and well-documented insights in complex scenario analysis. They even show a high success rate in computational tasks that large models are not good at, but of course this "high" success rate is relative and has not yet reached the level of stable output in a production environment.
In the open source model camp, Llama 3.1-405B is far ahead of its peers thanks to its large parameter scale and advanced model algorithm. Among other open source models with smaller parameter scales, there is no significant performance gap between the models. Although the scores vary slightly, they are all far from the passing line.
Therefore, if you want to build encryption-related AI applications at present, these small and medium parameter models are not suitable choices.
In our evaluation, two models are particularly eye-catching. First is the Phi-3 3.8B model launched by Microsoft. It is the smallest model participating in this experiment. However, it has achieved a performance level comparable to the 8B-12B model with less than half the number of parameters, and even performs better on certain categories of problems. This result highlights the importance of model architecture optimization and training strategies, rather than just relying on the increase in parameter size.
And Cohere's Command-R model has become an unexpected "dark horse" - in reverse. Command-R is not as famous as other models, but Cohere is a large model company focusing on the 2B market. I think there are still quite a lot of fits with fields such as Agent development, so I deliberately included it in the test scope. But Command-R, with 35B parameters, ranked last in most tests, and was defeated by many models below 10B.
This result triggered some thoughts: Command-R was mainly focused on retrieval enhancement generation capabilities when it was released, and it did not even publish regular benchmark test scores. Does this mean that it is a "special key" that can only unlock its full potential in specific scenarios?
Experimental limitations
In this series of tests, we have a preliminary understanding of AI's capabilities in the field of encryption. Of course, these tests are far from professional standards. The coverage of the data set is far from enough, the quantitative standards of the answers are relatively rough, and there is still a lack of a sophisticated and more accurate scoring mechanism, which will affect the accuracy of the evaluation results and may lead to underestimation of the performance of some models.
In terms of testing methods, the experiment only used a single method of zero-shot learning, and did not explore ways such as thinking chains and few-sample learning that can inspire greater potential of the model. In terms of model parameters, the experiments all used standard model parameters, and did not examine the impact of different parameter settings on model performance. These overall single testing methods limit our comprehensive evaluation of the model's potential and fail to fully explore the performance differences of the model under specific conditions.
Despite the relatively simple testing conditions, these experiments still produced a lot of valuable insights and provided a reference for developers to build applications.
The encryption field needs its own benchmark
In the field of AI, benchmarks play a key role. The rapid development of modern deep learning technology originated from ImageNET, which was completed by Professor Fei-Fei Li in 2012. It is a standardized benchmark and dataset in the field of computer vision.
By providing a unified evaluation standard, benchmarks not only provide developers with clear goals and reference points, but also promote technological progress throughout the industry. This explains why each newly released large language model focuses on publishing its performance on various benchmarks. These results have become a "universal language" for model capabilities, enabling researchers to locate breakthroughs, developers to choose the model that best suits a specific task, and users to make informed choices based on objective data. More importantly, benchmarks often indicate the future direction of AI applications, guiding resource investment and research focus.
If we believe that the intersection of AI and cryptography holds great potential, then establishing a dedicated cryptography benchmark becomes an urgent task. The establishment of a benchmark may become a key bridge connecting the two major fields of AI and cryptography, catalyzing innovation and providing clear guidance for future applications.
However, compared with mature benchmarks in other fields, building benchmarks in the cryptography field faces unique challenges: cryptography technology is evolving rapidly, the industry knowledge system has not yet solidified, and there is a lack of consensus on multiple core directions. As an interdisciplinary field, encryption covers cryptography, distributed systems, economics, etc., and its complexity far exceeds that of a single field. Even more challenging is that encryption benchmarks not only need to evaluate knowledge, but also need to examine the actual operational capabilities of AI in using encryption technology, which requires the design of a completely new evaluation architecture. The lack of relevant data sets further increases the difficulty.
The complexity and importance of this task determine that it cannot be accomplished by a single person or team. It requires the wisdom of users, developers, cryptography experts, cryptography researchers, and more interdisciplinary people, relying on broad community participation and consensus. Therefore, encryption benchmarks require more extensive discussions, because this is not only a technical work, but also a profound reflection on how we understand this emerging technology.
Postscript: Having talked about this, the topic is far from over. In the next article, I will explore in depth the specific ideas and challenges of building AI benchmarks in the encryption field. The experiment is still ongoing, and we are constantly optimizing the test model, enriching the data set, improving the evaluation framework, and improving the automated testing project. Adhering to the concept of open collaboration, all relevant resources in the future - including data sets, experimental results, evaluation frameworks, and automated testing codes will be open source as public resources.
Disclaimer: The content of this article solely reflects the author's opinion and does not represent the platform in any capacity. This article is not intended to serve as a reference for making investment decisions.
You may also like
Vitalik Buterin: The ideal Ethereum experience should be like using a credit card
A whale invested $2.7 million to buy WIF and BONK today
Caitlyn Jenner faces lawsuit from JENNER memecoin buyers