Original title: Glue and coprocessor architectures
Original author: Vitalik Buterin, founder of Ethereum
Original translation: Deng Tong, Golden Finance
Special thanks to Justin Drake, Georgios Konstantopoulos, Andrej Karpathy, Michael Gao, Tarun Chitra and various Flashbots contributors for their feedback and comments.
If you analyze any resource-intensive computation going on in the modern world in moderate detail, one feature you’ll find again and again is that computation can be divided into two parts:
· A relatively small amount of complex but computationally inefficient “business logic”;
· A large amount of intensive but highly structured “expensive work”.
These two forms of computation are best handled in different ways: in the former, where the architecture may be less efficient but needs to be very general, and in the latter, where the architecture may be less general but needs to be very efficient.
First, let’s look at the environment I’m most familiar with: the Ethereum Virtual Machine (EVM). Here’s a geth debug trace of an Ethereum transaction I recently made: Updating the IPFS hash of my blog on ENS. The transaction consumed a total of 46,924 gas, which can be broken down as follows: · Base cost: 21,000 · Call data: 1,556EVM
· Execution: 24,368SLOAD
· Opcodes: 6,400SSTORE
· Opcodes: 10,100LOG
· Opcodes: 2,149
· Other: 6,719
EVM trace of the ENS hash update. The second to last column is the gas consumption.
The moral of the story is this: the majority of execution (~73% if you look at just the EVM, ~85% if you include the base cost portion covering compute) is concentrated in a very small number of structured expensive operations: storage reads and writes, logging, and crypto (base cost includes 3000 for payment signature verification, and the EVM includes an additional 272 for payment hashing). The rest of the execution is "business logic": swapping bits of calldata to extract the ID of the record I'm trying to set and the hash I'm setting it to, etc. In a token transfer this would include adding and subtracting balances, in more advanced applications this might include loops, etc.
In the EVM, these two forms of execution are handled differently. High-level business logic is written in a higher-level language, typically Solidity, which compiles to the EVM. Expensive work is still triggered by EVM opcodes (SLOAD, etc.), but 99%+ of the actual computation is done in dedicated modules written directly inside client code (or even libraries).
To reinforce our understanding of this pattern, let’s explore it in another context: AI code written in Python using torch.
Forward pass of one block of a transformer model
What do we see here? We see a relatively small amount of “business logic” written in Python, which describes the structure of the operations being performed. In a real application, there would be another type of business logic that determines details such as how to get inputs and what operations to perform on outputs. However, if we drill down into each individual operation itself (the individual steps inside self.norm, torch.cat, +, *, self.attn, …), we see vectorized computation: the same operation computes a large number of values in parallel. Similar to the first example, a small portion of the computation is for business logic, and the majority of the computation is for performing large structured matrix and vector operations — in fact, most are just matrix multiplications.
Just like in the EVM example, these two types of work are handled in two different ways. The high-level business logic code is written in Python, a highly general and flexible language, but also very slow, and we simply accept the inefficiency because it only involves a small portion of the total computation cost. Meanwhile, the intensive operations are written in highly optimized code, typically CUDA code that runs on GPUs. We are even increasingly starting to see LLM inference being done on ASICs.
Modern programmable cryptography, like SNARKs, again follows a similar pattern on two levels. First, the prover can be written in a high-level language where the heavy lifting is done with vectorized operations, just like in the AI example above. The circular STARK code I have here demonstrates this. Second, the programs that are executed inside the cryptography itself can be written in a way that splits the difference between general business logic and highly structured expensive work.
To see how this works, we can look at one of the recent trends in STARK proofs. In order to be general and easy to use, teams are increasingly building STARK provers for widely adopted minimal virtual machines like RISC-V. Any program that needs to have an execution proven can be compiled to RISC-V, and the prover can then prove a RISC-V execution of that code.
Diagram from the RiscZero documentation
This is very convenient: it means we only have to write the proof logic once, and from then on, any program that needs a proof can be written in any "traditional" programming language (RiskZero supports Rust, for example). However, there is a problem: this approach incurs a lot of overhead. Programmable crypto is already very expensive; adding the overhead of running code in a RISC-V interpreter is too much. So the developers came up with a trick: identify specific expensive operations that make up the bulk of the computation (usually hashing and signing), and then create specialized modules to prove those operations very efficiently. Then you just combine the inefficient but general RISC-V proof system with the efficient but specialized proof system, and you get the best of both worlds.
Programmable crypto beyond ZK-SNARKs, such as multi-party computation (MPC) and fully homomorphic encryption (FHE), will likely be optimized using a similar approach.
Modern computing increasingly follows what I call a glue and coprocessor architecture: you have some central "glue" component, which is high in generality but low in efficiency, that is responsible for ferrying data between one or more coprocessor components, which are low in generality but high in efficiency.
This is a simplification: in practice, the trade-off curve between efficiency and generality almost always has more than two levels. GPUs and other chips, often called "coprocessors" in the industry, are less general than CPUs, but more general than ASICs. The trade-offs in terms of degree of specialization are complex, and depend on predictions and intuitions about which parts of an algorithm will remain the same in five years, and which parts will change in six months. In ZK proof architectures, we often see similar multiple levels of specialization. But for a broad mental model, it is sufficient to consider two levels. There are parallels to this in many areas of computing:
From the above examples, it certainly seems like a law of nature that computation can be split up in this way. In fact, you can find examples of computation specialization going back decades. However, I think this separation is increasing. And I think there's a reason for this:
We've only recently reached the limits of CPU clock speed increases, so further gains are only possible through parallelization. However, parallelization is hard to reason about, so it's often more practical for developers to continue to reason about sequentially and let parallelization happen in the backend, wrapped in dedicated modules built for specific operations.
Computation has only recently become so fast that the computational cost of business logic has become truly negligible. In this world, it also makes sense to optimize the VM where the business logic runs for goals other than computational efficiency: developer friendliness, familiarity, security, and other similar goals. Meanwhile, dedicated "coprocessor" modules can continue to be designed for efficiency, and derive their security and developer friendliness from their relatively simple "interface" to the binder.
It is becoming increasingly clear what the most important expensive operations are. This is most evident in cryptography, where specific types of expensive operations are most likely to be used: modular operations, elliptic curve linear combinations (aka multi-scalar multiplication), fast Fourier transforms, and so on. This is also becoming increasingly clear in AI, where for more than two decades, most computation has been "mostly matrix multiplication" (albeit at varying levels of precision). Similar trends are occurring in other fields. There are far fewer unknown unknowns in (compute-intensive) computation than there were 20 years ago.
A key point is that the glue should be optimized to be a good glue, and the coprocessor should be optimized to be a good coprocessor. We can explore the implications of this in a few key areas.
Blockchain virtual machines (such as the EVM) do not need to be efficient, just familiar. With the addition of the right coprocessor (aka "precompilation"), computations in an inefficient VM can actually be as efficient as computations in a native, efficient VM. For example, the overhead incurred by the EVM's 256-bit registers is relatively small, while the benefits of the EVM's familiarity and existing developer ecosystem are large and lasting. Development teams optimizing the EVM have even found that lack of parallelization is not generally a major barrier to scalability.
The best ways to improve the EVM are probably just (i) adding better precompiles or specialized opcodes, e.g. some combination of EVM-MAX and SIMD might be reasonable, and (ii) improving storage layout, e.g. the Verkle tree changes greatly reduce the cost of accessing storage slots that are adjacent to each other as a side effect.
The storage optimization in the Ethereum Verkle tree proposal puts adjacent storage keys together and adjusts gas costs to reflect this. Optimizations like this, combined with better precompiles, may be more important than tweaking the EVM itself.
One of the big challenges in improving the security of modern computing at the hardware level is its overly complex and proprietary nature: Chips are designed to be efficient, which requires proprietary optimizations. Backdoors are easy to hide, and side-channel vulnerabilities are constantly being discovered.
There continues to be efforts to push for more open, more secure alternatives from multiple angles. Some computing is increasingly done in trusted execution environments, including on users’ phones, which has improved security for users. The push for more open source consumer hardware continues, with some recent wins, such as a RISC-V laptop running Ubuntu.
RISC-V laptop running Debian
However, efficiency remains an issue. The author of the above linked article writes:
Newer open source chip designs such as RISC-V cannot possibly compete with processor technology that has been around and improved for decades. Progress always has to start somewhere.
More paranoid ideas, such as this design of building a RISC-V computer on an FPGA, face greater overhead. But what if glue and coprocessor architectures mean that this overhead is not actually significant? What if we accept that open and secure chips will be slower than proprietary chips, even giving up common optimizations such as speculative execution and branch prediction if necessary, but try to make up for this by adding (proprietary if necessary) ASIC modules that are used for specific types of computation that are most intensive? Sensitive computations can be done in the "main chip", which will be optimized for security, open source design, and side-channel resistance. More intensive computations (e.g. ZK proofs, AI) will be done in ASIC modules, which will know less information about the computation being performed (potentially, through cryptographic blinding, and possibly even zero information in some cases).
Another key point is that all of this is very optimistic about cryptography, and especially programmable cryptography, going mainstream. We’ve already seen super-optimized implementations of some specific highly structured computations in SNARK, MPC, and other settings: some hash functions are only a few hundred times more expensive than running the computation directly, and AI (mostly matrix multiplication) has very low overhead as well. Further improvements such as GKR may drive this down even further. Fully general purpose VM execution, especially when executed in a RISC-V interpreter, will probably continue to have an overhead of about ten thousand times, but for the reasons described in the paper this won’t matter: as long as the most intensive parts of the computation are handled separately using efficient, specialized techniques, the total overhead will be manageable.
A simplified diagram of a specialized MPC for matrix multiplication, the largest component in AI model inference. See this article for more details, including how to keep models and inputs private.
One exception to the idea that the glue layer only needs to be familiar, not efficient, is latency, and to a lesser extent data bandwidth. If the computation involves heavy operations on the same data dozens of times (as is the case in cryptography and AI), then any latency caused by an inefficient glue layer can become a major bottleneck in runtime. Therefore, the glue layer also has efficiency requirements, although these are more specific.
Overall, I think the above trend is a very positive development from multiple perspectives. First, it is a reasonable way to maximize computational efficiency while remaining developer-friendly, and being able to get more of both is good for everyone. In particular, it improves our ability to run sensitive and performance-demanding computations (e.g., ZK proofs, LLM reasoning) locally on user hardware by enabling specialization on the client side to improve efficiency. Second, it creates a huge window of opportunity to ensure that the pursuit of efficiency does not compromise other values, most notably security, openness, and simplicity: side-channel security and openness in computer hardware, reduced circuit complexity in ZK-SNARKs, and reduced complexity in virtual machines. Historically, the pursuit of efficiency has caused these other factors to take a back seat. With glue and coprocessor architectures, it no longer has to. One part of the machine optimizes for efficiency, and another part optimizes for generality and other values, and the two work together.
This trend is also very good for cryptography, which is itself a prime example of "expensive structured computation," and this trend accelerates that trend. This adds another opportunity to improve security. In the blockchain world, improved security is also possible: we can worry less about optimizing the virtual machine and more about optimizing precompiles and other features that coexist with the virtual machine.
Third, this trend provides opportunities for smaller, newer players to participate. If computation becomes less monolithic and more modular, this greatly lowers the barrier to entry. Even ASICs that use one type of computation have the potential to make a difference. This is true in the area of ZK proofs and EVM optimizations. It becomes easier and more accessible to write code with near-frontier efficiency. It becomes easier and more accessible to audit and formally verify such code. Finally, because these very different areas of computation are converging on some common patterns, there is more room for collaboration and learning between them.
欢迎加入律动 BlockBeats 官方社群:
Telegram 订阅群:https://t.me/theblockbeats
Telegram 交流群:https://t.me/BlockBeats_App
Twitter 官方账号:https://twitter.com/BlockBeatsAsia