Vitalik: Glue and Coprocessor Architecture, a New Concept to Improve Efficiency and Security
Binders should be optimized to be good binders, while coprocessors should be optimized to be good coprocessors.
The glue should be optimized to be a good glue, and the coprocessor should also be optimized to be a good coprocessor.
Original Title: "Glue and coprocessor architectures"
Written by: Vitalik Buterin, Ethereum Founder
Translation: Deng Tong, Jinse Finance
Special thanks to Justin Drake, Georgios Konstantopoulos, Andrej Karpathy, Michael Gao, Tarun Chitra, and various Flashbots contributors for their feedback and comments.
If you analyze any resource-intensive computation in the modern world with a moderate level of detail, one feature you will find again and again is that computation can be divided into two parts:
- A relatively small amount of complex but not computationally heavy "business logic";
- A large amount of intensive but highly structured "expensive work".
These two forms of computation are best handled in different ways: the former may have an inefficient architecture but requires very high generality; the latter may have lower generality but requires very high efficiency.
What are some practical examples of these different approaches?
First, let's look at the environment I am most familiar with: the Ethereum Virtual Machine (EVM). Here is a geth debug trace of a recent Ethereum transaction I made: updating the IPFS hash of my blog on ENS. The transaction consumed a total of 46,924 gas, which can be categorized as follows:
- Base cost: 21,000
- Call data: 1,556
- EVM execution: 24,368
- SLOAD opcode: 6,400
- SSTORE opcode: 10,100
- LOG opcode: 2,149
- Others: 6,719
EVM trace of ENS hash update. The second-to-last column is gas consumption.
The moral of the story is: most of the execution (about 73% if you look only at the EVM, about 85% if you include the base cost portion covering computation) is concentrated in a very small number of structured, expensive operations: storage reads and writes, logs, and cryptography (the base cost includes 3,000 for signature verification, and the EVM also includes 272 for hashing). The rest of the execution is "business logic": swapping calldata bits to extract the ID of the record I am trying to set and the hash I am setting it to, etc. In token transfers, this would include adding and subtracting balances; in more advanced applications, it might include loops, and so on.
In the EVM, these two forms of execution are handled differently. High-level business logic is written in higher-level languages, usually Solidity, which can be compiled to EVM. Expensive work is still triggered by EVM opcodes (such as SLOAD), but more than 99% of the actual computation is done in dedicated modules written directly inside client code (or even libraries).
To reinforce this pattern, let's explore it in another context: AI code written in Python using torch.
Forward pass of a block in a transformer model
What do we see here? We see a relatively small amount of "business logic" written in Python, describing the structure of the operations being performed. In practice, there would also be another type of business logic that determines details such as how to obtain inputs and what to do with outputs. But if we delve into each individual operation itself (self.norm, torch.cat, +, *, the steps inside self.attn...), we see vectorized computation: the same operation is performed in parallel on a large number of values. As in the first example, a small portion of computation is for business logic, while most computation is for performing large, structured matrix and vector operations—in fact, most are just matrix multiplications.
Just like in the EVM example, these two types of work are handled in two different ways. High-level business logic code is written in Python, a highly general and flexible language, but also very slow; we simply accept the inefficiency because it only involves a small part of the total computation cost. Meanwhile, intensive operations are written in highly optimized code, usually CUDA code running on GPUs. We are even increasingly seeing LLM inference being performed on ASICs.
Modern programmable cryptography, such as SNARKs, again follows a similar pattern on two levels. First, the prover can be written in a high-level language, with the heavy work done through vectorized operations, just like the AI example above. My round STARK code here demonstrates this. Second, the program executed inside the cryptography itself can be written in a way that divides between general business logic and highly structured expensive work.
To understand how this works, we can look at one of the latest trends in STARK proofs. For generality and ease of use, teams are increasingly building STARK provers for widely adopted minimal virtual machines such as RISC-V. Any program that needs to prove execution can be compiled to RISC-V, and then the prover can prove the RISC-V execution of that code.
Chart from RiscZero documentation
This is very convenient: it means we only need to write the proving logic once, and from then on, any program that needs to be proven can be written in any "traditional" programming language (for example, RiskZero supports Rust). However, there is a problem: this approach incurs significant overhead. Programmable cryptography is already very expensive; adding the overhead of running code in a RISC-V interpreter is too much. So, developers have come up with a trick: identify specific expensive operations that make up most of the computation (usually hashing and signatures), and then create specialized modules to prove these operations very efficiently. Then, by combining the inefficient but general RISC-V proving system with the efficient but specialized proving system, you get the best of both worlds.
Programmable cryptography beyond ZK-SNARKs, such as multi-party computation (MPC) and fully homomorphic encryption (FHE), may use similar methods for optimization.
Overall, what is the phenomenon?
Modern computation increasingly follows what I call the glue and coprocessor architecture: you have some central "glue" component, which is highly general but inefficient, responsible for shuttling data between one or more coprocessor components, which are less general but highly efficient.
This is a simplification: in practice, the trade-off curve between efficiency and generality almost always has more than two layers. GPUs and other chips commonly referred to as "coprocessors" in the industry are less general than CPUs but more general than ASICs. The trade-offs in specialization are complex, depending on predictions and intuition about which parts of the algorithm will remain unchanged in five years and which will change in six months. In ZK proof architectures, we often see similar multi-layer specialization. But for a broad mental model, considering two levels is sufficient. Similar situations exist in many computational fields:
From the above examples, it is clear that computation can certainly be split in this way, and it seems to be a law of nature. In fact, you can find examples of computational specialization going back decades. However, I believe this separation is increasing. I think there are reasons for this:
We have only recently reached the limits of CPU clock speed improvements, so further gains can only be achieved through parallelization. However, parallelization is difficult to reason about, so for developers, it is often more practical to continue reasoning sequentially and let parallelization happen on the backend, wrapped in dedicated modules built for specific operations.
Computation speed has only recently become so fast that the computational cost of business logic has truly become negligible. In this world, it also makes sense to optimize the VM running business logic for goals other than computational efficiency: developer-friendliness, familiarity, security, and other similar goals. Meanwhile, dedicated "coprocessor" modules can continue to be designed for efficiency and gain their security and developer-friendliness from their relatively simple "interfaces" with the glue.
It is becoming increasingly clear what the most important expensive operations are. This is most obvious in cryptography, where the types of specific expensive operations most likely to be used are: modular arithmetic, elliptic curve linear combinations (aka multi-scalar multiplication), fast Fourier transforms, and so on. In AI, this is also becoming increasingly apparent: for more than twenty years, most computation has been "mostly matrix multiplication" (though at different levels of precision). Similar trends are emerging in other fields. Compared to 20 years ago, there are far fewer unknown unknowns in (compute-intensive) computation.
What does this mean?
A key point is that the glue should be optimized to be a good glue, and the coprocessor should also be optimized to be a good coprocessor. We can explore the implications of this in several key areas.
EVM
Blockchain virtual machines (such as the EVM) do not need to be efficient, only familiar. By simply adding the right coprocessors (aka "precompiles"), computation in an inefficient VM can actually be as efficient as computation in a natively efficient VM. For example, the overhead caused by the EVM's 256-bit registers is relatively small, while the benefits brought by the EVM's familiarity and existing developer ecosystem are huge and lasting. Teams optimizing the EVM have even found that the lack of parallelization is often not the main bottleneck for scalability.
The best way to improve the EVM may simply be (i) to add better precompiles or specialized opcodes, for example, some combination of EVM-MAX and SIMD might be reasonable, and (ii) to improve storage layout, for example, the changes brought by Verkle trees as a side effect greatly reduce the cost of accessing adjacent storage slots.
Storage optimization in the Ethereum Verkle tree proposal, placing adjacent storage keys together and adjusting gas costs to reflect this. Optimizations like this, along with better precompiles, may be more important than tweaking the EVM itself.
Secure Computing and Open Hardware
One major challenge in improving the security of modern computing at the hardware level is its overly complex and proprietary nature: chips are designed for efficiency, which requires proprietary optimizations. Backdoors are easy to hide, and side-channel vulnerabilities are constantly being discovered.
People continue to push for more open and secure alternatives from multiple angles. Some computation is increasingly done in trusted execution environments, including on users' phones, which has already improved user security. The push for more open-source consumer hardware continues, with some recent victories, such as RISC-V laptops running Ubuntu.
RISC-V laptop running Debian
However, efficiency is still an issue. The author of the linked article above wrote:
Newer open-source chip designs like RISC-V cannot compete with processor technologies that have existed and been improved for decades. Progress always has a starting point.
More paranoid ideas, such as this design for building a RISC-V computer on an FPGA, face even greater overhead. But what if the glue and coprocessor architecture means this overhead actually doesn't matter? What if we accept that open and secure chips will be slower than proprietary chips, and if necessary even give up common optimizations like speculative execution and branch prediction, but try to make up for this by adding (if necessary, proprietary) ASIC modules for the most intensive specific types of computation? Sensitive computation can be done in the "main chip," which will be optimized for security, open-source design, and side-channel resistance. More intensive computation (such as ZK proofs, AI) will be done in ASIC modules, which will know less about the computation being performed (possibly, through cryptographic blinding, in some cases even zero knowledge).
Cryptography
Another key point is that all of this is very optimistic for cryptography, especially for programmable cryptography becoming mainstream. We have already seen some ultra-optimized implementations of specific highly structured computations in SNARKs, MPC, and other settings: the overhead of certain hash functions is only a few hundred times more expensive than running the computation directly, and the overhead for AI (mainly matrix multiplication) is also very low. Further improvements such as GKR may reduce this level even further. Fully general VM execution, especially when running in a RISC-V interpreter, may continue to incur about 10,000 times the overhead, but for the reasons described in this article, this does not matter: as long as the most intensive parts of the computation are handled separately using efficient specialized techniques, the total overhead is manageable.
Simplified diagram of matrix multiplication-specific MPC, the largest component in AI model inference. See this article for more details, including how to keep the model and inputs private.
An exception to the idea that "the glue layer only needs to be familiar, not efficient" is latency, and to a lesser extent, data bandwidth. If computation involves dozens of repeated heavy operations on the same data (as in cryptography and AI), any latency caused by an inefficient glue layer can become the main bottleneck for runtime. Therefore, the glue layer also has efficiency requirements, although these requirements are more specific.
Conclusion
Overall, I believe the above trends are very positive developments from multiple perspectives. First, this is a reasonable way to maximize computational efficiency while maintaining developer-friendliness, enabling more of both for everyone's benefit. In particular, by implementing specialization on the client side for efficiency, it increases our ability to run sensitive and performance-demanding computations (such as ZK proofs, LLM inference) locally on user hardware. Second, it creates a huge window of opportunity to ensure that the pursuit of efficiency does not come at the expense of other values, most notably security, openness, and simplicity: side-channel security and openness in computer hardware, reduced circuit complexity in ZK-SNARKs, and reduced complexity in virtual machines. Historically, the pursuit of efficiency has relegated these other factors to secondary status. With glue and coprocessor architectures, this is no longer necessary. Part of the machine is optimized for efficiency, the other part for generality and other values, and the two work together.
This trend is also very beneficial for cryptography, as cryptography itself is a major example of "expensive structured computation," and this trend accelerates its development. This adds another opportunity to improve security. In the blockchain world, improved security is also possible: we can worry less about optimizing the virtual machine and focus more on optimizing precompiles and other features that coexist with the VM.
Third, this trend provides opportunities for smaller and newer participants to get involved. If computation becomes less monolithic and more modular, it greatly lowers the barrier to entry. Even ASICs for one type of computation can make a difference. This is also true in the ZK proof field and EVM optimization. Writing code with near state-of-the-art efficiency becomes easier and more accessible. Auditing and formal verification of such code becomes easier and more accessible. Finally, as these very different computational fields converge on some common patterns, there is more room for collaboration and learning between them.
Disclaimer: The content of this article solely reflects the author's opinion and does not represent the platform in any capacity. This article is not intended to serve as a reference for making investment decisions.
You may also like
Asia Bitcoin Summit: Eric Trump on Crypto and Freedom
Trump-backed American Bitcoin Surges 60% in Nasdaq Debut
Smarter Web Company Signs New 21M Share Subscription Agreement
SEC Reviews Quantum Proof Plan For Bitcoin And Ethereum
Trending news
MoreCrypto prices
More








