DNET: Towards Heterogeneous Multiprocessing for Deep Learning

1. Context

The concentration of high-performance AI in the hands of a few powerful entities poses significant risks: it fosters technological monopolies, restricts innovation through vendor lock-in, inflates costs, and exacerbates global inequalities.

This centralization amplifies systemic vulnerabilities in a world increasingly reliant on AI; hardware shortages, geopolitical tensions, or supply-chain disruptions could trigger widespread operational failures.

The world's energy already flows through vastly heterogeneous fleets of hardware, yet today's AI software treats it as if there were only a single, monolithic supercomputer.

To commoditize and quantify intelligence ownership and transactions opens the path into a world where problem-solving morphs from a human endeavour into an open market of data and semiconductors. As compute advances into an ad-hoc universal currency for action upon the world, we have to ensure people’s access to it will not be blocked by rent-seeking, anti-competitive corporate practices or proprietary ecosystems willingly or unwillingly imposing artificial limitations.

Tackling the hardware, DNET will serve as a general software interconnect of processors into a network and compound the capabilities of everyday devices to function as an individual accelerator. In its mature form, our software stack should function as a generalizable abstraction layer over the complexities of hardware design, and the different decisions taken by manufacturers when weighing out the trade-offs their niche of the market imposes. It would allow developers to deploy to a heterogeneous network of devices without worrying about the topology and structure of each individual node.

Building upon all of the previous work available in compiler literature, a wide array of battle-tested algorithms and representations, as well as restricting the set of programs to the common, modern machine learning subset of algorithms and focusing mainly on inference makes this herculean task manageable.

1.1. Execution engines

Devices handle the execution of instructions in their internal pipelines differently, some more advanced systems might employ out-of-order execution, SIMD, branch prediction and speculative execution, while simpler embedded ones might stick to in-order, single issue pipelines.

Different instruction cache structures, choice of ISA (RISC or CISC) and decoding into micro-ops, or even number and width of register files or multiplexers can influence the execution time and scheduling, but many of these details are abstracted away by the ISA, kernel and ABI, leaving us with a mostly-standardized interface. Scheduling can be complex, especially for OOO and SIMD types, but the more challenging part is handling the memory needed by these execution pipelines.

1.2. The memory problem

Handling memory efficiently has been a problem across the industry in recent times, leading to innovations from the silicon layer with the advance of PIM (Processing in Memory) architectures dependent on changes to both the photolithography stage with the testing and introduction of new RAM cells like ReRAM, placement of memory modules on the chip, TSVs (Through-Silicon Via’s) and the introduction of advanced packaging fabs and processes like TSMC’s CoWoS, etc.

Memory transfer speeds in custom interconnects have also been getting a lot of attention, with companies like Apple deploying multiple dies in one package like the M3 Ultra and Nvidia developing custom memory modules on their server-class products.

Software-side, we’ve seen the introduction of RDMA and other libraries that abstract multiple devices and physical modules into a single unified address space. The network must adapt to all of these different devices and attempt to make use of the highest-bandwidth interface available.

The scheduling and transfer of memory across the entire network is the main bottleneck of distributed inference. Working on the OSI model, the lowest one could manage to specialize without explicit hardware support is down to the Transport layer using the TCP/UDP protocols, and even here we are bound by the hardware packet processing acceleration. Lower than this is the responsibility of the hardware manufacturers. This leaves us responsible for discovering the networking capabilities of each node on the network and scheduling around them if needed to minimize halts in wait of lazy nodes. The network should also be structured to minimize the required communications over these latency-heavy mediums and maximize the use of each node’s low-latency environment.

Locally, the incorrect management of different devices can lead to an impaired performance. Data transfers between accelerators working on a different clock rate and system memory must be minimized and done in an asynchronous manner. Maximal data reuse must be employed to minimize cache/TLB misses and page table walks, especially to external memory. Most of these operations require communication through a specialized driver. Many manufacturers consider the explicit control of the accelerator or chip to be proprietary IP, offering instead a closed-source kernel module driver and a runtime that functions as an API. Many of the abstractions provided are helpful, built to ensure detailed subsystems work properly, but many simply obfuscate hardware implementation details.

2. General architecture and approach

To be able to handle all of the above requirements, we have opted to go with a graph-centric compiler design, as opposed to a simpler inference engine design biased towards a more eager execution style. Since scheduling and planning all of the different sub-systems requires knowledge of the entire model graph, we transition into multiple IR representations ranging from a restricted set of tensor-level logic and memory operations structured as a DAG to an instruction stack.

2.1. Pipeline

We begin by intaking an abstract model graph from our frontend. This can either take the form of Python-level bindings for declaring custom models in a familiar and high-level form, potentially even supporting input from popular frameworks like PyTorch, or for a more complex interface we introduce a DSL with syntax resembling Lisp/Scheme functional declarations with much deeper control over graph and member symbols, the below example showing the forward function of an attention object.

forward(Attention, {
  fn_argument(forward, x, S_TENSOR, SYM_EXTERNAL, Tensor(shape(WILDCARD, WILDCARD, hidden_size), D_F16, 0));
  fn_argument(forward, mask,  S_TENSOR, SYM_EXTERNAL | SYM_OPTIONAL, Tensor(shape(WILDCARD), D_F16, 0));
  fn_argument(forward, cache, S_TENSOR, SYM_EXTERNAL | SYM_OPTIONAL, Tensor(shape(WILDCARD), D_F16, 0)); 

  capture( forward, 
    call(scaled_dot_product_attention,
         mul(idiv(hidden_size, num_attention_heads), 
             call(project_and_format, q_proj, x, num_attention_heads, cache)),
         call(project_and_format, k_proj, x, num_key_value_heads, cache),
         call(project_and_format, v_proj, x, num_key_value_heads, cache),
         cache, mask));
});

Either way, this layer’s representation stores separate graphs for each instantiation of child-objects as member symbols, allowing us to schedule common object-level passes like constant propagation and folding, branch removal, etc. Already at this stage we have access to all of the memory allocations of the program and we can begin registering tensors in scoped symbol tables and track symbolic shapes and memory access descriptors throughout the entire data-flow pipeline.

A more complete picture is introduced with the lowering into the IR and the inlining of separate objects into a full model graph we can reason about. Here, we have instruction-level control on the scheduling of each operation. Complex data-flow analysis gives information on data dependencies and any limitations of parallelism of the architecture at hand. If we are going to deploy on a network of nodes we make use of a network topology graph that embeds information on the asymmetry between node capabilities and drive the way we partition the pre-processed model graph into sub-graphs mapped to nodes of the topology with the goal of synchronizing the production of output tensors. This sharding also has instruction-level scope, allowing us to optimize for any level of parallelism, including element-wise tensor tiling. Transformation passes are responsible for restructuring the graph at this stage into one that can run efficiently given the known limitations. Lowering into the executable can take two paths, either generating C code that is then passed into a traditional compiler or directly emitting Assembly.

In the case of a network, this process happens locally on every node as it requires hardware information to correctly schedule the lowering of its sub-graph. Later on, a JIT runtime will be introduced in the execution stage to create a feedback loop and further optimize both the local execution process and the distribution of work over the entire network. Detailed performance trackers from each chip’s PMUs (Performance Monitoring Units) can be requested from most devices to understand detailed chip behaviour.

3. Conclusion

DNET represents an ambitious step toward commodifying inference by building a robust software abstraction layer over heterogeneous hardware. By unifying disparate devices into a cooperative execution network, we aim to unlock unused silicon potential and reduce dependence on monolithic, vendor-locked accelerators.