Core mechanism - distributed heterogeneous arithmetic infrastructure

DGPT adopts a distributed heterogeneous AI computing power operation platform as an important bearer for efficient and reliable processing and use of diversified data, which can meet the diversified demands for computing resources and computing power of various upper-layer applications.
Heterogeneous AI arithmetic operation platform is a heterogeneous fusion adaptation platform for diversified AI arithmetic, which can achieve effective docking of hardware performance and computing requirements, effective adaptation of heterogeneous arithmetic and user requirements, flexible scheduling of heterogeneous arithmetic among nodes, intelligent operation and open sharing of diversified arithmetic, and provide high-performance for diversified AI application scenarios by giving full play to the maximum computational effectiveness of various types of heterogeneous arithmetic by collaborative processing, It provides high-performance and highly reliable computing power support for diverse AI application scenarios. The heterogeneous arithmetic operation platform consists of four parts: hardware support platform, heterogeneous AI arithmetic adaptation platform, heterogeneous AI arithmetic scheduling platform, and intelligent operation and open platform (see Fig. 1). Relying on the fusion architecture that combines software and hardware, it solves the problems of poor compatibility and low efficiency caused by multiple architectures, and realises the classification and integration of hardware resources, pooling reconfiguration, and intelligent allocation through software definition.
Fig. 4 Heterogeneous AI arithmetic operation platform architecture
Technical Architecture for Heterogeneous AI Arithmetic Operations Platforms
The heterogeneous AI arithmetic operation platform adopts a software-hardware convergence architecture to achieve pooled reconfiguration and intelligent allocation of hardware resources through a software-defined approach.
Resource Reconfiguration Technology Programme
According to the differences in resource categories such as computing, storage and network, hardware resources will be integrated to form a pool of resources of the same kind, so as to achieve on-demand reorganisation of resources among different devices. By pooling resources through hardware reconfiguration, CPUs and various accelerators such as GPUs, FPGAs, xPUs, etc., will be more closely integrated, and the new ultra-high-speed internal and external interconnection technology with full interconnection will be used to achieve the fusion of heterogeneous computing chips; at the same time, computational resources can be flexibly scheduled according to the business scenarios; and heterogeneous storage media such as NVMe, SSDs, HDDs, etc., are interconnected through high-speed interconnection to form storage resources. At the software level, promote the self-service of hardware resources. At the software level, adaptive reconfiguration of hardware resources is promoted to achieve dynamic adjustment, flexible combination and intelligent distribution of resources in response to multi-application and multi-scenario demands.
Advantages of hardware-software convergence architecture technology
On the one hand, the hardware and software convergence architecture supports massive resource processing requirements. Heterogeneous AI arithmetic operation platform can meet the system's requirements for performance, efficiency, stability and scalability, satisfy the high bandwidth and low latency concurrent access requirements of GPU or CPU computing clusters in AI training, and adapt to the petabyte or even EB-level growth in data volume brought about by the linear growth of business deployment volume, while at the same time, significantly shortening the time for generating the AI model to maximize the release of hardware arithmetic power.
On the other hand, the software-hardware convergence architecture is able to meet the intelligent demands of multiple application scenarios. Based on software-defined computing, software-defined storage, and software-defined network, software-hardware convergence architecture gives full play to the application-aware capability of resource management and scheduling system, establishes an intelligent convergence architecture, converges computing and storage while separating control and computing, and fuses diversified arithmetic power relying on products such as intelligent network card, so that all the resources at the software level can be dynamically combined within the scope of scheduling to satisfy the demands of diversified applications.
Functional Architecture of Heterogeneous AI Arithmetic Operating Platforms
Hardware support platform
The hardware support platform is based on a converged architecture, realising the virtualisation and pooling of multiple hardware resources such as CPU, GPU, NPU, FPGA, ASIC, etc.
Establishment of "CPU+GPU", "CPU+FPGA", "CPU+ASIC (TPU, NPU, VPU, BPU)", etc. "CPU+AI acceleration chip" architecture, fully releasing the respective advantages of CPU and AI acceleration chip to cope with interactive response and highly parallel computing respectively. In complex AI application scenarios for diversified data processing, the hardware support platform is able to assign differentiated data computing tasks to the most appropriate hardware modules for processing, achieving the optimal arithmetic power of the entire platform.
Heterogeneous AI Arithmetic Adaptation Platform
Heterogeneous AI computing power adaptation platform is the core platform that connects the upper-layer algorithmic applications with the underlying heterogeneous computing power devices, drives the work of heterogeneous hardware and software computing power, and provides adaptation services that cover the whole process of AI computing power, so that users can migrate their applications from the original platform to the heterogeneous AI computing power adaptation platform. The heterogeneous AI computing power adaptation platform includes four parts: application framework, development kit, driver, and firmware (see Figure 2).
Fig. 5 Heterogeneous AI arithmetic adaptation platform architecture
The application framework is used to provide rich programming interfaces and operation methods, adapt the programming framework of algorithmic models, abstract the algorithmic computation semantics, adapt different application scenarios, shield the details of heterogeneous acceleration logic implementation, and make the heterogeneous arithmetic programming frameworks adapted to the heterogeneous arithmetic differentiated by various vendors. The development kit defines a set of heterogeneous programming models under the semantics of computational graphs, which is an important software for accelerating computational loads from frameworks to hardware, and achieves simplification, assimilation, and optimisation of heterogeneous accelerated programming. The driver module is used to adapt heterogeneous hardware to interact with the operating system and runtime environment. Firmware can be adapted to hardware support platforms to achieve security functions such as security verification, access isolation, hardware status alarms, etc., and can also directly act as other heterogeneous acceleration devices.
Heterogeneous AI Arithmetic Scheduling Platform
The heterogeneous arithmetic scheduling platform can achieve flexible scheduling of heterogeneous arithmetic among computing nodes, meet high performance and high scalability, and form a standardised and systematic design scheme. Heterogeneous AI computing power scheduling platform can achieve AI model development and deployment and operational reasoning. Relying on the concept of software-hardware integration, it carries out fine-grained slicing and scheduling of AI arithmetic, accelerates model iteration, empowers AI training, and enhances the compatibility and adaptability of various types of AI models.
Heterogeneous AI arithmetic scheduling platform consists of three modules: full-stack training, resource management, monitoring and alerting. The full-stack training module can achieve full-stack service from design and training to on-line operation of AI arithmetic scheduling, and at the same time ensure that the whole process of training can be investigated and analysed through visualization tools. The resource management module provides corresponding operation and management strategies for multi-tenant resources, IT resources, servers and scheduling resources, as well as report management, log management, fault management and other services for the resources of the entire heterogeneous AI computing power scheduling platform. The monitoring and alarm module provides monitoring and management for the computing power scheduling platform globally, including resource usage, training tasks, server resources, key components, etc., so as to achieve effective monitoring and timely alarms for data collection and storage and business resources.
Intelligent Operation Open Platform
Intelligent operation open platform provides integrated software and hardware solutions for the whole industry, establishes an open, shared and intelligent heterogeneous AI computing power support system and development environment, and realises intelligent operation, safety, reliability and open sharing of heterogeneous AI computing power. In terms of intelligent operation, it unifies and manages physical resources, cluster nodes, and platform data, establishes an allocation mechanism and process that matches the characteristics of heterogeneous AI computing power resources, and supports the expansion of heterogeneous computing power through strong management to carry various AI model services and scenario applications.
In terms of security protection, it deploys an active defence trusted platform control module, integrates and adapts trusted operating systems and platform kernels, establishes a complete chain of trust throughout the platform management process, creates a trusted computing environment, a security control mechanism and trusted policy management, guards against malicious intrusion and equipment replacement, and enhances the level of platform security and controllability.
In terms of open sharing, the Intelligent Operation Open Platform is oriented to the development needs of the industry and carries out technology research and development, results transformation and landing, and builds an ecological community of developers; at the same time, it provides users with shared content such as resource libraries, development tool libraries and solution libraries, and accelerates the integration and landing of heterogeneous AI arithmetic operation platforms with various industries and domains.
Heterogeneous computing power unified scheduling mechanism
In order to integrate diversified AI chips and arithmetic resources, the heterogeneous AI arithmetic operation platform needs to fuse diversified heterogeneous arithmetic, further enhance the technical advantages of the fusion architecture, and realise unified scheduling and efficient allocation of diversified heterogeneous AI arithmetic. First, the performance of the fusion technology is improved to deepen the application capability of software-hardware synergy. Through the new ultra-high-speed internal and external interconnection technology, pooling fusion, reconfiguration technology and other fusion architectures, it promotes the high-speed interconnection of multiple heterogeneous computing power facilities to form a highly efficient pooled computing centre; through software definition, it realizes the intelligent management of reconfigured hardware resource pools, which significantly improves the level of performance of software and hardware and ensures the flexible scheduling of business resources and intelligent operation and maintenance of monitoring and management. Secondly, it achieves unified scheduling of diversified heterogeneous computing power to meet the flexible scheduling and efficient allocation of heterogeneous computing power resources, and responds to the demands of various types of AI applications in a timely manner. Based on the differences in application scenarios, interface configurations, and load capacity, it establishes a unified scheduling architecture for diversified heterogeneous arithmetic resources and upper-layer multi-scenario demands, unified real-time resource sensing, abstract resource response and application scheduling.
Deployment of smart arithmetic virtual resource pools
The formation of a software-defined AI arithmetic virtual resource pool through virtualisation can enhance the operation capability of heterogeneous AI arithmetic operation platforms and optimise the application architecture. First, it enhances the ability of fine-grained slicing of computing resources. Fine-grained slicing of computing resources in the intelligent arithmetic virtual resource pool according to application requirements and business characteristics can maximise the use of arithmetic, improve resource utilisation, reduce computing costs, and avoid the tedious work of equipment selection and equipment adaptation in large-scale computing equipment clusters. The second is the virtualised configuration of heterogeneous computing server chip architecture. It is necessary to configure and set up the virtualisation technology according to the heterogeneous computing server's own chip architecture, so as to further guarantee the pooling of heterogeneous computing resources. Heterogeneous arithmetic servers, storage, network, etc. can be made into a virtual resource pool, and the arithmetic resources required by the upper layer applications can be captured in the resource pool through the API interface, and the mapping of the virtual resource pool to the physical resource pool can be achieved.
Heterogeneous operating platform convergence application
Deeply integrate the heterogeneous AI arithmetic operation platform with the intelligent transformation and upgrading of the industry, provide high-performance and highly reliable arithmetic support for diversified AI application scenarios, and enhance the application scope and application capability of the heterogeneous AI arithmetic operation platform. It provides multi-algorithm fusion scheduling, big data standardized processing, and multi-scene application service capability opening to help build smart city applications; adopts intelligent video management solutions, provides intelligent equipment management, AI intelligent analysis and services, and other capabilities to create an integrated solution for smart parks; provides video image model import capability, algorithmic warehouse model import, intelligent analysis template orchestration, and other capabilities to quickly respond to various types of smart government application needs; based on the heterogeneous AI arithmetic operation platform, it realises intelligent industrial applications such as predictive maintenance of production equipment, artificial intelligence high-precision mechanical equipment, and intelligent production assistance of industrial AR; empowers scientific research work collaboration and project innovation management with powerful intelligent arithmetic; and empowers intelligent finance and business innovation with high-speed, high-precision, and large-data-volume processing capabilities.
Forming a full-scene matrixing
Open ecology is an effective way to achieve diversified arithmetic fusion, and the construction of a matrixed cooperation model can promote technology fusion and innovation, scene fusion and application, and service fusion and delivery, and improve the ecological framework for the construction and development of heterogeneous AI arithmetic operation platforms. On the one hand, it is necessary to build an integrated solution that integrates the whole chain and faces the whole scene, continuously promote multi-party cooperation, and establish an ecological architecture from hardware, algorithms, AI middle platforms to industry applications. On the other hand, it is necessary to establish an open and open-source ecosystem, form a cooperative and win-win organisational alliance, change the production mode and application service mode, and continuously optimise the technical capabilities and construction level of the heterogeneous AI arithmetic operation platform.Open-source organisations such as the ODCC Open Data Centre Committee should give full play to their platform advantages, realise the openness of basic software and hardware and the integration of capabilities, and hatch more multi-dimensional and composite scenario intelligence solutions.
Technical standards for heterogeneous arithmetic scheduling
Standardise heterogeneous arithmetic scheduling technology capabilities, unify API standards and runtime arithmetic bases for heterogeneous hardware for deep learning, standardise the definition and execution of deep learning computing tasks, and decouple upper-layer applications and underlying heterogeneous hardware platforms. Coordinate with various manufacturers in the ecological industry chain, and focus on the unified management of heterogeneous equipment and hardware, the docking and adaptation of the system layer driver, the alignment of the acceleration libraries in the model and operator layer, the high-performance migration and optimisation of the algorithm layer framework, and the independent development of the platform layer scheduler, so as to effectively ensure the management and scheduling of heterogeneous computing power.
Unified Hardware Algorithm Adaptation Evaluation Methodology
From the perspectives of hardware adaptation and algorithm unification, heterogeneous computing power adaptation standards are formulated to achieve interoperability and performance maximisation between heterogeneous computing power. In terms of hardware adaptation, standardise heterogeneous chips and corresponding underlying interfaces, and form standardised testing methods in terms of heterogeneous chip functions, performance, stability, compatibility, etc.; standardise technical requirements and performance specifications for heterogeneous AI servers, and determine requirements for heterogeneous AI servers in terms of design specifications, management strategies, and operating environments, so as to promote the standardisation of server research and development, production, and testing. In terms of algorithm unification, standardise different types of models such as supervised learning, unsupervised learning and reinforcement learning for distributed AI deep learning frameworks; standardise the evaluation indexes of heterogeneous AI algorithmic models in terms of convergence time, convergence accuracy, throughput performance, latency performance, etc.; and standardise the requirements for the deployment of heterogeneous AI algorithmic models in a variety of application scenarios, as well as the requirements for adaptation in the design and development process of distributed training platforms, Evaluation methods for the energy efficiency of algorithmic model inference.