NPU HW/SW co-design

About

With the rapid growth of deep learning application areas, the support of hardware that can efficiently accelerate these processes has become increasingly important. Consequently, the development of Neural Processing Units (NPUs) for energy-efficient AI acceleration has been gaining momentum. However, traditional hardware design approaches, which start software development only after hardware development is completed, have several issues. If hardware design flaws or low performance are discovered, the entire design process must begin anew, and software development also becomes complex and time-consuming.

To address these issues, our lab proposes an integrated hardware/software design methodology for NPUs. The key to this methodology is to simultaneously progress hardware and software development, reducing the design iteration cycle. We have enabled performance prediction of software without actual hardware through virtual prototyping. This approach not only allows compiler developers to easily predict software performance, but also facilitates the exploration of various NPU architectures using a hardware simulator implemented in software.

Using this methodology, we have designed MIDAP, an energy-efficient CNN accelerator. It is designed with a fully pipelined structure and integrates large SRAM banks to minimize resource contentions and DRAM access. Additionally, it supports various non-convolutional operations and aims to maximize final performance and MAC utilization by standardizing data formats for easier software development.

So far, we have conducted system-level simulations, software development through virtual prototyping, NPU compiler development, datapath and control structure design, RTL-level hardware design, and FPGA verification, and are now in the stage of actual chip fabrication. Remarkably, thanks to our hardware/software co-design methodology, the simulator, compiler, and datapath design were all developed by a team of just two individuals within a span of two years, showcasing the effective progression of our NPU development process.

Publication

Seongwoo Choi, Hyunsu Moh, Changjae Yi, Joon Choi, Soonhoi Ha, "Enabling Decoder-only Language Model Inference on a CNN Accelerator", International Conference on Computer-Aided Design (ICCAD), Nov, 2025.

Jimin Lee, Soonhoi Ha, "Empowering Edge Devices With Processing-In-Memory for On-Device Language Inference", IEEE Embedded Systems Letters, Vol. 17, pp. 244-247, Aug, 2025.

Changjae Yi, Hyunsu Moh, Soonhoi Ha, "Vision Transformer Inference on a CNN Accelerator", IEEE 42nd International Conference on Computer Design (ICCD), Jan, 2025.

허정원, 김영진, 이지섭, 하순회, "N-Dolphin 임베디드 NPU를 사용한 3D 객체 탐지", 2024 한국컴퓨터종합학술대회 (KCC2024), Jun, 2024.

Choonghoon Park, Soonhoi Ha, "A Novel Throughput Enhancement Method for Deep Learning Applications on Mobile Devices With Heterogeneous Processors", IEEE Access, Vol. 12, pp. 38773-38785, Mar, 2024.

Choonghoon Park, Hyunsu Moh, Jimin Lee, Changjae Yi, Soonhoi Ha, "Fast and Accurate Virtual Prototyping of an NPU with Analytical Memory Modeling", 34th International Workshop on Rapid System Prototyping (RSP 23), pp. 1-7, Sep, 2023.

Changjae Yi, Donghyun Kang, Soonhoi Ha, "Hardware-Software Codesign of a CNN Accelerator", 2022 25th Euromicro Conference on Digital System Design (DSD), pp. 348-356, Aug, 2022.

Duseok Kang, Donghyun Kang, Soonhoi Ha, "Multi-Bank On-chip Memory Management Techniques for CNN Accelerators", IEEE Transactions on Computers, Vol. 71, No.5, May, 2022.

Keonjoo Lee, Donghyun Kang, Duseok Kang, Soonhoi Ha, "Analysis of the Effect of Off-chip Memory Access on the Performance of an NPU System", International Symposium on Quality Electronic Design (ISQED'22), Apr, 2022.

Donghyun Kang, Soonhoi Ha, "Datapath Extension of NPUs to Support Non-convolutional Layers Efficiently", IEEE Design & Test, Vol. 32, Issue 5, Mar, 2022.

이건주, 강동현, 강두석, 하순회, "DRAM 접근 지연 시간이 NPU 성능에 주는 영향을 줄이기 위한 메모리 계층 구조", 2021 한국소프트웨어종합학술대회 (KSC2021), Dec, 2021.

강두석, "Hardware-Aware Software Optimization Techniques for Convolutional Neural Networks on Embedded Systems", 서울대학교, Feb, 2021.

Donghyun Kang, Soonhoi Ha, "Tensor Virtualization Technique to Support Efficient Data Reorganization for CNN Accelerators", Design Automation Conference, Jul, 2020.

Donghyun Kang, Jintaek Kang, Hyungdal Kwon, Hyunsik Park, Soonhoi Ha, "A Novel CNN(Convolutional Neural Network) Accelerator That Enables Fully-pipelined Execution of Layers", 37th IEEE International Conference on Computer Design, Nov, 2019.

NPU HW/SW Co-Design

About

Download

Publication

Contributor