MCUNetV1 / MCUNetV2 / MCUNetV3

Billions of IoT devices around the world based on microcontrollers (MCU): Low-cost ($1-2), low-power, small, almost everywhere in our lives. However, deploying and training AI models on MCU is extremly hard: No DRAM, no OS and strict memory constraint (less than 256kB). Existing work optimizes for #parameters, but #activation is the real memory bottleneck. We introduces MCUNet, the first model to achieve >70% ImageNet top1 accuracy on a microcontroller. Our MCUNetV2 further improves the efficiency by up to 4x and pushs the ImageNet accuracy by 4.6%.

Besides deploying, AI systems also need to adapt to new sensory data for customization and continual learning. Cloud-based learning leads to privacy issue and high cost and direct training is much expensive than inference due to back-propagation, It is extreme hard to fit IoT devices (such as MCU only has 256KB SRAM). We propose an algorithm-system co-design framework to make on-device training possible with 1000x memory reduction, thus enabling IoT devices to not only perform inference but also continuously learn from new data.


MCUNet: Tiny Deep Learning on IoT Devices

Ji Lin 1 , Wei-Ming Chen 1,2 , Yujun Lin 1 , John Cohn 3 , Chuang Gan 3 , Song Han 1
Massachusetts Institute of Technology, National Taiwan University, MIT-IBM Watson AI Lab

    title={MCUNet: Tiny Deep Learning on IoT Devices},
    author={Lin, Ji and Chen, Wei-Ming and Cohn, John and Gan, Chuang and Han, Song},
    booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},


Machine learning on tiny IoT devices based on microcontroller units (MCU) is appealing but challenging: the memory of microcontrollers is 2-3 orders of magni-tude smaller even than mobile phones. We propose MCUNet, a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight infer-ence engine (TinyEngine), enabling ImageNet-scale inference on microcontrollers.TinyNAS adopts a two-stage neural architecture search approach that first opti-mizes the search space to fit the resource constraints, then specializes the networkarchitecture in the optimized search space. TinyNAS can automatically handle diverse constraints (i.e. device, latency, energy, memory) under low search costs. TinyNAS is co-designed with TinyEngine, a memory-efficient inference library to expand the search space and fit a larger model. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing the memory usage by 3.4x, and accelerating the inference by 1.7-3.3x compared to TF-Lite Micro and CMSIS-NN. MCUNet is the first to achieves >70% ImageNet top1 accuracy on an off-the-shelf commercial microcontroller, using 3.5x less SRAM and 5.7x less Flash compared to quantized MobileNetV2 and ResNet-18. On visual&audio wake words tasks, MCUNet achieves state-of-the-art accuracy and runs 2.4-3.4x faster than MobileNetV2and ProxylessNAS-based solutions with 3.7-4.1x smaller peak SRAM. Our study suggests that the era of always-on tiny machine learning on IoT devices has arrived.

Challenge: Memory Too Small to Hold DNNs

Existing Methods Reduce Model Size, but not the Activation Size

MCUNet: System-Algorithm Co-design

1. TinyNAS: Two-Stage NAS for Tiny Memory

2. TinyEngine: Memory-Efficient Inference Library

Experimental Results


Acknowledgments: We thank MIT Satori cluster for providing the computation resource. We thank MIT-IBM Watson AILab, Qualcomm, NSF CAREER Award #1943349 and NSF RAPID Award #2027266 for supportingthis research..

MCUNetV2: Memory-Efficient Patch-based Inference
for Tiny Deep Learning

Ji Lin 1 , Wei-Ming Chen 1 , Han Cai 1 , Chuang Gan 2 , Song Han 1
Massachusetts Institute of Technology, MIT-IBM Watson AI Lab


Use Cases


    title={MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning},
    author={Lin, Ji and Chen, Wei-Ming and Cai, Han and Gan, Chuang and Han, Song},
    booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},


Tiny deep learning on microcontroller units (MCUs) is challenging due to thelimited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2. Patch-based inference effectively reduces the peak memory usage of existing networks by 4-8x. Co-designed with neural networks, MCUNetV2 sets a record ImageNet accuracy on MCU (71.8%), and achieves >90% accuracy on the visual wake words dataset under only 32kB SRAM. MCUNetV2 also unblocks object detection on tiny devices, achieving 16.9% higher mAP on Pascal VOC compared to the state-of-the-art result. Our study largely addressed the memory bottleneck in tinyML and paved the way for various vision applications beyond image classification.

Problem: Imbalanced Memory Distribution of CNNs

The memory distribution of CNNs is usually highly imbalanced, with the first several layers dominating the memory usage.

1. Save Memory with Patch-based Inference

We can dramastically reduce the inference peak memory by using patch-based inference for the memory-intensive stage of CNNs.

For MobileNetV2, using patch-based inference allows us to reduce the peak memory by 8x.

2. Receptive Field Redistribution to Reduce Computation Overhead

Patch-based inference leads to computation overhead since different patches are overlapped with each other. To reduce the overlapping, we propose to re-distribute the receptive field (RF) by reducing RF for the per-patch stage and increase RF for the later per-layer stage.

After redistribution, the computation overhead of MobileNetV2 with patch-based inference reduces from 10% to 3%, while the performance remains the same.

3. Joint Automated Optimization of Neural Architecture and Inference Scheduling.

Redistributing RF requires manual tuning. We employ neural architecture search techniques to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.

Experimental Results

  • ImageNet (higher accuracy at the same memory budgets)
  • VWW (smaller memory usage, higher accuracy)
  • WIDER Face (better performance at the same memory budgets)

On-Device Training Under 256KB Memory

Ji Lin *1 , Ligeng Zhu *1 , Wei-Ming Chen 1 , Wei-Chen Wang 1 , Chuang Gan 2 , Song Han 1
Massachusetts Institute of Technology, MIT-IBM Watson AI Lab
(* indicates equal contributions)


On-device training enables the model to adapt to new data collected from the sensors by fine-tuning a pre-trained model. However, the training memory consumption is prohibitive for IoT devices that have tiny memory resources. We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. On-device training faces two unique challenges: (1) the quantized graphs of neural networks are hard to optimize due to mixed bit-precision and the lack of normalization; (2) the limited hardware resource (memory and computation) does not allow full backward computation. To cope with the optimization difficulty, we propose Quantization-Aware Scaling to calibrate the gradient scales and stabilize quantized training. To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offload the runtime auto-differentiation to compile time. Our framework is the first practical solution for on-device transfer learning of visual recognition on tiny IoT devices (e.g., a microcontroller with only 256KB SRAM), using less than 1/1000 of the memory of existing frameworks while matching the accuracy of cloud training+edge deployment for the tinyML application VWW. Our study enables IoT devices to not only perform inference but also continuously adapt to new data for on-device lifelong learning.

Figure.1 : Algorithm and system co-design reduces the training memory from 303MB (PyTorch) to 149KB with the same transfer learning accuracy, leading to 2300x reduction. The numbers are measured with MobilenetV2-w0.35, batch size 1 and resolution 128x128. It can be deployed to a microcontroller with 256KB SRAM.

Figure.2 : Measured peak memory and latency: (a) Sparse update with our graph optimization reduces the measured peak memory by 20-21x. (b) Graph optimization consistently improves the peak memory (c) Sparse update with our operators achieves 23-25x faster training speed. For all numbers, we choose the config that achieves the same accuracy as full update.



    title     = {On-Device Training Under 256KB Memory},
    author    = {Lin, Ji and Zhu, Ligeng and Chen, Wei-Ming and Wang, Wei-Chen and Gan, Chuang and Han, Song},
    booktitle = {Annual Conference on Neural Information Processing Systems (NeurIPS)},
    year      = {2022}

Acknowledgments: We thank National Science Foundation (NSF), MIT-IBM Watson AI Lab, MIT AI Hardware Program, Amazon, Intel, Qualcomm, Ford, Google for supporting this research.