MCUNetV2: Memory-Efficient Patch-based Inference
for Tiny Deep Learning

Ji Lin 1 , Wei-Ming Chen 1 , Han Cai 1 , Chuang Gan 2 , Song Han 1
Massachusetts Institute of Technology, MIT-IBM Watson AI Lab



Use Cases


  title={MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning},
  author={Lin, Ji and Chen, Wei-Ming and Cai, Han and Gan, Chuang and Han, Song},
  booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},


Tiny deep learning on microcontroller units (MCUs) is challenging due to thelimited memory size. We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs: the first several blocks have an order of magnitude larger memory usage than the rest of the network. To alleviate this issue, we propose a generic patch-by-patch inference scheduling, which operates only on a small spatial region of the feature map and significantly cuts down the peak memory. However, naive implementation brings overlapping patches and computation overhead. We further propose network redistribution to shift the receptive field and FLOPs to the later stage and reduce the computation overhead. Manually redistributing the receptive field is difficult. We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2. Patch-based inference effectively reduces the peak memory usage of existing networks by 4-8x. Co-designed with neural networks, MCUNetV2 sets a record ImageNet accuracy on MCU (71.8%), and achieves >90% accuracy on the visual wake words dataset under only 32kB SRAM. MCUNetV2 also unblocks object detection on tiny devices, achieving 16.9% higher mAP on Pascal VOC compared to the state-of-the-art result. Our study largely addressed the memory bottleneck in tinyML and paved the way for various vision applications beyond image classification.

Problem: Imbalanced Memory Distribution of CNNs

The memory distribution of CNNs is usually highly imbalanced, with the first several layers dominating the memory usage.

1. Save Memory with Patch-based Inference

We can dramastically reduce the inference peak memory by using patch-based inference for the memory-intensive stage of CNNs.

For MobileNetV2, using patch-based inference allows us to reduce the peak memory by 8x.

2. Receptive Field Redistribution to Reduce Computation Overhead

Patch-based inference leads to computation overhead since different patches are overlapped with each other. To reduce the overlapping, we propose to re-distribute the receptive field (RF) by reducing RF for the per-patch stage and increase RF for the later per-layer stage.

After redistribution, the computation overhead of MobileNetV2 with patch-based inference reduces from 10% to 3%, while the performance remains the same.

3. Joint Automated Optimization of Neural Architecture and Inference Scheduling.

Redistributing RF requires manual tuning. We employ neural architecture search techniques to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.

Experimental Results

  • ImageNet (higher accuracy at the same memory budgets)
  • VWW (smaller memory usage, higher accuracy)
  • WIDER Face (better performance at the same memory budgets)

MCUNet: Tiny Deep Learning on IoT Devices

Ji Lin 1 , Wei-Ming Chen 1,2 , Yujun Lin 1 , John Cohn 3 , Chuang Gan 3 , Song Han 1
Massachusetts Institute of Technology, National Taiwan University, MIT-IBM Watson AI Lab

  title={MCUNet: Tiny Deep Learning on IoT Devices},
  author={Lin, Ji and Chen, Wei-Ming and Cohn, John and Gan, Chuang and Han, Song},
  booktitle={Annual Conference on Neural Information Processing Systems (NeurIPS)},


Machine learning on tiny IoT devices based on microcontroller units (MCU) is appealing but challenging: the memory of microcontrollers is 2-3 orders of magni-tude smaller even than mobile phones. We propose MCUNet, a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight infer-ence engine (TinyEngine), enabling ImageNet-scale inference on microcontrollers.TinyNAS adopts a two-stage neural architecture search approach that first opti-mizes the search space to fit the resource constraints, then specializes the networkarchitecture in the optimized search space. TinyNAS can automatically handle diverse constraints (i.e. device, latency, energy, memory) under low search costs. TinyNAS is co-designed with TinyEngine, a memory-efficient inference library to expand the search space and fit a larger model. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing the memory usage by 3.4x, and accelerating the inference by 1.7-3.3x compared to TF-Lite Micro and CMSIS-NN. MCUNet is the first to achieves >70% ImageNet top1 accuracy on an off-the-shelf commercial microcontroller, using 3.5x less SRAM and 5.7x less Flash compared to quantized MobileNetV2 and ResNet-18. On visual&audio wake words tasks, MCUNet achieves state-of-the-art accuracy and runs 2.4-3.4x faster than MobileNetV2and ProxylessNAS-based solutions with 3.7-4.1x smaller peak SRAM. Our study suggests that the era of always-on tiny machine learning on IoT devices has arrived.

Challenge: Memory Too Small to Hold DNNs

Existing Methods Reduce Model Size, but not the Activation Size

MCUNet: System-Algorithm Co-design

1. TinyNAS: Two-Stage NAS for Tiny Memory

2. TinyEngine: Memory-Efficient Inference Library

Experimental Results


Acknowledgments: We thank MIT Satori cluster for providing the computation resource. We thank MIT-IBM Watson AILab, Qualcomm, NSF CAREER Award #1943349 and NSF RAPID Award #2027266 for supportingthis research..