Understanding Mask R-CNN Basic Architecture

Basic architecture of Mask R-CNN network and the ideas behind it

Nov 14, 2021 by Xiang Zhang

Mask R-CNN is a popular deep learning framework for instance segmentation task in computer vision field. It adds fully convolutional networks (FCN) to Faster R-CNN to generate mask for each object, while Faster R-CNN, Fast R-CNN, R-CNN is for bounding-box object detection. Mask R-CNN can be composed by these parts: a backbone, a Region Proposal Network (RPN), a Region of Interest alignment layer (RoIAlign), a bounding-box object detection head and a mask generation head. The first four make up the Faster R-CNN model. So the overall structure can be illustrated by the following figure.

mask r-cnn network

1. Backbone

A backbone is the main feature extractor of Mask R-CNN. Common choices of this part are residual networks (ResNets) with or without FPN. For simplicity, we take ResNet without FPN as a backbone. When we feed a raw image into a ResNet backbone, data goes through multiple residual bottleneck blocks, and turns into a feature map.

mask r-cnn backbone resnet

As the above figure shows, multiple residual bottleneck blocks with different channel d/d' configurations are stacked to make a deep residual network. In one bottleneck block, inputs go through two paths. One is multiple convolutional layers and the other is identical shortcut connection. Then outputs from both paths are added element-wisely. In this way, gradients can propagate through blocks easily, and a block can learn an identity function easily.

Feature map from the final convolutional layer of the backbone contains abstract informations of an image, e.g., different object instances, their classes and spatial properties. It is then fed to the RPN.

2. RPN

RPN stands for Region Proposal Network. Its function is scanning the feature map and proposing regions that may have objects in them (Region of Interest or RoI).

mask r-cnn Region Proposal Network

Concretely, a convolutional layer processes the feature map, outputs a c-channel tensor whose each spacial vector (also have c channels) is associated with an anchor center. A set of anchor boxes with different scales and aspect ratios are generated given one anchor center. These anchor boxes are different areas that evenly distributed over the whole image and cover it completely. Then two sibling 1 by 1 convolutional layers process the c-channel tensor. One is a binary classifier. It predicts whether each anchor box has an object. It maps each c-channel vector to a k-channel vector (represents k anchor boxes with different scales and aspect ratios sharing one anchor center). The other is a object bounding-box regressor. It predicts the offsets between the true object bounding-box and the anchor box. It maps each c-channel vector to a 4k-channel vector. For those overlapped bounding-boxes that may suggest the same object, we select ones with the highest objectness score, and drop the others. It's the Non-max suppression process.

As so, we get a bunch of proposed RoIs. The next step is to find where exactly each RoI is in the feature map. It's called RoIAlign.

3. RoIAlign

RoIAlign or Region of Interest alignment extracts feature vectors from a feature map based on RoI proposed by RPN, and turn them into a fix-sized tensor for further processes.

mask r-cnn RoIAlign

This operation can be illustrated by the above figure. We align RoI with their corresponding areas in the feature map by scaling. These regions come in different locations, scales and aspect radios. To get feature tensors of uniform shape, we sample over relevant aligned areas of the feature map. The white-bordered grid represents the feature map. The black-bordered grids represent RoIs. We divide each RoI into a fixed number of bins. In each bin, there are 4 dots representing sample locations. We sample feature vectors on the feature map grid around each dot and compute their bilinear interpolation as the dot vector. Then we pool dot vectors within one bin to get a smaller fix-sized feature map for each RoI. Next, we put each RoI's feature map into a set of residual bottleneck blocks to extract features further. The results represent every RoI's finer feature map and will be processed by two following parallel branches: object detection branch and mask generation branch.

4. Object detection branch

After we get individual RoI feature map, we can predict its object category and a finer instance bounding-box. This branch is a fully-connected layer that maps feature vectors to the final n classes and 4n instance bounding-box coordinates.

mask r-cnn object detection

5. Mask generation branch

On the mask generation branch, we feed RoI feature map to a transposed convolutional layer and a convolutional layer successively. This branch is a fully convolutional network. One binary segmentation mask is generated for one class. Then we pick the output mask according to the class prediction in object detection branch. In this way, per-pixel's mask prediction can avoid competition between different classes.

mask r-cnn mask generation

6. Summary

The basic architecture of Mask R-CNN is as explained. Here we conclude by reviewing some aspects of it.

  1. The whole model can be divide into two stages, the first stage proposes Regions of Interest, the second stage predict classes, bounding-boxes and masks for RoIs.
  2. Instance mask generation is achieved by combining bounding-box object detection and binary mask generation for each class, then relying on class prediction to select the mask.
  3. FPN can bring gains in average precision. The feature map and RoIAlign should change accordingly.

That's it for this blog. I hope it's useful to you. If you have any suggestions, or want to quote this blog, please leave a message below. Thanks for reading.

Published by Xiang Zhang

Hi everyone! My name is Xiang Zhang. I am passionate about the huge progress that deep learning has brought to various fields. I like studying them and sharing my learning experience.

Leave a Message