Towards cutting-edge object detection systems for defence applications

Helsing’s AI lab focuses on researching and implementing cutting-edge AI-based capabilities for defence applications. One of several key algorithmic building blocks for our perception systems is object detection. In this blog post, we recap a brief history of object detection methods and discuss the specific challenges for object detection systems in defence applications.

A brief history of object detection research

Object detection is the problem of automatically recognising and localising the objects present in an image or a video. The output of an object detection algorithm is typically visualised as labelled bounding boxes (see example below). Although deep learning approaches have become the state-of-the-art method for object detection, this problem was first proposed more than 20 years ago and thus long before deep learning became popular.

Generic object detection output. Source: Open Images V6 Dataset.
Generic object detection output. Source: Open Images V6 Dataset.

At that time, computational resources were limited and computer vision relied mostly on handcrafted features grounded in physical intuition. For example, the Viola-Jones object detection framework — considered the first real-time face detection system — uses so-called Haar features to locate distinctive facial features such as nose, eyes, or lips. Other interesting detection frameworks before the emergence of deep learning are the HOG detectors and the Deformable Part-based Models (DPM).

The R-CNN model proposed by Girshick et al. in 2014 marks the seminal breakthrough in deep learning methods for object detection. As a two-stage model , R-CNN first extracts region proposals and then classifies each proposal as either object or background. This is in contrast to one-stage models , which perform localisation and classification in a single step. Let’s look at the two approaches in more detail.

R-CNN pipeline. Source: R-CNN paper by Girshick et al.

Two-stage object detection

Two-stage (or, more generally, multi-stage) object detectors are incremental improvements to the original R-CNN paper. These detectors make use of two separate models: the first is used to extract region proposals, and the second model classifies and refines the localisation of the objects.

The main models of this family are the R-CNN (2014), SPPNet (2014), Fast R-CNN (2015), Faster R-CNN (2015), Feature Pyramid Networks (2017), Mask R-CNN (2017), Cascade R-CNN (2018).

Two-stage object detectors achieve high accuracy, but are, in general, slower than one-stage detectors. However, recent research advances are closing this gap, making them sufficiently fast for many applications.

High-level representation of a two-stage object detector.
High-level representation of a two-stage object detector.

One-stage object detection

One-stage detectors localise and classify objects with a single model. They generally use grids and anchor boxes, also known as a priori boxes , as canonical bounding boxes to generate their predictions. The choice of anchor boxes is a crucial decision choice when training a robust object detector. In general, one-stage detectors obtain multi-scale feature maps from the input image. These features act as a grid that describes the predefined locations. Afterwards, the anchor boxes, with several aspect ratios, are considered at each predefined location in each feature map. Finally, for each anchor box, the model predicts an “objectness” score, bounding box offsets to refine the final detection, and class probabilities to recognise the object.

Anchor boxes give the canonical shape of the object and are further refined by our object detector. Source: Open Images V6 Dataset.
Anchor boxes give the canonical shape of the object and are further refined by our object detector. Source: Open Images V6 Dataset.

The main one-stage models are the family of YOLO detectors (starting in 2015), SSD (2015), RetinaNet (2017), CenterNet (2019) and EfficientDet (2020).

In general, two-stage detectors have shown stronger performance in terms of accuracy on typical object detection benchmarks. However, the new one-stage detector variants such as YOLOv7 (2022), are very competitive and offer large advantages in terms of runtime. Therefore, they are preferred in real-time applications.

High-level representation of a one-stage object detector.
High-level representation of a one-stage object detector.

New directions

With the rise in popularity of the transformer architecture, object detection has also been re-imagined. The DETR (2020) model and its variants use the transformer architecture and do not require any hand-crafted set of anchor boxes.

A transformer is an encoder-decoder network that exploits the self-attention mechanism to weigh the relative importance of different parts of the input data. In its simple formulation (see illustration below), DETR uses a CNN backbone to learn 2D image features. This set of features is then fed into the transformer encoder jointly with a positional encoding. At the transformer decoder stage, a set of learnable positional embeddings, namely object queries , are provided as input. These embeddings will guide the attention to decide which are the important encoder output features to properly recognise and localise objects. Finally, a feed-forward network (FFN) takes each output of the transformer decoder and predicts either detection or ‘no object’.

DETR architecture. Source: DETR paper by Carion et al.
DETR architecture. Source: DETR paper by Carion et al.

Object detection at Helsing

Object detection is a core building block for our perception systems. Interestingly, many of the challenges we encounter are both specific to applications in the defence domain and orthogonal to the direction of the broader research community. For us researchers, this means that we get to work on fascinating novel research challenges without having to continually play catchup with the rest of the community. Let’s look at a few examples.

Tuneable performance

The standard metric for evaluating the quality of an object detection system is average precision (AP). We have found that AP alone is usually inadequate as a performance metric for complex functional chains composed of several AI and algorithmic capabilities, since the specific requirements for the object detection system vary between use cases.

For example, some applications require detections with near-zero false positives (FP), while in others false negatives (FN) have to be avoided. We definitely need more tools than AP for a better understanding of model performance and to be able to adjust them to the different use cases.

Additionally, the quality of the detections is not the only key factor when developing a model. For example, even perfect object detection is not suitable for our functional chains if it does not work on edge devices with limited compute resources. Since we dynamically adjust which models run on which devices, models also need to self-tune, for example to trade of the detection rate vis-a-vis the currently available compute resources.

In-mission learning

An adversarial attack? Source: DALL·E 2.
An adversarial attack? Source: DALL·E 2.

Running models in unforeseen scenarios or contexts (eg, different weather, landscape, hardware, perspectives, etc.) can yield significant domain shift with obvious consequences for model performance. In general, retraining models offline is not feasible and thus we require an “in-mission” learning loop.

Typical challenges that require in-mission learning are:

Solving this problem requires infrastructure for in-mission data capture and re-training in-mission, but, maybe more interestingly, it also requires fundamentally different model architectures that allow for fast re-training in resource-limited environments.

Current research at Helsing deals with the challenging tasks of (unsupervised) domain adaptation and continual and incremental learning among others.

Scale and class imbalance

An in-depth study of the ambitious problems we are considering in the defence sector, and particularly, the captured data, demonstrates huge imbalance problems in terms of scale and class:

Since good data distribution is rarely attainable in the defence domain, we must explore alternative solutions to these problems. Our current research lines focus on synthetic generation (including multi-scale information) and defining loss functions capable of balancing the information propagated through the network.

Multi-modal models

The vast majority of existing object detection research focuses on RGB images captured by electro-optical cameras. We are actively working on object detection techniques for other sensor types, and, even more interestingly, on multi-modal models that can detect objects given inputs from multiple different types of sensors.

Conclusion

While object detection may seem like a well-understood domain, we have identified and solved several novel research challenges, many of which are specific to the defence domain and its unique constraints and requirements. Of course, this is but one of several active AI research directions at Helsing. We are looking for the brightest of the brightest to join us on our journey; if you are interested, please get in touch!

Authors

Pau Riba, Sounak Dey, Jean-Marc Wanka, Rigas Kouskouridas