Object detection in computer vision

Object detection in computer vision usesdeep learning (or Deep Learning a family ofmachine learning models, described in the form of neural networks organized in layers. Generally speaking, allsupervised learning models for detecting objects in an image are artificial neural networks, used for deep learning.

Deep learning can be either supervised or unsupervised, depending on the situation. In the remainder of this article, we will only talk about supervised learning. In the latter case, the data used for supervised learning are labeled. In other words, we know what result the model should achieve. And this for each example used in the learning process. The ultimate goal is to learn to associate each example with the rightlabel.

Take a few minutes and let's talk about the main principles of supervised deep learning for object detection! Please note that I've made it a point not to go into a mathematical description of Deep Learning. This article is therefore intended to be accessible to everyone.

Supervised deep learning for object detection in computer vision

General principles of object detection in computer vision

An artificial neural network is a machine-learning model made up of computational units called neurons. A neuron calculates a weighted sum of inputs and applies a non-linear function to the result of this sum. This is the activation function, to which we'll return later.

Often, the output of one neuron is transmitted to the input of one or more neurons (depending on the number of layers), until the output of the network is reached. Sometimes, a recurrent neural network can also be used. In this case, the output of one neuron can be used as the input for the same neuron.

Learning a neural network involves optimizing the weighting of each sum. In this way, the weights become the parameters of the network. To optimize the weights, we use a back-propagation algorithm to minimize the error between the prediction and the label given as input.

How a neuron works

Analogous to a biological neuron, an artificial neuron receives signals from other neurons as input. Synapses are modeled by weights assigned to input values.

Neurons are not always active. In fact, a high weight value will cause neuron excitation. Conversely, if the weight is negative, then the input will have an inhibitory action. If the weighted sum of these inputs exceeds a certain threshold, the neuron is activated. The role of the activation function is to decide on the threshold. The result of the activation function is then transmitted to the following neurons.

Without going into detail, here are the four main activation functions. The slope of the ReLu leaky has been exaggerated for greater visibility (after [1]).

The curves of the four activation functions classically used in Deep Learning [1].

Convolutional networks

When analyzing an image, the properties given as input to the neural network are the intensity values of each pixel. For an RGB image, there are three intensities for each color: red, green and blue.

In this case, and to avoid memory problems due to the large size of the images and the number of parameters to be processed, we can use a reduced-size network. Each pixel is processed individually, using convolutions. These are used to produce the property maps. We can think of these as multi-layer images: each layer is a property obtained through a convolution filter.

Property maps are small in size and can be processed by a fully connected network without memory concerns. This type of model is known as a Convolutional Neural Network(CNN ).

Convolutions

When a CNN network processes a new image, it doesn't know which features are present in the image and where they might be. These networks therefore try to find out whether a feature in the image is present by means of filtering. The mathematics used to perform this operation are called convolutions. When processing an image, the position of the pixel within it is of fundamental importance. And just as well, since this information is exploited by a convolution layer.

The convolution window (or kernel) is a small, square matrix. Its size is fixed when the network is designed, and is often 3×3. This window "slides" over the property map. At each location, we calculate the term product between the convolution window and the part of the property map on which it lies (see illustration below). Finally, a non-linear function is applied to the sum of the elements of the resulting matrix.

Illustration of a 3×3 convolution filter that grays out the pixels of the input property map (in blue). The result is a layer of the output property map, in orange.

During the convolution process, the weights are invariant. This is how we can harness the computing power of modern GPUs.

Pooling

The pooling is a method of taking an image and reducing its size, while preserving the most important information it contains.

The most common form is called Max-Pooling, which consists of dragging a small window over all parts of the image and taking the maximum value of this window at each step. This is usually a 2×2 window/matrix (see illustration below).

Illustration of the Max-Pooling principle.

The major advantage of pooling is therefore to reduce the size of the data processed in memory, which is a limiting factor in learning a CNN. Backpropagation is affected differently by pooling, but we won't go into detail here.

Receptive fields

The receptive field designates a region of the image passed as input to the network. Increasing the receptive field provides more context for classifying a location. It will also enable the detection of larger objects within an image.

Pooling is a method of increasing the receptive field of a neuron. The activation function also modifies this effective field, and we may talk about this in a future article.

Retro propagation

In simple terms (and to avoid going into a complex mathematical description), backpropagation is the algorithm that applies the learning policy to the neural network's weights. The aim is to be able to apply a form of gradient descent(differentiable optimization algorithm) to the network.

In fact, gives a "score" for each image analyzed by the CNN. The quality of our features and weights is determined by the number of errors we make in our classification/detection. The features and weights are then adjusted to reduce the error, which is what learning is all about. Whatever adjustment is made, if the error is reduced, the adjustment is retained. This process is then repeated with each of the other images that have a label.

Illustration of the backpropagation principle. During the backpropagation process, if the error decreases, the fit is maintained.

Learn more about supervised deep learning for object detection in computer vision

In this article we have simply described the main principles of supervised deep learning for object detection in computer vision.

If you'd like to incorporate AI into your vision project, contact us and we'll put all our expertise at your service! We manage the creation of the dataset(database), which represents a significant amount of work. We can create a large database from images or videos (with several thousand annotated images). We then train a model by choosing the right neural network architecture and methodology.

In the next article, we'll present a case study of an object detection model developed by Imasolia. Stay tuned!

References :

[1] Thesis - Rémi Delassus - 2018 - Machine learning for anomaly detection in open data - application to cartography.

[2] Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations, Honglak Lee, Roger Grosse, Rajesh Ranganath, Andrew Y. Ng

Leave a comment