Internals Of YOLOv1 For Object Detection

Niteshkumardew
17 min readMar 9, 2021

Contents

  • Overview
  • What is Object Detection
  • R-CNN Family In Brief
  • Internals of YOLO
  • Intersection Over Union(IOU)
  • Non max suppression
  • Loss Calculation
  • mean average precision
  • Implementation

Code Reference

This blog focuses more on internal working and terminologies used in YOLO algorithms and not on the implementation part.To access complete code of this blog please refer to my Github.

Live Prediction on YOLO for Object Detection

Overview

YOLO stands for “You Only Look Once”. It is state of the art object detection technique which is first described in the seminal 2015 paper by Joseph Redmon et al. YOLO comes with many versions like YOLO v1, YOLO v2 ,YOLO v3 etc. YOLO v5 is the latest version at the time of writing this blog but here we will focus on YOLO v1.

There are lots of algorithms and techniques for object detection such as R-CNN family , Histogram of Oriented Gradients (HOG) , Single Shot Detector (SSD) and many more but YOLO has become the most popular technique for object detection amongst them. The two main reasons for that is

  • Prediction in the YOLO model is almost in real time.
  • The complete object detection pipeline(from image as input to final output) involves one single model.

Prediction in real time made it popular in applications like self driving cars , vehicle identification in traffic signals etc where single model training strategy made it easier to train for those applications.

What is Object Detection

To understand object detection clearly, we must understand what is localization and what is classification.

Localization :

Localization means identifying the region where the object is situated. In general it identifies a rectangular box across the object.

Classification :

It is the most common task in computer vision. Classification means categorization of image to a particular class.

Object detection

Localization and classification is one object oriented task that means if an image has only one object such as cat or dog then categorization that image as cat or dog is classification and creating a bounding box around that object is localization. When we perform these two tasks simultaneously for images that contain multiple objects in it then it is object detection.

Object Detection is not an easy task as it has to identify :

  • How many objects are there
  • At which location these object are placed
  • What is the size of each of the object
  • What are the classes

R-CNN family in brief

It is always good to know how an algorithm evolved related to a particular task and I find it easy to remember new algorithms when I know what problems that existing algorithm was facing and how new algorithms solve those problems. With this approach I will briefly talk about existing algorithms such as sliding window technique and R-CNN family before YOLO.

Sliding window technique

This technique is based on finding the region in the image where the object is located and then that particular section of image(cropped image) feeded to ConvNet for classification. This approach looks good in first go but there is a big problem in this approach .

It considers all possible region of a predefined shape in the image to find an object that means a fixed size window will slide on the image and for each slide it will feed that window to ConvNet for identifying an object.

This approach fails in the real world because it involves lots of image regions to be processed by ConvNet and if the shape of the object is different then it has to perform all these steps for another shape.

R-CNN

R-CNN stands for Region based Convolutional Neural Network. The improvement of R-CNN over sliding window technique is instead of 10’s of thousands of regions it reduced those numbers to 2000 regions per image which is of different size and shapes.

R-CNN uses selective search technique to find those regions which is based on a simple idea that any object in the image will be in the form of varying scales, colors, textures, and enclosure. In R-CNN those regions are called Region of Interest(ROI) which is further reshaped as input to ConvNet.

Problems :

  • R-CNN involves multiple models and those models work independently. It uses ConvNet for featurization, SVM for classification and another Regressor model for bonding box prediction.
  • It generates 2000 regions for every image and uses ConvNet 2000 times to get the features for each region. This task will become more complex when the number of images are large.
  • Training such a model is very difficult as each of the models works separately .

Fast R-CNN

R-CNN uses ConvNet 2000 times for every image which is extremely time consuming. Fast R-CNN as it name suggests reduced time required for prediction by following modification :

  • R-CNN generates Region Of Interest(ROI) from the original image then sends it to ConvNet for featurization where Fast R-CNN modified the sequence of processes as it sends the image to ConvNet first and then generates ROI using a regional proposal method on feature maps. This is how Fast R-CNN saved time as ConvNet is used only once for featurization for one image.
  • Fast R-CNN uses a ROI pooling layer which generates fixed size feature maps for every ROI that means irrespective to size and shape of Region of Interest, ROI pooling will output the same sized feature map for every ROI.
  • Fast R-CNN does not use different models for classification(Class Prediction) and Regression (Bounding Box) tasks as the RIO pooling layer generates fixed size feature maps which can be stacked together and sent to a fully connected network.
  • In the last layer it uses softmax activation for classes and linear activation for boxes.

Problems:

Even Fast RCNN has certain problem areas. It also uses selective search as a proposal method to find the Regions of Interest, which is a slow and time consuming process. It takes around 2 seconds per image to detect objects, which is much better compared to RCNN. But when we consider large real-life datasets, then even a Fast R-CNN doesn’t look so fast anymore.

Faster R-CNN

Faster RCNN is the modified version of Fast RCNN. The major difference between them is that Fast RCNN uses selective search for generating Regions of Interest, while Faster RCNN uses “Region Proposal Network”, aka RPN. RPN takes image feature maps as an input and generates a set of object proposals, each with an objectness score as output.

The below steps are typically followed in a Faster RCNN approach:

  • We take an image as input and pass it to ConvNet which returns the feature map for that image.
  • Region proposal networks are applied on these feature maps. This returns the object proposals along with their objectness score.
  • A RoI pooling layer is applied on these proposals to bring down all the proposals to the same size.
  • Finally, the proposals are passed to a fully connected layer which has a softmax layer and a linear regression layer at its top, to classify and output the bounding boxes for objects.

Problem :

  • The algorithm requires many passes through a single image to extract all the objects.
  • As there are different systems working one after the other, the performance of the systems further ahead depends on how the previous systems performed

The reader can visit R-CNN Family to know more about these algorithms.

Internals of YOLO

Though YOLO is the most popular algorithm amongst all the object detection techniques, I found a lack of quality resources to understand the entire working of YOLO in a single place. To take this into consideration I will try to explain every nitty gritty detail of working of YOLO in this blog and the best way to tell this story is by question and answering because this is how I understood this algorithm.

From now on this blog will follow the question answer pattern which I think is the easiest way to explain.

From where to start?

Understanding the YOLO algorithm starts with understanding its model architecture.

The YOLO model accepts an rgb image with size (448,448). Model architecture requires basic understanding of convolutional neural networks. While building this architecture we just have to maintain the sequence of Conv layer and max pooling layer and rest of the intermediate layer are self explanatory.

The most important part of this architecture is to understand its output.

How to interpret model output?

The output size of the above model is (7,7,30) for one rgb image of size (448,448). To interpret model output imagine that the image is divided into 7 by 7 (total 49) equal sized grid which is also called a cell, and each cell is represented by a 30 dimensional vector.

Since all the model parameters are trainable we will force the model parameter to learn these 30 parameters for a single cell of an image while training. In actual paper the image is divided into 7 by 7 grid but here just for explanation I will divide it into 3 by 3 grid.

What are these 30 parameters?

First 20 parameters are outputs from softmax activation which represents probability of an object belonging to class i (i in 1 to 20) for each cell. This YOLO architecture uses PascalVOC dataset which has 20 different classes of objects that’s why its output assigns its first 20 parameters for 20 different classes. If you train your model on another custom dataset then output should modify accordingly.

For training data Pc which is the 21th and 25th output parameter will be a binary number 0 or 1. 0 represents no object and 1 represents the presence of an object in the cell. x and y represents the x and y coordinate of the center point of the object where w and h represents height and width of the object. Remember that all x,y,w and h are values with respect to a particular cell. We will understand why we have two pairs of pc, x, y ,w and h later in this blog.

Now let’s understand the output with very simple example where we will divide the image into 3 by 3 grid and have only two different possible classes dog = 1 and cat = 2. So the size of output will be

(grid ,grid,num of classes + (5 * num of bounding boxes)) → (3,3, 2+5 * 2 ) → (3,3,12)

First cell has small portion of left dog but do not contain its midpoint(blue dot) so the output correspond to this cell will be

[ * , * , 0 , * , * , * , * , 0 , * , * , * , * ] where 0 represents no object and ‘*’ is for don’t care means these values will be ignored while training the model.

Now let’s try to calculate output for this cell

This cell contains the midpoint of the object which is a dog so c1 = 1 (dog), c2 = 0(cat) and pc1 = 1 because it has an object. Remember that x, y , w and h these four parameters are calculated with respect to cell and not with respect to image so x , y will be float value between 0 and 1 but w and h might be greater than 1 in case when the object itself is larger than cell.

So in the above cell x = 0.9 because the mid point is almost at the end of x axis and y = 0.55 because the mid point is slightly below the center of y axis.

width w will be somewhere around 0.5 because the object width is half of the width of the cell and height h will be 1.5 because the object’s height will be one and half times the height of the cell. so the final output correspond to this cell will be [1, 0, 1, 0.9, 0.55, 0.5, 1.5 , 0 , * , * , * , * ] and the box will look something like this

Similarly we can encode output for each of the cells.

What is an anchor box and why two anchor boxes in this architecture?

Anchor boxes are predefined rectangular boxes to represent the region of an object within the image. The shape of these boxes differs from each other generally vertically and horizontally oriented. This concept is derived from a simple assumption that if a cell is representing two objects then most probably those objects will differ by their shapes one might be taller and other might be wider.

Consider a situation where midpoint of two objects lies in the same cell.

The midpoint of both car and person is in the middle cell of the image and we need two anchor boxes to represent both the objects.

We need to fix the sequence of anchor boxes for all the cells of all the images. That means if we use the first anchor box to represent a taller and the second anchor box to a wider object then this sequence must be followed for all the cells of all images. The idea behind this logic is that the neurons of the last layer responsible to predict taller objects will always predict taller objects and neurons responsible to predict wider objects will always predict wider objects and in this way the weights will converge faster.

Note that in our actual model we are deviding the image into 7 by 7 grid and each grid can represent two objects so theoriticaly we can represent 7 * 7 * 2 = 98 objects per image.

Intersection Over Union(IOU)

How to measure the correctness of predicted box?

Let’s consider this prediction

If you get a prediction like this then how will you measure the correctness of that prediction? IOU is the measure of correctness of the predicted box. As its name suggests it is the ratio of area of intersection over area of union. For perfect matching it is equal to 1 and 0 when there is no intersection between actual and predicted box. We choose a threshold for good prediction which is generally 0.7 but it can be changed based on requirement.

Non max suppression

what if we predict multiple boxes for one object?

If the reader understood all the concepts till now, then they must have imagined that practical images have two or three objects mostly but we have a representation capacity of 98 objects per image. Don’t you think that you will get a lot of predictions for one object? The answer is of course you will. and here comes Non Max Suppression to remove unwanted predictions.

Let’s suppose we got three predictions and remember that we predict boxes with objectness scores which is Pc. Non Max Suppression removes unwanted predictions by:

  • First it picks the box with the highest objectness score.
  • Then calculate IOU between the box with the highest objectness score and other boxes.
  • If the IOU is greater than some threshold value such as 0.5 then it removes other boxes

The idea is simple, high IOU implies that both the boxes predict the same object but we don’t want multiple boxes so they have to be removed.In the above example we will choose the box with 0.9 objectness score and calculate its IOU with other boxes.

Let’s say the IOU between boxes with objectness score = 0.9 and 0.6 is IOU = 0.51 which is greater than threshold 0.5, so will remove the box with 0.6 score.

we will repeat the same with box having 0.3 objectness score

IOU is 0.6 which is greater than threshold 0.5, now this will also be removed. Finally we are left with the best possible prediction.

Loss Calculation

Things will become more clearer when we understand the loss calculation while training the model.

This loss function is taken from the actual paper. Loss is calculated in five steps and we will understand each step separately.

loss calculation on x and y coordinates

xi , yi are actual and xi_hat , yi_hat are the predicted x and y coordinate of the midpoint of the object with respect to i’th cell of the image. Loss is squared error loss which is associated with identity function. Identity function can take only two values 0 and 1 and it is 1 only when j’th anchor box of i’th cell represents an object in training data otherwise it is 0. lambda_coord is hyperparameter which is 0.5 in our case.

To interpret above loss function let’s just consider a dummy example with S = 9 (3 by 3 grid) and B = 2 (number of anchor boxes per cell).

above loss function value will be 0 for all the cells except cell 3 and 4. Suppose that our first anchor box represents a vertical and second anchor box represents a horizontal box then we will consider the x and y coordinate of the object midpoint for the first anchor box of the prediction because the shape of the left dog is vertically oriented .

Finally above loss will be calculated on x and y coordinate for i = 3(third cell) with B = 0(first anchor box) and i = 4(fourth cell) with B = 0(first anchor box).

Loss calculation on width and height

loss calculated on width and height are the same as what we have calculated for x and y coordinate except here in width and height loss calculation instead of taking its actual value we are taking its square root value.

The idea behind square root is simple, the width and height of the predicted boxed is crucial when the size of object is small that means for small sized objects we can’t afford errors so the loss function should penalize more for such cases. Small object will fit within the cell so its width and height value will be between 0 and 1 and the square root value will be higher than its actual value so this simple hack takes care of small sized objects.

Loss calculation on objectness score (with object)

Ci in the loss function is Pc_i in our case so don’t confuse it with notation.

objectness score loss is calculated for those anchor boxes corresponding to those cells which represents an object. This loss will be calculated only when objectness score Ci = 1 and when Ci = 0 then identity function is also 0.

Loss calculation on objectness score (with no object)

Loss associated with objectness score for the cell with no object is very important in order to select the best predicted box for an object. Here identity function will be 1 for all the cells with no object.

While performing Non Max Suppression we assumed that high objectness score means more accurate box. Above loss function forces all the predicted objectness scores which don’t have any object to 0 which further reduces the burden on Non Max Suppression. lambda_noobj is a hyperparameter which is = 5 in our case.

Loss calculation on class probabilities

pi(c) in loss function is ci in our representation.

Above loss is calculated on output of softmax activation which represents the probability of predicted objects belonging to classes. Usually in classification tasks we choose cross entropy loss but here squared error loss is picked. Note that in cross entropy loss we only concern with predicted probability of class it should belong to and we simply ignore predicted probability for other classes but here in squared error loss we want both, The predicted probability of desired class to be 1 and predicted probability for other class to be 0.

mean average precision

mAP is the ideal performance metric of the model for object detection tasks. In this section we will understand mAP with small examples.

Let’s consider a testset with three images image1 image2 and image3.

here we will calculate average precision for class ‘DOG’. First we will calculate the total number of dogs considering all the test images which is four in our example, two for image1 and 1 , 1 for image2 and image3. The green boxes are actual boxes where the red boxed (comes with objectness score) is predicted boxes for class ‘DOG’. In the above images we can see that some of the predictions are good but some are extremely bad. How will we calculate the performance of the model for such prediction?

To calculate mAP we should have an idea about precision and recall. I am assuming the reader already knows the precision and recall.

First we will calculate true positive(TP) and false positive(FP) for all the predictions on test images.

A predicted box is assigned as TP if IOU between predicted box and actual box > 0.5 otherwise FP.

Now we will sort the table based on confidence score(objectness score)

Now we will calculate precision and recall incrementally.

In simple words

precision = (number of correct prediction / total number of prediction) of that class

recall = (number of correct prediction / total number of actual labels) of that class

Now we will plot a curve between precision and recall that we have just calculated and area under the PR curve is average precision for class DOG.

Final model performance mAP is mean of AP’s for all the classes.

Implementation:

The blog is already very long so the implementation part will be in brief. Following steps are taken to train the model

  • PascalVOC dataset (number of classes = 20) is taken to train the model.
  • The model architecture is the same as the model explained in paper.

Training such a model requires a huge amount of computational power(multiple powerful GPUs) that I don’t have. PascalVOC dataset also provides a sample dataset with only 100 examples so I chose to train the model on that small dataset because my motive for this blog is not to build a powerful model but to understand its internal workings.

It took me around 20 hours to train the model for 200 epochs just with 100 examples, Basically I wanted to overfit the model on the small sample of dataset just to see the prediction. I trained the model from scratch but I recommend the reader to train on a pretrained model such as vgg16, inception network etc to reduce the training time.

Here is the training logs for last 10 epochs:

--

--