YOLO2 Walkthrough with Examples (2024)

Implementation and Visualization

zzxvictor/YOLO_Explained

Yolo is a fully convolutional model that, unlike many other scanning detection algorithms, generates bounding boxes in…

github.com

To understand how Yolo2 works, it is critical to understand what Yolo architecture look like. Yolo2 uses a VGG-style CNN called the DarkNet as feature extractors. Please note that DarkNet is an umbrella of various networks, and people use different variants to increase speed or accuracy.

As you can see, yolo’s output is nothing like what we’ve seen before. There are 416 x 416 pixels in the image but the output is 13 x 13. How on earth do we interprate the results?

Let’s put Yolo aside for a moment and think about how would we do object detection in one path? Here’s my naive solution:

Suppose we have a network that takes an input image of size, say 416 x 416 x3, and there are 20 classes in the dataset.For every pixel in the image, we can predict a box with the following layout(Figure 2). The model output has shape 416 x 416 x22

Yolo is engineered to be fast and accurate. Therefore, it is not ideal to predict one box per pixel (two adjecent pixel may belong to one object). The geeks who invented YOLO went to work and came up with a better idea

Optimization 1 — reduce predicted box number

Instead of predicting one box per pixel, we divide an image into S x S grids and predict several boxes per grid.

With this optimization, the output can be reduced to something like 13 x 13 x 5*22, if we predict 5 boxes per grid. This is a significant drop in box numbers

Optimization 2 — object score for filtering out low confidence prediction

Yolo also introduces an object score in addition to classification probabilities. The object score is an estimation of whether an object appears in the predicted box (It doesn’t care what object, that’s the job of class probailities). If a prediction has low object score, it will be discarded in post-processing. With that being said, the bounding boxe should like like this

With this optimization, the output will have shape 13 x 13 x 5 * (3+20)

Optimization 3 — tailor to the dataset

Instead of predicting the absolute size of boxes w.r.t the entire image, Yolo introduces what is known as Anchor Box, a list of predefined boxes that best match the desired objects (Given ground truths, run K mean clustering). The predicted box is scaled w.r.t the anchors. More specifically:

predict the box center (tx and ty in the figure 6) w.r.t the top left corner of its grid scaled by grid width and height .
Predict the width(tw) and height(th) of the box w.r.t an anchor box (pw and ph)

Final Format

Now you know YOLO predicts several bounding boxes per grid instead of just one. The output shape would be something like 13 x 13 x NUM_ANCHOR X (BOX INFO), where the last dimension looks just like an upgraded version of the naive approache.

With all optimizations, Yolo output can be interprated as:
for every grid:
 for every anchor box: (with different aspect ratios and sizes)
 predict a box
Thus, yolo output has shape 13 x 13 x 5 x 25, which is reshaped in practice into 13 x 13 x 125

Now we understand the format of Yolo 2. The next step is how to extract the boxes from the raw tensor. Sure we can’t use all 13 x 13 x 5 boxes right? In this section, we are going to see how to extract information from the raw output tensor.

Let's assume the output Y has shape 2 x 2 x 2*6, meaning there are two anchors per grid and one class in the dataset. 
Assume Y[0,1,0,:] = [0, 0, 0.4, 0.4, 0, 0.5]. This defines the red box in figure 8. But how to decode it?

Step 1 — extract box coordinates

Let’s take a look at the information

[0, 0, 0.4, 0.4, 0, 0.5] = 
[tx, ty, tw, th, obj score, class prob.]
Please refer to figure 6

We need to convert relative coordinates tx, ty into image scale coordinates bx, by and the same thing for tw and th. Here’s how to do it in tensorflow 2

After this step the red box coordinate is converted from:
[0, 0, 0.5, 0.5] to 
[0.5, 1.5, 1.13, 0.75], meaning it's (0.5 grid height, 1.5 grid width) from the top left image corner and the box has size (1.13 grid height, 0.75 grid width)

Now we have the coordinates of the predicted box on grid scale and its size, it’s very easy to calculate the coordinates of its corners(purple and green dots in figure 8). We do this step because it is standard to represent a box by its corners rather than its center and width/height

To get the coordinates of the green and purple dots, we need to:

green dot = boxXY - boxWH / 2
purple dot = boxXY + boxWH /2 
(Please note that the top left corner has smaller cordinates in images)

After this, we multiply the coordinates with (32,32) so that the bounding boxes are now in image scale. For instance, (0,0) means the top left corner, (1,1) means the bottom right corner, (0.5,0.5) means image center

Step 2 — Filter out low quality boxes

For every grid and every anchor box, yolo predicts a bounding box. In our case, this means 13 * 13 * 5 boxes are predicted. As you can imagine, not all boxes are accurate. Some of them might be false positives(no obj), some of them are predicting the same object (too much overlap). To obtain the final result, we need to:

Filter out boxes with low confidence (object score)
Filter out boxes that overlaps too much (two boxes have high IOU)

That’s it! Those are the only scripts you need to decode the yolo output. Let’s check out the results:

Reading scripts can be very confusing, which is I strongly recommend you to check out the repo and run it on Google Colab or your local computer.

Thank you so much for reading! In future tutorials, I’m going to talk about loading training data and transfer learnings!