YOLO2 Walkthrough with Examples (2024)

Yolo is one of the most sucessful object detection algorithm in the field, known for its lightening speed and decent accuracy. Comparing to other regional proposal frameworks that detect objects region by region, which requires many times of feature extraction, the input images are processed once in Yolo. In this tutorial, we are going to take a peek into the code of Yolo2.

YOLO2 Walkthrough with Examples (3)

For those who want to run the code step by step instead of reading comments, check out my complementary repo on GitHub! The repo has several tutorials covering all aspects of Yolo as well as a ready-to-use library for you to play with!

To understand how Yolo2 works, it is critical to understand what Yolo architecture look like. Yolo2 uses a VGG-style CNN called the DarkNet as feature extractors. Please note that DarkNet is an umbrella of various networks, and people use different variants to increase speed or accuracy.

YOLO2 Walkthrough with Examples (4)

As you can see, yolo’s output is nothing like what we’ve seen before. There are 416 x 416 pixels in the image but the output is 13 x 13. How on earth do we interprate the results?

Let’s put Yolo aside for a moment and think about how would we do object detection in one path? Here’s my naive solution:

Suppose we have a network that takes an input image of size, say 416 x 416 x3, and there are 20 classes in the dataset.For every pixel in the image, we can predict a box with the following layout(Figure 2). The model output has shape 416 x 416 x22
YOLO2 Walkthrough with Examples (5)

Yolo is engineered to be fast and accurate. Therefore, it is not ideal to predict one box per pixel (two adjecent pixel may belong to one object). The geeks who invented YOLO went to work and came up with a better idea

Optimization 1 — reduce predicted box number

Instead of predicting one box per pixel, we divide an image into S x S grids and predict several boxes per grid.

With this optimization, the output can be reduced to something like 13 x 13 x 5*22, if we predict 5 boxes per grid. This is a significant drop in box numbers
YOLO2 Walkthrough with Examples (6)

Optimization 2 — object score for filtering out low confidence prediction

Yolo also introduces an object score in addition to classification probabilities. The object score is an estimation of whether an object appears in the predicted box (It doesn’t care what object, that’s the job of class probailities). If a prediction has low object score, it will be discarded in post-processing. With that being said, the bounding boxe should like like this

YOLO2 Walkthrough with Examples (7)
With this optimization, the output will have shape 13 x 13 x 5 * (3+20)

Optimization 3 — tailor to the dataset

YOLO2 Walkthrough with Examples (8)

Instead of predicting the absolute size of boxes w.r.t the entire image, Yolo introduces what is known as Anchor Box, a list of predefined boxes that best match the desired objects (Given ground truths, run K mean clustering). The predicted box is scaled w.r.t the anchors. More specifically:

  1. predict the box center (tx and ty in the figure 6) w.r.t the top left corner of its grid scaled by grid width and height .
  2. Predict the width(tw) and height(th) of the box w.r.t an anchor box (pw and ph)
YOLO2 Walkthrough with Examples (9)

Final Format

Now you know YOLO predicts several bounding boxes per grid instead of just one. The output shape would be something like 13 x 13 x NUM_ANCHOR X (BOX INFO), where the last dimension looks just like an upgraded version of the naive approache.

YOLO2 Walkthrough with Examples (10)
With all optimizations, Yolo output can be interprated as:
for every grid:
for every anchor box: (with different aspect ratios and sizes)
predict a box
Thus, yolo output has shape 13 x 13 x 5 x 25, which is reshaped in practice into 13 x 13 x 125
YOLO2 Walkthrough with Examples (11)

Now we understand the format of Yolo 2. The next step is how to extract the boxes from the raw tensor. Sure we can’t use all 13 x 13 x 5 boxes right? In this section, we are going to see how to extract information from the raw output tensor.

Let's assume the output Y has shape 2 x 2 x 2*6, meaning there are two anchors per grid and one class in the dataset. 
Assume Y[0,1,0,:] = [0, 0, 0.4, 0.4, 0, 0.5]. This defines the red box in figure 8. But how to decode it?
YOLO2 Walkthrough with Examples (12)

Step 1 — extract box coordinates

Let’s take a look at the information

[0, 0, 0.4, 0.4, 0, 0.5] = 
[tx, ty, tw, th, obj score, class prob.]
Please refer to figure 6

We need to convert relative coordinates tx, ty into image scale coordinates bx, by and the same thing for tw and th. Here’s how to do it in tensorflow 2

After this step the red box coordinate is converted from:
[0, 0, 0.5, 0.5] to
[0.5, 1.5, 1.13, 0.75], meaning it's (0.5 grid height, 1.5 grid width) from the top left image corner and the box has size (1.13 grid height, 0.75 grid width)

Now we have the coordinates of the predicted box on grid scale and its size, it’s very easy to calculate the coordinates of its corners(purple and green dots in figure 8). We do this step because it is standard to represent a box by its corners rather than its center and width/height

To get the coordinates of the green and purple dots, we need to:

green dot = boxXY - boxWH / 2
purple dot = boxXY + boxWH /2
(Please note that the top left corner has smaller cordinates in images)

After this, we multiply the coordinates with (32,32) so that the bounding boxes are now in image scale. For instance, (0,0) means the top left corner, (1,1) means the bottom right corner, (0.5,0.5) means image center

Step 2 — Filter out low quality boxes

For every grid and every anchor box, yolo predicts a bounding box. In our case, this means 13 * 13 * 5 boxes are predicted. As you can imagine, not all boxes are accurate. Some of them might be false positives(no obj), some of them are predicting the same object (too much overlap). To obtain the final result, we need to:

Filter out boxes with low confidence (object score)
Filter out boxes that overlaps too much (two boxes have high IOU)

That’s it! Those are the only scripts you need to decode the yolo output. Let’s check out the results:

YOLO2 Walkthrough with Examples (13)

Reading scripts can be very confusing, which is I strongly recommend you to check out the repo and run it on Google Colab or your local computer.

Thank you so much for reading! In future tutorials, I’m going to talk about loading training data and transfer learnings!

YOLO2 Walkthrough with Examples (2024)
Top Articles
Latest Posts
Article information

Author: Margart Wisoky

Last Updated:

Views: 6776

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.