You Only Look Once (YOLO): The beginning of on-device computer vision


Xnor's object detection at the edge

YOLO Food Detection

What is YOLO?

Humans have the miraculous ability to glance at a scene and instantly know the type and position of the objects they see with astounding accuracy. Computer vision applications have used several methods to try effort to duplicate that feat.

Early methodologies used a multi-step classification process, scanning an image to generate potential bounding boxes, then running a classifier on the proposed boxes. YOLO only scans the image just once to accurately detect which objects are present and where they are in the frame. For more background on this process go here.

YOLO Detection System

How does YOLO work?

The original concept for YOLO was conceived by Xnor founder Joe Redmond around 2014. Over the next two years the concept continued to be refined at University of Washington’s Paul G. Allen School of Computer Science. This research article, published in 2016, summarizes the findings.

YOLO’s technique involves regressing bounding boxes directly from a single frame. Frame object detection spatially separates bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the entire detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

YOLO System Model

A speed-optimized architecture

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors.

YOLO Architecture

Accurate detection model

The early versions of YOLO made some localization errors but it was far less likely to predict a false detection where nothing existed. When generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset, YOLO outperformed all other detection methods by a wide margin, including DPM and R-CNN.

Each interaction of YOLO improves results. YOLOv2, was state-of-the-art on standard detection tasks like PASCAL VOC and COCO. At 67 FPS, YOLOv2 got 76.8 mAP on VOC 2007. At 40 FPS it got 78.6 mAP, outperforming methods like Faster RCNN with ResNet and SSD, while still running significantly faster.

YOLO Detection Accuracy
YOLO Unified Real-Time Object Detection

Scaling to more object classes

YOLO9000 achieved new levels of efficiency by simultaneously training on the COCO detection dataset and the ImageNet classification dataset.

By joint training, YOLO9000 could predict detections for object classes that lack labeled detection data. It got a 19.7 mAP on the ImageNet detection validation set, despite having detection data for just 44 of the 200 classes. On the 156 classes that aren’t in COCO, it got 16.0 mAP. On the ImageNet detection task, YOLO9000 can predict detections for more than 9000 different object categories – in realtime!

The unorthodox thinking that created of YOLO exemplifies Xnor’s approach to solving the technical barriers that keep AI from reaching its full potential. That means finding ways to make computing even more efficient on resource-constrained devices

“Just as we did with YOLO, Xnor continues to explore new applications for our deep learning technology, in areas like computer vision as well as many other exciting fields.”

YOLO Diver