Technical Case Study
How to Make Your Cameras Smarter and Your Customers’ Homes Safer with Xnor.ai’s State-of-the-Art AI Technology
Delivering delightful and highly demanded AI-powered experiences should not mean re-engineering your products, upgrading hardware or exposing customers to privacy risks.
As consumers embrace the capabilities enabled by sensors and connected devices, the smart home market is heating up. More and more, what defines one smart home solution provider from another is the meaningful and seamless user experience, built on robust yet affordable hardware. And while artificial intelligence technology can bring never before seen features, it often comes with complexity, security, and cost trade-offs that consumers are not willing to bear. And they should not, because there is another way -- on-device or edge AI.
At Xnor we bring AI on to devices in order to improve user experience without adding cost and privacy concerns. . By this we mean - more frames per second, more memory, more efficient power usage, and state of the art accuracy. This is what we set out to do with our latest partner, Wyze Labs, who sold millions of their $20 smart home cameras in just over a year. In July 2019, Wyze delivered powerful AI features to its customers through a simple firmware update - no monthly subscription fees, high cloud costs or new hardware required.
The feedback has been overwhelmingly positive. “The firmware update with Xnor so far has blown me away,” commented one a WyzeCam user via Twitter. And the reaction we have seen from smart home providers is one of suspended disbelief.
How do you go about adding AI capabilities like person detection, object detection, face recognition and more to consumer grade devices with minimal compute and memory resources? The answer lies in Xnor’s ML optimization process that is designed to extract the most value of existing hardware, software and data:
Step 1: Understanding the hardware powering your devices
We start by understanding the hardware parameters of your existing devices. Xnor works with several different hardware platforms based on Intel, ARM, MIPS, and others.
When we work on a hardware platform, the processor architecture is only one dimension that we examine. For popular platforms such as the Raspberry Pi 3, where there is a large community-backed effort, we benefit from a well-understood and stable operating system environment and cross-compilation toolchain. For more esoteric platforms, we work with our customers to build a bridge between our inference engine and models onto the platform. Usually this means spending a day or two on understanding the Software Developer Kit (SDK) of the manufacturer and pitch-matching our compilation toolchains with the IoT fabric. This allows us to use every computational resources a platform can offer. As a result, our partners receive a solution that exceeds Xnor's internal engineering bar.
Step 2: Making it faster with SIMD acceleration.
Once we have confirmed the viability of the Minimal Viable Product (MVP), we optimize the performance for a production setting. For these algorithms to be useful, they must meet performance targets set forth by our customers. This is especially difficult to achieve considering other processes such as image capture, image drawing, and video encoding contend for precious CPU cycles. We are also competing for services and processes running on the device that comprise of our customer’s application logic. We have to design a system that can peacefully coexist with our customer’s code while squeezing the most of your hardware.
Step 3: Training data
Based on the business or customer problem we are trying to solve, we identify a starting model from our set of pretrained models. This is the starting point for evaluating accuracy relative to the use case and typically is pre-trained from a publicly available dataset. To specialize for a customer’s use case and maximize the accuracy of the model, we work with the customer to acquire a customer-collected dataset.
Step 4: Training the model
Once we have a significant amount of labeled user video data, we can train a model. This process consists of teaching the model by providing positive and negative examples in the context of the use case. Each model takes about 7-10 days to fully train.
In every customer engagement, we apply a variety of our specialized training techniques that Xnor engineers and researchers have developed and published in peer reviewed academic venues.
Step 5: Testing and Quality Assurance
To understand the accuracy of the model we validate performance on a subset of the data that has purposefully been excluded from training. Accuracy on the validation set is a good proxy for how the solution will behave in production. Great validation sets capture the diversity and variability of the sample population, e.g., across various scenarios, viewpoints, and lighting conditions.
For every customer engagement, we create a small validation set drawn from data from the use case. If you’re interesting in detecting people in a residential setting, we create a validation set just for that scenario. If you’re deploying for traffic intersections, we create one for that. For every model that we train, we compare the performance of the model by calculating accuracy metrics dependent on the task.
Step 6: Shipping the model
Once we have a model that was trained, Xnor creates modules, known as Xnor Bundles or XBs, that contain both the model and the inference engine in a single library. This no-fuss, no-hassle approach simplifies the workflow for integrating edge AI algorithms onto arbitrary edge devices.
Working with an XB is very easy. A developer selects a language binding, e.g. C, Python, Java and links a prebuilt, optimized library for their edge platform. The developer can then run inference by feeding image data to a stable, well-defined API. Depending on the model, the API can provide a variety of outputs such as string labels for image classification, bounding boxes for object detection, and segmentation masks for semantic segmentation.
For any customer, we provided two XBs: one built for the customer’s IoT platform and another for rapid prototyping and testing for a traditional x86_64 platform. With a few lines of code, the customer is able to integrate Xnor's AI algorithms into their production systems. Model updates are even easier; for every new model we ship to the customer, all they have to do is overwrite the old XB with the new one. No code changes necessary.
Step 7: Refining the user experience
Introducing AI capabilities into a product is something that requires careful thought and consideration. AI is a powerful technology, but needs to be focused on solving user problems in order to be effective. Working with our partners to incorporate user feedback and create a powerful and user-centered experience is key to the success of any AI implementation. In the case of Wyze, in order to deliver the simple implementation of improving user notifications and allowing users to organize video clips that contain people, Xnor’s product and design team collaborated closely with the customer and their use case. Together we ensured that a complex technology did not create a complex user experience.
From home automation to security, better communication to improved energy management--we are just scratching the surface of the range of smart home scenarios edge AI can enable. But while the technology is complex and the engineering process rigorous, the collaboration with our customers is streamlined so they can focus on what they know best: deliver unique, delightful, valuable experiences to the consumer.
Arlo, Nest, Ring, Wyze: Solid False Alert Resistance
In a total of several weeks of testing, we have found that Arlo, Nest, Ring, and Wyze rejected most false alerts in our test scenes. For example, all four consistently rejected alerts from blowing foliage and the shadows created, which frequently triggered even many commercial analytics in our tests. These models also rejected false alerts on animals, as well as more common indoor issues such as shadows cast by subjects walking through adjacent rooms.