What is Computer Vision?

David Ruddock
|
Try Esper for Free
MDM Solutions for Android and iOS

Computer vision is quite literally what it sounds like: Making computers see. It may seem like science fiction, but in reality, computer vision has a very long history, going back to the early days of modern computing itself in the 1960s. There is no one technical definition of computer vision; it is used to broadly describe the technologies, techniques, and use cases for the interpretation of visual information by a computer. That can be something as relatively straightforward as OCR (optical character recognition) or as advanced as object recognition using machine learning in real-time with multidimensional (color, light, depth) input.

What are Examples of Computer Vision in the Real World?

The most common type of computer vision in use today is optical character recognition — though the technology has evolved very considerably since the early days. Using OCR, a computer takes an image file and, applying a computer vision algorithm, determines where known text characters occur inside that image. If you’ve ever used an online service that takes a picture of a check, an ID, or other document, you’ve interacted with an OCR system. Even hardware-enabled OCR exists, in the form of some document scanners that “convert” paper documents to editable digital files. OCR is one of the most basic forms of computer vision, even if its applications have become increasingly sophisticated.

More complex examples of computer vision exist in fields like retail, medicine, consumer electronics, and manufacturing. The growing industry of retail floor robots that scan for spills or check for incorrectly placed products is built almost entirely on computer vision. Using large arrays of cameras, LIDAR, and other sensors, these robots can detect a spill or obstacle in a store, or determine if an item on a shelf isn’t where it should be, through the use of computer vision algorithms. 

In manufacturing, computer vision is being used to evaluate products for quality and consistency as they come off the assembly line. In medicine, computer vision is being explored extensively for use in diagnostic medical imaging to detect abnormalities or disease. Consumers are even becoming more familiar with computer vision with the use of their smartphone cameras, which can now identify breeds of dogs, species of plants and insects, or look up products based on photos of objects in the real world. 

In the digital world, computer vision can be used to add visual constraints to an online retailer’s search engine. For example, if you want to only search for a white hooded sweatshirt, this would require that retailer to have manually classified all hooded sweatshirts as such, as well as correctly classify them by color — and to do so every time a new product is added to the store. For a specialty retailer, maybe this is scalable, but imagine you’re a marketplace retailer like Etsy or Amazon. To manually classify all products in this way would be virtually impossible. But with a computer vision algorithm, you could automate this process by training the algorithm to understand what a hooded sweatshirt looks like and what color it is, along with thousands of other apparel types.

How Does Computer Vision Work?

The exact method a given computer vision use case employs can vary, but most of them rely on object detection and classification algorithms. Not unlike large language models (LLMs), these computer vision algorithms are trained on immense data sets like Microsoft COCO or ImageNet so that they can correctly detect and identify (label, classify) objects in images, videos, or 3D media. Computer vision algorithms are usually quite specialized (rather than generalist) to a particular use case, though generic computer vision use cases like Google Lens do exist. 

On a more technical level, a computer vision system does not truly “see” the world in the way a human being does. Unsurprisingly, math is the underlying premise here — lots and lots of math, in the form of data points arranged in matrices. A computer vision system starts by regularizing all the visual data to be analyzed in a step known as preprocessing. Then, it analyzes the normalized image data and breaks it down into vector data through a process called feature extraction (this data is typically stored in a vector database). That vector data is then run through the computer vision algorithm to generate output, sometimes called the decision or inference stage — this is where the “vision,” as we understand it, is happening. Then, the data is output to a labeling tool where it is represented as bounding boxes or polygons with labels (classifications) applied to a given image.

In practice, “bounding” objects in an image (in a bounding box) — that is, defining where in a given image a detected object exists — is how computer vision is most often visualized. The labeling (also known as classification) process could tag that object as a person, product, animal, vehicle, or anything else you can think of, really. Imagine you’re looking at an image of a major traffic intersection, and around each vehicle is a descriptive box: “white sedan,” “gray SUV,” “red truck,” “blue motorcycle” etc., and you can get a sense of what computer vision “sees” like in practice. Specific distinctions between concepts like image segmentation as opposed to object detection are also important for those working with and developing these tools to understand, but are beyond the scope of this article.

Does Computer Vision Happen in the Cloud, the Edge, or On-Device?

Computer vision can be implemented on-device at the hardware level, at the edge, or in the cloud. Some use cases, like retail robots, may require on-device algorithms to meaningfully function, since they must report hazards like spills in near real time. Other use cases, like identifying the faces of actors in films to display contextual content on an on-demand video service, are best done in the cloud, given there is no real-time element involved. Other scenarios may exist in the middle, and demand edge implementations — using a smartphone to identify a historic building probably requires a model too powerful to be implemented on-device, but it must also respond quickly enough for the feature to be useful.

The accuracy and performance of a given computer vision model depend on a number of factors, including the relevance of the image data it was trained on, the amount of manual correction that has been used in re-training the model, the difficulty of the detection and classification requirements, and also whether the computer vision happens on device, in the cloud, or at the edge. If you need a computer vision algorithm to identify which specific product SKU matches what is in someone’s hand using live video feed with a maximum delay of no more than a minute, that’s going to be a far more difficult task than, say, detecting if someone is smiling, or whether a traffic light is green or red. The former use case requires tremendous cloud computing power and, likely, a lot of human intervention to correct misclassified or unclassified objects. The latter may work almost perfectly as a relatively simple on-device model.

What is the Future of Computer Vision?

Computer vision is a massive field, and one that is constantly evolving. With the rise of LLM AI, computer vision is only becoming more important, but it hardly needs LLMs to remain relevant. Given the massive potential for computer vision in manufacturing, medicine, scientific study, retail, web services, law enforcement, public infrastructure, and engineering — all of which the technology has already proven itself highly valuable to — we will continue to see advancements in the space for years to come.

The edge is very likely to become highly relevant to computer vision going forward, too. Many computer vision models are simply too large or require too much ongoing training to run on end devices, but still need to guarantee some level of performance (latency, responsiveness) to be useful to customers. Hybrid models may also hold promise, where some basic level of on-device computer vision occurs before data is sent up to the edge or cloud in an optimized format for more rapid ingestion and analysis.

Perhaps the most promising (and at times overhyped) area for computer vision to date has been vehicular autonomy. Using on-device algorithms in real-time to interpret road conditions, traffic, signage, and the behavior of other drivers, teaching a car to drive itself is arguably the most ambitious case for the technology yet, and one that has driven incredible sums of investment. A good example of computer vision in effect broadly in this space is ISA regulation in the EU, which has forced automakers to equip cars with computer vision systems capable of detecting road speed limits. While we may not be at self-driving cars just yet, continued iteration of autonomous and assisted driving with computer vision in the next 5-10 years seems assured regardless.

FAQ

No items found.
No items found.
David Ruddock
David Ruddock
David's tech experience runs deep. His tech agnostic approach and general love for technology fueled the 14 years he spent as a technology journalist, where David worked with major brands like Google, Samsung, Qualcomm, NVIDIA, Verizon, and Amazon, reviewed hundreds of products, and broke dozens of exclusive stories. Now he lends that same passion and expertise to Esper's marketing team.
David Ruddock
Learn about Esper mobile device management software for Android and iOS
Featured resource
Read more
Featured resource
Preparing Edge Device Fleets for the Future
Understand where IoT, AI, DevOps, security, and operationalizing the edge converge in this comprehensive guide for practitioners.
Download the Guide

Esper is Modern Device Management

For tablets, smartphones, kiosks, point of sale, IoT, and other Android and iOS edge devices.
MDM Solutions