Deploying and maintaining software across thousands of devices is challenging. Endless hardware factors and situational realities could cause outages, bugs, and reliability issues. And as rapid release cycles become more commonplace, fleet operators must continually address the repercussions of new updates.
To help mitigate software errors, observability has risen as an important facet of the modern DevOps strategy. Observability extends traditional performance monitoring to offer deeper context into software issues. As of 2021, 61% of DevOps teams are now practicing observability.
If applied to fleet management, observability could be a big win, helping pinpoint hardware and OS problems, address outages, patch specific clients, and more. Below, we'll define observability in a general sense. We'll also consider how to apply it to device fleet management, outlining some potential cases where observability could accelerate DevOps.
What is Observability?
You can think of observability as the next evolution of application performance monitoring (APM). Observability takes APM to the next level, providing more active telemetry into application usage patterns to identify unknown issues. The goal of observability is thus to provide deeper context and reveal the root causes of problems. By correlating a range of broad datasets, operators like quality assurance (QA) engineers or site reliability engineers (SREs) can better gauge a system's health to refine the software and remediate problems in real-time.
Observability typically encompasses three key areas: metrics, logs, and tracing.
- Metrics: A metric is a value produced by a software system. This includes typical application performance metrics such as response time, error rate, availability, latency, CPU usage, user satisfaction, and others.
- Logs: Logging refers to the process of storing historical application usage reports. This data is typically outputted in response to an event within the application. Logs usually display a user identifier, timestamp, an action, and the resource being acted upon. For example, a gateway will record a log when a user authenticates.
- Traces: Tracing is the more nuanced factor fueling modern observability. A trace shows the activity of a single request throughout an application flow. Traces help unite logs and identify what type of metrics are helpful to track and at what point.
Monitoring these data points in concert is becoming necessary to unify telemetry for popular cloud-native architecture, like microservices, service mesh, and serverless architectures. As these ecosystems are now quite distributed, with communication occurring via API calls, a better grasp of the correlations between them is becoming necessary to establish a clearer picture of how applications talk to each other and the overall user journey.
The Benefits of Observability
Most software is, by design, producing a ton of useful usage data. But, if this data isn't visible to operations, it's useless. This is one big benefit of the observability trend, as increased visibility into a software ecosystem is proven to help identify bottlenecks and increase development fluidity. According to The State of Observability 2021, leaders in observability tend to have 2.9 times better visibility into application performance and deliver products 60% faster.
Decreasing Mean Time to Recovery (MTR) is a guiding light for most SREs. Yet, software operators don't always have granular insight into how one change might affect other pieces of the broader ecosystem. Incorporating observability can help correlate these unforeseen dependencies. This can help identify root causes, remediate bugs, and decrease incident response times.
"Using observability, SREs can achieve a number of objectives in their day-to-day work, like identifying the root cause of production issues, faster resolution of issues and striving for self-healing infrastructure setup with no-code," says Sushant Mehta, senior manager, application development, Diyar United Company.
Better visibility into production environments can also inform A/B testing, enabling developers to refine progressive software delivery. By maximizing performances for new features, engineers can also sustain a higher quality user experience, which can have tangible business outcomes. An emphasis on observability pays off — a Forrester study on the use of one observability tool found an ROI of 296% and a net present value of $4.43 million over three years. Ensuring consistent availability may also be necessary to meet service-level agreements (SLAs).
Fleet Management Concerns
So how does observability tie into fleet management? Well, if you've ever worked in fleet management, you'll know that delivering software to thousands of devices in the field is fraught with complications. It can be a slow, tedious process.
First, consider the hardware. Your fleet is likely made of similar devices. But over time, new models have been added into the mix. A fleet may comprise different models of smartphones, tablets, point of sale systems, PCs, Internet of Things (IoT) devices, or proprietary equipment embedded within larger machines. These each come with their own unique computing limitations and interfaces for human interaction. Supporting more than one type of device likely means delivering different software versions and supporting legacy models, too.
Next, hardware may sport different operating systems, like Linux, Ubuntu, Android, Windows, or iOS. Devices could be running on different OS versions, causing drift. For example, perhaps a gym machine manufacturer pushes an update to its fleet, but it causes bugs on equipment that have graduated to an experimental OS version. Fleet managers will need to ensure compatibility as such patches occur.
Another consideration is the network. The days of manual software installation in the field are long gone — it's now standard practice to simultaneously issue updates to hardware over the wire. However, constant connectivity isn't always guaranteed with large fleets. Devices go offline or have patchy WiFi, stunting continuous upload and download abilities. Thus, hardware in the field may remain outdated for some time. For example, vehicle fleets working in remote locations may go out of service for hours or days when traveling through remote locations.
Lastly, setting up an automated deployment pipeline for massive fleets is a whole other problem to solve. Solutions on the market help operators maintain a central, unified command center for all devices. But without the right monitoring and deep visibility into production runtimes, it becomes harder to respond to many of the issues outlined above.
Applying Observability to Fleet Management
As we can see above, there are an exponential number of factors at work that could cause outages, incompatibilities, and poor user experiences when managing software updates for hardware at scale. So, how can operators respond to stability and reliability concerns? One answer is observability.
By applying the modern DevOps principle of observability to fleet management, engineers could unlock cues to remediate errors. Correlating metrics and logs with traces could help discover bugs and latencies, exposing the root cause behind them. To see how observability could aid device management at scale, let's consider some hypothetical scenarios.
Example #1: Restaurant POS System
A national fast-food restaurant chain issues a patch to its Point of Sale (POS) software to include seasonal menu items. However, the update causes bugs in a portion of the systems. Unknown to fleet managers, restaurant managers turned off certain stations to save on electricity during the off-season. Now that they're back online, they fail. This is inhibiting employees from jumping on more stations to respond to swells in foot traffic. With observability, fleet managers spot the non-functional systems and see that these devices never received an OS update that the new patch relies on. They quickly issue the required OS updates to these select systems and then issue the software patch.
Example #2: Factory Floor
A shipping company oversees a fleet of mounted forklift tablets that assist workers on the factory floor. The on-device fulfillment software includes a spreadsheet that forklift operators rely on to find and store their inventory. However, on a select number of tablets, the data is taking far too long to load — the fields are completely blank for 10 seconds before finally populating. This is hurting productivity and causing workers to avoid using these "dumb" forklifts. With the right observability in place, operators see that the devices with high latency are set to use a default cloud database storage location instead of a geographically optimized one. They issue a simple patch that instantly resolves the latency issues.
Example #3: Car Fleet
A sizable automotive fleet operator needs to issue a software update to in-vehicle dashboards. The update enables voice recognition to help drivers navigate hands-free. In doing so, the update installs an AI library to process voice recognition locally, as Internet connectivity may not always be possible. Fleet managers roll out the new functionality iteratively using A/B testing. With good observability, they are able to refine the features to appeal to different onboard models. Unfortunately, they discover that some dashboards lack the storage necessary to hold the AI library. Thus, they program in a notice to let drivers know this feature requires remote connectivity.
How to Enable Observability For Hardware Fleets
Nowadays, the need to quickly innovate is influencing more iterative, shorter release cycles, and fleet management solutions are catching up to this trend, enabling more of a DevOps approach. So far, observability has been relegated to the realm of cloud-native technology, but these tenets could arguably be just as beneficial — if not more so — when applied to mass device management.
But how exactly would operators accomplish this? Enabling observability for fleet management will require continuous monitoring of clients, tooling to identify bugs, and dashboards for real-time reporting. This would likely require a unified device management center that provides the means to quickly diagnose issues and react.
In our hypothetical examples, we saw how observability correlates issues to causes. This could help identify OS version drift, external dependency problems, or hardware incongruencies. Yet, what's critical to keep in mind is that the end goal is to avoid such incidents entirely. With the right observability in place, engineers could anticipate hangups before issues become widespread in the field. This deeper understanding could help engineers better operate their fleets and ultimately lead to increased reliability and improved end-user satisfaction.