Solving for Innovation

Public cloud and its built-in technologies have become a foundational pillar of application and operations modernization efforts. These efforts are designed to optimize business outcomes around speed, efficiencies, and effectiveness for business agility and innovation. Most CEOs realize that their technology architecture is in fact their business architecture and are increasing the use of modern technologies that include public cloud platforms and services to deliver a better end user experience.

Management solutions (and related management capabilities) for the monitoring and observability of public cloud platforms and cloud-born services are critical operational requirements for these modern technologies.

A recent survey we conducted for Google Cloud posits that the tools built by the cloud providers themselves should be considered as a preferred starting place for operations and management teams working in the cloud. Respondents to this survey indicated that these built-in tools used for the collection of performance, logging, and tracing data and metrics, can deliver better visibility into the applications, and supporting infrastructure that powers high-performing digital services.

These data and tools enable specialized teams in IT organizations, such as DevOps and SRE, to increase their understanding of critical performance data, and enhance their ability to identify, resolve, predict, and potentially auto-remediate service problems before they impact the end user experience. These tools from cloud platform providers and the data that power them can accelerate the evolution of IT organizations, even in the face of flat or decreased budgets.

In This White Paper

This IDC White Paper discusses results of a recent IDC survey, sponsored by Google Cloud, and goes into depth on the key business drivers and value for why stakeholders are adopting built-in monitoring and observability solutions. It provides actionable advice and thought leadership on how organizations are taking advantage of these solutions for the public cloud to expand, scale, and accelerate their DevOps, Site Reliability Engineering (SRE), observability, and modern operational practices and programs. It provides a value blueprint for stakeholders to consider adoption of built-in monitoring and observability solutions for their public cloud investments. The survey data shows that the top 4 leadership teams that influence decision making on the performance monitoring of their public cloud are the centralized IT operations team at 76%, cloud center of excellence at 53%, DevOps team at 53%, and SREs at 40%.

Key findings show significant interests in adopting built-in monitoring and observability cloud platform solutions and budgetary considerations:

69% of respondents trust that cloud providers can build better tools to manage their own clouds.
For those respondents who are already using cloud-born services, 82% are very satisfied or somewhat satisfied with their built-in public cloud monitoring and management tools.
56% of respondents expect their spending on tools for managing and monitoring the public cloud to remain the same or decrease over the next two years.
Of those planning to decrease their spending on tools to manage and monitor the public cloud, 83% will consolidate tools or move to a monthly subscription model.

The benefits and improvements from using built-in public cloud monitoring and management tools include elasticity/agility, customer (end user) experience, organizational maturity, and improved security.

Situation Overview

End users have low tolerance for poorly performing digital services; so in order to support them, public cloud and cloud-born technologies require high levels of system reliability, availability, and performance. Both their end user’s experience and a company’s brand reputation demands it. If not, their customers are quick to move to competitive options and form a negative perception of the product or business. To obtain and maintain high levels of system reliability and performance, IT operations, SREs, DevOps, and development teams have adopted efficient and effective built-in observability and management solutions to maintain and deliver a great end user experience.

The importance and value of built-in management and observability tools cannot be understated. And most of the organizations surveyed agreed: 82% of respondents are very or somewhat satisfied with their built-in public cloud tools. Why? They said that troubleshooting for service management, performance data collection and insights, cost management, and security management capabilities are their top 4 reasons. Clearly, tool adopters are getting value across a myriad of capabilities for effective and efficient monitoring and observability of their cloud-born platforms.

Organizations also realize that there is a cost to downtime and poorly performing cloud-born services. Built-in service visibility, data collection, smart dashboarding, topology mapping, and alerting are necessities because problem identification and resolution have never been more important or valuable. For effective problem identification, resolution, prevention, and prediction as well as the ability to mature from a reactive to a proactive operational posture, IT organizations must consider the adoption of built-in management and observability solutions.

These solutions provide an important set of data (and an early warning system) for providing service transparency and performance across the end-to-end cloud environment, including the infrastructure and application stacks. Without this data, IT organizations are running blind, with an incomplete view of service performance. In addition, built-in technologies offer unique advantages that IT organizations can leverage using automation, security, data, and analytics. These capabilities empower development, SREs, platform engineers, and operational teams with a modern approach to practices that establish and accelerate transformations in their organizations.

The use of built-in management and observability solutions plays a central role in accelerating people, processes, technology, and cultural maturity. Many organizations have experienced challenges empowering SRE, DevOps, operations, and observability practices and teams using existing tools as they adopt and scale their use of cloud.

Challenges in the usage of existing tools include the following:

Tool sprawl and the need for performance data: Complex systems and multiple tools make critical metrics hard to find; the DevOps, operational, observability, and SRE models require a tremendous amount of data to speed problem identification and resolution across the entire service infrastructure. Both service-level agreements (SLAs) and service-level objectives (SLOs) require discussions on the “metrics that matter” for cloud- born services by understanding the user journey and what users value in their service experience.
User/internal customer understanding: DevOps managers, observability engineers, operations teams, and SREs work across the organization to define the user (or internal customer) and what they value. SLOs and SLAs map that value to the technology and performance metrics that drive the right level of service reliability.
Cost containment: SLOs place the focus only on the metrics that matter from monitoring systems to deliver the right level of reliability for each service. Costs are optimized with investments steered towards only services that need to be highly reliable to keep customers happy. It also helps reduce how many metrics are stored long term, further reducing the cost of monitoring.
DevOps, SRE, and observability modernization and acceleration: SREs enable a modern developer-driven approach to service reliability and customer experience capabilities. SLOs empower operations discussions to evaluate investment trade-offs and force the focus on the user journey and the associated metrics that support a positive, reliable user experience.
Explosive growth in telemetry data: Operations teams are faced with capturing and analyzing increasingly large volumes of telemetry data, including logs and traces, from highly distributed and ephemeral services. Since the amount of data collected in cloud services explodes exponentially, we need observability solutions that scale well to be able to handle this data.
Business context: Operations and development teams must force a focus on the user (both internal and external) and their business journey with the company; it drives visibility into how a user experiences a service and what they value from the service. Great customer experiences drive trust, which requires reliable, high-performing, and secure services. SREs and DevOps can perform faster analysis and correlations with business and operational data stored in one system, accessible by easily compatible tools.
Service sprawl and dependency interface: SLOs and supporting SLIs (service-level indicators) help address tail latency problems driven by high-fan-out systems, which are increasingly common with container and microservices architectures. Performance data is a key requirement for solving these problems.

DevOps managers, observability engineers, developers, cloud platform teams, and SREs are leading their organizations toward modern IT operations practices, redefining the role of operations and system reliability by applying new organizational thinking and modern software capabilities to cloud infrastructure and operations tasks. These roles implement observability and monitoring with a focus on service reliability by using data collection, analysis, and automation, while focusing on the end user experience.

Modern leaders understand that a great customer experience drives revenues and growth. From a cultural perspective, these roles (and built-in monitoring and management tools) are helping CIOs migrate their culture toward a more data-driven, proactive, fact-based decision-making environment.

To deliver business outcomes, these leaders typically focus on three questions that shape their foundation for success for cloud-born monitoring and observability.

These are:

What is the service and user journey? These roles and teams must have a deep understanding of what the components of the service are, how the user journey is executed to deliver value, and what creates a positive customer experience. They must understand how the user experiences the service and what level of performance is required for satisfaction.
What is the SLO? SLOs establish performance and reliability targets for a service over a specified time. SLOs should reflect the level of service availability that would satisfy users, therefore defining the amount of unavailability that would not disappoint them to an unacceptable degree. This in turn has implications for budgeting the infrastructure spend.
What are the SLIs: SLIs are the performance metrics collected to inform the SLO and reliability targets. It is important to focus on the metrics that matter to the SLO reliability objective.

Why are SLOs important to driving an increase in team productivity and collaboration and higher levels of reliability? They offer multiple teams and stakeholders the ability to scale a service out as user demand grows, while assuring high levels of reliability with clearly defined performance and reliability targets that align team behaviors. SLOs empower SREs, DevOps, observability engineers, operations and developer teams to better understand what users care about in the service. They also allow teams to prioritize their work during the life of the service, across both software development and operations. As user demands change, the service can more efficiently adapt to deliver value.

Adopting SLOs helps engineering teams deliver optimized reliability, reduce team burnout, and provide a high-performing digital experience.

Business outcomes far transcend the technology tools used for cloud-born monitoring and observability. These solutions are expected to have a drastic impact on the core functions of a cloud platform: 71% of respondents believe that by using public cloud–provided IT operations management solutions, they can scale cloud-born/ container-based applications and related processes faster than when applications run solely on premises (private cloud). For most public cloud transformational use cases, speed is a core competitive differentiation and expected outcome. Management tools enable the acceleration to speed-related business outcomes.

Creating Effective SLOs Requires Cloud-Born Management and Observability Data and Metrics

It takes a collaborative team that includes product managers, engineering/developers, DevOps, observability engineers, operations, and SREs to define SLOs. In addition, a good SLO reflects transaction volume and value. When enterprises take the time to build good SLOs, they ensure that their evaluation of reliability takes the user experience into account. At the core of these discussions is data collection and transparency. To collect the right level of performance data cardinality, cloud-born management tools are required. These tools often come from cloud platform providers; 69% of respondents trust that cloud providers can build tools better than others to manage their own clouds. To enable an end-to-end view of service reliability and performance, some enterprises have created a seven-step process for creating SLOs.

These steps are:

Determine and define the SLI types that best capture the users’ experience. Define how users most commonly interact with the service and the features they use, and understand the components that create the service. Examples include how long an operation takes to complete, whether it completes successfully, and whether the data it provides is correct.
Define and align the SLIs (metrics) that matter most in the user journey by defining the actions that matter most to the users.
Choose how to measure the SLIs using a monitoring/observability system and capture the actual user experience.
Collect SLIs (typically two to four per service) over several weeks to create a service performance baseline, then estimate the initial SLOs per service.
Create error budgets from the initial SLOs to establish targets that drive user happiness and a positive customer experience. An error budget is the maximum amount of time that a service can be unavailable without contractual or user consequences. Ensuring that SLOs that have been reviewed and approved is a requirement before a new service can be put into production so that SLOs become part of the service definition.
PublishSLOstobroaderstakeholderstomaketheSLOvaluesavailabletousers, helping them understand the reliability guarantees that the service offers. This often includes what the service is and how it’s used, the defined types of SLIs being measured, how the SLIs support the SLOs, the SLO definitions, and the business/technology context for why the SLIs and SLOs were chosen for the service.
Review outages and SLOs on a regular basis to ensure that the existing SLIs capture outage scenarios, and tune objectives where appropriate. This supports the idea of continuous improvement.

As IT organizations mature toward a modern operational approach for higher levels of system reliability, cloud-born management solutions can support various roles and meet organizations where they are in the maturity journey. Indeed, the data shows rising interest in extracting more value out of cloud-born tools without spending more and an awareness of the value built-in tools have compared to third-party operations and management solutions.

Future Outlook

For leadership teams using cloud platforms, the future requires a broader adoption of public cloud management and observability tools. The survey indicated that 45% of respondents expect to use more platform-based managed services as part of their observability and management strategy over the next three years. The collection of, and access to performance data across teams and stakeholders have become critical in driving successful business outcomes while expanding the value of the respective cloud platform investments. For example, being able to track how security issues affect applications is a hot-button executive topic. Leaders need to be able to link threat and vulnerability detection and application code scanning to performance degradation in a more automated fashion. The ability to collect information across metrics, logs, traces, and security events holistically in a singular console is becoming a valuable cloud platform asset.

Many IT leadership teams are investigating the ability to use this information to conduct security and compliance audits using cloud-born management and observability tools. The future is about collecting and analyzing performance data to quickly identify, resolve, predict, and prevent problems before they impact the customer. In many cases, IT organizations will use these tools to act and gather insights into the cloud platforms and resources faster than ever using actionable, roles-based dashboards. These cloud platform tools will support and plug into DevOps, SRE, observability, and modern software and operations deployment models and practices, accelerating the value of cloud platforms and increasing their derived ROI and transformational benefits.

Challenges/Opportunities

For all the benefits of using cloud-born management and observability tools, challenges remain for IT leadership teams to consider. As with any transformation, successful teams will consider these challenges prior to deployment and plan accordingly.

These challenges include:

The control over of cloud-based tools versus a do-it-yourself strategy
The idea that the cost of cloud-based tools adoption is too high and does not add value
The inability to understand the value of cloud tools and the role they play in transformation and modernization strategies
The belief that third-party tools provide the necessary level and depth of information versus built-in tools
The idea that cloud-based services fail to perform at their expected levels
The conclusion that cloud-based management and observability tools don’t offer collaboration and value opportunities across multiple teams for identifying and resolving problems
The failure to consider the need to connect revenues, profitability, and brand reputation to highly performing cloud-born services and the role these tools play in assuring performance

Conclusion

As economic and business environments change, the role of cloud platforms continues to expand in adoption to provide agility, speed, and innovation. Most customers now use (and often depend on) digital services that are highly complex and that utilize cloud infrastructure to deliver a positive experience. Service reliability and deep customer relationships are fast becoming indicators of revenue growth, high customer renewal rates, and a positive business reputation. To drive competitive advantage through software reliability, organizations increasingly believe that built-in, cloud-born management and observability tools are required to contain costs and deliver the services that users appreciate and find valuable. As IT organizations continue to drive competitive advantages with the business, tools that are built by cloud providers for use in their clouds are fast becoming secret weapons in driving great user experience and loyalty.

About the Analysts

Stephen Elliot

Group Vice President, I&O, Cloud Operations, and DevOps, IDC

Stephen manages multiple programs spanning IT operations, enterprise management, ITSM, agile and DevOps, application performance, virtualization, multicloud management and automation, log analytics, container management, DaaS, and software-defined compute. Stephen advises senior IT, business, and investment executives globally in the creation of strategy and operational tactics that drive the execution of digital transformation and business growth.

More about Stephen Elliot

Tim Grieser

Research Vice President, Enterprise System Management Software, IDC

Tim’s coverage includes software and SaaS solutions for managing systems, applications, and IT operations across a wide variety of deployment models, including on-premises and private and public clouds. Tim has published IDC research in market sizing, market forecasting, technological trends, vendor strategies, and IT user needs and priorities. Current interests include IT operations analytics, encompassing both log analysis and predictive insights and cognitive/AI technologies.

This publication was produced by IDC Custom Solutions. As a premier global provider of market intelligence, advisory services, and events for the information technology, telecommunications, and consumer technology markets, IDCʼs Custom Solutions group helps clients plan, market, sell, and succeed in the global marketplace. We create actionable market intelligence and influential content marketing programs that yield measurable results.

Read more about Infrastructure and Application Modernisation

Cloud Observability Tools for Your Next Application