June 3, 2022
June 3, 2022
What is Observability?
Observability is understanding the full picture: there's something that's happening with your application landscape that's starting to impact your users, your clients, or your business.
And then observability is also the practice of being able to drill down into the data when something like that happens. You can tell what's going on, remediate it, which means quickly fix it, and then also put in a long-term solution so that it doesn't happen again.
A lot of organizations do monitoring, or Application Performance Monitoring. They're monitoring the infrastructure, monitoring compute, monitoring networking; basically they’re looking at the typical layers of the infrastructure (see figure below).
Where we want to take businesses is where they're monitoring from the outside-in; they're monitoring really mostly the layer between where their IT ends and their users connect, or where other businesses themselves connect to the application.
Looking user experience-inward also would allow organizations to drill down all the way past the compute layer, which is where most monitoring and IT Ops teams are doing their monitoring today. With a Full Stack Observability mindset and tooling, teams can also drill down all the way into the networking layer. This helps IT Ops teams understand if compute is their problem, if networking is their problem, or if it’s the application itself.
Or even further, is it a problem that I don't even own? It's a problem that might be outside of my perimeter via a third party API, for example.
But, the key is that the prioritization of issues and KPI management is based on how the issue hits the user experience and thus affects revenue, Customer Satisfaction/ Net Promoter Score, or (depending on the industry) operations such as shipping, logistics, manufacturing, or internal users of software.
Organizations typically struggle because there's a central area that provides the tooling and they provide training on the tooling. But there's not nearly enough dialogue into, ‘how do we do observability well’.
At Evolutio, we have a very strong recommendation for how companies should do observability well. And that really is focusing from the outside-in.
How Does Lacking a Culture of Observability Manifest Today?
The number one outcome that we see is that mostly users and clients are calling in issues. And if you have one user calling it an issue, you have 10 users that aren't saying anything at all.
If they are having a large and persistent outage, they're finding out late. And when you find out you're having an outage late, it's very hard to tell what the root cause is.
Cisco's Full Stack Observability has a portion of it that's called Application Performance Monitoring (APM). And just with basic APM, it's easy to tell what started things because the full timeline of an outage looks like a big hairball, but the beginning of an outage will really tell you what the problem was. APM can even point to the issues that cascaded before the outage started.
In the world of APM only, it does have quick time-to-value because just by installing the agents organizations can look in the rearview mirror and get answers.
How is true observability different? It's when you start having those leading indicators that say: "Okay team, typically when this happens, we're going to roll into an outage and you'll want to get somebody on it right away, and get them looking at this exact thing so that they can stop the root cause and fix it before it starts impacting most of your client base."
What we have found is a lot of times organizations don't even ever really find out root cause. The process of getting to the problem can take sometimes five days or more. And if your teams don't find root cause within those five days, a lot of times the team has to abandon the pursuit.
As for typical troubleshooting, the team is looking through different application logs to try to piece together a story of what happened. The application in question has multiple outside dependencies, so how did it respond when one dependency changed? How about a second dependency? Successful root cause analysis is very dependent that you have the right kind of logging and tooling in place. And then a lot of times when troubleshooting teams find anything that looks wrong, they immediately pin it on that reason. And a lot of times that is not the reason (or the whole reason).
And that, my readers, is why organizations have persistent outages: because they think they've found the root cause, and the problem has all the characteristics that it could have been a root cause, and teams pin everything on that, only to find out later it wasn't the root cause because it happens again.
Why is AppDynamics the Best Observability Solution on the Market?
We partner with AppDynamics because we believe it's the best in breed observability platform for achieving the goals that I've just laid out. Its analytics capabilities are very flexible and it allows IT operators and business users alike to segment the data in a way that's meaningful for remediation an for making business decisions.
Let's say you have multiple users coming in and leveraging a platform, and you can see which type of client or user is impacted, which region are they coming from, what exactly are they doing and how do those pieces that on the back end they don't look connected because they're not connected by a thread, but they really are part of a complete business process.
Taking an eCommerce experience, for example, a complete business process (or Business Transaction) for a user is having the ability to place an order with a credit card, because often times that's how companies make money. So, to be able to check out, the user has to be able to log in, has to be able to go to the home page, do an effective search of the site, put something in his or her cart, hit checkout, go through the bill payment process, and then get a confirmation that your checkout was successful.
That's a complete business process.
Being able to watch and monitor that complete business process as a single object, even though it's not connected by a thread, is something that AppDynamics is uniquely capable of doing. And that's really the approach we take with most of our clients. Obviously we want to watch what our clients' user activity is, but it's these key business processes that we want our clients to focus on, because if those business process break or have an outage, our clients are losing revenue, and their Customer Satisfaction is taking a hit. We want to help put process in place to avoid that, and a culture to proactively measure against those business processes.
The competitive landscape that AppDynamics is in is actually pretty crowded. There's several competitors in there that have good offerings that they bring to their clients.
DataDog & Dynatrace
One of the ones that we run into from time to time is DataDog, which is a platform that's extremely easy to plug-in.
Getting infrastructure metrics into DataDog is easy. The platform has a large synthetics capability which before you get into APM monitoring, application, performance monitoring, a lot of people start with synthetics, which is fake traffic that basically tests your website to make sure that certain business processes are functional.
What are synthetics? Imagine a web server that's out in Amazon Web Services, and that server is actually doing a checkout process. Monitoring that server is a wonderful way to check if you have a complete and total outage, it's actually a really poor way to verify if swaths of users of a certain type are either having a degraded or failed experience.
By using Application Performance Monitoring and analytics, you can tell this segment of users or clients is being impacted, and to what level. The thing that Datadog doesn't have is that ability to segment users with real traffic, with real APM monitoring. So, because it doesn't have the advanced analytics capabilities, it really cannot give you the ability to divide up the traffic that you're seeing in a meaningful way.
With DataDog, an operational user can say, "OK, we're having an impact". However, it can't look at only one specific segment of users that is being impacted, nor can it see the percentage of users that are being impacted by the impacting event.
Dynatrace, on the other hand, also has a very easy to use plug-in interface. The way you get data into Dynatrace is very easy. It also has really great agents for legacy software, which is a big strength of app dynamics as well. But again, it doesn't have the analytics capabilities. So in my mind a lot of the things that we do for clients just aren't possible there.
The other thing between Datadog and AppDynamics is, in AppDynamics you do most of the configuration on the back plane, so that means you don't have to touch the aid and so you don't have to touch the code, you just do all the configuration in the AppDynamics controller and it's extremely effective.
It's such a big breath of fresh air and deviation from how people have done monitoring forever. They've done logs to do logs. You have to play in what you want to do, write it to a log file, test that code, release that code, get it out there.
Really for DataDog it's the same scenario. If you want to change the way something looks in DataDog, you have to change the code to make that happen. So you have to make changes in the code, you have to test it, you have to get it through release in AppDynamics. You don't have to do that. So that's a huge plus that you have. The other thing that we love about AppDynamics is the dynamic baseline and capabilities. It's out of the box, straightforward, easy to use, always on.
In a lot of other platforms, you do have to do configurations by telling the tool, "I want this baseline, I want it to look like this." Now those other platforms do have more tips and tricks you can use, such as leveraging seasonal baselines, and you can really, really shape how those baselines are going to work.
AppDynamics doesn't have those specific baseline shaping capabilities, but it does have no-fuss baselines everywhere.
Scenario Examples for Observability
- Big Transformations: for large clients that have a very large IT landscape not having to put in specific thresholds or configure a baseline for each data point that you want to alert on is a huge plus because it saves a ton of work. Evolutio typically gets brought in when a user or a company is going through some kind of big transformation. Maybe the client is bringing tons of new application features and new capabilities to market, and they've made a huge investment to deploy the changes. It cannot fail.
- Migrating to the cloud or migrating to a different environment: migrations are a huge, programmatic effort; it's sometimes a multi-year project that can't fail. And big projects have toil. Basically, toil is when IT teams that are spending lots of time troubleshooting, trying to do problem analysis, and tackling root cause analysis in a manual, time-consuming way. As Evolutio consults our clients, a lot of what we're doing is trying to avoid toil, avoid people spending time doing something that is not driving the business objectives of either we want new capabilities brought to the market or we want to go to the cloud, or we want to do both at the same time. When we help our clients reduce toil, we really are able to make a huge impact in the effectiveness of an IT organization where as they're building their new capabilities. Their support structure is really enabling while these new capabilities are coming to market. The experience that Evolutio brings into the AppDynamics implementation process is that we just understand how to help companies scale the effort with governance and cultural change, and they're faster...and I mean light years faster.
Center Of Excellence: Training the Client to Use AppDynamics
For a company to bring in AppDynamics they have to get fully trained on the tool themselves and then understand all the principles of observability, how to exactly do it the right way, and then how to train their teams to do it in a scalable fashion. It's really an insurmountable task.
But because we've done so many of these large and successful implementations, we really have the playbook and we're able to train teams when we onboard an app.
Successful Implementation is a Five-step Process
And that's something that we provide video collateral training on via our Center of Excellence. When we're done, we enable the client to fully own a capability that is supported by the technology of AppDynamics, and it's much more related to having the capability of observability.
Evolutio's goal is to make sure our clients don't feel like some random implementation partner has come in, simply installed AppDynamics, now it's running, and they've walked away with no business impact or team cultural change.
Our Center of Excellence is enabling a full, end-to-end capability. It's in our ethos that Evolutio equips the organization to adopt a culture around observability.
Why Train & Enable a Culture of Observability?
By planning a relatively small project that's well planned and executed, and by using just a small amount of organizational time to organize and set up the monitoring, we can make it so that when clients can focus on their normal day-to-day work, which is usually application development, where they're building out new features or doing a cloud migration. Then, they get the benefit of mostly doing planned work and teams can stay in the realm of working the digital transformation roadmap instead of doing fire drills during an outage.
When an outage happens, and when a user is being impacted, that is unplanned work. And a lot of times it is a very large problem, and everyone drops everything to fix the issue now.
If organizations get into a cycle where you're constantly having to do this unplanned work that amounts to toil, it really has a psychological impact on IT teams. They're never hitting their deliverables, their never hitting your targets.
And IT staff feel like every time they have a sprint or a burn down meeting to talk about what happened they're saying, "the reason I didn't hit my targets is because I was doing very important work for the company, making sure that this outage got resolved."
We do not want our teams in that state. We want them in a state where they're driving things forward. They feel like they're being productive projects that they start, they're able to complete. And it's really because they're not being shoved off the roadmap of planned work by forced, unplanned work fixing outages.
What are the Big Tech and Google IT Teams Doing?
Consider how our culture has shifted to much more being IT-driven and IT-oriented. Look at the cultural adoption of Facebook, the culture of Instagram: Millennials and GenZ who have grown up with these platforms and tools literally at their fingertips, their patience as users to deal with applications not working continues to degrade everyday. A user action in an application that took 2-seconds in 2013, that should have taken 2-milliseconds in 2022. They have that level of attunement.
If you're an airline and a user is trying to buy a plane ticket, they want to be able to buy that plane ticket without hiccups, or they're going to go to a different site and buy that plane ticket.
It's becoming more and more of a mandate that organizations get in front of these user problems if they want to have retention, if they want to have business growth, if they want to have where new features that are released are successful, so it becomes more and more critical.
That's why you see observability initiatives springing up. You see technology leaders deciding, "OK, we need teams that own this."
Evolutio's clients grow into a Site Reliability Engineering (SRE) model, which is something that we enable our them to do because SRE has the observability mindset ingrained across the board. SRE is really part of how they start operating day-to-day as development teams by embedding monitoring-as-code thinking earlier in the SDLC.
The Site Reliability Engineering (SRE) Playbook
Evolutio really follows the SRE model that was positioned by Google about three years ago.
What is it? The Site Reliability Engineering model is where you have a team of site reliability engineers that's really held together by a Guild that's at the top. The SRE team is actually embedded with the software development teams, but their span of control is dozens or hundreds of applications. So a single SRE Engineer or Analyst may have two or three or ten applications that are under their purview. From there, that SRE gets all of their standards in place, and the rules get put in place by the SRE Guild.
With the Google SRM model, organizations want to focus on Four Golden Signals that the end user is experiencing.
The Four Golden Signals are:
- Amount of calls that are going through the system
- Average response time (or 95th percentile response time, depending on how the data is shaped)
- Number of errors that are in the system
Think of it this way: of all of the amount of traffic that's coming through versus what's available, it's also called saturation. How much of a certain pipe is saturated? How much of a certain storage platform is saturated? Typically what we see is at the outside where the user lives, you have to have all four Golden Signals.
You have to understand where you're at. And then as you get lower, sometimes you only have one or two of those signals because the number of errors going through may not be an issue.
But maybe saturation is more important if it's like a storage layer. So four different layers in your tech stack, you have different versions of those signals, but those are the ones that you want to focus on. And we have a standard alerting strategy that we roll out with our clients when they're implementing AppDynamics. And that really focuses on the Four Golden Signals at scale across the board standard for every app that we do.
Evolutio works hard with our clients to bring a culture of observability, so that as their business is growing and changing, observability is just immediately on the platform, whether it be they are as they're doing development, they're working observability initiatives in so they don't have to do a ton of rework at the end.
The best feeling is when I get to see organizations have that lightbulb moment clients are forward-thinking and can see over the horizon, "Where are we going to be in 3 to 5 years?" And, they're eager employ the tooling and KPIs properly today to get those desired results into the future. They start to say things like, "Let's start a Monitoring-as-Code initiative now to set us up for success in 36-months, because we're starting to collect the data."
I love when our clients are not only equipped to make performance decisions in the future, but they're also enabled to make decisions on how to run our business and leverage software to solve the biggest operational problems they will face.
Evolutio can help you with your Observability initiatives and driving the cultural change for getting Site Reliability Engineering on the roadmap with your teams. Visit our Contact page to get in touch and learn more.