The reality of software development is that things break all the time. No matter the industry or application, it is inevitable that it will break at some point in time.
Everything fails, all the time. - Werner Vogels (CTO @ Amazon)
This failure could because the logic in the code is wrong. It could be because the underlying infrastructure our service is running on failed. Or maybe our users are using the system in a way you didn’t design for.
The possibilities for failure are endless.
But we as developers shouldn’t be discouraged by this. In fact, we should embrace this and push ourselves to recognize that failures happen. Why? Because only via failure can we make our applications more resilient.
This begs the question though, how do we know when things fail? Or even better, how can we know when things are beginning to fail? There is an important distinction here.
Knowing when things have gone wrong is reactive. Whereas knowing when things are going wrong is proactive.
Gathering information about failures
Let’s start by getting from not knowing to at least knowing and reacting to failures.
A key to knowing when things have gone wrong is the definition of right versus wrong inside of our applications. Let’s take an example.
Here we have a function in C# that calls an external service. The call to the external service is wrapped in a try/catch block.
public void CallExternal()
//what does this mean?
The first step to knowing when something has gone wrong in our applications is defining wrong. Taking a look at the sample code above there is a few questions we can ask ourselves.
- If we end up in our catch block, what does that mean for this specific function?
- How does this exception impact the larger system?
- How does this exception impact our end user?
This is just three questions, but I think you can see the point I am trying to make. Catching an exception is great, but we must know what that means within the larger context of the system. Only then can we know when a true failure has occurred.
This brings us to the first step in knowing when something has gone wrong, logging. In order to know something has gone wrong inside of our application, we need to log exceptions inside of our applications.
After determining that an exception here is an error we can now introduce logging to our
public void CallExternal()
_log.Error("Failed to call external service", e);
Now we are logging information for ourselves so that we can know what went wrong inside of our application code. This is key because we need to know what went wrong when a failure happens. Logging helps us get there.
I say helps because there is good logging, mediocre, and bad logging. At this point, we have mediocre logging. We know that a failure happened so we log a message and include the exception. But where are those logs going? Or put another way, what is the implementation of our logging?
That is where good logging comes in. In my experience, good logging is not just sound logging practices inside of the code, but solid practices in making those logs visible externally.
For instance, if our application is running on an AWS EC2 instance and we are just logging to a file on disk, how useful are those logs? Certainly, they are better than no logs, but we still have to log into the machine, download the file and examine the messages.
To get to the next level we should make those logs visible externally. This could be done by rotating the logs off the machine and into an S3 bucket. Maybe we could change our logging implementation to write directly to a Kinesis Firehose that puts our records into Elasticsearch. Or if we have the money, we could use a provider like Sumo Logic or Data Dog and move our logs into their systems.
The point here is that in order for us to know when things have gone wrong, we need to define wrong and then log information when those failures happen.
An important thing to note here, we are using catch blocks to define wrong inside of our applications. Meaning our business logic code must throw exceptions when it has encountered something it wasn’t expecting.
This comes back to sound logging practices. A convention must exist within the team for logging otherwise we end up with sparse logs that are missing information.
We are now at a reactive state. If a user encounters an issue or if we have a basic alerting we can respond to errors that happen in our application.
Moving from reactive to proactive
Logging is great and any production system should have it. The messages we log provide us insight into the behavior of our system and information regarding failures.
At this point, we are in a reactive environment. If a user tells us something unexpected happened, we can analyze our logs to determine what went wrong.
We can do better. In fact, we always want to be striving to know when things are going wrong before even the user knows.
This is where monitoring comes in.
It’s a broad topic that covers the entire system. So what are some things we should be monitoring?
- Error rates.
- Performance of our system.
- The health of our infrastructure.
These are three very broad categories so let’s dive into each one in more detail.
Monitoring error rates
Error rates are a key indicator to knowing the health of our system as a whole. It is useful to think of these in terms of aggregate and fine grain.
The latter allows us to leverage the sound logging practices we have in place. Let’s say we have two functions in our system that log out details about errors anytime they occur. Furthermore, we have made our logs visible externally by sending them to a third party service or just logging directly to AWS CloudWatch Logs.
With those two prerequisites taken care of, we can now monitor the error rates of these two functions. By leveraging our logging we can create fine grain monitoring strategies that allow us to know if a particular function is working as expected. We could even take this further by extending our logging to include performance metrics, this would allow us to know the performance of a given function in production.
Monitoring functions within our system is a fine grain strategy, but it’s also important to monitor the error rates of the entire system. This allows us to see the total health of the system at a high level.
This important because with one aggregate metric we can likely spot an issue within our system. For instance, if our aggregate error rate jumps from 0 errors/second to 1,000 errors/second, we know we have a problem and can begin diving into our fine grain monitoring to determine the cause.
Fine grain monitoring and aggregate monitoring really go hand in hand. In fact, we could even use our fine grain monitoring to produce aggregates that allow us to see the error rate of the system as a whole.
Performance of our system
Error rates are very clear to understand, either the functionality worked or it didn’t. Response time and performance are not as clear-cut, but equally important to becoming a proactive team rather than reactive.
Measuring and monitoring the performance of our system allows us to track indicators of a system that is degrading. Compared to an error, performance has more of a gray area. This is because we need to monitor it over time in order to establish baselines for a healthy system.
However, once the baselines are established we can monitor our metrics to detect degradations before they become failures for our users. In essence, we want to track performance in our system so that we can proactively detect when we are approaching a failure state.
By monitoring the performance of our system over time we can combine that information with error rates to know if an error is degrading performance or if performance issues are introducing errors. Both allow us to proactively monitor for when things are going wrong.
The health of our infrastructure
The reality of cloud providers like AWS, Azure and Google Cloud is that there is a universe of failures that are outside of our control. By leveraging a cloud provider we are offloading the compute resources, networking, and storage to someone else.
This is fantastic. It allows us to focus on the things that matter like building stellar applications for our users.
But, we must recognize that these services can fail as well and when they do we need to know about it. Therefore we need to be continuously monitoring the overall health of our infrastructure.
If we are using one of the three cloud providers mentioned above here are some things we can monitor.
- Overall instance health
- Load Balancer health
- Autoscaling failures
- Autoscaling and Instance level CPU utilization
- Database memory, CPU, storage space
These are just a few things worth monitoring. What we choose to monitor at an infrastructure level is going to depend on our individual applications. For instance, if my application was entirely serverless I wouldn’t monitor Load Balancers or Autoscaling because they wouldn’t exist.
The key thing to remember here is that a failure at our infrastructure level is often times the most critical because it can take the entire system down. Therefore when that failure does occur we need to know about it.
Sound the fire alarm
Up to this point we have talked about using logging to gain insight into the behavior of our system at a fine grain level. We dove into monitoring errors and performance at a function level as well as aggregate level. We also talked about monitoring the overall health of infrastructure.
This is great, we know the basics of logging and monitoring. But these are just the tools that allow us to gather information, we still need to know when something is going wrong.
When there is a fire in the building we often pull a fire alarm to let everyone else know. We alert others to the fact that there is an issue that requires immediate attention.
By logging information about our system and monitoring the various pieces that compose it, we can use that information to alert us when there is an issue. This alerting could be one or more of these things.
- Send a text message to the team/support individual.
- Post a message to the team Slack channel.
- Send an email to the team email list.
- Sound an alarm in the shared workspace.
The purpose of logging and monitoring our system is to allow us to take action when there is an issue or we are approaching one. Alerting is the trigger that allows us to take action.
We could begin with failure alerting where we need to respond because a failure has occurred. Maybe this failure came from our fine grain logs or our error rate spiked in a way we have never seen before. This could allow us to track down what is causing the failure and resolve the issue.
Moving from reactive to proactive, we could start implementing proactive alerting. This is very relevant to the topic of performance. If we know the system begins to have issues when our instances reach 95% CPU utilization, we can alert ourselves when we cross 85-90%. Allowing us to be intuned with the current state of our system and take action before we reach a failure condition.
The reality is that most software applications contain failures. These failures can be from a multitude of things from bad code to infrastructure. It is our responsibility to know when things have gone wrong.
Logging is a tool that allows us to gather information about our systems. We can log when an error happens that we weren’t expecting. We can also log performance information at a granular level. Then we can make those logs externally visible so that we can digest that information without having to log into a machine.
Monitoring is an accompanying tool that allows us to view the health and performance of our systems. This monitoring is what allows us to know if we need to research an issue in our logs or if our infrastructure is hitting unexpected issues.
Alerting is how we hold ourselves responsible for the health of our system in production. Logging and monitoring are only useful if we actually use them in response to an action. By notifying ourselves of an issue we are holding ourselves accountable to the fact that when an issue arises, we need to resolve it.