Elastic observability helping reduce time-to-resolution
By Hugo D’Arcy – Data Engineer & Elastic Expert
During my time in Tech there’s always a large problem is, why did this break, usually followed by, oh this has been broken for days …. We’ve all been there and finding ways of getting to the root cause is nice; finding ways of getting to the root cause quickly is amazing. I’ve been working with the Elastic Stack for 18 months now and whilst the standard version helped a lot with understanding various aspects of operational work and finding errors within data integrations, which is nice, some of the features in the Gold Cloud version will help provide some much needed proactiveness to the monitoring.
This article will cover some of the Gold Cloud Elastic features I am looking forward to tinkering with, to greatly reduce time-to-resolution with a mix of proactive monitoring and more in-depth logs, metrics and traces, as well as a high-level overview of application environments. The example environment I will use in this article will consist off ActiveMQ (middleware), Talend (Intergration/ETL) and a Postgres (source database), feel free to swap out each component for your own environment’s.
Probably the most important feature in finding root causes significantly easier and faster is 3rd party alerting, i.e. when a process fails an alert is sent to slack, email, or teams, these can be based on specific custom rules. This serves as the catalyst in our root cause analysis. For our example, the rule will be an ActiveMQ dead-letter queue having a message within it, which enables us to have proactive monitoring, instead of someone stumbling across its days later.
Another useful feature is Synthethic Monitoring which allows us to scripts to check various states of applications. I will experiment creating a script that checks the status of the dead letter queue, i.e. when it is not empty it returns a response, and we tied this to an alert rule. This triggers our catalyst allowing the team on call to start their incident analysis the moment the issue occurs, additionally, providing them with the precise moments to investigate. This is particularly useful as we can find the most eloquent ways of finding errors within our environments. From here, all the logs and metrics from ActiveMQ and Talend are within our Elastic Stack, as such we can check precisely where the issues came from.
As a result, by using precise alerting rules and leveraging knowledge of the middleware, we can proactively notify ourselves greatly reducing time-to-resolution.
Now let us turn our attention to another avenue we can use to improve observability, traces. A trace is a group of transaction and spans with a common root. I am intrigued to place an elastic agent onto a Talend remote engine’s JVM to see what kind of traces it returns and how these can improve efficiency and performance. So that we can use this to get information on exceptions and errors within our jobs, routes and connectors for other applications such as databases; thereby, aiding development and optimization of our data integrations. This along with the two features above will greatly aid in our quest to find root causes quickly and efficiently.
Lastly, services inventory enables us to see a high-level overview of all our applications within an environment, with all the dependences and connections for our applications. From a quick glance we can see useful metrics and logs. Since, sometimes an error creeps in our dead-letter queue, but it stemmed from somewhere completely different, say a database connector being faulty. Using the services inventory, we see that the logs for the database are spiking due to an error in the connector, and this was the root cause of our issues.
In conclusion, I am really intrigued as to the full scale of features that can be used within the Elastic stack for proactive observability. Be it, poignant alerts, a dead-letter queue’s status not being empty, a span from Talend remote engine being slow or a faulty connector which enables us to save hours if not days of down time and let alone the amount of time saved for root cause analysis for the issues.