f181ab1b 9e94 4a4b bca1 f6a29ece4d9a

LGTM Stack: Total Visibility, Zero Blind Spots—Full Stack Observability for Every Workload

Spread the love

The concept of understanding how applications and servers behave has evolved over the years. We used to call it monitoring, but in recent years that word has morphed from just monitoring into a new word called Observability. But wait, why did it morph? Is observability just a new buzz word to catch people’s attention? What are the concepts driving this?

As I always say in my LinkedIn posts, the problems found in the current way of work, leads to new principles and new names to describe these new principles.

The term Observability has been used in other engineering systems before it got into Infrastructure management. A quick search on Wikipedia defines Observability as:

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. In control theory, the observability and controllability of a linear system are mathematical duals.

From the definition above (which was borrowed from Control Theory), the emphasis for Observability is “how well internal states of a system can be inferred from knowledge of external output”.

This changes how we see systems and prefer to fix issues from the inside out and not the outside in like we used to apply in monitoring. This concept leads to the basis of a pillar of Observability called Traces and another not to commonly used called Profiling. When these two signals are combined with logs and metrics, it gives you a holistic behavior of an application.

An OOM event that evicts a pod in Kubernetes and causes the pod to restart is an external output. But the internal inference comes from instrumentation of Profiling that can show which particular function in the code is causing an increase in memory consumption.

Traces can show the journey of the request within the service and identify slow operations or queries. These are the innovations that makes observability different from monitoring.

However, the real value of Observability comes from merging three major signals to get insights to the behavior of a system. The three major insights are: Logs, Metrics and Traces.

You need to be able collect, store and analyze these three to start the basis of your observability journey.

Components of the LGTM Stack

L for Loki

G for Grafana

T for Tempo

M for Mimir

Loki

Grafana loki logo

Loki is an opensource logging storage engine, designed to be highly scalable and durable in storing log data. It is a part of the Grafana ecosystem which is mostly used as part of an observability stack. Unlike traditional logging systems that index the entire log content, Loki takes a different approach by only indexing the metadata associated with the logs. This reduces the storage and indexing overhead, while allowing Loki to handle huge amount of log data. Loki further compresses the data and stores it in an object storage engine such as Amazon S3, Google Cloud Storage or MinIO. To learn more about Loki, read the Introduction to Loki article.

Grafana

images?q=tbn:ANd9GcQp98cp96c3Zc OaZ9B7LVSMiO9gZtrh9IbgQ&s

This is the eye to the other systems which are all storage systems. Grafana gives you the ability go query and visualize the data stored in Loki, Tempo and Mimir. It is the face of the Grafana stack makes helps to make sense of the data collected by the different Grafana storage engines.

Grafana is open-source and allow users to create interactive dashboards and alerts by connecting to various data sources such as Loki, Mimir, Elasticsearch and more. Grafana also supports plugins for additional datasources, panels and apps that do not exist in the default setup. Read our Introduction to Grafana to gain more insights.

Tempo

1e0495 grafana tempo

This is the database for storing and exploring Trace data. It is also open-source like Loki and Grafana, and cost effective. It is a powerful tool for troubleshooting applications by providing insights into the performance and behavior of micro-services architecture. Just like its Loki counterpart, storage is cost-effective because it uses the same type of storage engine, which is the object store. This reduces storage cost for trace data. Tempo can be integrated with existing observability tool. It supports popular tracing protocols such as OpenTelemetry (OTel), Jaegar and Ziplin. This allows users to send trace data from various sources.

It also integrates seamlessly with Grafana for visualizations and correlation of traces with metrics and logs in a unified User interface. This provides an all-round observability solution which is helpful for effective troubleshooting. You can read more about Grafana Tempo from the official Documentation.

Mimir

d2af03 grafana mimir

This is an open-source time-series engine and a senior colleague to Prometheus. Prometheus has a default retention period of 15days for logs and the only way to store long for longer is to increase the retention time. Mimir steps into the shoes to allow storage of logs in a more standard and scalable storage engine; object storage (Amazon S3, GCS, Azure Blob or MinIO). Mimir is highly available, scalable and serves as data aggregator for Prometheus. It also supports multi-tenancy enabling multiple teams or organizations to share thesame infrastructure while keeping their data isolated. In a previous blogpost 3 Ways to aggregate and store Prometheus Metrics long-term, I explained how Mimir is a useful tool for storing metrics, and it shared the same datasource as Prometheus when connect to it via Grafana.

One last important part of Observability which has been mentioned previously but not a member of the LGTM stack is profiling. in the Grafana stack, the tool used for performing profiling for applications is called Pyroscope. Like every other tool mentioned above, Pyroscope is open source store engine that can store, and allow querying of Pyroscope telemetry data. Profiling in observability involves analyzing and measuring the performance and behavior of an application or system, such as CPU usage, memory allocation, and execution time. It helps identify bottlenecks, inefficiencies, and anomalies, enabling developers to optimize performance and ensure smooth operation in production environments.

Conclusion

Observing your application and infrastructure properly can determine your key DevOps metrics and how well you are able to respond to and manage incidents. It is important to have all signals ready with alerting, to be able to fix issues faster and more efficiently.

Observability is also coming to AI, see how to instrument it via OTel


Spread the love

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
×