Log based metrics and monitoring at scale, adventures with elastic stack

Monitoring and logging in large companies is a critical aspect of system management, as it helps track the performance and health of applications. A popular practice among some organizations involves using a common data store, usually a time-series database like Elastic, to manage metrics as logs. This setup, often utilizing the ELK stack, includes tools like Logstash.


In this system, almost everything is treated as a log. API latencies, metrics, responses from external APIs, and database queries are all considered logs. This approach is nice because it allows for easy metric creation, making it straightforward to set up alarms or dashboards on top of these metrics.


However, this is hard with untyped and loosely structured logs. Usually this is where a lot of companies start from. Logs often lack structure and can vary widely in size and field count. The goal was to transition our application servers to Elastic, allowing it to manage all logs while enabling us to build metrics and create dashboards based on them, accessible throughout the company.
During implementation, we encountered issues with Elastic’s handling of mixed data types.

Elastic indexes and data types

Elastic requires consistency in data types for fields with the same key. For example, if the first log’s result is a number, Elastic expects all subsequent logs to adhere to this data type. This becomes problematic when different data sources, such as Cashfree, Razorpay, or Google, return varied data types.

Structuring Logs for Efficient Indexing and Debugging

Every log entry contains two key components: metric data points and debug data points. This distinction allows for both efficient indexing and comprehensive debugging.

Metric Data Points: Structured and Indexed

Metric data points capture well-defined, structured information that can be indexed for querying and analysis. For example, if an API call to Google fails, the log should record essential details such as:

• The API endpoint called

• The user ID associated with the request

• The error code or status

This structured data is stored in a predefined log schema, ensuring consistency across logs and enabling Elasticsearch to efficiently index and query it.

Debug Data Points: Contextual but Unindexed

Debug data points provide additional context but are not meant for indexing. These include verbose details such as the full API response, stack traces, or detailed request payloads. To prevent unnecessary indexing overhead, this information is placed in a dedicated field, such as _debugInfo, which is explicitly excluded from indexing.

Balancing Indexing and Debugging

By structuring logs this way, we achieve:

• Efficient querying: Metrics are indexed for dashboards and analytics.

• Comprehensive debugging: Debug information is available for troubleshooting but doesn’t clutter indexed searches.

• Scalability: Consistent log structures prevent schema conflicts and ensure smooth integration with Elasticsearch.

The logger library seamlessly integrates both metric and debug components, ensuring logs are both queryable and human-readable without unnecessary indexing overhead.

This is still WIP, I’ll share a good interface IMO which makes sure projects have good logging practices which are set in place by strong types and other checks