Log based metrics and monitoring at scale, adventures with elastic stack

Monitoring and logging in large companies is a critical aspect of system management, as it helps track the performance and health of applications. A popular practice among some organizations involves using a common data store, usually a time-series database like Elastic, to manage metrics as logs. This setup, often utilizing the ELK stack, includes tools like Logstash.

In this system, almost everything is treated as a log. API latencies, metrics, responses from external APIs, and database queries are all considered logs. This approach is nice because it allows for easy metric creation, making it straightforward to set up alarms or dashboards on top of these metrics.

However, this is hard with untyped and loosely structured logs. Usually this is where a lot of companies start from. Logs often lack structure and can vary widely in size and field count. The goal was to transition our application servers to Elastic, allowing it to manage all logs while enabling us to build metrics and create dashboards based on them, accessible throughout the company.
During implementation, we encountered issues with Elastic’s handling of mixed data types.

Elastic indexes and data types

Elastic requires consistency in data types for fields with the same key. For example, if the first log’s result is a number, Elastic expects all subsequent logs to adhere to this data type. This becomes problematic when different data sources, such as Cashfree, Razorpay, or Google, return varied data types.

Structuring Logs for Efficient Indexing and Debugging

Every log entry contains two key components: metric data points and debug data points. This distinction allows for both efficient indexing and comprehensive debugging.

Metric Data Points: Structured and Indexed

Metric data points capture well-defined, structured information that can be indexed for querying and analysis. For example, if an API call to Google fails, the log should record essential details such as:

• The API endpoint called

• The user ID associated with the request

• The error code or status

This structured data is stored in a predefined log schema, ensuring consistency across logs and enabling Elasticsearch to efficiently index and query it.

Debug Data Points: Contextual but Unindexed

Debug data points provide additional context but are not meant for indexing. These include verbose details such as the full API response, stack traces, or detailed request payloads. To prevent unnecessary indexing overhead, this information is placed in a dedicated field, such as _debugInfo, which is explicitly excluded from indexing.

Balancing Indexing and Debugging

By structuring logs this way, we achieve:

• Efficient querying: Metrics are indexed for dashboards and analytics.

• Comprehensive debugging: Debug information is available for troubleshooting but doesn’t clutter indexed searches.

• Scalability: Consistent log structures prevent schema conflicts and ensure smooth integration with Elasticsearch.

The logger library seamlessly integrates both metric and debug components, ensuring logs are both queryable and human-readable without unnecessary indexing overhead.

Suggested Logger Interface

In this world, there will be methods which are more targeted and structured to make sure elastic can run with ease.


// elastic expects a clean and consistent type so indexing can work well
export interface MetricData {
    applicationEvent: Uppercase<string>;
    logLevel?: LogLevel;

    correlationId?: string;
    apiName?: string;

    count?: number;
    entityId?: string;
    errorCode?: string;
    queryExecutionTimeInMs?: number;

    // all data relating to debugging, this is not searchable on elasticsearch by design
    _debugInfo?: Record<any, any>;
}

and then there will be a logMetric method which is strongly typed and will log this metric data out into a log stream. This is later picked up by logstash and pushed to elastic.

logMetric(ctx: MetricData): void {
        const logLevel: LogLevel = ctx.logLevel ?? 'info';
        this.logger[logLevel]({
            ...ctx,
        });
    }

Some tips

Setup a strong correlation ID system so that correlation IDs are neatly cascaded across the system, and also source correlation IDs are respected
Having an API name (this can be the name of the function of the controller) is very useful to breakdown visualisations for slow queries, server latency breaches etc..

Log based metrics and monitoring at scale, adventures with elastic stack

Elastic indexes and data types

Structuring Logs for Efficient Indexing and Debugging

Metric Data Points: Structured and Indexed

Debug Data Points: Contextual but Unindexed

Balancing Indexing and Debugging

Suggested Logger Interface

Some tips

More posts

Travel Log: Vietnam

Log based metrics and monitoring at scale, adventures with elastic stack

I want to know where I’m spending my money. Can I?

Building for Ourselves: The Beauty of Personal Software