APM best practices: Dos and don’ts guide for practitioners

Application performance management (APM) is the practice of regularly tracking, measuring, and analyzing the performance and availability of software applications. APM helps you get visibility into complex microservices environments, which can overwhelm site reliability engineering (SRE) teams. The generated insights create an optimal user experience and achieve desired business outcomes. It’s a complex process, but the goal is straightforward: ensuring that an application runs smoothly and meets the expectations of users and businesses. 

A clear understanding of an application's operation and a proactive APM practice are crucial for maintaining high-performing software applications. APM shouldn’t be an afterthought. It should be considered from the beginning. When implemented proactively, it can be incorporated into how software runs by embedding monitoring components directly into the application.

# Auto-instrumentation handles this automatically @app.route('/api/orders') def create_order(): # Add manual span only for critical business logic with tracer.start_as_current_span("order.validation") as span: span.set_attribute("order.value", order_total) if not validate_order(order_data): span.set_status(Status(StatusCode.ERROR)) return 400

  • Do: Start with auto-instrumentation, then add manual spans for business-critical operations.

  • Don't: Manually instrument every function call — you'll create performance overhead and noise.

  • Pitfall: Over-instrumentation can add 15%–20% latency. Monitor your monitoring with baseline performance comparisons.

A few components for an organization or business to consider when developing an APM strategy are:

  • Performance monitoring, including evaluating latency, service level objectives, response time, throughput, and request volumes

  • Error tracking, including exceptions, crashes, and failed API calls 

  • Infrastructure monitoring, including health and resource usage of servers, containers, and cloud environments that support the application

  • User experience metrics, including load times, session performance, click paths, and browser or device details (It’s important to keep in mind that even if system metrics look fine, users may still encounter performance issues.)
Key principles of effective APM

The core principles of effective application performance management are end-to-end visibility (from the user's browser to the database), real-time monitoring and insights, and contextual insights, with a user- and business-objective focus. APM can improve application scalability by enabling continuous improvements and increasing performance over time.

  • Do: Implement real-time dashboards with SLO-based alerts rather than arbitrary thresholds.

  • Don't: Rely only on periodic performance reviews or CPU/memory alerts — instrument user experience metrics.

  • Pitfall: Alert fatigue from low-level system metrics. Focus on user-facing SLOs that indicate real problems.

When creating an APM strategy, here are a few key principles to consider:

1. Proactive monitoring: Prevent issues before they impact users by setting up alerts and responding quickly to any anomalies. But try to avoid alert fatigue. Balance automated alerts with human oversight so important issues don’t get missed, focusing on outcomes rather than system metrics. 

2. Real-time insights: Move beyond logging issues and enable fast decision-making based on live data and real-time dashboards that prioritize the most critical business transactions. Use telemetry data (logs, metrics, and traces) to parse your performance insights.

3. End-to-end visibility: Monitor the application across the entire environment, the entire user flow, and all layers, from frontend to backend.

4. User-centric approach: Prioritize performance and experience from an end-user perspective, while considering key business objectives.

5. Real user monitoring: The work doesn’t stop when it’s in your user’s hands. By monitoring their experience, you can iterate and improve based on their feedback.

6. Continuous improvement: Use insights to optimize over time and regularly uncover and tackle unreported issues. Issues should be addressed dynamically rather than when discovered in periodic performance reviews. 

7. Context propagation: Ensure trace context flows through your entire request path, especially across service boundaries:

# Outgoing request - inject context headers = {} propagate.inject(headers) response = requests.post('http://service-b/process', headers=headers)

8. Sampling strategy: Use intelligent sampling to balance visibility with performance:

  • 1%–10% head-based sampling for high-traffic services

  • 100% sampling for errors and slow requests using tail-based sampling

  • Monitor instrumentation overhead — aim for <5% performance impact

@RestController public class OrderController { @PostMapping("/orders") public ResponseEntity createOrder(@RequestBody OrderRequest request) { // Auto-instrumentation captures this endpoint automatically // Add custom business context Span.current().setAttributes(Attributes.of( stringKey("order.value"), String.valueOf(request.getTotal()), stringKey("user.tier"), request.getUserTier() )); return ResponseEntity.ok(processOrder(request)); } }

  • Do: Implement sampling strategies and monitor instrumentation overhead in production.

  • Don't: Use 100% sampling for high-traffic services — you'll impact performance and explode storage costs.

  • Pitfall: Head-based sampling can miss critical error traces. Use tail-based sampling to capture all errors while reducing volume.

Here’s how to get it right:

  • Select the right APM solution: The right APM tool should align with an application's architecture and the organization's needs. The solution should provide an organization with the tools and capabilities it needs to monitor, track, measure, and analyze its software applications. A business may use OpenTelemetry, an open source observability framework, to instrument and collect telemetry data (traces, metrics, and logs) from applications. 

  • Manage cardinality to control costs: High-cardinality attributes can make metrics unusable and expensive:
# Good - bounded cardinality span.set_attribute("user.tier", user.subscription_tier) # 3-5 values span.set_attribute("http.status_code", response.status_code) # ~10 values # Bad - unbounded cardinality span.set_attribute("user.id", user.id) # Millions of values span.set_attribute("request.timestamp", now()) # Infinite values
  • Set up intelligent alerting based on SLOs rather than arbitrary thresholds. Use error budgets to determine when to page someone:
slos: - name: checkout_availability target: 99.9% window: 7d - name: checkout_latency target: 95% # 95% of requests under 500ms window: 7d

  • Train teams and promote collaboration. An APM strategy impacts a wide range of stakeholders, not just developers. Be sure to involve IT teams and other business stakeholders in cross-departmental collaboration. Work together by implementing APM into your organizational setup. Make sure to establish clear goals and KPIs that align with business needs and consider user experience. 

  • Review and evaluate. An APM strategy continues to evolve and change alongside application and business needs.
order_processing_duration = Histogram( "order_processing_seconds", "Time to process orders", ["payment_method", "order_size"] ) with order_processing_duration.labels( payment_method=payment.method, order_size=get_size_bucket(order.total) ).time(): process_order(order)
  • Synthetic monitoring: Simulates user interactions to detect issues before real users are affected. Critical for external dependencies:
// Synthetic check for critical user flow const syntheticCheck = async () => { const span = tracer.startSpan('synthetic.checkout_flow'); try { await loginUser(); await addItemToCart(); await completePurchase(); span.setStatus({code: SpanStatusCode.OK}); } catch (error) { span.recordException(error); span.setStatus({code: SpanStatusCode.ERROR}); throw error; } finally { span.end(); } };

  • Deep-dive diagnostics and profiling: Helps troubleshoot complex performance bottlenecks, which could include third-party plugins or tools. Through application profiling, you can go deeper into your data and analyze how it is performing according to its functions.

  • Distributed tracing: Essential for microservices architectures. Handle context propagation carefully across async boundaries:
# Event-driven systems - propagate context through messages def publish_order_event(order_data): headers = {} propagate.inject(headers) message = { 'data': order_data, 'trace_headers': headers # Preserve trace context } kafka_producer.send('order-events', message) APM data analysis and insights

Monitoring and gathering data is just the beginning. Businesses need to understand how to interpret application performance management data for tuning and decision-making.

Identifying trends and patterns helps teams proactively detect issues. Use correlation analysis to link user complaints with backend performance. See an example here using ES|QL (Elastic’s query language):

FROM traces-apm* | WHERE user.id == "user_12345" AND @timestamp >= "2024-06-06T09:00:00" AND @timestamp <= "2024-06-06T10:00:00" | EVAL duration_ms = transaction.duration.us / 1000 | KEEP trace.id, duration_ms, transaction.name, service.name, transaction.result | WHERE duration_ms > 2000 | SORT duration_ms DESC | LIMIT 10

Detecting bottlenecks: APM reveals common performance anti-patterns such as n+1 problems that can be seen in the code below. Use APM to optimize the code:

# N+1 query problem detected by APM def get_user_orders_slow(user_id): user = User.query.get(user_id) orders = [] for order_id in user.order_ids: # Each iteration = 1 DB query orders.append(Order.query.get(order_id)) return orders # Optimized after APM analysis def get_user_orders_fast(user_id): return Order.query.filter(Order.user_id == user_id).all() # Single query

Correlating metrics and linking user complaints with backend performance data, including historical data, reveals how different parts of the system interact. This can help teams accurately diagnose root causes and understand the full impact of performance issues.

Automating root cause analysis and using AI/machine learning-based tools such as AIOps helps to accelerate diagnostics and resolution by pinpointing the source of problems, reducing downtime, and freeing up resources.

It’s important to use a holistic picture of your data to inform future decisions. The more data you have, the more you can leverage.

  • Do: Use distributed traces to identify the specific service and operation causing slowdowns.

  • Don't: Assume correlation means causation — verify with code-level profiling data.

  • Pitfall: Legacy systems often appear as black boxes in traces. Use log correlation and synthetic spans to maintain visibility.

// Java - Auto-propagation with Spring Cloud @PostMapping("/orders") public ResponseEntity createOrder(@RequestBody OrderRequest request) { Span.current().setAttributes(Attributes.of( stringKey("order.type"), request.getOrderType(), longKey("order.value"), request.getTotalValue())); // OpenFeign automatically propagates context to downstream services return paymentClient.processPayment(request.getPaymentData());} // Go - Manual context extraction and propagation func processHandler(w http.ResponseWriter, r *http.Request) { ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header)) ctx, span := tracer.Start(ctx, "process_payment") defer span.End() // Continue with trace context maintained}

Legacy system integration: Create observability bridges for systems that can't be directly instrumented:

# Synthetic spans with correlation IDs for mainframe calls with tracer.start_as_current_span("mainframe.account_lookup") as span: correlation_id = format(span.get_span_context().trace_id, '032x') logger.info("CICS call started", extra={ "correlation_id": correlation_id, "trace_id": span.get_span_context().trace_id }) result = call_mainframe_service(account_data, correlation_id) span.set_attribute("account.status", result.status)

Advanced trace analysis with ES|QL: Link user complaints to backend performance using Elastic's query language:

-- Find slow requests during complaint timeframe FROM traces-apm* | WHERE user.id == "user_12345" AND @timestamp >= "2024-06-06T09:00:00" | EVAL duration_ms = transaction.duration.us / 1000 | WHERE duration_ms > 2000 | STATS avg_duration = AVG(duration_ms) BY service.name, transaction.name | SORT avg_duration DESC -- Correlate errors across service boundaries FROM traces-apm* | WHERE trace.id == "44b3c2c06e15d444a770b87daab45c0a" | EVAL is_error = CASE(transaction.result == "error", 1, 0) | STATS error_rate = SUM(is_error) / COUNT(*) * 100 BY service.name | WHERE error_rate > 0

Event-driven architecture patterns: Explicitly propagate context through message headers for async processing:

# Producer - inject context into message headers = {} propagate.inject(headers) message = { 'data': order_data, 'trace_headers': headers # Preserve trace context } await kafka_producer.send('order-events', message) # Consumer - extract and continue trace trace_headers = message.get('trace_headers', {}) context = propagate.extract(trace_headers) with tracer.start_as_current_span("order.process", context=context): await process_order(message['data'])

  • Do: Use ES|QL for complex trace analysis that traditional dashboards can't handle.

  • Don't: Try to instrument legacy systems directly — use correlation IDs and synthetic spans.

  • Pitfall: Message queues and async processing break trace context unless explicitly propagated through headers.

  • Key insight: Perfect instrumentation isn't always possible. Strategic use of correlation IDs, synthetic spans, and intelligent querying provides comprehensive observability even in complex, hybrid environments.