Inside this week's LWN.net Weekly Edition:

Front: Kernel features from Python; i686 in Fedora; Kernel development with LLMs; Rust drivers; Load balancing with machine learning; Transparent huge pages.
Briefs: Bcachefs removal; Coccinelle for Rust; Netdev Foundation; Oracle Linux 10; GNU HHIS 5.0; Rust 1.88.0; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

APM best practices: Dos and don’ts guide for practitioners

2025-07-03 00:00

Application performance management (APM) is the practice of regularly tracking, measuring, and analyzing the performance and availability of software applications. APM helps you get visibility into complex microservices environments, which can overwhelm site reliability engineering (SRE) teams. The generated insights create an optimal user experience and achieve desired business outcomes. It’s a complex process, but the goal is straightforward: ensuring that an application runs smoothly and meets the expectations of users and businesses.

A clear understanding of an application's operation and a proactive APM practice are crucial for maintaining high-performing software applications. APM shouldn’t be an afterthought. It should be considered from the beginning. When implemented proactively, it can be incorporated into how software runs by embedding monitoring components directly into the application.

# Auto-instrumentation handles this automatically @app.route('/api/orders') def create_order(): # Add manual span only for critical business logic with tracer.start_as_current_span("order.validation") as span: span.set_attribute("order.value", order_total) if not validate_order(order_data): span.set_status(Status(StatusCode.ERROR)) return 400

Do: Start with auto-instrumentation, then add manual spans for business-critical operations.
Don't: Manually instrument every function call — you'll create performance overhead and noise.
Pitfall: Over-instrumentation can add 15%–20% latency. Monitor your monitoring with baseline performance comparisons.

A few components for an organization or business to consider when developing an APM strategy are:

Performance monitoring, including evaluating latency, service level objectives, response time, throughput, and request volumes
Error tracking, including exceptions, crashes, and failed API calls
Infrastructure monitoring, including health and resource usage of servers, containers, and cloud environments that support the application
User experience metrics, including load times, session performance, click paths, and browser or device details (It’s important to keep in mind that even if system metrics look fine, users may still encounter performance issues.)

Key principles of effective APM

The core principles of effective application performance management are end-to-end visibility (from the user's browser to the database), real-time monitoring and insights, and contextual insights, with a user- and business-objective focus. APM can improve application scalability by enabling continuous improvements and increasing performance over time.

Do: Implement real-time dashboards with SLO-based alerts rather than arbitrary thresholds.
Don't: Rely only on periodic performance reviews or CPU/memory alerts — instrument user experience metrics.
Pitfall: Alert fatigue from low-level system metrics. Focus on user-facing SLOs that indicate real problems.

When creating an APM strategy, here are a few key principles to consider:

1. Proactive monitoring: Prevent issues before they impact users by setting up alerts and responding quickly to any anomalies. But try to avoid alert fatigue. Balance automated alerts with human oversight so important issues don’t get missed, focusing on outcomes rather than system metrics.

2. Real-time insights: Move beyond logging issues and enable fast decision-making based on live data and real-time dashboards that prioritize the most critical business transactions. Use telemetry data (logs, metrics, and traces) to parse your performance insights.

3. End-to-end visibility: Monitor the application across the entire environment, the entire user flow, and all layers, from frontend to backend.

4. User-centric approach: Prioritize performance and experience from an end-user perspective, while considering key business objectives.

5. Real user monitoring: The work doesn’t stop when it’s in your user’s hands. By monitoring their experience, you can iterate and improve based on their feedback.

6. Continuous improvement: Use insights to optimize over time and regularly uncover and tackle unreported issues. Issues should be addressed dynamically rather than when discovered in periodic performance reviews.

7. Context propagation: Ensure trace context flows through your entire request path, especially across service boundaries:

# Outgoing request - inject context headers = {} propagate.inject(headers) response = requests.post('http://service-b/process', headers=headers)

8. Sampling strategy: Use intelligent sampling to balance visibility with performance:

1%–10% head-based sampling for high-traffic services
100% sampling for errors and slow requests using tail-based sampling
Monitor instrumentation overhead — aim for <5% performance impact

@RestController public class OrderController { @PostMapping("/orders") public ResponseEntity createOrder(@RequestBody OrderRequest request) { // Auto-instrumentation captures this endpoint automatically // Add custom business context Span.current().setAttributes(Attributes.of( stringKey("order.value"), String.valueOf(request.getTotal()), stringKey("user.tier"), request.getUserTier() )); return ResponseEntity.ok(processOrder(request)); } }

Do: Implement sampling strategies and monitor instrumentation overhead in production.
Don't: Use 100% sampling for high-traffic services — you'll impact performance and explode storage costs.
Pitfall: Head-based sampling can miss critical error traces. Use tail-based sampling to capture all errors while reducing volume.

Here’s how to get it right:

Select the right APM solution: The right APM tool should align with an application's architecture and the organization's needs. The solution should provide an organization with the tools and capabilities it needs to monitor, track, measure, and analyze its software applications. A business may use OpenTelemetry, an open source observability framework, to instrument and collect telemetry data (traces, metrics, and logs) from applications.
Manage cardinality to control costs: High-cardinality attributes can make metrics unusable and expensive:

# Good - bounded cardinality span.set_attribute("user.tier", user.subscription_tier) # 3-5 values span.set_attribute("http.status_code", response.status_code) # ~10 values # Bad - unbounded cardinality span.set_attribute("user.id", user.id) # Millions of values span.set_attribute("request.timestamp", now()) # Infinite values

Set up intelligent alerting based on SLOs rather than arbitrary thresholds. Use error budgets to determine when to page someone:

slos: - name: checkout_availability target: 99.9% window: 7d - name: checkout_latency target: 95% # 95% of requests under 500ms window: 7d

Train teams and promote collaboration. An APM strategy impacts a wide range of stakeholders, not just developers. Be sure to involve IT teams and other business stakeholders in cross-departmental collaboration. Work together by implementing APM into your organizational setup. Make sure to establish clear goals and KPIs that align with business needs and consider user experience.
Review and evaluate. An APM strategy continues to evolve and change alongside application and business needs.

order_processing_duration = Histogram( "order_processing_seconds", "Time to process orders", ["payment_method", "order_size"] ) with order_processing_duration.labels( payment_method=payment.method, order_size=get_size_bucket(order.total) ).time(): process_order(order)

Synthetic monitoring: Simulates user interactions to detect issues before real users are affected. Critical for external dependencies:

// Synthetic check for critical user flow const syntheticCheck = async () => { const span = tracer.startSpan('synthetic.checkout_flow'); try { await loginUser(); await addItemToCart(); await completePurchase(); span.setStatus({code: SpanStatusCode.OK}); } catch (error) { span.recordException(error); span.setStatus({code: SpanStatusCode.ERROR}); throw error; } finally { span.end(); } };

Deep-dive diagnostics and profiling: Helps troubleshoot complex performance bottlenecks, which could include third-party plugins or tools. Through application profiling, you can go deeper into your data and analyze how it is performing according to its functions.
Distributed tracing: Essential for microservices architectures. Handle context propagation carefully across async boundaries:

# Event-driven systems - propagate context through messages def publish_order_event(order_data): headers = {} propagate.inject(headers) message = { 'data': order_data, 'trace_headers': headers # Preserve trace context } kafka_producer.send('order-events', message) APM data analysis and insights

Monitoring and gathering data is just the beginning. Businesses need to understand how to interpret application performance management data for tuning and decision-making.

Identifying trends and patterns helps teams proactively detect issues. Use correlation analysis to link user complaints with backend performance. See an example here using ES|QL (Elastic’s query language):

FROM traces-apm* | WHERE user.id == "user_12345" AND @timestamp >= "2024-06-06T09:00:00" AND @timestamp <= "2024-06-06T10:00:00" | EVAL duration_ms = transaction.duration.us / 1000 | KEEP trace.id, duration_ms, transaction.name, service.name, transaction.result | WHERE duration_ms > 2000 | SORT duration_ms DESC | LIMIT 10

Detecting bottlenecks: APM reveals common performance anti-patterns such as n+1 problems that can be seen in the code below. Use APM to optimize the code:

# N+1 query problem detected by APM def get_user_orders_slow(user_id): user = User.query.get(user_id) orders = [] for order_id in user.order_ids: # Each iteration = 1 DB query orders.append(Order.query.get(order_id)) return orders # Optimized after APM analysis def get_user_orders_fast(user_id): return Order.query.filter(Order.user_id == user_id).all() # Single query

Correlating metrics and linking user complaints with backend performance data, including historical data, reveals how different parts of the system interact. This can help teams accurately diagnose root causes and understand the full impact of performance issues.

Automating root cause analysis and using AI/machine learning-based tools such as AIOps helps to accelerate diagnostics and resolution by pinpointing the source of problems, reducing downtime, and freeing up resources.

It’s important to use a holistic picture of your data to inform future decisions. The more data you have, the more you can leverage.

Do: Use distributed traces to identify the specific service and operation causing slowdowns.
Don't: Assume correlation means causation — verify with code-level profiling data.
Pitfall: Legacy systems often appear as black boxes in traces. Use log correlation and synthetic spans to maintain visibility.

// Java - Auto-propagation with Spring Cloud @PostMapping("/orders") public ResponseEntity createOrder(@RequestBody OrderRequest request) { Span.current().setAttributes(Attributes.of( stringKey("order.type"), request.getOrderType(), longKey("order.value"), request.getTotalValue())); // OpenFeign automatically propagates context to downstream services return paymentClient.processPayment(request.getPaymentData());} // Go - Manual context extraction and propagation func processHandler(w http.ResponseWriter, r *http.Request) { ctx := otel.GetTextMapPropagator().Extract(r.Context(), propagation.HeaderCarrier(r.Header)) ctx, span := tracer.Start(ctx, "process_payment") defer span.End() // Continue with trace context maintained}

Legacy system integration: Create observability bridges for systems that can't be directly instrumented:

# Synthetic spans with correlation IDs for mainframe calls with tracer.start_as_current_span("mainframe.account_lookup") as span: correlation_id = format(span.get_span_context().trace_id, '032x') logger.info("CICS call started", extra={ "correlation_id": correlation_id, "trace_id": span.get_span_context().trace_id }) result = call_mainframe_service(account_data, correlation_id) span.set_attribute("account.status", result.status)

Advanced trace analysis with ES|QL: Link user complaints to backend performance using Elastic's query language:

-- Find slow requests during complaint timeframe FROM traces-apm* | WHERE user.id == "user_12345" AND @timestamp >= "2024-06-06T09:00:00" | EVAL duration_ms = transaction.duration.us / 1000 | WHERE duration_ms > 2000 | STATS avg_duration = AVG(duration_ms) BY service.name, transaction.name | SORT avg_duration DESC -- Correlate errors across service boundaries FROM traces-apm* | WHERE trace.id == "44b3c2c06e15d444a770b87daab45c0a" | EVAL is_error = CASE(transaction.result == "error", 1, 0) | STATS error_rate = SUM(is_error) / COUNT(*) * 100 BY service.name | WHERE error_rate > 0

Event-driven architecture patterns: Explicitly propagate context through message headers for async processing:

# Producer - inject context into message headers = {} propagate.inject(headers) message = { 'data': order_data, 'trace_headers': headers # Preserve trace context } await kafka_producer.send('order-events', message) # Consumer - extract and continue trace trace_headers = message.get('trace_headers', {}) context = propagate.extract(trace_headers) with tracer.start_as_current_span("order.process", context=context): await process_order(message['data'])

Do: Use ES|QL for complex trace analysis that traditional dashboards can't handle.
Don't: Try to instrument legacy systems directly — use correlation IDs and synthetic spans.
Pitfall: Message queues and async processing break trace context unless explicitly propagated through headers.
Key insight: Perfect instrumentation isn't always possible. Strategic use of correlation IDs, synthetic spans, and intelligent querying provides comprehensive observability even in complex, hybrid environments.

SOC analyst vs. security analyst: What’s the difference?

Elastic

2025-07-03 00:00

A security operations center (SOC) analyst enhances your security posture by defending the organization against cybersecurity threats. Responsible for monitoring, detecting, investigating, and responding to cyber threats, the SOC analyst is the first line of defense in keeping the organization’s IT ecosystem secure when an incident arises.

A security analyst, similar to a SOC analyst, is responsible for proactive defense and security posture. However, security analysts tend to have a more strategic, preventive focus and may or may not work within the SOC.

With such critical responsibilities, what does it take to become a SOC analyst or security analyst? Let’s explore the job, required skills, and the career path of both.

Challenges SOC analysts face

With a job so rewarding and critical for an organization, it’s no surprise that SOC analysts face many challenges.

1. Alert fatigue: SOC analysts are overwhelmed by the volume of alerts, including false positives, generated by security tools. All these alerts require attention, triage, and intervention, potentially leading SOC analysts to overlook critical threats.

The potential solution: AI-driven security analytics significantly reduces the noise and prioritizes critical alerts, saving security analysts time and effort.

2. High stress levels and burnout: SOC analysts operate in a high-pressure environment, amid constant demands to respond to yet another threat. Then, there’s the added pressure of a dynamic threat landscape and the need to keep up with emerging and advanced threat actors, new vulnerabilities, and attack techniques.

The potential solution: An AI Assistant can help security analysts gain quicker insights and analysis and respond to threats faster and more efficiently.

3. Fear of being replaced by AI: As SOC analysts begin to rely on AI to make their jobs easier, many question whether their jobs will become obsolete. An AI Assistant can already triage alerts and monitor networks for threats more effectively than a junior security analyst. What will happen tomorrow?

The potential solution: AI won’t replace SOC teams, but it will fundamentally transform the role of tier 1 SOC analysts. Analysts will be able to forget about time-consuming manual tasks and get AI help in elevating their skills, so they can focus on more rewarding investigations and threat hunting.

Debian looking for testers with Apple M1/M2 machines

LWN

2025-07-02 15:32

Debian's Bananas team has put out a call for people with Apple M1 or M2 systems to help test Debian on those machines:

The Bananas Team has set up an installer at with images for GNOME, KDE and console installations. While we'd like to build an actual Debian installer sooner or later (we may need a heads-up from the Debian Images team for that), at this time we only provide an asahi-type installer, which installs both the "bootloader" and the OS partitions to disk from the network (as opposed to only installing the bootloader and then letting you install Debian using a d-i USB stick). We haven't forked Trixie from Testing yet, so what you'll get is Debian Testing quite deep into the freeze.

Three Ubisoft chiefs found guilty of enabling culture of sexual harassment

The Guardian

2025-07-02 15:12

Former staff likened offices of video game company in Paris to a ‘boys’ club above the law’

Three former executives at the video game company Ubisoft have been given suspended prison sentences for enabling a culture of sexual and psychological harassment in the workplace at the end of the first big trial to stem from the #MeToo movement in the gaming industry.

The court in Bobigny, north of Paris, had heard how the former executives used their position to bully or sexually harass staff, leaving women terrified and feeling like pieces of meat.

The Netdev Foundation launches

LWN

2025-07-02 14:47

The Netdev Foundation, which is "a user-led effort under the supervision of the Linux Foundation, focused on financially supporting Linux networking development", has announced its existence.

The initial motivation was to move the NIPA testing outside of Meta, so that more people can help and contribute. But there should be sufficient budget to sponsor more projects.

(NIPA is Netdev Infrastructure for Patch Automation).

[$] Accessing new kernel features from Python

LWN

2025-07-02 14:03

Every release of the Linux kernel has lots of new features, many of which are accessible from user space. Usually, though, the GNU C Library (glibc) and tools that access the Linux user-space API lag behind the kernel releases. Geoffrey Thomas showed how Python programs can access these new kernel features as soon as the kernel is released in his "What's New in the Linux Kernel... from Python" talk at PyCon US 2025. While he had two examples of accessing new kernel features, the real goal of the talk was to demonstrate how to go about connecting Python to the Linux kernel.

From Pong to Wii Sports: the surprising legacy of tennis in gaming history

The Guardian

2025-07-02 14:00

From the lab-born Tennis for Two to the console classics of Nintendo and Sega, the sport has been a constant, foundational force in gaming’s rise

With Wimbledon under way, I am going to grasp the opportunity to make a perhaps contentious claim: tennis is the most important sport in the history of video games.

Sure, nowadays the big sellers are EA Sports FC, Madden and NBA 2K, but tennis has been foundational to the industry. It was a simple bat-and-ball game, created in 1958 by scientist William Higinbotham at the Brookhaven National Laboratory in Upton, New York, that is widely the considered the first ever video game created purely for entertainment. Tennis for Two ran on an oscilloscope and was designed as a minor diversion for visitors attending the lab’s annual open day, but when people started playing, a queue developed that eventually extended out of the front door and around the side of the building. It was the first indication that computer games might turn out to be popular.

Lobster bisque and onion soup on ISS menu for French astronaut

The Guardian

2025-07-02 12:42

Chef with 10 Michelin stars has designed meals for Sophie Adenot’s trip to International Space Station next year

Even by the exacting standards of France’s gastronomes, it sounds like a meal that is truly out of this world. When the French astronaut Sophie Adenot travels to the International Space Station next year, she will dine on French classics such as lobster bisque, foie gras and onion soup prepared specially for her by a chef with 10 Michelin stars.

Parsnip and haddock velouté, chicken with tonka beans and creamy polenta, and a chocolate cream with hazelnut cazette flower will also be on the menu, the European Space Agency said on Wednesday.