Skip to content

Tail-Based Sampling

Smart log sampling that decides after trace completion based on errors, latency, status codes, and custom business logic. Achieve 40-90% cost reduction while maintaining 100% error coverage.

Overview

Traditional (head-based) sampling decides immediately whether to log:

typescript
// ❌ Head-based sampling - might miss errors
if (Math.random() < 0.1) { // 10% sample rate
  logger.info('Request started')
  // ... process request ...
  logger.error('Request failed!') // If not sampled, this error is lost!
}

Tail-based sampling buffers logs and decides after the trace completes:

typescript
// ✅ Tail-based sampling - never miss errors
logger.info('Request started') // Buffered
// ... process request ...
logger.error('Request failed!') // Buffered

// On trace complete:
// - Has error? → Keep 100%
// - Slow (>1s)? → Keep 100%
// - Success? → Keep 1%

Benefits:

  • ✅ 100% error coverage (never miss issues)
  • ✅ 100% slow request coverage (catch performance problems)
  • ✅ 40-90% cost reduction (drop uninteresting logs)
  • ✅ Smart decisions (based on full trace context)
  • ✅ Budget-aware (adaptive sampling)

Basic Usage

typescript
import { createLogger, tailSamplingPlugin, tracingPlugin } from '@sylphx/cat'

const logger = createLogger({
  plugins: [
    tracingPlugin(), // Required for trace correlation
    tailSamplingPlugin() // Default rules
  ]
})

// Errors are kept (100%)
logger.error('Payment failed', { orderId: 'ORD-123' })

// Success logs are sampled (1%)
logger.info('Order processed', { orderId: 'ORD-124' })

Default Rules

typescript
[
  { name: 'errors', condition: hasError, sampleRate: 1.0 }, // Keep all errors
  { name: 'slow', condition: duration > 1000ms, sampleRate: 1.0 }, // Keep slow requests
  { name: '5xx', condition: statusCode >= 500, sampleRate: 1.0 }, // Keep server errors
  { name: '4xx', condition: statusCode >= 400, sampleRate: 0.5 }, // Sample client errors 50%
  { name: 'warnings', condition: hasWarning, sampleRate: 0.2 }, // Sample warnings 20%
  { name: 'default', condition: true, sampleRate: 0.01 } // Sample success 1%
]

Custom Rules

Define rules based on your business logic:

typescript
import { tailSamplingPlugin, type SamplingRule } from '@sylphx/cat'

const rules: SamplingRule[] = [
  // Rule 1: Keep all payment errors (highest priority)
  {
    name: 'payment-errors',
    priority: 100,
    condition: (trace) => {
      return trace.metadata.hasError && trace.metadata.customFields.category === 'payment'
    },
    sampleRate: 1.0
  },

  // Rule 2: Keep VIP user actions (50%)
  {
    name: 'vip-users',
    priority: 90,
    condition: (trace) => trace.metadata.customFields.userTier === 'vip',
    sampleRate: 0.5
  },

  // Rule 3: Keep slow checkouts (>3s)
  {
    name: 'slow-checkout',
    priority: 80,
    condition: (trace) => {
      return (
        trace.metadata.customFields.action === 'checkout' &&
        (trace.metadata.maxDuration || 0) > 3000
      )
    },
    sampleRate: 1.0
  },

  // Rule 4: Sample regular users at low rate
  {
    name: 'default',
    priority: 0,
    condition: () => true,
    sampleRate: 0.01 // 1%
  }
]

const logger = createLogger({
  plugins: [
    tracingPlugin(),
    tailSamplingPlugin({ rules })
  ]
})

Rule Structure

typescript
interface SamplingRule {
  name: string // Rule identifier
  priority?: number // Higher = evaluated first (default: 0)
  condition: (trace: TraceBuffer) => boolean // Keep if true
  sampleRate: number // 0.0 to 1.0 (0% to 100%)
}

Trace Metadata

typescript
interface TraceMetadata {
  hasError: boolean // Has error or fatal level
  hasWarning: boolean // Has warn level
  statusCode?: number // HTTP status code
  maxDuration?: number // Max duration in ms
  customFields: Record<string, unknown> // Your custom data
}

Adaptive Sampling

Automatically adjust sample rates based on budget:

typescript
const logger = createLogger({
  plugins: [
    tracingPlugin(),
    tailSamplingPlugin({
      adaptive: true,
      monthlyBudget: 10 * 1024 * 1024 * 1024, // 10 GB/month

      onBudgetUpdate: (stats) => {
        console.log(`Budget: ${stats.usedPercent}% used`)
        console.log(`Current sample rate: ${stats.currentSampleRate}`)
      }
    })
  ]
})

How it works:

  1. Monitor log volume in real-time
  2. If approaching budget (>80%), reduce sample rates
  3. Prioritize errors and slow requests
  4. Never drop errors (always 100%)

Example:

  • 0-50% budget: Sample success at 5%
  • 50-80% budget: Sample success at 2%
  • 80-95% budget: Sample success at 0.5%
  • 95-100% budget: Sample success at 0.1%

Configuration

typescript
tailSamplingPlugin({
  // Sampling rules
  rules?: SamplingRule[]

  // Enable adaptive sampling
  adaptive?: boolean // default: false

  // Monthly budget in bytes
  monthlyBudget?: number // default: undefined

  // Max buffer size (traces)
  maxBufferSize?: number // default: 1000

  // Max trace duration (ms)
  maxTraceDuration?: number // default: 60000 (1 minute)

  // Callback when trace is flushed
  onFlush?: (trace: TraceBuffer, kept: boolean) => void

  // Callback when budget is updated
  onBudgetUpdate?: (stats: BudgetStats) => void
})

Real-World Examples

E-Commerce API

typescript
const rules: SamplingRule[] = [
  // Critical: All errors
  { name: 'errors', condition: (t) => t.metadata.hasError, sampleRate: 1.0 },

  // Critical: All 5xx errors
  { name: '5xx', condition: (t) => (t.metadata.statusCode || 0) >= 500, sampleRate: 1.0 },

  // Important: Slow requests (>2s)
  { name: 'slow', condition: (t) => (t.metadata.maxDuration || 0) > 2000, sampleRate: 1.0 },

  // Important: Authentication issues
  { name: 'auth', condition: (t) => {
    const code = t.metadata.statusCode
    return code === 401 || code === 403
  }, sampleRate: 0.5 },

  // Medium: 4xx errors
  { name: '4xx', condition: (t) => {
    const code = t.metadata.statusCode || 0
    return code >= 400 && code < 500 && code !== 401 && code !== 403
  }, sampleRate: 0.1 },

  // Low: Success
  { name: 'success', condition: () => true, sampleRate: 0.01 } // 1%
]

const logger = createLogger({
  plugins: [
    tracingPlugin(),
    tailSamplingPlugin({
      rules,
      adaptive: true,
      monthlyBudget: 50 * 1024 * 1024 * 1024 // 50 GB
    })
  ]
})

Result: 40-90% cost reduction with 100% error coverage

Gaming Backend

typescript
const rules: SamplingRule[] = [
  // Critical: Crashes
  { name: 'crashes', condition: (t) => t.metadata.customFields.event === 'crash', sampleRate: 1.0 },

  // Critical: Cheat detection
  { name: 'cheat', condition: (t) => t.metadata.customFields.cheatDetected === true, sampleRate: 1.0 },

  // Important: High latency matchmaking
  { name: 'slow-match', condition: (t) => {
    return t.metadata.customFields.event === 'matchmaking' &&
      (t.metadata.maxDuration || 0) > 500
  }, sampleRate: 1.0 },

  // Low: Regular gameplay
  { name: 'gameplay', condition: () => true, sampleRate: 0.001 } // 0.1%
]

Financial API

typescript
const rules: SamplingRule[] = [
  // Critical: Transaction errors
  { name: 'tx-errors', condition: (t) => {
    return t.metadata.hasError && t.metadata.customFields.category === 'transaction'
  }, sampleRate: 1.0 },

  // Critical: Fraud detection
  { name: 'fraud', condition: (t) => t.metadata.customFields.fraudScore !== undefined, sampleRate: 1.0 },

  // Critical: High-value transactions
  { name: 'high-value', condition: (t) => {
    const amount = t.metadata.customFields.amount as number
    return amount !== undefined && amount > 10000
  }, sampleRate: 1.0 },

  // Important: Audit events (compliance)
  { name: 'audit', condition: (t) => t.metadata.customFields.auditRequired === true, sampleRate: 1.0 },

  // Low: Regular API calls
  { name: 'default', condition: () => true, sampleRate: 0.05 } // 5%
]

Distributed Tracing

Tail-based sampling works across microservices:

typescript
// Service A (API Gateway)
const gatewayLogger = createLogger({
  plugins: [
    tracingPlugin(), // Generates traceId
    tailSamplingPlugin()
  ]
})

gatewayLogger.info('Request received')

// Propagate traceId to Service B
const headers = tracingPlugin.toHeaders(traceContext)
await fetch('http://service-b', { headers })

// Service B
const authLogger = createLogger({
  plugins: [
    tracingPlugin({ getTraceContext: () => traceContext }), // Same traceId
    tailSamplingPlugin()
  ]
})

authLogger.error('Authentication failed')

// Both services' logs are kept (error detected)

Performance

Memory Usage

typescript
// Configure buffer limits
tailSamplingPlugin({
  maxBufferSize: 1000, // Max 1000 traces in memory
  maxTraceDuration: 60000 // Auto-flush after 1 minute
})

Memory estimate:

  • ~1 KB per log entry
  • 10 logs per trace average
  • 1000 traces max = ~10 MB

Throughput

Tail-based sampling adds minimal overhead:

  • Buffer write: ~1-2% CPU
  • Rule evaluation: ~0.5% CPU
  • Total overhead: ❤️% CPU

Monitoring

Track sampling statistics:

typescript
const logger = createLogger({
  plugins: [
    tracingPlugin(),
    tailSamplingPlugin({
      onFlush: (trace, kept) => {
        if (kept) {
          console.log(`✅ Kept trace ${trace.traceId}:`,
            `errors=${trace.metadata.hasError},`,
            `duration=${trace.metadata.maxDuration}ms`)
        } else {
          // Optionally log discarded traces
        }
      },

      onBudgetUpdate: (stats) => {
        console.log(`Budget: ${stats.usedPercent.toFixed(1)}% used`)
        console.log(`Sample rate: ${stats.currentSampleRate}`)
      }
    })
  ]
})

Testing

typescript
import { tailSamplingPlugin } from '@sylphx/cat'

describe('Tail-based sampling', () => {
  it('keeps errors', () => {
    const plugin = tailSamplingPlugin()

    const errorEntry = {
      level: 'error' as const,
      timestamp: Date.now(),
      message: 'Failed',
      traceId: 'trace-123',
      data: {}
    }

    plugin.onLog!(errorEntry)
    plugin.flush!('trace-123')

    // Verify error was kept
  })

  it('samples success at low rate', () => {
    const plugin = tailSamplingPlugin()
    let kept = 0

    for (let i = 0; i < 1000; i++) {
      const entry = {
        level: 'info' as const,
        timestamp: Date.now(),
        message: 'Success',
        traceId: `trace-${i}`,
        data: { statusCode: 200 }
      }

      plugin.onLog!(entry)
      plugin.flush!(`trace-${i}`)
    }

    // Expect ~1% kept (10 out of 1000)
    expect(kept).toBeLessThan(50)
  })
})

Best Practices

Always Keep Errors

typescript
// ✅ Good - never miss errors
{ name: 'errors', condition: (t) => t.metadata.hasError, sampleRate: 1.0 }

// ❌ Bad - might miss critical issues
{ name: 'errors', condition: (t) => t.metadata.hasError, sampleRate: 0.5 }

Prioritize Rules

typescript
// ✅ Good - high priority for important events
{ name: 'payment-errors', priority: 100, ... }
{ name: 'vip-users', priority: 90, ... }
{ name: 'default', priority: 0, ... }

// ❌ Bad - no priorities, order matters

Use Adaptive Sampling

typescript
// ✅ Good - automatically manages budget
adaptive: true,
monthlyBudget: 10 * 1024 * 1024 * 1024

// ❌ Bad - fixed rates might exceed budget
adaptive: false

Cost Savings

Real-world impact (100,000 requests/second):

StrategyLogs KeptError CoverageMonthly Cost
No sampling100,000/s100%$10,000
Head-based (10%)10,000/s~10% ❌$1,000
Tail-based5,000/s100% ✅$500

Tail-based achieves:

  • 50% cost vs. head-based
  • 95% cost savings vs. no sampling
  • 100% error coverage

See Also

Released under the MIT License.