Monitoring & Metrics

Overview

NanoARB exposes Prometheus metrics for comprehensive monitoring of trading performance, latency, and system health. The monitoring stack includes Prometheus for metrics collection and Grafana for visualization.

Architecture

The monitoring stack consists of:

NanoARB Engine: Exposes metrics on port 9090
Prometheus: Scrapes metrics every 1 second and stores time-series data
Grafana: Provides real-time dashboards and alerting

Setup

Starting the Monitoring Stack

With Docker Compose
Standalone

# Full stack (engine + monitoring)
cd docker
docker compose up -d

# Monitoring only (for local development)
docker compose -f docker-compose-monitoring.yml up -d

Start Prometheus:

prometheus --config.file=docker/prometheus.yml

Start Grafana:

grafana-server --config=/etc/grafana/grafana.ini

Accessing Dashboards

Once the stack is running:

Grafana: http://localhost:3000
- Username: admin
- Password: nanoarb
Prometheus: http://localhost:9091
Metrics Endpoint: http://localhost:9090/metrics

Prometheus Configuration

The Prometheus configuration in docker/prometheus.yml:

global:
  scrape_interval: 1s        # Scrape every second for HFT
  evaluation_interval: 1s
  external_labels:
    monitor: 'nanoarb'

scrape_configs:
  - job_name: 'nanoarb'
    static_configs:
      - targets: ['host.docker.internal:9090']
    scrape_interval: 1s      # High-frequency scraping
    metrics_path: /metrics

The 1-second scrape interval is optimized for high-frequency trading. For longer-term monitoring, increase to 5s or 15s to reduce storage requirements.

Data retention:

Default: 30 days (configured via --storage.tsdb.retention.time=30d)
Adjust in docker-compose.yml if you need longer retention

Available Metrics

NanoARB exposes metrics in the MetricsRegistry defined in crates/nano-gateway/src/metrics.rs:

Trading Metrics

Metric	Type	Description
`nanoarb_orders_total`	Counter	Total number of orders submitted
`nanoarb_fills_total`	Counter	Total number of fills received
`nanoarb_position`	Gauge	Current net position in contracts
`nanoarb_pnl`	Gauge	Current profit/loss in dollars
`nanoarb_events_total`	Counter	Total events processed by the engine

Latency Metrics

All latency metrics are recorded in nanoseconds with histogram buckets:

Metric	Type	Description
`nanoarb_inference_latency_ns`	Histogram	ML model inference time
`nanoarb_order_latency_ns`	Histogram	Order submission latency
`nanoarb_book_update_latency_ns`	Histogram	Order book update processing time
`nanoarb_event_latency_ns`	Histogram	Event processing latency

Histogram buckets:

Range: 100ns to ~100ms
Exponential buckets with factor of 2 (20 buckets total)
Enables percentile queries (p50, p95, p99)

Example Queries

# Orders per minute
rate(nanoarb_orders_total_total[1m]) * 60

# Percentage of orders that get filled
(rate(nanoarb_fills_total_total[5m]) / rate(nanoarb_orders_total_total[5m])) * 100

# 99th percentile inference time
histogram_quantile(0.99, 
  sum(rate(nanoarb_inference_latency_ns_bucket[1m])) by (le)
)

# Average position over 5 minutes
avg_over_time(nanoarb_position[5m])

Grafana Dashboards

The default dashboard is located at grafana/dashboards/main.json and includes:

1. Key Performance Indicators (Top Row)

P&L

Current profit/loss in dollars

Position

Current net position

Orders/min

Order submission rate

Fills/min

Fill execution rate

2. Equity Curve

Real-time visualization of cumulative P&L:

nanoarb_pnl

Shows your trading performance over time, helping identify profitable and unprofitable periods.

3. Inference Latency

Tracks ML model performance with percentiles:

p50 (median): Typical inference time
p95: 95th percentile - most requests complete within this time
p99: 99th percentile - worst-case latency for optimization

# p50
histogram_quantile(0.50, sum(rate(nanoarb_inference_latency_ns_bucket[1m])) by (le))

# p95
histogram_quantile(0.95, sum(rate(nanoarb_inference_latency_ns_bucket[1m])) by (le))

# p99
histogram_quantile(0.99, sum(rate(nanoarb_inference_latency_ns_bucket[1m])) by (le))

For HFT strategies, target p99 inference latency under 1 microsecond (1000ns). Higher latencies may result in adverse selection.

4. Position Over Time

Tracks position changes throughout the trading session:

nanoarb_position

Useful for:

Identifying position accumulation
Monitoring inventory risk
Verifying position flattening at session end

5. Event Processing Rate

Monitors system throughput:

rate(nanoarb_events_total_total[1m])

High event rates (>10,000 events/sec) indicate:

Heavy market data processing
Potential bottlenecks in event loop
Need for performance optimization

Dashboard Configuration

The dashboard is provisioned automatically in grafana/provisioning/:

grafana/
├── provisioning/
│   ├── dashboards/
│   │   └── dashboards.yml     # Dashboard provider config
│   └── datasources/
│       └── datasources.yml    # Prometheus datasource
└── dashboards/
    └── main.json               # Main trading dashboard

Adding Custom Panels

Navigate to Grafana: http://localhost:3000
Open Dashboard: “NanoARB Trading Dashboard”
Add Panel: Click “Add panel” in top-right
Configure Query: Use Prometheus queries from examples above
Save Dashboard: Exports to JSON for version control

Custom Dashboard Example

Sharpe Ratio Panel

{
  "title": "Sharpe Ratio (Rolling 1h)",
  "targets": [
    {
      "expr": "(avg_over_time(nanoarb_pnl[1h]) - avg_over_time(nanoarb_pnl[1h] offset 1h)) / stddev_over_time(nanoarb_pnl[1h])",
      "legendFormat": "Sharpe"
    }
  ],
  "type": "stat"
}

Calculates rolling Sharpe ratio based on P&L standard deviation.

Alerting

Configure Grafana Alerts

Create Alert Rule

In Grafana, go to Alerting → Alert rules → New alert rule

Define Condition

Example: Alert when P&L drops below -$10,000

nanoarb_pnl < -10000

Configure Notification

Set up notification channels (Slack, email, PagerDuty)

Save and Test

Test the alert and save the configuration

Example Alert Rules

High Latency Alert

Alert when p99 inference latency exceeds 10 microseconds:

histogram_quantile(0.99, 
  sum(rate(nanoarb_inference_latency_ns_bucket[1m])) by (le)
) > 10000

Position Limit Alert

Alert when position exceeds 80% of max:

abs(nanoarb_position) > 40  # Assuming max_position = 50

Drawdown Alert

Alert on significant drawdown (requires additional calculation):

(max_over_time(nanoarb_pnl[1d]) - nanoarb_pnl) > 50000

Metrics Export

Raw Metrics Format

View raw Prometheus metrics:

curl http://localhost:9090/metrics

Example output:

# HELP nanoarb_orders_total Total number of orders submitted
# TYPE nanoarb_orders_total counter
nanoarb_orders_total_total 1542

# HELP nanoarb_pnl Current P&L in dollars
# TYPE nanoarb_pnl gauge
nanoarb_pnl 2450.75

# HELP nanoarb_inference_latency_ns Model inference latency in nanoseconds
# TYPE nanoarb_inference_latency_ns histogram
nanoarb_inference_latency_ns_bucket{le="100"} 245
nanoarb_inference_latency_ns_bucket{le="200"} 489
nanoarb_inference_latency_ns_bucket{le="400"} 723
...

Export to CSV

Use Prometheus API to export historical data:

# Export P&L for last hour
curl -G http://localhost:9091/api/v1/query_range \
  --data-urlencode 'query=nanoarb_pnl' \
  --data-urlencode 'start='$(date -u -d '1 hour ago' +%s) \
  --data-urlencode 'end='$(date -u +%s) \
  --data-urlencode 'step=1s' \
  | jq -r '.data.result[0].values[] | @csv' > pnl.csv

Performance Monitoring

Key Metrics to Watch

Inference Latency

Target: <1μs p99Critical for strategy competitiveness

Order Latency

Target: <100μs p99Impacts fill probability

Event Processing Rate

Target: >50k events/secIndicates system capacity

Fill Ratio

Target: >80%Measures execution quality

Latency Optimization

If latencies are too high:

Check CPU pinning: Ensure process runs on isolated cores
Review event loop: Look for blocking operations
Profile code: Use perf or flamegraph to identify hotspots
Optimize model: Reduce inference complexity

See Production Deployment for optimization techniques.

Troubleshooting

Metrics Not Appearing

# Check if engine is exposing metrics
curl http://localhost:9090/metrics

# Check Prometheus targets
open http://localhost:9091/targets

# Verify Prometheus is scraping
docker compose logs prometheus

Dashboard Not Loading

# Check Grafana logs
docker compose logs grafana

# Verify datasource connection
curl http://admin:nanoarb@localhost:3000/api/datasources

# Restart Grafana
docker compose restart grafana

High Memory Usage

Prometheus stores metrics in memory. Reduce retention or scrape interval:

# In docker-compose.yml
command:
  - '--storage.tsdb.retention.time=7d'  # Reduce from 30d
  - '--storage.tsdb.retention.size=1GB' # Add size limit

​Overview

​Architecture

​Setup

​Starting the Monitoring Stack

​Accessing Dashboards

​Prometheus Configuration

​Available Metrics

​Trading Metrics

​Latency Metrics

​Example Queries

​Grafana Dashboards

​1. Key Performance Indicators (Top Row)

P&L

Position

Orders/min

Fills/min

​2. Equity Curve

​3. Inference Latency

​4. Position Over Time

​5. Event Processing Rate

​Dashboard Configuration

​Adding Custom Panels

​Custom Dashboard Example

​Alerting

​Configure Grafana Alerts

​Example Alert Rules

​Metrics Export

​Raw Metrics Format

​Export to CSV

​Performance Monitoring

​Key Metrics to Watch

Inference Latency

Order Latency

Event Processing Rate

Fill Ratio

​Latency Optimization

​Troubleshooting

​Metrics Not Appearing

​Dashboard Not Loading

​High Memory Usage

​Next Steps

Docker

Production

Overview

Architecture

Setup

Starting the Monitoring Stack

Accessing Dashboards

Prometheus Configuration

Available Metrics

Trading Metrics

Latency Metrics

Example Queries

Grafana Dashboards

1. Key Performance Indicators (Top Row)

2. Equity Curve

3. Inference Latency

4. Position Over Time

5. Event Processing Rate

Dashboard Configuration

Adding Custom Panels

Custom Dashboard Example

Alerting

Configure Grafana Alerts

Example Alert Rules

Metrics Export

Raw Metrics Format

Export to CSV

Performance Monitoring

Key Metrics to Watch

Latency Optimization

Troubleshooting

Metrics Not Appearing

Dashboard Not Loading

High Memory Usage

Next Steps