Overview
NanoARB provides a complete RL environment for training market-making agents using:- Gym-style environment for market making
- State representations from order book data
- Action spaces for quote placement
- Reward functions for profit optimization
- Support for IQL and Decision Transformer algorithms
MarketMakingEnv
The RL environment simulates market-making dynamics:nano-strategy/src/rl_env.rs:160-186
Creating an Environment
nano-strategy/src/rl_env.rs:188-207
Environment Configuration
nano-strategy/src/rl_env.rs:115-140
Default Configuration
nano-strategy/src/rl_env.rs:142-158
Action Space
Actions control quote placement:nano-strategy/src/rl_env.rs:8-21
Creating Actions
nano-strategy/src/rl_env.rs:36-58
Action Validation
nano-strategy/src/rl_env.rs:60-72
State Space
The state representation includes:nano-strategy/src/rl_env.rs:75-92
State Features
- LOB features - Flattened order book snapshot (prices, quantities, depths)
- Inventory - Current position normalized by max_inventory
- Unrealized P&L - Mark-to-market P&L
- Time since trade - Normalized time since last fill
- Spread - Current bid-ask spread
- Imbalance - Order book imbalance
- Recent returns - Last N price returns
Converting State to Array
nano-strategy/src/rl_env.rs:95-106
Reward Function
The reward balances multiple objectives:nano-strategy/src/rl_env.rs:388-422
Reward Components
- Spread Capture - Positive reward for fills that capture spread
- Inventory Penalty - Quadratic penalty for large positions
- Adverse Selection - Penalty when market moves against position
- Fee Costs - Maker/taker fees reduce reward
Tuning Reward Coefficients
Training Loop
Standard RL training loop:Environment API
reset
nano-strategy/src/rl_env.rs:209-223
step
nano-strategy/src/rl_env.rs:225-264
Fill Simulation
The environment simulates realistic fills:nano-strategy/src/rl_env.rs:266-325
IQL Training
Implicit Q-Learning for offline RL:Decision Transformer
Sequence modeling approach to RL:Deployment
Deploy trained RL agent:Best Practices
- Start with imitation learning - Pre-train on data from profitable strategies
-
Tune reward coefficients - Balance spread capture vs inventory risk:
-
Use sufficient context - Include enough history for informed decisions:
- Normalize state features - Ensure all features are on similar scales
- Monitor out-of-distribution - Track when live conditions differ from training
- Use offline RL for safety - Train on historical data before live deployment