Multi-Agent Reinforcement Learning for Autonomous Drone Swarm Coordination in Dynamic Environments

James Okonkwo¹, Yuki Tanaka², Anna Kowalski³

¹ Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA

² Department of Aeronautics and Astronautics, University of Tokyo, Tokyo 113-8654, Japan

³ Institute of Automatic Control, Warsaw University of Technology, 00-665 Warsaw, Poland

Published: 2026-05-18 · FAIDS Vol. 1, No. 1 (2026)

Abstract

Coordinated drone swarms enable applications from disaster response to precision agriculture, but scaling coordination to hundreds of agents in dynamic, GPS-denied environments remains unsolved. We propose SwarmMARL, a multi-agent reinforcement learning (MARL) framework combining graph attention networks with centralized training and decentralized execution (CTDE). SwarmMARL agents learn emergent flocking, obstacle avoidance, and task allocation behaviors through a curriculum of progressively complex scenarios. Evaluated in high-fidelity AirSim simulations with up to 128 drones and validated on a 32-drone physical testbed, SwarmMARL achieves 98.7% mission completion rate in search-and-rescue scenarios with 40% fewer collisions than MAPPO and QMIX baselines. Real-world outdoor tests demonstrate robust formation maintenance under 12 m/s wind gusts and communication dropout rates up to 30%.

Keywords: multi-agent reinforcement learning, drone swarms, autonomous systems, graph attention networks, coordination

1. Introduction

Autonomous drone swarms promise transformative capabilities for applications requiring distributed sensing and action over large areas. Search-and-rescue operations, environmental monitoring, and infrastructure inspection all benefit from the redundancy, coverage, and scalability that swarms provide compared to single-agent systems.

Traditional swarm control approaches rely on hand-crafted rules (e.g., Reynolds flocking) or centralized planners that do not scale beyond 20-30 agents and fail under communication constraints. Multi-agent reinforcement learning offers a data-driven alternative, but existing MARL algorithms struggle with partial observability, non-stationarity, and the combinatorial action space of large swarms.

2. SwarmMARL Method

Each drone agent observes local LiDAR range scans, relative positions of k-nearest neighbors (k=8), and its own velocity state. A graph attention network (GAT) encoder processes the neighbor graph to produce context-aware embeddings. The policy network outputs continuous velocity commands, while a centralized critic during training evaluates joint state-action values using global information. Training proceeds through a four-stage curriculum: (1) basic flocking, (2) static obstacle avoidance, (3) dynamic obstacle navigation, and (4) cooperative search with target detection.

Figure 1. Training curriculum progression showing mission completion rate across four curriculum stages

3. Experiments and Results

We conducted extensive simulation experiments in AirSim with Unreal Engine environments, followed by validation on a 32-drone Crazyflie 2.1 testbed equipped with OptiTrack motion capture. Three benchmark scenarios were evaluated: area coverage, collaborative search-and-rescue, and formation flight through a forest canopy.

Table 1. Performance comparison of MARL methods on search-and-rescue scenario (128 agents, 50 trials)

Method	Completion (%)	Collisions	Search Time (s)	Area Covered (%)
Reynolds Flocking	62.4	847	312	71.2
MAPPO	89.3	156	198	88.5
QMIX	85.7	203	215	84.3
SwarmMARL	98.7	94	165	96.8

Figure 2. Scalability analysis: mission completion rate vs. swarm size for SwarmMARL and baseline methods

4. Analysis

Attention weight visualization reveals that SwarmMARL agents dynamically adjust their coordination radius based on scenario complexity — maintaining tight formations during search but spreading during coverage tasks. The GAT encoder's ability to weight neighbor importance enables graceful degradation under communication dropout, maintaining 91.3% completion rate even with 30% packet loss compared to 68.5% for MAPPO.

5. Conclusions

SwarmMARL demonstrates that graph attention-based MARL with curriculum learning can scale drone swarm coordination to 128 agents while maintaining robust performance under real-world disturbances. The framework's sim-to-real transfer capability, validated on physical drones, positions it as a practical solution for deploying autonomous swarms in disaster response and environmental monitoring applications.

References

Lowe, R.; Wu, Y.; Tamar, A. Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. NeurIPS 2017.
Rashid, T.; Samvelyan, M.; Schroeder, C. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. ICML 2018.
Veličković, P.; Cucurull, G.; Casanova, A. Graph Attention Networks. ICLR 2018.
Yu, C.; Liu, J.; Nemati, S. The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games. NeurIPS 2022.
Tang, Y.; Haque, S.; Liu, J. Learning to Fly: A Deep Reinforcement Learning Approach for Quadrotor Control. ICRA 2018.
Şenbaşlar, D.; Sukhatme, G. S. Scaling Multi-Agent Reinforcement Learning for Real-World Drone Swarms. Science Robotics 2024, 9, eadh1725.

This article is published under the Creative Commons Attribution 4.0 International License (CC BY 4.0).