Building Onboard Human Detection for SAR Drones

Why Edge, Not Cloud

The conventional approach to drone-based detection is conceptually simple: stream the camera feed from the drone to a ground station, run detection on the ground station (or in the cloud), and display results to the operator. The drone is a flying camera with a video transmitter. All the intelligence lives on the ground.

This architecture has a single point of failure that matters enormously in SAR: the communication link.

SAR operations happen in environments that are hostile to radio communication. RF signals attenuate through foliage, reflect off terrain features, and lose line-of-sight behind ridgelines. The Parrot ANAFI communicates over WiFi, which is adequate at close range but degrades rapidly with distance, obstruction, and interference. At 500m range with partial obstruction, you might get intermittent connectivity. At 1km through a forest canopy, you probably get nothing.

If detection depends on the video stream reaching the ground station, then detection fails when the link fails. The drone is still flying. The camera is still recording. A person may be directly below the drone. But nobody knows, because the frames are not reaching the detector.

Running inference onboard changes the failure mode. The detection pipeline operates on the drone itself, processing each camera frame locally. When a person is detected, the drone publishes a compact alert message — GPS coordinates, confidence score, a small thumbnail image. This alert is a few kilobytes, not a continuous video stream. It can be buffered and retransmitted when connectivity resumes. Even if the link is down for minutes, the detections are not lost.

The tradeoff is compute. The drone has limited processing power compared to a ground station GPU. But for object detection at useful frame rates, it is sufficient. The Parrot ANAFI's Snapdragon 845 can run quantised TensorFlow Lite models at 5 to 10 frames per second, depending on model complexity. That is not real-time video analysis, but it is more than adequate for a drone flying at 5 m/s over a search grid — at that speed, a person-sized target remains in frame for multiple seconds.

The Detection Pipeline

The pipeline has six stages, each straightforward individually. The engineering challenge is making them run reliably and continuously on constrained hardware while the drone is simultaneously executing a complex flight mission.

1. Frame capture. The AirSDK provides a video sink that receives decoded camera frames. Each frame arrives as a raw image buffer in YUV or RGB format, at the camera's native resolution (up to 4K on the ANAFI). We do not need — and cannot afford to process — full-resolution frames for detection.

2. Resize. The frame is downscaled to the model's expected input size: 300 × 300 pixels for SSD MobileNet. This is a bilinear interpolation resize, done in software. The resize also handles colour space conversion if necessary (YUV to RGB). The 300 × 300 input means we are processing roughly 2% of the pixels in a 4K frame, which is the primary reason inference is fast enough to run on mobile hardware.

3. Inference. The resized frame is fed to the TensorFlow Lite interpreter running an SSD MobileNet v2 model quantised to 8-bit integers. The model was pre-trained on COCO, which includes "person" as one of its 80 object classes. Inference on the Snapdragon 845 takes approximately 100 to 150ms per frame using the NNAPI delegate, which offloads computation to the DSP.

4. Output parsing. The model outputs four tensors: bounding box coordinates (normalised to [0, 1]), class IDs, confidence scores, and a count of detections. We iterate through the detections, filter for class ID 0 ("person" in COCO), and check the confidence score against the threshold.

5. Geolocation. For each detection that passes the threshold, we compute an approximate geographic position. The bounding box centre gives us the detection's position in the image frame. Combined with the drone's GPS position (or VIO-estimated position), altitude, and camera orientation, we can project the image-space position to a ground-space coordinate. This is an approximate calculation — it assumes flat terrain and a nadir (straight-down) camera angle — but it is accurate enough to direct a ground team to within 10 to 15 metres of the target.

6. Alert publication. The detection is packaged into a protobuf message containing: timestamp, GPS coordinates of the detection, confidence score, and a JPEG-compressed crop of the original frame centred on the bounding box (with padding for context). This message is published to the ground station. If the comm link is active, it arrives immediately. If the link is down, it is queued for transmission when connectivity resumes.

Choosing a Confidence Threshold

The confidence threshold is the minimum score a detection must have before it is reported. This is the single most important tunable parameter in the system, and it involves a genuine engineering tradeoff.

Low threshold (e.g. 0.3): The system reports everything that vaguely resembles a person. Tree stumps, rocks with shadows, backpacks, oddly-shaped bushes. The operator is overwhelmed with false positives. In testing, a 0.3 threshold over typical SAR terrain produced 50 to 100 false alerts per mission. Each alert requires the operator to look at the thumbnail, assess it, and dismiss it. This consumes operator attention and creates alert fatigue — after dismissing 30 false positives, the operator stops taking alerts seriously and risks dismissing a true positive.

High threshold (e.g. 0.9): The system only reports detections it is very confident about. False positives are rare. But so are true positives under challenging conditions. A person partially obscured by vegetation, lying in an unusual pose, wearing terrain-matching clothing, or at the edge of the frame might generate a confidence score of 0.5 to 0.7. A 0.9 threshold misses them entirely.

The default: 0.6. This is a pragmatic balance for visible-spectrum detection in SAR conditions. In our testing across varied terrain types, a 0.6 threshold produces approximately 3 to 8 false positives per 200m × 200m grid search. That is a manageable number — the operator can review each alert in seconds. The false negative rate at 0.6 is non-zero but significantly better than human visual monitoring of a video feed, which is the alternative.

The threshold is configurable per mission. An operator searching an open beach (high contrast, few occluding objects) might raise it to 0.7 or 0.8. An operator searching dense scrubland where a person could be partially hidden might lower it to 0.5 and accept more false positives as the cost of not missing someone.

The fundamental principle for SAR is that false negatives are worse than false positives. A false positive wastes 10 seconds of operator attention. A false negative means a person is not found. The threshold should err on the side of reporting too much rather than too little.

From Detection to Action

When the model produces a detection above threshold, the system executes a rapid sequence of actions, all running onboard the drone.

Crop and annotate. The original camera frame (at full or near-full resolution) is cropped to the bounding box region with 30% padding on each side. This gives the operator a zoomed view of what triggered the detection, with enough surrounding context to assess the scene. The bounding box is drawn on the crop. The crop is JPEG-compressed at quality 85 — a balance between image quality for human assessment and file size for transmission.

Compute position. The bounding box centre is projected from image coordinates to geographic coordinates. The calculation uses the drone's current position (latitude, longitude, altitude above ground), the camera's focal length and sensor dimensions, and the bounding box position in pixels. For a nadir camera at 30m altitude, each pixel in a 4K frame corresponds to roughly 1cm on the ground. The bounding box centre maps to a ground position with an accuracy of approximately 5 to 10m, accounting for position estimation error, attitude error, and the flat-terrain assumption.

Publish alert. The detection message is serialised as a protobuf and published over the drone's communication channel. The message includes: timestamp (UTC), detection ID (monotonically increasing), drone position at time of detection, estimated target position, confidence score, model class label, and the JPEG thumbnail. The total message size is typically 20 to 40 KB.

Display to operator. On the ground station, the alert appears on the mission map as a marker at the estimated position, with the thumbnail displayed in a sidebar. The operator can tap the marker to see the full crop, accept or dismiss the detection, and optionally redirect the drone to circle back for a closer look.

The entire pipeline from frame capture to alert display takes under 200ms in typical conditions. The operator sees a detection alert within a fraction of a second of the drone passing over the target. Compare this to the manual approach, where the operator must notice a person in a continuous video stream while simultaneously monitoring other instruments — and where "noticing" depends on sustained human attention that degrades measurably after 20 minutes.

What We Don't Do Yet

Honest engineering requires stating limitations clearly.

Thermal and infrared detection. The current system uses the ANAFI's visible-spectrum camera only. Thermal imaging would dramatically improve detection of people in vegetation, at night, and in low-contrast environments. It requires different hardware (a thermal camera payload) and a different model (trained on thermal imagery, where a person's heat signature is the primary feature). This is planned but not yet implemented.

Night operation. The visible-spectrum camera is useless in darkness, and VIO degrades severely in low light because feature tracking requires visible texture. Night SAR operations are among the most critical scenarios — a person lost after dark is in increasing danger — and our current system cannot address them. Thermal integration is the path forward, combined with active illumination for VIO.

Dense canopy penetration. Visible light does not penetrate leaves. A person under forest canopy is invisible to a camera flying above. Lower altitude helps in some cases (gaps in the canopy), but dense deciduous or coniferous forest is effectively opaque from above. This is a physics problem, not a software problem. LiDAR and thermal are partial mitigations but not complete solutions.

Prone or partially buried subjects. The COCO-trained model performs best on upright, fully visible humans. A person lying face-down in tall grass, partially buried in rubble, or curled in a fetal position may not match the training distribution well enough to trigger a confident detection. Fine-tuning the model on SAR-specific imagery (drone-perspective, varied poses, partial occlusion) would improve performance, but we have not yet assembled a large enough training dataset for this.

These are not disclaimers buried in fine print. They are the honest boundaries of what the current system can and cannot do. Every SAR tool has limitations. The question is whether, within its operational envelope, it performs better than the alternative — which is a fatigued human watching a video feed. Within that envelope, it does.