Every System Fails

No autonomous system is infallible. Hardware degrades. Software encounters edge cases. The physical environment does things that no simulation predicted. The question that matters is not whether failure happens — it will — but whether the system handles failure safely, transparently, and predictably.

An autonomous drone that "just works" until it doesn't — with no defined failure behavior — is more dangerous than a manually piloted one. The manual pilot at least has situational awareness and can improvise. The autonomous system that fails without a plan simply falls out of the sky, flies into terrain, or continues executing a mission that no longer makes sense.

The value of Overwatch's safety architecture is not preventing all failures. That is impossible. The value is ensuring that every failure has a defined, deterministic response. When a failure occurs, the system does not guess. It does not retry indefinitely. It transitions to a known-safe state, alerts the operator, and logs everything.

This post documents every known failure mode across both Overwatch Core and Overwatch Orchestrate, and how the system handles each one. We are publishing this because transparency about limitations is more useful than marketing about capabilities.

Handoff Failures

This failure mode is specific to Orchestrate and its fleet relay operations. The scenario: the active drone's battery is approaching the swap threshold, Orchestrate dispatches the standby drone, and the standby drone fails to launch.

Causes fall into three categories. Hardware fault: a motor fails its pre-flight self-test, the GPS module does not acquire a fix within the launch timeout, or a battery contact is corroded and the drone cannot draw sufficient current. Battery error: the standby drone's state-of-charge is below the launch threshold despite the status indication showing it as ready — this happens when batteries have been sitting in cold conditions and their voltage sags under load. Communication failure: Orchestrate cannot establish the WiFi link to the standby drone to upload the mission.

Orchestrate's response is deterministic. First, it alerts the operator with a "handoff failure" notification identifying the drone and the cause. Second, it extends the active drone's patrol if battery allows — the active drone may fly past the normal swap threshold into RTH territory, buying time. Third, it attempts to dispatch the next drone in the fleet roster. If no standby drone is available, Orchestrate raises a "fleet exhaustion" alert. The active drone completes its current patrol leg and returns. A coverage gap occurs. The system logs the event with timestamps and cause codes.

The operator's decision at that point: swap the faulty drone's battery or hardware, insert a fresh drone into the fleet roster, or accept reduced coverage until the situation is resolved. Orchestrate does not make that decision. It presents the facts and waits.

Communications Loss

The WiFi link between the operator's device and the drone drops. This is one of the most common field failures, and the system is designed to handle it as a normal operating condition rather than an emergency.

Causes: range — the drone flew beyond WiFi range during a patrol or search leg. Interference — the RF environment in the field is noisy, particularly near urban areas, ports, or military installations. Terrain — the drone is behind a cliff, building, or ridge that blocks line-of-sight to the ground station.

Core's response: the drone continues its mission autonomously. This is a fundamental design choice. The entire flight plan is onboard — every waypoint, every sweep line, the complete boustrophedon grid — stored as relative vectors in the AirSDK mission package. The drone does not need the link to navigate, search, or detect. It is not streaming commands from the ground station. It is executing a self-contained program.

Detections that occur during communications loss are stored locally on the drone. When the link recovers, they are transmitted to the ground station and appear on the operator's map. No detection data is lost.

There is one exception. If communications loss exceeds the configurable timeout AND battery is below the RTH threshold, the drone returns home. The rationale: the drone may not have enough battery to complete the mission and still make it back to the landing point. Without a link to the operator, it cannot request guidance, so it makes the conservative choice. The operator sees a "comms lost" alert with the drone's last known position and the timestamp of link loss.

Detection Misses

The onboard detection model — TFLite SSD MobileNet v2 running at approximately 5 FPS on the drone's processor — uses a confidence threshold of 0.6. Every frame where the model's confidence exceeds 0.6 generates an alert. Every frame below 0.6 does not. This creates a tradeoff that cannot be eliminated, only managed.

Below the threshold, the model may have seen something — a shape, a colour contrast, a partial outline — but was not confident enough to classify it as a person. This means missed detections are possible. We cannot state otherwise honestly.

Specific limitations we have observed. Prone or partially occluded persons are harder to detect than standing ones — the model was trained primarily on upright figures, and a person lying flat presents a significantly different silhouette from 30–50 metres altitude. Dense canopy blocks the camera entirely; the drone cannot see through trees. Night operations with visible-spectrum camera only: detection performance degrades severely after dark, as the model depends on visual contrast that does not exist without ambient light. Persons in water — overboard scenarios — present a small, low-contrast target against a moving, reflective background.

Mitigations exist but do not eliminate the problem. The boustrophedon sweep pattern with configurable overlap means each ground point is imaged from multiple frames at different angles and distances. A person missed in one frame may be detected in the next pass at a different viewing geometry. Higher overlap percentages increase this redundancy at the cost of longer flight time and therefore smaller searchable area per battery. Operators choose the tradeoff based on the scenario.

Thermal camera integration — on the roadmap — will address the night and vegetation limitations by detecting body heat rather than visual appearance. This is the single highest-impact improvement planned for the detection pipeline.

Weather Degradation

Wind. The ANAFI UKR's operational wind limit is 8 m/s — approximately 29 km/h, Beaufort 4. Above this, position holding degrades, ground speed becomes inconsistent across upwind and downwind legs, and the systematic sweep pattern breaks down. The spacing between sweep lines assumes consistent ground speed; variable wind makes the actual ground coverage unpredictable. The flight supervisor monitors wind via IMU-derived estimates. If sustained wind exceeds the threshold, the supervisor transitions to RTH. The drone does not attempt to continue the mission in conditions where the search pattern is no longer reliable.

Rain. Water on the lens degrades image quality for both navigation and detection. The visual-inertial odometry (VIO) system tracks visual features frame-to-frame to estimate position; water droplets on the lens reduce the number of trackable features and can cause false feature matches. Heavy rain can cause VIO drift to exceed the 5-metre threshold, triggering Emergency landing. Light rain may be tolerable but reduces detection confidence — the model sees blurred, streaked frames and its confidence scores drop below the 0.6 threshold more frequently.

Fog. Reduces visual range and VIO feature tracking quality. The effect is similar to rain — fewer trackable features, faster drift accumulation. VIO drift is again the governing constraint. In dense fog, the drone may not be able to maintain position accuracy long enough to complete a meaningful search.

The operator receives weather-related alerts for all of these conditions and can pre-emptively pause or abort a mission before conditions deteriorate to the point where the supervisor forces a return. Pre-emptive abort is always preferable to an automated one, because the operator can choose when and where the drone returns rather than having the supervisor make that decision mid-leg.

VIO Drift and Navigation Errors

When GPS is unavailable and the drone is navigating on VIO alone, position error accumulates. This is inherent in any dead-reckoning system — small errors in each frame's motion estimate compound over time. The longer the drone flies on VIO without a GPS correction, the larger the position error grows.

The 5-metre drift threshold is the point where the supervisor decides the position estimate is too unreliable to continue the mission. The number is not arbitrary. At 5 metres of accumulated error, the search coverage gaps become comparable to or larger than the sweep strip width at typical search altitudes. The drone is still flying a pattern, but the pattern is not where the drone thinks it is. The search grid on the operator's map no longer corresponds to what the drone is actually covering on the ground.

The response is Emergency landing — immediate controlled descent at the current position. The rationale: it is better to land safely at a known-bad position than to continue flying a search that is not actually searching where it should be. A completed search pattern with 5+ metres of drift error gives the operator false confidence that an area has been covered when it has not. Landing and acknowledging the navigation failure is more honest and ultimately safer.

After landing, the operator retrieves the drone, moves to a location with better GPS coverage or richer visual texture for VIO tracking, and relaunches. The mission can be resumed from the point of interruption — Overwatch logs the last completed waypoint, so the operator can generate a new mission covering the remaining area.

Operator Override

At any point during any mission — Core or Orchestrate, single drone or fleet — the operator can: pause the mission (drone holds position), command RTH (orderly return along a safe path), command emergency land (immediate descent at current position), or take manual control via the Parrot FreeFlight controller.

The autonomous system is always subordinate to human command. This is a design principle, not a feature. There is no "fully autonomous" mode that locks out the operator. There is no confirmation dialog that delays an emergency command. The operator presses the button; the drone obeys.

The reason is simple: the operator has context that the system does not. Weather changing faster than the IMU can estimate. Bystanders entering the search area. A vessel moving into the flight path. New information from ground teams about the missing person's likely location that changes the search priority. Radio traffic from other aircraft in the area. The system executes the plan; the human decides whether the plan still makes sense.

Every operator override is logged with the same fidelity as every autonomous state transition — timestamp, drone position, battery state, active mission phase, and the command issued. Post-mission review can reconstruct exactly when and why the operator intervened.

Transparency Over Perfection

The system does not hide its failures. Every state transition, every threshold breach, every detection — and every frame where the model saw something below the 0.6 confidence threshold — is logged with timestamps, GPS coordinates (when available), and the drone's internal state at that moment.

Post-mission analysis is deterministic. You can reconstruct exactly what the drone did, why it did it, and what it saw. If the drone returned early, the log shows the exact sensor reading that triggered the return and the supervisor state transition that executed it. If a detection was missed, the sub-threshold frames are available for review. If communications were lost, the log shows what the drone did autonomously during the gap.

This transparency is more valuable than perfection. A system that fails visibly and predictably can be trusted. The operator learns its behaviour, understands its limits, and can compensate for its weaknesses. A system that fails silently — that hides edge cases behind retry loops and optimistic status messages — cannot be trusted, because the operator never knows whether the system is working or merely appearing to work.

We chose to build the system that admits what it cannot do. Every limitation documented in this post is a limitation we have observed, measured, and designed a response for. The list will grow as the system encounters new environments and new edge cases. When it does, the response will be the same: define the failure, implement a deterministic response, log everything, and tell the operator the truth.