The Department of Defense and Department of Homeland Security use many threat detection systems, such as air cargo screeners and counter-Improvised Explosive Device systems. Threat detection systems that perform well during testing are not always well-received by the system operators, however. Some systems may frequently “cry wolf,” generating alarms even when true threats are not present. As a result, operators may lose faith in the systems—ignoring them or even turning them off and taking the chance that a true threat will not appear. This paper reviews statistical concepts to reconcile the performance metrics that summarize a developer’s view of a system during testing with the metrics that describe an operator’s view of the system during real-world missions. Program managers can still make use of systems that “cry wolf” by arranging them into a tiered system that, overall, exhibits better performance than any individual system alone.