Files

Résumé

The pursuit of software security and reliability hinges on the identification and elimination of software vulnerabilities, a challenge compounded by the vast and evolving complexity of modern systems. Fuzzing has emerged as an indispensable technique for bug discovery, owing to its ability to automate the rapid generation and execution of test cases. However, its effectiveness is constrained by the quality of metrics employed for evaluation and optimization. This dissertation posits that for effective bug discovery in increasingly complex systems, fuzzing techniques must employ tailored metrics that capture application-specific features such as state and semantics. The dissertation first addresses a glaring gap in the fuzzing landscape—the lack of standardized benchmarks for fuzzer evaluation—through Magma, a fuzzer benchmark with ground-truth metrics. Magma enables objective assessment of fuzzer performance across diverse software targets by leveraging real-world bugs, instrumented to provide bug-centric metrics. Through rigorous experiments, set up over 200,000 CPU-hours and involving state-of-the-art mutation-based fuzzers, Magma highlights the limitations of using crash counts as the de facto evaluation metric, and provides a unified platform for an accurate evaluation and comparison of fuzzers. The second project, Igor, addresses the inefficacy of crash de-duplication techniques, which typically suffer from bug-count inflation and conflation. Igor builds on the insight that a bug cannot be triggered without executing its code. Through a process of test case minimization and execution trace matching, we introduce a metric for crash de-duplication that goes beyond code coverage and call stacks. By employing control-flow graph similarity comparisons over minimized execution traces, Igor demonstrates its capability to accurately group crashes, reducing "unique" bug counts by an order of magnitude compared to existing techniques. The penultimate project, Tango, addresses the inadequacy of traditional code coverage metrics in exploring the state spaces of complex systems like language parsers and video games. By incorporating "state" as a first-class citizen, Tango enhances the fuzzer's ability to navigate complex systems. State inference led to the discovery of previously undetected bugs, and it also highlighted a novel observation: code coverage is insufficient for describing state. Through our evaluation, Tango reveals that fuzzers which rely solely on code coverage could potentially spend upwards of 80% of their time duplicating their efforts in the face of stateful targets. Finally, Sensei aims to help fuzzers at incrementally exploring targets by fuzzing parsers in isolation and using those results to bootstrap the fuzzing of more complex targets. By leveraging the rich and specific domain knowledge encoded in Wireshark dissectors, Sensei incorporates parser-specific metrics to guide fuzzing in the direction of high-quality and diverse inputs, to aid the exploration of network protocol implementations. Collectively, these projects introduce novel metrics and methodologies for fuzzer development and evaluation. They provide empirical evidence supporting the thesis that tailored metrics are key to effective and successful fuzzing.

Détails

PDF