Learning from the Recent Windows/Falcon Sensor Outage: Causes and Potential Improvement Strategies in Linux Using Open Source Solutions
At the time of writing, most people have probably heard about the massive Windows outage caused by a faulty kernel driver in Falcon Sensor, a CrowdStrike software. On Friday, July 19, 2024, a software configuration update designed to target newly observed malicious artifacts used in cyberattacks prevented several million Windows machines to boot. How can a configuration file crash an OS? Because the real issue is not the configuration file itself, but the kernel driver using it. Let’s take a quick, non-technical tour of the potential reasons behind this situation, how it is addressed in the Linux kernel, and what you as users or customers can do to avoid such issues.
Developing Kernel Drivers is Hard
Kernel drivers allow code to be executed at the kernel level without modifying the kernel itself. This method is used by several software developers for the purposes of low level interactions, such as connecting to specific hardware, firewalling, and security monitoring. Bugs in these drivers are often critical, as they may be exploited by attackers or lead to a denial of service of the operating system.
Creating flawless code is a significant challenge for any developer. Most drivers are still developed in C or C++, which are powerful but complex and not inherently memory-safe languages. Moreover, achieving full code coverage through thorough testing of kernel drivers can be difficult due to the complexity of tracing a running kernel. These challenges, along with other factors such as dealing with undocumented kernel structures, closed-source kernels, and legacy code, can unfortunately lead to bugs and security vulnerabilities.
While we respect third-party companies, some may have less rigorous code review and development processes compared to OS kernel developers. This is a common issue in the software industry, where the focus may sometimes be on rapid feature development rather than code quality or stability. Consequently, as long as third-party companies continue to rely on kernel drivers, the risk of kernel panic (A kernel panic is a safety measure in an operating system, triggered by a critical error from which it cannot safely recover, causing the system to halt or restart) is likely to be higher compared to the risk from the OS kernel itself. Robust testing and advanced development practices are essential to mitigate these risks, but completely eliminating such errors remains a challenging task.
Can This Happen to Linux?
Kernel panics can occur on any operating system. However, the Linux kernel has been moving away from relying on third party kernel drivers. How is this being achieved? The answer is eBPF (Extended Berkeley Packet Filter). This technology enables code to run within the Linux kernel without the need to modify kernel source code or load kernel modules. A full explanation of how eBPF works is beyond the scope of this post; if you’re interested in learning more, please refer to the references section.
The power of eBPF lies in the fact that the code executed is not machine code but bytecode, which is interpreted by a virtual machine running in the kernel. Before execution, the bytecode is subjected to a code verifier that checks for any potential programming mistakes that could impact kernel functionality. Since every eBPF program has a maximum number of instructions it can execute (no endless loops are possible), the verifier can check every single instruction. If the verifier detects issues, it will reject the bytecode, preventing it from being loaded into the kernel. This mechanism aims to ensure that faulty code never reaches the kernel, so the only component that may fail is your userland application trying to load eBPF code—a clever way to avoid kernel panics, don’t you think?
Of course, no software is entirely free from bugs, and eBPF sub-system is no exception. While it is true that bugs can occur, the current design and verification process make the likelihood of eBPF bytecode causing a kernel panic extremely low.
Why Not eBPF on Windows?
Microsoft is currently working on bringing eBPF to Windows, and you can track their progress in the ebpf-for-windows repository. However, it’s important to note that it is still far from offering the same functionality as eBPF on Linux. This limitation restricts its use to specific networking tasks, leaving out critical areas such as performance monitoring and security tracing. For security vendors, these capabilities are essential for building robust and feature-rich products. As a result, they are likely to continue using traditional kernel drivers until eBPF for Windows meets all their requirements.
What Do We Do at CIRCL?
At CIRCL, we are strong advocates of open source. Consequently, we are investing time and resources to provide an open-source alternative to monitoring products for Linux. Introducing Kunai, a security monitoring tool for Linux written in Rust and powered by eBPF.
Inspired by Sysmon for Windows, we developed this project to offer a similar experience for Linux users. Like its Windows counterpart, Kunai monitors various system events, including binary and script execution, shared objects being loaded, network connections, and many other events that can be used to build robust threat detection and hunting scenarios. It also tracks security events generated by any Linux container technology, allowing users to monitor activities within their containers. Additionally, it features a rule engine for creating powerful detection and log filtering primitives. Furthermore, it can be integrated with any Threat Intelligence Provider, enabling real-time, IoC (Indicators of Compromise) based scanning .
Although Kunai is primarily designed for threat detection and hunting, it can also be a valuable ally for forensic analysis on Linux systems. Its logs provide detailed insights into the full activity of a given process, which is crucial for understanding what happened on a system.
If you have any security monitoring needs for your Linux hosts, consider giving Kunai a try. Don’t hesitate to open issues on GitHub or reach out to us if you encounter problems or want to contribute to its improvement.
What You Can Do as a Customer?
As a customer, you have the power to choose the best solutions and should thoroughly understand and review the technologies you use. Failing to do so can lead to vulnerabilities (including potential outages) in your infrastructure, making you partly responsible for any issues that arise. It is essential to have a robust technology and product evaluation process. If you find a vendor or technology that meets most of your requirements but falls short in some areas, request a roadmap to address these gaps and ensure your needs are fully met. As a general rule, running third-party kernel drivers should be avoided when viable alternatives are available.
If any product requires kernel drivers, a good evaluation metric is to check whether these drivers are implemented in a memory-safe language such as safe Rust. In addition to its memory safety capabilities, safe Rust enforces a number of good programming practices by design and provides strong guarantees about the code. This ultimately improves the overall quality of critical code running inside the kernel. Even though Rust is very attractive for kernel level applications, it will likely be necessary to use unsafe Rust (similar to C/C++), but its use should be minimized to the strictest extent possible. It is worth noting that while kernel code implemented in Rust will not prevent kernel panics, it will provide most of the Rust guarantees and inherently reduce the risk.
Conclusions
Kernel panics have existed for ages and will continue to occur. However, operating system developers take this issue very seriously. While alternatives to relying on traditional C or C++ written kernel modules exist, such as using a safer programming language or another technology like eBPF, some software providers may prefer the conventional approach, likely because developing something from scratch using a completely different method can be too costly. In such cases, remember that as a user or customer, it is your responsibility to push for improvements or opt out of such practices if you do not agree with them.
References
eBPF
Classification of this document
TLP:CLEAR information may be distributed without restriction, subject to copyright controls.
Revision
- Version 1.0 - TLP:CLEAR - First version - 22nd July 2024