In the past few days, there has been a lot of progress and a lot of publicity for this project, which shows the ecosystem’s desire and demand for lowering the barrier to entry on booting Arm SBC’s, in this case the Raspberry Pi 4 of course.
Tweets, LinkedIn posts, CNX Software replies, and Hackster comments all tell the same story: Allowing users to power on a single board computer, install the operating system of their choice using “normal” boot media, and proceed through an install process just like they are used to a typical PC is a missing piece in the Arm ecosystem. Without the ability for “regular” users to start to explore Arm hardware and get up and running in a way they are used to, Arm Servers will remain a niche product.
So, join us on the Discord Server, help contribute patches and code if you can, or simply spread awareness of the project on your Social Media channels!
One of the gaps between ServerReady and existing “Edge” SoCs is the latter’s general reliance on OS-coordinated device initialization and power state management. A typical BSP or port of an OS would involve clock source, GPIO, PoR and pin control/multiplexing drivers as prerequisites for any embedded devices, and usually I2C, SPI and voltage regulator drivers to be able to do anything “interesting” such as accessing sensors, doing storage I/O or driving graphics outputs.
While firmware could hypothetically pre-initialize everything into a working state, this presents a dilemma for devices meant to operate at the lowest possible power setting. Pre-initialization also means no device reconfiguration after OS boot.
ACPI has somewhat adapted to this space, but not abstracted enough from gory details. While ACPI does tidy up some platform device configuration via its interpreted AML byte code methods, it only appears to be marginally better than device tree for non-server systems. For example, ACPI’s notion of Generic Serial Bus (I2C, SPI, UART) and GPIO OpRegions lets device AML methods perform I/O without resorting to bit-banging MMIO addresses, but requires host operating system drivers to provide the underlying implementation. There are ACPI resource descriptors for tracking and describing GPIOs and pin configuration for devices, but these are again completely useless without appropriate drivers.
Great, so this basically reduces ACPI on non-server systems to obscure machine code coordinating a bunch of OS drivers, sourced from silicon providers and platform integrators. But maybe we can throw all these new vendor-specific OpRegions away, for compatibility’s sake, and code like it’s ACPI 4.0a?
Is writing AML really feasible?
Surely, device- and platform-specific configuration logic can be neatly limited to AML methods?
Well, that’s highly overrated.
Above is an excerpt of a TAD device for a memory-mapped RTC. It’s a good warning against doing anything beyond basic I/O and arithmetic in AML. Considering how straightforward RTCs are as a device class, this bit of code (roughly 1/4 of all AML required) is unmaintainable and without comments would be completely incomprehensible. Yes, that’s a busy-wait in there for 5 seconds, waiting on a completion from the device, but it might as well be looping forever. Note, that AML methods in most implementations run with a global interpreter lock (which also means AML code is not reentrant – an OpRegion cannot be backed by AML).
AML is machine code. Really slow and limited machine code. ASL (the source compiled into AML) is about as expressive as 8048 assembly.
How about translating the following real-life SBC support code to AML? On Raspberry Pi the Arm cores don’t have access to device control blocks and must request the VideoCore VPU (GPU) processor to act on its behalf via special “mailbox” requests in shared RAM. E.g.:
First of all, this is doing DMA…so for an AML implementation, you’ll need a chunk of memory. You can’t use an actual AML buffer for this, so you’ll have to carve some physical memory out in the UEFI memory map, mark it with the right memory attributes matching your coherency requirements, set up a SystemMemory OpRegion…
Oh, I’m having so much fun!
Well, Barbie, we are just getting started..
As you can see, the routine is a mix of MMIO and CPU operations, e.g. data cache cleaning and invalidation. This is problematic for compiling to AML, which doesn’t include any cache operations. You could do away with cache operations entirely by carving out a non-cache coherent memory chunk. Or perhaps you’re lucky and your device supports cache coherent DMA… Well, then it would have to look like this:
No more cache operations! But now we have barriers. AML doesn’t have memory barriers. Maybe you can play fast and loose? Well, expect problems when you rev up the cores to something a bit more OoO. Also, consider that an operating system’s AML interpreter is probably getting scheduled across multiple CPUs while running your method…
So, no AML for anything involving DMA. What a mess…
No one wants to write AML
Considering how problematic it is to write AML, Microsoft introduced their own notion of Platform Extension Plugins (PEPs) to stand in for entire AML methods.
PEPs are intended to be used for off-SoC power management methods. Since they are installable binaries, they can be updated on-the-fly as opposed to ACPI firmware which requires a firmware flash. … Power management was the original intent for PEPs, but they can be used to provide or override any arbitrary ACPI runtime method.
Providing power management using PEPs can be much easier to debug than code written for the ACPI firmware. …
PEPs can obscure any method and provide methods you didn’t know were necessary. The AML code now just provides a “skeleton” and some dummy method implementations. This makes ACPI system descriptions relying on PEPs completely useless, since PEPs are opaque platform knowledge, sourced from vendors and completely undocumented.
PEPs play no role in the construction of the ACPI namespace hierarchy because the namespace hierarchy must be provided in the firmware DSDT. When the ACPI driver evaluates a method at runtime, it will check against the PEP’s implemented methods for the device in question, and, if present, it will execute the PEP and ignore the firmware’s version. However, the device itself must be defined in the firmware.
Even worse, you can’t meaningfully tell which AML methods would be overridden by a PEP, or what kind of configuration data the PEP was meant to source. PEPs are even worse than device tree, because PEPs can fully hide configuration data that would otherwise be reported as properties through device tree.
This is a good explanation for why none of the Snapdragon laptops today can boot Linux using ACPI. And not just Linux – the HP Envy x2 cannot boot a “stock” Windows 10 image without Snapdragon customizations.
PEPs are not an answer.
Let’s put the BSP in TF-A
We know we don’t want to write in AML for a good number of reasons. In a few situations it is literally impossible to write safe and functional AML. We also want to avoid writing drivers for things OS vendors (and their customers) really don’t care about, like pin controllers and clock source management. And PEPs are not a standard interface, and their implementation is OS vendor specific.
It would appear that the best place to hide low-level platform drivers for device initialization and power state management is Trusted Firmware-A. TF-A is an industry-adopted TrustZone firmware usually used to implement PSCI or as a foundation for a Trusted Execution Environment (TEE). It’s a rich enough environment to contain complex code written in a high-level level. Also, TF-A likely already includes some of the component drivers. This way, TF-A becomes a software-based System Control Processor (SCP).
If the platform BSP entrails are squirreled away in TF-A, how would an OS interact with them? Via ACPI AML methods of course. Regardless of how the ACPI interface works, the actual calls to a firmware-based SCP would be via the well-standardized SMCCC specification.
Isn’t putting stuff in TF-A bad? It’s not ideal, but putting it into AML or OS vendor-specific drivers is much worse. Other platforms such as IA-64 (SAL) and OpenPower (OPAL) rely on firmware interfaces to abstract some platform I/O and implementation-specific details.
Isn’t that cycle stealing? No, because we’re talking about operations done on behalf of the operating system requesting them.
Note: It’s not just generic off-the-shelf operating systems that would win from TF-A abstracting common SoC hardware and control interfaces. UEFI firmware itself could make good use of these, reducing the development and support effort for all platforms, and removing similar/duplicated functionality further reducing code size and bug counts.
But what about UEFI runtime services?
Runtime services are meant to abstract parts of the hardware implementation of the platform from the OS, but the interface is fairly limited in scope today. Here’s the current set of calls:
Returns the current time, time context, and time keeping capabilities.
Sets the current time and time context.
Returns the current wakeup alarm settings.
Sets the current wakeup alarm settings.
Returns the value of a named variable.
Enumerates variable names.
Sets, and if needed creates, a variable.
Switches all runtime functions from physical to virtual addressing.
Subsumes the platform’s monotonic counter functionality.
Resets all processors and devices and reboots the system.
Passes capsules to the firmware with both virtual and physical mapping.
Returns information about the EFI variable store.
But what if this list could be extended? Instead of moving the BSP into TF-A, let’s make it all a UEFI runtime service, and figure out how to perform RT calls from AML.
That’s not a very good idea:
RT is fragile, as services share the same privilege level as the calling operating system. Differences in the way different operating systems call services are a constant source of bugs across vendor implementations, e.g. flat addressing or translated, interrupts enabled or disabled, UART enabled or disabled.
RT requires an environment – memory ranges must be correctly mapped, enough stack, disabled FP traps, etc. A single SMC instruction for trapping to TF-A is hard to beat.
RT is fragile, as runtime services are provided by the same drivers that provide boot time services in most UEFI implementation like Tiano. There’s no meaningful isolation between an RT driver and its (and other) BS components. The firmware programming model is just bad. It is extremely easy to make a seemingly benign change (new giobal variable, logging statement, driver dependency) that will break RT support for some users, but be very difficult to track down.
Limited facilities with no support for asynchronous implementation, e.g. can’t take an exception on behalf of a service, while in OS. This may make some hardware hard to expose efficiently or mean that certain workarounds are impossible, e.g. a device quirk that relies on handling external aborts on a system with firmware-first error handling.
Simply revising the UEFI specification with new RT services won’t do anything for existing code bases. Retrofitting will be a significant effort. In contrast, TF-A is a simpler code base that is easier to swap out.
UEFI implementations have a poor track record of being open-sourced by firmware vendors. TF-A implementations are done by silicon providers themselves, and have a better track record of being open source and audit-able.
Tying SMCCC and ACPI together
We need a generic escape hatch mechanism from ACPI to the operating system, to be able to easily perform arbitrary SMCCC calls from AML device methods.
Escape Hatch #1 – FFH
The Functional Fixed Hardware (FFH) OpRegion type seems like a good fit.
Unfortunately, the ACPI specification gives no examples of FFH usage for anything outside of Register resource descriptors, as part of processor Low Power Idle States (_LPI) support.
I never saw any examples with OpRegions being declared as of type FFixedHW, but the ACPICA compiler (iasl) didn’t barf at this quick draft:
The problem with Hatch #1 is the large amount of changes to operating system ACPI support. Additionally, the syntax is obtuse and doesn’t fit well the semantics of a method invocation, and the creation of OpRegions and buffers has additional overheads.
To borrow a page from the PEP book, the OS ACPI interpreter could provide a method to perform SMCCC calls. Unlike PEPs, this could be a well defined interface. Its presence could be negotiated using the standard _OSC (OS Capabilities) mechanism.
The Platform Communication Channel (PCC) is a generic mechanism for OSPM to communicate with an entity in the platform, such as as a BMC, SCP (system control processor). PCC relies on a shared memory region and a doorbell register. Starting ACPI 6.0, PCC also became a supported OpRegion type. Assuming the doorbell can be wired to an SMC, PCC could be used to communicate with a Trusted Firmware-based SCP.
PCC is a higher level protocol than invoking random SMC calls from ACPI or OS directly. The commands/messages all go via structured shared memory. There would be only one SMCCC call used – for the doorbell itself.
To use PCC with SMC via FFH, the Arm FFH specification would need to be amended to cover the PCC use case.
Escape Hatch #4 – SCMI + FFH = 💖?
We can do better than raw PCC. Arm already has an expansive adaptation of PCC – the System Control and Management Interface (SCMI), which covers exactly the use case we are after – device control and power management. SCMI is a higher-level protocol with a concrete command set.
SCMI is pretty advanced, and even supports asynchronous (delayed response) commands.
If the doorbell is SMC or HVC based, it should follow the SMC Calling Convention [SMCCC]. The doorbell needs to provide the identifier of the Shared Memory area that contains the payload. The Shared Memory area containing the payload is updated with the SCMI return response when the call returns. The identifier of the Shared Memory area should be 32-bits and each identifier should map to a distinct Shared Memory area.
While SCMI supports SMC as a doorbell according to the spec, the details are unfortunately left out. Presumably SMC must be exposed via FFH, yet the Arm FFH specification doesn’t currently cover this scenario.
SCMI could be a reasonable interface for a Trusted Firmware-based SCP. SCMI supports a lot of the interfaces one would otherwise implement in an arbitrary fashion, such as sensors, clocks and power and reset control. And not just from AML – as a well defined specification, SCMI could be a generic (fallback?) implementation for OS drivers (GPIO, PCIe, I2C, etc).
There are a few areas for improvement:
Add commands for pin and GPIO control.
Add commands to abstract embedded controller I/O (I2C, SPI, SMBUS).
Add commands to abstract PCIe CAM for systems without standards-compliant ECAM.
Add a purely SMC-based FastChannel transport, for areas where asynchronous support is irrelevant and where latency is key, like PCIe CAM, pin control or GPIO.
We looked at a number of schemes, but the most baked-through appears to be exposing a Trusted Firmware-based SCP via SCMI and an SMC doorbell, although the Arm FFH support for SMC-based SCMI doorbells still needs to figured out, and there are a few crucial categories of interfaces missing, such pin control and GPIO.
Of course, don’t forget that the Raspberry Pi 4, for example, has its GPU/VPU acting as the System Control Processor with it’s own SCMI-like mailbox. Could that be replaced with SCMI, avoiding a TF-A based SCMI interface entirely?
Finally, firmware-based SCPs aren’t just for Edge or Client devices. Even on servers, systems today can rely on GPIO-signaled Events instead of Interrupt-signaled Events. GPIO-signaled system events require a vendor-specific GPIO driver. Thus for servers, SCMI could mean never having to worry about vendor-specific GPIO drivers.
Based on feedback in the Discord channel (thanks Samer!) I reworked the status page to scale better across future releases and provide more information in an easier-to-consume fashion.
Not only does the new version aggregate all of the release notes, but provides an easy way for locating firmware feature, development status, OS support and standards compliance reports for every new release. All of the OS and firmware status reports will be per release, meaning you will be able to reference old reports if needed.
I’ve added a new link to menu – a status page, providing a fairly-detailed view of things working or broken in the current release. It probably needs to be split into more pages to separate UEFI implementation from operating system-visible aspects like SMBIOS and ACPI, but it’s a place to start.
It may not look like much, but it provides a radically improved networking experience, courtesy of a heads up by Jared and his recent work on the NetBSD GENET driver.
Of course what use is improved GENET via ACPI, when there is barely an OS support? Well, courtesy of the amazing work done by Pete and Jeremy Linton, the Linux ACPI patch for GENET has been merged into net-next.
NetBSD’s amazing Jared McNeill, who appears to crank out Arm platform support code for NetBSD at an inhuman rate, has coded up a driver for the on-board gigabit NIC (aka GENET).
While a great milestone for NetBSD, this is also the world’s first BSD-licensed implementation of a GENET driver. For our UEFI development effort, this finally means being able to implement a proper UEFI driver for the on-board NIC for PXE booting, iSCSI…you name it.
The NetBSD driver already supports the ACPI bindings for GENET, which first appeared in our 1.1 release, and its development is providing great feedback on further evolving the ACPI support. See, the MAC address is not stored in the NIC itself, but comes from the outside (via mailbox interface, I’m guessing via OTP). Of course, you can hypothetically read it from the NIC itself, if it’s been initialized. But apparently that only works if the NIC has been taken out of reset and the MAC is programmed. NetBSD today can boot 3 ways on the Pi 4 – TianoCore UEFI, U-Boot and “straight up” via config.txt. For booting via UEFI, the NIC is taken out of reset and the MAC is programmed. For others, the MAC is not programmed or the NIC is not taken out of reset, making it unsafe to try and read the MAC address, so there needs to be a more reliable mechanism. This might mean a local-mac-address_DSD property is in order for best compatibility. Having to fall back to the VPU mailbox interface in ACPI mode is a no-go: that would amount to Pi-specific platform knowledge and definitely be not SBBR. Another angle to consider is operating systems performing a fast reboot (aka kexec on Linux) – it would be totally unexpected to see a MAC address change to leak across kexec, so that’s another reason for persisting via an ACPI property.
Stepping back, I want to extend a huge thanks to Jared for both his feedback and for his work on supporting our firmware. NetBSD today is the most advanced OS to boot on the Pi 4B SBBR-way: networking, xHCI, 4GB boards, SD card, etc. Once we get the new SDHCI controller (MMC2) described in ACPI and working this should also bring in Wi-Fi. Jared reports that the existing Arasan driver could be sufficient to support MMC2 – that is to say, the old Arasan SDHCI controller’s set of quirks appears to be a direct superset – at least on NetBSD. 🤣
NetBSD also is the only OS today to fully support ACPI _DMA descriptors for describing DMA translations/constraints. This is very important for supporting Pi and Pi-like platforms via straight-up ACPI and without platform DMA quirks. If you like what you’re seeing with NetBSD and Arm support, consider supporting the NetBSD Foundation.
As many of the regular Arm Server community members already know, there is a huge discrepancy between the small, cheap, fun, “toys” known as Single Board Computers, and the enterprise-grade, standards-compliant Arm Servers that are meant to be racked and deployed in datacenters. And in the middle exists…nothing much. There have been some attempts over the years such as the SoftIron Overdrive 1000, the 96Boards Developer Box, and few other specialty (read: expensive) adaptations of server parts, but for the most part, the relatively cheap, standards-compliant Arm developer machine landscape has been barren.
As a result of this missing puzzle piece, Arm Server adoption has lagged behind industry and Arm’s own projections. Rewinding 5 years, in 2015 Arm predicted they would have a 20% market share in the Datacenter. Now here we are in 2020, and that clearly hasn’t come to fruition.
If the middle-ground is going to remain mostly empty, then we as a community need to focus on a different task: Showing the SBC makers how to build a standards-compliant system, and steer them towards making Arm boards that “just work” with any OS, boot in a fashion similar to what developers already know and understand (UEFI / ACPI), and act like an x86 machine. It shouldn’t take an Arm Engineer to boot an Arm board.
Then, once the SBC makers are nudged in this direction, two things can occur. First, more developers can begin the process of building software that is Arm compatible and a first class citizen on Arm Servers. Second, once individual developers get accustomed to using Arm SBC’s for coding, they’ll want something a bit more powerful…creating more demand for that mid-tier Arm “PC” class device.
And coming full circle, that means, it’s time to make the Raspberry Pi 4 ServerReady, to show people it can be done, and jumpstart this process.
The Pi 3B/3B+ is still a great platform to explore UEFI functionality, and it’s EBBR-compliant UEFI boot for Linux (that is, with device tree) is still way better than messing with config files. Plus you can do PXE and iSCSI. Isn’t that awesome? Or port Doom to it :-).