Linaro Tech Days talk on Pi UEFI

Originally Samer and I were going to present the ongoing Raspberry Pi 4B UEFI work at Linaro Connect ’20 in Budapest. Of course, life had its own plans, but fortunately Linaro was gracious and accommodating to host a virtual event and invite us to be a part of it.

We briefly talked about ServerReady and standardization of the server parts, mentioned the difficulties seen with current non-server parts like the Pi, described the approach taken, provided a status update on where the UEFI/ACPI “ServerReady”-like firmware is today, and left the audience with a call to action to both join the community effort and help beyond the Pi with some other devices, like Rockchip and nVidia-based platforms. Had some good questions from the audience, too.

The full video presentation is available on Youtube, and slides are over at LTD20 207 Making Pi ServerReady.

v1.7 – yes, only a day after v1.6.

I hadn’t even finished writing up the 1.6 related artifacts when Pete pushed the button on the new release. Huge thanks both to Pete for getting this out there so soon and to Ard Biesheuvel for reviewing and approving the edk2-platforms fixes.

What’s inside this release?

The ACPI fixes will mean improved hardware support in OSes, although today that mostly means improved NetBSD support. We definitely need some volunteers to help with enabling Pi 4 support with ACPI in upstream Linux – see the issue tracker.

Big shout-out to Jeremy Linton for the PPTT implementation (Processor Properties Topology Table) – a new ACPI 6.3 table describing the relation between CPUs and caches. Yes, that’s not a typo – our PPTT is the 2nd revision variant introduced in ACPI 6.3, whereas PPTT was first introduced in 6.2.

Note: the matching Pi 3 UEFI release (v1.19) combines the fixes for the Pi 4 v1.6 and v1.7 releases.

As always, read the release notes and usual caveats.

v1.6 out

A few more goodies…

What’s inside this release?

Biggest change here is that your Pi will now boot at the default (Pi Foundation-recommended) frequency, instead of the 600MHz minimum. One less configuration option to change on every upgrade!

IMPORTANT: HTTP(S) boot, like PXE and iSCSI, will not currently work with the internal network card, because the GENET driver has not been upstreamed yet. This means you need a supported USB interface (Ax88772b) to use this feature.

As always, read the release notes and usual caveats.

More GENET work

Our UEFI firmware already supports certain USB-based network adapters for network booting (or other purposes), but that requires an additional adapter, which becomes even more awkward if you want to use a PoE HAT. NetBSD Arm platform guru Jared McNeill has been working on something, that Pi 4 UEFI users are going to find pretty cool. Dongle-less PXE and iSCSI is coming to a Pi 4 near you, because we’re getting native support for GENET networking soon! 😍🔥

This is being implemented as a Simple Network Protocol driver, so it will be usable by any UEFI driver or application.

v1.5 release available

A quick follow-up to v1.4, this release brings some important fixes.

Don’t stop me now (’cause I’m having a good time)
Don’t stop me now (yes, I’m havin’ a good time)
I don’t want to stop at all

Queen – Jazz

What’s inside this release?

This release mostly improves on the logic added to v1.4 release for switching between 3GB/4GB modes on the 4GB Pi 4.

Now, when you have 3GiB mode selected (which is the default), you will see 3GiB being reflected in the UEFI setup screen.

The fix for external .dtb is worth diving into. Many of our readers will know, that Raspberries traditionally boot operating systems with Device Tree, instead of ACPI. A dated overview can be found on the official Pi site. Long story short, if you want to boot an 64-bit Linux, NetBSD or FreeBSD today on the Pi with full I/O support, you still need the Device Tree that the VPU firmware prepares based on your config.txt settings (dtparams, overlays, etc).

The Pi 4 support for Device Tree is exactly like the Pi 3 support. At the time, UEFI relied on the Device Tree being placed in RAM after the UEFI image itself, basically overlaying itself on a section of the UEFI image. Recently, new VideoCore firmware broke this approach (which I really shouldn’t have come up with in the first place!) by switching the load ordering – now the Device Tree was loaded before the UEFI image, and since the two images overlapped, the Device Tree blob was getting overwritten. You can read more about it here and here. Anyway, it’s fixed now. The same fix needs to be made to the Pi 3 build. That’s still TBD.

If you want to boot an OS using Device Tree, don’t forget to enable it. By default the Pi 4 will boot in ACPI mode.

Be mindful that the fix involved changing the load address for the Device Tree in a way that wouldn’t overlap the UEFI image. The new values are:

  • device_tree_address=0x1f0000
  • device_tree_end=0x200000

While looking at the above regression, we were also able to regain 2MiB of memory, as the Trusted Firmware footprint for Pi 4 is much smaller than on the Pi 3.

As always, read the release notes and usual caveats.

v1.4 release already out

Pete’s on a roll! Some significant improvements have landed in the latest and greatest release.

What’s inside this release?

Assuming you have a 4GiB Pi, you can now boot with with the entire 4GiB available if you run NetBSD-current (generic 64-bit image), which already supports the ACPI interfaces required to make the Pi USB3 controller work with the full 4GiB RAM. Those living on the edge can try this Linux patch.

Note: upstreaming the Linux patch would be a great way to help this project.

Additionally, there’s been some improvements to the setup option layout:

Reorder forms in the order they are most likely to be queried.
Rename Chipset Configuration, making CPU settings more prominent.
New Advanced Configuration. 3GB limit setting is grayed out on 1GB/2GB boards.
Grouping all SD/MMC settings together.

As always, read the release notes and usual caveats.

Abstracting SoC hardware initialization.

How to best support Edge with standards?

One of the gaps between ServerReady and existing “Edge” SoCs is the latter’s general reliance on OS-coordinated device initialization and power state management. A typical BSP or port of an OS would involve clock source, GPIO, PoR and pin control/multiplexing drivers as prerequisites for any embedded devices, and usually I2C, SPI and voltage regulator drivers to be able to do anything “interesting” such as accessing sensors, doing storage I/O or driving graphics outputs.

While firmware could hypothetically pre-initialize everything into a working state, this presents a dilemma for devices meant to operate at the lowest possible power setting. Pre-initialization also means no device reconfiguration after OS boot.

ACPI has somewhat adapted to this space, but not abstracted enough from gory details. While ACPI does tidy up some platform device configuration via its interpreted AML byte code methods, it only appears to be marginally better than device tree for non-server systems. For example, ACPI’s notion of Generic Serial Bus (I2C, SPI, UART) and GPIO OpRegions lets device AML methods perform I/O without resorting to bit-banging MMIO addresses, but requires host operating system drivers to provide the underlying implementation. There are ACPI resource descriptors for tracking and describing GPIOs and pin configuration for devices, but these are again completely useless without appropriate drivers.

Great, so this basically reduces ACPI on non-server systems to obscure machine code coordinating a bunch of OS drivers, sourced from silicon providers and platform integrators. But maybe we can throw all these new vendor-specific OpRegions away, for compatibility’s sake, and code like it’s ACPI 4.0a?

Is writing AML really feasible?

Surely, device- and platform-specific configuration logic can be neatly limited to AML methods?

Well, that’s highly overrated.

A time set method for an ACPI Time and Alarm (TAD) device.

Above is an excerpt of a TAD device for a memory-mapped RTC. It’s a good warning against doing anything beyond basic I/O and arithmetic in AML. Considering how straightforward RTCs are as a device class, this bit of code (roughly 1/4 of all AML required) is unmaintainable and without comments would be completely incomprehensible. Yes, that’s a busy-wait in there for 5 seconds, waiting on a completion from the device, but it might as well be looping forever. Note, that AML methods in most implementations run with a global interpreter lock (which also means AML code is not reentrant – an OpRegion cannot be backed by AML).

AML is machine code. Really slow and limited machine code. ASL (the source compiled into AML) is about as expressive as 8048 assembly.

How about translating the following real-life SBC support code to AML? On Raspberry Pi the Arm cores don’t have access to device control blocks and must request the VideoCore VPU (GPU) processor to act on its behalf via special “mailbox” requests in shared RAM. E.g.:

Excerpt of a mailbox communication routine for Raspberry Pi from TF-A

First of all, this is doing DMA…so for an AML implementation, you’ll need a chunk of memory. You can’t use an actual AML buffer for this, so you’ll have to carve some physical memory out in the UEFI memory map, mark it with the right memory attributes matching your coherency requirements, set up a SystemMemory OpRegion…

Oh, I’m having so much fun!

Well, Barbie, we are just getting started..

As you can see, the routine is a mix of MMIO and CPU operations, e.g. data cache cleaning and invalidation. This is problematic for compiling to AML, which doesn’t include any cache operations. You could do away with cache operations entirely by carving out a non-cache coherent memory chunk. Or perhaps you’re lucky and your device supports cache coherent DMA… Well, then it would have to look like this:

Excerpt of a mailbox communication routine for Raspberry Pi from UEFI

No more cache operations! But now we have barriers. AML doesn’t have memory barriers. Maybe you can play fast and loose? Well, expect problems when you rev up the cores to something a bit more OoO. Also, consider that an operating system’s AML interpreter is probably getting scheduled across multiple CPUs while running your method…

So, no AML for anything involving DMA. What a mess…

No one wants to write AML

Considering how problematic it is to write AML, Microsoft introduced their own notion of Platform Extension Plugins (PEPs) to stand in for entire AML methods.

PEPs are intended to be used for off-SoC power management methods. Since they are installable binaries, they can be updated on-the-fly as opposed to ACPI firmware which requires a firmware flash. … Power management was the original intent for PEPs, but they can be used to provide or override any arbitrary ACPI runtime method.

Providing power management using PEPs can be much easier to debug than code written for the ACPI firmware. …

PEPs can obscure any method and provide methods you didn’t know were necessary. The AML code now just provides a “skeleton” and some dummy method implementations. This makes ACPI system descriptions relying on PEPs completely useless, since PEPs are opaque platform knowledge, sourced from vendors and completely undocumented.

PEPs play no role in the construction of the ACPI namespace hierarchy because the namespace hierarchy must be provided in the firmware DSDT. When the ACPI driver evaluates a method at runtime, it will check against the PEP’s implemented methods for the device in question, and, if present, it will execute the PEP and ignore the firmware’s version. However, the device itself must be defined in the firmware.

Even worse, you can’t meaningfully tell which AML methods would be overridden by a PEP, or what kind of configuration data the PEP was meant to source. PEPs are even worse than device tree, because PEPs can fully hide configuration data that would otherwise be reported as properties through device tree.

This is a good explanation for why none of the Snapdragon laptops today can boot Linux using ACPI. And not just Linux – the HP Envy x2 cannot boot a “stock” Windows 10 image without Snapdragon customizations.

PEPs are not an answer.

Let’s put the BSP in TF-A

We know we don’t want to write in AML for a good number of reasons. In a few situations it is literally impossible to write safe and functional AML. We also want to avoid writing drivers for things OS vendors (and their customers) really don’t care about, like pin controllers and clock source management. And PEPs are not a standard interface, and their implementation is OS vendor specific.

It would appear that the best place to hide low-level platform drivers for device initialization and power state management is Trusted Firmware-A. TF-A is an industry-adopted TrustZone firmware usually used to implement PSCI or as a foundation for a Trusted Execution Environment (TEE). It’s a rich enough environment to contain complex code written in a high-level level. Also, TF-A likely already includes some of the component drivers. This way, TF-A becomes a software-based System Control Processor (SCP).

Conceptually, it’s not a bad fit. TF-A already provides services that abstract underlying I/O and SoC control facilities for a general-purpose OS. For example, the Power State Coordination Interface is an abstraction over system state (power-off, reboot) and CPU state (secondary CPU start up) control. As an another example, the Software Delegated Exception Interface can be used to abstract NMI support and implement firmware-first system error handling (RAS). Moreover, many vendors already use private calls between UEFI (or U-Boot) and TF-A firmware, for similar reasons.

If the platform BSP entrails are squirreled away in TF-A, how would an OS interact with them? Via ACPI AML methods of course. Regardless of how the ACPI interface works, the actual calls to a firmware-based SCP would be via the well-standardized SMCCC specification.

Trusted Firmware-based SCP calls are vendor-specific, and are part of SiP or OEM service calls.

Isn’t putting stuff in TF-A bad? It’s not ideal, but putting it into AML or OS vendor-specific drivers is much worse. Other platforms such as IA-64 (SAL) and OpenPower (OPAL) rely on firmware interfaces to abstract some platform I/O and implementation-specific details.

Don’t like blobs? Upstream and open-source your TF-A, like everybody else.

Isn’t that cycle stealing? No, because we’re talking about operations done on behalf of the operating system requesting them.

Note: It’s not just generic off-the-shelf operating systems that would win from TF-A abstracting common SoC hardware and control interfaces. UEFI firmware itself could make good use of these, reducing the development and support effort for all platforms, and removing similar/duplicated functionality further reducing code size and bug counts.

But what about UEFI runtime services?

Runtime services are meant to abstract parts of the hardware implementation of the platform from the OS, but the interface is fairly limited in scope today. Here’s the current set of calls:

GetTime()Returns the current time, time context, and time keeping capabilities.
SetTime()Sets the current time and time context.
GetWakeupTime()Returns the current wakeup alarm settings.
SetWakeupTime() Sets the current wakeup alarm settings.
GetVariable()Returns the value of a named variable.
GetNextVariableName()Enumerates variable names.
SetVariable()Sets, and if needed creates, a variable.
SetVirtualAddressMap()Switches all runtime functions from physical to virtual addressing.
GetNextHighMonotonicCount() Subsumes the platform’s monotonic counter functionality.
ResetSystem() Resets all processors and devices and reboots the system.
UpdateCapsule()Passes capsules to the firmware with both virtual and physical mapping.
QueryVariableInfo()Returns information about the EFI variable store.

But what if this list could be extended? Instead of moving the BSP into TF-A, let’s make it all a UEFI runtime service, and figure out how to perform RT calls from AML.

That’s not a very good idea:

  1. RT is fragile, as services share the same privilege level as the calling operating system. Differences in the way different operating systems call services are a constant source of bugs across vendor implementations, e.g. flat addressing or translated, interrupts enabled or disabled, UART enabled or disabled.
  2. RT requires an environment – memory ranges must be correctly mapped, enough stack, disabled FP traps, etc. A single SMC instruction for trapping to TF-A is hard to beat.
  3. RT is fragile, as runtime services are provided by the same drivers that provide boot time services in most UEFI implementation like Tiano. There’s no meaningful isolation between an RT driver and its (and other) BS components. The firmware programming model is just bad. It is extremely easy to make a seemingly benign change (new giobal variable, logging statement, driver dependency) that will break RT support for some users, but be very difficult to track down.
  4. Limited facilities with no support for asynchronous implementation, e.g. can’t take an exception on behalf of a service, while in OS. This may make some hardware hard to expose efficiently or mean that certain workarounds are impossible, e.g. a device quirk that relies on handling external aborts on a system with firmware-first error handling.
  5. Simply revising the UEFI specification with new RT services won’t do anything for existing code bases. Retrofitting will be a significant effort. In contrast, TF-A is a simpler code base that is easier to swap out.
  6. UEFI implementations have a poor track record of being open-sourced by firmware vendors. TF-A implementations are done by silicon providers themselves, and have a better track record of being open source and audit-able.

Tying SMCCC and ACPI together

We need a generic escape hatch mechanism from ACPI to the operating system, to be able to easily perform arbitrary SMCCC calls from AML device methods.

Escape Hatch #1 – FFH

The Functional Fixed Hardware (FFH) OpRegion type seems like a good fit., page 114

Unfortunately, the ACPI specification gives no examples of FFH usage for anything outside of Register resource descriptors, as part of processor Low Power Idle States (_LPI) support.

I never saw any examples with OpRegions being declared as of type FFixedHW, but the ACPICA compiler (iasl) didn’t barf at this quick draft:

DefinitionBlock ("test.aml", "SSDT", 5, "FOO", "BAR", 6) 
   Device (F000) {
      Name (_HID, "FOO1234")
      Method (_INI, 0, Serialized) {
         OperationRegion (VNCL, FFixedHW, 0xbeef0000, 0x8)
         Field (VNCL, BufferAcc, Lock, Preserve) {
            Offset (0),  AccessAs(BufferAcc, AttribRawBytes (58)),
            SMCC, 8,

         // SMC call exchange buffer
         Name (BUFF, Buffer(58){})
         CreateQWordField(BUFF, 0x0,  CALL) // Function identifier w0
         CreateQWordField(BUFF, 0x8,  AR1)  // Argument x1
         CreateQWordField(BUFF, 0x10, AR2)  // Argument x2
         CreateQWordField(BUFF, 0x18, AR3)  // Argument x3
         CreateQWordField(BUFF, 0x20, AR4)  // Argument x4
         CreateQWordField(BUFF, 0x28, AR5)  // Argument x5
         CreateQWordField(BUFF, 0x30, AR6)  // Argument x6
         CreateQWordField(BUFF, 0x38, RET0) // Result x0
         CreateQWordField(BUFF, 0x40, RET1) // Result x1
         CreateQWordField(BUFF, 0x48, RET2) // Result x2
         CreateQWordField(BUFF, 0x50, RET3) // Result x3

         CALL = 0xC3001234 // OEM call 0x1234
         AR1 = 0x1         // Some parameter for call
         SMCC = BUFF       // Invoke! 
         If (RET0 == 0xffffffffffffffff) {
            // Failure.
            // ...

         // Success

Encouraging, if messy. This could be an amendment to the Arm Functional Fixed Hardware Specification.

Escape Hatch #2 – OS method

The problem with Hatch #1 is the large amount of changes to operating system ACPI support. Additionally, the syntax is obtuse and doesn’t fit well the semantics of a method invocation, and the creation of OpRegions and buffers has additional overheads.

To borrow a page from the PEP book, the OS ACPI interpreter could provide a method to perform SMCCC calls. Unlike PEPs, this could be a well defined interface. Its presence could be negotiated using the standard _OSC (OS Capabilities) mechanism.

// OS ACPI interpreter implements OSMC.
//    Params: fn and arguments x1-x6,
//    Returns: Package{4} containing return values x0-x3.
External (OSMC, MethodObj, PkgObj, { IntObj, IntObj, IntObj, IntObj, IntObj, IntObj, IntObj })

Name (RETV, Package(4){})
RETV = OSMC(0xC3001234, 0x1, 0, 0, 0, 0, 0)
If (DeRefOf (RETV[0]) == 0xffffffffffffffff) {
   // SMCCC call failed.
   // ...

Escape Hatch #3 – PCC (and FFH, again).

The Platform Communication Channel (PCC) is a generic mechanism for OSPM to communicate with an entity in the platform, such as as a BMC, SCP (system control processor). PCC relies on a shared memory region and a doorbell register. Starting ACPI 6.0, PCC also became a supported OpRegion type. Assuming the doorbell can be wired to an SMC, PCC could be used to communicate with a Trusted Firmware-based SCP.

PCC is a higher level protocol than invoking random SMC calls from ACPI or OS directly. The commands/messages all go via structured shared memory. There would be only one SMCCC call used – for the doorbell itself.

A PCC doorbell can be an FFH register access.

To use PCC with SMC via FFH, the Arm FFH specification would need to be amended to cover the PCC use case.

Escape Hatch #4 – SCMI + FFH = 💖?

We can do better than raw PCC. Arm already has an expansive adaptation of PCC – the System Control and Management Interface (SCMI), which covers exactly the use case we are after – device control and power management. SCMI is a higher-level protocol with a concrete command set.

Some of what SCMI can do.

SCMI is pretty advanced, and even supports asynchronous (delayed response) commands.

If the doorbell is SMC or HVC based, it should follow the SMC Calling Convention [SMCCC]. The doorbell needs to provide the identifier of the Shared Memory area that contains the payload. The Shared Memory area containing the payload is updated with the SCMI return response when the call returns. The identifier of the Shared Memory area should be 32-bits and each identifier should map to a distinct Shared Memory area.

While SCMI supports SMC as a doorbell according to the spec, the details are unfortunately left out. Presumably SMC must be exposed via FFH, yet the Arm FFH specification doesn’t currently cover this scenario.

SCMI could be a reasonable interface for a Trusted Firmware-based SCP. SCMI supports a lot of the interfaces one would otherwise implement in an arbitrary fashion, such as sensors, clocks and power and reset control. And not just from AML – as a well defined specification, SCMI could be a generic (fallback?) implementation for OS drivers (GPIO, PCIe, I2C, etc).

There are a few areas for improvement:

  • Add commands for pin and GPIO control.
  • Add commands to abstract embedded controller I/O (I2C, SPI, SMBUS).
  • Add commands to abstract PCIe CAM for systems without standards-compliant ECAM.
  • Add a purely SMC-based FastChannel transport, for areas where asynchronous support is irrelevant and where latency is key, like PCIe CAM, pin control or GPIO.

Parting thoughts

We looked at a number of schemes, but the most baked-through appears to be exposing a Trusted Firmware-based SCP via SCMI and an SMC doorbell, although the Arm FFH support for SMC-based SCMI doorbells still needs to figured out, and there are a few crucial categories of interfaces missing, such pin control and GPIO.

Of course, don’t forget that the Raspberry Pi 4, for example, has its GPU/VPU acting as the System Control Processor with it’s own SCMI-like mailbox. Could that be replaced with SCMI, avoiding a TF-A based SCMI interface entirely?

Finally, firmware-based SCPs aren’t just for Edge or Client devices. Even on servers, systems today can rely on GPIO-signaled Events instead of Interrupt-signaled Events. GPIO-signaled system events require a vendor-specific GPIO driver. Thus for servers, SCMI could mean never having to worry about vendor-specific GPIO drivers.

v1.3 release

This is a minor follow-up to v1.2 for Pi 4, intended to do some soak testing on the ongoing ACPI clean-up and factorization changes.

What’s inside this release?

As always, read the release notes and usual caveats.

Updates to the status page!

Based on feedback in the Discord channel (thanks Samer!) I reworked the status page to scale better across future releases and provide more information in an easier-to-consume fashion.

Not only does the new version aggregate all of the release notes, but provides an easy way for locating firmware feature, development status, OS support and standards compliance reports for every new release. All of the OS and firmware status reports will be per release, meaning you will be able to reference old reports if needed.

Finally, the new reports link back to the issue tracker and we have our first set of ACS compliance reports now too.

The updated project status page.