v1.9 release is out

This mostly updates the underlying TF-A firmware, bringing a few important improvements.

https://github.com/pftf/RPi4/releases/tag/v1.9

What’s inside this release?

  • Update to TF-A v2.3 [tianocore/edk2-non-osi@96ec764]
  • Fix an ASSERT being produced with the DEBUG version, related to TFTP functionality [tianocore/edk2-platforms@f09ea1a]

The TF-A changes are worth describing in greater detail.

TF-A is the Arm secure firmware, providing services such as platform power off/reset and secondary CPU manipulation. The improvements to UART detection mean that TF-A firmware, just like UEFI, will honor config.txt selection of the UART (e.g. via overlay, although it’s really done by the VPU firmware). This is more developer oriented, and means not losing logging/initialization messages. Unrelated to Pi 4 itself, it paves the way for proper (transparent) Pi 3 UEFI support for Compute Module variants and that 64-bit variant of the Pi 2 (rev 1.2).

As always, read the release notes and usual caveats. Note that we do not distribute DEBUG builds. The PSCI CPU_OFF implementation allow a host OS to de-initialize individual CPU cores. More concretely, this will allow more of the Arm Architecture Compliance Suite (ACS) to pass.

The matching Pi 3 UEFI release is v1.22.

Abstracting SoC hardware initialization.

How to best support Edge with standards?

One of the gaps between ServerReady and existing “Edge” SoCs is the latter’s general reliance on OS-coordinated device initialization and power state management. A typical BSP or port of an OS would involve clock source, GPIO, PoR and pin control/multiplexing drivers as prerequisites for any embedded devices, and usually I2C, SPI and voltage regulator drivers to be able to do anything “interesting” such as accessing sensors, doing storage I/O or driving graphics outputs.

While firmware could hypothetically pre-initialize everything into a working state, this presents a dilemma for devices meant to operate at the lowest possible power setting. Pre-initialization also means no device reconfiguration after OS boot.

ACPI has somewhat adapted to this space, but not abstracted enough from gory details. While ACPI does tidy up some platform device configuration via its interpreted AML byte code methods, it only appears to be marginally better than device tree for non-server systems. For example, ACPI’s notion of Generic Serial Bus (I2C, SPI, UART) and GPIO OpRegions lets device AML methods perform I/O without resorting to bit-banging MMIO addresses, but requires host operating system drivers to provide the underlying implementation. There are ACPI resource descriptors for tracking and describing GPIOs and pin configuration for devices, but these are again completely useless without appropriate drivers.

Great, so this basically reduces ACPI on non-server systems to obscure machine code coordinating a bunch of OS drivers, sourced from silicon providers and platform integrators. But maybe we can throw all these new vendor-specific OpRegions away, for compatibility’s sake, and code like it’s ACPI 4.0a?

Is writing AML really feasible?

Surely, device- and platform-specific configuration logic can be neatly limited to AML methods?

Well, that’s highly overrated.

A time set method for an ACPI Time and Alarm (TAD) device.

Above is an excerpt of a TAD device for a memory-mapped RTC. It’s a good warning against doing anything beyond basic I/O and arithmetic in AML. Considering how straightforward RTCs are as a device class, this bit of code (roughly 1/4 of all AML required) is unmaintainable and without comments would be completely incomprehensible. Yes, that’s a busy-wait in there for 5 seconds, waiting on a completion from the device, but it might as well be looping forever. Note, that AML methods in most implementations run with a global interpreter lock (which also means AML code is not reentrant – an OpRegion cannot be backed by AML).

AML is machine code. Really slow and limited machine code. ASL (the source compiled into AML) is about as expressive as 8048 assembly.

How about translating the following real-life SBC support code to AML? On Raspberry Pi the Arm cores don’t have access to device control blocks and must request the VideoCore VPU (GPU) processor to act on its behalf via special “mailbox” requests in shared RAM. E.g.:

Excerpt of a mailbox communication routine for Raspberry Pi from TF-A

First of all, this is doing DMA…so for an AML implementation, you’ll need a chunk of memory. You can’t use an actual AML buffer for this, so you’ll have to carve some physical memory out in the UEFI memory map, mark it with the right memory attributes matching your coherency requirements, set up a SystemMemory OpRegion…

Oh, I’m having so much fun!

Well, Barbie, we are just getting started..

As you can see, the routine is a mix of MMIO and CPU operations, e.g. data cache cleaning and invalidation. This is problematic for compiling to AML, which doesn’t include any cache operations. You could do away with cache operations entirely by carving out a non-cache coherent memory chunk. Or perhaps you’re lucky and your device supports cache coherent DMA… Well, then it would have to look like this:

Excerpt of a mailbox communication routine for Raspberry Pi from UEFI

No more cache operations! But now we have barriers. AML doesn’t have memory barriers. Maybe you can play fast and loose? Well, expect problems when you rev up the cores to something a bit more OoO. Also, consider that an operating system’s AML interpreter is probably getting scheduled across multiple CPUs while running your method…

So, no AML for anything involving DMA. What a mess…

No one wants to write AML

Considering how problematic it is to write AML, Microsoft introduced their own notion of Platform Extension Plugins (PEPs) to stand in for entire AML methods.

PEPs are intended to be used for off-SoC power management methods. Since they are installable binaries, they can be updated on-the-fly as opposed to ACPI firmware which requires a firmware flash. … Power management was the original intent for PEPs, but they can be used to provide or override any arbitrary ACPI runtime method.

Providing power management using PEPs can be much easier to debug than code written for the ACPI firmware. …

https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/using-peps-for-acpi-services

PEPs can obscure any method and provide methods you didn’t know were necessary. The AML code now just provides a “skeleton” and some dummy method implementations. This makes ACPI system descriptions relying on PEPs completely useless, since PEPs are opaque platform knowledge, sourced from vendors and completely undocumented.

PEPs play no role in the construction of the ACPI namespace hierarchy because the namespace hierarchy must be provided in the firmware DSDT. When the ACPI driver evaluates a method at runtime, it will check against the PEP’s implemented methods for the device in question, and, if present, it will execute the PEP and ignore the firmware’s version. However, the device itself must be defined in the firmware.

https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/using-peps-for-acpi-services

Even worse, you can’t meaningfully tell which AML methods would be overridden by a PEP, or what kind of configuration data the PEP was meant to source. PEPs are even worse than device tree, because PEPs can fully hide configuration data that would otherwise be reported as properties through device tree.

This is a good explanation for why none of the Snapdragon laptops today can boot Linux using ACPI. And not just Linux – the HP Envy x2 cannot boot a “stock” Windows 10 image without Snapdragon customizations.

PEPs are not an answer.

Let’s put the BSP in TF-A

We know we don’t want to write in AML for a good number of reasons. In a few situations it is literally impossible to write safe and functional AML. We also want to avoid writing drivers for things OS vendors (and their customers) really don’t care about, like pin controllers and clock source management. And PEPs are not a standard interface, and their implementation is OS vendor specific.

It would appear that the best place to hide low-level platform drivers for device initialization and power state management is Trusted Firmware-A. TF-A is an industry-adopted TrustZone firmware usually used to implement PSCI or as a foundation for a Trusted Execution Environment (TEE). It’s a rich enough environment to contain complex code written in a high-level level. Also, TF-A likely already includes some of the component drivers. This way, TF-A becomes a software-based System Control Processor (SCP).

Conceptually, it’s not a bad fit. TF-A already provides services that abstract underlying I/O and SoC control facilities for a general-purpose OS. For example, the Power State Coordination Interface is an abstraction over system state (power-off, reboot) and CPU state (secondary CPU start up) control. As an another example, the Software Delegated Exception Interface can be used to abstract NMI support and implement firmware-first system error handling (RAS). Moreover, many vendors already use private calls between UEFI (or U-Boot) and TF-A firmware, for similar reasons.

If the platform BSP entrails are squirreled away in TF-A, how would an OS interact with them? Via ACPI AML methods of course. Regardless of how the ACPI interface works, the actual calls to a firmware-based SCP would be via the well-standardized SMCCC specification.

Trusted Firmware-based SCP calls are vendor-specific, and are part of SiP or OEM service calls.

Isn’t putting stuff in TF-A bad? It’s not ideal, but putting it into AML or OS vendor-specific drivers is much worse. Other platforms such as IA-64 (SAL) and OpenPower (OPAL) rely on firmware interfaces to abstract some platform I/O and implementation-specific details.

Don’t like blobs? Upstream and open-source your TF-A, like everybody else.

Isn’t that cycle stealing? No, because we’re talking about operations done on behalf of the operating system requesting them.

Note: It’s not just generic off-the-shelf operating systems that would win from TF-A abstracting common SoC hardware and control interfaces. UEFI firmware itself could make good use of these, reducing the development and support effort for all platforms, and removing similar/duplicated functionality further reducing code size and bug counts.

But what about UEFI runtime services?

Runtime services are meant to abstract parts of the hardware implementation of the platform from the OS, but the interface is fairly limited in scope today. Here’s the current set of calls:

NameDescription
GetTime()Returns the current time, time context, and time keeping capabilities.
SetTime()Sets the current time and time context.
GetWakeupTime()Returns the current wakeup alarm settings.
SetWakeupTime() Sets the current wakeup alarm settings.
GetVariable()Returns the value of a named variable.
GetNextVariableName()Enumerates variable names.
SetVariable()Sets, and if needed creates, a variable.
SetVirtualAddressMap()Switches all runtime functions from physical to virtual addressing.
GetNextHighMonotonicCount() Subsumes the platform’s monotonic counter functionality.
ResetSystem() Resets all processors and devices and reboots the system.
UpdateCapsule()Passes capsules to the firmware with both virtual and physical mapping.
QueryVariableInfo()Returns information about the EFI variable store.

But what if this list could be extended? Instead of moving the BSP into TF-A, let’s make it all a UEFI runtime service, and figure out how to perform RT calls from AML.

That’s not a very good idea:

  1. RT is fragile, as services share the same privilege level as the calling operating system. Differences in the way different operating systems call services are a constant source of bugs across vendor implementations, e.g. flat addressing or translated, interrupts enabled or disabled, UART enabled or disabled.
  2. RT requires an environment – memory ranges must be correctly mapped, enough stack, disabled FP traps, etc. A single SMC instruction for trapping to TF-A is hard to beat.
  3. RT is fragile, as runtime services are provided by the same drivers that provide boot time services in most UEFI implementation like Tiano. There’s no meaningful isolation between an RT driver and its (and other) BS components. The firmware programming model is just bad. It is extremely easy to make a seemingly benign change (new giobal variable, logging statement, driver dependency) that will break RT support for some users, but be very difficult to track down.
  4. Limited facilities with no support for asynchronous implementation, e.g. can’t take an exception on behalf of a service, while in OS. This may make some hardware hard to expose efficiently or mean that certain workarounds are impossible, e.g. a device quirk that relies on handling external aborts on a system with firmware-first error handling.
  5. Simply revising the UEFI specification with new RT services won’t do anything for existing code bases. Retrofitting will be a significant effort. In contrast, TF-A is a simpler code base that is easier to swap out.
  6. UEFI implementations have a poor track record of being open-sourced by firmware vendors. TF-A implementations are done by silicon providers themselves, and have a better track record of being open source and audit-able.

Tying SMCCC and ACPI together

We need a generic escape hatch mechanism from ACPI to the operating system, to be able to easily perform arbitrary SMCCC calls from AML device methods.

Escape Hatch #1 – FFH

The Functional Fixed Hardware (FFH) OpRegion type seems like a good fit.

https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf, page 114

Unfortunately, the ACPI specification gives no examples of FFH usage for anything outside of Register resource descriptors, as part of processor Low Power Idle States (_LPI) support.

I never saw any examples with OpRegions being declared as of type FFixedHW, but the ACPICA compiler (iasl) didn’t barf at this quick draft:

DefinitionBlock ("test.aml", "SSDT", 5, "FOO", "BAR", 6) 
{
   Device (F000) {
      Name (_HID, "FOO1234")
      Method (_INI, 0, Serialized) {
         OperationRegion (VNCL, FFixedHW, 0xbeef0000, 0x8)
         Field (VNCL, BufferAcc, Lock, Preserve) {
            Offset (0),  AccessAs(BufferAcc, AttribRawBytes (58)),
            SMCC, 8,
         }

         // SMC call exchange buffer
         Name (BUFF, Buffer(58){})
         CreateQWordField(BUFF, 0x0,  CALL) // Function identifier w0
         CreateQWordField(BUFF, 0x8,  AR1)  // Argument x1
         CreateQWordField(BUFF, 0x10, AR2)  // Argument x2
         CreateQWordField(BUFF, 0x18, AR3)  // Argument x3
         CreateQWordField(BUFF, 0x20, AR4)  // Argument x4
         CreateQWordField(BUFF, 0x28, AR5)  // Argument x5
         CreateQWordField(BUFF, 0x30, AR6)  // Argument x6
         CreateQWordField(BUFF, 0x38, RET0) // Result x0
         CreateQWordField(BUFF, 0x40, RET1) // Result x1
         CreateQWordField(BUFF, 0x48, RET2) // Result x2
         CreateQWordField(BUFF, 0x50, RET3) // Result x3

         CALL = 0xC3001234 // OEM call 0x1234
         AR1 = 0x1         // Some parameter for call
         SMCC = BUFF       // Invoke! 
         If (RET0 == 0xffffffffffffffff) {
            // Failure.
            // ...
         }

         // Success
         //...
      }
   }
}

Encouraging, if messy. This could be an amendment to the Arm Functional Fixed Hardware Specification.

Escape Hatch #2 – OS method

The problem with Hatch #1 is the large amount of changes to operating system ACPI support. Additionally, the syntax is obtuse and doesn’t fit well the semantics of a method invocation, and the creation of OpRegions and buffers has additional overheads.

To borrow a page from the PEP book, the OS ACPI interpreter could provide a method to perform SMCCC calls. Unlike PEPs, this could be a well defined interface. Its presence could be negotiated using the standard _OSC (OS Capabilities) mechanism.

//
// OS ACPI interpreter implements OSMC.
//    Params: fn and arguments x1-x6,
//    Returns: Package{4} containing return values x0-x3.
//
External (OSMC, MethodObj, PkgObj, { IntObj, IntObj, IntObj, IntObj, IntObj, IntObj, IntObj })

Name (RETV, Package(4){})
RETV = OSMC(0xC3001234, 0x1, 0, 0, 0, 0, 0)
If (DeRefOf (RETV[0]) == 0xffffffffffffffff) {
   // SMCCC call failed.
   // ...
}

Escape Hatch #3 – PCC (and FFH, again).

The Platform Communication Channel (PCC) is a generic mechanism for OSPM to communicate with an entity in the platform, such as as a BMC, SCP (system control processor). PCC relies on a shared memory region and a doorbell register. Starting ACPI 6.0, PCC also became a supported OpRegion type. Assuming the doorbell can be wired to an SMC, PCC could be used to communicate with a Trusted Firmware-based SCP.

PCC is a higher level protocol than invoking random SMC calls from ACPI or OS directly. The commands/messages all go via structured shared memory. There would be only one SMCCC call used – for the doorbell itself.

A PCC doorbell can be an FFH register access.

To use PCC with SMC via FFH, the Arm FFH specification would need to be amended to cover the PCC use case.

Escape Hatch #4 – SCMI + FFH = 💖?

We can do better than raw PCC. Arm already has an expansive adaptation of PCC – the System Control and Management Interface (SCMI), which covers exactly the use case we are after – device control and power management. SCMI is a higher-level protocol with a concrete command set.

Some of what SCMI can do.

SCMI is pretty advanced, and even supports asynchronous (delayed response) commands.

If the doorbell is SMC or HVC based, it should follow the SMC Calling Convention [SMCCC]. The doorbell needs to provide the identifier of the Shared Memory area that contains the payload. The Shared Memory area containing the payload is updated with the SCMI return response when the call returns. The identifier of the Shared Memory area should be 32-bits and each identifier should map to a distinct Shared Memory area.

https://static.docs.arm.com/den0056/b/DEN0056B_System_Control_and_Management_Interface_v2_0.pdf

While SCMI supports SMC as a doorbell according to the spec, the details are unfortunately left out. Presumably SMC must be exposed via FFH, yet the Arm FFH specification doesn’t currently cover this scenario.

SCMI could be a reasonable interface for a Trusted Firmware-based SCP. SCMI supports a lot of the interfaces one would otherwise implement in an arbitrary fashion, such as sensors, clocks and power and reset control. And not just from AML – as a well defined specification, SCMI could be a generic (fallback?) implementation for OS drivers (GPIO, PCIe, I2C, etc).

There are a few areas for improvement:

  • Add commands for pin and GPIO control.
  • Add commands to abstract embedded controller I/O (I2C, SPI, SMBUS).
  • Add commands to abstract PCIe CAM for systems without standards-compliant ECAM.
  • Add a purely SMC-based FastChannel transport, for areas where asynchronous support is irrelevant and where latency is key, like PCIe CAM, pin control or GPIO.

Parting thoughts

We looked at a number of schemes, but the most baked-through appears to be exposing a Trusted Firmware-based SCP via SCMI and an SMC doorbell, although the Arm FFH support for SMC-based SCMI doorbells still needs to figured out, and there are a few crucial categories of interfaces missing, such pin control and GPIO.

Of course, don’t forget that the Raspberry Pi 4, for example, has its GPU/VPU acting as the System Control Processor with it’s own SCMI-like mailbox. Could that be replaced with SCMI, avoiding a TF-A based SCMI interface entirely?

Finally, firmware-based SCPs aren’t just for Edge or Client devices. Even on servers, systems today can rely on GPIO-signaled Events instead of Interrupt-signaled Events. GPIO-signaled system events require a vendor-specific GPIO driver. Thus for servers, SCMI could mean never having to worry about vendor-specific GPIO drivers.