Windows 10 drivers?

A recurring topic is Windows drivers for the Raspberry Pi 3 and 4 series.

MCCI DesignWare USB2 driver

This is for the front USB ports on Pi 3 and only the Type-C port on the Pi 4.

MCCI Corporation has made their TrueTask USB host stack available to the Raspberry Pi WoA community for non-commercial, evaluation purposes. MCCI did the original work for the 32-bit Windows IoT Core. It is available courtesy of Terrill Moore, CEO of MCCI, who graciously spent time in early 2019 to get it building and validated with the 64-bit Pi 3 UEFI.

If you like the drivers, I hope you’ll support The Things Network New York. MCCI does some pretty amazing things with LoRaWAN.

Driver is here. Launch announcement is here.

Note that the driver will not correctly work on Pi 4 boards with more than 1GB of RAM, unless you limit the RAM seen by Windows. See this guide.

OSS DesignWare USB2 driver

Before the MCCI driver was released, this was the only option. Originally based on an earlier version the UEFI USB driver and the UCX framework, it’s not particularly stable or recommended. It was originally developed by @NTAuthority, who was the first person ever to show Windows running on Pi 3 (rumor goes, with an early variant of the Pi 3 UEFI ;-)). And he thankfully left enough crumbs for the rest of us to pick up and carry the torch.

Driver repo is here.

Other BSP drivers

These were originally put up by Microsoft as part of the 32-bit Windows IoT Core BSP for the Pi 2/3. After a bit of cleaning, they build and run fine on 64-bit Windows.

Driver repo is here.

Guide – Windows 10 ARM64 on Pi 4B

So you want to install Windows 10 on this ‘Berry. You better follow this guide closely.

Because the front (USB3) ports are still unsupported, this guide will use the MCCI drivers for the “legacy” DWC2 USB controller, available via the Type-C plug. Because of the limitations of the DWC2 driver, Windows 10 will only work with 1GB of RAM usable.

Hardware needed

  • A PC with recent Windows 10 build installed.
  • Micro SD card reader.
  • Powered type-c usb hub or just a Type-C OTG cable if you can power the Pi through GPIO pins or even micro-USB hub with a type c to micro usb adapter.
  • USB mouse and keyboard.
  • A fast micro SD card – 16GB or higher – Class A1 or A2
  • Raspberry Pi 4B,
  • Micro HDMI cable.
  • Power supply (5V 3A+).

Downloads

Download Windows 10 installation files for arm64 from https://uup.rg-adguard.net/.

  • Download ISO compiler in OneClick!
  • Run downloaded CMD-file (run creatingISO.cmd file).

Or, download via https://uupdump.ml/, using aria2 and convert, running aria2_download_windows.cmd after extracting it.

Either of those services will help you to generate a ISO file but we only need install.wim file from sources folder on the ISO. Any build that passes OOBE without issues will be fine.

WoR (windows on raspberry) -Download 2.0.0-alpha.3 from https://worproject.ml/downloads

Guide

Once you download all of the things above you can proceed.

Open WoR. Select Disk from the list which will be your microsd card reader and select Raspberry pi 4 as a device that you will use. Then select build of windows WoR should use by pointing to a correct install.wim file. Use the latest drivers that WoR server provides. Select the latest UEFI for Raspberry pi 4 in WoR. Make sure MBR is selected as a partition scheme. WoR will automatically limit memory to 1024MB as it is still required to enable USB type-c drivers. Do edit boot options in WoR if you need to(I always overclock as my Pi has a fan attached).

WoR will deploy windows to the selected micro sd card which will take from 16 minutes to 3 hours depending on speed of your micro SD card.

Safely remove micro SD card and move it into the Raspberry Pi

Notes

This guide will be most likely updated if anything changes. First boot will take between 12 minutes to 2 hours depending on speed of your micro SD card. If there are issues during OOBE setup pressing shift + F10 then typing

%windir%\System32\Sysprep\sysprep.exe /oobe /reboot 

might help. If it doesn’t, you will need to test a different build of Windows 10 arm64. Remember that only Type-C port works correctly at the moment, so you will have to connect other devices to it somehow. Good luck!

Momentum is Building

In the past few days, there has been a lot of progress and a lot of publicity for this project, which shows the ecosystem’s desire and demand for lowering the barrier to entry on booting Arm SBC’s, in this case the Raspberry Pi 4 of course.

Tweets, LinkedIn posts, CNX Software replies, and Hackster comments all tell the same story: Allowing users to power on a single board computer, install the operating system of their choice using “normal” boot media, and proceed through an install process just like they are used to a typical PC is a missing piece in the Arm ecosystem. Without the ability for “regular” users to start to explore Arm hardware and get up and running in a way they are used to, Arm Servers will remain a niche product.

So, join us on the Discord Server, help contribute patches and code if you can, or simply spread awareness of the project on your Social Media channels!

Abstracting SoC hardware initialization.

How to best support Edge with standards?

One of the gaps between ServerReady and existing “Edge” SoCs is the latter’s general reliance on OS-coordinated device initialization and power state management. A typical BSP or port of an OS would involve clock source, GPIO, PoR and pin control/multiplexing drivers as prerequisites for any embedded devices, and usually I2C, SPI and voltage regulator drivers to be able to do anything “interesting” such as accessing sensors, doing storage I/O or driving graphics outputs.

While firmware could hypothetically pre-initialize everything into a working state, this presents a dilemma for devices meant to operate at the lowest possible power setting. Pre-initialization also means no device reconfiguration after OS boot.

ACPI has somewhat adapted to this space, but not abstracted enough from gory details. While ACPI does tidy up some platform device configuration via its interpreted AML byte code methods, it only appears to be marginally better than device tree for non-server systems. For example, ACPI’s notion of Generic Serial Bus (I2C, SPI, UART) and GPIO OpRegions lets device AML methods perform I/O without resorting to bit-banging MMIO addresses, but requires host operating system drivers to provide the underlying implementation. There are ACPI resource descriptors for tracking and describing GPIOs and pin configuration for devices, but these are again completely useless without appropriate drivers.

Great, so this basically reduces ACPI on non-server systems to obscure machine code coordinating a bunch of OS drivers, sourced from silicon providers and platform integrators. But maybe we can throw all these new vendor-specific OpRegions away, for compatibility’s sake, and code like it’s ACPI 4.0a?

Is writing AML really feasible?

Surely, device- and platform-specific configuration logic can be neatly limited to AML methods?

Well, that’s highly overrated.

A time set method for an ACPI Time and Alarm (TAD) device.

Above is an excerpt of a TAD device for a memory-mapped RTC. It’s a good warning against doing anything beyond basic I/O and arithmetic in AML. Considering how straightforward RTCs are as a device class, this bit of code (roughly 1/4 of all AML required) is unmaintainable and without comments would be completely incomprehensible. Yes, that’s a busy-wait in there for 5 seconds, waiting on a completion from the device, but it might as well be looping forever. Note, that AML methods in most implementations run with a global interpreter lock (which also means AML code is not reentrant – an OpRegion cannot be backed by AML).

AML is machine code. Really slow and limited machine code. ASL (the source compiled into AML) is about as expressive as 8048 assembly.

How about translating the following real-life SBC support code to AML? On Raspberry Pi the Arm cores don’t have access to device control blocks and must request the VideoCore VPU (GPU) processor to act on its behalf via special “mailbox” requests in shared RAM. E.g.:

Excerpt of a mailbox communication routine for Raspberry Pi from TF-A

First of all, this is doing DMA…so for an AML implementation, you’ll need a chunk of memory. You can’t use an actual AML buffer for this, so you’ll have to carve some physical memory out in the UEFI memory map, mark it with the right memory attributes matching your coherency requirements, set up a SystemMemory OpRegion…

Oh, I’m having so much fun!

Well, Barbie, we are just getting started..

As you can see, the routine is a mix of MMIO and CPU operations, e.g. data cache cleaning and invalidation. This is problematic for compiling to AML, which doesn’t include any cache operations. You could do away with cache operations entirely by carving out a non-cache coherent memory chunk. Or perhaps you’re lucky and your device supports cache coherent DMA… Well, then it would have to look like this:

Excerpt of a mailbox communication routine for Raspberry Pi from UEFI

No more cache operations! But now we have barriers. AML doesn’t have memory barriers. Maybe you can play fast and loose? Well, expect problems when you rev up the cores to something a bit more OoO. Also, consider that an operating system’s AML interpreter is probably getting scheduled across multiple CPUs while running your method…

So, no AML for anything involving DMA. What a mess…

No one wants to write AML

Considering how problematic it is to write AML, Microsoft introduced their own notion of Platform Extension Plugins (PEPs) to stand in for entire AML methods.

PEPs are intended to be used for off-SoC power management methods. Since they are installable binaries, they can be updated on-the-fly as opposed to ACPI firmware which requires a firmware flash. … Power management was the original intent for PEPs, but they can be used to provide or override any arbitrary ACPI runtime method.

Providing power management using PEPs can be much easier to debug than code written for the ACPI firmware. …

https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/using-peps-for-acpi-services

PEPs can obscure any method and provide methods you didn’t know were necessary. The AML code now just provides a “skeleton” and some dummy method implementations. This makes ACPI system descriptions relying on PEPs completely useless, since PEPs are opaque platform knowledge, sourced from vendors and completely undocumented.

PEPs play no role in the construction of the ACPI namespace hierarchy because the namespace hierarchy must be provided in the firmware DSDT. When the ACPI driver evaluates a method at runtime, it will check against the PEP’s implemented methods for the device in question, and, if present, it will execute the PEP and ignore the firmware’s version. However, the device itself must be defined in the firmware.

https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/using-peps-for-acpi-services

Even worse, you can’t meaningfully tell which AML methods would be overridden by a PEP, or what kind of configuration data the PEP was meant to source. PEPs are even worse than device tree, because PEPs can fully hide configuration data that would otherwise be reported as properties through device tree.

This is a good explanation for why none of the Snapdragon laptops today can boot Linux using ACPI. And not just Linux – the HP Envy x2 cannot boot a “stock” Windows 10 image without Snapdragon customizations.

PEPs are not an answer.

Let’s put the BSP in TF-A

We know we don’t want to write in AML for a good number of reasons. In a few situations it is literally impossible to write safe and functional AML. We also want to avoid writing drivers for things OS vendors (and their customers) really don’t care about, like pin controllers and clock source management. And PEPs are not a standard interface, and their implementation is OS vendor specific.

It would appear that the best place to hide low-level platform drivers for device initialization and power state management is Trusted Firmware-A. TF-A is an industry-adopted TrustZone firmware usually used to implement PSCI or as a foundation for a Trusted Execution Environment (TEE). It’s a rich enough environment to contain complex code written in a high-level level. Also, TF-A likely already includes some of the component drivers. This way, TF-A becomes a software-based System Control Processor (SCP).

Conceptually, it’s not a bad fit. TF-A already provides services that abstract underlying I/O and SoC control facilities for a general-purpose OS. For example, the Power State Coordination Interface is an abstraction over system state (power-off, reboot) and CPU state (secondary CPU start up) control. As an another example, the Software Delegated Exception Interface can be used to abstract NMI support and implement firmware-first system error handling (RAS). Moreover, many vendors already use private calls between UEFI (or U-Boot) and TF-A firmware, for similar reasons.

If the platform BSP entrails are squirreled away in TF-A, how would an OS interact with them? Via ACPI AML methods of course. Regardless of how the ACPI interface works, the actual calls to a firmware-based SCP would be via the well-standardized SMCCC specification.

Trusted Firmware-based SCP calls are vendor-specific, and are part of SiP or OEM service calls.

Isn’t putting stuff in TF-A bad? It’s not ideal, but putting it into AML or OS vendor-specific drivers is much worse. Other platforms such as IA-64 (SAL) and OpenPower (OPAL) rely on firmware interfaces to abstract some platform I/O and implementation-specific details.

Don’t like blobs? Upstream and open-source your TF-A, like everybody else.

Isn’t that cycle stealing? No, because we’re talking about operations done on behalf of the operating system requesting them.

Note: It’s not just generic off-the-shelf operating systems that would win from TF-A abstracting common SoC hardware and control interfaces. UEFI firmware itself could make good use of these, reducing the development and support effort for all platforms, and removing similar/duplicated functionality further reducing code size and bug counts.

But what about UEFI runtime services?

Runtime services are meant to abstract parts of the hardware implementation of the platform from the OS, but the interface is fairly limited in scope today. Here’s the current set of calls:

NameDescription
GetTime()Returns the current time, time context, and time keeping capabilities.
SetTime()Sets the current time and time context.
GetWakeupTime()Returns the current wakeup alarm settings.
SetWakeupTime() Sets the current wakeup alarm settings.
GetVariable()Returns the value of a named variable.
GetNextVariableName()Enumerates variable names.
SetVariable()Sets, and if needed creates, a variable.
SetVirtualAddressMap()Switches all runtime functions from physical to virtual addressing.
GetNextHighMonotonicCount() Subsumes the platform’s monotonic counter functionality.
ResetSystem() Resets all processors and devices and reboots the system.
UpdateCapsule()Passes capsules to the firmware with both virtual and physical mapping.
QueryVariableInfo()Returns information about the EFI variable store.

But what if this list could be extended? Instead of moving the BSP into TF-A, let’s make it all a UEFI runtime service, and figure out how to perform RT calls from AML.

That’s not a very good idea:

  1. RT is fragile, as services share the same privilege level as the calling operating system. Differences in the way different operating systems call services are a constant source of bugs across vendor implementations, e.g. flat addressing or translated, interrupts enabled or disabled, UART enabled or disabled.
  2. RT requires an environment – memory ranges must be correctly mapped, enough stack, disabled FP traps, etc. A single SMC instruction for trapping to TF-A is hard to beat.
  3. RT is fragile, as runtime services are provided by the same drivers that provide boot time services in most UEFI implementation like Tiano. There’s no meaningful isolation between an RT driver and its (and other) BS components. The firmware programming model is just bad. It is extremely easy to make a seemingly benign change (new giobal variable, logging statement, driver dependency) that will break RT support for some users, but be very difficult to track down.
  4. Limited facilities with no support for asynchronous implementation, e.g. can’t take an exception on behalf of a service, while in OS. This may make some hardware hard to expose efficiently or mean that certain workarounds are impossible, e.g. a device quirk that relies on handling external aborts on a system with firmware-first error handling.
  5. Simply revising the UEFI specification with new RT services won’t do anything for existing code bases. Retrofitting will be a significant effort. In contrast, TF-A is a simpler code base that is easier to swap out.
  6. UEFI implementations have a poor track record of being open-sourced by firmware vendors. TF-A implementations are done by silicon providers themselves, and have a better track record of being open source and audit-able.

Tying SMCCC and ACPI together

We need a generic escape hatch mechanism from ACPI to the operating system, to be able to easily perform arbitrary SMCCC calls from AML device methods.

Escape Hatch #1 – FFH

The Functional Fixed Hardware (FFH) OpRegion type seems like a good fit.

https://uefi.org/sites/default/files/resources/ACPI_6_3_final_Jan30.pdf, page 114

Unfortunately, the ACPI specification gives no examples of FFH usage for anything outside of Register resource descriptors, as part of processor Low Power Idle States (_LPI) support.

I never saw any examples with OpRegions being declared as of type FFixedHW, but the ACPICA compiler (iasl) didn’t barf at this quick draft:

DefinitionBlock ("test.aml", "SSDT", 5, "FOO", "BAR", 6) 
{
   Device (F000) {
      Name (_HID, "FOO1234")
      Method (_INI, 0, Serialized) {
         OperationRegion (VNCL, FFixedHW, 0xbeef0000, 0x8)
         Field (VNCL, BufferAcc, Lock, Preserve) {
            Offset (0),  AccessAs(BufferAcc, AttribRawBytes (58)),
            SMCC, 8,
         }

         // SMC call exchange buffer
         Name (BUFF, Buffer(58){})
         CreateQWordField(BUFF, 0x0,  CALL) // Function identifier w0
         CreateQWordField(BUFF, 0x8,  AR1)  // Argument x1
         CreateQWordField(BUFF, 0x10, AR2)  // Argument x2
         CreateQWordField(BUFF, 0x18, AR3)  // Argument x3
         CreateQWordField(BUFF, 0x20, AR4)  // Argument x4
         CreateQWordField(BUFF, 0x28, AR5)  // Argument x5
         CreateQWordField(BUFF, 0x30, AR6)  // Argument x6
         CreateQWordField(BUFF, 0x38, RET0) // Result x0
         CreateQWordField(BUFF, 0x40, RET1) // Result x1
         CreateQWordField(BUFF, 0x48, RET2) // Result x2
         CreateQWordField(BUFF, 0x50, RET3) // Result x3

         CALL = 0xC3001234 // OEM call 0x1234
         AR1 = 0x1         // Some parameter for call
         SMCC = BUFF       // Invoke! 
         If (RET0 == 0xffffffffffffffff) {
            // Failure.
            // ...
         }

         // Success
         //...
      }
   }
}

Encouraging, if messy. This could be an amendment to the Arm Functional Fixed Hardware Specification.

Escape Hatch #2 – OS method

The problem with Hatch #1 is the large amount of changes to operating system ACPI support. Additionally, the syntax is obtuse and doesn’t fit well the semantics of a method invocation, and the creation of OpRegions and buffers has additional overheads.

To borrow a page from the PEP book, the OS ACPI interpreter could provide a method to perform SMCCC calls. Unlike PEPs, this could be a well defined interface. Its presence could be negotiated using the standard _OSC (OS Capabilities) mechanism.

//
// OS ACPI interpreter implements OSMC.
//    Params: fn and arguments x1-x6,
//    Returns: Package{4} containing return values x0-x3.
//
External (OSMC, MethodObj, PkgObj, { IntObj, IntObj, IntObj, IntObj, IntObj, IntObj, IntObj })

Name (RETV, Package(4){})
RETV = OSMC(0xC3001234, 0x1, 0, 0, 0, 0, 0)
If (DeRefOf (RETV[0]) == 0xffffffffffffffff) {
   // SMCCC call failed.
   // ...
}

Escape Hatch #3 – PCC (and FFH, again).

The Platform Communication Channel (PCC) is a generic mechanism for OSPM to communicate with an entity in the platform, such as as a BMC, SCP (system control processor). PCC relies on a shared memory region and a doorbell register. Starting ACPI 6.0, PCC also became a supported OpRegion type. Assuming the doorbell can be wired to an SMC, PCC could be used to communicate with a Trusted Firmware-based SCP.

PCC is a higher level protocol than invoking random SMC calls from ACPI or OS directly. The commands/messages all go via structured shared memory. There would be only one SMCCC call used – for the doorbell itself.

A PCC doorbell can be an FFH register access.

To use PCC with SMC via FFH, the Arm FFH specification would need to be amended to cover the PCC use case.

Escape Hatch #4 – SCMI + FFH = 💖?

We can do better than raw PCC. Arm already has an expansive adaptation of PCC – the System Control and Management Interface (SCMI), which covers exactly the use case we are after – device control and power management. SCMI is a higher-level protocol with a concrete command set.

Some of what SCMI can do.

SCMI is pretty advanced, and even supports asynchronous (delayed response) commands.

If the doorbell is SMC or HVC based, it should follow the SMC Calling Convention [SMCCC]. The doorbell needs to provide the identifier of the Shared Memory area that contains the payload. The Shared Memory area containing the payload is updated with the SCMI return response when the call returns. The identifier of the Shared Memory area should be 32-bits and each identifier should map to a distinct Shared Memory area.

https://static.docs.arm.com/den0056/b/DEN0056B_System_Control_and_Management_Interface_v2_0.pdf

While SCMI supports SMC as a doorbell according to the spec, the details are unfortunately left out. Presumably SMC must be exposed via FFH, yet the Arm FFH specification doesn’t currently cover this scenario.

SCMI could be a reasonable interface for a Trusted Firmware-based SCP. SCMI supports a lot of the interfaces one would otherwise implement in an arbitrary fashion, such as sensors, clocks and power and reset control. And not just from AML – as a well defined specification, SCMI could be a generic (fallback?) implementation for OS drivers (GPIO, PCIe, I2C, etc).

There are a few areas for improvement:

  • Add commands for pin and GPIO control.
  • Add commands to abstract embedded controller I/O (I2C, SPI, SMBUS).
  • Add commands to abstract PCIe CAM for systems without standards-compliant ECAM.
  • Add a purely SMC-based FastChannel transport, for areas where asynchronous support is irrelevant and where latency is key, like PCIe CAM, pin control or GPIO.

Parting thoughts

We looked at a number of schemes, but the most baked-through appears to be exposing a Trusted Firmware-based SCP via SCMI and an SMC doorbell, although the Arm FFH support for SMC-based SCMI doorbells still needs to figured out, and there are a few crucial categories of interfaces missing, such pin control and GPIO.

Of course, don’t forget that the Raspberry Pi 4, for example, has its GPU/VPU acting as the System Control Processor with it’s own SCMI-like mailbox. Could that be replaced with SCMI, avoiding a TF-A based SCMI interface entirely?

Finally, firmware-based SCPs aren’t just for Edge or Client devices. Even on servers, systems today can rely on GPIO-signaled Events instead of Interrupt-signaled Events. GPIO-signaled system events require a vendor-specific GPIO driver. Thus for servers, SCMI could mean never having to worry about vendor-specific GPIO drivers.