Latency critical DMA read via PCIe

Dear All,

I am currently developing a high throughput audio system which operates via PCIe tunneled into a USB4 interface. This include a custom FPGA based hardware and custom Audio DriverKit driver.

While performing read operation via the hw DMA (that is a Host to Device transfer), I am noticing sparse latency spikes into the read transfers. Specifically, 4KB operations (which I assume including MRd + CpID) take normally from 5us to 40us to be completed, perfectly fine for my case. However, in some rare occasions, they can end up to 400us, which causes me overruns. The measurements have been carried out from the FPGA and they include the overall request and transfer time.

While trying to tackle the problem, I'm investigating the possible power saving options and performance constraint methods at my disposal. I currently use these methods to mitigate the problem.

ChangePowerState(kIOServicePowerCapabilityOn); SetPowerOverride(true); RequireMaxBusStall(kIOMaxBusStall25usec); CreatePMAssertion(kIOServicePMAssertionCPUBit | kIOServicePMAssertionForceFullWakeupBit, &ivars->PMAssertionID, false);

The buffers are currently about 16MB, single segment, 16KB aligned and, of course, "prepared" for DMA.

The system run for 3 hours without any overrun, but I'm not still fully convinced about its reliability. May someone provide me some comments on this? Are there profiling tools that I can use?

Feel free to request me any required detail. The testing system is a MacBook Pro M2 Pro.

Many Thanks and Best Regards

Francesco

While performing read operations via the hw DMA (that is, a Host to Device transfer), I am noticing sparse latency spikes into the read transfers. Specifically, 4KB operations (which I assume include MRd + CpID) take normally from 5us to 40us to be completed, perfectly fine for my case. However, in some rare occasions, they can end up to 400us, which causes me overruns.

How rare is "rare"? The system is complicated enough that, given enough time/work/complexity, "something" is all but guaranteed to go wrong. If you can narrow the failure down to some set of specific conditions, then a deeper investigation could be useful, but without that context, it's hard to guess about what happened or even whether it was a true problem.

Having said that, the "4KB operations" did jump out at me. Is your hardware's normal work unit? Are you specifically preparing 4KB "chunks" as independent memory operations? If you are, then you might try operating on 16KB chunks, as that's the system’s natural page size, and sub-page mapping is more complicated for the DART to manage.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Thanks for the reply Kevin. My apologies for the too qualitative info. The device prototype has been just set up and I don't have enough good statistics yet. I currently would like to ensure that all the proper driver technologies have been put in place and I will then start a long run session.

Audio Buffers

Let me provide you more detail about the system and the tests carried out so far. I will present the methods concerning the D2H path only (the one affected by the latency spike). The write one is anyway completely equivalent.

  • Buffer allocation (in audio device init):
``
OSSharedPtr<IOBufferMemoryDescriptor>   m_input_io_ring_buffer; //into ivars

IOBufferMemoryDescriptor::Create(kIOMemoryDirectionIn, buffer_size_bytes, 0x4000, ivars->m_input_io_ring_buffer.attach());
``
  • Buffer memory mapping (in audio device StartIO):
__block OSSharedPtr<IOMemoryDescriptor> input_iomd;

input_iomd->CreateMapping(0, 0, 0, 0, 0, ivars->m_input_memory_map.attach());

In all tests, a 16384 audio sample buffer has been used. The total size depends on how many channels were interleaved. Particularly I tested a system with 16, 64 and 256 I/O audio channels, 48kHz, 32 bit integer format.

DMA Buffer Preparation


D2HSegmentsN = 1 // Single segment forced (so far)

IODMACommand::Create(ivars->pciDevice, kIODMACommandCreateNoOptions, &dmaSpecification, &dmaCommandD2H);

dmaCommandD2H->PrepareForDMA(kIODMACommandPrepareForDMANoOptions, D2H_memory_buffer_descriptor, 0, virtualD2HSegment.length, &mem_direction_flags, &D2HSegmentsN, physicalD2HSegment);

PCIe Device

Followed the same procedure presented in official Apple video for DMA bus mastering ("Modernize PCI and SCSI drivers with DriverKit").

// Enable memory space access and bus mastering for DMA
    ivars->pciDevice->ConfigurationRead16(kIOPCIConfigurationOffsetCommand, &commandReg);
    commandReg |= (kIOPCICommandBusMaster | kIOPCICommandMemorySpace);
    ivars->pciDevice->ConfigurationWrite16(kIOPCIConfigurationOffsetCommand, commandReg);

Performed Tests

  1. Very First. No actions for CPU/DART/PCIe power management (all default), 16 Channels, single DMA burst at every audio sample (20.8us of deadline), that is 64 bytes (very inefficient). Frequent deadline misses (1 per minute) in the read operation. This is predictable since the baseline takes normally about ~20/25us -> abandoned approach.
  2. Burst increased to 8 audio samples (that is 167us of deadline) and 16 interleaved channels (512 bytes). Better stability in operation (read baseline is still about 10 to 40us). However, 1 per 30 minutes c.ca I noticed a spike in the read exceeding the deadline -> host underrun (bad).
  3. Same burst morphology but I applied power management + bus characteristic constraints. Particularly:
pciDevice->EnablePCIPowerManagement(kPCIPMCSPowerStateD0);

pciDevice->SetASPMState(kIOPCILinkControlASPMBitsDisabled);

//This looks very critical <<<<-------
RequireMaxBusStall(kIOMaxBusStall25usec);

plus, into Info.plist:

IOPCITunnelL1Enable NO
IOPMPCISleepLinkDisable NO 
IOPMPCIConfigSpaceVolatile NO
IOPCIRetrainLinkWake YES

Now things are much better and read deadline misses occurred only probably 3 times in 12 hours test.

  1. Carried away by my enthusiasm, I tried an extreme test with 256 channels. The burst was of 8 or 4 samples, which indeed corresponds to 8KB or 4KB. The outcome seems very similar to case 3. But I’d like to eliminate the possibility of deadline misses entirely. So I went further on investigating about power features etc. I ended up adding this requirements before the audio IO op. start:
ChangePowerState(kIOServicePowerCapabilityOn);
SetPowerOverride(true);
CreatePMAssertion(kIOServicePMAssertionCPUBit | kIOServicePMAssertionForceFullWakeupBit, &ivars->PMAssertionID, false);

After this, in several days, I did not notice any relevant event and my question is if the problem has been really solved completely (?). I should probably try to comment the called method one by one and check what is the game changer. Am I doing some stupidities? Are some of these method redundant (probably yes). Are there other relevant methods I'm missing or some profile tools from the host system which I can use to track the system in long term?

All the cited measurements have been carried out by the FPGA itself, so they are reliable in term of precision.

Concerning your point of the 16KB, I know this is the page size, I can try to ask my DMA to produce such a burst. However, if I remember correctly, PCIe allows burst of 4KB maximum, so I don't know if this will help. I can try. Worth to study better if such a large request can be asked in a MRr, or a division In sub-chunks is unavoidable.

Thank you very much

Hi,

Concerning your point of the 16KB, I know this is the page size. I can try to ask my DMA to produce such a burst. However, if I remember correctly, PCIe allows bursts of 4KB maximum, so I don't know if this will help. I can try. It’s worth studying better if such a large request can be asked in a MRr, or a division into sub-chunks is unavoidable.

So, the issue here isn't about the PCI bus, it's about how the DART manages mappings. The DART can do sub-page size mapping, but it's generally "easier" and faster when you're working in full page increments. That is, you're better off using 1 16Kb page vs 4 4Kb pages, even though the final result is exactly the same.

Note that this ISN'T really about what you actually send to your PCI card. If you need to work in 4Kb chunks (or any other weird size for that matter), then you can take that 16Kb page and use simple math to subdivide the physical offsets. This post is an example of how the performance can vary.

D2HSegmentsN = 1 // Single segment forced (so far)

No, not just so far. IODMACommand.PrepareForDMA() returns a segment count and an array of segments; however, that detail is effectively a vestigial appendage, not a useful feature. I have a post that describes what's going on in more detail here, but you're only going to ever get "1" segment back. I'd actually recommend that you check that segmentsCount==1 and simply terminate your driver if you get anything else, as a different value would imply significant enough architectural changes that "blindly" continuing is unwise.

Note that the underlying behavior here is effectively a fundamental side feature of the DART, not an accident or DriverKit-specific feature. Even within the kernel, it's not entirely clear to me how you'd get IODMACommand to generate multiple segments, and that's with a much broader set of memory descriptor functionality than DriverKit exposes.

Followed the same procedure presented in the official Apple video for DMA bus mastering ("Modernize PCI and SCSI drivers with DriverKit").

That's a good reference, particularly since SCSIControllerDriverKit is basically designed around subdividing I/O buffers. However, I'll also note that the IOPCIFamily's implementation (including DriverKit) is open source, so there is another resource you may find useful.

Carried away by my enthusiasm, I tried an extreme test with 256 channels. The burst was of 8 or 4 samples, which indeed corresponds to 8KB or 4KB. The outcome seems very similar to case 3. But I’d like to eliminate the possibility of deadline misses entirely.

So, the critical factors here are:

  1. How fast you're attempting to perform operations.

  2. How much data you're trying to transfer.

...but all of the transfers you're describing are sufficiently small that #1 is the primary factor. Note the dynamic here:

The burst was of 8 or 4 samples, which indeed corresponds to 8KB or 4KB. The outcome seems very similar to case 3.

...is exactly what I'd expect. That is, the bus has sufficient bandwidth that I'd expect the behavior to be indistinguishable all the way from 64 bytes -> 16+ KB. That entire range is basically "almost nothing". You can see that same dynamic in the storage post I mentioned earlier— at "bulk" scale, 16KB transfers were faster than 4KB transfers because the actual "time on PCI bus" was the same, but the DART was slower with smaller transfers.

Moving to here:

My question is if the problem has been really solved completely.

The word "completely" here is tricky. macOS is not designed around truly "guaranteed" I/O time, which means it's basically "always" possible to create circumstances where SOME kind of disruption will occur. As the most obvious example, it's hard to guarantee your transfer will occur in time if/when I pile enough other "stuff" on to the same bus. More practically, the "weak" link here tends to be shunting data to and from user space, not the PCI bus or your driver. There's not a lot your driver can do if user space ends up stalled for several minutes.

That reality is what things like the real-time threads exist; however, those have their own limits as well. The real-time thread can and will continue firing exactly on schedule, but you'll still lose data if/when it can't shunt data of that thread... because the VM shortage that's stalling the system is exactly the same reason it can't allocate memory.

The practical answer here is to ensure your transfer cadence is long enough that the system can reliably service that cadence. I can't provide hard numbers for that (audio is not my core area and there isn't really a fixed value), but the general guidance is that the less often you transfer data, the better it is.

I should probably try to comment the called method one by one and check what is the game changer. Am I doing some stupidities?

I don't see any obvious issue. Even the issues around preparing memory and that DART are less of an issue if you're reusing memory (which is how audio is typically handled).

Are there other relevant methods I'm missing or some profile tools from the host system which I can use to track the system in the long term?

The default answer here is Instrument, though I'll admit that I haven't actually used it all that much with DriverKit. However, it should be able to show you your interrupt cadence, as well as what other activity is occurring around that window.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Latency critical DMA read via PCIe
 
 
Q