Hi,
Concerning your point of the 16KB, I know this is the page size. I can try to ask my DMA to produce such a burst. However, if I remember correctly, PCIe allows bursts of 4KB maximum, so I don't know if this will help. I can try. It’s worth studying better if such a large request can be asked in a MRr, or a division into sub-chunks is unavoidable.
So, the issue here isn't about the PCI bus, it's about how the DART manages mappings. The DART can do sub-page size mapping, but it's generally "easier" and faster when you're working in full page increments. That is, you're better off using 1 16Kb page vs 4 4Kb pages, even though the final result is exactly the same.
Note that this ISN'T really about what you actually send to your PCI card. If you need to work in 4Kb chunks (or any other weird size for that matter), then you can take that 16Kb page and use simple math to subdivide the physical offsets. This post is an example of how the performance can vary.
D2HSegmentsN = 1 // Single segment forced (so far)
No, not just so far. IODMACommand.PrepareForDMA() returns a segment count and an array of segments; however, that detail is effectively a vestigial appendage, not a useful feature. I have a post that describes what's going on in more detail here, but you're only going to ever get "1" segment back. I'd actually recommend that you check that segmentsCount==1 and simply terminate your driver if you get anything else, as a different value would imply significant enough architectural changes that "blindly" continuing is unwise.
Note that the underlying behavior here is effectively a fundamental side feature of the DART, not an accident or DriverKit-specific feature. Even within the kernel, it's not entirely clear to me how you'd get IODMACommand to generate multiple segments, and that's with a much broader set of memory descriptor functionality than DriverKit exposes.
Followed the same procedure presented in the official Apple video for DMA bus mastering ("Modernize PCI and SCSI drivers with DriverKit").
That's a good reference, particularly since SCSIControllerDriverKit is basically designed around subdividing I/O buffers. However, I'll also note that the IOPCIFamily's implementation (including DriverKit) is open source, so there is another resource you may find useful.
Carried away by my enthusiasm, I tried an extreme test with 256 channels. The burst was of 8 or 4 samples, which indeed corresponds to 8KB or 4KB. The outcome seems very similar to case 3. But I’d like to eliminate the possibility of deadline misses entirely.
So, the critical factors here are:
-
How fast you're attempting to perform operations.
-
How much data you're trying to transfer.
...but all of the transfers you're describing are sufficiently small that #1 is the primary factor. Note the dynamic here:
The burst was of 8 or 4 samples, which indeed corresponds to 8KB or 4KB. The outcome seems very similar to case 3.
...is exactly what I'd expect. That is, the bus has sufficient bandwidth that I'd expect the behavior to be indistinguishable all the way from 64 bytes -> 16+ KB. That entire range is basically "almost nothing". You can see that same dynamic in the storage post I mentioned earlier— at "bulk" scale, 16KB transfers were faster than 4KB transfers because the actual "time on PCI bus" was the same, but the DART was slower with smaller transfers.
Moving to here:
My question is if the problem has been really solved completely.
The word "completely" here is tricky. macOS is not designed around truly "guaranteed" I/O time, which means it's basically "always" possible to create circumstances where SOME kind of disruption will occur. As the most obvious example, it's hard to guarantee your transfer will occur in time if/when I pile enough other "stuff" on to the same bus. More practically, the "weak" link here tends to be shunting data to and from user space, not the PCI bus or your driver. There's not a lot your driver can do if user space ends up stalled for several minutes.
That reality is what things like the real-time threads exist; however, those have their own limits as well. The real-time thread can and will continue firing exactly on schedule, but you'll still lose data if/when it can't shunt data of that thread... because the VM shortage that's stalling the system is exactly the same reason it can't allocate memory.
The practical answer here is to ensure your transfer cadence is long enough that the system can reliably service that cadence. I can't provide hard numbers for that (audio is not my core area and there isn't really a fixed value), but the general guidance is that the less often you transfer data, the better it is.
I should probably try to comment the called method one by one and check what is the game changer. Am I doing some stupidities?
I don't see any obvious issue. Even the issues around preparing memory and that DART are less of an issue if you're reusing memory (which is how audio is typically handled).
Are there other relevant methods I'm missing or some profile tools from the host system which I can use to track the system in the long term?
The default answer here is Instrument, though I'll admit that I haven't actually used it all that much with DriverKit. However, it should be able to show you your interrupt cadence, as well as what other activity is occurring around that window.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware