OpenZFS on FSKit — Proof of Concept

Question

Created 2w

Replies 6

Boosts 0

Participants 3

Installing ZFSFSKit.appex ? /Library/ExtensionKit/Extensions/                   
  Substituting real Mach-O (libtool wrapper ? .libs/ZFSFSKit)                   
                                                                                
Installing zfs.fs ? /Library/Filesystems/                                       
  mount_zfs: Mach-O 64-bit executable arm64                                     
  Done.                                                                         
                                                                                
Signing (before pluginkit, so it sees a valid signature)...                     
Re-signing /Library/ExtensionKit/Extensions/ZFSFSKit.appex ad-hoc (no identity).
  Note: requires amfi_get_out_of_my_way=1 in boot-args.                         
  Team ID: ADHOC                                                                
/Library/ExtensionKit/Extensions/ZFSFSKit.appex: replacing existing signature   
                                                                                
Done. Signature:                                                                
Identifier=org.openzfsonosx.filesystems.zfs.fsext                               
Signature=adhoc                                                                 
TeamIdentifier=not set                                                          
Entitlements:                                                                   
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "https://www.apple.com/DTDs/PropertyList-1.0.dtd"><plist version="1.0"><dict><key>com.apple.application-identifier</key><string>ADHOC.org.openzfsonosx.filesystems.zfs.fsext</string><key>com.apple.developer.fskit.fsmodule</key><true/><key>com.apple.developer.team-identifier</key><string>ADHOC</string><key>com.apple.security.app-sandbox</key><true/></dict></plist>                         
                                                                                
Registering with pluginkit...                                                   
pluginkit -a done.                                                              
                                                                                
Restarting fskitd...      

# sudo pluginkit -v -m -p com.apple.fskit.fsmodule
+    org.openzfsonosx.filesystems.zfs.fsext((null))     6A12A41280FB-4190-B957-FA94DC89BB1E    2026-05-29 01:17:58 +0000       /Library/ExtensionKit/Extensions/ZFSFSKit.appex
                                                      

# sudo mkdir /Volumes/tank                                                                   
# sudo mount -F -t zfs /dev/disk4 /Volumes/tank                                              
# ls -la /Volumes/tank                                                                       
total 3                                                                         
drwxr-xr-x  3 lundman  staff    4 May 29 09:21 .                                
drwxr-xr-x  4 root     wheel  128 May 29 10:18 ..                               
-rw-r--r--  1 lundman  staff   11 May 29 09:21 file.txt                         
drwxr-xr-x  2 lundman  staff    2 May 29 09:21 HelloWorld                       
# cat /Volumes/tank/file.txt                                                                 
HelloWorld

Even though FSKit isn't quite ready, I built a proof-of-concept FSKit extension to understand what the migration path looks like. This post shares what we got working, specific technical findings that weren't documented, and the gaps we hit that would need Apple's attention for a production implementation.

Luckily, OpenZFS already compiles in userland for the "zdb" utility so not much work was required on that side.

There were certain amount of desperation applied when we came across hurdles, so possibly some assumptions we formed are not correct. (We didn't go back and confirm the problem after it started working).

Answer 1

lundman OP

2w

Technical Findings — Things That Weren't Obvious

These cost significant debugging time and aren't documented anywhere:

startCheckWithTask: must complete asynchronously

Calling [task didCompleteWithError:nil] synchronously inside startCheckWithTask: causes fskitd to receive "task completed" before "task started" over XPC.

fskitd rejects this with FSKitErrorDomain Code=27503 "Task didn't start yet" and never spawns the activate instance. The fix is to dispatch the completion with even a 1ms delay so FSKit can send the "started" notification first.

app-sandbox=true is required in entitlements

ExtensionKit rejects the extension entirely without com.apple.security.app-sandbox=true, even though sandboxing a filesystem extension feels counterintuitive. This needs to be paired with com.apple.developer.fskit.fsmodule=true.

Container identifier must equal volume identifier for FSUnaryFileSystem

FSContainer.h documents this but it's easy to miss: for unary file systems, the container identifier passed in probeResource: must exactly match the volume identifier used when constructing the FSVolume. A mismatch causes loadResource: to fail with EAGAIN "unexpected container state."

What's Missing for Production

We see three hard blockers and several important-but-workable gaps.

Blocker 1: No management plane

Every ZFS management tool — zpool, zfs, zdb, zed, zinject — communicates with the ZFS engine via /dev/zfs ioctls (ZFS_IOC_*). Without an equivalent:

No pool create, destroy, import, export, scrub, resilver, or status
No snapshots, clones, send/receive, or bookmark management
No dataset property management
No ZFS event daemon (zed) for fault handling and auto-replacement

The need isn't specifically "ioctls" — it's a defined IPC contract between the FSKit-hosted ZFS engine and management tools. Whether that's a character device DEXT, an XPC service exposed by the extension, or a new FSKit API, the mechanism needs to exist. The management tools would be adapted to whatever is provided.

Question: Is there a recommended pattern for a filesystem extension to expose a management interface to non-sandboxed privileged tools?

Blocker 2: No virtual block device publication

ZFS Volumes (zvols) are block devices backed by the ZFS storage pool — used for VM disks, iSCSI targets, swap, and more.

The current kext implementation (IOBlockStorageDevice subclass) works because it creates a kernel service that IOKit matches into IOMedia → /dev/diskN.

There is no userspace equivalent.

DriverKit's IOUserBlockStorageDevice is the closest analog, but IOKit matching is hardware-triggered. There's no mechanism for a filesystem extension discovering a zvol inside a pool to say "please also publish a block device for this." A static DEXT can't know what zvols exist until the pool is imported.

Question: What is the intended path for a filesystem to publish virtual block devices — for example, a software RAID layer or a volume manager that needs to create block device nodes dynamically at runtime?

Blocker 3: N:M — multiple devices per pool, multiple datasets per mount

ZFS has a fundamental mismatch with FSKit's current resource model on two axes:

N devices → 1 pool. ZFS pools span multiple block devices: a 3-disk RAIDZ, a 4-disk mirror, etc. FSKit's probe/activate model is one FSBlockDeviceResource per activation. For a RAIDZ pool, all member device fds need to be available at spa_import time. There's no multi-resource activation concept, and no way for probing one member disk to "claim" the others. (Our PoC works only because we used a single-vdev file pool.)

1 pool → M mounts. A pool typically contains many datasets, each with its own mountpoint (tank, tank/home, tank/data, tank/vm). These should each appear as separately mounted volumes. FSUnaryFileSystem is explicitly one volume per activation — there's no mechanism for one pool import to produce multiple mounted volumes.

These likely need separate primitives: something like a multi-resource pool probe that coalesces member devices, and a dataset iterator that produces multiple FSVolume instances per pool activation.

Question: Is multi-resource activation (multiple FSBlockDeviceResource objects for one filesystem instance) on the roadmap? Is there a pattern for one extension activation to produce multiple mount points?

Secondary Gaps (important but not immediate blockers)

ARC memory limits. ZFS's Adaptive Replacement Cache is designed to use a significant portion of system RAM. Sandbox memory limits constrain it. A declared "buffer cache" process role with higher memory entitlements would meaningfully improve performance.

Background operations. Pool scrub and vdev resilver (RAID rebuild) are long-running background I/O tasks essential for data integrity. There's no mechanism for an FSKit extension to run sustained background work while mounted. Resilver after a disk failure can't wait for user interaction.

No unified buffer cache integration. All reads go through the extension process with no kernel page cache sharing. mmap either goes through readFromFile: per page fault or doesn't work at all. This is significant for database and VM image workloads.

NFS/SMB re-export. ZFS is widely used as a NAS backend. Correctness of persistent file IDs, fsid stability, and server-side locking semantics under FSKit needs validation.

The code is open source at https://github.com/openzfsonosx/openzfs-fork/tree/FSKit

Jorgen Lundman
Co-Authored-By: Claude Sonnet 4.6

Answer 2

DTS Engineer OP

Apple

1w

Part 1:

Technical Findings — Things That Weren't Obvious

First off, thank you for all your comments! I'm a bit buried preparing for WWDC, but I do have some comments to pass along:

startCheckWithTask: must complete asynchronously

This is actually fairly common in our frameworks, but probably worth documenting better.

app-sandbox=true is required in entitlements

ExtensionKit rejects the extension entirely without com.apple.security.app-sandbox=true, even though sandboxing a filesystem extension feels counterintuitive. This needs to be paired with com.apple.developer.fskit.fsmodule=true.

This isn't so much saying "I'm sandboxed" as it's saying "I acknowledge that I will be executing inside a sandbox". ALL ExtensionKit extensions ARE sandboxed because part of what the extension points execution environment defines... is the sandbox environment that extension will run in.

I think part of the confusion here is that we commonly use the term "sandboxed" to mean "an app/process that can't do much as it used to be able to" when what the system ACTUALLY means is more like "a process that's opted into the system used explicitly define what it's allowed to do". Architecturally, the goal here is that ALL processes on the system should be sandboxed, as there's no reason why every process can't/shouldn't be able to declare "this is what I need to be able to do".

What's Missing for Production

We see three hard blockers and several important-but-workable gaps.

I have some specific comments below, but please file separate bugs on all of these and then post the bug numbers back here.

Blocker 1: No management plane ... Every ZFS management tool — zpool, zfs, zdb, zed, zinject — communicates with the ZFS engine via /dev/zfs ioctls (ZFS_IOC_*). ... Question: Is there a recommended pattern for a filesystem extension to expose a management interface to non-sandboxed privileged tools?

It's possible we might add something like this into FSKit in the future; however, I think you could actually do whatever you need to do today. Your FSKit extension access to the broader system isn't THAT limited, so it should be able to use either of the two "standard" options:

XPC -> This is generally our "preferred" option, since it's somewhat higher performance and allows for a richer set of data transfer options (for example, sharing memory between processes). However, it may not always be usable depending on the relationship between processes and the limits the system imposes on Mach port connections.
UNIX domain sockets -> As long as any two processes have write access to the same file system location, they can communicate using a UNIX domain socket, which means this basically works "everywhere".
Your file system (bonus option) -> It's more complicated when you're building on an existing file system, but your FSKit extension does have full control over its own file system. Nothing prevents you from designating a particular location/file name/etc. as "special" and then using I/O to that target as a control interface instead of a standard I/O path.

Blocker 2: No virtual block device publication ... Question: What is the intended path for a filesystem to publish virtual block devices — for example, a software RAID layer or a volume manager that needs to create block device nodes dynamically at runtime?

At this point, I don't think there is any great solution for this, as you're basically talking about creating a disk image device/driver, and our only current support for that involves a KEXT. However, one slightly odd option that might work today is to use hdiutil with the "diskimage-class=CRawDiskImage" option (see this thread for an example of that). I'm not sure how this works within ZFS, but if you can present a file object to "the system", then that option will make hdiutil generate a completely new dev node which uses that file as its final I/O target.

Blocker 3: N:M — multiple devices per pool, multiple datasets per mount ... Question: Is multi-resource activation (multiple FSBlockDeviceResource objects for one filesystem instance) on the roadmap? Is there a pattern for one extension activation to produce multiple mount points?

So, I can't talk about our future plans in any concrete way, but this is a clear limitation of the current architecture which the FSKit team is definitely aware of. As the most obvious issue, this is a problem for APFS for largely the same reasons as ZFS.

If you REALLY wanted to get something working "today", I think both of these could TECHNICALLY be implemented using "CRawDiskImage" and some cleverness, but I suspect the performance would be bad enough that I'm not sure this is really a viable approach.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 3

DTS Engineer OP

Apple

1w

Part 2:

ARC memory limits. ZFS's Adaptive Replacement Cache is designed to use a significant portion of system RAM. Sandbox memory limits constrain it. A declared "buffer cache" process role with higher memory entitlements would meaningfully improve performance.

I confess, I haven't actually looked into this. What's the limit you’re actually hitting and how much memory do you think/need/want? Also, what API are you reading with and have you tried using metadataRead(into:startingAt:length:)? I'm not sure if it will help or not (it depends on what limit you’re actually hitting), but part of the reason there are two APIs is that the metadata APIs change how memory ownership ("the kernel" vs. "your process") is "accounted" for, which can reduce your processes’ "perceived" footprint.

Background operations. Pool scrub and vdev resilver (RAID rebuild) are long-running background I/O tasks essential for data integrity. There's no mechanism for an FSKit extension to run sustained background work while mounted.

I'm not sure what you mean here. As long as your volume is mounted, you should be able to create threads and do whatever you want. Have you tried that? What's going wrong?

No unified buffer cache integration. All reads go through the extension process with no kernel page cache sharing. mmap either goes through readFromFile: per page fault or doesn't work at all. This is significant for database and VM image workloads.

Not exactly. I don't think the problem here is a lack of UBC integration, it's that things like LVM (logical volume management) and encryption mean that you can't really use the UBC the way it was intended. Your FSKit extension IS using the UBC, as it's the underlying data source used to return data from FSBlockDeviceResource. In a simpler file system, you'd then use FSVolumeKernelOffloadedIOOperations to shift all of the actual I/O out of your extension, letting the kernel directly interact with the final disk I/O target.

The problem is that this approach doesn't work when you need to modify the bytes being written before they actually reach the disk. What you actually need/want here is an I/O layer UNDERNEATH the UBC so that the UBC can perform its normal work on the "real" data but you can modify it before it actually hits the disk.

As it happens, this problem already exists without FSKit - it's why CoreStorage, APFS, and (I think) ZFS kernel implementation all have both an IOKit IOMedia driver as well as their VFS driver. The VFS driver sits above the UBC feeding final data to user space, while the IOMedia driver handles final data modification before it reaches the disk.

Unfortunately, we don't have a great solution for that today, as it's basically another variant of the disk image issue I referenced above. Theoretically, you might be able to use two FSKit extensions and CRawDiskImage to recreate the same layered architecture (see #2 above), but I'm not confident performance would be all that great. In terms of the broader system, I think it might be possible to create a virtual I/O device using SCSIControllerDriverKit and UserGetDataBuffer or UserProcessBundledParallelTasks, but the work involved would be trifficult[1] and, again, I'm not sure performance would be great.

[1] "Tricky and difficult", Alexander J. Elliott

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Answer 4

lundman OP

4d

Kevin, thank you for the detailed responses — this is exactly the kind of clarity we were hoping for. Let me work through your points.

On app-sandbox=true

The reframe is helpful. We were reading "sandbox" as "locked down" rather than "opt-in capability declaration." That makes much more sense architecturally. The practical follow-on question: what entitlements exist today for block device ioctls (DKIOCGETBLOCKSIZE, DKIOCGETBLOCKCOUNT)? We currently work around their absence with a path→size registry, but if we can simply declare the entitlement, that's the right fix. Similarly for any IPC socket path we'd use for the management layer.

On the management plane

Good news: we already implemented a UNIX socket approach — a zfsd management daemon that receives ZFS_IOC_* requests over a socket, paired with a libzfs_core transport that connects to it instead of opening /dev/zfs. The open question is whether the sandboxed extension can create and bind a socket at a path reachable by privileged tools outside the sandbox (e.g., /var/run/zfs/zfsd.sock). If that's permitted without a special entitlement, this blocker is resolved for us.

Your third option — using the filesystem itself as a control channel — is interesting for in-band dataset operations, but circular for pool-level work (create, import, destroy) that must happen before any dataset is mounted.

On zvols

The hdiutil/CRawDiskImage suggestion could work for VM disk image consumers (Parallels, VMware) if the extension can present zvol I/O as a file-like object. Two follow-up questions:

Is IOUserBlockStorageDevice in DriverKit a viable path for dynamically publishing block devices from userspace? Could an FSKit extension coordinate with a companion DriverKit extension to publish zvols when a pool is imported?
Is the hdiutil path accessible from within the extension's sandbox, or does it require coordination with a process outside?

On N:M

The APFS observation is the most useful validation we could have asked for. Apple knows about it.

On ARC memory limits

Fair point — we were speculating. We haven't actually hit a limit yet. The concern was that ARC on a 64 GB system typically uses 20–30 GB, and we were anticipating that a sandboxed process holding that much memory would attract attention from the system.

The metadataRead suggestion is interesting. We're currently using the standard read path. If metadata reads are accounted to the kernel rather than the process, that could meaningfully reduce ARC's apparent footprint. We'll revisit with actual numbers when we get there. Is there documentation on what memory limits a sandboxed FSKit extension actually operates under?

On background operations

We were speculating here too — haven't tried it. Good to know threads are unrestricted while mounted. We'll come back if we actually hit a problem.

On UBC / buffer cache

Thank you for the correction — that's a much clearer framing. The issue isn't UBC access; it's that ZFS needs to sit below the UBC as well as above it, and FSKit currently only provides the top layer. The two-layer architecture you described (IOMedia below UBC for transformation + VFS above for the POSIX layer) is exactly what the kext implements today.

You mentioned SCSIControllerDriverKit with UserGetDataBuffer/UserProcessBundledParallelTasks as a potential path for a virtual I/O transformation layer. Is that a supported and documented route for this use case, even if difficult? Or is it more theoretical? If it's genuinely viable, we'd rather know that now than assume the door is closed.

Action items

We'll file separate bugs for the three hard blockers (management plane, zvols, N:M multi-device/multi-dataset) and post the numbers back here.

But at the end of the day, we wanted to test out the fit with FSKit and it is rather pleasing we got it to do "something at all", which is quite hopeful if'en Apple does decide to remove KEXT support. Shiny.

Answer 5

rottegift OP

3d

Thanks for the superb answer, Kevin!

I'll just touch on the memory issue:

I confess, I haven't actually looked into this. What's the limit you’re actually hitting and how much memory do you think/need/want?

We mostly use IOMallocAligned(), and then manage chunks within the kext mostly using the vmem and kmem caching originated in OpenSolaris, adapted to a system where there is a parent memory system using alloc/free style calls rather than a very large range of physical pages. (This was the starting point for zfsonlinux, too). We would like as much as a knowledgeable user expecting to do large amounts of cacheable I/O might want to dedicate. Ideally, it would be fully automatic and dynamically reactive to the amount of otherwise-unused memory on the system and the amount of cache-serviced I/O actually observed.

In the case of APFS inside a zvol for example, consumers of files in the APFS filesystem will make use of the UBC just fine. This performs surprisingly well in two ways: firstly, ARC and other ZFS caching hides a lot of drive-seek latencies from APFS; and secondly, because APFS caches data in the UBC, the number of reads into the ARC for "double-cached" data largely vanishes, and consequently the ARC copy is more likely to stay off the frequently-read part of the cache and thus more eligible for replacement.

What ARC caches in APFS-in-zvol is morally equivalent to sets of disk blocks (from the APFS perspective) -- there is no knowledge of individual APFS files or other objects.

However, for datasets (by far the most common use of ZFS everywhere) UBC is not engaged, and I think that's the focus of lundman's comments on UBC. Here what is cached is ranges of a "DMU" object, which is roughly equivalent to ranges of an ordinary file (or file-like directory or other metadata).

Multiple read(2)s of a part of a file in a dataset are served by caches within the kext, and we jump through a couple of additional hoops when a file is mmap()ed. writes (including msync()s) are also cached (after compression and checksumming and other processing), and aggregated.

(Experimental plumbings between the UBC and the ARC using UPLs mostly ended up having poor performance, although that was many years ago. One current blocker that has arisen since then is that the modern ARC typically retains mostly compressed data, with short-lived caching of uncompressed data. (The data is almost always stored compressed on the underlying media, and what's cached is what's read off the media). Extra copies of uncompressed data in the UBC, or decompressing data in the kext to hand to the UBC to generate and keep a compressed copy of its own, seems a little wasteful.)

Because we starve all the other kernel clients of RAM if we go over some threshold (e.g. 60 GiB on a 512 GiB M3 Mac Studio) we have some defensive capping well below that.

We do give memory back via IOFreeAligned() if the ARC shrinks because of user intervention or automatically if memory pressure is detected. Since the project goes back to (and still runs on) old intel Mac Minis, where page table shootdowns and older kernel-internal memory management made bursts of many small frees expensive, we mostly exchange fairly large chunks of memory with the kernel. Because of the nature of a hierarchy of slab allocators, that risks pinning more memory in the kext that we might want, or through another lens, makes us less reactive to memory pressure (or a manual reduction in footprint) than I'd like.

Maybe on newer Apple Silicon machines with newer kernels with Apple's many interesting memory management innovations, allocating/freeing smaller chunks of memory might be less noticeable. Additionally, if plumbing between UBC and the kext's caches were "easier", we could hold on to less memory.

("Easier" is mostly about correctness in plumbing between existing code and UPLs given the restricted interface for the latter and the need to handle files which are potentially mmap()ed and having read()/write() done on them simultaneously. Think of a file on which some application is doing ordinary read()/write() when mds and friends come along to index it.)

Finally, for completeness, there's minor uses of other IO alloc/free calls like IOMallocType() too, and apparently two corner cases where kmalloc() is used.

Sean Doran
Not coauthored by any intelligence and it probably shows

Answer 6

DTS Engineer OP

Apple

3d

The practical follow-on question: what entitlements exist today for block device ioctls (DKIOCGETBLOCKSIZE, DKIOCGETBLOCKCOUNT)?

I'm not sure what you mean here. Having said that, the short answer is that I'm not aware of any entitlements relevant to this "area" of the system.

The open question is whether the sandboxed extension can create and bind a socket at a path reachable by privileged tools outside the sandbox (e.g., /var/run/zfs/zfsd.sock). If that's permitted without a special entitlement, this blocker is resolved for us.

First, just to clarify, "privilege" (that is, running as root) as such doesn't actually matter here. Your FSKit extension can basically "talk to" whoever it wants (particularly through a socket), so any protection you're doing needs to be something "you" implement, not the system. I haven't played around much with what's actually possible, however:

Your FSKit extension has access to an app group container as well as its own and any tool you write can have access to the same location.
If you really wanted it "somewhere else", the file access entitlement should let you make any location you want accessible.

https://developer.apple.com/library/archive/documentation/Miscellaneous/Reference/EntitlementKeyReference/Chapters/AppSandboxTemporaryExceptionEntitlements.html#//apple_ref/doc/uid/TP40011195-CH5-SW7

There are a lot of edge cases and details involved, but you could basically make almost anything work. If/when you're actually ready to implement something, I'd suggest doing that in a new thread that's JUST focused on the codesigning and IPC issues.

Your third option — using the filesystem itself as a control channel — is interesting for in-band dataset operations, but circular for pool-level work (create, import, destroy) that must happen before any dataset is mounted.

The overarching issue here is that, given the current capabilities of FSKit, there isn't really any way to fully implement ZFS in a way that directly matches its kernel implementation. I can think of a lot of different ways to "project" ZFS into macOS (for example, creating a "management" volume that exports pools to other targets), but a lot of this is ultimately going to come down to making your own choices about how this should work.

Moving on:

Is IOUserBlockStorageDevice in DriverKit a viable path for dynamically publishing block devices from userspace?

I'll be coming back to DriverKit shortly, but no, IOUserBlockStorageDevice won't work. It's something like a "side car" class which allows commands to be sent directly to the device. It isn't really on the I/O path, which is what you'd actually need.

Is the hdiutil path accessible from within the extension's sandbox, or does it require coordination with a process outside?

I haven't actually tested, but I suspect it's directly accessible. We also just announced DiskImageKit, which might be useful.

https://developer.apple.com/documentation/DiskImageKit

That leads to here:

The issue isn't UBC access; it's that ZFS needs to sit below the UBC as well as above it, and FSKit currently only provides the top layer. The two-layer architecture you described (IOMedia below UBC for transformation + VFS above for the POSIX layer) is exactly what the kext implements today.

So, if you haven't already, please file a bug asking us to create a solution for this, then post the bug number back here. I don't know if this is an issue we'll address, but the "best" option here would likely be an API that actually addressed these issues.

You mentioned SCSIControllerDriverKit with UserGetDataBuffer/UserProcessBundledParallelTasks as a potential path for a virtual I/O transformation layer. Is that a supported and documented route for this use case, even if difficult? Or is it more theoretical?

All of the above. It's theoretical in the sense that I'm not aware of anyone having used this approach and that this obviously wasn't what SCSIControllerDriverKit was designed for. However, it's also not really "secret" or even particularly obscure, more to the point I haven't been able to think of any reason why it wouldn’t work.

If it's genuinely viable, we'd rather know that now than assume the door is closed.

That's a bit of an open question. No one has ever asked for the entitlement to be used for this and, to be honest, I only really thought of this approach a day or two ago, so it hasn't been given that much consideration.

HOWEVER, writing this has led to a new thought as well. Starting here:

In the case of APFS inside a zvol for example, consumers of files in the APFS filesystem will make use of the UBC just fine. This performs surprisingly well in two ways: firstly, ARC and other ZFS caching hides a lot of drive-seek latencies from APFS; and secondly, because APFS caches data in the UBC, the number of reads into the ARC for "double-cached" data largely vanishes, and consequently the ARC copy is more likely to stay off the frequently-read part of the cache and thus more eligible for replacement.

Have you considered replicating this basic architecture for data volumes as well? That is:

The device initially mounts with a "pools volume" (pondscape volume?), which exposes individual raw files for each pool.
Individual pools are mounted by pointing hdiutil at the pool file.
You "data volume" file system driver matches and presents the actual data volume.

I don't know enough about ZFS's format to know how well this would work, but the ideal here is that the data being "exported" out of the pool file is exported as unencrypted data suitable for immediate "presentation". That then allows your "data volume" driver to use kernel I/O, taking advantage of the UBC and the performance gain provided by kernel offload.

I honestly don't know what the results of that will be, but it's possible that will give you reasonably good performance with a relatively straightforward implementation.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware