Kernel panic when using fclonefileat from ES

Question

Created 1w

Replies 3

Boosts 0

Participants 2

Hi, I am developing instant snapshot backup solution for macOS using Endpoint Security. We have stumbled upon a Kernel Panic when using "fclonefileat" API.

We are catching a kernel panic on customer machines when attempting to clone the file during ES sync callback:

panic(cpu 0 caller 0xfffffe002c495508): "apfs_io_lock_exclusive : Recursive exclusive lock attempt" @fs_utils.c:435

I have symbolized the backtrace to know it is related to clone operation with the following backtrace:

apfs_io_lock_exclusive
apfs_clone_internal
apfs_vnop_clonefile

I made a minimal repro that boils down to the following operations:

apfs_crash_stress - launch thread to do rsrc writes

static void *rsrc_write_worker(void *arg)
{
  int id = (int)(long)arg;
  char buf[8192];
  long n = 0;
  
  fill_pattern(buf, sizeof(buf), 'W' + id);
  
  while (n < ITERATION_LIMIT) {
    int file_idx = n % NUM_SOURCE_FILES;
    int fd = open(g_src_rsrc[file_idx], O_WRONLY | O_CREAT, 0644);
    if (fd >= 0) {
      off_t off = ((n * 4096) % RSRC_DATA_SIZE);
      pwrite(fd, buf, sizeof(buf), off);
      if ((n & 0x7) == 0)
        fsync(fd);
      
      close(fd);
    } else {
      setxattr(g_src[file_idx], "com.apple.ResourceFork",
               buf, sizeof(buf), 0, 0);
    }
    
    n++;
  }
  printf("[rsrc_wr_%d] done (%ld ops)\n", id, n);
  return NULL;
}

apfs_crash_es - simple ES client that is cloning the file (error checking omitted for brevity)

static std::string volfsPath(uint64_t devId, uint64_t vnodeId)
{
  return "/.vol/" + std::to_string(devId) + "/" + std::to_string(vnodeId);
}

static void cloneAndScheduleDelete(const std::string& sourcePath, dispatch_queue_t queue, uint64_t devId, uint64_t vnodeId)
{
  struct stat st;
  if (stat(sourcePath.c_str(), &st) != 0 || !S_ISREG(st.st_mode))
    return;

  int srcFd = open(sourcePath.c_str(), O_RDONLY);
  const char* cloneDir = "/Users/admin/Downloads/_clone";
  mkdir(cloneDir, 0755);

  const char* filename = strrchr(sourcePath.c_str(), '/');
  filename = filename ? filename + 1 : sourcePath.c_str();
  
  std::string cloneFilename = std::string(filename) + ".clone." + std::to_string(time(nullptr)) + "." + std::to_string(getpid());
  std::string clonePath = std::string(cloneDir) + "/" + cloneFilename;

  fclonefileat(srcFd, AT_FDCWD, clonePath.c_str(), 0);
  {
    dispatch_after(dispatch_time(DISPATCH_TIME_NOW, 1 * NSEC_PER_SEC), queue, ^{
      if (unlink(clonePath.c_str()) == 0)
      {
        LOG("Deleted clone: %s", clonePath.c_str());
      }
      else
      {
        LOG("Failed to delete clone: %s", clonePath.c_str());
      }
    });
  }
  close(srcFd);
}

static const es_file_t* file(const es_message_t* msg)
{
  switch (msg->event_type)
  {
    case ES_EVENT_TYPE_AUTH_OPEN:
      return msg->event.open.file;
    case ES_EVENT_TYPE_AUTH_EXEC:
      return msg->event.exec.target->executable;
    case ES_EVENT_TYPE_AUTH_RENAME:
      return msg->event.rename.source;
  }
    
  return nullptr;
}

int main(void)
{
  es_client_t* cli;
  auto ret = es_new_client(&cli, ^(es_client_t* client, const es_message_t * msgc)
  {
    if (msgc->process->is_es_client)
    {
      es_mute_process(client, &msgc->process->audit_token);
      return respond(client, msgc, true);
    }

    dispatch_async(esQueue, ^{
      bool shouldClone = false;
      if (msgc->event_type == ES_EVENT_TYPE_AUTH_OPEN)
      {
        auto& ev = msgc->event.open;
        if (ev.fflag & (FWRITE | O_RDWR | O_WRONLY | O_TRUNC | O_APPEND))
        {
          shouldClone = true;
        }
      }
      else if (msgc->event_type == ES_EVENT_TYPE_AUTH_UNLINK || msgc->event_type == ES_EVENT_TYPE_AUTH_RENAME)
      {
        shouldClone = true;
      }

      if (shouldClone)
      {
        if (auto f = ::file(msgc))
          cloneAndScheduleDelete(f->path.data, cloneQueue, f->stat.st_dev, f->stat.st_ino);
      }

      respond(client, msgc, true);
    });
  });
  LOG("es_new_client -> %d", ret);

  es_event_type_t events[] = {
    ES_EVENT_TYPE_AUTH_OPEN,
    ES_EVENT_TYPE_AUTH_EXEC,
    ES_EVENT_TYPE_AUTH_RENAME,
    ES_EVENT_TYPE_AUTH_UNLINK,
  };
  es_subscribe(cli, events, sizeof(events) / sizeof(*events));
}

Create 2 terminal sessions and run the following commands:

 % sudo ./apfs_crash_es
 % sudo ./apfs_crash_stress ~/Downloads/test/

Machine will very quickly panic due to APFS deadlock. I expect that no userspace syscall should be able to cause kernel panic. It looks like a bug in APFS implementation and requires fix on XNU/kext side.

We were able to reproduce this issue on macOS 26.3.1/15.6.1 on Intel/ARM machines.

Here is the panic string:

panic_string.txt

Source code without XCode project:

apfs_crash_es.cpp

apfs_crash_stress.cpp

Full XCode project + full panic is available at https://www.icloud.com/iclouddrive/0f215KkZffPOTLpETPo-LdaXw#apfs%5Fcrash%5Fes

Answered by DTS Engineer in 881596022

Hi, I am developing an instant snapshot backup solution for macOS using Endpoint Security. We have stumbled upon a Kernel Panic when using the "fclonefileat" API.

Yes, this is a known bug (r.161340058). More specifically, the "..namedfork/rsrc" construct is a longstanding part of the system, but it's not widely used and it allows the expression of operations that aren't entirely "coherent" (like cloning an object that isn't a file). Have you tested this on macOS 26.4? I believe the issue should now be fixed.

Having said that:

I expect that no userspace syscall should be able to cause a kernel panic. It looks like a bug in the APFS implementation and requires a fix on the XNU/kext side.

...this is a good example of one of my long-standing warnings, namely that us fixing kernel bugs doesn't mean your code will work. The panic above is actually fixed in the VFS layer by having "fclonefileat" fail, just like clonefile does when presented with the same scenario.

As a more general comment, I'd suggest adding a check for the "..namedfork/" construct in the ES client and handling these files as a special case. Typically, that means stripping the suffix off the path so that you're working with the actual file. I'd actually add two checks, one that is specific for "..namedfork/rsrc" and another for "..namedfork/<anything else>" to watch for anything "new". As far as I'm aware, we ONLY use "..namedfork/rsrc" and I'm not sure the more general syntax even works, but it's an easy edge case to watch for.

Finally, I'll note that the bug above has raised a more general concern about "..namedfork" being handled in a consistent way. There is now some ongoing work (r.161084094) to standardize how these checks are handled in the VFS layer, and it's likely that will eventually lead to some small changes in API behavior. Notably, "truncate" currently fails on named forks and it's likely that will at some point start working (like ftruncate already does).

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

Boost

Answer 1

DTS Engineer OP

Apple

6d

Accepted Answer

Recommended

Hi, I am developing an instant snapshot backup solution for macOS using Endpoint Security. We have stumbled upon a Kernel Panic when using the "fclonefileat" API.

Yes, this is a known bug (r.161340058). More specifically, the "..namedfork/rsrc" construct is a longstanding part of the system, but it's not widely used and it allows the expression of operations that aren't entirely "coherent" (like cloning an object that isn't a file). Have you tested this on macOS 26.4? I believe the issue should now be fixed.

Having said that:

I expect that no userspace syscall should be able to cause a kernel panic. It looks like a bug in the APFS implementation and requires a fix on the XNU/kext side.

...this is a good example of one of my long-standing warnings, namely that us fixing kernel bugs doesn't mean your code will work. The panic above is actually fixed in the VFS layer by having "fclonefileat" fail, just like clonefile does when presented with the same scenario.

As a more general comment, I'd suggest adding a check for the "..namedfork/" construct in the ES client and handling these files as a special case. Typically, that means stripping the suffix off the path so that you're working with the actual file. I'd actually add two checks, one that is specific for "..namedfork/rsrc" and another for "..namedfork/<anything else>" to watch for anything "new". As far as I'm aware, we ONLY use "..namedfork/rsrc" and I'm not sure the more general syntax even works, but it's an easy edge case to watch for.

Finally, I'll note that the bug above has raised a more general concern about "..namedfork" being handled in a consistent way. There is now some ongoing work (r.161084094) to standardize how these checks are handled in the VFS layer, and it's likely that will eventually lead to some small changes in API behavior. Notably, "truncate" currently fails on named forks and it's likely that will at some point start working (like ftruncate already does).

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

2

Answer 2

deniskopyrin OP

6d

Hi Kevin,

I have tested it on macOS 26.4. Bug no longer reproduces with my test samples. Thank you for the fix!

Regarding "fclonefileat" API my expectations are actually met in macOS 26.4 now - I do expect that Kernel is capable of rejecting invalid calls of "clonefile" API and does fail the operation. It is tricky to predict the future as there might be new regular file type that might not be clonable/will become clonable so I always delegate the file clonability decision to Kernel. My userland code is to handle the errors gracefully, although fallback in case of rsrc needs to be reconsidered... It is on my side though, I do not see any further enhancements needed in Kernel currently.

For backwards compatibility, I will include explicit careful checks for resource forks and paths with "..namedfork/" as you have suggested.

I appreciate your help!

0

Answer 3

DTS Engineer OP

Apple

5d

Regarding "fclonefileat" API, my expectations are actually met in macOS 26.4 now - I do expect that the Kernel is capable of rejecting invalid calls of "clonefile" API and does fail the operation. It is tricky to predict the future as there might be new regular file types that might not be clonable/will become clonable, so I always delegate the file clonability decision to the Kernel.

So, stepping back for a moment, my larger advice here in terms of "future proofing" would actually be to avoid the direct call to clonefile and instead call copyfile. While kernel panic'ing is a relatively "extreme" failure, clonefile is a low-level syscall, and I'd fully expect that there can/are/will be edge cases where clonefile fails and copyfile does not. The difference here is a matter of API role - clonefile is only designed to do the specific task of sending a specific request to the VFS system, while copyfile() is designed around the broader "job". If you specifically want to fail in any case where a file can't be cloned, then passing COPYFILE_CLONE_FORCE will do that.

For backwards compatibility, I will include explicit, careful checks for resource forks and paths with "..namedfork/" as you have suggested.

This is not just an issue of backward compatibility. Anytime you see the "..namedfork" construct, that means the "file path" you're interacting with is NOT in fact "a file" but is actually some "sub-part" of that file. To be perfectly frank, that's a DEEPLY weird situation and I won't guarantee that our "general" APIs will actually handle it properly. Keep in mind that this syntax was NEVER intended to be a "general" API pattern, but was ONLY created to simplify file copying. That is, the ONLY legitimate way to use "..namedfork" was to use it to read the entire contents of one resource fork and then write those full contents to a different resource fork. ANY other use was inherently programmatic error. We've now repurposed the resource fork for a new purpose (file system compressed files) but the general "rule" remains the same.

History Lesson for the Curious

As originally envisioned (macOS classic in the later 1980s) in HFS (and then later HFS+), any file could consist of two "forks". The "data fork" (which is where "normal" file data existed) and the "resource fork". The resource fork and its own dedicated access API ("the Resource Manager") which managed the data in the resource fork as a sort of small database. Critically, the data stored in the resource fork was often a mix of "system data" (like custom icons) and app data (like localization resources), so using any API other than the Resource Manager to access that data was a great way to disrupt all sorts of things not just your app. Direct fork level access DID exist, but only so apps could blind copy data "in bulk", NOT so they could actually look at or interpret that data.

That dynamic is also why we felt comfortable repurposing the resource fork and for file system level compression- we'd already deprecated and disabled the Resource Manager, which meant we'd entirely removed the ONLY API apps were allowed to use when actually interpreting that data.

In any case, all of that's why I think you should be careful when manipulating “namedfork" paths. It's an API pattern that was never particularly well known and, more importantly, was only ever used in very narrow and specific ways. Ignoring that is very likely to cause problems, exactly like what happened here.

__
Kevin Elliott
DTS Engineer, CoreOS/Hardware

0