Render advanced 3D graphics and perform data-parallel computations using graphics processors using Metal.

Posts under Metal tag

117 Posts

Post

Replies

Boosts

Views

Activity

关于我使用Swift和Metal制作的神经网络引擎
我今年18岁。没有机器学习背景,没有上过大学,高中都没去上,没有导师。 几天前我盯着一张纸发呆。突然想:为什么计算机神经网络一定要是2D的?可以模拟生物吗?为什么一定要在平面上算?如果多个平面,岂不是翻倍?如果把六张纸想象成一个魔方,六个面各自承载神经元,八条体对角线变成新的通信通道会怎么样? 我真的很喜欢折腾这些,然后我立刻制定了详细计划,使用AI工具辅助写下了第一个 kernel。跑崩了。我又重新想了一下,和qq群友分享了我的目标,又写。又崩。连续几十次。没有 PyTorch,没有 TensorFlow,没有 CUDA。只有Swift和Metal。因为我的电脑显卡是AMD Vega 64,没装任何框架辅助,因为我想明白最底层的运行方式是什么原理。 这就是CubeNN。 ##以下为AI的详细解答,内容与架构改动太多,我在这里一次讲不清楚 它是什么 一个用魔方几何作为计算架构的神经网络引擎。 标准 Transformer: 把数据排成一行,O(n²) 地互相看 CubeNN: 把数据分布在 14 个面上,只在该看的地方看 6 个标准面 → 块稀疏注意力(粗看全局 + 细看局部) 8 个 X 面对角线 → 跨面信息桥(不做 Attention,只负责传递) 每轮:6 面算 → 投影到 8 X 面 → 上采样精炼 → 融合回 6 面 最关键的是 Cube Cascade——一个树+链级联推理: 树阶段: 1 个魔方 spawn 8 个 → 8 个 spawn 64 个 → 73 个并行探索 GPU 上同时跑,选最优路径 链阶段: 最优叶子无限深度精炼 3-5 步收敛,方差提升 ~7% 怎么实现的 纯 Swift + Metal。零依赖。零框架。 // 大致代码就是这些 import Metal import Foundation let device = MTLCreateSystemDefaultDevice()! let library = try! device.makeLibrary(filepath: "cube_nn.metallib") // ...12 个 GPU kernel,12,000 次 dispatch 关键技术决策: 单 Command Buffer:整个树阶段 73 个魔方的全部 kernel dispatch 打包进一个 CB,0 次 CPU-GPU 同步 Pipeline State 缓存:编码从 1022ms 降到 42ms Buffer 偏移:所有 73 个魔方的 14 个面存进一个连续 buffer,kernel 通过 buffer(15) 传偏移量 FP16:N≥64 时半精度提速 21% 性能 ##经过测试,但是因设备差异可能不准确,仅参考 AMD Radeon RX Vega 64 (2017 年显卡, 14nm, 295W): 规模 神经元 魔方数 耗时 N=32 6,144 73 (树) 435ms N=64 24,576 21 (树) 817ms N=128 98,304 1 116ms N=32 全连接 Attention 每层 201M FLOP → CubeNN 块稀疏 370K FLOP (544× 减少) N=128 全连接需要 32GB 显存(物理上不存在)→ CubeNN 用 192KB N=256 全连接需要 2.2T FLOP → CubeNN 52M FLOP (42,300× 减少) 代码体积:161KB。 对比 PyTorch 的 800MB。 我经历了什么 这个项目最困难的不是写 kernel,是在没有任何人告诉我"能不能做"的情况下,靠反复试错找到路。 第一次试图跑 73 个魔方,GPU 直接 hang 了。花了 3 天定位到是 Command Buffer 堆叠过多。 改了 single encoder 方案,又碰上 SIGILL——Metal 不允许 makeBuffer(length: 0),B=0 时创建了零长度 buffer。 想用 threadgroup memory 做 kernel fusion,结果跨 threadgroup 读不到数据,才明白 LDS 是 per-group 的。 N=64 的 FP16 要手动写 float↔half 转换函数,因为 macOS 11 上 Float16 类型被标为 unavailable。 每一次崩溃都教会我一个 Metal 的底层细节。没有人教我,但 Metal 的报错信息就是最好的老师。 为什么发在 Apple 开发者论坛 因为这是为苹果生态而生的项目。CubeNN 从头到尾只用了两个东西:Swift 和 Metal。它不需要移植就能跑在任何 Apple Silicon Mac 上(API兼容)。如果未来能把部分 kernel 映射到 Neural Engine,效率会再翻几倍。 我想问 Apple 的 Metal 工程师和 Core ML 团队: ** 有没有更好的 GPU 任务调度方式?**目前表现仍然欠佳(对于我这个完美主义者来说),可能改得有点乱了 有没有兴趣评估这个架构在 M4 上的表现? 我手里只有 Vega 64。M4 GPU + ANE方法 跑 CubeNN 会是什么效果? 源代码 ├── run.swift # 统一 CLI,参数化 N/B/depth ├── src/ │ ├── cube_nn.metal # FP16 kernel │ └── cube_nn_fp32.metal # FP32 kernel └── benchmarks/ # 实测数据 如果你读到了这里——谢谢你。一个门外汉靠痴狂的,纯粹到几乎是妄想的主意和Metal走到了这里。我懂的不是很多,如果这个架构有任何价值,我想让它变得更好。任何建议、批评、或者指教,都非常欢迎。
0
0
43
3d
_FusedMatMul with [BiasAdd, Relu] produces incorrect results in graph mode on Metal GPU
When running a tf.function-traced graph on the Metal GPU, any operation that combines MatMul → BiasAdd → Relu (the fused pattern emitted by tf.keras.layers.Dense(activation='relu')) produces numerically incorrect output — errors on the order of tens of units, not floating-point noise. Eager mode on the same Metal GPU is correct. Graph mode forced to CPU (tf.config.set_visible_devices([], 'GPU')) is also correct. The bug is deterministic and data-independent (reproduces with random weights). the three-op combination of MatMul + BiasAdd + Relu trigger the error. Specifically: relu(tf.nn.bias_add(tf.matmul(x, W), b)) in graph mode on Metal is wrong, while relu(tf.matmul(x, W) + b) (using AddV2 instead of BiasAdd) is correct. Removing the Relu also makes the result correct — tf.nn.bias_add(tf.matmul(x, W), b) without a following Relu produces correct output at every shape tested. This points to the Metal plugin's fused _FusedMatMul kernel with fused_ops=[BiasAdd, Relu] as the culprit. Disabling the TF core grappler remapping pass (tf.config.optimizer.set_experimental_options({'remapping': False})) does not fix the issue, confirming that the fusion decision is made inside the Metal plugin's own kernel selection, below the TF core graph optimizer. The bug reproduces across all shapes tested (batch 4–200, inner dimension K 512–8192, output 128–2048) and is not specific to any particular weight values. A minimal reproducer: import tensorflow as tf import numpy as np # Any shape works; larger K makes the error more obvious M, K, N = 64, 2048, 1024 W = tf.Variable(tf.random.normal([K, N])) b = tf.Variable(tf.random.normal([N])) x = tf.random.normal([M, K]) @tf.function def graph_fused(x): return tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) @tf.function def graph_safe(x): return tf.nn.relu(tf.matmul(x, W) + b) # AddV2 instead of BiasAdd eager_ref = tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) # eager = correct fused_out = graph_fused(x) # Metal graph mode = WRONG safe_out = graph_safe(x) # Metal graph mode = correct print(f"eager vs graph_fused (BiasAdd): {tf.reduce_max(tf.abs(eager_ref - fused_out)).numpy():.1f}") # ^ typically 30–80+ (WRONG) print(f"eager vs graph_safe (AddV2): {tf.reduce_max(tf.abs(eager_ref - safe_out)).numpy():.2e}") # ^ typically ~1e-5 (correct) Environment: TensorFlow 2.18.1, Keras 3.11.2, tensorflow-metal (latest as of 2026-05-26), Apple Silicon Mac. Impact: This breaks any Keras model that uses Dense(activation='relu') when called inside a tf.function or via SavedModel serving on the Metal GPU. Eager-mode inference is unaffected.
0
0
1.1k
2w
Metal GPU Driver Crash on M5 Pro + macOS 26.5 — kIOGPUCommandBufferCallbackErrorOutOfMemory with <2GB working sets
Metal GPU Driver Crash on M5 Pro + macOS 26.5 — kIOGPUCommandBufferCallbackErrorOutOfMemory with <2GB working sets Summary The Metal driver AGXMetalG17X 351.2 on macOS 26.5 (25F71) for the M5 Pro chip crashes with kIOGPUCommandBufferCallbackErrorOutOfMemory (00000008) when running LLM inference workloads with working sets as small as ~1.5GB, despite 24GB of unified memory being available and Apple Diagnostics confirming the hardware is fully functional. This affects multiple tools: MLX, llama.cpp (Metal backend), and native apps using Metal for inference. System Component Value Model MacBook Pro (Mac17,9) Chip Apple M5 Pro (applegpu_g17s) GPU Cores 16 RAM 24 GB LPDDR5 macOS 26.5 (25F71) Metal Metal 4 GPU Driver AGXMetalG17X 351.2 Xcode 26.5 (17F42) Reproduction MLX (Python) pip install mlx mlx-lm python -m mlx_lm.generate \ --model mlx-community/Qwen2.5-3B-Instruct-4bit \ --max-tokens 10 \ --prompt "Hello" Expected: Normal text generation Actual: Crash with: libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) llama.cpp brew install llama.cpp llama-cli --model model.gguf --prompt "Hello" --n-predict 20 --n-gpu-layers 99 Expected: Fast GPU generation Actual: Process hangs indefinitely Test Results Tool Model Peak Memory Result MLX Qwen2.5-0.5B-4bit 0.36 GB ✅ Works MLX Qwen2.5-1.5B-4bit 0.98 GB ✅ Works MLX Qwen3-1.7B-4bit 1.01 GB ✅ Works MLX Qwen2.5-3B-4bit ~1.5 GB ❌ Metal OOM crash MLX Qwen3-4B-4bit ~2.1 GB ❌ Metal OOM crash MLX Qwen3-8B-4bit ~4.5 GB ❌ Metal OOM crash llama.cpp Qwen2.5-0.5B GGUF ~0.5 GB ❌ Hangs with GPU llama.cpp Qwen2.5-0.5B GGUF ~0.5 GB ✅ Works with CPU only Key Evidence Hardware is healthy — Apple Diagnostics passed all tests Basic Metal works — matmul, array ops work fine CPU inference works — llama.cpp with -ngl 0 runs correctly The error is NOT about actual memory exhaustion — kIOGPUCommandBufferCallbackErrorOutOfMemory means the kernel rejects the Metal memory commit, not that physical memory is full. The system reports 17.76GB available for Metal working set. Crash Log Extract Thread 31 Crashed: 0 libsystem_kernel.dylib __pthread_kill + 8 1 libsystem_pthread.dylib pthread_kill + 296 2 libsystem_c.dylib abort + 148 3 Metal MTLReportFailure.cold.1 + 48 4 Metal MTLReportFailure + 576 5 Metal -[_MTLCommandBuffer addCompletedHandler:] + 104 ... Exception Type: EXC_CRASH (SIGABRT) Termination Reason: Namespace SIGNAL, Code 6, Abort trap: 6 Related Issues ml-explore/mlx#3586 — Metal compiler regression on macOS 26.5 ml-explore/mlx#3534 — M5 float32 precision issue ml-explore/mlx#3568 — M5 random divergence ml-explore/mlx#3539 — Metal residency OOM (M4 Max) Request Please investigate the AGXMetalG17X driver for M5 Pro on macOS 26.5. The driver appears to incorrectly reject Metal memory commits for LLM inference workloads, even when the working set is well within the system's reported limits (1.5GB requested vs 17.76GB available). Happy to provide full crash logs, sysdiagnose archives, or run additional tests.
0
0
183
2w
MetalToolchain and auto updates...
Hello, I can understand why you do not ship the MetalToolchain with the default Xcode installation any more due to the relatively low usage and high download size. That said, every time Xcode runs an auto update it wipes MetalToolchain and breaks my local development build. It would be nice if the updates would be smart enough to honor the fact that. I have already run: "xcodebuild -downloadComponent MetalToolchain" and include that in the update, rather than deleting the module. Thanks, Chris
1
0
229
3w
Inexplicable Metal crash ever since iOS 26.5 beta 4
Hi all, I'm working on updating my audio visualizer app. I'm adding new visualizers based on Metal 4 compute shaders. They worked in iOS 26.4 and iOS 26.5 up until beta 3. However, after that, the visualizers started crashing the phone and forcing a restart. On the latest version of iOS 26.5, the crash is still there. I submitted feedback, but haven't heard anything back just yet. I was wondering if others have faced this same issue, and if there are any workarounds. Here is my repo if you want to look at the code (forgive me if it's sloppy, I'm quite new to graphics programming and Metal): https://github.com/aabagdi/VisualMan/tree/main Thank you!
4
0
1.4k
3w
XPC Communication between Editor app and user-compiled code
Hello! I'm trying to implement an editor app (macOS) that allows the user to write code, which will be compiled and executed, showing the result in the editor window. Imagine it like SwiftUI previews, but the graphic output is created with Metal, not SwiftUI. I found that IOSurface can be used to share that kind of data over XPC, so I would not have to rely on the private NSRemoteView. However, I'm confused if it is, at all, possible for my editor app to connect to an XPC Service, that was NOT bundled with it (but compiled by it at runtime). I succeeded to launch an XPC service defined as: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>Label</key> <string>com.myteam.myproject.service</string> <key>MachServices</key> <dict> <key>com.myteam.myproject.service</key> <true/> </dict> <key>Program</key> <string>/Path/to/service/run_my_service.sh</string> </dict> </plist> But the call to let connection = NSXPCConnection(machServiceName: "com.myteam.myproject.service") let proxy = connection.remoteObjectProxyWithErrorHandler { error in continuation.resume(throwing: error) } as? MyServiceProtocol fails with "The connection to service named com.myteam.myproject.service was invalidated: Connection init failed at lookup with error 3 - No such process." I have added <key>com.apple.security.temporary-exception.mach-lookup.global-name</key> <array> <string>com.myteam.myproject.service</string> </array> to my entitlements. Since the tutorials I followed are quite old, I'm wondering if support for something like this was dropped at some point. Thanks for any advice!
6
0
644
4w
Possibilities of Overclocking Apple Silicon
I've been testing Apple Silicon devices in their desktop configurations on the Mac Studio and now retired Mac Pro and it seems like they're greatly bottlenecked by their clock speeds. For reference here's my testing results. Testing Results: Mac Studio M2 Max • 32GBs RAM • 30 core GPU • 1TB Storage CPU Utilization • 60% • 20W CPU Temperature • 47ºC GPU Utilization • 100% • 20W GPU Temperature • 55ºC Fan Speed • 50% Workload Duration • 2hrs Another point is that the clock speed on the M2 Max's CPU is 3.5 GHz and on the GPU it is 1.44 GHz at max performance. Which the Mac Studio has no trouble pushing. My question is how do I push those clock speeds higher? Cause 1.44 GHz at 55ºC is evidence for extensive headroom. I'm sure there are tools internally for testing the upper limits of the silicon, but it makes no sense why it would be set so low the Mac Studio is at no worries of melting. Is there any way to push the performance of my Mac Studio? FB22713867 - Possibilities of Overclocking Apple Silicon
1
0
322
May ’26
Metal, Vulkan, OpenGL & Godot
Greetings! I'm preparing to publish an app in Apple Store. It's a 2D Audio app made in Godot, already published in Google Store.. As we know, OpenGL is considered deprecated since iOS 12 / 2018 .. However given the current state of Metal, or Vulkan integration in Godot, and with the idea of bringing the Best possible experience on iOS.. I'm not completely sure what will be the best API to use as primary option.. -As good as Metal, or even Vulkan work in Godot; the fact of the matter is, each API has its strong and weak points.. -Metal: Native on iOS, fully compliant and supported. However it has two weak points: Initial Compilation Freeze - +5 sec. Performance Hit, (although negligible for final user) app uses 25% more CPU (on my iPhone 12). Battery drain? -Vulkan: In godot, Vulkan > MoltenVk > Metal More complex translation layer, but interestingly gives slightly better Performance than Metal.. Initial Compilation doesn't cause Freeze, because is lazy/delayed and performed while the app is starting. Uses 25% less CPU than Metal and gives slightly more stable Framerate. (iPhone 12) However, given the extra complexity it could be more prone to error, or Compatibility Problems, which are known and have been reported with older iOS devices (iPads come to mind..) Right? -OpenGL: No Initial Compilation Needed Max Performance, No CPU munch Universally supported, (in theory?) works Perfectly on my iPhone 12 with iOS 26.3 and 26.4.2 And all in all, gives the best Performance and user experience. -And that's pretty much the situation! Since the graphics API of choice, will have an effect and directly translate to User experience... what's then the best one? -This will be the first app I Publish on Apple Store, so as you can imagine I want to Comply with Apple as much as possible; and bring iOS users the best possible experience. However each one of the APIs seem to have a negative aspect.. Metal: 5sec Compilation Freeze Vulkan: Compatibility Problems? OpenGL: "Deprecated" In practical terms, right now, OpenGL gives the best Performance, and the best User Experience.. So what to do? -The Android version is published in Google Store in OpenGL Compat mode. Works perfectly. Even tho OpenGL has been Deprecated on iOS for 7+ years, it has survived all along, with no announced removal date from Apple. And it seems to work perfectly and be fully operational up to the latest iOS 26 version.. right? Maybe Apple is maintaining it for stability and compatibility reasons, even if they're no longer actively developing it? Butthee "deprecated" label sounds alarming, as if support could drop any day.. So what will be the best choice in this situation? -Will an app built primarily for OpenGL, (with Metal fallback) be Rejected right away in Apple Store? -Otoh Vulkan (via MoltenVK) could be a middle term solution, second best Performance, no Compilation Freeze.. But yeah, the Compatibility aspect is important; and while considerable improvements have been made in Godot's implementation, the current status or possible outcome is harder to assess.. Both Metal and OpenGL seem safer options in that sense..
5
0
1.1k
Apr ’26
LowLevelInstanceData & animation
AppleOS 26 introduces LowLevelInstanceData that can reduce CPU draw calls significantly by instancing. However, I have noticed trouble with animating each individual instance. As I wanted low-level control, I'm using a custom system and LowLevelInstanceData.replace(using:) to update the transform each frame. The update closure itself is extremely efficient (Xcode Instruments reports nearly no cost). But I noticed extremely high runloop time, reach around 20ms. Time Profiler shows that the CPU is blocked by kernel.release.t6401. I think it is caused by synchronization between CPU and GPU, however, as I am already using a MTLCommandBuffer to coordinate it, I don't understand why I am still seeing large CPU time.
3
0
956
Apr ’26
SCNTechnique clearColor Always Shows sceneBackground When Passes Share Depth Buffer
Problem Description I'm encountering an issue with SCNTechnique where the clearColor setting is being ignored when multiple passes share the same depth buffer. The clear color always appears as the scene background, regardless of what value I set. The minimal project for reproducing the issue: https://www.dropbox.com/scl/fi/30mx06xunh75wgl3t4sbd/SCNTechniqueCustomSymbols.zip?rlkey=yuehjtk7xh2pmdbetv2r8t2lx&st=b9uobpkp&dl=0 Problem Details In my SCNTechnique configuration, I have two passes that need to share the same depth buffer for proper occlusion handling: "passes": [ "box1_pass": [ "draw": "DRAW_SCENE", "includeCategoryMask": 1, "colorStates": [ "clear": true, "clearColor": "0 0 0 0" // Expecting transparent black ], "depthStates": [ "clear": true, "enableWrite": true ], "outputs": [ "depth": "box1_depth", "color": "box1_color" ], ], "box2_pass": [ "draw": "DRAW_SCENE", "includeCategoryMask": 2, "colorStates": [ "clear": true, "clearColor": "0 0 0 0" // Also expecting transparent black ], "depthStates": [ "clear": false, "enableWrite": false ], "outputs": [ "depth": "box1_depth", // Sharing the same depth buffer "color": "box2_color", ], ], "final_quad": [ "draw": "DRAW_QUAD", "metalVertexShader": "myVertexShader", "metalFragmentShader": "myFragmentShader", "inputs": [ "box1_color": "box1_color", "box2_color": "box2_color", ], "outputs": [ "color": "COLOR" ] ] ] And the metal shader used to display box1_color and box2_color with splitting: fragment half4 myFragmentShader(VertexOut in [[stage_in]], texture2d<half, access::sample> box1_color [[texture(0)]], texture2d<half, access::sample> box2_color [[texture(1)]]) { half4 color1 = box1_color.sample(s, in.texcoord); half4 color2 = box2_color.sample(s, in.texcoord); if (in.texcoord.x < 0.5) { return color1; } return color2; }; Expected Behavior Both passes should clear their color targets to transparent black (0, 0, 0, 0) The depth buffer should be shared between passes for proper occlusion Actual Behavior Both box1_color and box2_color targets contain the scene background instead of being cleared to transparent (see attached image) This happens even when I explicitly set clearColor: "0 0 0 0" for both passes Setting scene.background.contents = UIColor.clear makes the clearColor work as expected, but I need to keep the scene background for other purposes What I've Tried Setting different clearColor values - all are ignored when sharing depth buffer Using DRAW_NODE instead of DRAW_SCENE - didn't solve the issue Creating a separate pass to capture the background - the background still appears in the other passes Various combinations of clear flags and render orders Environment iOS/macOS, running with "My Mac (Designed for iPad)" Xcode 16.2 Question Is this a known limitation of SceneKit when passes share a depth buffer? Is there a workaround to achieve truly transparent clear colors while maintaining a shared depth buffer for occlusion testing? The core issue seems to be that SceneKit automatically renders the scene background in every DRAW_SCENE pass when a shared depth buffer is detected, overriding any clearColor settings. Any insights or workarounds would be greatly appreciated. Thank you!
1
0
954
Apr ’26
Cannot load .mtlpackage to MTLLibrary
After watching WWDC 2025 session "Combine Metal 4 machine learning and graphics", I have decided to give it a shot to integrate the latest MTL4MachineLearningCommandEncoder to my existing render pipeline. After a lot of trial and errors, I managed to set up the pipeline and have the app compiled. However, I am now stuck on creating a MTLLibrary with .mtlpackage. Here is the code I have to create a MTLLibrary according the WWDC session https://developer.apple.com/videos/play/wwdc2025/262/?time=550: let coreMLFilePath = bundle.path(forResource: "my_model", ofType: "mtlpackage")! let coreMLURL = URL(string: coreMLFilePath)! do { metalDevice.makeLibrary(URL: coreMLURL) } catch { print("error: \(error)") } With the above code, I am getting error: Error Domain=MTLLibraryErrorDomain Code=1 "Invalid metal package" UserInfo={NSLocalizedDescription=Invalid metal package} What is the correct way to create a MTLLibrary with .mtlpackage? Do I see this error because the .mtlpackage I am using is incorrect? How should I go with debugging this? I'd really appreciate if I could get some help on this as I have been stuck with it for some time now. Thanks in advance!
1
0
593
Apr ’26
Can a compute pipeline be as efficient as a render pipeline for rasterization?
I'm new to graphics and game design and I just wanted to know if a compute pipeline could be as efficient as a render pipeline for rasterization and an explanation on how and why. Also is it possible to manually perform rasterization with a render pipeline as in manipulate individual pixel data in a metal texture yourself but do it with a render pipeline?
1
0
655
Apr ’26
GPTK 3 and D3DMetal issue with Modern Pipeline Creation
Death Stranding 2: On the Beach (v1.0.48.0, Steam) crashes during rendering initialization when running through CrossOver 26 with D3DMetal 3.0 on an Apple M2 Max Mac Studio running macOS Sequoia. The game successfully initializes Streamline, NVAPI, DLSS (Result::eOk), DLSSG (Result::eOk), Reflex, and XeSS — all subsystems report success. The crash occurs immediately after, during rendering pipeline creation, before the game reaches NXStorage initialization or window creation. Minidump analysis confirms the crash is an access violation (0xc0000005) at DS2.exe+0x67233d, writing to address 0x0. RAX=0x0 (null pointer being dereferenced), R12=0xFFFFFFFFFFFFFFFF (error/invalid handle return). The game appears to call a D3D12 API — likely CheckFeatureSupport or a pipeline state creation function — that D3DMetal acknowledges as supported but returns null or invalid data for. The game trusts the response and dereferences the null pointer. Two other Nixxes titles using the same engine and D3DMetal setup run without issue: Spider-Man 2 (~50 FPS) and Horizon Zero Dawn Remastered (~34 FPS). DS2 uses newer technology versions (DLSS 4, FSR 4, XeSS 2) and a newer DirectX 12 Agility SDK, which likely queries D3D12 features that D3DMetal does not yet fully implement. The crash also reproduces when D3DMetal reports as AMD vendor (1002) instead of NVIDIA (10de), crashing at the same executable offset, confirming it is a D3D12 feature reporting gap in D3DMetal rather than a vendor-specific issue. How To Reproduce Install Crossover 26+ on MacOS 26.4 Install Steam and download Death Stranding 2 Run Death Stranding 2 and check logs after crash in Documents\DEATH STRANDING 2 ON THE BEACH Feedback Requests FB22285513 — Game Porting Toolkit 3 issue with Modern Pipeline Creation
1
4
877
Apr ’26
Xcode26 Replay frame broken
Got a broken frame when using Xcode to capture a frame and replay it from a Unity game. It seems like the vertex buffer is broken; I see a bunch of "nan"s in the vertex buffer. However, the game displays correct when running, and it only happend when I upgrade my Xcode and iphone to Xcode26 and IOS26 ios26
1
0
466
Apr ’26
Missing DirectX Calls for Tearing and Depth Bound Test in D3DMetal and GPTK 3
I want to address the missing or incomplete DirectX calls from D3DMetal and Game Porting Toolkit 3. These missing calls have in part caused issue with our porting process and we are reconsidering. Missing or Incomplete Calls DXGI_FEATURE_PRESENT_ALLOW_TEARING — IDXGIFactory5::CheckFeatureSupport — this calls has to do with how VSync is handled and some modern games require it to initialize. Currently D3DMetal return 0 maybe by design but most likely because it’s not integrated. Adding a stub that returns 1 can fix this. I’m my use case I simply Noped the check and forced it to continue. D3D12_FEATURE_D3D12_OPTIONS2.DepthBoundsTestSupported — this call is also not present. Which causes games to not initialize rendering. Thankfully this was fixed by once again skipping the check. But this is essential for water rendering. This could be one reason currently water is not rendering in our game. IDXGIOutput6::GetDesc1().ColorSpace — returns DXGI_COLOR_SPACE_RGB_FULL_G22_NONE_P709 (SDR) on external HDR compatible displays. We were able to fix this by forcing HDR to be enabled. It should return HDR support. These calls may exist but they need to be updated to return the correct values. Specifically for depth bound test you can reference MoltenVK which sets it up on top of Metal since it’s not a native feature. The water issue could be also an issue with how the shaders are compiled. But I’m unable to check because of the closed source nature of GPTK and its debuggers. What is a better way we can debug our game to see why the water isn’t rendering. Does D3DMetal have some debug options or something similar? Feedback Number FB22330617 - Missing DirectX Calls for Tearing and Depth Bound Test in D3DMetal and GPTK 3 We hope these issues are resolved quickly because we were thinking of a simultaneous release with our Windows version, but we can't ship with such large bugs.
6
3
637
Apr ’26
Xcode Metal Capture crash when using MTLSamplerState
The sample code just draw a triangle and sample texture. both sample code can draw a correct triangle and sample texture as expected. there are no error message from terminal. Sample code using constexpr Sampler can capture and replay well. Sample code using a argumentTable to bind a MTLSamplerState was crashed when using Metal capture and replay on Xcode. Here are sample codes. Sample Code Test Environment: M1 Pro MacOS 26.3 (25D125) Xcode Version 26.2 (17C52) Feedback ID: FB22031701
2
0
486
Apr ’26
D3DMetal Extreme Over Synchronization Issues
Explanation Currently, D3DMetal’s GPU synchronization approach introduces significant compute overhead on the CPU. This specifically affects D3D12 games that use modern rendering pipelines on Apple Silicon. Specifically, I’ve tested Death Stranding 2 On the Beach for how it handles its rendering. And the results are extreme: frame times are suffering from a 42% decrease from synchronization. Although there are obviously other effects at play, such as the overhead introduced by Rosetta and Wine, both of them don’t introduce as much overhead as D3DMetal. This issue isn’t just specific to Death Stranding 2 On the Beach; most games running through D3DMetal suffer from this. Most games still seem to force synchronization to ~30 ms to reach the 30 fps amount. But it could be better with better synchronization, such as how DXMT handles it. Instead of doubling the work, it allows Metal to single-handedly track resource dependencies internally. This is in part due to the unfortunate bad mapping of D3D12 calls onto shared logic between D3D11 and D3D12. System M2 Max Mac Studio — 32 GBs — 30-core GPU macOS 26.4 Tahoe CrossOver 26.1 RC Death Stranding 2 On the Beach — Steam Assassin’s Creed Valhalla — Steam & Ubisoft Connect Thank you for your commitment. Another game that I recommend testing to really see this swell is Assassin’s Creed Valhalla. Feedback FB22426600 - D3DMetal Extereme Over Syncranization Issues
1
1
681
Apr ’26
Xcode_26 not compiling Metal project
Hello Xcode 26.0.1 (17A400) Missing some Metal components When building a program using Metal, it induces an unexpected error : “error: error: cannot execute tool 'metal' due to missing Metal Toolchain; use: xcodebuild -downloadComponent MetalToolchain Command CompileMetalFile failed with a nonzero exit code” Which terminates the build The fix given “xcodebuild -downloadComponent MetalToolchain” using sudo does not work Did someone find a work around or could resolve the issue? Many thanks Jean MacBook Air M4; macOS 26.0.1; Xcode 26.0.1
5
2
519
Mar ’26
How to load and draw texture with opacity in Metal
The background I'm finally working to convert my very old Mac kaleidoscope application, ScopeWorks, which was written in OpenGL and Objective-C, to a Multiplatform app in SwiftUI and Metal. I'm using the MetalKit MTKView class, wrapped for SwiftUI as an NSViewRepresentable or UIViewRepresentable. I then provide an MTKViewDelegate that provides a draw method. The draw method fetches the current render pass descriptor, creates a command buffer, sets up a render pipeline, and does its drawing. My renderer's makePipeline method looks like this: func makePipeline() { let library = device.makeDefaultLibrary() let pipelineDesc = MTLRenderPipelineDescriptor() pipelineDesc.vertexFunction = library?.makeFunction(name: "vertex_main") pipelineDesc.fragmentFunction = library?.makeFunction(name: "fragment_main") pipelineDesc.colorAttachments[0].pixelFormat = .bgra8Unorm pipeline = try! device.makeRenderPipelineState(descriptor: pipelineDesc) } And my shaders look like this: struct VertexOut { float4 position [[position]]; float2 texCoord; }; vertex VertexOut vertex_main(const device float2* position [[buffer(0)]], uint vid [[vertex_id]]) { VertexOut out; float2 pos = position[vid]; out.position = float4(pos, 0, 1); out.texCoord = pos * 0.5 + 0.5; // basic mapping return out; } fragment float4 fragment_main(VertexOut in [[stage_in]], texture2d<float> tex [[texture(0)]], constant float4& color [[buffer(1)]]) { constexpr sampler s(address::repeat, filter::linear); // float4 texColor = tex.sample(s, in.texCoord); // return texColor * color; float4 textureColor = {1, 2, 3, 4}; if (all(color == textureColor)) { return tex.sample(s, in.texCoord); } else { return color; } // Sample the texture directly — no color tint applied return tex.sample(s, in.texCoord); } The first part of my MTKViewDelegate's draw method looks like this: func draw(in view: MTKView) { guard let drawable = view.currentDrawable, let descriptor = view.currentRenderPassDescriptor, let pipeline = pipeline, let texture = texture else { return } let commandBuffer = commandQueue.makeCommandBuffer()! let encoder = commandBuffer.makeRenderCommandEncoder(descriptor: descriptor)! encoder.setRenderPipelineState(pipeline) encoder.setFragmentTexture(texture, index: 0) descriptor.colorAttachments[0].clearColor = MTLClearColor(red: 0.0, green: 0, blue: 0, alpha: 1.0) // Draw six equilateral triangles forming the hexagon let radius: Float = 0.6 for i in 0..<6 { let angle = Float(i) * (.pi / 3) let cosA = cos(angle) let sinA = sin(angle) let nextA = Float(i+1) * (.pi / 3) let cosB = cos(nextA) let sinB = sin(nextA) let verts: [simd_float2] = [ simd_float2(0, 0), simd_float2(radius * cosA, radius * sinA), simd_float2(radius * cosB, radius * sinB) ] encoder.setVertexBytes(verts, length: MemoryLayout<simd_float2>.stride * 3, index: 0) // Tell the fragment shader to use the texture color. var textureColor: simd_float4 = simd_float4(1, 2, 3, 4) encoder.setFragmentBytes(&textureColor, length: MemoryLayout<SIMD4<Float>>.stride, index: 1) encoder.drawPrimitives(type: .triangle, vertexStart: 0, vertexCount: 3) One of the things the existing app does is load PNG or TIFF images with an alpha channel, and then overlay parts of the image on top of themselves flipped, so you get interesting Moiré patterns in the lines in the resulting kaleidoscope. For now I'm working on a single sample image, loading it into a texture in Metal, and just rendering it as a hexagon and drawing lines for the triangles that make up the hexagon. (For now I'm using the vertex coordinates as the texture coordinates, so I get a hexagonal part of my texture rather than a single triangular part tessellated into a hexagon. I'll fix that later.) In both iOS and OS I set the clear color to black at the beginning of the draw function. The issue: The source image is mostly transparent, but with a lot of partly transparent pixels. Here's what it looks like in Photoshop, where you can see the transparent parts as a checkerboard pattern: (I tried to crop the original image to show the approximate part that I'm rendering in a hexagon, but it's not exact. Look for the same shapes in the different images to compare them.) When I render my hexagon in the Metal view in the iOS version of the app, it looks like it's forcing each pixel to fully opaque or fully transparent: And in the macOS version of the app, it seems to force ALL the pixels to opaque: I haven't shown all the setup code, because it's' a lot. Is there some rendering mode setup I'm missing in order to get it to draw the pixels into the output based on their opacity, including partial opacity?
2
0
1.1k
Mar ’26
CoreML regression between macOS 26.0.1 and macOS 26.1 Beta causing scrambled tensor outputs
We’ve encountered what appears to be a CoreML regression between macOS 26.0.1 and macOS 26.1 Beta. In macOS 26.0.1, CoreML models run and produce correct results. However, in macOS 26.1 Beta, the same models produce scrambled or corrupted outputs, suggesting that tensor memory is being read or written incorrectly. The behavior is consistent with a low-level stride or pointer arithmetic issue — for example, using 16-bit strides on 32-bit data or other mismatches in tensor layout handling. Reproduction Install ON1 Photo RAW 2026 or ON1 Resize 2026 on macOS 26.0.1. Use the newest Highest Quality resize model, which is Stable Diffusion–based and runs through CoreML. Observe correct, high-quality results. Upgrade to macOS 26.1 Beta and run the same operation again. The output becomes visually scrambled or corrupted. We are also seeing similar issues with another Stable Diffusion UNet model that previously worked correctly on macOS 26.0.1. This suggests the regression may affect multiple diffusion-style architectures, likely due to a change in CoreML’s tensor stride, layout computation, or memory alignment between these versions. Notes The affected models are exported using standard CoreML conversion pipelines. No custom operators or third-party CoreML runtime layers are used. The issue reproduces consistently across multiple machines. It would be helpful to know if there were changes to CoreML’s tensor layout, precision handling, or MLCompute backend between macOS 26.0.1 and 26.1 Beta, or if this is a known regression in the current beta.
8
4
2.4k
Mar ’26
关于我使用Swift和Metal制作的神经网络引擎
我今年18岁。没有机器学习背景,没有上过大学,高中都没去上,没有导师。 几天前我盯着一张纸发呆。突然想:为什么计算机神经网络一定要是2D的?可以模拟生物吗?为什么一定要在平面上算?如果多个平面,岂不是翻倍?如果把六张纸想象成一个魔方,六个面各自承载神经元,八条体对角线变成新的通信通道会怎么样? 我真的很喜欢折腾这些,然后我立刻制定了详细计划,使用AI工具辅助写下了第一个 kernel。跑崩了。我又重新想了一下,和qq群友分享了我的目标,又写。又崩。连续几十次。没有 PyTorch,没有 TensorFlow,没有 CUDA。只有Swift和Metal。因为我的电脑显卡是AMD Vega 64,没装任何框架辅助,因为我想明白最底层的运行方式是什么原理。 这就是CubeNN。 ##以下为AI的详细解答,内容与架构改动太多,我在这里一次讲不清楚 它是什么 一个用魔方几何作为计算架构的神经网络引擎。 标准 Transformer: 把数据排成一行,O(n²) 地互相看 CubeNN: 把数据分布在 14 个面上,只在该看的地方看 6 个标准面 → 块稀疏注意力(粗看全局 + 细看局部) 8 个 X 面对角线 → 跨面信息桥(不做 Attention,只负责传递) 每轮:6 面算 → 投影到 8 X 面 → 上采样精炼 → 融合回 6 面 最关键的是 Cube Cascade——一个树+链级联推理: 树阶段: 1 个魔方 spawn 8 个 → 8 个 spawn 64 个 → 73 个并行探索 GPU 上同时跑,选最优路径 链阶段: 最优叶子无限深度精炼 3-5 步收敛,方差提升 ~7% 怎么实现的 纯 Swift + Metal。零依赖。零框架。 // 大致代码就是这些 import Metal import Foundation let device = MTLCreateSystemDefaultDevice()! let library = try! device.makeLibrary(filepath: "cube_nn.metallib") // ...12 个 GPU kernel,12,000 次 dispatch 关键技术决策: 单 Command Buffer:整个树阶段 73 个魔方的全部 kernel dispatch 打包进一个 CB,0 次 CPU-GPU 同步 Pipeline State 缓存:编码从 1022ms 降到 42ms Buffer 偏移:所有 73 个魔方的 14 个面存进一个连续 buffer,kernel 通过 buffer(15) 传偏移量 FP16:N≥64 时半精度提速 21% 性能 ##经过测试,但是因设备差异可能不准确,仅参考 AMD Radeon RX Vega 64 (2017 年显卡, 14nm, 295W): 规模 神经元 魔方数 耗时 N=32 6,144 73 (树) 435ms N=64 24,576 21 (树) 817ms N=128 98,304 1 116ms N=32 全连接 Attention 每层 201M FLOP → CubeNN 块稀疏 370K FLOP (544× 减少) N=128 全连接需要 32GB 显存(物理上不存在)→ CubeNN 用 192KB N=256 全连接需要 2.2T FLOP → CubeNN 52M FLOP (42,300× 减少) 代码体积:161KB。 对比 PyTorch 的 800MB。 我经历了什么 这个项目最困难的不是写 kernel,是在没有任何人告诉我"能不能做"的情况下,靠反复试错找到路。 第一次试图跑 73 个魔方,GPU 直接 hang 了。花了 3 天定位到是 Command Buffer 堆叠过多。 改了 single encoder 方案,又碰上 SIGILL——Metal 不允许 makeBuffer(length: 0),B=0 时创建了零长度 buffer。 想用 threadgroup memory 做 kernel fusion,结果跨 threadgroup 读不到数据,才明白 LDS 是 per-group 的。 N=64 的 FP16 要手动写 float↔half 转换函数,因为 macOS 11 上 Float16 类型被标为 unavailable。 每一次崩溃都教会我一个 Metal 的底层细节。没有人教我,但 Metal 的报错信息就是最好的老师。 为什么发在 Apple 开发者论坛 因为这是为苹果生态而生的项目。CubeNN 从头到尾只用了两个东西:Swift 和 Metal。它不需要移植就能跑在任何 Apple Silicon Mac 上(API兼容)。如果未来能把部分 kernel 映射到 Neural Engine,效率会再翻几倍。 我想问 Apple 的 Metal 工程师和 Core ML 团队: ** 有没有更好的 GPU 任务调度方式?**目前表现仍然欠佳(对于我这个完美主义者来说),可能改得有点乱了 有没有兴趣评估这个架构在 M4 上的表现? 我手里只有 Vega 64。M4 GPU + ANE方法 跑 CubeNN 会是什么效果? 源代码 ├── run.swift # 统一 CLI,参数化 N/B/depth ├── src/ │ ├── cube_nn.metal # FP16 kernel │ └── cube_nn_fp32.metal # FP32 kernel └── benchmarks/ # 实测数据 如果你读到了这里——谢谢你。一个门外汉靠痴狂的,纯粹到几乎是妄想的主意和Metal走到了这里。我懂的不是很多,如果这个架构有任何价值,我想让它变得更好。任何建议、批评、或者指教,都非常欢迎。
Replies
0
Boosts
0
Views
43
Activity
3d
_FusedMatMul with [BiasAdd, Relu] produces incorrect results in graph mode on Metal GPU
When running a tf.function-traced graph on the Metal GPU, any operation that combines MatMul → BiasAdd → Relu (the fused pattern emitted by tf.keras.layers.Dense(activation='relu')) produces numerically incorrect output — errors on the order of tens of units, not floating-point noise. Eager mode on the same Metal GPU is correct. Graph mode forced to CPU (tf.config.set_visible_devices([], 'GPU')) is also correct. The bug is deterministic and data-independent (reproduces with random weights). the three-op combination of MatMul + BiasAdd + Relu trigger the error. Specifically: relu(tf.nn.bias_add(tf.matmul(x, W), b)) in graph mode on Metal is wrong, while relu(tf.matmul(x, W) + b) (using AddV2 instead of BiasAdd) is correct. Removing the Relu also makes the result correct — tf.nn.bias_add(tf.matmul(x, W), b) without a following Relu produces correct output at every shape tested. This points to the Metal plugin's fused _FusedMatMul kernel with fused_ops=[BiasAdd, Relu] as the culprit. Disabling the TF core grappler remapping pass (tf.config.optimizer.set_experimental_options({'remapping': False})) does not fix the issue, confirming that the fusion decision is made inside the Metal plugin's own kernel selection, below the TF core graph optimizer. The bug reproduces across all shapes tested (batch 4–200, inner dimension K 512–8192, output 128–2048) and is not specific to any particular weight values. A minimal reproducer: import tensorflow as tf import numpy as np # Any shape works; larger K makes the error more obvious M, K, N = 64, 2048, 1024 W = tf.Variable(tf.random.normal([K, N])) b = tf.Variable(tf.random.normal([N])) x = tf.random.normal([M, K]) @tf.function def graph_fused(x): return tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) @tf.function def graph_safe(x): return tf.nn.relu(tf.matmul(x, W) + b) # AddV2 instead of BiasAdd eager_ref = tf.nn.relu(tf.nn.bias_add(tf.matmul(x, W), b)) # eager = correct fused_out = graph_fused(x) # Metal graph mode = WRONG safe_out = graph_safe(x) # Metal graph mode = correct print(f"eager vs graph_fused (BiasAdd): {tf.reduce_max(tf.abs(eager_ref - fused_out)).numpy():.1f}") # ^ typically 30–80+ (WRONG) print(f"eager vs graph_safe (AddV2): {tf.reduce_max(tf.abs(eager_ref - safe_out)).numpy():.2e}") # ^ typically ~1e-5 (correct) Environment: TensorFlow 2.18.1, Keras 3.11.2, tensorflow-metal (latest as of 2026-05-26), Apple Silicon Mac. Impact: This breaks any Keras model that uses Dense(activation='relu') when called inside a tf.function or via SavedModel serving on the Metal GPU. Eager-mode inference is unaffected.
Replies
0
Boosts
0
Views
1.1k
Activity
2w
Metal GPU Driver Crash on M5 Pro + macOS 26.5 — kIOGPUCommandBufferCallbackErrorOutOfMemory with <2GB working sets
Metal GPU Driver Crash on M5 Pro + macOS 26.5 — kIOGPUCommandBufferCallbackErrorOutOfMemory with <2GB working sets Summary The Metal driver AGXMetalG17X 351.2 on macOS 26.5 (25F71) for the M5 Pro chip crashes with kIOGPUCommandBufferCallbackErrorOutOfMemory (00000008) when running LLM inference workloads with working sets as small as ~1.5GB, despite 24GB of unified memory being available and Apple Diagnostics confirming the hardware is fully functional. This affects multiple tools: MLX, llama.cpp (Metal backend), and native apps using Metal for inference. System Component Value Model MacBook Pro (Mac17,9) Chip Apple M5 Pro (applegpu_g17s) GPU Cores 16 RAM 24 GB LPDDR5 macOS 26.5 (25F71) Metal Metal 4 GPU Driver AGXMetalG17X 351.2 Xcode 26.5 (17F42) Reproduction MLX (Python) pip install mlx mlx-lm python -m mlx_lm.generate \ --model mlx-community/Qwen2.5-3B-Instruct-4bit \ --max-tokens 10 \ --prompt "Hello" Expected: Normal text generation Actual: Crash with: libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) llama.cpp brew install llama.cpp llama-cli --model model.gguf --prompt "Hello" --n-predict 20 --n-gpu-layers 99 Expected: Fast GPU generation Actual: Process hangs indefinitely Test Results Tool Model Peak Memory Result MLX Qwen2.5-0.5B-4bit 0.36 GB ✅ Works MLX Qwen2.5-1.5B-4bit 0.98 GB ✅ Works MLX Qwen3-1.7B-4bit 1.01 GB ✅ Works MLX Qwen2.5-3B-4bit ~1.5 GB ❌ Metal OOM crash MLX Qwen3-4B-4bit ~2.1 GB ❌ Metal OOM crash MLX Qwen3-8B-4bit ~4.5 GB ❌ Metal OOM crash llama.cpp Qwen2.5-0.5B GGUF ~0.5 GB ❌ Hangs with GPU llama.cpp Qwen2.5-0.5B GGUF ~0.5 GB ✅ Works with CPU only Key Evidence Hardware is healthy — Apple Diagnostics passed all tests Basic Metal works — matmul, array ops work fine CPU inference works — llama.cpp with -ngl 0 runs correctly The error is NOT about actual memory exhaustion — kIOGPUCommandBufferCallbackErrorOutOfMemory means the kernel rejects the Metal memory commit, not that physical memory is full. The system reports 17.76GB available for Metal working set. Crash Log Extract Thread 31 Crashed: 0 libsystem_kernel.dylib __pthread_kill + 8 1 libsystem_pthread.dylib pthread_kill + 296 2 libsystem_c.dylib abort + 148 3 Metal MTLReportFailure.cold.1 + 48 4 Metal MTLReportFailure + 576 5 Metal -[_MTLCommandBuffer addCompletedHandler:] + 104 ... Exception Type: EXC_CRASH (SIGABRT) Termination Reason: Namespace SIGNAL, Code 6, Abort trap: 6 Related Issues ml-explore/mlx#3586 — Metal compiler regression on macOS 26.5 ml-explore/mlx#3534 — M5 float32 precision issue ml-explore/mlx#3568 — M5 random divergence ml-explore/mlx#3539 — Metal residency OOM (M4 Max) Request Please investigate the AGXMetalG17X driver for M5 Pro on macOS 26.5. The driver appears to incorrectly reject Metal memory commits for LLM inference workloads, even when the working set is well within the system's reported limits (1.5GB requested vs 17.76GB available). Happy to provide full crash logs, sysdiagnose archives, or run additional tests.
Replies
0
Boosts
0
Views
183
Activity
2w
MetalToolchain and auto updates...
Hello, I can understand why you do not ship the MetalToolchain with the default Xcode installation any more due to the relatively low usage and high download size. That said, every time Xcode runs an auto update it wipes MetalToolchain and breaks my local development build. It would be nice if the updates would be smart enough to honor the fact that. I have already run: "xcodebuild -downloadComponent MetalToolchain" and include that in the update, rather than deleting the module. Thanks, Chris
Replies
1
Boosts
0
Views
229
Activity
3w
Inexplicable Metal crash ever since iOS 26.5 beta 4
Hi all, I'm working on updating my audio visualizer app. I'm adding new visualizers based on Metal 4 compute shaders. They worked in iOS 26.4 and iOS 26.5 up until beta 3. However, after that, the visualizers started crashing the phone and forcing a restart. On the latest version of iOS 26.5, the crash is still there. I submitted feedback, but haven't heard anything back just yet. I was wondering if others have faced this same issue, and if there are any workarounds. Here is my repo if you want to look at the code (forgive me if it's sloppy, I'm quite new to graphics programming and Metal): https://github.com/aabagdi/VisualMan/tree/main Thank you!
Replies
4
Boosts
0
Views
1.4k
Activity
3w
XPC Communication between Editor app and user-compiled code
Hello! I'm trying to implement an editor app (macOS) that allows the user to write code, which will be compiled and executed, showing the result in the editor window. Imagine it like SwiftUI previews, but the graphic output is created with Metal, not SwiftUI. I found that IOSurface can be used to share that kind of data over XPC, so I would not have to rely on the private NSRemoteView. However, I'm confused if it is, at all, possible for my editor app to connect to an XPC Service, that was NOT bundled with it (but compiled by it at runtime). I succeeded to launch an XPC service defined as: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>Label</key> <string>com.myteam.myproject.service</string> <key>MachServices</key> <dict> <key>com.myteam.myproject.service</key> <true/> </dict> <key>Program</key> <string>/Path/to/service/run_my_service.sh</string> </dict> </plist> But the call to let connection = NSXPCConnection(machServiceName: "com.myteam.myproject.service") let proxy = connection.remoteObjectProxyWithErrorHandler { error in continuation.resume(throwing: error) } as? MyServiceProtocol fails with "The connection to service named com.myteam.myproject.service was invalidated: Connection init failed at lookup with error 3 - No such process." I have added <key>com.apple.security.temporary-exception.mach-lookup.global-name</key> <array> <string>com.myteam.myproject.service</string> </array> to my entitlements. Since the tutorials I followed are quite old, I'm wondering if support for something like this was dropped at some point. Thanks for any advice!
Replies
6
Boosts
0
Views
644
Activity
4w
Possibilities of Overclocking Apple Silicon
I've been testing Apple Silicon devices in their desktop configurations on the Mac Studio and now retired Mac Pro and it seems like they're greatly bottlenecked by their clock speeds. For reference here's my testing results. Testing Results: Mac Studio M2 Max • 32GBs RAM • 30 core GPU • 1TB Storage CPU Utilization • 60% • 20W CPU Temperature • 47ºC GPU Utilization • 100% • 20W GPU Temperature • 55ºC Fan Speed • 50% Workload Duration • 2hrs Another point is that the clock speed on the M2 Max's CPU is 3.5 GHz and on the GPU it is 1.44 GHz at max performance. Which the Mac Studio has no trouble pushing. My question is how do I push those clock speeds higher? Cause 1.44 GHz at 55ºC is evidence for extensive headroom. I'm sure there are tools internally for testing the upper limits of the silicon, but it makes no sense why it would be set so low the Mac Studio is at no worries of melting. Is there any way to push the performance of my Mac Studio? FB22713867 - Possibilities of Overclocking Apple Silicon
Replies
1
Boosts
0
Views
322
Activity
May ’26
Metal, Vulkan, OpenGL & Godot
Greetings! I'm preparing to publish an app in Apple Store. It's a 2D Audio app made in Godot, already published in Google Store.. As we know, OpenGL is considered deprecated since iOS 12 / 2018 .. However given the current state of Metal, or Vulkan integration in Godot, and with the idea of bringing the Best possible experience on iOS.. I'm not completely sure what will be the best API to use as primary option.. -As good as Metal, or even Vulkan work in Godot; the fact of the matter is, each API has its strong and weak points.. -Metal: Native on iOS, fully compliant and supported. However it has two weak points: Initial Compilation Freeze - +5 sec. Performance Hit, (although negligible for final user) app uses 25% more CPU (on my iPhone 12). Battery drain? -Vulkan: In godot, Vulkan > MoltenVk > Metal More complex translation layer, but interestingly gives slightly better Performance than Metal.. Initial Compilation doesn't cause Freeze, because is lazy/delayed and performed while the app is starting. Uses 25% less CPU than Metal and gives slightly more stable Framerate. (iPhone 12) However, given the extra complexity it could be more prone to error, or Compatibility Problems, which are known and have been reported with older iOS devices (iPads come to mind..) Right? -OpenGL: No Initial Compilation Needed Max Performance, No CPU munch Universally supported, (in theory?) works Perfectly on my iPhone 12 with iOS 26.3 and 26.4.2 And all in all, gives the best Performance and user experience. -And that's pretty much the situation! Since the graphics API of choice, will have an effect and directly translate to User experience... what's then the best one? -This will be the first app I Publish on Apple Store, so as you can imagine I want to Comply with Apple as much as possible; and bring iOS users the best possible experience. However each one of the APIs seem to have a negative aspect.. Metal: 5sec Compilation Freeze Vulkan: Compatibility Problems? OpenGL: "Deprecated" In practical terms, right now, OpenGL gives the best Performance, and the best User Experience.. So what to do? -The Android version is published in Google Store in OpenGL Compat mode. Works perfectly. Even tho OpenGL has been Deprecated on iOS for 7+ years, it has survived all along, with no announced removal date from Apple. And it seems to work perfectly and be fully operational up to the latest iOS 26 version.. right? Maybe Apple is maintaining it for stability and compatibility reasons, even if they're no longer actively developing it? Butthee "deprecated" label sounds alarming, as if support could drop any day.. So what will be the best choice in this situation? -Will an app built primarily for OpenGL, (with Metal fallback) be Rejected right away in Apple Store? -Otoh Vulkan (via MoltenVK) could be a middle term solution, second best Performance, no Compilation Freeze.. But yeah, the Compatibility aspect is important; and while considerable improvements have been made in Godot's implementation, the current status or possible outcome is harder to assess.. Both Metal and OpenGL seem safer options in that sense..
Replies
5
Boosts
0
Views
1.1k
Activity
Apr ’26
LowLevelInstanceData & animation
AppleOS 26 introduces LowLevelInstanceData that can reduce CPU draw calls significantly by instancing. However, I have noticed trouble with animating each individual instance. As I wanted low-level control, I'm using a custom system and LowLevelInstanceData.replace(using:) to update the transform each frame. The update closure itself is extremely efficient (Xcode Instruments reports nearly no cost). But I noticed extremely high runloop time, reach around 20ms. Time Profiler shows that the CPU is blocked by kernel.release.t6401. I think it is caused by synchronization between CPU and GPU, however, as I am already using a MTLCommandBuffer to coordinate it, I don't understand why I am still seeing large CPU time.
Replies
3
Boosts
0
Views
956
Activity
Apr ’26
SCNTechnique clearColor Always Shows sceneBackground When Passes Share Depth Buffer
Problem Description I'm encountering an issue with SCNTechnique where the clearColor setting is being ignored when multiple passes share the same depth buffer. The clear color always appears as the scene background, regardless of what value I set. The minimal project for reproducing the issue: https://www.dropbox.com/scl/fi/30mx06xunh75wgl3t4sbd/SCNTechniqueCustomSymbols.zip?rlkey=yuehjtk7xh2pmdbetv2r8t2lx&st=b9uobpkp&dl=0 Problem Details In my SCNTechnique configuration, I have two passes that need to share the same depth buffer for proper occlusion handling: "passes": [ "box1_pass": [ "draw": "DRAW_SCENE", "includeCategoryMask": 1, "colorStates": [ "clear": true, "clearColor": "0 0 0 0" // Expecting transparent black ], "depthStates": [ "clear": true, "enableWrite": true ], "outputs": [ "depth": "box1_depth", "color": "box1_color" ], ], "box2_pass": [ "draw": "DRAW_SCENE", "includeCategoryMask": 2, "colorStates": [ "clear": true, "clearColor": "0 0 0 0" // Also expecting transparent black ], "depthStates": [ "clear": false, "enableWrite": false ], "outputs": [ "depth": "box1_depth", // Sharing the same depth buffer "color": "box2_color", ], ], "final_quad": [ "draw": "DRAW_QUAD", "metalVertexShader": "myVertexShader", "metalFragmentShader": "myFragmentShader", "inputs": [ "box1_color": "box1_color", "box2_color": "box2_color", ], "outputs": [ "color": "COLOR" ] ] ] And the metal shader used to display box1_color and box2_color with splitting: fragment half4 myFragmentShader(VertexOut in [[stage_in]], texture2d<half, access::sample> box1_color [[texture(0)]], texture2d<half, access::sample> box2_color [[texture(1)]]) { half4 color1 = box1_color.sample(s, in.texcoord); half4 color2 = box2_color.sample(s, in.texcoord); if (in.texcoord.x < 0.5) { return color1; } return color2; }; Expected Behavior Both passes should clear their color targets to transparent black (0, 0, 0, 0) The depth buffer should be shared between passes for proper occlusion Actual Behavior Both box1_color and box2_color targets contain the scene background instead of being cleared to transparent (see attached image) This happens even when I explicitly set clearColor: "0 0 0 0" for both passes Setting scene.background.contents = UIColor.clear makes the clearColor work as expected, but I need to keep the scene background for other purposes What I've Tried Setting different clearColor values - all are ignored when sharing depth buffer Using DRAW_NODE instead of DRAW_SCENE - didn't solve the issue Creating a separate pass to capture the background - the background still appears in the other passes Various combinations of clear flags and render orders Environment iOS/macOS, running with "My Mac (Designed for iPad)" Xcode 16.2 Question Is this a known limitation of SceneKit when passes share a depth buffer? Is there a workaround to achieve truly transparent clear colors while maintaining a shared depth buffer for occlusion testing? The core issue seems to be that SceneKit automatically renders the scene background in every DRAW_SCENE pass when a shared depth buffer is detected, overriding any clearColor settings. Any insights or workarounds would be greatly appreciated. Thank you!
Replies
1
Boosts
0
Views
954
Activity
Apr ’26
Cannot load .mtlpackage to MTLLibrary
After watching WWDC 2025 session "Combine Metal 4 machine learning and graphics", I have decided to give it a shot to integrate the latest MTL4MachineLearningCommandEncoder to my existing render pipeline. After a lot of trial and errors, I managed to set up the pipeline and have the app compiled. However, I am now stuck on creating a MTLLibrary with .mtlpackage. Here is the code I have to create a MTLLibrary according the WWDC session https://developer.apple.com/videos/play/wwdc2025/262/?time=550: let coreMLFilePath = bundle.path(forResource: "my_model", ofType: "mtlpackage")! let coreMLURL = URL(string: coreMLFilePath)! do { metalDevice.makeLibrary(URL: coreMLURL) } catch { print("error: \(error)") } With the above code, I am getting error: Error Domain=MTLLibraryErrorDomain Code=1 "Invalid metal package" UserInfo={NSLocalizedDescription=Invalid metal package} What is the correct way to create a MTLLibrary with .mtlpackage? Do I see this error because the .mtlpackage I am using is incorrect? How should I go with debugging this? I'd really appreciate if I could get some help on this as I have been stuck with it for some time now. Thanks in advance!
Replies
1
Boosts
0
Views
593
Activity
Apr ’26
Can a compute pipeline be as efficient as a render pipeline for rasterization?
I'm new to graphics and game design and I just wanted to know if a compute pipeline could be as efficient as a render pipeline for rasterization and an explanation on how and why. Also is it possible to manually perform rasterization with a render pipeline as in manipulate individual pixel data in a metal texture yourself but do it with a render pipeline?
Replies
1
Boosts
0
Views
655
Activity
Apr ’26
GPTK 3 and D3DMetal issue with Modern Pipeline Creation
Death Stranding 2: On the Beach (v1.0.48.0, Steam) crashes during rendering initialization when running through CrossOver 26 with D3DMetal 3.0 on an Apple M2 Max Mac Studio running macOS Sequoia. The game successfully initializes Streamline, NVAPI, DLSS (Result::eOk), DLSSG (Result::eOk), Reflex, and XeSS — all subsystems report success. The crash occurs immediately after, during rendering pipeline creation, before the game reaches NXStorage initialization or window creation. Minidump analysis confirms the crash is an access violation (0xc0000005) at DS2.exe+0x67233d, writing to address 0x0. RAX=0x0 (null pointer being dereferenced), R12=0xFFFFFFFFFFFFFFFF (error/invalid handle return). The game appears to call a D3D12 API — likely CheckFeatureSupport or a pipeline state creation function — that D3DMetal acknowledges as supported but returns null or invalid data for. The game trusts the response and dereferences the null pointer. Two other Nixxes titles using the same engine and D3DMetal setup run without issue: Spider-Man 2 (~50 FPS) and Horizon Zero Dawn Remastered (~34 FPS). DS2 uses newer technology versions (DLSS 4, FSR 4, XeSS 2) and a newer DirectX 12 Agility SDK, which likely queries D3D12 features that D3DMetal does not yet fully implement. The crash also reproduces when D3DMetal reports as AMD vendor (1002) instead of NVIDIA (10de), crashing at the same executable offset, confirming it is a D3D12 feature reporting gap in D3DMetal rather than a vendor-specific issue. How To Reproduce Install Crossover 26+ on MacOS 26.4 Install Steam and download Death Stranding 2 Run Death Stranding 2 and check logs after crash in Documents\DEATH STRANDING 2 ON THE BEACH Feedback Requests FB22285513 — Game Porting Toolkit 3 issue with Modern Pipeline Creation
Replies
1
Boosts
4
Views
877
Activity
Apr ’26
Xcode26 Replay frame broken
Got a broken frame when using Xcode to capture a frame and replay it from a Unity game. It seems like the vertex buffer is broken; I see a bunch of "nan"s in the vertex buffer. However, the game displays correct when running, and it only happend when I upgrade my Xcode and iphone to Xcode26 and IOS26 ios26
Replies
1
Boosts
0
Views
466
Activity
Apr ’26
Missing DirectX Calls for Tearing and Depth Bound Test in D3DMetal and GPTK 3
I want to address the missing or incomplete DirectX calls from D3DMetal and Game Porting Toolkit 3. These missing calls have in part caused issue with our porting process and we are reconsidering. Missing or Incomplete Calls DXGI_FEATURE_PRESENT_ALLOW_TEARING — IDXGIFactory5::CheckFeatureSupport — this calls has to do with how VSync is handled and some modern games require it to initialize. Currently D3DMetal return 0 maybe by design but most likely because it’s not integrated. Adding a stub that returns 1 can fix this. I’m my use case I simply Noped the check and forced it to continue. D3D12_FEATURE_D3D12_OPTIONS2.DepthBoundsTestSupported — this call is also not present. Which causes games to not initialize rendering. Thankfully this was fixed by once again skipping the check. But this is essential for water rendering. This could be one reason currently water is not rendering in our game. IDXGIOutput6::GetDesc1().ColorSpace — returns DXGI_COLOR_SPACE_RGB_FULL_G22_NONE_P709 (SDR) on external HDR compatible displays. We were able to fix this by forcing HDR to be enabled. It should return HDR support. These calls may exist but they need to be updated to return the correct values. Specifically for depth bound test you can reference MoltenVK which sets it up on top of Metal since it’s not a native feature. The water issue could be also an issue with how the shaders are compiled. But I’m unable to check because of the closed source nature of GPTK and its debuggers. What is a better way we can debug our game to see why the water isn’t rendering. Does D3DMetal have some debug options or something similar? Feedback Number FB22330617 - Missing DirectX Calls for Tearing and Depth Bound Test in D3DMetal and GPTK 3 We hope these issues are resolved quickly because we were thinking of a simultaneous release with our Windows version, but we can't ship with such large bugs.
Replies
6
Boosts
3
Views
637
Activity
Apr ’26
Xcode Metal Capture crash when using MTLSamplerState
The sample code just draw a triangle and sample texture. both sample code can draw a correct triangle and sample texture as expected. there are no error message from terminal. Sample code using constexpr Sampler can capture and replay well. Sample code using a argumentTable to bind a MTLSamplerState was crashed when using Metal capture and replay on Xcode. Here are sample codes. Sample Code Test Environment: M1 Pro MacOS 26.3 (25D125) Xcode Version 26.2 (17C52) Feedback ID: FB22031701
Replies
2
Boosts
0
Views
486
Activity
Apr ’26
D3DMetal Extreme Over Synchronization Issues
Explanation Currently, D3DMetal’s GPU synchronization approach introduces significant compute overhead on the CPU. This specifically affects D3D12 games that use modern rendering pipelines on Apple Silicon. Specifically, I’ve tested Death Stranding 2 On the Beach for how it handles its rendering. And the results are extreme: frame times are suffering from a 42% decrease from synchronization. Although there are obviously other effects at play, such as the overhead introduced by Rosetta and Wine, both of them don’t introduce as much overhead as D3DMetal. This issue isn’t just specific to Death Stranding 2 On the Beach; most games running through D3DMetal suffer from this. Most games still seem to force synchronization to ~30 ms to reach the 30 fps amount. But it could be better with better synchronization, such as how DXMT handles it. Instead of doubling the work, it allows Metal to single-handedly track resource dependencies internally. This is in part due to the unfortunate bad mapping of D3D12 calls onto shared logic between D3D11 and D3D12. System M2 Max Mac Studio — 32 GBs — 30-core GPU macOS 26.4 Tahoe CrossOver 26.1 RC Death Stranding 2 On the Beach — Steam Assassin’s Creed Valhalla — Steam & Ubisoft Connect Thank you for your commitment. Another game that I recommend testing to really see this swell is Assassin’s Creed Valhalla. Feedback FB22426600 - D3DMetal Extereme Over Syncranization Issues
Replies
1
Boosts
1
Views
681
Activity
Apr ’26
Xcode_26 not compiling Metal project
Hello Xcode 26.0.1 (17A400) Missing some Metal components When building a program using Metal, it induces an unexpected error : “error: error: cannot execute tool 'metal' due to missing Metal Toolchain; use: xcodebuild -downloadComponent MetalToolchain Command CompileMetalFile failed with a nonzero exit code” Which terminates the build The fix given “xcodebuild -downloadComponent MetalToolchain” using sudo does not work Did someone find a work around or could resolve the issue? Many thanks Jean MacBook Air M4; macOS 26.0.1; Xcode 26.0.1
Replies
5
Boosts
2
Views
519
Activity
Mar ’26
How to load and draw texture with opacity in Metal
The background I'm finally working to convert my very old Mac kaleidoscope application, ScopeWorks, which was written in OpenGL and Objective-C, to a Multiplatform app in SwiftUI and Metal. I'm using the MetalKit MTKView class, wrapped for SwiftUI as an NSViewRepresentable or UIViewRepresentable. I then provide an MTKViewDelegate that provides a draw method. The draw method fetches the current render pass descriptor, creates a command buffer, sets up a render pipeline, and does its drawing. My renderer's makePipeline method looks like this: func makePipeline() { let library = device.makeDefaultLibrary() let pipelineDesc = MTLRenderPipelineDescriptor() pipelineDesc.vertexFunction = library?.makeFunction(name: "vertex_main") pipelineDesc.fragmentFunction = library?.makeFunction(name: "fragment_main") pipelineDesc.colorAttachments[0].pixelFormat = .bgra8Unorm pipeline = try! device.makeRenderPipelineState(descriptor: pipelineDesc) } And my shaders look like this: struct VertexOut { float4 position [[position]]; float2 texCoord; }; vertex VertexOut vertex_main(const device float2* position [[buffer(0)]], uint vid [[vertex_id]]) { VertexOut out; float2 pos = position[vid]; out.position = float4(pos, 0, 1); out.texCoord = pos * 0.5 + 0.5; // basic mapping return out; } fragment float4 fragment_main(VertexOut in [[stage_in]], texture2d<float> tex [[texture(0)]], constant float4& color [[buffer(1)]]) { constexpr sampler s(address::repeat, filter::linear); // float4 texColor = tex.sample(s, in.texCoord); // return texColor * color; float4 textureColor = {1, 2, 3, 4}; if (all(color == textureColor)) { return tex.sample(s, in.texCoord); } else { return color; } // Sample the texture directly — no color tint applied return tex.sample(s, in.texCoord); } The first part of my MTKViewDelegate's draw method looks like this: func draw(in view: MTKView) { guard let drawable = view.currentDrawable, let descriptor = view.currentRenderPassDescriptor, let pipeline = pipeline, let texture = texture else { return } let commandBuffer = commandQueue.makeCommandBuffer()! let encoder = commandBuffer.makeRenderCommandEncoder(descriptor: descriptor)! encoder.setRenderPipelineState(pipeline) encoder.setFragmentTexture(texture, index: 0) descriptor.colorAttachments[0].clearColor = MTLClearColor(red: 0.0, green: 0, blue: 0, alpha: 1.0) // Draw six equilateral triangles forming the hexagon let radius: Float = 0.6 for i in 0..<6 { let angle = Float(i) * (.pi / 3) let cosA = cos(angle) let sinA = sin(angle) let nextA = Float(i+1) * (.pi / 3) let cosB = cos(nextA) let sinB = sin(nextA) let verts: [simd_float2] = [ simd_float2(0, 0), simd_float2(radius * cosA, radius * sinA), simd_float2(radius * cosB, radius * sinB) ] encoder.setVertexBytes(verts, length: MemoryLayout<simd_float2>.stride * 3, index: 0) // Tell the fragment shader to use the texture color. var textureColor: simd_float4 = simd_float4(1, 2, 3, 4) encoder.setFragmentBytes(&textureColor, length: MemoryLayout<SIMD4<Float>>.stride, index: 1) encoder.drawPrimitives(type: .triangle, vertexStart: 0, vertexCount: 3) One of the things the existing app does is load PNG or TIFF images with an alpha channel, and then overlay parts of the image on top of themselves flipped, so you get interesting Moiré patterns in the lines in the resulting kaleidoscope. For now I'm working on a single sample image, loading it into a texture in Metal, and just rendering it as a hexagon and drawing lines for the triangles that make up the hexagon. (For now I'm using the vertex coordinates as the texture coordinates, so I get a hexagonal part of my texture rather than a single triangular part tessellated into a hexagon. I'll fix that later.) In both iOS and OS I set the clear color to black at the beginning of the draw function. The issue: The source image is mostly transparent, but with a lot of partly transparent pixels. Here's what it looks like in Photoshop, where you can see the transparent parts as a checkerboard pattern: (I tried to crop the original image to show the approximate part that I'm rendering in a hexagon, but it's not exact. Look for the same shapes in the different images to compare them.) When I render my hexagon in the Metal view in the iOS version of the app, it looks like it's forcing each pixel to fully opaque or fully transparent: And in the macOS version of the app, it seems to force ALL the pixels to opaque: I haven't shown all the setup code, because it's' a lot. Is there some rendering mode setup I'm missing in order to get it to draw the pixels into the output based on their opacity, including partial opacity?
Replies
2
Boosts
0
Views
1.1k
Activity
Mar ’26
CoreML regression between macOS 26.0.1 and macOS 26.1 Beta causing scrambled tensor outputs
We’ve encountered what appears to be a CoreML regression between macOS 26.0.1 and macOS 26.1 Beta. In macOS 26.0.1, CoreML models run and produce correct results. However, in macOS 26.1 Beta, the same models produce scrambled or corrupted outputs, suggesting that tensor memory is being read or written incorrectly. The behavior is consistent with a low-level stride or pointer arithmetic issue — for example, using 16-bit strides on 32-bit data or other mismatches in tensor layout handling. Reproduction Install ON1 Photo RAW 2026 or ON1 Resize 2026 on macOS 26.0.1. Use the newest Highest Quality resize model, which is Stable Diffusion–based and runs through CoreML. Observe correct, high-quality results. Upgrade to macOS 26.1 Beta and run the same operation again. The output becomes visually scrambled or corrupted. We are also seeing similar issues with another Stable Diffusion UNet model that previously worked correctly on macOS 26.0.1. This suggests the regression may affect multiple diffusion-style architectures, likely due to a change in CoreML’s tensor stride, layout computation, or memory alignment between these versions. Notes The affected models are exported using standard CoreML conversion pipelines. No custom operators or third-party CoreML runtime layers are used. The issue reproduces consistently across multiple machines. It would be helpful to know if there were changes to CoreML’s tensor layout, precision handling, or MLCompute backend between macOS 26.0.1 and 26.1 Beta, or if this is a known regression in the current beta.
Replies
8
Boosts
4
Views
2.4k
Activity
Mar ’26