Live 3D body capture from a webcam, into a POP, at 30 fps
Hey all. Wanted to share what I learned building a TouchDesigner component that takes any TOP — webcam, video, still image — and pushes out a live SMPL body mesh as a POP, one mesh per person in the frame. Up to 10 simultaneous bodies, positioned in real 3D world space, running at around 30 fps on M-series Macs and 45-50 fps on Windows with an NVIDIA card.
This post is the tutorial. If you'd rather skip the work, I packaged the whole thing up — link at the end.
Example Output, live running from local webcam - creates 3D mesh based on person. Note: This is not a silouette - its a full 3D mesh with depth and volume
What I wanted
A TD-native way to say "give me the body of every person in this video signal, as 3D geometry I can do TD things to." No external server, no GPU box on the network, no IPC sockets, no bespoke Blender or Unity exporter in the middle. Just plug a TOP in, get a POP out.
There are three free monocular human reconstruction models worth knowing about:
- ROMP (Sun et al., ICCV 2021) — single-shot, multi-person, fast. Outputs SMPL pose + shape + a 6890-vertex mesh per person.
- BEV (Sun et al., CVPR 2022) — same authors, adds depth-ordering between people and supports children. About 2× slower than ROMP.
- TRACE (Sun et al., CVPR 2023) — adds temporal tracking with persistent IDs and globally-stable trajectories under camera motion. About 5× slower.
For real-time TD work, ROMP is the right pick. The simple-romp Python package (pip install simple-romp) wraps it cleanly.
That's the easy part. The interesting part is everything between pip install simple-romp and a working TD component that doesn't crash, doesn't run at 2 fps, and isn't horrible to install. Six specific problems, all solvable, all worth knowing about even if you never touch ROMP.
Problem 1: simple-romp on a Mac runs at 2 fps
If you pip install simple-romp, run inference on a 1920×1080 frame, your Mac (M2, M3, doesn't matter which) sits at around 485 ms per frame. That's about 2 fps. For a research demo, fine. For real-time TD, useless.
Why? simple-romp's ONNX runtime session has its provider list hardcoded inside romp/main.py:
self.ort_session = onnxruntime.InferenceSession( self.settings.model_onnx_path, providers=['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'])
On a Mac, none of those exist. ONNX Runtime falls back to the CPU provider silently and you get pure-CPU inference. Meanwhile your Mac has a Neural Engine sitting idle.
The fix is one line of monkey-patching after the model loads:
import onnxruntime as ort self._model.ort_session = ort.InferenceSession( self._model.settings.model_onnx_path, providers=['CoreMLExecutionProvider', 'CPUExecutionProvider'], )
CoreMLExecutionProvider ships with the standard onnxruntime package on macOS arm64. It routes ONNX ops to a mix of CPU, GPU, and the Apple Neural Engine. Inference drops from 485 ms to 18 ms on my M-series box. That's a 27× speedup, no model changes, no model file conversion, no extra deps. Single biggest perf win in the whole pipeline.
If anyone wants to go further: coremltools.converters.convert( "ROMP.onnx", convert_to="mlprogram", compute_units=ALL, target=macOS13) would compile it to an mlprogram targeting the ANE specifically. Probably another 2-3× on top. I haven't done this yet but it's the next step.
On Windows with an NVIDIA card, the equivalent move is just leaving CUDAExecutionProvider in the list and installing the CUDA-enabled PyTorch wheels — that gets you ~10 ms inference on something like an RTX 3070.
Problem 2: TD's threading model
ROMP inference takes ~18 ms per frame even with Core ML. If you put that in your cook callback, you've blown your frame budget. So inference has to run on a worker thread.
TD's main-thread rule: anything in the td module — op(), parent(), parameter reads, DAT.text writes, OP attribute access — must happen on the main thread. Touch any of it from a worker and you get td.tdError: TouchDesigner objects cannot be accessed outside the main thread. Crash, in some cases.
There's a documented workaround: td.run("...some Python source...") schedules a string of code to execute on the main thread on the next cook. People tell you to use that for thread→main marshalling.
That's also broken. td.run() itself isn't thread-safe. Calling it from a worker thread also raises tdError. Don't believe me? Try this from a worker:
import threading, td def w(): td.run("op('/').time") threading.Thread(target=w).start() # tdError: TouchDesigner objects cannot be accessed outside the main thread
What does work: a plain queue.Queue. The worker pushes tuples like ("set", "/path/to/dat", "some text") into the queue. An Execute DAT's onFrameStart callback runs on the main thread and drains the queue every cook, applying messages.
The crucial detail: the worker thread can't even read TD object attributes for path strings. If you do outbox.put(("set", op('worker_status').path, msg)) from the worker, TD raises tdError at the .path access. You have to capture all the strings on the main thread before spawning the worker, and pass them in as closure locals.
# main thread: precompute path strings log_path = str(comp.op('setup_log').path) status_path = str(comp.op('worker_status').path) outbox = queue.Queue() # worker thread: only touches plain Python objects def _pump(outbox=outbox, log_path=log_path, status_path=status_path): for line in proc.stdout: outbox.put(("append", log_path, line)) threading.Thread(target=_pump, daemon=True).start() # back on main thread (Execute DAT onFrameStart): def drain_main_outbox(): while True: try: kind, path, text = outbox.get_nowait() except queue.Empty: return d = op(path) if kind == "append": d.text = (d.text or "") + text elif kind == "set": d.text = text
Once you've got this pattern, no thread warnings, no crashes, no mysterious freezes. Used everywhere in the component — from the inference worker, from the subprocess pipe pump during setup, everywhere.
Problem 3: Reset leaks worker threads
Every time you change a parameter that requires re-booting the worker (Device, Maxbodies, etc.), you tear down the old worker and start a new one. Standard pattern. The catch: if worker.stop() returns before the thread has actually joined, and you've already lost your reference (self._worker = None), the thread keeps running with the ONNX/Core ML session still loaded. Press Reset 47 times during debugging and you have 47 zombie inference threads sharing the GPU. fps drops to 1.
Two fixes I needed:
-
Generous join timeout in teardown. I bumped to 4 seconds and log a warning if the join times out. Workers stuck deep in backend.infer() (which can't be interrupted from outside) will still exit at the top of their next loop iteration, but the teardown waits long enough to actually catch most cases.
-
A class-level "everyone stop" event on the InferenceWorker class that every loop iteration checks alongside the per-instance _stop event. So a single InferenceWorker.shutdown_all() call trips the flag for every worker thread alive in the process, regardless of whether you have a Python handle to it. Plus a module-level registry of every worker spawned, deregistered on successful join, walked by a "Kill zombies" button on the parameter page.
Threads from before this code shipped still can't be reached (Python won't let you force-kill threads), so a TD restart is the last-resort nuke. But going forward, the kill button cleans the registered workers and the global stop flag catches any that the registry missed.
Problem 4: Geometry path — Script SOP into SOP-to-POP
POPs are GPU-side. ROMP gives you a NumPy array of vertex positions on CPU. There's no Script POP yet, and even if there were, the straight-from-NumPy path matters for performance.
What works: a Script SOP that bakes the SMPL topology once (6890 points + 13776 triangle faces — fixed for SMPL), then on every cook just updates the point positions from the latest worker output. A SOP-to-POP node lifts to GPU. Downstream POPs work normally.
The Script SOP onCook callback is around 30 ms when iterating ~28k point positions for 4 stacked bodies (each body adds 6890 points). That's the next perf bottleneck after Core ML — moving this to a Script CHOP and using CHOP-to-POP would cut it.
The cook callback also has to be force-triggered every frame. By default a Script SOP only cooks when its dependency graph dirties, and a worker-thread-written buffer doesn't dirty it. So I added op('humanmesh').cook(force=True) to the Execute DAT's onFrameStart. Cheap and effective.
Problem 5: Multi-person bodies stack at the origin
ROMP returns SMPL vertices in the SMPL coordinate frame — every person centred at their own origin. Hand four people's verts to a SOP and they all overlap at (0,0,0).
ROMP's output dict has a cam_trans key — a per-person world-space translation in metres, derived from the perspective camera fit. Apply that per person before stacking:
verts_all = np.asarray(out['verts']) # (P, 6890, 3) cam_trans = np.asarray(out['cam_trans']) # (P, 3) verts_all = verts_all + cam_trans[:, None, :] # broadcast
People in the back of the room land at large +Z; people in front at small +Z. Falls apart elegantly when the camera moves (TRACE solves this), but for a static camera it gives you very good world-space positions for free.
Problem 6: Axis conventions
Easiest fix: a Transform POP between SOP-to-POP and your output, with sx/sy/sz bound via expressions to three Toggle parameters on the COMP (Flip X, Flip Y, Flip Z). Plus a uniform scale. User toggles to match their scene. Defaults of Y-on, Z-on, X-off match a static non-mirrored camera. Selfie webcam? Toggle X.
Putting it together
- Webcam TOP → wired into the COMP's left input.
- inTOP child reads the frame.
- HumanPopExt.OnFrameLive() pushes the latest BGR frame to the worker's queue (drop-stale-frame semantics: always-latest).
- Worker thread pulls a frame, runs model(bgr) (~18 ms with Core ML), applies cam_trans per person, writes verts into a lock-protected NumPy buffer.
- Script SOP onCook reads the buffer (under lock) and updates point positions. Bakes topology on first cook.
- SOP-to-POP lifts to GPU.
- Transform POP applies user-tunable axis flips and scale.
- Out POP carries the geometry to the rest of your network.
Per-frame overhead:
- Frame read from TOP: ~1 ms
- Worker queue put: <1 ms
- Inference (worker thread, parallel): ~18 ms
- Buffer read on cook: <1 ms
- Script SOP point update for 4 bodies (~28k points): ~30 ms
- SOP-to-POP + Transform POP on GPU: ~1 ms
End-to-end you're around 30 fps with one body, 18-22 fps with four.
Other things that mattered
The setup-side stuff is half the work for a distributable component:
-
Cross-platform venv bootstrap. TD ships its own Python 3.11. I spawn a host Python 3.11 (homebrew on Mac, python.org on Windows) from a Python subprocess called by a Setup pulse. That host Python creates a venv at a user-configurable path, pip-installs everything from a pinned requirements file, and the TD COMP's init_exec prepends the venv's site-packages to sys.path at startup. No conflict with TD's own NumPy.
-
simple-romp 1.1.4 needs Cython at build time. Its setup.py imports Cython but doesn't declare it in pyproject.toml's build-system.requires. With PEP 517 build isolation that fails. Pre-install Cython into the venv, then run the main install with --no-build-isolation.
-
MPII SMPL .pkl + Python 3.11 + NumPy 1.26 = chumpy hell. SMPL's .pkl files are pickled with chumpy.Ch wrappers around their numpy arrays. chumpy 0.70 is the latest release, hasn't been updated, uses inspect.getargspec (removed in 3.11) and np.bool (removed in numpy 1.20). The "official" romp.prepare_smpl tool just imports chumpy and dies. I wrote a 30-line chumpy.Ch stub that gets installed into sys.modules before pickle.load, deserialises the array contents, and produces the same SMPL_NEUTRAL.pth ROMP expects — without ever touching the broken chumpy package.
-
TD's auto-globals don't propagate. When you import a module from a Text DAT vs. paste it into a Text DAT and run, you get different scopes. op, me, parent, td, baseCOMP, etc. are injected only into the DAT's exec context. Modules imported via import from the textport or from another DAT don't see them. If you write tools/build_component.py, you have to either eval it in a DAT or do import td; op = td.op; baseCOMP = td.baseCOMP; ... at the top.
-
Custom param expressions can't use op() directly. They can use me.op(...). Don't ask me why, but op('worker_status').text in a Str parameter expression raises an error, while me.op('worker_status').text works fine.
Performance reality
On a 1920×1080 input with twelve people detected, top four shown:
- Pure inference (no TD overhead, just model(bgr) in a loop):
- Stock simple-romp on macOS: ~485 ms / frame (2 fps)
- With Core ML provider patched in: ~18 ms / frame (55 fps)
- End-to-end inside TD with four-body output:
- macOS M2/M3: ~30-35 fps
- Windows + NVIDIA RTX 3070: ~60-90 fps - depending on your RTX card, You may be able to push this to up to 90 but no promises
The Core ML patch is the lever; everything else is engineering plumbing.
If you want to skip the build
I packaged all of this — the .tox, install scripts, model files, documentation — as a single download for patrons. ~224 MB, unzip, follow the install guide, you're up in five minutes plus PyTorch download time. It includes:
- The HumanPop .tox itself, parameter-tuned for sensible defaults
- A cross-platform setup script that handles venv + Cython + SMPL conversion automatically
- The chumpy-free SMPL converter as a standalone script
- ROMP weights pre-downloaded (MIT, redistributable)
- Sample multi-person test image
- Full user manual + troubleshooting guide
If you build something with this, post it here. I want to see what people make.
Happy to answer questions about any of the six problems above — especially the threading and Core ML stuff, which I think are useful beyond just ROMP.







