Close
Tutorial

TDSegment - Run face, Clothing and area segmentation models locally For Beginners

Runs any SegFormer-family CoreML model on Apple Silicon — clothing, fashion, face parts, 150-class ADE20K scene parsing — from a single dropdown menu. This post is about what's happening inside, because every parameter on the tab maps to one step of the same underlying pipeline regardless of which model you picked.

The model is a single idea in seven skins

Every model I bundled is a SegFormer (a transformer-based image-to-image network). They all do one thing: for each pixel in your input, output a number that says "I think this pixel belongs to class N." The only thing that changes between "clothing" and "face" and "scene" is which classes N can be — 18 for human parsing, 19 for face regions, 47 for fine-grained fashion items, 150 for scenes.

So the TOX's job is the same regardless: take a frame, run it through the network, and give you three different ways to consume the result.

The architecture:

A SegFormer is a U-shape. The left side (encoder) runs your image through a stack of transformer blocks, each one halving resolution and doubling feature depth. By the bottom it has thrown away most of the spatial information but built up a rich understanding of what is in each region. The right side (decoder) upsamples back to the input resolution, fusing in skip connections from the encoder at each level so the output edges stay crisp.

The final layer produces, for each pixel, a vector of logits — one number per class. Argmax of that vector gives you the class ID. That class-ID map, at input resolution, is what the TOX receives from CoreML.

The three outputs are three interpretations of that one map

  1. out3_classindex — the raw class-ID map as grayscale. Most honest view. You can see exactly which pixel the network thinks is which class because each class gets its own gray value. Useful for debugging or for driving things by the raw class ID.

  2. out2_colored — a coloured overlay. Each class gets a colour from the 18-slot palette on the Palette tab. This is implemented in a GLSL fragment shader, so it runs entirely on the GPU — the shader reads the class ID from the class map, looks up the colour in a 1×18 palette texture, and writes it out. Zero CPU cost per pixel.

  3. out1_masked — your input with non-class pixels replaced. The colored overlay doubles as an alpha mask; a multiplyTOP combines your input's RGB with that mask's alpha, then a compositeTOP blends it over a background colour (black / white / transparent / passthrough). You can toggle which classes are "visible" independently — show only clothing, or only body parts, or any subset.

Why the palette has both colour and opacity per class

A pixel labelled "Background" is one the network is confident isn't anything interesting. You usually want that to be transparent. A pixel labelled "Upper-clothes" you probably want to keep. A pixel labelled "Sunglasses" you might want to show only when a particular scene calls for it. The per-class visibility toggles let you express that directly — you're not hiding pixels in the output, you're telling the shader to sample palette alpha 0 for that class so the mask automatically excludes them.

The parameter tabs as layers of the same pipeline

Every parameter on the TOX maps to one specific step in the path from your input to your output. Nothing is decorative.

  • Pre Process — the input to the network. The model was trained on photos of people. If your input is dark, noisy, rotated, or zoomed in on the wrong region, it performs worse. Brightness / contrast / gamma / saturation / hue / pre-blur / crop / rotate all adjust the image before the network sees it, so you can match the domain it was trained on.

  • Palette + Classes — how to colour and show/hide classes after the network has labelled them. Colour is per-class (18 slots). Visibility and opacity are per-class toggles. Preset buttons set groups ("only clothing", "only body parts") in one click.

  • Mask — feather, blur, and erode/dilate on the alpha mask used for the masked-RGB composite. Purely post-processing.

  • Output — how the downstream TD network sees the result. Background mode (transparent / black / white / custom colour / passthrough input), premultiply, per-output invert, swap outputs, output pixel format. Doesn't change what the model saw — just how it's presented.

  • Performance — knobs for how often the model runs (Skipframes), whether to pipeline the input read for GPU-CPU overlap (Delayedread), whether to pre-warm the ANE on load, how often to update the Status readouts. None of these change the output; they only change the rate.

  • Telemetry — read-only. Peak and minimum observed FPS, cook counter, cache hit counter, model load time, model metadata.

  • Debug — verbose logging, per-stage profiling, frame dumping to disk.

The seven presets, and why you pick one

The Modelpreset dropdown is the simplest interface. Pick one:

  • cloth_binary — the original clothSegmentation model. Binary mask (is-this-cloth y/n). Smallest, ~30 fps. Use when all you need is "show cloth, hide the rest."

  • cloth_b0_fast — SegFormer B0 trained on human-parsing. 18 classes, ~60 fps. Use for live-video / real-time applications.

  • cloth_b3_quality — SegFormer B3, same classes as the fast one but much more accurate, ~25 fps. Use for photos or when you can spare the milliseconds.

  • fashion_b3 — SegFormer B3 fine-grained fashion. 47 classes including individual garment sub-parts (sleeves, collars, zippers). For fashion/retail use cases.

  • face_b5 — SegFormer B5 face parsing. 19 face regions. Trained on CelebAMask-HQ. For face-specific compositing (swap lips, recolour hair, track face parts).

  • scene_b0_fast — SegFormer B0 on ADE20K. 150 scene classes — wall, building, sky, floor, tree, ceiling, road, person, furniture, everything. ~60 fps. For environmental augmentation.

  • scene_b4_quality — SegFormer B4 on ADE20K. Same 150 classes, much sharper. ~20 fps.

Picking a preset writes the correct Modelfile path, sets an appropriate confidence threshold, auto-applies a palette tuned to that model's semantics, and asks you to pulse Reload Model to swap it in. You're not limited to the presets — the Modelfile parameter takes any CoreML .mlpackage you point it at. It just needs to be a SegFormer-family model trained on a standard image-segmentation task.

The Apple Silicon secret that makes this work

Every M-series Mac has three processors running in parallel: CPU, GPU, and the Apple Neural Engine (ANE). Apple's CoreML framework can split a single model across all three with one flag: compute_units=ComputeUnit.ALL. In my benchmarks the ANE is the largest contributor — the B0 clothes model runs about 10× faster on ANE+GPU than it would on CPU alone. That's the difference between 6 fps (CPU) and 60 fps (all three).

The catch: TouchDesigner's Python sandbox normally can't load coremltools' native libraries because macOS's hardened runtime refuses unsigned dylibs in signed host processes. Every previous attempt at CoreML-in-TD I've seen gives up there.

The fix is one command:

codesign -s - <library>.so

Ad-hoc signing each unsigned .so / .dylib in TD's Python site-packages satisfies the hardened runtime without changing what the libraries do. The setup script (setup_codesigning.sh) does this for every file in one pass. Run it once, and coremltools.predict() works natively inside TD forever.

Once that's done, any CoreML model runs on the Neural Engine automatically. The TOX is a thin shell around that — it just handles reading frames, calling predict(), colorizing the result, and compositing.

Memory hygiene

One thing I ran into during development: coremltools.predict() accumulates internal IOSurface references over thousands of cooks. On long sessions this can eventually hit TD with a freeze when the process runs out of GPU memory. The callback now runs gc.collect() every 256 cooks to clear that, and explicitly dels intermediate PIL images and numpy arrays as it goes. The old-model reference is also freed before a new one is loaded (not just reassigned) so the GC can reclaim its weights.

The practical effect: you can leave it running for hours without the "freezes after a while" pattern.

The one-sentence design logic

The model produces one probability-map-per-frame; the TOX gives you every reasonable way to shape, colour, and composite that map without losing information you didn't choose to throw away, regardless of whether the labels are clothing, face parts, or scene objects.

That's why there's only one TOX and seven presets instead of seven TOXes. The pipeline doesn't care what the labels mean — it cares about how you want to consume them.

What's in the package

Two zips:

  1. TOX zip (small): clothSeg3.tox + setup_codesigning.sh + this README

  2. Models zip (~1.5 GB): seven .mlpackage / .mlmodel files + MODELS.md with attributions

Drop the models anywhere, run the setup script once, import the TOX. Pick a preset. Pulse Reload. Flip Enable on. Done.

 

Asset Downloads

Experience level 

Comments