Bit-packing 16-bit images without Pack TOP

Suppose you have a 32-bit RGBA image and you use the Pack TOP to convert it to 8-bit. You’ll end up with four times the number of pixels. Ok, that’s fine. Suppose you have a 16-bit RGBA image and you use the Pack TOP. You’ll still have four times the number of pixels, although you only needed to use double. This sample project shows how it’s possible to use only double. I’ve included bi-directional examples of 16-bit RGBA to 8-bit RGBA and 16-bit R to 8-bit RGBA (you need half as many pixels). There’s a very small error term in the 16-bit RGBA example, but the 16-bit R example is perfect. So let me know if you can identify the problem in the first one.
bitpacking.toe (5.88 KB)

My goal is to put a UV pair in each channel of a 32-bit image. Without using any bitpacking, you could say R is U1, G is V1, B is U2, and A is V2. Instead, I’m envisioning that R as a packed 32-bit float is a 16-bit UV1, G is UV2, B is UV3, A is UV4. Wouldn’t that be useful?

I’m trying to use the built-in bit-packing functions packUnorm2x16 and unpackUnorm2x16. I think the Unorm means that the input to the function should be between 0.0 and 1.0. I think that means that if I have integers from 0 to 65535, I should divide them by 65535 before packUnorm2x16. In the 8-bit case, it’s like dividing an 8-bit number by 255.

Unfortunately, this results in an imprecise packing as demonstrated by the TOP to CHOPs in the attached file. I’ve also looked into using the other built-in functions such as packHalf2x16, but I haven’t had luck with those either. This is a very dry topic, but I’d appreciate it a lot if someone can help out.

Edit:
I fixed the 16-bit → 8-bit → 16-bit pipeline too and attached it.
bitpacking.toe (6.64 KB)