TWIMC,
I've played a little bit with different optimization options in flambda 4.04, and finally, all three versions of the loop: curried, uncurried, and the for-loop, have the same performance, though they still loose about 30% to the C version, due to tagging.
Basically, this means, that flambda was able to get rid of the allocation. I don't actually know which of the options finally made the difference, but this is how I compiled it.
ocamlopt.opt -c -S -inlining-report -unbox-closures -O3 -rounds 8 -inline-max-depth 256 -inline-max-unroll 1024 -o loop.cmx loop.ml ocamlopt.opt loop.cmx -o loop.native
Regards,
Ivan