I used to be a native asm programmer in Z80 and 680x0, and one reason for using XOR rather than MOV is to do with condition codes: the XOR operation will most likely update the condition codes (notably, the Zero flag), whereas MOV will probably not.
Often you would not want the flags updated when simply clearing a register - you're hardly likely to test the Zero flag having just set something to zero, because it's obvious, and more importantly you may want to set something to zero while preserving the flags from a previous operation.
But often you don't care about the flags so you can use the slightly shorter and/or faster XOR operation. It used to generally be shorter and faster because the MOV instruction had the initial step of an immediate load of the zero from memory.
And that's why it changes with different optimisation levels - the compiler knows when the flags need to be preserved, and if they don't it can get away with using XOR.
It's been a while since I programmed low level but I think on the 68k series they started to introduce cache and multi stage instruction pipelines. By alternating instructions working on different things you could get a decent performance gain. If every instruction had to wait for the result of the previous instruction to complete then it wouldn't be running at its best. With careful planning you could insert 'free' instructions but you would have to watch how flags were altered. We used to spend quite a bit of time optimising code to this level, eeking every bit of performance out of the hardware. Great fun.
Sure, things have moved on a lot since those days. I think in modern RISC architectures you can even specify whether the instruction should set the condition flags.
> > Also, I like how returning 0 is "xor eax, eax".
> -O1 is `mov eax, 0`
Simply because it is shorter: On x86-64 (and x86-32)
xor eax,eax
encodes as
31h C0h or 33h C0h
(depending on the assembler; typically the first one is used) - 2 bytes, while
mov eax,0x0
encodes as
B8h 00h 00h 00h 00h
- 5 bytes.
Having privately analyzed some 256b demos I cannot even imagine how one could even come to the idea to use `mov r32, imm32` for zeroing a register (except for the reason that people don't want to understand how the assembly code is internally encoded) - the canonical way to use is `xor` (`sub` also works in principle, but `xor` is the way that is recommended by Intel).
It's not just shorter, it's also faster. But see my answer also: there are condition flag implications of using XOR and sometimes MOV will be preferable. The optimiser will always know best :)
"On Sandybridge this gets even better. The register renamer detects certain instructions (xor reg, reg and sub reg, reg and various others) that always zero a register. In addition to realizing that these instructions do not really have data dependencies, the register renamer also knows how to execute these instructions – it can zero the registers itself. It doesn’t even bother sending the instructions to the execution engine, meaning that these instructions use zero execution resources, and have zero latency! See section 2.1.3.1 of Intel’s optimization manual where it talks about dependency breaking idioms. It turns out that the only thing faster than executing an instruction is not executing it."
It's fascinating how far down the rabbit hole goes these days. One might think machine code as emitted by compilers would be pretty close to where the buck stops, but no. Named registers are just an abstraction on top of a larger register pool, opcodes get JIT compiled and optimized to microcode instructions, execution order is mostly just a hint for the processor to ignore if it can get things done faster by reordering or parallelizing... And memory access is probably the greatest illusion of all.
What I also find rather interesting is the concept of macro-op fusion that Intel introduced with the Core 2 processors: This means for example that a cmp ... (or test ...) followed by a conditional jump can/will be fused together to a single micro-op. In other words: Suddenly a sequence of two instruction maps to one internal micro-op. If you are interested in the details, read section 8.5 in http://www.agner.org/optimize/microarchitecture.pdf
the lower levels of optimizations are supposed to be more straightforward translations of the high-level language code. you can imagine this might be useful if you are debugging assembly.
On the other hand, I find O0 is significantly worse than what even a novice human Asm programmer would do if asked to manually compile code, and O1 would be around the same as a novice human.
Yes, I used to find that too. It's because, pre-optimization, on older architectures, the compiler outputs chunks of asm as if from a recipe book. Loads of unnecessary memory access, pointless moving data between registers, etc.
A proficient human coder, on the other hand, writes assembler that is partly optimized by default.
But few humans could write code like a seriously optimizing compiler, esp. on modern pipelined architectures - that stuff is unintelligible. Which is as it should be, because modern processors are not designed to be programmed directly by humans.
Why is it so different with different optimisation levels? The default emits quite a bit of code, -O1 is `mov eax, 0`