My personal story - I used to use exclusively assembler for 6502 and 8086 as it actually ran fast enough. In the mid 90s I saw Delphi's code (and Delphi was not known for its optimizations) but it was able to use the Pentium instruction pairing which takes quite an effort to accomplish by hand.
But a human would almost never use some of those more complex instructions, for a very simple reason: they eat too many clock cycles. When one is coding in assembler, one usually targets two constraints:
1. the least amount of clock cycles needed to pull off an operation;
2. the least amount of bytes to encode the operation.
Where those two meet is where the best coders get unbelievable performance out of the hardware. At least that's the case in the demo scene, although many nowadays cheat by banging the GPU's in CUDA or OpenGL.
But a human would almost never use some of those more complex instructions, for a very simple reason: they eat too many clock cycles. When one is coding in assembler, one usually targets two constraints:
1. the least amount of clock cycles needed to pull off an operation;
2. the least amount of bytes to encode the operation.
Where those two meet is where the best coders get unbelievable performance out of the hardware. At least that's the case in the demo scene, although many nowadays cheat by banging the GPU's in CUDA or OpenGL.