This is one of those times were I get kind of annoyed that what I tell my 'high-level' language gets 'interpreted' for me, because it knows better. I feel this post may be slightly less arbitrary than the rest; in the sense that the machine-level hack covers something that you should be able to use, but depending on your assembler, may not be able to.
Be prepared for a deep dive into ModR/M + SIB with all the Mod and gratuitous binary...because ModR/M and SIB.
Exhibit A:
As we will soon find out, even our debugger is lying to us a little bit. Wow, what a fun journey this is going to be.
Let's take that assembly from the debugger, put it into a source file, assemble it, and see what it looks like in the debugger again:
Exhibit B:
Woah now, assembly is the exact same, but the machine code got a little condensed. Where all the nulls at?
Here's some more machine code:
Oh fancy! Same amount of bytes as the first example, but a little more filled out on the assembly end. it's obvious what the 11's in the machine code represent on the assembly side. For now I will say that the 84-40 sequence is what's giving us the rax*2. It's a little more complicated than that. We want to go on a journey to find out what assembly actually corresponds to the first machine code example.
We are dealing with the below XOR instruction for now:
That whole r/m32,r32 part requires the use of a ModR/M byte.
A lot is going on in these 8-bits, and depending on the first 2 bits, there may be many more bytes afterwards. Here's a visual to break apart the pieces in a general case.
In all the machine code for this type of XOR (0x31), this applies to the byte after the 0x31. Before trying to apply that to what we see in machine code above, know that things change if register 2 is 100 in binary, this opens up to SIB byte land. It is regular ESP if Mod is binary 11, otherwise SIB. Let's work a couple of non-SIB examples to solidify.
We want to do:
XOR [ecx], eax
XOR [ebp + 55h], ecx
XOR [esi + 11111111h], edx
Let's look at the mod bytes for each. The first one is just a non offset pointer, this corresponds to binary 00 from our image above. The 2nd instruction has a 8 bit offset, this is like the binary 01 for our image. And the 3rd instruction has a 32 bit offset, this corresponds to Mod 10 (in binary). so far our binary for each mod byte is:
00 xxx xxx
01 xxx xxx
10 xxx xxx
Now lets look at the registers. For the first instruction we have ecx (001), eax (000). 2nd instruction is ebp (101), ecx (001). Finally, the 3rd instruction is esi (110), edx (010). Know that these 3-bit values get reversed in encoding. The binary for each mod instruction now becomes:
00 000 001 / 0000 0001 / 0x01
01 001 101 / 0100 1101 / 0x4d
10 010 110 / 1001 0110 / 0x96
So this now leaves us with these instructions:
31 01
31 4d
31 d6
Almost done. That 55h and 11111111h offset needs encoding too. It's pretty simple; it just comes right after the ModR/M byte. So:
31 01
31 4d 55
31 d6 11111111
Am I lying, let's look at the debugger:
If you can excuse the 64-bit registers (rcx, rbp, and rsi) and just picture an 'e' instead of the 'r', we are looking good so far.
So what if that 2nd register is a binary 100 putting us into SIB land, what does that do? It makes us have to add a 2nd byte after the ModR/M. This byte allows us to add another kind of offset: register + register * [1,2,4, or 8] + [possible offset defined by ModR/M]. So the plot thickens a little. Here's another diagram:
So for an instruction like XOR [ebx + ecx*4], It's the first 2 bits that define the multiplier of 4, the next 3 bits that define ecx, and the last 3 bits that define ebx (note that the registers get swapped just like the ModR/M). Without lots of gratuitous examples, let's just encode this one and validate in the debugger. Multiplier of 4 would be binary 10, ebx is 011, and ecx is 001, swap the register order and our binary for this is 10 001 011, or 1000 1011, or 0x8b hex. So our SIB byte is 8b. Let's look back at what the ModR/M would be. We have no offset but are dealing with a pointer, so this is Mod 00. We haven't established the 2nd operand register, let's just call it 000 (eax). And the other register needs to be 100, which specifies this SIB in the first place. So the ModR/M byte would be 00 000 100, or 0000 0100, or 0x04 hex. Our full instruction would then be 31 08 8b. Let's see this in a debugger:
Again, don't mind the 64-bit registers, but this looks perfect.
Our first example of xor [eax + eax], eax had the machine code of: 31 84 00 00 00 00 00.
Let's break this down. We know 31 is the XOR part. 0x84 is our ModR/M byte. In binary, 1000 0100, or 10 000 100. So the Mod 10 makes this a pointer with a 4-byte offset, interesting. the 000 just means eax/rax. and the 100...oh, we got a SIB, so the next byte will be SIB. That next byte is 0x00. Ok, so 00 000 000 in binary. This means a multiplier of 1, and both registers eax. This should then be: XOR [eax + eax*1 + 00000000h], eax.
Cool, now we're getting somewhere, let's assemble that and see what we get:
FFFFFFFFFFFFFFFFFffffflkdskaalejpaoifjdasfke!!!!!!!!
C'mon assembler! You totally just ignored me. You had no problem when I made the *1 a *2 and the 00000000h a 11111111h in one of the examples above. But if I use *1 you say "meh, it's just rax+rax anyway, don't need the gratuitous *1" and "pfffh, +00000000h is just the same as plus 0, which doesn't do anything, let's just leave that part out."
I know, the actual logical results are no different (though the machine code is different). Why does it matter to anybody if there is no difference? First of all, that's nobodies decision to make but my own. As a general rule: I'm Programmer = I'm the God of this computer! I make the decisions, not the compiler or assembler. In a perfect world maybe. Second of all, here is exhibit C:
So we are seeing a 'multi-byte' nop. If you really look hard, it is pulling this trick off by giving the nop instruction a 32-bit operand. Becuase ModR/M+SIB apply, and the machine-code size can vary so much with these encoding, this trick can be used to make these NOPs vary in size quite a bit. You need the 66 prefix to help out for the 2, 6 and 9 byte instructions, but still, pretty cool. Based on the long winded above journey, do you think things will work this ideally. Let's humor this notion:
So obviously that sucks and is terrible. Although, it was just a "Recommended" way to do this from Intel. Let's trick it by using more than *1 and actual values in the offsets:
HAHA, that's better. But remember the point of this series of blog posts? Since Assembly is Too High-Level, we are going to take Intel's recommendation to heart anyway (by programming directly in machine-code):
So that works, and is still pretty terrible; just 2-9 bytes of multi-byte nops? I thought Intel supports instructions up to 15-bytes. ModR/M doesn't really help us turn it up that high. And multibyte nops really suck for sledding too. So I propose a trick from an older post:
Ahh, the world in the machine is right again. Sled on!