Continued from Optimizing away. Ok, now I have worked up the courage.
Prepare yourself for a major disappointment. I really do not know how to tweak that C#-loop to run a nanosecond faster. But I can do the same calculation much faster. How? Just my old favorite party trick. It goes like this:
1. Add a new project to your solution
2. Chose Visual C++ / CLR / Class Library
3. Insert the following managed class:
public ref class FastImageCompare { public: static double compare( void* p1, void* p2, int count ) { return NativeCode::fastImageCompare( p1, p2, count ); } static double compare( IntPtr p1, IntPtr p2, int count ) { return NativeCode::fastImageCompare( p1.ToPointer(), p2.ToPointer(), count ); } };
4. Insert the following function into an unmanaged class (which I happened to call NativeCode):
unsigned long long NativeCode::fastImageCompare( void* p1, void* p2, int count ) { int high32 = 0; _asm { push ebx push esi push edi mov esi, p1 mov edi, p2 xor eax, eax again: dec count js done movzx ebx, [esi] movzx edx, [edi] sub edx, ebx imul edx, edx movzx ebx, [esi+1] movzx ecx, [edi+1] sub ebx, ecx imul ebx, ebx add edx, ebx movzx ebx, [esi+2] movzx ecx, [edi+2] sub ebx, ecx imul ebx, ebx add edx, ebx add esi, 4 add edi, 4 add eax, edx jnc again inc high32 jmp again done: mov edx, high32 pop edi pop esi pop ebx } }
Yeah. That’s it. Hand tuned assembly language within a .NET Assembly. UPDATE 2009-01-01: return type of the function changed from “unsigned long” to “unsigned long long”, see here.
I guess that’s almost cheating. And we will be locked inside the Intel platform. Most people won’t mind I guess, but other may have very strong feelings about it. If we really would like to exploit this kind of optimizations while still be portable (to Mono/Mac for example) one possibility would be to load the assembly with native code dynamically. If it fails we could fall back to an alternative version written in pure managed code.
(I know from experience that some people with lesser programming skills react to this with a “what? it must be a crappy compiler if you can write faster code by yourself”. Let me assure you that this is not the case. On the contrary: I’m amazed about the quality of the code emitted by the C# + .NET JIT compilers.)
