May 5, 2000
GCC now supports reordering of basic blocks to attempt to reduce the number of mis-predicted branches in the final executable.
GCC will use dynamic branch predictions from profiling or static branch predictions to drive the block reordering algorithm.
An excellent example of how block reordering can improve code can be found in the following example (inner loop from the infamous eqntott.c benchmark):
int cmppt (a, b) PTERM *a[], *b[]; { register int i, aa, bb; for (i = 0; i < ninputs; i++) { aa = a[0]->ptand[i]; bb = b[0]->ptand[i]; if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) { if (aa < bb) { return (-1); } else { return (1); } } } return (0); }
Note the code inside the loop which returns (1) or (-1) in some circumstances. Without block reordering those statements will be inside the loop in the assembly output:
.L6: movswl (%edi,%ebx,2),%ecx movl -16(%ebp), %eax movswl (%eax,%ebx,2),%edx xorl %eax, %eax cmpl $2, %edx sete %al decl %eax andl %eax, %edx xorl %eax, %eax cmpl $2, %ecx sete %al decl %eax andl %eax, %ecx cmpl %ecx, %edx je .L5 movl $0, %eax setge %al leal -1(%eax,%eax), %eax jmp .L2 .p2align 4,,7 .L5: incl %ebx cmpl %esi, %ebx jl .L6
There are two significant problems with the generated code. First many processors will predict the "je .L5" jump as not taken, when it fact it is almost always a taken branch. This can significantly harm performance on such processors since we have a mis-predicted branch each iteration of the loop. Second, the code after "je .L5" up to and including the "jmp .L2" statement pollutes the instruction cache with code that is usually not executed.
With block reordering enabled, we get the following code for the loop:
.L6: movswl (%edi,%ebx,2),%ecx movl -16(%ebp), %eax movswl (%eax,%ebx,2),%edx xorl %eax, %eax cmpl $2, %edx sete %al decl %eax andl %eax, %edx xorl %eax, %eax cmpl $2, %ecx sete %al decl %eax andl %eax, %ecx cmpl %ecx, %edx jne .L14 incl %ebx cmpl %esi, %ebx jl .L6
As you can see, the jump to the exit code has been removed from the loop. Most processors will properly predict the "jne .L14" as not taken resulting in better performance. As a secondary benefit the loop itself is smaller which may improve instruction cache behavior.
Copyright (C) Free Software Foundation, Inc. Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.
These pages are maintained by the GCC team. Last modified 2022-10-26.