Recently in Programming Category
For our example, we'll have an array called data where all values are stored, and an array called results, where an integer value is stored, indicating if the input was bigger or smaller than 50.
So, this is the straighfoward way to do it.
for(int i=0;i<NUM_DATA;++i)
{
if(data[i] <= 50)
results[i] = SMALLERTHAN_50;
else
results[i] = BIGGERTHAN_50;
}
And most of us (and myself 2 days ago) would say: "you won't get much faster than that!". Well, in fact you can. How? By not branching!
In school they show you that branching can make the processor flush the data it has in the pipeline and start over. They also tell you that today's general-purpose processors do not generally take a big hit when branching, but that hit still exists (I'm talking x86 PCs, here. Consoles DO have serious problems with branching).
So, how do you do this without branching? With a small trick ;-)
result = x + (y-x) & condition
If condition has all bits to 0, the second half of the computation will be 0, so result ends up being x. On the other side, if condition has all bits to 1, the x cancels out, and we get result=y.
So we get a function like this (code shamelessly taken from here):
int isel( int a, int x, int y )
{
int mask = a >> 31; // arithmetic shift right, splat out the sign bit
return x + ((y - x) & mask); // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
}
If a is positive, it returns x, if negative, returns y.So using this function, we can get our array sorted out just like this:
for(int i=0;i<NUM_DATA;++i)
results[i] = isel(50 - data[i], SMALLERTHAN_50, BIGGERTHAN_50);
It works just as well... the only question is performance. What will run faster?
Well, in my machine (a Core 2 Duo at 2.4GHz), for NUM_DATA=10,000, and 200,000 iterations of those loops, I get:
- Using GCC with no optimizations:
Elapsed time BRANCHING: 19.156000000 sec Elapsed time NOT BRANCHING: 17.566000000 sec- Using GCC with -O3:
Elapsed time BRANCHING: 2.605000000 sec Elapsed time NOT BRANCHING: 1.981000000 secThe difference is not that big without optimizations, but with them, it's clear as day: not branching gets you further. More than a 20%, in this case!!
And this goes to show you what some people might already have told you: compilers inline the hell out of your code if set to high optimization values, even if you don't explicitly ask them to. This code is faster using a function call inside the main loop than without it!
By the way, GCC is NOT generating MMX, SSE, or any instructions like that. This is the disassembly for the main loop in the branchless version (the text on the right is my interpretation for the asm):
.text:00401170 loc_401170: .text:00401170 mov eax, ecx ## eax=50 .text:00401172 sub eax, [ebx+edx*4] ## eax-=data[i] .text:00401175 shr eax, 1Fh ## eax>>=31 .text:00401178 mov [edi+edx*4], eax ## result[i] = eax .text:0040117B inc edx ## ++i .text:0040117C cmp edx, 270Fh ## if(i<NUM_DATA) .text:00401182 jle short loc_401170 ## repeat loop
The process is heavily optimized by GCC, but it's all there in normal, honest-to-fsm, everyday-life instructions.
I do still have some doubts about this, however, as the GCC optimization for the branching loop doesn't use a
j(n)le or jz instruction, but setnle. And I can't find out if this instruction does in fact flush the pipeline or not. Anyway, it still results in a faster processing when using the isel version.
For comparison, here's the optimized branching loop:
.text:004010F0 loc_4010F0: .text:004010F0 xor eax, eax .text:004010F2 cmp dword ptr [ebx+edx*4], 32h .text:004010F6 setnle al .text:004010F9 mov [esi+edx*4], eax .text:004010FC inc edx .text:004010FD cmp edx, 270Fh .text:00401103 jle short loc_4010F0And, well, here's the whole testing code if anyone is curious about it:
#include <cstdlib>
#include <cstdio>
#include <ctime>
enum{ SMALLERTHAN_50, BIGGERTHAN_50};
int isel( int a, int x, int y )
{
int mask = a >> 31; // arithmetic shift right, splat out the sign bit
return x + ((y - x) & mask); // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
}
int main ()
{
const int NUM_DATA = 10000;
const size_t NUM_ITERS = 200000;
clock_t startTime, endTime;
int * data = new int[NUM_DATA];
int * results = new int[NUM_DATA];
int * results2 = new int[NUM_DATA];
// Initialize test data
for(int i=0;i<NUM_DATA;++i)
data[i]= rand()%100;
startTime = clock();
// ------------------------------
// Branch test
for (int j=0;j<NUM_ITERS;++j)
for(int i=0;i<NUM_DATA;++i)
{
if(data[i] <= 50)
results[i] = SMALLERTHAN_50;
else
results[i] = BIGGERTHAN_50;
}
// ------------------------------
endTime = clock();
printf ("\tElapsed time BRANCHING: %.9f sec\n",
(float)(endTime-startTime) / CLOCKS_PER_SEC);
startTime = clock();
// ------------------------------
// Branchless test
for (int j=0;j<NUM_ITERS;++j)
for(int i=0;i<NUM_DATA;++i)
results2[i] = isel(
50 - data[i],
SMALLERTHAN_50,
BIGGERTHAN_50);
// ------------------------------
endTime = clock();
printf ("\tElapsed time NOT BRANCHING: %.9f sec\n",
(float)(endTime-startTime) / CLOCKS_PER_SEC);
// Check we didn't mess things up
for(int i=0;i<NUM_DATA;++i)
if( results[i] != results2[i])
printf ("ERROR in elem %i=%i, first value: %i, second value: %i\n",
i, data[i], results[i], results2[i]);
}
I wasn't always like that, mind you. As a teenager, my code was just a horrible mess of spaghetti code. But Pablo Orduña showed me long ago the value of easily understandable code, and since then I always try my best to code in a clear and non-hackish way.
Anyway, these days I've found myself working with code from other people. And, at that, people that didn't try to create readable code. I'm sure they did know how to do it. They just weren't in the mood. You can just read it in the code: "if it works, ship it!"
As a small example (there are much much worse things than this), here's a slightly modified version of a function I found the other day:
bool RetrieveProperty(object_t * object, string propName, VARIANT* pPropertyValue)
{
PropertyReader * pPropertyReader = NULL;
object->FromInterface(IID_IPropertyReader, (void**)&pPropertyReader);
if (pPropertyReader)
{
Properties * pProperties = NULL;
pPropertyReader->get_Properties(&pProperties);
if (pProperties)
{
long propCount = 0;
if (pProperties->get_Count(&propCount) == OK)
{
for (long i = 0; i < propCount; i++)
{
Property * pProperty = NULL;
pProperties->get_Item(i, &pProperty);
if (pProperty)
{
PropertyDictionary * pPropertyDictionary = NULL;
pProperty->FromInterface(IID_PropertyDictionary, (void**)&pPropertyDictionary);
if (pPropertyDictionary)
{
string bDictionaryName;
pPropertyDictionary->get_Name(&bDictionaryName);
if (!strcmp((char*)bDictionaryName.c_str(), "MY_DICTIONARY"))
{
HRESULT hr;
if ((hr = pPropertyDictionary->get_Item(PropertyName, pPropertyValue)) == OK)
return true;
}
}
}
}
}
}
}
return false;
}
Now, did anyone notice something odd? The nested ifs and loops? They make it unnecessarily hard to follow the program flow and they don't buy us anything in return!
In fact,the whole structure of the function allows us to remove most of the 9 (yes, 9) indentation levels in that code. We just have to reverse the ifs and return early. Like this:
bool RetrieveProperty(object_t * object, string propName, VARIANT* pPropertyValue)
{
PropertyReader * pPropertyReader = NULL;
object->FromInterface( IID_IPropertyReader, (void**)&pPropertyReader);
if (!pPropertyReader)
return false;
Properties * pProperties = NULL;
pPropertyReader->get_Properties(&pProperties);
if (!pProperties)
return false;
long propCount = 0;
if (pProperties->get_Count(&propCount) != OK)
return false;
for (long i = 0; i < propCount; i++)
{
Property * pProperty = NULL;
pProperties->get_Item(i, &pProperty);
if (!pProperty)
continue;
PropertyDictionary * pPropertyDictionary = NULL;
pProperty->FromInterface(IID_PropertyDictionary, (void**)&pPropertyDictionary);
if (!pPropertyDictionary)
continue;
string bDictionaryName;
pPropertyDictionary->get_Name(&bDictionaryName);
if (strcmp((char*)bDictionaryName.c_str(), "MY_DICTIONARY"))
continue;
HRESULT hr; // why do we need this, I say?
if (hr = pPropertyDictionary->get_Item(PropertyName, pPropertyValue) == OK)
return true;
}
return false;
}
Max indentation: 3 levels.
It takes less than a minute, it's safe to do, the meaning of the program is exactly the same... and it's much cleaner and easy to understand! And as a bonus, it isn't even longer than the previous version!!
There are even clever people that have said this before me several times!
So please, do try. I know it seems easier and faster not to, but it's only in the short run. Some day (when you have to go back to try to understand your own code) you'll regret it.
Still unconvinced? Now try imagining this with 700 loc functions (and yes, I have a few like that one here...)
Disclaimer: Really long post. If you're not interested in C++ templates (or in correcting me when I write about them), don't read this! Also, if you know more about C++ than I do, please respond with a better solution!
Now, these days I've been reading one of Herb Sutter's wonderful books (Exceptional C++ Style), and one advice he gives is "Where possible, prefer writing functions as nonmembers nonfriends". His arguments seemed pretty solid, so I decided to give it a shot. However, not a day after reading that, I've already found something that keeps me from doing it in certain types of templated code.
For the small engine I've been writing, I've coded a templated Vec2<T> class. And writing some code to test it, g++ gave this wonderful error:
error: no match for 'operator*' in 'geom::Vec2<float>(((const float&)((const float*)(&-1.0e+1f))), ((const float&)((const float*)(&0.0f)))) * ((#'float_expr' not supported by dump_expr#<expression error> * 5.00000000000000010408340855860842566471546888351e-3) + 1.0e+0)'
The line that gave that error is this one:
geom::Vec2<float> v = geom::Vec2<float>(-10,0) * (1+rand()%10*0.005);
The problem (leaving aside some really weird things about float_expr and dump_expr) seems to be that the compiler can't find operator*, even though I did write one. So, what's happening?
Let's see. operator* is defined as a nonmember method, like this:
template <typename T> const Vec2<T> operator*(const Vec2<T> &lhs, T rhs) { return Vec2<T>(lhs.x*rhs,lhs.y*rhs); }
And if we look close at the line where it fails, we see it's a multiplication just like
Vec2<float>(10,0) * 0.005
See the problem now?
The compiler, at the moment of the operator* template instantiation, has to choose the type of the template to instantiate, but finds none that fits exactly. It finds that it should call operator*(Vec2<float>, double), but there's only operator*(Vec2<T>,T) defined, so it just sighs and proclaims "What the hell do you want me to do with this?".
In fact, what we probably want it to do is to convert that double to a float, and then choose the float version of operator*. However, the compiler is not smart enough to do that. As it happens, templated functions parameter type selection and automatic type conversion don't usually mix very well (I read something about it from one of Sutter's books, but I don't remember the exact details). So what can we do?
Option 1 (bad)
One option would be to give the compiler a little nudge (well, not so little really) so it chooses the correct template instantiation. For example, this would compile cleanly:
geom::Vec2<float> v = geom::Vec2<float>(-10,0) * (float)(1+rand()%10*0.005);
However, it's not very polite of us, as utility class programmers, to negate the class user the option to multiply a float vector by a double scalar. So what other option is there?
Option 2 (better?)
Another option would be to make the operator* method to be a template with 2 typenames, one for each side of the operation. Like this:
template <typename T,typename Y>
const Vec2<T> operator*(const Vec2<T> &lhs, Y rhs)
{
return Vec2<T>(lhs.x*rhs,lhs.y*rhs);
}
This would work pretty well.
As a side note, we would have problems if we tried to do multiplications with the scalar in the left side, like:
Vec2<float> v = 2.0 * Vec2<float>(10,0);
So we would have to define the swapped version too:
template <typename T,typename Y> const Vec2<T> operator*(Y lhs, const Vec2<T> &rhs)
{
return Vec2<T>(rhs.x*lhs,rhs.y*lhs);
}
The problem with this approach is that every different type we use will instantiate another version of the code. For this particular method it won't really matter because the compiler will probably inline it anyway. But for more complicated methods it will just instantiate another full version of the method, creating a serious case of template bloat. For example (assuming they are not inlined), each of these expressions would instantiate another version of operator*:
Vec2<float> v1 = Vec2<float>(10,0) * 65; Vec2<float> v2 = Vec2<float>(10,0) * 65.0; Vec2<float> v3 = Vec2<float>(10,0) * 65.0f; Vec2<float> v4 = Vec2<float>(10,0) * 'a';
In fact, if we compile those lines with the -ggdb flag, and then inspect them with gdb, we'll see all the instantiations appear:
(gdb) info function geom::operator*
All functions matching regular expression "geom::operator":
File test.cpp:
const geom::Vec2<float> geom::Vec2<float> const geom::operator*<float, char>(geom::Vec2<float> const&, char);
const geom::Vec2<float> geom::Vec2<float> const geom::operator*<float, double>(geom::Vec2<float> const&, double);
const geom::Vec2<float> geom::Vec2<float> const geom::operator*<float, float>(geom::Vec2<float> const&, float);
const geom::Vec2<float> geom::Vec2<float> const geom::operator*<float, int>(geom::Vec2<float> const&, int);
I know this is an exaggerated example, but template bloat (excessive template instantiation) can be a real problem. So, in the end, I'm not sure this option is a very good idea.
Option 3 (almost good?)
Another option (even though we wanted to avoid it from the start to follow the advice of Mr Sutter) is a member method. As simple as this:
template <typename T> struct Vec2{...Vec2<T> operator*(T scalar) const{return Vec2<T>(x*scalar,y*scalar);}};
T it can accept, and it's clear that we want to convert the type of scalar to that type T. And for the same reason, it would only instantiate once (again, assuming the compiler won't inline it because of size or something).But we would still have the problem with the swapped version. If we try to do a multiplication like 2.0*Vec2<float>(10,0), we'll get a nice compiler error. What can we do about that?
The best way I've found is to do this:
template <typename T, typename Y>const Vec2<T> operator*(Y lhs, const Vec2<T> &rhs){return rhs*lhs;}
And I've already run out of ideas to get this to work properly. If anyone knows of a more elegant way to do it, or has found an error in code, logic, technique, grammar or whatever, please leave a comment :-)