The value of being stuck
If you see me traveling alone in a bus, train, underground or plane, it’s more than likely you’ll find me completely absorbed, writing on a small notebook. My father says I look like a paranoid schizophrenic, and he might have a point. However, that habit has given me the best ideas I’ve ever had. For example, I developed 90% of the techniques I used in my masters’ thesis in those small notebooks, while trying to keep my balance in the Madrid underground.
Having half an hour a day where I have no internet connection, no TV, no books and nobody to talk to gives my brain the opportunity to give shape to any small idea I might have had in the last days. Basically, I need to be stuck in a place where I can do nothing else but think. Some people use the shower for that, but I have a terrible memory, so I need to write things down to remember them later!
From late 2009 to late 2011 I worked from home, and lately I’ve come to realize that staying at home all day meant I didn’t really have any time to think. I was always online, always reading some article, blog post or documentation, writing code or watching a TV series. I learned a terrible amount of things in that time, but in the end all this meant that my brain was acting only as a data storage system, recording every little fact it could, but never thinking about them.
Now I have at least 30 minutes a day of “alone time”, and I always use that time to think about any interesting projects I have in mind. And let me tell you, I’ve never had so many good ideas in my life.
So, find a way to be stuck in one place for a few minutes every day, and think!
Motivation in games
Lately I’ve been thinking about games, and what is it that makes them work. Most likely, everything I’ll write here has already been said by somebody else, but I find that putting things down in paper (or blog) is useful to get my thoughts clear.
- Acceptance, the need for approval
- Curiosity, the need to learn
- Eating, the need for food
- Family, the need to raise children
- Honor, the need to be loyal to the traditional values of one’s clan/ethnic group
- Idealism, the need for social justice
- Independence, the need for individuality
- Order, the need for organized, stable, predictable environments
- Physical activity, the need for exercise
- Power, the need for influence of will
- Romance, the need for sex
- Saving, the need to collect
- Social contact, the need for friends (peer relationships)
- Social status, the need for social standing/importance
- Tranquility, the need to be safe
- Vengeance, the need to strike back/to win
New year, new me
Trip to Kenya
I didn’t tell most people, but on April 1st I went on a 10 day trip to Kenya. I’m planning to upload here a good portion of the nearly a thousand photos we made during the trip, but until then, here’s a good one to show that IT-related services is never the best paid career. Not even down there ;-)
GPU Raymarching with Distance Fields
These last two weeks I’ve been pretty distracted with personal issues, and to get back on track it always helps me to start a small personal project in which I experiment something I’ve never done before. So here we go: GPU Raymarching using Distance Fields.
On C++ performance: The Evil Mr. Branch
Here’s a simple problem . For some reason, you’ve got to select between two types or values: smaller (or equal) than 50, and bigger than 50. And you need to do that fast using one core (no multithreading). ¿How do you do it?
For our example, we’ll have an array called data where all values are stored, and an array called results, where an integer value is stored, indicating if the input was bigger or smaller than 50.
So, this is the straighfoward way to do it.
|
1 2 3 4 5 6 7 |
for(int i=0;i<1000;++i)
{
if(data[i] < 50)
results[i] = SMALLERTHAN_50;
else
results[i] = BIGGERTHAN_50;
} |
And most of us (and myself 2 days ago) would say: “you won’t get much faster than that!”. Well, in fact you can. How?
By not branching!
In school they show you that branching can make the processor flush the data it has in the pipeline and start over. They also tell you that today’s general-purpose processors do not generally take a big hit when branching, but that hit still exists (I’m talking x86 PCs, here. Consoles DO have serious problems with branching).
So, how do you do this without branching? With a small trick ;-)
|
1 |
result = x + (y-x) & condition |
If condition has all bits to 0, the second half of the computation will be 0, so result ends up being x. On the other side, if condition has all bits to 1, the x cancels out, and we get result=y.
So we get a function like this (code shamelessly taken from here):
|
1 2 3 4 5 |
int isel( int a, int x, int y )
{
int mask = a >> 31; // arithmetic shift right, splat out the sign bit
return x + ((y - x) & mask); // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
} |
If a is positive, it returns x, if negative, returns y.
So using this function, we can get our array sorted out just like this:
|
1 |
for(int i=0;i<100;++i) results[i] = isel(50 - data[i], SMALLERTHAN_50, BIGGERTHAN_50); |
It works just as well… the only question is performance. What will run faster?
Well, in my machine (a Core 2 Duo at 2.4GHz), for NUM_DATA=10000, and 200000 iterations of those loops, I get:
- Using GCC with no optimizations:
|
1 2 |
Elapsed time BRANCHING: 19.156000000 sec
Elapsed time NOT BRANCHING: 17.566000000 sec |
- Using GCC with -O3:
|
1 2 |
Elapsed time BRANCHING: 2.605000000 sec
Elapsed time NOT BRANCHING: 1.981000000 sec |
The difference is not that big without optimizations, but with them, it’s clear as day: not branching gets you further. More than a 20%, in this case!!
And this goes to show you what some people might already have told you: compilers inline the hell out of your code if set to high optimization values, even if you don’t explicitly ask them to. This code is faster using a function call inside the main loop than without it!
By the way, GCC is NOT generating MMX, SSE, or any instructions like that. This is the disassembly for the main loop in the branchless version (the text on the right is my interpretation for the asm):
|
1 2 3 4 5 6 7 8 |
.text:00401170 loc_401170:
.text:00401170 mov eax, ecx ## eax=50
.text:00401172 sub eax, [ebx+edx*4] ## eax-=data[i]
.text:00401175 shr eax, 1Fh ## eax>>=31
.text:00401178 mov [edi+edx*4], eax ## result[i] = eax
.text:0040117B inc edx ## ++i
.text:0040117C cmp edx, 270Fh ## if(i<NUM_DATA)
.text:00401182 jle short loc_401170 ## repeat loop |
The process is heavily optimized by GCC, but it’s all there in normal, honest-to-fsm, everyday-life instructions.
I do still have some doubts about this, however, as the GCC optimization for the branching loop doesn’t use a j(n)le or jz instruction, but setnle. And I can’t find out if this instruction does in fact flush the pipeline or not. Anyway, it still results in a faster processing when using the isel version.
For comparison, here’s the optimized branching loop:
|
1 2 3 4 5 6 7 8 |
.text:004010F0 loc_4010F0:
.text:004010F0 xor eax, eax
.text:004010F2 cmp dword ptr [ebx+edx*4], 32h
.text:004010F6 setnle al
.text:004010F9 mov [esi+edx*4], eax
.text:004010FC inc edx
.text:004010FD cmp edx, 270Fh
.text:00401103 jle short loc_4010F0 |
And, well, here’s the whole testing code if anyone is curious about it:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
#include
#include
#include
enum{ SMALLERTHAN_50, BIGGERTHAN_50};
int isel( int a, int x, int y )
{
int mask = a >> 31; // arithmetic shift right, splat out the sign bit
return x + ((y - x) & mask); // mask is 0xFFFFFFFF if (a < 0) and 0x00 otherwise.
}
int main ()
{
const int NUM_DATA = 10000;
const size_t NUM_ITERS = 200000;
clock_t startTime, endTime;
int * data = new int[NUM_DATA];
int * results = new int[NUM_DATA];
int * results2 = new int[NUM_DATA];
// Initialize test data
for(int i=0;i<100;++i) data[i]= rand()%100;
startTime = clock();
// ------------------------------
// Branch test
for (int j=0;j<NUM_ITERS;++j)
for(int i=0;i<NUM_DATA;++i) {
if(data[i] < 50 )
results[i] = SMALLERTHAN_50;
else
results[i] = BIGGERTHAN_50;
}
// ------------------------------
endTime = clock();
printf ("\tElapsed time BRANCHING: %.9f sec\n",(float)(endTime-startTime) / CLOCKS_PER_SEC);
startTime = clock();
// ------------------------------
// Branchless test
for (int j=0;j<NUM_ITERS;++j)
for(int i=0;i<NUM_DATA;++i)
results2[i] = isel( 50 - data[i], SMALLERTHAN_50, BIGGERTHAN_50);
// ------------------------------
endTime = clock();
printf ("\tElapsed time NOT BRANCHING: %.9f sec\n",(float)(endTime-startTime) / CLOCKS_PER_SEC);
// Check we didn't mess things up
for(int i=0;i<NUM_DATA;++i)
if( results[i] != results2[i])
printf ("ERROR in elem %i=%i, first value: %i, second value: %i\n",i, data[i], results[i],results2[i]);
} |
C++: Could you please code cleanly? Pretty please?
One of my little things when coding is that I always feel ashamed when my code is not clean.
This means I try very hard to make my code simple and easy to understand. I don’t always succeed (not even close), but I try.
I wasn’t always like that, mind you. As a teenager, my code was just a horrible mess of spaghetti code. But Pablo Orduña showed me long ago the value of easily understandable code, and since then I always try my best to code in a clear and non-hackish way.
Anyway, these days I’ve found myself working with code from other people. And, at that, people that didn’t try to create readable code. I’m sure they did know how to do it. They just weren’t in the mood. You can just read it in the code: “if it works, ship it!”
As a small example (there are much much worse things than this), here’s a slightly modified version of a function I found the other day:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
bool RetrieveProperty(object_t * object, string propName, VARIANT* pPropertyValue)
{
PropertyReader * pPropertyReader = NULL;
object->FromInterface(IID_IPropertyReader, (void**)&pPropertyReader);
if (pPropertyReader)
{
Properties * pProperties = NULL;
pPropertyReader->get_Properties(&pProperties);
if (pProperties)
{
long propCount = 0;
if (pProperties->get_Count(&propCount) == OK)
{
for (long i = 0; i < propCount; i++)
{
Property * pProperty = NULL;
pProperties->get_Item(i, &pProperty);
if (pProperty)
{
PropertyDictionary * pPropertyDictionary = NULL;
pProperty->FromInterface(IID_PropertyDictionary, (void**)&pPropertyDictionary);
if (pPropertyDictionary)
{
string bDictionaryName;
pPropertyDictionary->get_Name(&bDictionaryName);
if (!strcmp((char*)bDictionaryName.c_str(), "MY_DICTIONARY"))
{
HRESULT hr;
if ((hr = pPropertyDictionary->get_Item(PropertyName, pPropertyValue)) == OK)
return true;
}
}
}
}
}
}
}
return false;
} |
Now, did anyone notice something odd? The nested ifs and loops? They make it unnecessarily hard to follow the program flow and they don’t buy us anything in return!
In fact,the whole structure of the function allows us to remove most of the 9 (yes, 9) indentation levels in that code. We just have to reverse the ifs and return early. Like this:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
bool RetrieveProperty(object_t * object, string propName, VARIANT* pPropertyValue)
{
PropertyReader * pPropertyReader = NULL;
object->FromInterface( IID_IPropertyReader, (void**)&pPropertyReader);
if (!pPropertyReader)
return false;
Properties * pProperties = NULL;
pPropertyReader->get_Properties(&pProperties);
if (!pProperties)
return false;
long propCount = 0;
if (pProperties->get_Count(&propCount) != OK)
return false;
for (long i = 0; i < propCount; i++)
{
Property * pProperty = NULL;
pProperties->get_Item(i, &pProperty);
if (!pProperty)
continue;
PropertyDictionary * pPropertyDictionary = NULL;
pProperty->FromInterface(IID_PropertyDictionary, (void**)&pPropertyDictionary);
if (!pPropertyDictionary)
continue;
string bDictionaryName;
pPropertyDictionary->get_Name(&bDictionaryName);
if (strcmp((char*)bDictionaryName.c_str(), "MY_DICTIONARY"))
continue;
HRESULT hr; // why do we need this, I say?
if (hr = pPropertyDictionary->get_Item(PropertyName, pPropertyValue) == OK)
return true;
}
return false;
} |
Max indentation: 3 levels.
It takes less than a minute, it’s safe to do, the meaning of the program is exactly the same… and it’s much cleaner and easy to understand! And as a bonus, it isn’t even longer than the previous version!!
There are even clever people that have said this before me several times!
So please, do try. I know it seems easier and faster not to, but it’s only in the short run. Some day (when you have to go back to try to understand your own code) you’ll regret it.
Still unconvinced? Now try imagining this with 700 loc functions (and yes, I have a few like that one here…)
New job!
Tomorrow monday is officially my first day working for RandomControl as the fryrender plugins subsystem developer and maintainer.
Voronoi video-filling
After showing the last test to a colleague, he guessed that this method would look really bad if seen in motion, using a random sampling method for the image. He thought that a constant sampling pattern would be needed… and so I had to test it out.
Voronoi image-filling
Little test to see what can you do to approximate a big image using only a few samples, and using Voronoi diagrams for that. It’s not amazingly cool or anything, but it runs in real-time :)