GPU-Based Object Picking Without Physics!

Earlier this week, my team and I were having a discussion about the level editor. Since we’re making a first-person shooter, level design will be critical to the game’s success, and one of the big things we hadn’t implemented yet was object-picking. If you aren’t familiar, object-picking is the process in which you select an object from 3D space, normally using the mouse. Although we’re using the Bullet physics engine, the system for adding physics bodies to objects isn’t fully in place, therefore ray-casting was out of the picture. With the milestone presentation next week (object-picking being pretty mandatory), we were bumping ideas off of each other as to how we were going to implement. On an off-handed comment, a teammate asked me if there was anything we could possibly do with the GPU. Hmmm… How could I replace ray-casting?

Now, ray-casting is fairly cut-and-dry. All it really requires is firing off a vector into your game world, and detecting where/what it hits. When it comes to tying this in with object picking, the process is pretty straightforward:

Convert your mouse coordinates into world-space coordinates.
Get a vector going from the camera’s position in the world-space to your mouse’s position in the world-space.
Using this vector as the basis of your ray-cast; go through all the objects in your scene, using some linear algebra to solve for intersections. Usually some culling is done here to help reduce computations.
Find the object you’re looking for.

These steps aren’t too complicated, but step three is the troublesome part. Without integrated physics, we have no ray-casting system available to us. Doing object-picking this way is also a pain as everything, whether it really needs physics or not, must be assigned a physics object. Not excruciating, but I asked myself if there was another way to do it. After a little thinking, I came up with a wonky, but possibly possible solution: I’d store information about each pixel on the screen, then sample it using the mouse’s position!

Now, you may recognize this as the basis of deferred rendering, which is exactly why I chose it. Our graphics engine currently supports a full deferred pipeline, so I could hide the ID data in a little bit of extra space in my last buffer. I won’t go into how deferred rendering works as a whole, but my current G-buffer layout looks like this (minus the depth buffer):

See those 24 bits at the bottom? That’s my ticket. My plan was to store the entity ID of each object in the pixels that it renders to. Seems easy enough, right? I did too! So I tried to implemented this, then immediately ran into several problems.

DirectX 11 doesn’t support a render-target format like DXGI_FORMAT_R8_UNORM_G8B8A8_UINT.
HLSL doesn’t have a method for directly interpreting bits in the 4.0 shader model (it’s in 5.0 though!).
SAMPLING. You really don’t want to bilinearly-interpolate your ID values, who knows what would happen then?
I have a 24-bit number, but only 3 8-bit numbers (assuming they even fit properly).

So, first problem: the render target for the specular power has the format of DXGI_FORMAT_R8G8B8A8_UNORM. Each pixel has four 8-bit unsigned-normalized floats allocated to it. I want the floating point accuracy, but I want to use the other three as integers. Off the bat, I assumed they just crammed IEEE floating-point format into 8-bits, but I was very wrong.

In reality, UNORM isn’t really a float. If you were able to look at the bit-representation, you’d see that it looks just like an integer in memory, because it is. How a UNORM works is that each bit represents a floating-point value, evenly distributed from zero to one. For example, if we had only 2 bits, our values would be:

bits  float
0 0 - 0.0f

0 1 - 0.33f

1 0 - 0.66f

1 1 - 1.0f

So for an 8-bit UNORM value, each bit represents 1/255. The GPU reads in the bit pattern, converts it to a float, then passes it to the shader.

That means (in theory) I could cast my integer as a normalized float in one shader, which would get turned into an integer by the GPU, which would then convert back from an integer into a float for another shader, just in time for me to cast that value back into an integer. Sounds insane, but it works. I was worried about floating-point error, as nothing would break this faster than if my object IDs had some entropy in them. However, this worked out fine, solving the first two problems.

To fix the sampling, there’s a small function that I haven’t found many places online, but HLSL supports a `Load` function. This directly access a resource like an array, and takes an int3 as the parameter. The first two integers are the UV coordinate, with the Z integer being the mipmap.

Finally, I had to distribute a 24-bit number into three parts. Dividing a number is trivial with some bit-wise operators, and with a little work I reached the following:

// packing function
PS_GBUFFER_OUT PackGBuffer(
    float3 BaseColor,
    float3 Normal,
    float SpecIntensity,
    float SpecPower,
    float emissive )
{
    PS_GBUFFER_OUT Out;

    // Normalize the specular power
    float SpecPowerNorm = (SpecPower - g_SpecPowerRange.x) / g_SpecPowerRange.y;

    // convert id into proper sizes
    uint word1 = objID &amp; 0xff;
    uint word2 = (objID &lt;&lt; 8) &amp; 0xff;
    uint word3 = (objID &lt;&lt; 16) &amp; 0xff;

    // Pack all the data into the GBuffer structure
    Out.ColorSpecInt = float4(BaseColor.rgb, SpecIntensity);
    Out.Normal = float4(Normal.xyz * 0.5 + 0.5, emissive);
    Out.SpecPow = float4(SpecPowerNorm, (float)(word1) / 255.f, (float)(word2) / 255.f, (float)(word3) / 255.f);

    return Out;
}

Awesome! The data was being safely, accurately stored within video memory. But now I had to bring that data back to the CPU.

My first attempt was to create the render-target as CPU-read only, then map the buffer back for reading. This was a terrible idea. Performance tanked. Turns out trying to move an entire 1920×1080 buffer from the GPU to the CPU 60x a second is a bad idea. This brought me to compute shaders. Using a compute shader, I could look at a tiny part of the buffer, then bundle it up and send it back to the CPU. This turned out to be fairly easy, and after a little tinkering I integrated this into the engine:

//input texture we want to read from
Texture2D inputTexture : register(t0);

//input of what position we want to get
cbuffer MousePosition : register(b0)
{
    int4 mousePos;
};

//defining the output to CPU
struct CS_OUTPUT
{
    uint id;
};

//specify the output to the CPU as a read-write buffer
RWStructuredBuffer&lt;CS_OUTPUT&gt; gOutput : register(u0);

//entry point, we only need 1 thread
[numthreads(1, 1, 1)]
void main(void)
{
    //read value from texture
    float4 value = inputTexture.Load(int3(mousePos.x, mousePos.y, 0));

    //grab the raw ID values, scale back to their integer values
    uint word1 = value.y * 255.f;
    uint word2 = value.z * 255.f;
    uint word3 = value.w * 255.f;

    //shift values into place, combine to get the final ID
    gOutput[0].id = word1 + (word2 &lt;&lt; 8) + (word3 &lt;&lt; 16);
}

To my amazement, it worked after only a few bumps in the road. Turns out sending/retrieving data to the compute shader is ~~a little bit~~ a lot different than regular shaders since they aren’t a part of the regular pipeline, but overall those problems were more syntax-esque in nature, so I won’t bother mentioning them.

Running this through my GPU profiler, I couldn’t actually see any noticeable overhead, which is pretty awesome. I’m trying to get a lower-end computer to run it on so that the impact might be a little more apparent. But for now, I’m happy enough to let it stay in the engine.

Now, this method isn’t perfect, and has a couple glaring issues:

Any object that is completely occluded cannot be selected under any circumstance. Since the depth buffer is used to block occluded pixels, only the front-most pixels will be stored in the buffer. The main edge cases to think about are when objects are completely occluded from view (see above), and when you are really far out from the scene (too small to actually be rasterized). Also, you wouldn’t actually select all the objects in your box, just the ones on top. Personally, I write this off as a hidden bonus (I hate this about Maya, I’m always selecting hidden vertices/edges), but some people disagree.
Dynamic bounding box selection can’t happen. You have to define the type and size of your buffers before you can use them, and dynamically creating memory buffers at run-time is just asking for a performance hit. Allocating enough memory for the whole screen will take us back to square one, and we might as well just drag the whole screen at that point.
How on earth to provide the information back to the game engine?

Fixing the first one would be a nightmare. You’d need… Tiered render-targets? Multiple render passes? Both are terrible ideas. Maybe attempt to somehow use a mask of sorts? No idea.

However, that second problem could be addressed. Preface, I haven’t implemented this side of it yet, so this is pure speculation.

You’d definitely need a bigger buffer to ferry data back to the CPU. This would have to be tested with, as you’d have to balance out your buffer size. Bigger buffer = less trips, but could mean that your last trip is really empty or, in the cases where you have a small box, complete overkill. Smaller buffer = less waste, but the downside is that with large bounding boxes you’d have to do a ton of trips. However, in the event of a normal amount of data, it wouldn’t as wasteful. I think the ideal solution is to use two buffers. One bigger buffer, and one smaller one. You could pass some type of boolean to the GPU to specify where to read, and carefully use some conditionals that don’t perform any real branching. You would then make several calls to the compute shader, using thread-groups to process the area of the screen that you desire. It would cull copies, and load up the buffer with the IDs, then ship it back to you. Presto!

The third problem is easy when you’re only picking a single spot. Simply sample the ID buffer every frame, then store the result. I plan to multi-thread my graphics engine in the upcoming months, so there’s no way that I’m letting people make calls to the GPU, requesting ID data whenever they want. That would make the pipeline unmanageable. Implementing it this way means that I’m always one frame behind, however after testing I couldn’t notice the delay. 1/60th of a second is pretty fast, so if somebody notices I’d be pretty surprised.

Doing this for a bounding box is another story. I guess you could try it with the one frame delay, but how do other programmers request it? You’d have to preemptively start scanning before the call was finished. In reality it would always be tied to the mouse, so maybe you could use the delta in the mouse’s direction to estimate its future position, and therefore a box? Perhaps have some system where you keep the whole frame in CPU memory, only reading when it gets invalidated? Definitely would be sketchy, but might be possible. Something to be tried in the future.

Matt Yan

Graphics Engineer/Pixel Wrangler

GPU-Based Object Picking Without Physics!

One thought on “GPU-Based Object Picking Without Physics!”

Leave a comment Cancel reply

Share this:

Related

One thought on “GPU-Based Object Picking Without Physics!”

Leave a comment Cancel reply