Zebadiah
 
The Blog
Find
Skip to content

Pseudo-instancing via the Texture Buffer in OpenGL and Java

Putting this here mostly for myself, but maybe someone else will find it useful.

I’ve been working on writing a versatile GLRenderer that will gracefully degrade on less-than-modern hardware, but that is also compatible with the new ways of doing things in OpenGL (no matrix stack and whatnot). One of the more important use-cases for the project I’m working on is drawing many instances of the same geometry at different locations on the screen. It is also important that the position of each instance can be updated every draw step. On modern hardware, you would use something called Geometry Instancing via a command like glDrawElementsInstanced, but in this case we’re going to assume that this extension is not available and instead use some neat tricks and a custom vertex shader to fake it. It was a pretty painful process to figure this out, especially in java where there seems to be so few examples online, so I’m going to try and walk through this in excruciating detail for when I forget it all in a few months. This is all JOGL/GLSL btw.

I started from a very basic set up, with one caveat that everything would be specialized from the beginning for 2D rendering, since that is all that is required for this project. You can see the full source here. The key drawing parts are explained below.

Vertex Shader

uniform mat3 mvp;
attribute vec2 vertex;

void main(void)  
{
	gl_Position = vec4(mvp*vec3(vertex, 1), 1); 
}

The vertex shader starts out about as simple as it gets. ‘mvp’ is the combined model/view/projection matrix that is used to transform the vertices and ‘vertex’ is of course the position of the current vertex. The only thing to note is how I’m creating a vec3 from (x, y, 1) to multiply the mvp matrix by. See Homogenous coordinates to understand the reasoning behind this, but the short version is that it makes the math work.

GLRenderer

public class GLRenderer extends GLCanvas implements GLEventListener, WindowListener
{
	static final int NUM_THINGS = 50000;
	static final float SIZE = .1f;
	static final float MAX_SPEED = .2f;
	static final int FLOAT_BYTES = Float.SIZE / Byte.SIZE;
	
	// The base geometry, just a triangle
	float[] vertices = { -SIZE, -SIZE, -SIZE, SIZE, SIZE, SIZE };
	
	// x,y positions and velocities of each thing
	float[] positions = new float[NUM_THINGS * 2];
	float[] velocities = new float[NUM_THINGS * 2];

	// Always use VBOs when possible
	int vertexBufferID = 0;
	
	// Shader attributes
	int shaderProgram;
	int mvpAttribute, vertexAttribute;
	

We’ll be drawing 50,000 triangles to start. As good a number as any, and it should be sufficient to put a minor strain on the computer. The positions and velocities will be initialized to random values. When the GL context becomes available we load the vertex data into a buffer on the graphics card, and tell OpenGL to use that buffer as the source for vertex data. Standard VBO stuff.

public void init(GLAutoDrawable d)
{
	final GL2 gl = d.getGL().getGL2();
	
	gl.glClearColor(0f, 0f, 0f, 1f);
	gl.glEnableVertexAttribArray(vertexAttribute);
	
	// check out the ShaderLoader in the repo for an example of how to load shaders from files
	shaderProgram = ShaderLoader.compileProgram(gl, "default");
	gl.glLinkProgram(shaderProgram);
	
	// Grab references to the shader attributes
	vertexAttribute = gl.glGetAttribLocation(shaderProgram, "vertex");
	mvpAttribute    = gl.glGetUniformLocation(shaderProgram, "mvp");
	
	_loadVertexData(gl);
	
	gl.glUseProgram(shaderProgram);
}
	
/**
 * Loads the vertex data into a buffer on the graphics card.
 * 
 * @param gl
 */
private void _loadVertexData(GL2 gl)
{
	// Buffer lengths are in bytes
	int numBytes = vertices.length * FLOAT_BYTES;
	
	// Bind a buffer on the graphics card
	vertexBufferID = _generateBufferID(gl);
	gl.glBindBuffer(GL2.GL_ARRAY_BUFFER, vertexBufferID);
	
	// Allocate some space
	gl.glBufferData(GL2.GL_ARRAY_BUFFER, numBytes, null, GL2.GL_STATIC_DRAW);
	
	// Tell OpenGL to use our vertexAttribute as _the_ vertex attribute in
	// the shader and to use the currently bound buffer as the data source
	gl.glVertexAttribPointer(vertexAttribute, 2, GL2.GL_FLOAT, false, 0, 0);
	
	// Map the buffer so that we can insert some data
	ByteBuffer vertexBuffer = gl.glMapBuffer(GL2.GL_ARRAY_BUFFER, GL2.GL_WRITE_ONLY);
	FloatBuffer vertexFloatBuffer = vertexBuffer.order(ByteOrder.nativeOrder()).asFloatBuffer();
	
	vertexFloatBuffer.put(vertices);
	
	gl.glUnmapBuffer(GL2.GL_ARRAY_BUFFER);
	gl.glBindBuffer(GL2.GL_ARRAY_BUFFER, 0);
}

You can see the use of the Matrix3x3 class below where it is used to apply a 2d orthographic transformation. Once the view is ready the positions and velocities can be randomized.

public void reshape(GLAutoDrawable d, int x, int y, int width, int height)
{
	final GL2 gl = d.getGL().getGL2();
	
	gl.glViewport(0, 0, width, height);
	float ratio = (float) height / width;
	
	// width is fixed at 100
	viewWidth = 100;
	// height is whatever it needs to be so that the aspect ratio is the same as the viewport
	viewHeight = viewWidth * ratio;
	
	// Set the current matrix to an orthographic transformation matrix
	Matrix3x3.ortho(0, viewWidth, 0, viewHeight);
	
	if (!didInit)
	{
		viewInit(gl);
		didInit = true;
	}
}	

/**
 * Called the first time 'reshape' is called. Useful for things that
 * can't be initialized until the screen size is known.
 * 
 * @param gl
 */
public void viewInit(GL2 gl)
{
	// Give each thing a random starting position within the bounds of the view
	for (int i = 0; i < NUM_THINGS; i++)
	{
		positions[i * 2] = (float) (Math.random() * viewWidth);
		positions[i * 2 + 1] = (float) (Math.random() * viewWidth);
		
		velocities[i * 2] = (float) (Math.random() * MAX_SPEED * 2 - MAX_SPEED);
		velocities[i * 2 + 1] = (float) (Math.random() * MAX_SPEED * 2 - MAX_SPEED);
	}
}

Now that the data is loaded we just need to tell OpenGL to draw all of the things and keep track of how long it takes. This will give us a baseline to compare pseudo-instancing to, which we hope will be much faster (spoiler alert: it is).

public void display(GLAutoDrawable d)
{
	if (numDrawIterations == 0)
	{
		startDrawTime = System.currentTimeMillis();
	}
	
	final GL2 gl = d.getGL().getGL2();
	
	gl.glClear(GL2.GL_COLOR_BUFFER_BIT | GL2.GL_DEPTH_BUFFER_BIT);
	
	// Render all of the things
	for (int i = 0; i < NUM_THINGS; i++)
	{
		positions[i * 2] += velocities[i * 2];
		positions[i * 2 + 1] += velocities[i * 2 + 1];
		
		_render(gl, positions[i * 2], positions[i * 2 + 1]);
	}
	
	// Output the render time
	numDrawIterations++;
	if (numDrawIterations > 100)
	{
		// Make sure opengl is done before we calculate the time
		gl.glFinish();
		
		numDrawIterations = 0;
		long totalDrawTime = System.currentTimeMillis() - startDrawTime;
		
		System.out.println(totalDrawTime / numDrawIterations);
	}
}

The actual rendering is done via glDrawArrays, which will draw the vertices pointed to by the glVertexAttribPointer we set earlier. A translation is applied to the current matrix before it is sent to the vertex shader in order to position each thing.

public void _render(GL2 gl, float x, float y)
{
	Matrix3x3.push();
		
	// Check out the Matrix3x3 class in the repo for details.
	// Modifies the transformation matrix that we're about to send to the vertex shader
	Matrix3x3.translate(x, y);
		
	// Send the MVP matrix to the shader
	gl.glUniformMatrix3fv(mvpAttribute, 1, false, Matrix3x3.getMatrix());
		
	// Draw the vertices pointed to by the glVertexAttribPointer
	gl.glDrawArrays(GL2.GL_TRIANGLES, 0, 3);
		
	Matrix3x3.pop();
}

That pretty much covers the starting implementation. The two big bottlenecks here are the calls to glUniformMatrix3fv and glDrawArrays that happen for every single instance of the geometry. Our goal is to get rid of these calls, and instead pass in all of the position data to the vertex shader at once and draw every instances of the geometry with a single draw call. You can see the full source of the finished product here. The key changes are explained below. The trick is to store all of the position data in a texture that can be accessed from the shader. One big bonus of this method is that we won’t need to calculate and pass in a transformation matrix for each instance of the geometry. We will only need to pass in the projection matrix in order to apply the orthographic projection, which luckily only changes when the view is resized.

Vertex Shader

uniform mat3 projection;
attribute vec3 vertex;

// This is where the position data comes from
uniform samplerBuffer positionSampler;

// Fetches the position from the texture buffer
float positionFetch(int index)
{
	// each set of 4 position elements (2 xy pairs) is represented by a single vec4 (for RGBA)
	int item_index = int(index/4);
	// use the remainder to figure out which component of the vector we are interested in
	int component_index = index%4;

	return texelFetch(positionSampler, item_index)[component_index];
}

void main(void)  
{ 
	// Since we are just transforming we don't need any fancy model/view matrix, just add to the vertex position
	float x = vertex.x + positionFetch( int(vertex.z)*2 );
	float y = vertex.y + positionFetch( int(vertex.z)*2 + 1);
	vec3 real_position = vec3(x, y, 1);
	 
	gl_Position = vec4(projection * real_position, 1); 
}

The real magic here is performed by the texelFetch function that is used to look up a value in a texture. In this case the texture is a one dimensional buffer of type RGBA32 that is set up in the GLRenderer and that holds the position for each instance of the geometry. Normally the shader would have no sense of which instance a given vertex belongs to, so in order to work around this we set the z-component of each vertex to an index that represents which instance of the geometry the vertex belongs to. So, since we are drawing triangles, the first 3 vertices rendered will have a z-component of 0, the next 3 will have a z-component of 1, then 2, etc. It is this index that we use to look up the vertex’s position in the texture buffer. Now let’s see how this is setup in the GLRenderer

GLRenderer

// Interlaced x,y positions of each thing
float[] velocities = new float[NUM_THINGS*2];
// positions array was here

// Always use VBOs when possible
int vertexBufferID  = 0;
int positionBufferID = 0;

// Shader attributes
int shaderProgram;
int projectionAttribute, vertexAttribute, positionAttribute;

...

public void init(GLAutoDrawable d)
{		
	...

	// Grab references to the shader attributes
	projectionAttribute = gl.glGetUniformLocation(shaderProgram, "projection");
	vertexAttribute     = gl.glGetAttribLocation(shaderProgram, "vertex");
	positionAttribute   = gl.glGetUniformLocation(shaderProgram, "positionSampler");
		
	_loadVertexData(gl);
	_preparePositionBuffer(gl);

	gl.glUseProgram(shaderProgram);
}

The first change to notice is that we are no longer storing the position data outside of the graphics card at all, so we can get rid of the ‘positions’ array. We instead now have a positionBufferID that will refer to the texture buffer on the graphics card that we load the position data into. The mvp shader attribute has also been replaced by the ‘projectionAttribute’ and there is a new ‘positionAttribute’ that will refer to the ‘positionSampler’ in the vertex shader.

	 /**
	 * Loads NUM_THINGS instances of the vertex data into a buffer on the graphics card.
	 * Each vertex's z-component is the index of the instance and can be used to
	 * look up the instance's position using the positionSampler in the vertex shader.
	 * 
	 * @param gl
	 */
	private void _loadVertexData(GL2 gl)
	{
        // Buffer lengths are in bytes
	    int numBytes = 3*vertices.length*FLOAT_BYTES*NUM_THINGS/2;
	    
		// Bind a buffer on the graphics card and load our vertexBuffer into it
        vertexBufferID = _generateBufferID(gl);
		gl.glBindBuffer(GL2.GL_ARRAY_BUFFER, vertexBufferID);
		
		// Allocate some space
		gl.glBufferData(GL2.GL_ARRAY_BUFFER, numBytes, null, GL2.GL_STATIC_DRAW);
		
		// Tell OpenGL to use our vertexAttribute as _the_ vertex attribute in the shader and to use
		// the currently bound buffer as the data source
		gl.glVertexAttribPointer(vertexAttribute, 3, GL2.GL_FLOAT, false, 0, 0);
		
		// Map the buffer so that we can insert some data
		ByteBuffer vertexBuffer = gl.glMapBuffer(GL2.GL_ARRAY_BUFFER, GL2.GL_WRITE_ONLY);
		// Do this rather than using 'putFloat' directly on vertexBuffer, it's faster
		FloatBuffer vertexFloatBuffer = vertexBuffer.order(ByteOrder.nativeOrder()).asFloatBuffer();
		
		// Add the vertices to the FloatBuffer
		// The z-component is used to store an index that the shader will use to 
		// look up the position for each vertex
        for(int i = 0; i < NUM_THINGS; i++)
        {
        	for(int v = 0; v < vertices.length; v+=2)
            {
        		vertexFloatBuffer.put(vertices[v]);
        		vertexFloatBuffer.put(vertices[v+1]);
        		vertexFloatBuffer.put(i); // the index
            }
        }
        
        gl.glUnmapBuffer(GL2.GL_ARRAY_BUFFER);
	    gl.glBindBuffer(GL2.GL_ARRAY_BUFFER, 0);
	}

The changes to _loadVertexData are fairly significant. We are no longer loading just one instance of the vertex data. Instead we are loading NUM_THINGS instances so that we can draw them all at once with a single call. This of course uses up more memory, but in general the trade off is worth it since the speed increase is so significant and memory is so cheap. The z-component of each vertex is used to store an index into the position texture buffer as explained above.

	 /**
	 * Set up a texture buffer to hold the position data and tell TEXTURE0 to use it.
	 * Also makes sure that the positionSampler is hooked up to TEXTURE0.
	 * 
	 * @param gl
	 */
	private void _preparePositionBuffer(GL2 gl)
	{
	    // Make sure the position sampler is bound to TEXTURE0 and TEXTURE0 is active
	    gl.glUniform1f(positionAttribute, 0); // 0 means TEXTURE0
	    gl.glActiveTexture(GL2.GL_TEXTURE0);
	    
		// Bind a texture buffer
		positionBufferID = _generateBufferID(gl);
		gl.glBindBuffer(GL2.GL_TEXTURE_BUFFER, positionBufferID);
	    
	    // Allocate some space
	    int size = NUM_THINGS * 2 * FLOAT_BYTES;
	    // Use STREAM_DRAW since the positions get updated very often 
	    gl.glBufferData(GL2.GL_TEXTURE_BUFFER, size, null, GL2.GL_STREAM_DRAW);
	    
	    // Unbind
	    gl.glBindBuffer(GL2.GL_TEXTURE_BUFFER, 0);

	    // The magic: Point the active texture (TEXTURE0) at the position texture buffer
	    // Right now the buffer is empty, but once we fill it, the positionSampler in
	    // the vertex shader will be able to access the data using texelFetch
	    gl.glTexBuffer(GL2.GL_TEXTURE_BUFFER, GL2.GL_RGBA32F, positionBufferID);
	}

The comments above explain whats going on pretty well. This is how we bind the ‘positionSampler’ in the vertex shader to the buffer represented by positionBufferID. I found a lot of conflicting and over-complicated examples of how to do this on google, but after much experimentation this seems to be the simplest way to go about it. Notice the use of GL_RGBA32F when we point TEXTURE0 at the position buffer. This is what makes our data come into the vertex shader as a vec4. We could also have used GL_RED32F here and then only the first component of each texel would contain data and looking up the position of each vertex would be a little easier, but GL_RGBA32F is likely more efficient on space since it makes use of the natural vec4 packing that GLSL tends to prefer.

	public void viewInit(GL2 gl)
	{
		// Bind and fill the position buffer
		gl.glBindBuffer(GL2.GL_TEXTURE_BUFFER, positionBufferID);
	    ByteBuffer textureBuffer = gl.glMapBuffer(GL2.GL_TEXTURE_BUFFER, GL2.GL_WRITE_ONLY);
	    FloatBuffer textureFloatBuffer = textureBuffer.order(ByteOrder.nativeOrder()).asFloatBuffer();

		for(int i = 0; i < NUM_THINGS; i++)
		{
			// Give each thing a random starting position within the bounds of the view
			textureFloatBuffer.put((float) (Math.random()*viewWidth));
			textureFloatBuffer.put((float) (Math.random()*viewHeight));
			
			// and random starting velocity
			velocities[i*2] = (float) Math.random()*MAX_SPEED - MAX_SPEED/2;
			velocities[i*2+1] = (float) Math.random()*MAX_SPEED - MAX_SPEED/2;
		}
	    
	    gl.glUnmapBuffer(GL2.GL_TEXTURE_BUFFER);
	    gl.glBindBuffer(GL2.GL_TEXTURE_BUFFER, 0);
	}

Initializing the positions and velocities looks a little different now, but the concept is the same. The only difference is that now we are pushing the position data straight to the graphics card. We are loading the TEXTURE_BUFFER here just like we would load vertex data or anything else.

	public void display(GLAutoDrawable d)
	{
		...

		_updatePositions(gl);

		// Render all of the things
		gl.glDrawArrays(GL2.GL_TRIANGLES, 0, NUM_THINGS*3);

		...
	}

We can now draw everything with a single call to glDrawArrays, which is exactly what we wanted. The vertex shader will take care of making sure everything is drawn at the right position. The only remaining task is to update the positions on the graphics card, but it’s pretty simple.

	public void _updatePositions(GL2 gl)
	{
		gl.glBindBuffer(GL2.GL_TEXTURE_BUFFER, positionBufferID);
		ByteBuffer textureBuffer = gl.glMapBuffer(GL2.GL_TEXTURE_BUFFER, GL2.GL_READ_WRITE);
		
	    FloatBuffer textureFloatBuffer = textureBuffer.order(ByteOrder.nativeOrder()).asFloatBuffer();
		for(int i = 0; i < NUM_THINGS*2; i++)
		{
			textureFloatBuffer.put(i, textureFloatBuffer.get(i) + velocities[i]);
		}
		
	    gl.glUnmapBuffer(GL2.GL_TEXTURE_BUFFER);
	    gl.glBindBuffer(GL2.GL_TEXTURE_BUFFER, 0);
	}

Oh, and don’t forget to set the projection matrix in ‘reshape’ or things probably won’t look right

	public void reshape(GLAutoDrawable d, int x, int y, int width, int height)
	{
		...
		Matrix3x3.ortho(0, viewWidth, 0, viewHeight);
	    
		// Send the projection matrix to the shader, only needs to be sent once per resize
        gl.glUniformMatrix3fv(projectionAttribute, 1, false, Matrix3x3.getMatrix());
		...
	}

And there you have it. If you followed along closely you should now be able to render millions of moving triangles without any noticeable lag. In my own personal benchmarks I was able to render ~200 million traingles per second with this method as opposed to only ~1 million before implementing psuedo-instancing. Not too shabby.

Feel free to leave a comment below or otherwise contact me if you have questions. Thanks for reading.

First Post

Just throwing some crap in here for now to test things.

int x = 2;
int y = 1;
int s = x;
x = y;
y = s;