Here's the third part in a series of posts describing my vector graphics renderer. It will discuss the practical aspects of passing down data to the GPU with OpenGL and trying to optimize the performance of the renderer. Here is also a good time to state that I'm far from a graphics programming expert, so this is just an account of how I made my renderer faster (or at least less excrutiatingly slow), not necessarily how you should do it.
In Part 1 , we saw how paths were built with GraphicsMoveTo(), GraphicsLineTo() and GraphicsCurveTo() functions, and how path stroking was implemented in GraphicsStroke(). In Part 2 we discussed polygon triangulation algorithms and how they were used to fill a path with GraphicsFill(). Our vector graphics system also exposes function to create fonts and gradients, to apply transformations, and to limit drawing to a given area defined by a clipping path. We briefly discuss them in the following paragraphs before touching on the OpenGL code.
The graphics context maintains a stack of transformation matrices. The matrix on top of the stack is used to translate/scale/rotate paths. The two following functions are used to push/pop matrices on the stack:
Note that GraphicsPushMatrix() multiplies its matrix argument with the current (ie. top of the stack) matrix before pushing it on the stack.
Vertex transformation occurs in our shader, so we use a uniform to pass our current transformation matrix to the shader before drawing, with the help of the following function :
Loading font files and creating a font atlas texture is done with the help of stb_ttf, by Sean Barett. I was already in debt to Mr Barett after so many hours playing Thief: the dark project in my young days (he wrote the software renderer when working at Looking Glass), but his stb libraries are also awesome: public domain, header only, easy to integrate and use, entirely readable and transparent and giving you total freedom over data and memory management.
It was a breeze to get font textures and metrics working, and to write a naive text rendering function using textured quads. I could (and will) put some more work here, regarding font scaling, kerning, etc. but that's really all there is to it for now: you get a big alpha texture with all you glyphs at a given size, and you render text by drawing quads with the corresponding UV coordinates, using the alpha value of the texel to multiply the alpha value of the current color.
We saw in Part 2 that the filling algorithm computed UV coordinates corresponding to each triangle vertex, which allows to fill a path not only with a solid color, but also with a texture. The result is as if we scaled the texture image to fit the bounding rectangle of our path, and draw the image pixels only if they're inside the polygon defined by the path. This is used mostly with gradient textures:
The function GraphicsCreateGradientLinear() creates a texture image corresponding to a linear gradient from a start color to a stop color along a given vector. We can then set this texture current to the graphics context by using GraphicsSetGradient(), so that it will be used in subsequent calls of GraphicsFill().
We will see later the implementation of GraphicsCreateGradientLinear(), once we have described the way we handle textures.
Here's one of the features whose implementation i'm not really satisfied with at the moment, for reasons we will see in a few paragraphs, but here it is: we need a way to restrict drawing to arbitrary (i.e. path-defined) regions of the window, in order to ease the drawing of nested views, scrollable areas, etc, by the GUI system.
As such, we define a clip region, which will initialy cover the entire window, and can be restricted or expanded by the user's drawing code. We expose two functions for this purpose:
The function GraphicsPushClip() restricts the clip region to the intersection of the current path with the current clip region, whereas GraphicsPopClip() reverts the clip region to its previous state.
The easy way to do this is to use the OpenGL stencil buffer. During normal drawing operations, the stencil test is set to pass where the stencil value is zero and fail otherwise.
In GraphicsPushClip(), we first increment all values in the stencil buffer. Then we temporarily set the stencil test to fail on ones and pass otherwise, and the stencil operation to zero the stencil value upon failure. We then fill the current path, thus zeroing the intersection of the previous clip region (were values are now all ones) and the current path.
Fig. 1: Pushing a clip path
To pop the clipping, we only have to decrement all values in the stencil buffer (without wrapping, ie. zeros will not be modified).
Fig. 2: Poping a clip path
Here is the definition of GraphicsPushClip() and GraphicsPopClip():
While this implementation is relatively straightforward, we will see in a moment why it is far from ideal in our practical use case.
In a very first implementation of the renderer, we could pass vertex data and request OpenGL to draw them as soon as our stroking/filling functions have computed the needed triangles. That approach would interleave the immediate mode graphical interface logic with the drawing logic, and would not be very efficient in cases were we realize after the fact that we didn't need to draw some portions of the GUI. That's why, instead of directly calling our drawing code, we merely record our drawing requests into a command buffer, and actually call our drawing code when the context is flushed by a call to GraphicsFlush().
For instance, calling GraphicsStroke() will not immediately compute triangles and send them to the GPU, but rather append a command to a buffer. When GraphicsFlush() is called, it will loop through the entire buffer and eventually call GraphicsGLStroke() which is the name of the actual stroking function we discussed in Part 1 .
This way we avoid one of the disadvantages of naive immediate mode GUI implementation, where the GUI sometimes has to be drawn twice per frame due to a GUI event changing the look of the GUI. Here, we just discard the previous command buffer. Still, our triangles are passed to the GPU and rendered as soon as they are computed, with calls to glBufferData() and glDrawArrays().
Our shader code would be something along the lines of:
On top of our renderer is an immediate mode GUI which handles the interactions of the user with our application. To get a ballpark estimation of our renderer performance and compare different approaches, we write a GUI that fills the whole window, and draw all the GUI each frame.
We will see at the end that this is a worst case because in most circumstances we need to draw only parts of the GUI, and in fact most of the time we don't need to redraw the GUI at all. Still, we need to ensure that if the GUI must be entirely redrawn for whatever reason (eg. if the window is resized, maximized, etc...), that task can be achieved in some reasonable amount of time so that the user won't notice a lag. We are not aiming for 60fps of a constantly changing world here. An occasional 50ms to draw a frame would be ok (of course, if we can do better it's all good).
The first iteration of our rendering code, while simple to reason about, gave extremely poor performance results. It achieved a frame rate of roughly 3 FPS (approximately 333ms/frame) and wasted a lot of CPU time. Using the profiler and the GPU driver monitor enabled to identify the following bottlenecks:
For clarity, let's first eliminate the last minor bottleneck, i.e. memory management, even if in a real scenario you would likely try to address it only after having dealt with the first three.
We already saw in Part 1 how the malloc()/free() operations were replaced inside the drawing functions by a linear allocation scheme, so this is just a reminder: at initialization, the graphics context allocates a “big” block of memory (1MB) that it uses to store the data structures needed by the path construction and triangulation functions. It maintains an offset in the buffer pointing to the first “free” byte, and simply returns the address of that byte and increments the offset whenever a memory block is requested by the use of the allocation function:
The only subtlety here is that the blocks are all aligned on 16 bytes boundaries to ease some SIMD optimizations on our vector type.
There is two kind of overhead to consider regarding our OpenGL code:
To reduce the overhead due to OpenGL API calls, the idea is to gather larger chunks of vertex data and send them in batch, using only a few calls. However, each batch should have the same “state”, ie use the same values for our shader uniforms. That means that each time we change the transformation matrix, the color, or the texture, we must call OpenGL to draw the data we gather so far, then start collecting a new batch...
So we want to reduce the number of shader state changes to allow us to send larger batches. In particular, we want to be able to use multiple textures inside the same batch, without having to modify texture units. That's where a Texture Atlas comes in handy: instead of creating a texture for each gradient, we create a big texture at initialization time, and allocate some rectangular area inside this texture each time we need to create a gradient. In our vertex attributes data, we adjust the texture UV coordinates so that they map to the correct area of the atlas.
We use a super simple (and somewhat wasteful) packing scheme for the texture atlas: we allocate regions side by side on horizontal lines, and jump to a new line each time we can't find enough space at the end of the current line. We try to reuse deallocated regions but don't try to prevent fragmentation. For now it's a reasonable assumption, in the context of a GUI, that most of the textures will be allocated once and be kept in the atlas until the end of the program.
The texture atlas consists in the following members inside our graphics context structure:
The member atlasTexture is an OpenGL texture handle created by glGenTextures(). atlasFreeOffset is the next position at which we could try to allocate space for a new texture. atlasMaxLineHeight is the maximum height of all the areas on the current line. atlasFreeAreas is a list of areas that were previously marked as “deallocated” and can be reused for future textures.
Our gradient creation function, GraphicsGradientCreateLinear(), first try to allocate a region from the texture atlas as follows:
It then proceeds to draw a color gradient inside the allocated region:
In this snippet, we first create a frame buffer attached to our texture atlas, and set the viewport to draw in the previously allocated area. Then we compute the colors for each corner of the area and draw a quad over the entire viewport. We finally destroy the frame buffer and reset the viewport to its original state.
Now let's talk about our vertex attribute format, that we just saw in use in the previous code snippet. In order to maximize the size of batches that we can send to the GPU without uniform or shader changes, we use a common vertex attributes format for all our primitives:
This structure contains the position of the vertex, its color, its UV coordinates in the texture atlas, and a shader mode. Note that we use 32 bit integers to pack our Color and UV attributes. Once these attributes have been computed using 32 bits floating point precision for each component, our shader doesn't do much more computation on them. We can afford to loose some precision in there, so we can spare a lot of data transmission overhead and get a big speed-up.
The mode can be one the following values:
The fragment shader code uses masking to select the appropriate operation:
Now that we have a uniform way to send vertex data to the GPU while using different shading modes and textures, we can modify our drawing code to gather vertex data and send it in large batches to the GPU. To this effect we define two functions:
GraphicsGLPushVertexData() appends a buffer of vertex_attributes structures of length count to the vertex buffer. Each time we modify a uniform of our shaders, or when the temporary buffer is full, we call GraphicsGLDrawTriangles() to send the data to the GPU and draw the triangles, using glDrawArrays().
We then orphan the buffer using glBufferData(), indicating to OpenGL that we need a fresh buffer to write to. Otherwise, we would block inside GraphicsGLPushVertexData() when we call glBufferSubData(), until OpenGL has finished processing the vertex buffer.
So far we managed to reduce the cost of OpenGL calls to a much smaller amount of time and reached a frame rate of approximately 17fps (59ms/frame).
The problem here is that our calls to glBufferData() and glBufferSubData() must still do a bunch of work copying our vertex data and will eventually block to wait for the GPU. Not knowing the implementation details of the driver, it is hard to tell why it is blocking, but there are a few possible explanations I can only speculate (if someone is a graphics driver specialist, feel free to chime in and tell me if those are correct or completely wrong guesses! There's not a lot of detailed info on that subject online and I would love to hear about it):
A second technique we can use to improve our vertex submission rate and avoid waiting for the GPU is buffer object streaming :
Our GraphicsGLPushVertexData() and GraphicsGLDrawTriangles() functions now look like this:
There is a tradeoff to be found regarding the size of the vertex buffer. We want it to be big enough to have several batches fit in, but if it is too big the oprhaning may be too slow, creating a performance degradation every so often.
Buffer object streaming completely got rid of the CPU waiting for GPU time, and also significantly reduced the time spent in OpenGL calls. The frame rate we achieved at this point is around 24fps (41ms/frame), but now OpenGL plays a minor part in the total frame time:
The major part of the rendering time is now spent in our stroking and filling algorithms, so that's where we could improve things further, and that will come up in the next months for sure!
An important thing to note though is that these results rely on our ability to send a large number of triangle to the GPU in one call. Each action that require the shader uniforms or the texture atlas to be changed forces us to send the current batch and start a new one. If called frequently, it will kill all the benefits from this approach.
That's why I said earlier that I'm not happy with the way the clip area is implemented. Each clipping path push or pop needs to flush the current batch. This was a problem because the GUI massively relied on clipping to draw its widgets and views. I temporarily worked around that by doing more rounded rectangles fills and stroke, and clipping text early with rectangles. But i'm still bothered by this problem. One option would be to have a clipping paths stack and clip the current drawing path with the current clipping path when we stroke or fill. But that would mean transforming both paths to the same coordinate system and we would loose the benefit of doing coordinate transforms inside the shader...
Most of the time, a GUI widget doesn't need to be redrawn each frame. In fact it requires a redraw only if its aspect changed due to user interaction or dynamic styling. While designed in an immediate mode style, the GUI system of Top! caches enough info to determines if a redraw is needed, at the widget level. For instance, when a button is interacted with by the user, it will, in addition to issuing its drawing commands, emmit a “dirty rectangle”. Dirty rects are transformed into the window coordinate system and merged by the renderer to form a set of disjoint rectangles, which determines the areas of the window that must be redrawn.
Note that if the button is not interacted with, it still has to issue its drawing commands, because another widget could emmit a dirty rectangle that overlaps its own area. To preserve a consistent (z-ordered) result, the untouched widget would have to be (partially) redrawn.
We expose a variant of GraphicsFlush() called GraphicsFlushForDirtyRect(). This function serves the same purpose as GraphicsFlush(), that is, running through the command buffer and executing our vector drawing functions. But it takes the dirty rectangles emmited during the last frame into account to skip or clip some commands.
Each drawing primitive (Stroke, Fill, ...), has an implicit bounding box corresponding to the extents of the current path (in some cases augmented by the maximum of the line width and miter limit). If the bounding box doesn't intersect any dirty rect, we can skip the primitive entirely. If it's contained in a dirty rect, we can process as usual. But what happens if the bounding box partially overlaps with one (or more) dirty rectangles ? We must draw the primitive but clip the areas that fall outside of dirty rects.
Here we employ a technique similar to our path-defined clip area, using the stencil buffer:
This way, most of the time the rendering function is entirely skipped, and when the GUI needs to be redrawn it's often limited to one small widget bounding box. This allows us to achieve a good display reactivity while consumming very little CPU time.
This wraps up this series on my vector graphics renderer for now! Thanks for reading!