FlushGPUBuffer         How to flush the GPU command buffer regularly to avoid stalls or framerate stuttering

Sometimes you can get an inconsistent frame rate or a lagging response to input because your -CPU outclasses your -GPU by a big enough margin, that the driver's command buffer gets clogged causing a periodic stall in your frame rate. One way to deal with this is to flush the -GPU command buffer manually on a more regular basis, distributing the mismatch more evenly across frames.

You do not, however, want to lock the -CPU and -GPU into a synchronous frame loop - this is incredibly inefficient because it prevents the -GPU and -CPU from working more independently, and on more powerful hardware it's vital to queue more than one frame up at once to get maximum efficiency; this is particularly important in SLI configurations. Therefore, suggestions that you should try blitting a pixel from the primary frame buffer back to the -CPU to force the buffer to be flushed are not advisable.

Another way to tackle this is via hardware occlusion queries. The nice thing about these is that you can delimit them around frames, and then only force them to flush if there are more than a given number of frames still queued. You can use a round-robin assignment of queries to make sure that you always force the flush on a frame 'N' frames back, thus keeping the command buffer a fixed size but still keeping multiple frames on the queue. Here's a class that does that:

OgreGpuCommandBufferFlush.h:

#ifndef __GPUCOMMANDBUFFERFLUSH_H__
#define __GPUCOMMANDBUFFERFLUSH_H__

#include "OgrePrerequisites.h"
#include "OgreFrameListener.h"

namespace Ogre
{
    
    /** Helper class which can assist you in making sure the -GPU command
        buffer is regularly flushed, so in cases where the -CPU is outpacing the
        -GPU we do not hit a situation where the -CPU suddenly has to stall to 
        wait for more space in the buffer.
    */
    class GpuCommandBufferFlush : public FrameListener
    {
    protected:
        bool mUseOcclusionQuery;
        typedef std::vector<HardwareOcclusionQuery*> HOQList;
        HOQList mHOQList;
        size_t mMaxQueuedFrames;
        size_t mCurrentFrame;
        bool mStartPull;
        bool mStarted;

    public:
        GpuCommandBufferFlush();
        virtual ~GpuCommandBufferFlush();

        void start(size_t maxQueuedFrames = 2);
        void stop();
        bool frameStarted(const FrameEvent& evt);
        bool frameEnded(const FrameEvent& evt);

    };

}

#endif

OgreGpuCommandBufferFlush.cpp:

#include "OgreGpuCommandBufferFlush.h"
#include "OgreRoot.h"
#include "OgreRenderSystem.h"
#include "OgreHardwareOcclusionQuery.h"

namespace Ogre
{
    //---------------------------------------------------------------------
    GpuCommandBufferFlush::GpuCommandBufferFlush()
        : mUseOcclusionQuery(true)
        , mMaxQueuedFrames(2)
        , mCurrentFrame(0)
        , mStartPull(false)
        , mStarted(false)
    {

    }
    //---------------------------------------------------------------------
    GpuCommandBufferFlush::~GpuCommandBufferFlush()
    {
        stop();
    }
    //---------------------------------------------------------------------
    void GpuCommandBufferFlush::start(size_t maxQueuedFrames)
    {
        if (!Root::getSingletonPtr() || !Root::getSingletonPtr()->getRenderSystem())
            return;

        stop();
        mMaxQueuedFrames = maxQueuedFrames;
        RenderSystem* rsys = Root::getSingleton().getRenderSystem();
        mUseOcclusionQuery = rsys->getCapabilities()->hasCapability(RSC_HWOCCLUSION);

        if (mUseOcclusionQuery)
        {
            for (size_t i = 0; i < mMaxQueuedFrames; ++i)
            {
                HardwareOcclusionQuery* hoq = rsys->createHardwareOcclusionQuery();
                mHOQList.push_back(hoq);
            }
        }

        mCurrentFrame = 0;
        mStartPull = false;

        Root::getSingleton().addFrameListener(this);

        mStarted = true;

    }
    //---------------------------------------------------------------------
    void GpuCommandBufferFlush::stop()
    {
        if (!mStarted || !Root::getSingletonPtr() || !Root::getSingletonPtr()->getRenderSystem())
            return;

        RenderSystem* rsys = Root::getSingleton().getRenderSystem();
        for (HOQList::iterator i = mHOQList.begin(); i != mHOQList.end(); ++i)
        {
            rsys->destroyHardwareOcclusionQuery(*i);
        }
        mHOQList.clear();

        Root::getSingleton().removeFrameListener(this);

        mStarted = false;

    }
    //---------------------------------------------------------------------
    bool GpuCommandBufferFlush::frameStarted(const FrameEvent& evt)
    {
        if (mUseOcclusionQuery)
        {

            mHOQList[mCurrentFrame]->beginOcclusionQuery();

        }

        return true;
    }
    //---------------------------------------------------------------------
    bool GpuCommandBufferFlush::frameEnded(const FrameEvent& evt)
    {
        if (mUseOcclusionQuery)
        {
            mHOQList[mCurrentFrame]->endOcclusionQuery();
        }
        mCurrentFrame = (mCurrentFrame + 1) % mMaxQueuedFrames;
        // If we've wrapped around, time to start pulling
        if (mCurrentFrame == 0)
            mStartPull = true;

        if (mStartPull)
        {
            unsigned int dummy;
            mHOQList[mCurrentFrame]->pullOcclusionQuery(&dummy);
        }

        return true;
    }
    //---------------------------------------------------------------------

}

Usage

To use it, just declare an instance somewhere in your app:

GpuCommandBufferFlush mBufferFlush;


And initialise it somewhere after you've created your first RenderWindow:

mBufferFlush.start(numberOfQueuedFrames);


Ideally call stop() or destroy it before you shut down Root, but it should behave safely if you forget.

Currently this class requires hardware occlusion query support and will just do nothing if it's not available. An alternative path could be added to use a round-robin set of render textures, which are rendered to using a dedicated SceneManager and then their contents blitted back to the -CPU N frames later - this would achieve the same result, assuming the driver queues everything in order for all targets. However, hardware occlusion queries are pretty ubiquitous now, so this is left as an exercise for the reader.

This class is in response to user issues described in this thread.