Getting the Best Performance from the API

System performance (raw throughput, latency, and jitter tolerance) can be affected by a variety of factors. One of these factors is how the client application uses VDPAU; i.e. the number of surfaces allocated for buffering, order of operations, etc.

NVIDIA GPUs typically contain a number of separate hardware modules that are capable of performing different parts of the video decode, post-processing, and display operations in parallel. To obtain the best performance, the client application must attempt to keep all these modules busy with work at all times.

Consider the decoding process. At a bare minimum, the application must allocate one video surface for each reference frame that the stream can use (2 for MPEG or VC-1, a variable stream-dependent number for H.264) plus one surface for the picture currently being decoded. However, if this minimum number of surfaces is used, performance may be poor. This is because back-to-back decodes of non-reference frames will need to be written into the same video surface. This will require that decode of the second frame wait until decode of the first has completed; a pipeline stall.

Further, if the video surfaces are being read by the video mixer for post-processing, and eventual display, this will "lock" the surfaces for even longer, since the video mixer needs to read the data from the surface, which prevents any subsequent decode operations from writing to the surface. Recall that when advanced de-interlacing techniques are used, a history of video surfaces must be provided to the video mixer, thus necessitating that even more video surfaces be allocated.

For this reason, NVIDIA recommends the following number of video surfaces be allocated:

Next, consider the display path via the presentation queue. This portion of the pipeline requires at least 2 output surfaces; one that is being actively displayed by the presentation queue, and one being rendered to for subsequent display. As before, using this minimum number of surfaces may not be optimal. For some video streams, the hardware may only achieve real-time decoding on average, not for each individual frame. Using compositing APIs to render on-screen displays, graphical user interfaces, etc., may introduce extra jitter and latency into the pipeline. Similarly, system level issues such as scheduler algorithms and system load may prevent the CPU portion of the driver from operating for short periods of time. All of these potential issues may be solved by allocating more output surfaces, and queuing more than one outstanding output surface into the presentation queue.

The reason for using more than the minimum number of video surfaces is to ensure that the decoding and post-processing pipeline is not stalled, and hence is kept busy for the maximum amount of time possible. In contrast, the reason for using more than the minimum number of output surfaces is to hide jitter and latency in various GPU and CPU operations.

The choice of exactly how many surfaces to allocate is a resource usage v.s. performance trade-off; Allocating more than the minimum number of surfaces will increase performance, but use proportionally more video RAM. This may cause allocations to fail. This could be particularly problematic on systems with a small amount of video RAM. A stellar application would automatically adjust to this by initially allocating the bare minimum number of surfaces (failures being fatal), then attempting to allocate more and more surfaces, provided the allocations kept succeeding, up to the suggested limits above.

The video decoder's memory usage is also proportional to the maximum number of reference frames specified at creation time. Requesting a larger number of reference frames can significantly increase memory usage. Hence it is best for applications that decode H.264 to request only the actual number of reference frames specified in the stream, rather than e.g. hard-coding a limit of 16, or even the maximum number of surfaces allowable by some specific H.264 level at the stream's resolution.

Note that the NVIDIA implementation correctly implements all required interlocks between the various pipelined hardware modules. Applications never need worry about correctness (providing their API usage is legal and sensible), but simply have to worry about performance.