TurboVNC | About / A Study on the Usefulness of H.264 Encoding in a VNC Environment

Author: D. R. Commander

Update 8/19/14: After the initial study was performed, a member of the TigerVNC community pointed out that the x264 encoder's default settings cause a new I-frame (full frame) to be generated for every 250 frames. Given that the 2D datasets, in particular, contain a very large number of incremental framebuffer updates (approx. 19,000 in the case of slashdot-24) and that most of these encompass small regions of pixels, generating an I-frame for every 250 of them has a huge impact on the overall compression ratio. Thus, a new set of results was obtained, this time with the i_keyint_max parameter in x264 set to a high enough value that only a few I-frames were generated for each dataset. This would normally not be a good idea when encoding an actual video file, because seeking to a specific frame in the video requires finding the most recent I-frame and decoding all of the P-frames (incremental frames) between that I-frame and the seek point. In the case of VNC, however, the video stream will always be consumed in real time, so seeking is irrelevant.

Setting i_keyint_max to a high value improved the 2D compression ratios by 2-3x. The improvement with regards to 3D datasets was generally not significant, except in the case of the googleearth-24 and q3demo-24 datasets. The report has been edited so that it discusses only the new results. Additionally, new tests were performed in which the framebuffer updates were collected in the YUV holding buffer and sent through the H.264 encoder only periodically, which greatly changed the performance of H.264 with the 2D datasets and the afore-mentioned googleearth-24 and q3demo-24 3D datasets. There are still only a handful of datasets for which H.264 can be shown to compress better than the TurboVNC encoder, but for those datasets, the improvement is now very compelling. The x264 library still takes up too much CPU time, however, to be of much use in most multi-user TurboVNC server environments.

Introduction

As of this writing, H.264 (AKA "AVC") is one of the most popular video encoding standards in use, and its implementations include Blu-Ray discs, numerous consumer digital video cameras, satellite TV, and numerous streaming video web sites (Vimeo, YouTube, etc.) H.264 employs advanced interframe compression techniques, which attempt to eliminate duplicate data between frames as well as to predict the content of frames based on previous or subsequent frames. Thus, as with other interframe codecs, a full frame is encoded only occasionally, and other frames are encoded relative to the full frame or relative to each other. This is similar in concept to the "full" and "incremental" framebuffer updates that VNC servers send to VNC viewers.

It should be noted that H.264's interframe encoding techniques are all performed "behind the scenes", and this already makes H.264 a bit of an awkward fit for VNC. VNC servers rely on the ability to encode and send only the changed pixels from the remote framebuffer to the VNC viewer. With H.264, on the other hand, it is necessary to encode the entire remote framebuffer image each time it changes (even if only one pixel changed) and to let the H.264 encoder analyze the image for differences relative to the previous image. Further, TurboVNC employs a highly adaptive codec that encodes subrectangles using the most efficient means possible, depending on the number of colors in the subrectangle. TurboVNC's codec will generally only use lossy compression if it detects that a subrectangle has sufficient color depth to benefit from it. H.264, in contrast, is designed around the needs of motion pictures and games and other content for which quality is perceived more in the aggregate than on a frame-by-frame basis. Thus, even prior to conducting this research, there was a sense that, if H.264 improved upon the TurboVNC codec at all, the improvement was likely to be confined to only certain types of applications.

The goal of this research was to investigate whether a compelling case could be made for implementing H.264 encoding in TurboVNC. If such a case could be made, then a great deal of follow-on work would be required in order to productize the technology, including:

Designing an appropriate extension to the RFB protocol (an RFB encoding number is already registered for H.264, but there seems to be no available information on the specifics of how that encoding should be implemented.) The extension would require additional pseudo-encodings to allow the viewer to remotely adjust the encoder parameters on the server.
Support for decoding H.264 in the TurboVNC Viewer. This could prove particularly tricky, given the lack of a performant H.264 decoder for Java (it might be necessary to use JNI, as we do with JPEG.) There are several open source options available for decoding H.264, and further research-- perhaps even of a greater scope than this research-- would be necessary in order to evaluate the performance of these solutions.

Methodology

A strawman H.264 encoder for the TurboVNC Benchmark Tools was developed using the open source x264 library and compared against the existing TurboVNC encoder at the low level. The new H.264 encoder implements the rfbSendRectEncodingTight() interface and thus behaves, from the point of view of the compare-encodings benchmark, just like the TurboVNC encoder. When the first rectangle of a given dataset is passed to the H.264 encoder, it creates a new x264 instance, as well as an I420-encoded copy of the framebuffer. For each subsequent rectangle, only the changed pixels are encoded as YUV and placed in the I420 holding buffer (this required the use of a new function in the TurboJPEG 1.4 API that allows for encoding an RGB image into separate YUV image planes.) Incrementally performing YUV encoding in this manner saves a lot of CPU time, but unfortunately, it is still necessary to pass the entire I420 holding buffer into the x264 encoder each time the framebuffer changes. As previously mentioned, H.264 handles duplicate pixels behind the scenes, but it is still a very inefficient use of CPU time for it to analyze each frame for differences rather than allowing us to tell it exactly where the differences are.

In order to verify that the H.264 encoder was producing a correct video stream, the encoder was extended such that it generates an FLV video file whenever the -o option is passed to compare-encodings.

x264 provides several methods for controlling the size of the encoded stream, including specifying a constant frame rate and a constant bit rate. For the purposes of this research, however, the H.264 stream was encoded with a constant quality so as to produce an apples-to-apples comparison relative to the TurboVNC encoder. Constant frame rate and constant bit rate are not really meaningful in such a low-level experiment as this, since the experiment is designed to measure the encoder's efficiency irrespective of the constraints of a network.

All of the 24-bit datasets from the initial TurboVNC codec research were tested both with the TurboVNC 1.1+ encoder and with the strawman H.264 encoder. 4:2:0 subsampling was used in the TurboVNC encoder (since the H.264 encoder also uses 4:2:0), and two JPEG quality levels, 80 and 40, were chosen to represent "medium quality" and "low quality". The quality of the H.264 stream was adjusted until it produced approximately the same DSSIM on a specific frame of the GLXspheres dataset as did the "medium-quality" and "low-quality" TurboVNC baselines (in all cases, the DSSIM was measured against the same frame compressed with 4:2:0 subsampling and 100% quality.) The test images are included below so that the reader may visually confirm that they appear to be of approximately equal quality.

GLXSpheres Frame Captures - Closest Quality Equivalents
	TurboVNC	H.264
Medium Quality	4:2:0 Q80	4:2:0 QP23
Low Quality	4:2:0 Q40	4:2:0 QP31

One will notice that JPEG produces ringing artifacts around the edges of the spheres, but H.264 produces more loss of clarity and blocking artifacts in the interior gradients. This makes sense, given its nature as a motion picture codec. Such loss of detail would be the least noticeable to the human eye in 30-frame-per-second cinematic content, but such is not necessarily the case in a 3D visualization application. In volume visualization applications, for instance, these "interior gradients" are the actual data!

Note that this is admittedly not a particularly rigorous method for establishing a quality equivalence between the TurboVNC and H.264 encoders. If more time were available, then it would have been better to compare a frame from each of the datasets. The process of obtaining the frame captures is rather tedious, however.

Test Platform

Dell Precision T3500
2.8 GHz quad-core Intel Xeon W3530
4 GB memory
CentOS Enterprise Linux 5.9

Results

The attached spreadsheet lists the results of the study. The first thing to note is that, even with the fastest settings, encoding with x264 was incredibly slow, particularly for the 2D datasets. Drilling down into the results reveals that, in fact, the encoder was compressing 80-90 frames/sec/core on these datasets, but each "frame" is not really a frame-- it is usually a tiny updated region of the remote display, and many of these tiny updates may be required in order to fully refresh the display (the 2D apps captured in these datasets were not using double buffering.) Because the entire framebuffer was being re-compressed each time it was updated, all of the framebuffer updates took a constant amount of time to encode, regardless of their size. Thus, the total encoding time with H.264 was simply a product of the number of updates. The performance of the H.264 encoder was unacceptably slow (much less than 1 Megapixel/sec) on the 2D datasets because of this phenomenon. By comparison, the TurboVNC encoder processed 20,000-30,000 updates/second on the 2D datasets, because its encoding time is generally proportional to the number of pixels, not the number of updates. This equated to 50-80 Megapixels/sec, far more than could be streamed over most broadband connections (in other words, the TurboVNC encoder can fill a broadband connection without using a whole CPU core.)

There were only seven datasets for which H.264 was capable of producing significantly better compression than TurboVNC: Photos (16-19% data reduction relative to TurboVNC), KDE Hearts (31% data reduction, but only in the low-quality case), TeamCenter Vis (13-14% data reduction), UGS/NX (15-24% data reduction), GLXSpheres (51-60% data reduction), Google Earth (56-67% data reduction), and Quake3 (44-62% data reduction.) The latter three were particularly compelling, and it is probably because these are the datasets whose interframe differences are the most predictable. However, the performance of the Google Earth and Quake 3 datasets was abysmal-- peaking at generally less than 1 Megapixel/sec/core. Examining the video from each revealed the source of their poor performance: a notification icon was flashing in the bottom right of the window manager when the dataset was recorded, and this caused the dataset to contain a lot of small 2D framebuffer updates in addition to the large updates generated by the 3D application. This would be a non-issue if the applications were running in full-screen mode, but still, those workloads are quite normal for a VNC environment, and it is not unreasonable to expect the encoder to handle them. The performance of the other 3D datasets was generally about 15-30 Megapixels/sec/core, which was generally sufficient to fill a modest (10-15 megabit/second) broadband connection, but almost all of these datasets compressed significantly worse with H.264 than with the TurboVNC encoder. The total size of all of the compressed 3D datasets was about the same between H.264 and the TurboVNC encoder's fastest mode. The gains in compression ratio on tcvis-01-24, ugnx-01-24, glxspheres-24, googleearth-24, and q3demo-24 were cancelled out by losses on the other datasets.

Referring to the TurboVNC User's Guide, Compression Level 9 represents something of a poor trade-off between compression ratio and performance, since it nearly doubles the CPU time relative to Compression Level 2 while improving compression ratio by at most about 15% (and usually much less.) CL 9 was implemented solely for the purposes of emulating the "tightest" mode in TightVNC 1.3.x, in order to ease the minds of former TightVNC users who were contemplating upgrading to TurboVNC. However, as slow as CL 9 is, it is still (in the aggregate) about 80-90 times faster than H.264 when compressing 2D app workloads and about 5-6 times faster than H.264 when compressing 3D app workloads.

But let's back up for a moment. The TurboVNC Benchmark Tools are designed to measure the efficiency of various VNC encodings and subencodings, but the underlying assumptions that they make do not play to the strengths of H.264. H.264 is a frame-based codec, so what if we were to treat it like one? Since the H.264 encoder already stores each new update in an I420 holding buffer, we can simply defer the compression of the holding buffer until a specified interval has elapsed-- or, to put it another way, we can place a frame rate governor on the H.264 encoder. It is not technically fair to compare the resulting datastream with the datastream produced by the TurboVNC encoder, because the H.264 datastream is no longer including all of the data. However, what we're trying to do here is approach the problem from a usability point of view. In a real TurboVNC server environment, there are several mechanisms that act to coalesce or limit the framebuffer updates that are sent to a particular client:

The deferred update mechanism, which accumulates all updated regions over a specified time period and sends one framebuffer update at the end of that time period. This one framebuffer update does, however, contain a separate rectangle for each updated region, so from the point of view of the H.264 encoder, it would be the same as sending multiple small updates.
The TVNC_COMBINERECT environment variable, which specifies that framebuffer updates containing more than a given number of rectangles (default = 100) should be combined into a single bounding rectangle.
The RFB flow control extensions, which prevent more data from being sent than the client or network is able to handle.

It normally doesn't make sense to emulate these behaviors in the TurboVNC Benchmark Tools, because all of the other VNC encoders are region-based. Thus, their low-level performance and efficiency is not generally affected by any of the mechanisms above. However, in the case of H.264, since it has such a high amount of overhead each time the encoder is invoked, coalescing small framebuffer updates is critical to achieving decent performance with it.

From a user experience point of view, the TurboVNC encoder in low-quality mode is never going to be able to stream more than about 2 Megapixels/sec for every Megabit/sec of bandwidth. Thus, for a modest broadband connection (10-15 Megabits/sec), we can achieve a similar level of usability with the H.264 encoder if we can make it compress about 20-30 Megapixels/sec.

Setting the maximum frame rate of the H.264 encoder to 25 had a dramatic effect on the performance of the 2D datasets and the Google Earth and Quake 3 3D datasets. It was now possible to compress 15 Megapixels/sec or more in all cases using a single CPU core. For the seven datasets mentioned above, the H.264 encoder was now able to produce a 23-85% smaller datastream than the TurboVNC encoder, but it also required 2.2 to 6.5x as much CPU time as the TurboVNC encoder. Further, for the other datasets, H.264 still did a worse job of compression than the TurboVNC encoder, despite the fact that H.264 was being given an unfair advantage.

Translating these results into an actual TurboVNC server environment could prove non-trivial. Given the complex interaction between the deferred update mechanism and the RFB flow control extensions, it is doubtful that a simple frame rate governor would suffice.

Conclusions

H.264 is definitely not a suitable general-purpose replacement for the TurboVNC encoder. When configured and tuned correctly, and when the encoder architecture is restructured to be frame-based instead of region-based, H.264 can provide very compelling compression ratio improvements on certain types of workloads-- particularly games and other application workloads that produce relatively predictable motion between frames. However, it should also be noted that frame spoiling will interfere with the interframe comparison techniques used by H.264, so these results may not bear out in an actual VirtualGL/TurboVNC environment. Since some of the 3D datasets used in this research were originally captured with frame spoiling turned on, that could have contributed to the lackluster compression ratios achieved by H.264 on those datasets. Frame spoiling is a necessary feature of remote 3D display environments, and any encoder that is to be used in such environments needs to be able to properly deal with it.

There may be faster H.264 implementations, but x264 is generally regarded as being the best open source solution, and at the moment, its performance is generally too slow to be of much use in most TurboVNC deployments. Using an entire core's worth of CPU time, it was able to provide barely sufficient performance to remotely display a 1-megapixel 3D app at interactive frame rates. The whole purpose of H.264 would be to reduce bandwidth so that more frames/second could be squeezed into a low-bandwidth network connection, but if the encoder is CPU-limited, then it doesn't matter how much you shrink the bandwidth-- the frame rate will stay the same. Using a whole core per user is generally unacceptable in a large-scale, multi-user TurboVNC server environment. Since x264 is already using SIMD acceleration, it is doubtful that there is that much room for improvement, although the encoder did prove to be quite scalable in a multi-core environment (more than doubling its performance with 4 cores relative to 1.) This indicates that H.264 is likely a much better candidate for GPU acceleration than JPEG is.

Given the performance challenges involved with integrating H.264 encoding into TurboVNC, and given the relatively slow performance of x264, a more viable near-term approach might be to integrate H.264 into VirtualGL. Since VirtualGL deals only with 3D images with generally constant sizes and predictable motion, and since VirtualGL could take advantage of the H.264 encoding capabilities on high-end nVidia GPUs, it would seemingly be a better fit. The downside to this, however, is that such a solution would require a set of X and RFB extensions to allow it to tunnel the H.264 stream through TurboVNC, and implementing these extensions would be difficult and time-consuming.

It is worth noting that the "constant frame rate" and "constant bit rate" features of the H.264 encoder would be more useful in a production environment than they were in these low-level tests. Those features could serve as the basis for an adaptive quality mechanism, perhaps even obviating the automatic lossless refresh (ALR) feature in TurboVNC. However, the fact that H.264 may facilitate the implementation of an adaptive quality mechanism is of little consequence until we can improve the performance of the encoder. The assumption behind adaptive quality is that performance is network-limited and that you can improve the frame rate by decreasing the quality during periods of high activity. If the performance is not network-limited, then there is no point to this.

Although H.264 is not going to be a general-purpose replacement for the TurboVNC encoder, it might still be beneficial to add support for it, as a way of letting users explore its potential. For instance, despite its high CPU usage, it might do a much better job of transmitting certain types of workloads over very-low-bandwidth connections (e.g. satellite.) It would take some effort to implement the encoder in TurboVNC, effort that would need financial sponsorship, but the project should be mostly straightforward.

This research did show that H.264 can be beneficial for certain types of apps, particularly those that generate somewhat game-like or video-like image sequences. X servers won't be around forever. At some point, Wayland will enter the picture, and we (as an industry) will have to figure out how to remotely display those applications. Even sooner than that, apps will increasingly stop using X primitives in favor of X RENDER and other image-based drawing methods, so it is likely that application workloads will become more H.264-friendly as time progresses. Open source H.264 codecs and CPUs will also improve in that same time period. This situation will continue to be tracked in the meantime.

Further research in the near term could include farming off the H.264 encoding to an nVidia GPU using NVENC. It is unknown how much overhead this would incur, but it is worth studying. If your organization would like to sponsor this research, or the efforts to integrate H.264 into TurboVNC, please contact me.