A Study on the Usefulness of H.264 Encoding in a VNC Environment
Author: D. R. Commander
Update 8/19/14: After the initial study was performed, a member of the TigerVNC community pointed out that the x264 encoder's default settings cause a new I-frame (full frame) to be generated for every 250 frames. Given that the 2D datasets, in particular, contain a very large number of incremental framebuffer updates (approx. 19,000 in the case of slashdot-24) and that most of these encompass small regions of pixels, generating an I-frame for every 250 of them has a huge impact on the overall compression ratio. Thus, a new set of results was obtained, this time with the
As of this writing, H.264 (AKA "AVC") is one of the most popular video encoding standards in use, and its implementations include Blu-Ray discs, numerous consumer digital video cameras, satellite TV, and numerous streaming video web sites (Vimeo, YouTube, etc.) H.264 employs advanced interframe compression techniques, which attempt to eliminate duplicate data between frames as well as to predict the content of frames based on previous or subsequent frames. Thus, as with other interframe codecs, a full frame is encoded only occasionally, and other frames are encoded relative to the full frame or relative to each other. This is similar in concept to the "full" and "incremental" framebuffer updates that VNC servers send to VNC viewers.
It should be noted that H.264's interframe encoding techniques are all performed "behind the scenes", and this already makes H.264 a bit of an awkward fit for VNC. VNC servers rely on the ability to encode and send only the changed pixels from the remote framebuffer to the VNC viewer. With H.264, on the other hand, it is necessary to encode the entire remote framebuffer image each time it changes (even if only one pixel changed) and to let the H.264 encoder analyze the image for differences relative to the previous image. Further, TurboVNC employs a highly adaptive codec that encodes subrectangles using the most efficient means possible, depending on the number of colors in the subrectangle. TurboVNC's codec will generally only use lossy compression if it detects that a subrectangle has sufficient color depth to benefit from it. H.264, in contrast, is designed around the needs of motion pictures and games and other content for which quality is perceived more in the aggregate than on a frame-by-frame basis. Thus, even prior to conducting this research, there was a sense that, if H.264 improved upon the TurboVNC codec at all, the improvement was likely to be confined to only certain types of applications.
The goal of this research was to investigate whether a compelling case could be made for implementing H.264 encoding in TurboVNC. If such a case could be made, then a great deal of follow-on work would be required in order to productize the technology, including:
A strawman H.264 encoder for the TurboVNC Benchmark Tools was developed using the open source x264 library and compared against the existing TurboVNC encoder at the low level. The new H.264 encoder implements the
In order to verify that the H.264 encoder was producing a correct video stream, the encoder was extended such that it generates an FLV video file whenever the
x264 provides several methods for controlling the size of the encoded stream, including specifying a constant frame rate and a constant bit rate. For the purposes of this research, however, the H.264 stream was encoded with a constant quality so as to produce an apples-to-apples comparison relative to the TurboVNC encoder. Constant frame rate and constant bit rate are not really meaningful in such a low-level experiment as this, since the experiment is designed to measure the encoder's efficiency irrespective of the constraints of a network.
All of the 24-bit datasets from the initial TurboVNC codec research were tested both with the TurboVNC 1.1+ encoder and with the strawman H.264 encoder. 4:2:0 subsampling was used in the TurboVNC encoder (since the H.264 encoder also uses 4:2:0), and two JPEG quality levels, 80 and 40, were chosen to represent "medium quality" and "low quality". The quality of the H.264 stream was adjusted until it produced approximately the same DSSIM on a specific frame of the GLXspheres dataset as did the "medium-quality" and "low-quality" TurboVNC baselines (in all cases, the DSSIM was measured against the same frame compressed with 4:2:0 subsampling and 100% quality.) The test images are included below so that the reader may visually confirm that they appear to be of approximately equal quality.
One will notice that JPEG produces ringing artifacts around the edges of the spheres, but H.264 produces more loss of clarity and blocking artifacts in the interior gradients. This makes sense, given its nature as a motion picture codec. Such loss of detail would be the least noticeable to the human eye in 30-frame-per-second cinematic content, but such is not necessarily the case in a 3D visualization application. In volume visualization applications, for instance, these "interior gradients" are the actual data!
Note that this is admittedly not a particularly rigorous method for establishing a quality equivalence between the TurboVNC and H.264 encoders. If more time were available, then it would have been better to compare a frame from each of the datasets. The process of obtaining the frame captures is rather tedious, however.
The attached spreadsheet lists the results of the study. The first thing to note is that, even with the fastest settings, encoding with x264 was incredibly slow, particularly for the 2D datasets. Drilling down into the results reveals that, in fact, the encoder was compressing 80-90 frames/sec/core on these datasets, but each "frame" is not really a frame-- it is usually a tiny updated region of the remote display, and many of these tiny updates may be required in order to fully refresh the display (the 2D apps captured in these datasets were not using double buffering.) Because the entire framebuffer was being re-compressed each time it was updated, all of the framebuffer updates took a constant amount of time to encode, regardless of their size. Thus, the total encoding time with H.264 was simply a product of the number of updates. The performance of the H.264 encoder was unacceptably slow (much less than 1 Megapixel/sec) on the 2D datasets because of this phenomenon. By comparison, the TurboVNC encoder processed 20,000-30,000 updates/second on the 2D datasets, because its encoding time is generally proportional to the number of pixels, not the number of updates. This equated to 50-80 Megapixels/sec, far more than could be streamed over most broadband connections (in other words, the TurboVNC encoder can fill a broadband connection without using a whole CPU core.)
There were only seven datasets for which H.264 was capable of producing significantly better compression than TurboVNC: Photos (16-19% data reduction relative to TurboVNC), KDE Hearts (31% data reduction, but only in the low-quality case), TeamCenter Vis (13-14% data reduction), UGS/NX (15-24% data reduction), GLXSpheres (51-60% data reduction), Google Earth (56-67% data reduction), and Quake3 (44-62% data reduction.) The latter three were particularly compelling, and it is probably because these are the datasets whose interframe differences are the most predictable. However, the performance of the Google Earth and Quake 3 datasets was abysmal-- peaking at generally less than 1 Megapixel/sec/core. Examining the video from each revealed the source of their poor performance: a notification icon was flashing in the bottom right of the window manager when the dataset was recorded, and this caused the dataset to contain a lot of small 2D framebuffer updates in addition to the large updates generated by the 3D application. This would be a non-issue if the applications were running in full-screen mode, but still, those workloads are quite normal for a VNC environment, and it is not unreasonable to expect the encoder to handle them. The performance of the other 3D datasets was generally about 15-30 Megapixels/sec/core, which was generally sufficient to fill a modest (10-15 megabit/second) broadband connection, but almost all of these datasets compressed significantly worse with H.264 than with the TurboVNC encoder. The total size of all of the compressed 3D datasets was about the same between H.264 and the TurboVNC encoder's fastest mode. The gains in compression ratio on tcvis-01-24, ugnx-01-24, glxspheres-24, googleearth-24, and q3demo-24 were cancelled out by losses on the other datasets.
Referring to the TurboVNC User's Guide, Compression Level 9 represents something of a poor trade-off between compression ratio and performance, since it nearly doubles the CPU time relative to Compression Level 2 while improving compression ratio by at most about 15% (and usually much less.) CL 9 was implemented solely for the purposes of emulating the "tightest" mode in TightVNC 1.3.x, in order to ease the minds of former TightVNC users who were contemplating upgrading to TurboVNC. However, as slow as CL 9 is, it is still (in the aggregate) about 80-90 times faster than H.264 when compressing 2D app workloads and about 5-6 times faster than H.264 when compressing 3D app workloads.
But let's back up for a moment. The TurboVNC Benchmark Tools are designed to measure the efficiency of various VNC encodings and subencodings, but the underlying assumptions that they make do not play to the strengths of H.264. H.264 is a frame-based codec, so what if we were to treat it like one? Since the H.264 encoder already stores each new update in an I420 holding buffer, we can simply defer the compression of the holding buffer until a specified interval has elapsed-- or, to put it another way, we can place a frame rate governor on the H.264 encoder. It is not technically fair to compare the resulting datastream with the datastream produced by the TurboVNC encoder, because the H.264 datastream is no longer including all of the data. However, what we're trying to do here is approach the problem from a usability point of view. In a real TurboVNC server environment, there are several mechanisms that act to coalesce or limit the framebuffer updates that are sent to a particular client:
It normally doesn't make sense to emulate these behaviors in the TurboVNC Benchmark Tools, because all of the other VNC encoders are region-based. Thus, their low-level performance and efficiency is not generally affected by any of the mechanisms above. However, in the case of H.264, since it has such a high amount of overhead each time the encoder is invoked, coalescing small framebuffer updates is critical to achieving decent performance with it.
From a user experience point of view, the TurboVNC encoder in low-quality mode is never going to be able to stream more than about 2 Megapixels/sec for every Megabit/sec of bandwidth. Thus, for a modest broadband connection (10-15 Megabits/sec), we can achieve a similar level of usability with the H.264 encoder if we can make it compress about 20-30 Megapixels/sec.
Setting the maximum frame rate of the H.264 encoder to 25 had a dramatic effect on the performance of the 2D datasets and the Google Earth and Quake 3 3D datasets. It was now possible to compress 15 Megapixels/sec or more in all cases using a single CPU core. For the seven datasets mentioned above, the H.264 encoder was now able to produce a 23-85% smaller datastream than the TurboVNC encoder, but it also required 2.2 to 6.5x as much CPU time as the TurboVNC encoder. Further, for the other datasets, H.264 still did a worse job of compression than the TurboVNC encoder, despite the fact that H.264 was being given an unfair advantage.
Translating these results into an actual TurboVNC server environment could prove non-trivial. Given the complex interaction between the deferred update mechanism and the RFB flow control extensions, it is doubtful that a simple frame rate governor would suffice.
H.264 is definitely not a suitable general-purpose replacement for the TurboVNC encoder. When configured and tuned correctly, and when the encoder architecture is restructured to be frame-based instead of region-based, H.264 can provide very compelling compression ratio improvements on certain types of workloads-- particularly games and other application workloads that produce relatively predictable motion between frames. However, it should also be noted that frame spoiling will interfere with the interframe comparison techniques used by H.264, so these results may not bear out in an actual VirtualGL/TurboVNC environment. Since some of the 3D datasets used in this research were originally captured with frame spoiling turned on, that could have contributed to the lackluster compression ratios achieved by H.264 on those datasets. Frame spoiling is a necessary feature of remote 3D display environments, and any encoder that is to be used in such environments needs to be able to properly deal with it.
There may be faster H.264 implementations, but x264 is generally regarded as being the best open source solution, and at the moment, its performance is generally too slow to be of much use in most TurboVNC deployments. Using an entire core's worth of CPU time, it was able to provide barely sufficient performance to remotely display a 1-megapixel 3D app at interactive frame rates. The whole purpose of H.264 would be to reduce bandwidth so that more frames/second could be squeezed into a low-bandwidth network connection, but if the encoder is CPU-limited, then it doesn't matter how much you shrink the bandwidth-- the frame rate will stay the same. Using a whole core per user is generally unacceptable in a large-scale, multi-user TurboVNC server environment. Since x264 is already using SIMD acceleration, it is doubtful that there is that much room for improvement, although the encoder did prove to be quite scalable in a multi-core environment (more than doubling its performance with 4 cores relative to 1.) This indicates that H.264 is likely a much better candidate for GPU acceleration than JPEG is.
Given the performance challenges involved with integrating H.264 encoding into TurboVNC, and given the relatively slow performance of x264, a more viable near-term approach might be to integrate H.264 into VirtualGL. Since VirtualGL deals only with 3D images with generally constant sizes and predictable motion, and since VirtualGL could take advantage of the H.264 encoding capabilities on high-end nVidia GPUs, it would seemingly be a better fit. The downside to this, however, is that such a solution would require a set of X and RFB extensions to allow it to tunnel the H.264 stream through TurboVNC, and implementing these extensions would be difficult and time-consuming.
It is worth noting that the "constant frame rate" and "constant bit rate" features of the H.264 encoder would be more useful in a production environment than they were in these low-level tests. Those features could serve as the basis for an adaptive quality mechanism, perhaps even obviating the automatic lossless refresh (ALR) feature in TurboVNC. However, the fact that H.264 may facilitate the implementation of an adaptive quality mechanism is of little consequence until we can improve the performance of the encoder. The assumption behind adaptive quality is that performance is network-limited and that you can improve the frame rate by decreasing the quality during periods of high activity. If the performance is not network-limited, then there is no point to this.
Although H.264 is not going to be a general-purpose replacement for the TurboVNC encoder, it might still be beneficial to add support for it, as a way of letting users explore its potential. For instance, despite its high CPU usage, it might do a much better job of transmitting certain types of workloads over very-low-bandwidth connections (e.g. satellite.) It would take some effort to implement the encoder in TurboVNC, effort that would need financial sponsorship, but the project should be mostly straightforward.
This research did show that H.264 can be beneficial for certain types of apps, particularly those that generate somewhat game-like or video-like image sequences. X servers won't be around forever. At some point, Wayland will enter the picture, and we (as an industry) will have to figure out how to remotely display those applications. Even sooner than that, apps will increasingly stop using X primitives in favor of X RENDER and other image-based drawing methods, so it is likely that application workloads will become more H.264-friendly as time progresses. Open source H.264 codecs and CPUs will also improve in that same time period. This situation will continue to be tracked in the meantime.
Further research in the near term could include farming off the H.264 encoding to an nVidia GPU using NVENC. It is unknown how much overhead this would incur, but it is worth studying. If your organization would like to sponsor this research, or the efforts to integrate H.264 into TurboVNC, please contact me.
|All content on this web site is licensed under the Creative Commons Attribution 2.5 License. Any works containing material derived from this web site must cite The VirtualGL Project as the source of the material and list the current URL for the TurboVNC web site.|