CPU vs GPU: A Discussion about Hardware Acceleration and Rendering

(Interview of Prof. Dr. Philipp Slusallek, accomplished scientist and veteran from the computer graphics field, performed in January 2015 for Seekscale company, a cloud rendering startup)

Hardware acceleration is hot in rendering industry right now, and while people still try to figure out how to split workloads between CPUs and GPUs, FPGAs are quickly rising. We interviewed Philipp Slusallek, Scientific Director and Computer Graphics professor at Saarland University, and talked about what’s hot right now, and above all what we can expect in CG in the near future! We first heard about Philipp by coming across this great discussion.

– What rendering techniques are best adapted to each hardware (GPU/CPU)?

We basically have two key algorithms for rendering: rasterization and ray tracing. Rasterization is a “forward” rendering approach that renders the triangles in a scene one by one (conceptually) and each time updates all pixels covered by that triangle. Ray tracing, on the other hand, traces rays for each pixel to find out which triangle is visible for that pixel.

While the two seem very different, there are intermediate versions, such as “tile-based rasterization” or “frustum tracing”, where only the triangles covering a part of the screen are rasterized or all rays of a tile a traced together, respectively. If you make these tiles smaller, eventually a single pixel in size, rasterization starts to look much like ray tracing and vice versa as you trace larger and larger frusta. However, we still do not have algorithms that really cover the whole range of options well.

Additionally, we are seeing an increased need for advanced (programmable) shading and lighting effects (e.g. global illumination for smooth indirect illumination). While they are sitting on top of the core renderer, they impose many requirements that will determine what rendering approach can or cannot be used.

On the HW (hardware) side: GPUs started as HW dedicated for rasterization. But they have long evolved to become essentially large parallel compute engines (with some dedicated HW thrown in for rasterization, still). On the other hand, CPUs have become much better at parallel workloads with multi-core, multi-threading, and more and increasingly wide SIMD compute units (MMX, SSE, AVX, and the now upcoming AVX-512).

In some sense the two general HW designs are converging towards a similar sweet spot, but coming from two very different starting points.

Rasterization is more dependent on good HW support to be fast and so runs best on GPUs. Ray-tracing can be implemented with very good performance on each HW architecture, probably with some advantage for CPU-like architectures if you look at the latest comparison in the Embree paper at Siggraph 2014 (which may be a bit biased).

– For a developer working on a renderer, what is the technical arbitrage between writing for a CPU, and for a GPU?

Let me focus on writing a full ray tracing renderer using the raw HW. Writing a “renderer” that makes good use of OpenGL/DirectX is a very different story.

First of all writing a really good renderer is a big challenge — but also a lot of fun, independent of the choice of rendering algorithm or HW. It reaches across many levels from highly optimized inner loops all the way up to data management of often complex scenes. It thus touches many aspects of HW and SW and is a perfect poster child for how to best design efficient code for certain HW architectures.

An obvious key element is strongly optimizing the inner loops of a renderer as it determines the upper limit of the achievable rendering performance. This is where a lot of efforts and research has been, and still is, spent. Today, getting the best performance usually still means to hand-optimize the code (at the assembly or intrinsics level) for specific architectures and different choices of renderer configurations. While this often pays off, it requires expert knowledge, is very tedious, and often needs to be completely rewritten as the HW or the requirements change.

In addition, there are also many interdependencies between choices made at this low levels and the various levels of algorithms above it. This means that one needs to keep the code flexible in order for it to adapt to these requirements. However, with current technology, flexibility conflicts strongly with achieving optimal performance.

As a result of all this, almost all renderers have been targeting only one HW architecture: GPUs or CPUs and even within such an architecture, algorithms might have to be configured very differently on different instances. While cross-plattform renderers have been developed, e.g. in OpenCL, OpenCL is still a very low-level language that is not well suited for all HW architectures thus limiting what can be achieved.

In contrast, the hardware vendors each support renderer development with their HW-specific frameworks: Nvidia’s Optix is focused essentially only on GPUs, while Intel’s Embree targets CPU-like architectures (including their MIC/Xeon Phi processors). One interesting note is that both tools (and others) had to develop their own compiler technology to develop these frameworks. While both of these frameworks are very good choices they are black boxes that each come with their own set of drawbacks and limitations.

Already in 2008 we showed that with the right language/compiler tools high optimization of low-level code does not necessarily limit flexibility of the overall design. Our RTFact system [HPG 2008] used C++ template metaprogramming to specialize generic code from across several levels of abstraction at compile time. With this framework, we were able to easily configure very different renderers based on the same generic code and still achieve performance within about 10% of our previous hand-optimized code in each case — which was really remarkable. Unfortunately, writing the core code via C++ template metaprogramming is really hard and results in “write-only code” that is really hard to maintain. But it showed that performance and flexibility are NOT mutually exclusive!

Since then we have teamed up with one of the best compiler research group here at Saarland University (Prof. Sebastian Hack) and have jointly developed “AnyDSL”: It picks up the original idea of RTfact, generalizes it beyond just rendering, and uses much improved compiler technology to implement automatic specialization of generic code. AnyDSL provides a completely new programming models that offers developers a way for formulating hierarchies of conceptual abstractions while still allowing the compiler to eliminate any overhead usually associated with such abstractions (e.g. virtual function calls in C++).

This finally allows for writing completely generic code at each level of abstraction, specifying how that code should make use of abstractions at the lower levels. Finally, at the lowest level, code that has been coming from the different levels is combined and aggressively specialized using novel compiler algorithms to essentially interpret and partially execute any code that is knows at compile time. This is so effective because at these low level a lot of information that we abstracted from at the high levels is now available, such as information about the HW and the context in which some generic high-level code should be executed and which can be taken into account by the compiler. This approach combined the flexibility of hierarchical abstractions with the ability to generate optimized code that rival and often actually exceed the performance of hand-optimized code.

For example, our tests in image processing show that this approach often beats hand-optimized code because that code is very tedious and difficult to write such that not all promising optimization options are actually explored. Since simply use the generic code plus some rather simple mapping code for each specific HW, exploring the space of possible optimizations is much larger.

By now, we have already written a full ray tracer in AnyDSL that compiles and runs efficiently both on various different GPUs and CPUs. However, not all optimizations have been applied yet, so that we cannot really compare its performance yet. This will happen in the next few months but the partial results we have so far are already very promising.

– Corona Render uses for example Intel Embree ray tracing library, for the CPU. What do you think of this solution?

As I argue above, these tools are very useful. They provide hand-optimized kernels that solve some of the core problems, particular the inner loops and can thus be used as the basis for the higher layers, just as Corona did. However, they are essentially black boxes which makes it harder to adapt them to different purposes.

I believe that we need to go beyond these individual optimized kernels and create generic, adaptable, and reusable building blocks for rendering. That can be combined in flexible ways to create optimized renderers for many configurations and HW platforms. So far, we have been missing the right tools to even think how this could be done.

With AnyDSL we now have a tool that for the first time allows us to address this challenge. Now we are investigating how the to design such a flexible rendering framework: What are the right abstractions to be used at the different levels? How do we best map them to HW platforms? Which algorithms work best in what context and how do we need to formulate them, so they can easily be combined? How do we determine the data layout to optimize for different algorithms using it? And so on.

Essentially, AnyDSL allows us to rethink the design of renderers in a fundamentally new way.

– In a GameStar interview you mentioned you were working on FPGAs. FPGAs are a quickly growing field for those who need to optimize hardware for specific use cases. GPU is by definition hardware optimized for CG, yet you work on FPGA hardware for ray tracing. Should we infer that according to you GPUs are not evolving in the right direction right now?

Yes, this interview took place in 2004 at the height of the ray tracing versus rasterization discussion after real-time ray tracing became widely available. At the time, David was defending rasterization strongly, but the situation has changed dramatically since then. By now, Nvidia has embraced ray tracing fully e.g. through Optix and the Mental-Ray products they acquired. They have also significantly optimized their HW for ray tracing (which I was fortunately enough to contribute to in 2007/08 as visiting professor on invitation from David).

Regarding FPGA: We actually published our FPGA design (the RPU architecture) at Siggraph 2005 and various other conferences. At the time it was the first full ray tracing HW architecture that included everything: intersection computations, scene traversal, fully programmable shading, construction of spatial index structures, and many more. We even did extensive evaluations of how it could be mapped to ASICs and the results where very promising.

This was later picked up by industry, including Samsung, Imagination, and others. Imagination now offers a HW ray tracing engine as an add-on to their GPU SOC designs. Most interestingly, they argue (convincingly!) that with the tremendous power cost of memory access, ray tracing will likely have significant advantages over rasterization on mobile platforms, particularly for advanced rendering effects. This is something we have thought about much already in 2005, but back then it was very difficult to get hard numbers on power efficiency of GPU designs for comparison — something Imagination does of course have.

Even more interesting has been research published at HPG 2014 this year, where some very innovative algorithmic changes were combined with rather small changes to a GPU architecture to achieve amazing ray tracing performance (in simulations).

These are very promising results and it will be very interesting to see how this field will develop over the next few years. It seems that ray tracing is very strong already (being partially used in many games already), but the best time for ray tracing may be still to come.

– For a software developer that would like to get into FPGAs, what is the kind of capabilities you need, and where do you draw the line to define what computations are going to be taken care of by hardware, and by software?

FPGAs are definitely a very interesting HW architecture because they are so flexible. But they also come with significant cost, in terms of low clock rate, limited floating point support, and also financial cost (for large designs). I am skeptical they will be able to provide the performance at the power envelope we are looking for today. But they are absolutely great for trying out new designs. With some of the latest ideas I would not be surprised to see some great new developments in this context.

– FPGAs are right now mostly in the R&D stage, and designing production-ready specific hardware is still a massive resources commitment. Are your prototypes viable for real case use, and do you think FPGAs and the needed knowledge can become commoditized?

We are definitely looking into that right now. For our RPU design in 2005, Sven Woop (then PhD student in my lab, today one of the researcher behind Embree) first designed a functional programming language for expressing HW designs at a high level (today we would say a Domain Specific Language, DSL). This allowed him to design the HW in an extremely efficient way: It took him about 6 months for a working prototype of the RPU — including the time it took him to design the DSL in the first place!

Today we have AnyDSL, which already is a functional programming language that we use to express complex SW designs. It is an obvious question, what changes would be required to enable AnyDSL to also express HW aspects of our algorithm, including the mapping to low-level HW building blocks like various functional blocks, different memories, control logic, and so on. This is a big challenge in general but one that might be able to change the way we think about HW and SW, as well as the interface between the two.

– Will we see one day for example fluid simulation processing units?

The algorithms are pretty regular and can likely benefit from direct HW support. But it depends on two things: First, on how easy and efficient we can make it for developers to even design such HW. An extended version of AnyDSL or similar new tools could probably help here.

Second, what is financially viable for a given market is mainly a question of economy of scale: Pure SW solutions are always possible, but an FPGA approach or even an ASIC depends on the return on investment. Allowing an easier and cheaper way to develop an initial software solution and then incrementally evaluate different HW designs without having to start from scratch would definitely be a big step forward.

While I can see FPGA solutions for this (and there have been designs already), I do not think the market is large enough for custom hardware.

– If you look at the past 20 years, what are according to you the most significant changes in the approach industry has taken to solve the rendering equation?

This is actually a very interesting question for me: More or less exactly twenty years ago I finished my PhD thesis at Erlangen University about the “Vision” architecture: A comprehensive SW framework for solving the rendering equation, offering most of the available approaches at the time. It included ray tracing and rasterization as well as some of the latest global illumination techniques that we had developed. We actually were one of the first to combine full RenderMan compatibility with global illumination computations, including novel Monte-Carlo techniques.

Interestingly, back then almost the entire movie industry insisted that Monte-Carlo would never be used in film because of the inherent noise and due to being physically-correct by default, which — they thought — would take away the artistic control they needed. This position actually persisted until only a few years ago, when we finally saw a complete switch of the entire industry to physically-base Monte-Carlo methods within a single year or so.

Today everyone makes use of global illumination using advanced Monte-Carlo techniques. Some of them were actually developed in my lab: Like the “Vertex Connection and Merging (VCM)” technique by Iliyan Georgiev et al. [Siggraph Asia 2012] that has taken the industry in storm. Within a few months after its publication it had been integrated into at least two commercial renderers. And after ~30 years of using their REYES renderer Pixar finally evealed their new architecture at Siggraph this year, which is using VCM as its core.

It is fair to say, that today we have a pretty good understanding of how to solve the rendering equation in general. VCM finally integrated Photon Mapping into the general Monte-Carlo techniques in a mathematically clean and very efficient way. This solved a lot of the hard problems in an elegant way. Lighting in participating media (volumes) is also well advanced now due to recent follow-up work to VCM.

Probably the biggest challenge today is performance. In the EU project “Dreamspace” we are working on real-time realistic rendering for film and video productions. The goal is to integrate the entire post-production into the live production process such that everyone on-set can see the final results in real-time and at least close to full quality. This requires us to develop highly scalable real-time algorithms that take optimal advantage of the available HW to fulfill the significant compute requirements needed.

One other idea that, from my point of view, should receive more attention in ray tracing-based approaches is finding short cuts and special case solutions. This has been at the core of much of rasterization research over the years but has been largely ignored in ray tracing (partially for good reasons :-). However, I believe that we can get a lot of performance that way. Good image space filtering techniques to eliminate residual noise of early results is one aspect of that development and we have seen some good techniques here already. I am sure there will be more.

Some other remaining challenges are:

— Lighting via glossy surfaces: We are good at handling mostly diffuse and mostly mirror-like surfaces, but despite many attempts the intermediate range of glossy surfaces can still cause significant noise and artifacts.

— Portable material models: It is a real shame that we still do not have a general way to exchange materials between two rendering system. With “shade.js” [Pacific Graphics 2014] we have recently done some steps in that direction but much work remains to be done here.

— Handling of large scenes and adaptive LOD: We need to be able to scale our models across much larger ranges. It is still hard to design a system that can render very large models in real-time. As an extreme example think of how to implement a renderer that can render a 3D model of the entire worlds (think OpenStreepMap-3D). We neither have a suitable format to store and transmit such content, to efficiently query for the right LOD, and to manage the data dynamically as we roam around it. This would equally be applicable for rendering complete and detailed models of entire airplanes, automobiles, and other complex models. With our Blast format and 3D-Repo/Assetserver [Web3D 2014], we are looking at providing solution here that are being picked up by industry already, but again much work still remains here.

— Handling measured 3D models and materials: In the future, a lot of 3D data will be coming from capturing the real world around us via smart phones and other devices. It is still unclear how such models should best be represented, shared, modified, and finally be rendered.

— Web-based interactive and realistic 3D Graphics: To a large degree 3D graphics has been limited to custom native applications, like games, CAD-, and animation systems. With XML3D [Web3D 2013] we are offering a very interesting option to integrate 3D models into HTML and the DOM directly. This allows any Web developer to interact with the 3D models in the same way as they interact with text, images, and video today. This way, HTML becomes even more of a rich media interface than it already is. This is a very compelling idea to me.

We are currently integrating XML3D with Server-Based Rendering to also enable highly-realistically, real-time rendered content in any browser. I believe that this could have a huge impact and will eventually be integrated also into the client directly.

– According to you what is the hottest field in CG research right now, where are the next breakthroughs going to come from?

There are so many fields that it is hard to answer that in detail.

However, I believe that we will see an huge increase in data driven graphics in the near future. The ability to capture 3D models, materials, illuminations from the real world is getting to the point where it becomes a very useful tools. OpenStreetMap in full 3D is not really that far away. This opens tremendous opportunities but also many challenges from a graphics and visual computing point of view.

I am particularly interested in what we call “Intelligent Simulated Reality” at the intersection of graphics, artificial intelligence, high-performance computing, and security: How can we bring models of the world into the computer, how can we attach semantics to those models such that the computer “understands” these models, how can we then run simulations efficiently on such models across the Internet, and how can we allow people to visualize and interact with these models to better understand the world around them, how can we enable them to better evaluate the possible outcomes of their decisions ahead of time in order to make this world a better place.

Graphics — and Visual Computing in general — play key roles here as they are the only means to make such data accessible and understandable to everyone.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s