Deep Learning in the Real World: How to Deal with Non-differentiable Loss Functions

Over the past few years, deep learning has been taking by storm many industries. From voice recognition to image analysis and synthesis, neural networks have turned out to be very efficient at solving a vast number of problems. Concretely, if you can define your problem as the minimization of a differentiable objective function, there is a very good chance deep learning can do something for you. Neural nets will first do badly, and as the error is backpropagated from objective function through all the layers of your neural network, the optimizer tweaks the neural network weights, until the error for your problem is minimized, at least locally.

Problem is, can you always frame your problem as the minimization of a differentiable function? In a lot of real-world scenarios, there is no obvious answer to this question. Here are a few examples:

  • autonomous cars: various teams try to teach neural networks to understand a 3D scene, figure out which objects are present, and where, in 3D space. The most obvious way is to use geometric priors and “invert” the image rendering process (you basically pick a game engine and try to have a neural network infer the 3D scene that is compatible with what’s visible on camera). Problem is, traditional rendering software are usually not differentiable so you can’t backpropagate error through the game engine.


  • any 3rd-party software control: you may want to design a neural network to play a game, or teach a neural network to control an industrial robot. The neural network would issue commands and try to minimize the error. Unfortunately there is few chance the game or the industrial robot control software is differentiable. This means that once you measure the error, you have no way to backpropagate it through the objective function (the game or control software) and teach the neural net to do better.


  • inverse problems: many physical processes are well-known and accurately modelled by scientists, but those software are usually not inversible. For example, scientists listen to earthquake shockwaves and try to figure out which earth mantle composition is compatible with the measured shockwaves propagation. You could train a neural network to output earth mantle compositions and simulate how would theoretical earthquake shockwaves would look like, and you measure the error by comparing the output to the actually measured shockwaves. There again, if your geophysics simulation software is not differentiable, you have no way to backpropagate the error and teach the neural net how to improve.


In all those cases, it’s not hard to find what to minimize, but the amount of interest usually cannot be computed by a nice differentiable python function that you can just plug in your tensorflow or pytorch training loop. This doesn’t mean your objective function is not differentiable in theory (although it’s possible), it just means that you don’t have a gradient computation mechanism in place.

So, what are the strategies available?

  • Reinforcement learning: if your problem can be framed as training an agent to take the best actions in an environment where the actions consequences are perfectly known and can be easily measured, reinforcement learning may be relevant. It’s very hard to get right, but some impressive successes have been achieved.


  • Differentiable approximation: if your function is not too long to evaluate, you can treat it as a black box, generate large amounts of inputs/outputs, and use this as a training set to train a neural network to approximate the function. Since neural networks are themselves differentiable, you can use the resulting network as a differentiable loss function (don’t forget to freeze the network weights). This approach has been used among other things for differentiable rendering. It’s a very powerful approach since it’s possible to obtain a differentiable approximation of the loss function of interest, even in cases where the loss function is not differentiable.


  • Manual gradient computations: if you’re motivated and are familiar with the specific problem at hand, you can design custom tensorflow or pytorch operators, and provide the gradients yourself. Tensorflow Graphics uses this for their differentiable OpenGL renderer.


  • Automatic Differentiation: If you have access to the source code of the software that computes your function of interest, you can use automatic differentiation tools (Autograd, tangent etc) to make it differentiable. At the end of the day, any software only computes lots of basic operations on the CPU/GPU, that are all differentiable. Usually those automatic differentiation libraries use source code transformation or operator overloading to compute the gradients.


I personally think those 4 techniques are going to fuel the deep learning revolution in coming years, as people try to see how they can embed neural networks in more and more real-world operations and business-specific software.

Mathematics is very often about establishing theoretical theorems and results, and then finding ways to frame real-world problems in a way compatible with those theorems. When people first started to work on neural networks decades ago, a very powerful theorem was established: the universal approximation theorem, which basically states that any continuous function (so basically almost any phenomenon in nature and life) can be approximated by a neural network. Deep learning scientists then stacked more and more neurons layers to improve the power of neural networks, and train them to do ever more impressive things, always under the assumption that the objective function to minimize has to be differentiable (else backpropagation is not possible).

To many, this limitation is severe (many problems are not differentiable per se), but this is missing the point: differentiability is indeed an hypothesis on which deep learning methods are built, but then it’s up to clever engineers to frame their problems as differentiable problems. We all know planets are not perfect spheres, yet NASA and ESA have been using for decades approximated models to successfully compute gravitational interactions and land spacecrafts off-world.

To conclude, never forget that theoretical assumptions and hypothesis are not limitations, they’re a framework for clever engineers to frame real-world problems, and get results. As Galileo said, “The book of nature is written in the language of Mathematic”.

7 thoughts on “Deep Learning in the Real World: How to Deal with Non-differentiable Loss Functions

  1. “At the end of the day, any software only computes lots of basic operations on the CPU/GPU, that are all differentiable”

    This is an interesting thought, but isn’t the basic “JUMP” operation non-differentiable?

    Also, even if it is differentiable, that gradient might not be very useful. Consider trying to do gradient descent on something like a sawtooth shaped function.


    1. You’re right, not *all* operations are differentiable. Maybe a better reformulation is this: a lot of phenomenons in nature are not differentiable but very often the software we write to simulate them has to use numerical approximations that are themselves often differentiable. Of course there are exceptions, in which case the best option is probably to provide manually the gradients.
      You mention a sawtooth shaped function, for example in computer graphics rasterization is not differentiable because of discontinuities in triangle visibility functions. AFAIK differentiable rasterizer works around that by providing manually defined gradients.
      So, you’re right, but it’s often possible to find a way to have gradients even for shit shaped functions 😉


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s