I’ve been working lately on computer vision projects, involving Tensorflow for deep learning, OpenCV for computer vision and OpenGL for computer graphics. I’m especially interested in hybrid approaches, where I mix deep learning stuff, opencv stuff, and classic OpenGL pipeline. The main idea is to avoid framing problems as black box problems, throw a neural network at it and hope for the best. The main idea is rather to do the maximum amount of work with proven technologies, and let deep learning work only on a well-defined subset of the problem.
This time, I was working on an augmented reality problem, where I have an image, and I want to overlay stuff on it. In OpenCV, from an image you can estimate camera parameters, which are called “intrinsic camera parameters”. In OpenCV pinhole camera model, those parameters are: fx (horizontal focal length), fy (vertical focal length), cx (camera center X coord), cy (camera center Y coord).
This is the OpenCV camera matrix:
You want to overlay stuff on the original image. Now you have estimated the OpenCV camera parameter, you need to turn it into an OpengL projection matrix, so that you can render stuff on top of the original image using the OpenGL graphics pipeline. This problem of computing the OpenGL projection matrix from OpenCV camera matrix is NOT easy.
First of all, the OpenCV camera matrix projects vertices directly to screen coordinates. (NOTE: don’t forget to then divide by z component). OpenGL projection matrix projects vertices to clip space. The conversion from clip space to NDC (which means division by w component) is handled by OpenGL, and the conversion from NDC to screen space is handled as well by OpenGL. So the first problem is that we’re not looking at the same transformations exactly.
The second problem is that in OpenGL you usually assume that your camera center is at the origin (it’s the convention). It’s not the case in OpenCV, your camera parameters cx and cy let you have the camera center anywhere, it’s a degree of freedom like any other. 90% of the OpenGL projection matrices formulas you will find on the Internet do not account for that.
In the end, I checked many sources:
But the one that saved my day was this one: https://strawlab.org/2011/11/05/augmented-reality-with-OpenGL
The formula there is accurate (you can replace K with OpenCV camera matrix).
Here is a source code sample to demonstrate how you can project a point with OpenCV and OpenGL and get the same results (therefore validating your matrices):
Full source code can be found here: https://github.com/francoisruty/fruty_opencv-opengl-projection-matrix
When you have your OpenGL projection matrix, you can then render and overlay all the stuff you need on your image. I initially expected this step to take me 1 or 2 hours and it ended up taking me like 6 or 7 hours, so I thought I would share the solution.