We’ve all have to arrange and rearrange things on a daily basis. We arrange things in our personal lives—our sock drawers, puzzles, furniture, or a bouquet of flowers—and in our professional lives—Photoshop layers, audio tracks, the events in a project timeline, or the distribution of your 401k.
Today, I was arranging point clouds.
The kind of clouds I’m dealing with consists of thousands of points in 3d. They look a little something like this:
Software inside the Kinect figures out how far away objects in its field of view are based on the distance between the dots (the math is more complicated than I make it sound).
Point clouds have been around a lot longer than the Kinect: 3D scanning applications have been in use for a long time to build models for manufacturing and quality assurance, to digitize real life objects for use in movie special effects, and in the field of medical imaging, among other uses. However, the devices used to perform these 3d scans have been exorbitantly expensive.
Here at Second Story, we’ve been excited by the potential of such an inexpensive sensing device ($149.99 at the time of this writing). It wasn’t long before a project came along where we wanted to map a large space. This space would require three Kinects to cover it. The two questions that immediately came to mind were, “Can we run multiple Kinects on one computer?” and “How do we align the multiple point clouds?”
That blurry mess is three superimposed point clouds, two from Kinects and one from an Asus Xtion. The Asus Xtion is a device that has some of the same 3D sensing capabilities that the Kinect does, primarily the ability to generate a point cloud out of the scene it is viewing. It costs a little more, weighing in at $189.00.
Each individual point cloud is represented by a different color. In the foreground you may be able to see the red point cloud that shows my face, and although it’s difficult to tell at this angle, there are corresponding yellow and green head blobs.
The reason they are not in the same place is that cameras are in different locations and know nothing about one another. It would be similar to watching a movie of a scene shot from two different angles.
The next step is to ask, “How do we align these point clouds in 3D space?” Theoretically, if we knew the details of how the camera sees, its field of view, its position in space, and its direction, precisely, we might be able to use those details to align the point clouds. However, practically, it doesn’t make sense. The alignment of the point clouds would be limited to the accuracy of your measurements and the accuracy of your supposition of how the camera sees things. All it takes is to just be a little bit off and you have a very poorly aligned and disappointing set of point clouds. Furthermore, if someone accidentally nudges your Kinect it will send you in to a screaming, hair-pulling frenzy.
One idea for how to align point clouds is to use an object with known dimensions to calibrate the cameras. Place a cube in the cameras’ fields of view, and based on your knowledge of the shape of a cube you should be able figure out how to align the point clouds. The disadvantages of this are that you always have to have a cube with you (possibly a very specific cube), you have to carefully control the elements in the scene, and you have to be able to extract the cube data from the scene. It turns out that the task of separating an object from the scene (a processcalled segmentation) is not a trivial task.
So what is the solution? Well, first thing we do is call it by the correct name. The process of aligning point clouds to each other is called registration.
The kind of registration we need to perform is a registration that is based on rigid transformations, which means that the objects in a scene aren’t deformed from one point cloud to another. What one camera sees is just a moved and rotated view of what another camera sees. Nothing is squashed or bent or extruded.
The registration algorithms make a guess as to which points are the “same” point from one point cloud to another and then the point clouds are registered. For example, the point at the tip of my nose is the same point no matter what camera is looking at it. If an algorithm can somehow detect that that the points on my face are the same points even though they’re from different cameras, then we can move and rotate one of the point clouds until it “fits” with the other point cloud.
This is the library I used for the registration of point clouds from multiple Kinects is the Point Cloud Library. The PCL is a C++ programming library developed specifically for doing things with point clouds. These things include but are not limited to:
Filtering – getting rid of noisy data, downsampling
Segmentation – breaking up a scene in to different components
Registration – stitching various point clouds together
Feature Detection- describing point clouds for use in comparisons, and object detection
Surface Reconstruction- creating hulls and meshes from points
Visualization – viewing and exploring point clouds
IO – loading and saving point clouds from files and devices (such as the Kinect)
Before I get in to anything technical, check out the video I made of the registration of three point clouds captured from Kinects. You’ll get the general gist of what’s going on. I’ll wait.
Okay, now I’ll quickly outline the process of registration without getting in to too much of the nitty-gritty.
The first step is to downsample the point clouds. Downsampling reduces the number of points in the cloud in such a way that it still represents the object well enough. You can imagine this as being similar to converting a movie from HD format to standard format. You’re getting rid of some information in exchange for having less data. We want to have less points so that our calculations for registration are quicker. I used the VoxelGrid functionality to downsample the point clouds. This also guarantees that there aren’t multiple points in the exact same spot. Some algorithms really hate points in the same spot. In order to get rid of even more points I filtered out everything past a certain distance from the camera. This was so that we can focus on aligning points from an object in the foreground. Just based on empirical observation this worked better than including the entire scene.
After the three point clouds were downsampled and filtered, they were ready for registration. The point clouds were registered sequentially. I registered cloud 2 against cloud 1, and cloud 3 against cloud 2.
The methodology I decided to use was to align the point clouds using a sample consensus initial alignment. Sample consensus methods are iterative methods in which at each iteration, points from one cloud are tested against a model (in this case another cloud) to decide if those points are “inliers,” i.e., if they are similar enough to be probably the same point. There are two take-homes from this. Firstly, the more points that two cameras share, the more likely you’ll get a quick and decent registration. Secondly, since it’s an iterative method, the longer you’re willing to wait, the better a match you’ll get.
Finally, the point clouds aren’t actually compared directly. Rather, a set of feature descriptors is derived from each point cloud and those feature descriptors are compared against each other. It’s useless to say that Donny (me) is standing 3 meters away at an angle of 90 degrees in the picture from one camera and 5.5 meters at an angle of 20 degrees in the picture from the other camera, especially when we have no idea where the cameras are in relation to each other. Instead, if we describe Donny, and say he is 6 feet tall, has dark brown hair, a cleft chin, and is wearing blue jeans and a gray t-shirt, it would be easier to identify and match him between the two cameras. The above description of me is equivalent to the feature descriptor for a point. It’s these feature descriptors that we use in our sample consensus comparisons to determine the aforementioned “inliers.”
The ability to perform accurate sensing in large space opens the door to many possibilities. Imagine a hall of screens each reacting as you towards them, or perhaps a projection of an immersive environment a visitor can walk through and interact with, or strips of leds in the floor that trace out where you’ve been walking allowing you to become your own life-sized paintbrush. Interactive installations like these have been possible, but not at such an accessible level. The combination of the drop in price-point for advanced sensing technology, the wealth of software that has been developed for the technology, and our insatiable desire to push the envelope makes this an exciting time to be a developer at Second Story.