Description
Dense SLAM with Point-based Fusion
1 Overview
In this homework you will be implementing a 3D dense SLAM system using pointbased fusion, which can be regarded as a simplified version of [1]. There are mainly three steps to setup the system:
• Localization: projective ICP, a common implementation of visual odometry that is similar to [2].
• Mapping: point-based fusion, a simplified map fusion algorithm.
• SLAM: glue the components and build the entire system.
You will be testing the system on the synthethic ICL-NUIM [3] dataset. Unlike real-world data, you don’t have to care about holes and noise, which can be tedious to deal with. Please download the dataset from this link. Prior to running the system, use pythonpreprocess.py /path/to/dataset to generate normal maps.
2 Iterative Closest Point (ICP) (50 points)
Please read Section 5 in [1] before working on the following problems. Note, many equations in [1] are using unconventional parameterizations, and the notations are not easy to follow. We provide our derivation below as a reference.
In general, as its name indicates, ICP runs iteratively. Here is the general flow:
1. Setup data correspondences between source and target point clouds given the current pose [Rt | tt] (usually initialized with identity).
2. Linearize the loss function at [Rt | tt] and solve for incremental update [δR | δt].
3. Update by left multiplying the incremental transformation [Rt+1 | tt+1] = [δR | δt][Rt | tt].
4. Go back to step 1 until convergence.
You will be working on icp.py in this section.
2.1 Projective data association
Data association is critical for point cloud registration, since the overall spirit is to push together correspondent points. In the conventional point cloud ICP, this is achieved by nearest neighbor search, usually depending on a KDTree. Although there are various fast implementations, building and searching in a KDTree can be relatively slow.
In KinectFusion, a more efficient variation, projective data association is used. Suppose the target point cloud is in the form of a vertexmap (where each pixel correpsonds to a point in 3D), then naturally there is an indexable geometry layout. Now given a point p, instead of searching for q in the 3D space, after initial transformation, we can project p to the 2D image and obtain the pixel coordinate (u,v), and directly read the corresponding point q from the vertexmap. In this case, p and q are not strictly nearest neighbors in 3D, but projective nearest neighbors. In practice, this is very efficient and works pretty well.
In case you are unfamiliar with projective pinhole camera model, we have provided you with the project function.
Question (5 points): Suppose you have projected a point p to a vertex map and obtained the u,v coordinate with a depth d. The vertex map’s height and width are H and W. Write down the conditions u,v,d must satisfy to setup a valid correspondence. Also implement the TODO : firstfilter section in function findprojective correspondence.
After obtaining the correspondences q from the vertex map and the corresponding normal nq in the normal map, you will need to additionally filter them by distance thresholds so that ||p − q|| < dthr.
Question (5 points): Why is this step necessary? Implement this filter in TODO : secondfilter section in function findprojectivecorrespondence.
2.2 Linearization
KinectFusion seeks to minimize the point-to-plane error between associated points (p,q) where p is from the source and q is from the target. The error function can be written by:
, (1)
where (pi,qi) is the i-th associated point pair in the assocation set Ω = {(pn,qn)}, and nqi is the estimated normal at qi (we have covered the data association step, i.e., how to associate pi and qi).
It is hard to directly solve R ∈ SO(3). So we rely on the small-angle assumption by parameterizing δR with
. (2)
and also introduce δt = [tx,ty,tz]. Note this parameterization is unlike [2] and is closer to the conventional Lie Algebra’s formulation [4].
With δR and δt, we can write down the incremental update version
, (3)
where p0i = R0pi + t0 aiming at solving α,β,γ,tx,ty,tz with the given initial R0,t0.
Question (15 points): Now reorganize the parameters and rewrite ri(δR,δt) in the form of
α
β
γ
ri(α,β,γ,tx,ty,tz) = Ai tx + bi,
ty tz
where Ai is a 1 × 6 matrix and bi is a scalar.
.
2.3 Optimization
The aforementioned step in fact does the linearization. Now suppose we have collected n associated pairs, we move on to optimize
. (4)
By solving the linear system you derived above, you can obtain the incremental transformation update in the tangent space. This 6D vector can be converted to a 4 × 4 matrix by pose2transformation provided by us.
Figure 1: Point clouds before and after registration.
Now, run pythonicp.py /path/to/dataset. If everything works fine, you will find ICP’s loss converges in less than 10 iterations and the inlier count increases correspondingly. The two point clouds fit better, resulting in an interleaved pattern shown in Fig. 1.
Question (10 points): Report your visualization before and after ICP with the default source and target (frame 10 and 50). Then, choose another more challenging source and target by yourself (e.g., frame 10 and 100) and report the visualization.
Analyize the reason for its failure or success.
Note: Press P to take a screenshot.
3 Point-based Fusion (40 points)
Please read Section 4 in [1] before working on the following problems.
KinectFusion uses volumetric TSDF as the map representation, which is nontrivial to implement. Point-based fusion, on the otherhand, maintains a weighted point cloud and actively merges incoming points, hence is easy to implement.
The essential operations of point-based fusion are very similar to projective data association. Given an estimated pose, we first project the point p in the map to the image and obtain its corresponding (u,v) coordinates. Then, after proper filtering, we compute a weighted average of the properties (positions, normals, colors).
You will be working on fusion.py in this section.
3.1 Filter
Similar to projective data association, we need a transformation. In a typical SLAM system, this is obtained from accumulative ICP. In this section, however, we assume ground truth transformations are known. The transformation ] is from the input camera frame’s coordinate system to the worlds’s coordinate system.
With the available transformation, you need to perform filtering similar to projective data association. The only difference is that you will need to add a normal angle constraint to be more strict with filtering.
Question (5 points): Implement filterpass1,filter pass2 to obtain mask arrays before merging and adding input points.
3.2 Merge
The merge operation updates existing points in the map by calculating a weighted average on the desired properties.
Figure 2: Fusion with ground truth poses with a normal map.
Question (15 points): Given p ∈ R3 in the map coordinate system with a weight w and its corresponding point q ∈ R3 (read from the vertex map) in the frame’s coordinate system with a weight 1, write down the weighted average of the positions in terms of . Similarly, write down the weighted average of normals in terms of . Implement the corresponding part in merge. Note a normalization of normals is required after weighted average.
3.3 Addition
Question (10 points): Implement the corresponding part in add. You will need to select the unassociated points and concatenate the properties to the existing map.
3.4 Results
Now run pythonfusion.py /path/to/dataset. The system will take the ground truth poses and produce the resulting image after fusing 200 frames, see Fig. 2.
Question (10 points): Report your visualization with a normal map, and the final number of points in the map. Estimate the compression ratio by comparing the number of points you obtain and the number of points if you naively concatenate all the input. Note to speed up we use a downsample factor 2 for all the input. Note: Press N to visualize the normal map.
4 The dense SLAM system (20 points + 10 points)
Now we put together both ICP and fusion code in main.py. We have called the relevant functions implemented by you for your convenience.
Question (5 points): Which is the source and which is the target, for ICP between the map and the input RGBD-frame? Can we swap their roles, why or why not?
Run pythonmain.py /path/to/dataset. By default, 200 frames will be tracked and mapped.
Question (15 points): Report your visualization. Also, record the estimated trajectory and compare against the ground truth. Drift is acceptable in the case, to the extend shown in Fig. 3.
Bonus (10 points): Can you try to reduce the drift? There could be several methods, from improving the RGBD odometry to improving the fusion by rejecting
Figure 3: Fusion with poses estimated from frame-to-model ICP.
outliers and eliminating dormant points (i.e., not associated to any points for a certain period). Report your updated visualization and trajectories.
5 Code Submission Rules
Upload all of your python code in a folder. Include all your visualizations in your PDF file. Do not upload the dataset when you submit.
References
[1] Keller, Maik, et al. “Real-time 3D reconstruction in dynamic scenes using point-based fusion.” International Conference on 3D vision (3DV), 2013. Link: http://ieeexplore.ieee.org/document/6599048/.
[2] Newcombe, Richard A., et al. “KinectFusion: Real-time dense surface mapping and tracking.” 10th IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2011. Link: http://ieeexplore.ieee.org/document/6162880/.
[3] Handa, Ankur, et al. “A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM.” IEEE International Conference on Robotics and Automation (ICRA), 2014. Website: https://www.doc.ic.ac.uk/˜ahanda/VaFRIC/iclnuim.html.
Reviews
There are no reviews yet.