Markerless Augmented Reality

Greg Eddington

Project Goals

The ability to augment reality by placing a virtual image on top of a camera frame and mimic real world transformations.
Markerless tracking to allow any flat surface to be used for tracking.
Use a single camera to allow augmentation with non-stereo vision, such as webcams and phone cameras.
Requires minimal stored state to reduce the memory utilization by augmentation program.
Written in C/C++ to allow real-time execution in the future.

Description

The concept of the algorithm presented in this project is based on the fact that if three points on an affine surface can be found and matched between two frames on a video, then an affine transformation can be found which transforms the points from the first frame to the positions in the second frame using the equation below.

This transformation can then be applied to the vertices used to create a virtual image, making it appear to move in the same manner as the surface in the video. This creates the ability to augment the video with the virtual image in a realistic manner, as the virtual image moves in the world in the same way as another object.

In order to match three points from a affine surface in one frame to the same affine surface in another, two identification processes must first occur. The first is to define the region of the affine surface and limit the search for points to only that surface. The second is to search that region for points which are unique and distinguishable, so that they can be matched in sequential frames of the video. This project used two previously existing algorithms to perform these two functions. To find the affine surface, the MSER algorithm was used, and to find keypoints within the surface, the SIFT algorithm was used.


Maximally Stable Extremal Regions	Scale-Invariant Feature Transform
MSER searches an image and finds regions of maximum stability.	SIFT searches an image and finds unique and distinguishable keypoints.

MSER will sort all gray-scale pixels in an image by intensity, and then iterate through ascending levels of thresholds to produce binary black and white images, with the first sequence being all white and the last being all black. It then finds blobs which are produced during the thresholding processes, and marks blobs which maintain shape for a certain number of thresholds as a stable region. These blobs are then abstracted with an ellipse, which is used to constrain the search space.

SIFT begins by taking a frame and creating several blurred version of it using a Gaussian blur, to increase the algorithms invariance to scale. Difference of Gaussiance (DoG) images are then created by subsracting adjacent images in the Gaussian blurring process, to approximate a Laplacian of Gaussian and produce an image that is invariant to image changes. The maxima and minima pixels are then found by iterating through all pixels and looking for pixels which are the maxima or minima of the 26 adjacent pixels (8 neighbors in the current DoG, 9 neighbors in the DoG produced by the previous level of blurring, and 9 neighbors in the DoG produced by the next level of blurring). These pixels are now considered keypoints, and are further analyzed to filter out noisy keypoints. The contrast of the keypoints with their neighbors are analyzed, and low contrast points are removed. Poorly localized points are then filtered out by removing points which have a high principle of curvature in one direction, but not in another, suggesting that it is lying on an edge of a surface. A 128 element vector key is calculated for each point by calculating a histogram of neighboring pixels.

Implementation

A high level flow of the algorithm is shown below.

The keyframe from which augmentation is relative to is the first frame of the input. The user can select a MSER ellipse to use as the augmentation surface. Before augmentation occurs, every frame is loaded into memory and the MSER and SIFT algorithms are performed on every frame. This is not necessary, as only the initial keyframes requires the MSER algorithm, but being able to visualize the MSER ellipses across all frames proved useful when implementing the algorithm. In a real-time system, MSER should only be performed until a keyframe is selected. Similarly, only the keyframe and final frame are needed for augmentation, but storing every frame was useful for analysis. Once the keyframe is selected, then a SIFT search is performed on it to find all keypoints within the selected MSER ellipse.

For every subsequent frame, a SIFT search is performed to find all keypoints within the image. Using the 128 bit key, each keypoint is compared to each keypoint within the MSER ellipse of the original frame, calculating the distance between the two keys. Keys below a defined threshold are considered matching, and all others are considered unmatched. The three best matching keypoint pairs are chosen for computing the affine transformation matrix. This augmentation is stored for the given frame, and the process is repeated for every frame in the sequence.

The user is able to cycle through all the frames of the input, and can turn on and off various view controls. The MSER ellipses can be shown as blue ellipses and the SIFT keypoints can be shown as red glyphs. The ellipses and keypoints can also be narrowed down to the selected ellipse and matching keypoints. The user can also toggle augmentation on and off. Cycling through all frames with augmentation on allows the user to view the augmented video.

Results

	Translation on a Wall A coupon was placed on a wall, and a clock was augmented into the scene on top of where the coupon was. The camera panned across the wall to verify that translation was working correctly. [Video]
	Rotating on a Wall A coupon was placed on a wall, and a clock was augmented into the scene on top of where the coupon was. The camera was rotated to verify that rotation was working correctly. [Video]
	Tracking in a Busy Scene with Translation, Scaling, and Rotations A coupon was placed on the side of a wall, with several different planes and objects present in the scene. A clock was augmented into the scene on top of where the coupon was. The camera was then moved through a series of motions, including panning, zooming, and rotation, to verify that the augmentation was working correctly. [Video]
	Augmentation on a Painting A painting was used as the surface for augmentation, with a clock being augmented into the scene. [Video]
	Augmentation on a Moving Object A video game cartridge as the surface for augmentation, with a video game character being augmented into the scene. Instead of moving the camera, the augmented surface was moved instead. Note, the input video has a noticeable skip at the beginning, and the duration is relatively short due to the higher resolution. [Video]
	Augmentation on a Manual A graphics card manual is used as the surface for augmentation, and a sign which reads "Hello World" is augmented into the scene. [Video]

Download

Source:	.zip
Required Libraries:	OpenGL, GLU, GLUT, VLFeat, libPNG, zlib, Armadillo, BLAS, LAPACK
Operating System:	Linux

How To Use

usage: aug <input> <image> <target-ellipse-x> <target-ellipse-y> <aug-x> <aug-y>

input	A .txt file which contains a list of .PNG files, each being a frame of a video capture. The filenames should be relative to the text file. Note: The number of frames should be 128 or less, and all frames will exist in memory at once so care should be taken to not to use too many.
image	A .PNG filename of the virtual image to augment into the scene.
target-ellipse-x	The X coordinate of a point on the ellipse to use as the MSER for SIFT keypoint matching.
target-ellipse-y	The Y coordinate of a point on the ellipse to use as the MSER for SIFT keypoint matching.
aug-x	The X coordinate of where to initially place the virtual image in the scene.
aug-y	The Y coordinate of where to initially place the virtual image in the scene.

Controls

A	Turn augmentation on or off.
M	Turn MSER ellipses on or off.
E	Toggle between only showing the selected ellipse or all ellipses.
S	Turn SIFT keypoints on or off.
K	Toggle between showing only matching keypoints or all keypoints.
-	Swithc to the previous frame.
+	Switch to the next frame.
Q	Quit.

Future Work

I plan on continuing this work for my senior project. The main goal will be for real time augmentation, which will require two changes to the current program. The first is to change the input source for the program from a series of image files to a webcam using a library designed to continuously pull frames from the camera. The second is to profile and optimize the actual augmentation algorithm so that it can run at an acceptable rate for a real time graphics application.

The second main goal of the project is to create a framework which provides a set of tools for programmers to incorperate markerless tracking into their programs. This entails making parameterization of the algorithm easier, providing a simpler interface for selected a source to augment, and providing an API so that users can incorperate the library into their code.