Homework Assignment 02

Course Home | Assignments | Schedule & Lecture Notes

Computer Science 3100
Algorithms

Closest Pair

Overview

Topics: Divide-and-conquer algorithm for finding the closest pair of points
Related Reading: Chapters 33.4 of CLRS
Due: 10:00am, Monday, September 26, 2016

You must adhere to the policies on academic integrity, paying particular attention to the limits on collaboration.

Required Implementation
Required Analysis
Submission Procedures
Advice
Grading Rubric
Extra Credit

Required Implementation

For this homework, you will implement the O(n lg n) divide-and-conquer algorithm for finding the closest pair of given points in the plane. The algorithm is described in Chapter 33.4 of the text, but care will be needed to properly implement all aspects of a correct and efficient algorithm.

We will get you started by providing a framework for the project, which you may download as individual files or as a single zipfile. Our program contains a main driver that parses commandline arguments with the following form:

  driver A N [S]

where A is an integer with value 0, 1, or 2 that designates the algorithm to be used, N is the number of points in the data set, and S is an (optional) integer seed to the random-number generator, thereby allowing you to repeat trials on the same data set. Algorithm "0" is the quadratic-time brute force implementation, which we've already implemented as a reference algorithm. Algorithm "1" is the divide-and-conquer algorithm which you must implement. Algorithm "2" is an extra credit algorithm described at the end of this document.

Our driver takes care of pseudorandomly generating N points with nonnegative integer coordinates, sending the points as a parameter to the indicated algorithm, and reporting the running time of the algorithm. Our codebase already includes an implementation of a basic Point struct with fields x and y, together with useful utility functions:

distSquared(a, b), which returns the square of the distance between points a and b. Note that we rely on using the square of the distance rather than the actual distacne because this allows us to use exact comparision of integer values, rather than the floating-points that would result from a square-root computation if computing the actual distance.
compareByX(a, b), a boolean function that can be used with the standard sorting tools to sort a collection of points according to x-coordinate (and using the y-coordinate as a tie-breaker).
compareByY(a, b), a boolean function that can be used with the standard sorting tools to sort a collection of points according to y-coordinate (and using the x-coordinate as a tie-breaker).

Our codebase also includes the definition for a simple Outcome struct, which allows an algorithm to return the pair of closests points and the square of the distance between them.

Your Task

All of your code must be placed in the file closestpair.cpp. Within that file, you must implement the function

Outcome efficient(const vector<Point>& data)

which returns information about the closest pair of points from the data set. In case there were a tie, you may report any pair of points that achieves that minimum distance.

While our driver relies on calling the efficient function, you are welcome to introduce any additional functions within that file, such as the one needed for the recursive divide-and-conquer algorithm. However, you should not make any changes to any files other than closestpair.cpp.

Required Analysis

In addition to your code, you must create a writtend document addressing several aspects of your implementation and its efficiency.

Provide a brief overview of your implementation, discussing any unusual design decisions, and disclosing any known problems with the correctness of your code.
Give a more specific explanation of the data structures that you use for representing collections of points, and how this information is passed from one recursive level to another.
Just as it is common for a merge-sort implementation to eventually revert to insertion sort for small enough data sets, you can optimize your closest-pair implementation by relying on the existing brute-force implementation once N falls below some threshold. Please explain what value you have chosen for such a threshold and what experimental data you gathered to support that decision.

Provide a table such as the following showing observed running times of your program for various choices of N. Report on both the brute-force and the divide-and-conquer solution, yet omitting values once running times get beyond 20 seconds or so.

N	brute-force	divide-and-conquer
1,000
2,000
4,000
8,000
16,000
32,000
64,000
128,000
256,000
512,000
1,000,000
2,000,000
4,000,000
8,000,000
16,000,000
32,000,000

For both brute-force and divide-and-conquer, provide a brief analysis of how the theoretical asymptotic performance of the algorithm is (or is not) reflected in your observed data.

Submission Procedures

You are required to submit two files

Your edited version of closestpair.cpp
A text document including the analysis requested in the preceding section

Ensure that both files include your name in the headers, and then email them both as attachments to the instructor.

Advice

The key to the implementation is the proper management of the data through the recursive process. To achieving O(n lg n) asymptotics, you must only use a sorting algorithm at the top-level of your algorithm, before entering the recursive divide-and-conquer. Also, to minimize the constant factor within the asymptotics for real-world efficiency, try to minimize as best possible the creation of unnecessary copies of data.

In terms of efficiency, we observe that the brute force implementation when run on hopper solves a problem with N=50,000 in 3 seconds, while our divide-and-conquer approach solves that same problem in 0.028 seconds, and solves a problem with N=5,000,000 in under 4 seconds.

Finally, it is imperative that the new implementation correctly solves the problem. For small to moderate test cases, you should compare the output of your divide-and-conquer algorithm to the output reported by the brute force impelementation, as the answers should agree. If you'd like a few known results for bigger case, we find a distance squared value of 1394 when using N=100,000 and seed S=12, with that closest pair being points (27822826, 24063096) and (27822831, 24063059). For N=1,000,000 and S=345, the closest pair (22944166, 21376804) and (22944158, 21376802) has distance squared of 68.

Grading Rubric

The homework is worth up to 100 points, as follows:

(50 points) Quality/effort in core implementation
(10 points) Accurate calculation of closest pairs
(10 points) Efforts to streamline efficiency of implementation
(6 points) Overview of project in text document
(6 points) Discussion of data structure choices
(6 points) Discussion of threshold choice for when to stop divide-and-conquer
(6 points) Completion of table of running times
(6 points) Discussion of asymptotic analysis and running times

Extra Credit

It turns out that there is a corresponding lower bound showing that Θ(n lg n) is the best possible deterministic algorithm for the closest pair problem in a model that assumes use only of basic algebraic operations. However, if use of a constant-time floor operation is allowed there is a better algorithm, and if randomization is allowed, there are several algorithms that can compute the closest pair with an expected running time that is O(n) with high probability. For up to 20 points of extra credit, you are to implement one of these algorithms and compare its running time to the standard divide-and-conquer algorithm.

Most of these O(n) algorithm use the following framework. If we have any upper bound, δ, on the true closest distance, and we overlay an δxδ grid over the entire plane and bucket points that lie in the same cell of the grid, then the true closest pair will either lie in the same cell or must lie in two (possibly diagonally) neighboring cells. Otherwise, any two such points would be separated by strictly more than δ and thus not the closest pair.

Which grid cell a particular point lies in can be computed by taking its distance from the origin along the x- and y-axis and using the floor computation when dividing by δ. While we may consider the grid infinitely large, there can be at most N nonempty cells. Furthermore, the technique of hashing can be used to associate a data structure with each cell with operations having expected O(1) time. If δ were the actual closest pair distance, then we could be sure that there are at most 4 points in any cell, but if the estimate is worse, there could be many more points in the same cell

The specific algorithm we consider was originally described by Michael Rabin (a more recent description is given by Richard Lipton). It begins by considering a random sample of sqrt(N) points and then uses a brute-force algorithm to find the closest pair in that subset. It uses the closest pair distance from that subset as the δ estimate, and then computes the grid composition of all points on the grid with δxδ sized cells. It then computes the actual distance between each point p and any point that lies in the "neighborhood" of p, where the neighborhood is the same cell or any cell which touches horizontally, vertically, or diagonally. It is clear that this algorithm must identify the true closest pair. The more interesting challenge is the mathematical analysis to show that the overall amount of work performed has expected value of O(n); fortunately, Dr. Rabin proved this result.

The biggest challenge in making this algorithm effective is in minimizing the inherent expense associated with managing a hash-table of secondary condainers to represent the grid decomposition. While there are supporting data structures available within C++11 for both hashing and secondary containers, care is needed to get this algorithm to outperform the O(n lg n) divide-and-conquer algorithm.

Two other such O(n) algorithms based on a grid decomposition are presented by Khuller and Matias, and by Golin, Raman, Schwarz, and Smid.

For the sake of full disclosure, I will note that (as of September 5) I have not yet been able to get my linear time implementation of Rabin's algorithm to outperform the earlier algorithm because my underlying constant factors are too high. In fact, my extra credit version takes twice the time of the required version, but I haven't yet worked on fully optimizing the implementation.

CSCI 3100, Fall 2016
Last modified: Monday, 05 September 2016

Course Home | Assignments | Schedule & Lecture Notes

Saint Louis University

Computer Science 3100
Algorithms

Michael Goldwasser

Fall 2016

Dept. of Computer Science