Saint Louis University |
Computer Science 3100
|
Dept. of Computer Science |
Topics: Divide-and-conquer algorithm for finding the closest pair of points
Related Reading: Chapters 33.4 of CLRS
Due:
10:00am, Monday, September 26, 2016
You must adhere to the policies on academic integrity, paying particular attention to the limits on collaboration.
For this homework, you will implement the
We will get you started by providing a framework for the project, which you may download as individual files or as a single zipfile. Our program contains a main driver that parses commandline arguments with the following form:
driver A N [S]where A is an integer with value
Our driver takes care of pseudorandomly generating N points with nonnegative integer coordinates, sending the points as a parameter to the indicated algorithm, and reporting the running time of the algorithm. Our codebase already includes an implementation of a basic Point struct with fields x and y, together with useful utility functions:
distSquared(a, b), which returns the square of the distance between points a and b. Note that we rely on using the square of the distance rather than the actual distacne because this allows us to use exact comparision of integer values, rather than the floating-points that would result from a square-root computation if computing the actual distance.
compareByX(a, b), a boolean function that can be used with the standard sorting tools to sort a collection of points according to x-coordinate (and using the y-coordinate as a tie-breaker).
compareByY(a, b), a boolean function that can be used with the standard sorting tools to sort a collection of points according to y-coordinate (and using the x-coordinate as a tie-breaker).
Our codebase also includes the definition for a simple Outcome struct, which allows an algorithm to return the pair of closests points and the square of the distance between them.
All of your code must be placed in the file closestpair.cpp. Within that file, you must implement the function
Outcome efficient(const vector<Point>& data)which returns information about the closest pair of points from the data set. In case there were a tie, you may report any pair of points that achieves that minimum distance.
While our driver relies on calling the efficient function, you are welcome to introduce any additional functions within that file, such as the one needed for the recursive divide-and-conquer algorithm. However, you should not make any changes to any files other than closestpair.cpp.
In addition to your code, you must create a writtend document addressing several aspects of your implementation and its efficiency.
Provide a brief overview of your implementation, discussing any unusual design decisions, and disclosing any known problems with the correctness of your code.
Give a more specific explanation of the data structures that you use for representing collections of points, and how this information is passed from one recursive level to another.
Just as it is common for a merge-sort implementation to eventually revert to insertion sort for small enough data sets, you can optimize your closest-pair implementation by relying on the existing brute-force implementation once N falls below some threshold. Please explain what value you have chosen for such a threshold and what experimental data you gathered to support that decision.
Provide a table such as the following showing observed running times of your program for various choices of N. Report on both the brute-force and the divide-and-conquer solution, yet omitting values once running times get beyond 20 seconds or so.
N | brute-force | divide-and-conquer |
---|---|---|
1,000 | ||
2,000 | ||
4,000 | ||
8,000 | ||
16,000 | ||
32,000 | ||
64,000 | ||
128,000 | ||
256,000 | ||
512,000 | ||
1,000,000 | ||
2,000,000 | ||
4,000,000 | ||
8,000,000 | ||
16,000,000 | ||
32,000,000 |
For both brute-force and divide-and-conquer, provide a brief analysis of how the theoretical asymptotic performance of the algorithm is (or is not) reflected in your observed data.
You are required to submit two files
The key to the implementation is the proper management of the data
through the recursive process. To achieving
In terms of efficiency, we observe that the brute force implementation when run on hopper solves a problem with N=50,000 in 3 seconds, while our divide-and-conquer approach solves that same problem in 0.028 seconds, and solves a problem with N=5,000,000 in under 4 seconds.
Finally, it is imperative that the new implementation correctly solves
the problem. For small to moderate test cases, you should compare the
output of your divide-and-conquer algorithm to the output reported by
the brute force impelementation, as the answers should agree. If you'd
like a few known results for bigger case, we find a distance squared
value of 1394 when using N=100,000 and seed S=12, with that
closest pair being points
The homework is worth up to 100 points, as follows:
It turns out that there is a corresponding lower bound showing that
Most of these
Which grid cell a particular point lies in can be computed by taking its distance from the origin along the x- and y-axis and using the floor computation when dividing by δ. While we may consider the grid infinitely large, there can be at most N nonempty cells. Furthermore, the technique of hashing can be used to associate a data structure with each cell with operations having expected O(1) time. If δ were the actual closest pair distance, then we could be sure that there are at most 4 points in any cell, but if the estimate is worse, there could be many more points in the same cell
The specific algorithm we consider was originally described by Michael
Rabin (a more recent description is given by
Richard
Lipton).
It begins by considering a random sample of sqrt(N) points and then
uses a brute-force algorithm to find the closest pair in that
subset. It uses the closest pair distance from that subset as the
δ estimate, and then computes the grid composition of all points
on the grid with δxδ sized cells. It then computes the
actual distance between each point p and any point that
lies in the "neighborhood" of p, where the neighborhood is
the same cell or any cell which touches horizontally, vertically, or
diagonally. It is clear that this algorithm must identify the true
closest pair. The more interesting challenge is the mathematical
analysis to show that the overall amount of work performed has
expected value of
The biggest challenge in making this algorithm effective is in minimizing
the inherent expense associated with managing a hash-table of
secondary condainers to represent the grid decomposition. While there
are supporting data structures available within C++11 for both hashing
and secondary containers, care is needed to get this algorithm to
outperform the
Two other such
For the sake of full disclosure, I will note that (as of September 5) I have not yet been able to get my linear time implementation of Rabin's algorithm to outperform the earlier algorithm because my underlying constant factors are too high. In fact, my extra credit version takes twice the time of the required version, but I haven't yet worked on fully optimizing the implementation.