A Brief Introduction to Skyline Problem (Pareto-optimal Tuples) (1)

The skyline problem is to compute the best tuples from a set of ordered d-tuples. The name is originated from what the solution represented on 2d plane resembles the scene that urban buildings comprise. Skyline is one of the recommendation queries, and it is considering multi criteria. It is very interesting problem as well as very useful query. This problem has been being intensively studied for recent years. Today, I’m going to present the problem definition of skyline. Next time, I’ll describe several algorithms for the skyline problem.

Singapore Skyline (#12) First of all, let us know the input data. The input data D^{d} of skyline is a set of n ordered d-tuples, each of which consists of ordered d scalar values. They are shown in below formulas:

D^{d} = {tp_{1},tp_{2},...tp_{n}}

tp_{i} = (v_{1},v_{2},...,v_{d})

tp_{i} denotes a d-tuple. And, we need to understand the definition of the dominance relation. In addition, because the skyline problem is to find the better tuples, we need an assumption about ‘better’. In most literature, it is assumed that the less value is better, so we follow this assumption.

Definition 1 (Dominance). Let tp and tp’ be tuples in D^{d} where v_{i} is an element of tp and u_{i} is an element of tp’ for 1 < i leq d. Then, tp dominates tp’ if and only if  forall{i}, v_{i} leq u_{i} land exists{j}, v_{j} < u_{j}.

In other words, it is said that one tuple tp dominates another tuple tp' if tp is not worse (not greater) than tp' in all dimensions and tp is better (less) than tp' in at least one dimension.

Definition 2 (Skyline) Given a data set D^{d}, a skyline contains tuples that is not dominated any other tuples in D^{d}.

As I described above definition, a skyline is a set of tuples and the tuples are not dominated by any other tuples in D^{d}. In literature, a d-dimensional data set and above two definitions are usually represented for comprehensive description to d-points on d-axies.

Without loss of generality, we assume that D^{d} is a 2d data set (i.e., d=2). A data set is given as follows:

  • a = (3,2)
  • b = (8,1)
  • c = (1,10)
  • d = (4,3)
  • e = (8,6)

Each element of a tuple in D^{d} can be represented to one axis. In other words, the first element and the second element of tuples are represented to X and Y axies respectively. Then, tuples of above list are represented to 2d points as shown in Fig. 1.

Fig. 1. An example of a skyline

Fig. 1. An example of a skyline

In Fig. 1, let us look into a dominance relation. The point a dominates the points {d,e} since elements of the point a less than those of {d,e} in X and Y. The point b dominates only e since X values of {b,e} are same (i.e., X=8) but Y of b (i.e., 1) is less than that (i.e., 6) of e. The points {d,e} cannot belong to the skyline because they are dominated by other tuples. Consequently, the points a,b, and c belong to the skyline since they are not dominated by any other tuples.

Initially, the skyline problem was known as the maxima vector problem (H. T. Kung et. al 1975) for traditional processing system. However, this problem was revisited by the Skyline Operator (Stephan Börzsönyi et. al 2001). Since then, this problem has been intensively studied in database area.

Next time, I’ll describe several algorithms including above algorithms in detail.