A Brief Introduction to Skyline Problem (Pareto-optimal Tuples) (1)

The skyline problem is to compute the best tuples from a set of ordered d-tuples. The name is originated from what the solution represented on 2d plane resembles the scene that urban buildings comprise. Skyline is one of the recommendation queries, and it is considering multi criteria. It is very interesting problem as well as very useful query. This problem has been being intensively studied for recent years. Today, I’m going to present the problem definition of skyline. Next time, I’ll describe several algorithms for the skyline problem.

Singapore Skyline (#12) First of all, let us know the input data. The input data D^{d} of skyline is a set of n ordered d-tuples, each of which consists of ordered d scalar values. They are shown in below formulas:

D^{d} = {tp_{1},tp_{2},...tp_{n}}

tp_{i} = (v_{1},v_{2},...,v_{d})

tp_{i} denotes a d-tuple. And, we need to understand the definition of the dominance relation. In addition, because the skyline problem is to find the better tuples, we need an assumption about ‘better’. In most literature, it is assumed that the less value is better, so we follow this assumption.

Definition 1 (Dominance). Let tp and tp’ be tuples in D^{d} where v_{i} is an element of tp and u_{i} is an element of tp’ for 1 < i leq d. Then, tp dominates tp’ if and only if  forall{i}, v_{i} leq u_{i} land exists{j}, v_{j} < u_{j}.

In other words, it is said that one tuple tp dominates another tuple tp' if tp is not worse (not greater) than tp' in all dimensions and tp is better (less) than tp' in at least one dimension.

Definition 2 (Skyline) Given a data set D^{d}, a skyline contains tuples that is not dominated any other tuples in D^{d}.

As I described above definition, a skyline is a set of tuples and the tuples are not dominated by any other tuples in D^{d}. In literature, a d-dimensional data set and above two definitions are usually represented for comprehensive description to d-points on d-axies.

Without loss of generality, we assume that D^{d} is a 2d data set (i.e., d=2). A data set is given as follows:

  • a = (3,2)
  • b = (8,1)
  • c = (1,10)
  • d = (4,3)
  • e = (8,6)

Each element of a tuple in D^{d} can be represented to one axis. In other words, the first element and the second element of tuples are represented to X and Y axies respectively. Then, tuples of above list are represented to 2d points as shown in Fig. 1.

Fig. 1. An example of a skyline

Fig. 1. An example of a skyline

In Fig. 1, let us look into a dominance relation. The point a dominates the points {d,e} since elements of the point a less than those of {d,e} in X and Y. The point b dominates only e since X values of {b,e} are same (i.e., X=8) but Y of b (i.e., 1) is less than that (i.e., 6) of e. The points {d,e} cannot belong to the skyline because they are dominated by other tuples. Consequently, the points a,b, and c belong to the skyline since they are not dominated by any other tuples.

Initially, the skyline problem was known as the maxima vector problem (H. T. Kung et. al 1975) for traditional processing system. However, this problem was revisited by the Skyline Operator (Stephan Börzsönyi et. al 2001). Since then, this problem has been intensively studied in database area.

Next time, I’ll describe several algorithms including above algorithms in detail.


Some Interesting Papers of ACM SIGMOD Conference 2009

ACM SIGMOD Conference 2009 was held in Providence, Rhode Island from June 29 through July 2. Then, the electronic proceedings are available online. Among many nice papers, I tried to choose some interesting papers as follows:

MapReduce & Hadoop

  • “A Comparison of Approaches to Large Scale Data Analysis,” Andrew Pavlo, Samuel Madden, David DeWitt, Michael Stonebraker, Alexander Rasin, Erik Paulson, Lakshmikant Shrinivas and Daniel Abadi.

Some of the authors are members of vertica, a parallel database. Prof. Dwitt strongly attacked MapReduce (MapReduce: A major step backwards, MapReduce II). So, I wonder how did they benchmark both architectures.

Skyline Queries

  • “Minimizing the Communication Cost for Continuous Skyline Maintenance,” Zhenjie Zhang, Reynold Cheng, Dimitris Papadias, Anthony K. H. Tung.
  • “Scalable Skyline Computation Using Object-based Space Partitioning,” ZHANG Shiming, Nikos Mamoulis, David Cheung.
  • “Kernel-Based Skyline Cardinality Estimation,” Zhenjie Zhang, Yin Yang, Ruichu Cai, Dimitris Papadias, Anthony and K. H. Tung.

Since I first met the skyline problem, I have been always interested in skyline queries. Considering multi-criteria, Skyline queries retrieve the best tuples among multi-dimensional objects.

Graph Query Processing

  • “3-HOP: A High-Compression Indexing Scheme for Reachability Query,” Ruoming Jin, Yang Xiang, Ning Ruan, and Dave Fuhry.

Rechability query is to compute whether two given vertices are rechable, or not. Rechability query is one of the most fundamental operations in graph querying. it can be usually used in a primitive operation for complex graph queries.

RDF Query Processing

  • “Scalable Join Processing on Very Large RDF Graphs,” Thomas Neumann and Gerhard Weikum.

The issue with which I’m primarily concerned is RDF query processing. As linked data are gaining attention, this issue will be more dealt with in the database community.

Spatial Query Processing

  • “Quality and Efficiency in High Dimensional Nearest Neighbor Search,” Yufei Tao, Ke Yi, Cheng Sheng and Panos Kalnis.
  • “Continuous Obstructed Nearest Neighbor Queries in Spatial Databases,” Yunjun Gao and Baihua Zheng.
  • “A Revised R*-tree in Comparison with Related Index Structures,” Norbert Beckmann and Bernhard Seeger.

While I was taking M.S. program, I studied many spatial query processing issues. Hence, I try to keep in touch with recent spatial database issues.

They are seem to be very interesting. Later, I will post paper reviews about above papers.


HadoopDB: An Open Source Parallel Database for Analytical Workloads

With the increasingly growing volume of data, the techniques to manage big data are needed in many areas. Open source community and many companies have attempted developing solutions to deal with big data.

Recently, Prof. Daniel Abadi, who is an Assistant Professor of Computer Science at Yale University, announced HadoopDB release and the paper published in VLDB’09. HadoopDB is an open source analytical database, being developed by him and his students. The paper states that HadoopDB is a hybrid of both MapReduce and parallel  database and it takes the best features from both.

Hadoop LogoActually, MapReduce has made controversial issues from a database point of view. Formerly, there was some debates. Representatively, Prof. David Dewitt, who is well known as a great master of (parallel) database, critiqued that MapReduce is a major step backwards. On the other hand, proponents of MapReduce argue that MapReduce outperforms parallel database in respect of scalability, fault tolerance, and flexibility to unstructured data.

This paper concludes that HadoopDB is close to the performance of parallel databases while it is similar score on fault tolerance and feasibility in heterogeneous systems as Hadoop.

In sum, HadoopDB is a hybrid system of MapReduce and parallel DBMS. It is quite interesting achievement. I respect their decision to release HadoopDB as open source because their achievement will more broadly contribute to Hadoop and data analytical database. Still, I do not read this paper completely, and sooner I will discuss HadoopDB in detail.

Some interesting points:

  • They carried out experiments on a 100 node of amazon EC2 cluster.
  • They try to deal with semantic web data (i.e., RDF) by HadoopDB.
  • HadoopDB is a full open source project.
  • HadoopDB isn’t well suited for real-time data yet.
  • I can participate in his presentation at the session at VLDB.

See Also:


Follow

Get every new post delivered to your Inbox.

Join 412 other followers