Some Interesting Papers of ACM SIGMOD Conference 2009
Posted: August 8, 2009 Filed under: Research | Tags: database, paper, SIGMOD 3 CommentsACM SIGMOD Conference 2009 was held in Providence, Rhode Island from June 29 through July 2. Then, the electronic proceedings are available online. Among many nice papers, I tried to choose some interesting papers as follows:
MapReduce & Hadoop
- “A Comparison of Approaches to Large Scale Data Analysis,” Andrew Pavlo, Samuel Madden, David DeWitt, Michael Stonebraker, Alexander Rasin, Erik Paulson, Lakshmikant Shrinivas and Daniel Abadi.
Some of the authors are members of vertica, a parallel database. Prof. Dwitt strongly attacked MapReduce (MapReduce: A major step backwards, MapReduce II). So, I wonder how did they benchmark both architectures.
Skyline Queries
- “Minimizing the Communication Cost for Continuous Skyline Maintenance,” Zhenjie Zhang, Reynold Cheng, Dimitris Papadias, Anthony K. H. Tung.
- “Scalable Skyline Computation Using Object-based Space Partitioning,” ZHANG Shiming, Nikos Mamoulis, David Cheung.
- “Kernel-Based Skyline Cardinality Estimation,” Zhenjie Zhang, Yin Yang, Ruichu Cai, Dimitris Papadias, Anthony and K. H. Tung.
Since I first met the skyline problem, I have been always interested in skyline queries. Considering multi-criteria, Skyline queries retrieve the best tuples among multi-dimensional objects.
Graph Query Processing
- “3-HOP: A High-Compression Indexing Scheme for Reachability Query,” Ruoming Jin, Yang Xiang, Ning Ruan, and Dave Fuhry.
Rechability query is to compute whether two given vertices are rechable, or not. Rechability query is one of the most fundamental operations in graph querying. it can be usually used in a primitive operation for complex graph queries.
RDF Query Processing
- “Scalable Join Processing on Very Large RDF Graphs,” Thomas Neumann and Gerhard Weikum.
The issue with which I’m primarily concerned is RDF query processing. As linked data are gaining attention, this issue will be more dealt with in the database community.
Spatial Query Processing
- “Quality and Efficiency in High Dimensional Nearest Neighbor Search,” Yufei Tao, Ke Yi, Cheng Sheng and Panos Kalnis.
- “Continuous Obstructed Nearest Neighbor Queries in Spatial Databases,” Yunjun Gao and Baihua Zheng.
- “A Revised R*-tree in Comparison with Related Index Structures,” Norbert Beckmann and Bernhard Seeger.
While I was taking M.S. program, I studied many spatial query processing issues. Hence, I try to keep in touch with recent spatial database issues.
They are seem to be very interesting. Later, I will post paper reviews about above papers.
HadoopDB: An Open Source Parallel Database for Analytical Workloads
Posted: July 31, 2009 Filed under: Research | Tags: database, hadoop, hadoopdb, map-reduce, vldb 3 CommentsWith the increasingly growing volume of data, the techniques to manage big data are needed in many areas. Open source community and many companies have attempted developing solutions to deal with big data.
Recently, Prof. Daniel Abadi, who is an Assistant Professor of Computer Science at Yale University, announced HadoopDB release and the paper published in VLDB’09. HadoopDB is an open source analytical database, being developed by him and his students. The paper states that HadoopDB is a hybrid of both MapReduce and parallel database and it takes the best features from both.
Actually, MapReduce has made controversial issues from a database point of view. Formerly, there was some debates. Representatively, Prof. David Dewitt, who is well known as a great master of (parallel) database, critiqued that MapReduce is a major step backwards. On the other hand, proponents of MapReduce argue that MapReduce outperforms parallel database in respect of scalability, fault tolerance, and flexibility to unstructured data.
This paper concludes that HadoopDB is close to the performance of parallel databases while it is similar score on fault tolerance and feasibility in heterogeneous systems as Hadoop.
In sum, HadoopDB is a hybrid system of MapReduce and parallel DBMS. It is quite interesting achievement. I respect their decision to release HadoopDB as open source because their achievement will more broadly contribute to Hadoop and data analytical database. Still, I do not read this paper completely, and sooner I will discuss HadoopDB in detail.
Some interesting points:
- They carried out experiments on a 100 node of amazon EC2 cluster.
- They try to deal with semantic web data (i.e., RDF) by HadoopDB.
- HadoopDB is a full open source project.
- HadoopDB isn’t well suited for real-time data yet.
- I can participate in his presentation at the session at VLDB.
See Also:
- Yale researchers create database-Hadoop hybrid, Computer World
- HadoopDB: An Open Source Parallel Database, O’REILLY radar
- MapReduce: A major step backwards
- MapReduce: A major step backwards (II)
Paper: Graph Twiddling in a MapReduce World
Posted: July 17, 2009 Filed under: Research | Tags: graph, graph cluster, map-reduce, scalable computing 5 CommentsToday, at the lab seminar I presented the paper “Graph Twiddling in a MapReduce World” published in IEEE Computing in Science & Engineering. This paper addresses an investigation into the feasibility of decomposion graph operations into a series of MapReduce processes. In this post, I’m going to discuss this paper briefly.
As I mentioned above, this paper discusses the feasibility of decompositing graph operations into a series of MapReduce processes. As you know, the MapReduce has been gaining attentions in various applications that cope with large-scale datasets. However, to the best of my knowledge there have been no studies for dealing with graphs on MapReduce. This paper proposes several operations as follows:
- Augmenting Edges with Degrees
- Simplifying the Graph
- Enumerating Triangles
- Enumerating Rectangles
- Finding Trusses
- Barycentric Clustering
- Finding Components
Some operations are performed in combination with other operations. Actually, some of them are very easy problems if they can traverse graphs. However, as the author said, traversing graphs with MapReduce is very inefficient (i.e., causing many MapReduce iterations) because a mapper reads only a record randomly for each map operation. Anyway, all the operations that the paper proposed avoid traversing graphs. Instead, their common pattern in graph algorithms proposed is as follows:
- A map operation: Read and process all the edges (or vertex) or changing some piece of edge (or vertex) information. Then, result in records by vertex as key.
- A reduce oprtation: For each record obtained from the previous map operation, read and determine the updated state of vertex or edge; emit this information in partially (or locally) updated records. Then, results in them.
- A reduce opration: For each record from the previous reduce operation, combine the updates globally and complete updated information.
Discussion
Even though this paper proposes several graph operations, they are still unnatural owing to too many MapReduce iterations; to the best of my knowledge, each MapReduce job’s initializing cost is very expensive. It is because mapper only can read record sequentially. The proposed graph operations based on MapReduce will cause the overhead of both MR iteration and communication. As a result, the feasible primitive graph operations with MapReduce are very limited. In addition, there are evidences to show the MapReduce is not suited to graph operations, but I will state them later.
Therefore, I think that a new programming model for graph (or complexity data) are needed. Ideally, the new programming model for graph must support graph traversing. In addition, data are needed to be preserved in locality in regards with their connectivity although data are distributed across a number of data nodes. Actually, basing these ideas I’m concreting “Hamburg: A New Programming Model for Graph Data” inspired by a blog post “Large-scale Graph Computing at Google”
References
- Jonathan Conhen, “Graph Twiddling in a MapReduce World”, Volume 11, Issue 4, pp 29–41, IEEE Computing in Science & Engineering, July-Aug, 2009.
- Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
- Google Cluster Computing and MapReduce Lecture 5
- Breath-first graph search using an iterative map-reduce algorithm
- Hamburg, Hadoop Wiki