ACM SIGMOD Conference 2009 was held in Providence, Rhode Island from June 29 through July 2. Then, the electronic proceedings are available online. Among many nice papers, I tried to choose some interesting papers as follows:
MapReduce & Hadoop
- “A Comparison of Approaches to Large Scale Data Analysis,” Andrew Pavlo, Samuel Madden, David DeWitt, Michael Stonebraker, Alexander Rasin, Erik Paulson, Lakshmikant Shrinivas and Daniel Abadi.
Some of the authors are members of vertica, a parallel database. Prof. Dwitt strongly attacked MapReduce (MapReduce: A major step backwards, MapReduce II). So, I wonder how did they benchmark both architectures.
- “Minimizing the Communication Cost for Continuous Skyline Maintenance,” Zhenjie Zhang, Reynold Cheng, Dimitris Papadias, Anthony K. H. Tung.
- “Scalable Skyline Computation Using Object-based Space Partitioning,” ZHANG Shiming, Nikos Mamoulis, David Cheung.
- “Kernel-Based Skyline Cardinality Estimation,” Zhenjie Zhang, Yin Yang, Ruichu Cai, Dimitris Papadias, Anthony and K. H. Tung.
Since I first met the skyline problem, I have been always interested in skyline queries. Considering multi-criteria, Skyline queries retrieve the best tuples among multi-dimensional objects.
Graph Query Processing
- “3-HOP: A High-Compression Indexing Scheme for Reachability Query,” Ruoming Jin, Yang Xiang, Ning Ruan, and Dave Fuhry.
Rechability query is to compute whether two given vertices are rechable, or not. Rechability query is one of the most fundamental operations in graph querying. it can be usually used in a primitive operation for complex graph queries.
RDF Query Processing
- “Scalable Join Processing on Very Large RDF Graphs,” Thomas Neumann and Gerhard Weikum.
The issue with which I’m primarily concerned is RDF query processing. As linked data are gaining attention, this issue will be more dealt with in the database community.
Spatial Query Processing
- “Quality and Efficiency in High Dimensional Nearest Neighbor Search,” Yufei Tao, Ke Yi, Cheng Sheng and Panos Kalnis.
- “Continuous Obstructed Nearest Neighbor Queries in Spatial Databases,” Yunjun Gao and Baihua Zheng.
- “A Revised R*-tree in Comparison with Related Index Structures,” Norbert Beckmann and Bernhard Seeger.
While I was taking M.S. program, I studied many spatial query processing issues. Hence, I try to keep in touch with recent spatial database issues.
They are seem to be very interesting. Later, I will post paper reviews about above papers.
With the increasingly growing volume of data, the techniques to manage big data are needed in many areas. Open source community and many companies have attempted developing solutions to deal with big data.
Recently, Prof. Daniel Abadi, who is an Assistant Professor of Computer Science at Yale University, announced HadoopDB release and the paper published in VLDB’09. HadoopDB is an open source analytical database, being developed by him and his students. The paper states that HadoopDB is a hybrid of both MapReduce and parallel database and it takes the best features from both.
Actually, MapReduce has made controversial issues from a database point of view. Formerly, there was some debates. Representatively, Prof. David Dewitt, who is well known as a great master of (parallel) database, critiqued that MapReduce is a major step backwards. On the other hand, proponents of MapReduce argue that MapReduce outperforms parallel database in respect of scalability, fault tolerance, and flexibility to unstructured data.
This paper concludes that HadoopDB is close to the performance of parallel databases while it is similar score on fault tolerance and feasibility in heterogeneous systems as Hadoop.
In sum, HadoopDB is a hybrid system of MapReduce and parallel DBMS. It is quite interesting achievement. I respect their decision to release HadoopDB as open source because their achievement will more broadly contribute to Hadoop and data analytical database. Still, I do not read this paper completely, and sooner I will discuss HadoopDB in detail.
Some interesting points:
- They carried out experiments on a 100 node of amazon EC2 cluster.
- They try to deal with semantic web data (i.e., RDF) by HadoopDB.
- HadoopDB is a full open source project.
- HadoopDB isn’t well suited for real-time data yet.
- I can participate in his presentation at the session at VLDB.
- Yale researchers create database-Hadoop hybrid, Computer World
- HadoopDB: An Open Source Parallel Database, O’REILLY radar
- MapReduce: A major step backwards
- MapReduce: A major step backwards (II)
- The Database Column – 말 그대로 데이터베이스 이슈들을 다룬다. 최근에는 클라우드 컴퓨팅에 대한 이슈도 언급된다. 이 블로그는 진짜 짱인게 Samuel Madden
- Gödel’s Lost Letter and P=NP – 제목만보면 NP문제를 주로 다루는 것 같지만 다양한 문제들과 알고리즘들을 다루고 있다(사실 오늘 발견함). 상당히 유익해 보이는 반면 어려워 보인다 (@_@).
- All Things Distributed – Amazon CTO인 Werner Vogels의 블로그 이다. Scalable and distributed Computing에 대한 이슈를 다룬다.
원래 계획은 5개씩 소개하여 2회에 총 10개 소개였는데 요즘 포스팅 거리도 없고 하니…… 나머지는 다음에 이어서 쓰겠다.
덧붙임. 저 블로그들에 읽고 싶은 글들은 많은데 업데이트되는 수가 장난이 아니라…따라가기 참 힘들구나 ~(~_~)~