HadoopDB: An Open Source Parallel Database for Analytical Workloads

With the increasingly growing volume of data, the techniques to manage big data are needed in many areas. Open source community and many companies have attempted developing solutions to deal with big data.

Recently, Prof. Daniel Abadi, who is an Assistant Professor of Computer Science at Yale University, announced HadoopDB release and the paper published in VLDB’09. HadoopDB is an open source analytical database, being developed by him and his students. The paper states that HadoopDB is a hybrid of both MapReduce and parallel  database and it takes the best features from both.

Hadoop LogoActually, MapReduce has made controversial issues from a database point of view. Formerly, there was some debates. Representatively, Prof. David Dewitt, who is well known as a great master of (parallel) database, critiqued that MapReduce is a major step backwards. On the other hand, proponents of MapReduce argue that MapReduce outperforms parallel database in respect of scalability, fault tolerance, and flexibility to unstructured data.

This paper concludes that HadoopDB is close to the performance of parallel databases while it is similar score on fault tolerance and feasibility in heterogeneous systems as Hadoop.

In sum, HadoopDB is a hybrid system of MapReduce and parallel DBMS. It is quite interesting achievement. I respect their decision to release HadoopDB as open source because their achievement will more broadly contribute to Hadoop and data analytical database. Still, I do not read this paper completely, and sooner I will discuss HadoopDB in detail.

Some interesting points:

  • They carried out experiments on a 100 node of amazon EC2 cluster.
  • They try to deal with semantic web data (i.e., RDF) by HadoopDB.
  • HadoopDB is a full open source project.
  • HadoopDB isn’t well suited for real-time data yet.
  • I can participate in his presentation at the session at VLDB.

See Also:

3 Comments on “HadoopDB: An Open Source Parallel Database for Analytical Workloads”

  1. Hyunsik Choi says:

    #chunglab HadoopDB Releases! http://bit.ly/ZntJ5

  2. woorung says:

    I think that the final winner between anti-RDBMS and parallel RDBMS will be a hybrid system which aims at integrating MapReduce into RDBMS.Actually, GreenPlum and HadoopDB are doing that.Both of them are from RDBMS advocates since they have lots of knowledge and experience through RDBMS research.Especially, the hybrid system needs SQL-like query analysis & optimization to manipulate distributed DBMSs with MapReduce.In this point, I think that RDBMS advocates cannot help defeating anti-RDBMS advocates, unfortunately.Nevertheless, they won't lead IT industry & market due to somewhat high cost.To reduce the cost, most of people want to take advantage of open sources.I think that the debates were ended up with HadoopDB.Woohyun Kima creator of open source coord which provides C++ MapReduce framework and distributed key-value store.

  3. Hyunsik Choi says:

    I also think that a hybrid system will play an alternative role between anti-RDBMS and parallel RDBMS. Actually, proponents of anti-RDBMS claim that RDBMS already reached a limit for satisfying manifold demands that incur in various environments. I agree some of their advocates. However, MapReduce is not Swiss Army Knife, so we would need the alternative to take ability of both.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s