HadoopDB: An Open Source Parallel Database for Analytical Workloads

Posted: July 31, 2009 | Author: Hyunsik Choi | Filed under: Research | Tags: database, hadoop, hadoopdb, map-reduce, vldb |3 Comments

With the increasingly growing volume of data, the techniques to manage big data are needed in many areas. Open source community and many companies have attempted developing solutions to deal with big data.

Recently, Prof. Daniel Abadi, who is an Assistant Professor of Computer Science at Yale University, announced HadoopDB release and the paper published in VLDB’09. HadoopDB is an open source analytical database, being developed by him and his students. The paper states that HadoopDB is a hybrid of both MapReduce and parallel database and it takes the best features from both.

Actually, MapReduce has made controversial issues from a database point of view. Formerly, there was some debates. Representatively, Prof. David Dewitt, who is well known as a great master of (parallel) database, critiqued that MapReduce is a major step backwards. On the other hand, proponents of MapReduce argue that MapReduce outperforms parallel database in respect of scalability, fault tolerance, and flexibility to unstructured data.

This paper concludes that HadoopDB is close to the performance of parallel databases while it is similar score on fault tolerance and feasibility in heterogeneous systems as Hadoop.

In sum, HadoopDB is a hybrid system of MapReduce and parallel DBMS. It is quite interesting achievement. I respect their decision to release HadoopDB as open source because their achievement will more broadly contribute to Hadoop and data analytical database. Still, I do not read this paper completely, and sooner I will discuss HadoopDB in detail.

Some interesting points:

They carried out experiments on a 100 node of amazon EC2 cluster.
They try to deal with semantic web data (i.e., RDF) by HadoopDB.
HadoopDB is a full open source project.
HadoopDB isn’t well suited for real-time data yet.
I can participate in his presentation at the session at VLDB.

3 Comments on “HadoopDB: An Open Source Parallel Database for Analytical Workloads”

Hyunsik Choi says:

July 31, 2009 at 2:18 am

#chunglab HadoopDB Releases! http://bit.ly/ZntJ5

Reply
woorung says:

July 31, 2009 at 11:40 am

I think that the final winner between anti-RDBMS and parallel RDBMS will be a hybrid system which aims at integrating MapReduce into RDBMS.Actually, GreenPlum and HadoopDB are doing that.Both of them are from RDBMS advocates since they have lots of knowledge and experience through RDBMS research.Especially, the hybrid system needs SQL-like query analysis & optimization to manipulate distributed DBMSs with MapReduce.In this point, I think that RDBMS advocates cannot help defeating anti-RDBMS advocates, unfortunately.Nevertheless, they won't lead IT industry & market due to somewhat high cost.To reduce the cost, most of people want to take advantage of open sources.I think that the debates were ended up with HadoopDB.Woohyun Kima creator of open source coord which provides C++ MapReduce framework and distributed key-value store.

Reply
Hyunsik Choi says:

July 31, 2009 at 1:15 pm

I also think that a hybrid system will play an alternative role between anti-RDBMS and parallel RDBMS. Actually, proponents of anti-RDBMS claim that RDBMS already reached a limit for satisfying manifold demands that incur in various environments. I agree some of their advocates. However, MapReduce is not Swiss Army Knife, so we would need the alternative to take ability of both.

Reply

Dive Into A Data Deluge

Discussion about Newly Emerging Issues on Database

HadoopDB: An Open Source Parallel Database for Analytical Workloads

Some interesting points:

See Also:

3 Comments on “HadoopDB: An Open Source Parallel Database for Analytical Workloads”

Leave a comment Cancel reply

Categories

Archives

Dive Into A Data Deluge

Discussion about Newly Emerging Issues on Database

HadoopDB: An Open Source Parallel Database for Analytical Workloads

Some interesting points:

See Also:

Share this:

Related

3 Comments on “HadoopDB: An Open Source Parallel Database for Analytical Workloads”

Leave a comment Cancel reply

Categories

Archives