MapReduce Online Comes Out!

Posted: October 20, 2009 | Author: Hyunsik Choi | Filed under: Research | Tags: hadoop, map-reduce, online aggregation, stream queries | 3 Comments

MapReduce has been gaining much attention in data intensive computing field. As you know, it is well known as a very popular framework for batch-processing.

Recently, however, Tyson Condie who is a Ph.D student in UC Berkeley accomplishes MapReduce Online. Today, I heard this news from Data Beta. Actually, It is amazing works since the original MapReduce is specialized and designed for only batch-processing. In addition, most people believe that MapReduce will remain a batch-processing.

The essential of MapReduce online is that it tries to hold the fault-tolerance model of the original MapReduce, whereas it provides the the pipelining of results across tasks and jobs instead of materializing the output of each MapReduce task and job into disk. Consequently, MapReduce online enables the program to return the result earlier from a big job.

You can get further information from MapReduce Online.

BSP Library on Hadoop?

Posted: October 9, 2009 | Author: Hyunsik Choi | Filed under: FOSS, Research | Tags: angrapa, apache, bsp, bulk synchronization parallel, distributed systems, hadoop, hama | 2 Comments

Recently, I started to participate in the Hama project (a distributed scientific package on Hadoop for massive matrix and graph data), and I have taken the times to develop the bulk synchronization parallel (BSP) library on Hadoop (HAMA-195); I’m getting help from Edword Yoon, a founder of Hama project. The motivation of BSP lib is definitely clear.

The hadoop platforms are installed in cloud computing service providers and many companies as you can see in http://wiki.apache.org/hadoop/PoweredBy. However, most of them may use only MapReduce programs. As you know although MapReduce is very scalability, but it provides only the simple programming model. Many programmers want to use more various programming model without changing the platform (i.e., Hadoop). This BSP lib will be the beginning for their desires. However, like MapReduce, BSP may also be not swiss army knife. When we find appropriate applications, BSP lib on Hadoop will be valued for its scalability and ability.

Sooner, I’ll post articles about the progress of BSP library and Angrapa (the graph package on Hama).

HadoopDB: An Open Source Parallel Database for Analytical Workloads

Posted: July 31, 2009 | Author: Hyunsik Choi | Filed under: Research | Tags: database, hadoop, hadoopdb, map-reduce, vldb | 3 Comments

With the increasingly growing volume of data, the techniques to manage big data are needed in many areas. Open source community and many companies have attempted developing solutions to deal with big data.

Recently, Prof. Daniel Abadi, who is an Assistant Professor of Computer Science at Yale University, announced HadoopDB release and the paper published in VLDB’09. HadoopDB is an open source analytical database, being developed by him and his students. The paper states that HadoopDB is a hybrid of both MapReduce and parallel database and it takes the best features from both.

Actually, MapReduce has made controversial issues from a database point of view. Formerly, there was some debates. Representatively, Prof. David Dewitt, who is well known as a great master of (parallel) database, critiqued that MapReduce is a major step backwards. On the other hand, proponents of MapReduce argue that MapReduce outperforms parallel database in respect of scalability, fault tolerance, and flexibility to unstructured data.

This paper concludes that HadoopDB is close to the performance of parallel databases while it is similar score on fault tolerance and feasibility in heterogeneous systems as Hadoop.

In sum, HadoopDB is a hybrid system of MapReduce and parallel DBMS. It is quite interesting achievement. I respect their decision to release HadoopDB as open source because their achievement will more broadly contribute to Hadoop and data analytical database. Still, I do not read this paper completely, and sooner I will discuss HadoopDB in detail.

Some interesting points:

They carried out experiments on a 100 node of amazon EC2 cluster.
They try to deal with semantic web data (i.e., RDF) by HadoopDB.
HadoopDB is a full open source project.
HadoopDB isn’t well suited for real-time data yet.
I can participate in his presentation at the session at VLDB.

Dive Into A Data Deluge

Discussion about Newly Emerging Issues on Database

MapReduce Online Comes Out!

BSP Library on Hadoop?

HadoopDB: An Open Source Parallel Database for Analytical Workloads

Some interesting points:

See Also:

Categories

Archives

Dive Into A Data Deluge

Discussion about Newly Emerging Issues on Database

MapReduce Online Comes Out!

Share this:

BSP Library on Hadoop?

Share this:

HadoopDB: An Open Source Parallel Database for Analytical Workloads

Some interesting points:

See Also:

Share this:

Categories

Archives