BSP Library on Hadoop?

Recently, I started to participate in the Hama project (a distributed scientific package on Hadoop for massive matrix and graph data), and I have taken the times to develop the bulk synchronization parallel (BSP) library on Hadoop (HAMA-195); I’m getting help from Edword Yoon, a founder of Hama project. The motivation of BSP lib is definitely clear.

The hadoop platforms are installed in cloud computing service providers and many companies as you can see in However, most of them may use only MapReduce programs. As you know although MapReduce is very scalability, but it provides only the simple programming model. Many programmers want to use more various programming model without changing the platform (i.e., Hadoop). This BSP lib will be the beginning for their desires. However, like MapReduce, BSP may also be not swiss army knife. When we find appropriate applications, BSP lib on Hadoop will be valued for its scalability and ability.

Sooner, I’ll post articles about the progress of BSP library and Angrapa (the graph package on Hama).


HadoopDB: An Open Source Parallel Database for Analytical Workloads

With the increasingly growing volume of data, the techniques to manage big data are needed in many areas. Open source community and many companies have attempted developing solutions to deal with big data.

Recently, Prof. Daniel Abadi, who is an Assistant Professor of Computer Science at Yale University, announced HadoopDB release and the paper published in VLDB’09. HadoopDB is an open source analytical database, being developed by him and his students. The paper states that HadoopDB is a hybrid of both MapReduce and parallel  database and it takes the best features from both.

Hadoop LogoActually, MapReduce has made controversial issues from a database point of view. Formerly, there was some debates. Representatively, Prof. David Dewitt, who is well known as a great master of (parallel) database, critiqued that MapReduce is a major step backwards. On the other hand, proponents of MapReduce argue that MapReduce outperforms parallel database in respect of scalability, fault tolerance, and flexibility to unstructured data.

This paper concludes that HadoopDB is close to the performance of parallel databases while it is similar score on fault tolerance and feasibility in heterogeneous systems as Hadoop.

In sum, HadoopDB is a hybrid system of MapReduce and parallel DBMS. It is quite interesting achievement. I respect their decision to release HadoopDB as open source because their achievement will more broadly contribute to Hadoop and data analytical database. Still, I do not read this paper completely, and sooner I will discuss HadoopDB in detail.

Some interesting points:

  • They carried out experiments on a 100 node of amazon EC2 cluster.
  • They try to deal with semantic web data (i.e., RDF) by HadoopDB.
  • HadoopDB is a full open source project.
  • HadoopDB isn’t well suited for real-time data yet.
  • I can participate in his presentation at the session at VLDB.

See Also:

Hadoop: The Definitive Guide

O’REILLY 에서 책이 출시 된 것 같네요.
다음과 같은 내용을 다루고 있다고 합니다.

  • Use the Hadoop Distributed File System (HDFS) for storing large
    datasets, and run distributed computations over those datasets using
  • Become familiar with Hadoop’s data and I/O building blocks for compression, data integrity, serialization, and persistence
  • Discover common pitfalls and advanced features for writing real-world MapReduce programs
  • Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud
  • Use Pig, a high-level query language for large-scale data processing
  • Take advantage of HBase, Hadoop’s database for structured and semi-structured data
  • Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems

그동안 소스도 분석해 보고 Hadoop 기반 어플리케이션도 짜보고 했지만 좀 더 체계적으로 알고 싶은 마음에 질러볼까합니다.
그런데 바빠서 볼 수 있을지 ~(~_~)~

Adding new data nodes to Hadoop without rebooting

Usually, I have been wonder how to new data nodes (or recovered nodes) to Hadoop without rebooting. Recently, I found the solution from hadoop core-user mailing list.

The way is very simple as follows:

1. configure conf/slaves and *.xml files on master machine
2. configure conf/master and *.xml files on slave machine
3. run ${HADOOP}/bin/hadoop datanode

If you have to add more than one data node to Hadoop, run the following command (instead of the third command above) on master machine.


Additionally, the way to add a region server to Hbase master without restarting all is similar to that of Hadoop.

1. configure conf/regionservers and *.xml files on master machine
2. configure conf/*.xml files on slave machine
3. run ${HBASE}/bin/hbase regionserver start

Three nice articles that address Very Large Data Base