<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dive into A Data Deluge &#187; hadoop</title>
	<atom:link href="http://diveintodata.org/tag/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://diveintodata.org</link>
	<description>Discussion about Newly Emerging Issues on Database</description>
	<lastBuildDate>Mon, 06 Sep 2010 12:13:07 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/>		<item>
		<title>HDFS Scalability 향상을 위한 시도들 (1)</title>
		<link>http://diveintodata.org/2010/05/hdfs-scalability-%ed%96%a5%ec%83%81%ec%9d%98-%ec%8b%9c%eb%8f%84%eb%93%a4-1/</link>
		<comments>http://diveintodata.org/2010/05/hdfs-scalability-%ed%96%a5%ec%83%81%ec%9d%98-%ec%8b%9c%eb%8f%84%eb%93%a4-1/#comments</comments>
		<pubDate>Mon, 24 May 2010 05:21:51 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[distributed file systems]]></category>
		<category><![CDATA[gfs]]></category>
		<category><![CDATA[google file system]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hdfs]]></category>
		<category><![CDATA[improvement]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[scale-out]]></category>
		<category><![CDATA[scale-up]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=761</guid>
		<description><![CDATA[얼마전 Yahoo!의 HDFS 팀에서 Multiple nodes를 사용하여 HDFS namenode의 Horizontal Scalability를 향상 시키는 방법을 제안 했었습니다 (HDFS-1052). 그런데 그 뒤로는 Dhruba Borthakur라는 Hadoop 커미터가 Vertical Scalability 개선 방법을 제안했습니다(The Curse of Singletons! The Vertical Scalability of Hadoop NameNode, HDFS-1093, HADOOP-6713). Borthakur에 대해 LinkedIn 에서 찾아보니 현재 Facebook에서 근무하는 Hadoop 엔지니어라고 나오는군요. 위 두 제안을 보면 [...]]]></description>
			<content:encoded><![CDATA[<div>
<p><img class="alignright" title="Apache Hadoop" src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" width="200" height="50" /><br />
얼마전 Yahoo!의 HDFS 팀에서 Multiple nodes를 사용하여 HDFS namenode의 Horizontal Scalability를 향상 시키는 방법을 제안 했었습니다 (<a href="https://issues.apache.org/jira/browse/HDFS-1052" target="_blank">HDFS-1052</a>). 그런데 그 뒤로는 <a href="http://www.linkedin.com/in/dhruba" target="_blank">Dhruba Borthakur</a>라는 Hadoop 커미터가 Vertical Scalability 개선 방법을 제안했습니다(<a href="http://hadoopblog.blogspot.com/2010/04/curse-of-singletons-vertical.html" target="_blank">The Curse of Singletons! The Vertical Scalability of Hadoop NameNode</a>, <a href="https://issues.apache.org/jira/browse/HDFS-1093" target="_blank">HDFS-1093</a>, <a href="https://issues.apache.org/jira/browse/HADOOP-6713" target="_blank">HADOOP-6713</a>). Borthakur에 대해 LinkedIn 에서 찾아보니 현재 Facebook에서 근무하는 Hadoop 엔지니어라고 나오는군요.</p>
<p>위 두 제안을 보면 Vertical Scalability과 Horizontal Scalability라는 용어가 나옵니다. Vertical Scalability는 시스템의 사양을 향상 시켰을 때 얻는 확장성을 의미합니다. 주로 CPU, Memory, Hard disk 등의 향상을 의미합니다. Hadoop과 같은 분산 시스템에서는 시스템 코어의 수가 늘어나는 것도 Vertical Scalability의 범주로 포함됩니다. 반면 Horizontal Scalability는 시스템의 개수를 늘렸을 때 얻는 확장성을 의미합니다. 예를 들면 노드의 수가 10대에서 20개로 늘어났을 때 얻는 확장성을 의미합니다. scale-up과 scale-out도 각각 같은 의미로 통용됩니다.</p>
<p>본 포스트에서는 위 두 가지 제안 중에서 Dhruba Borthaku가 제안한 vertical scalability 향상을 위한 제안을 소개합니다. 우선 Dhruba Borthakur라는 해커가 지적한 Hadoop Namenode (현재 Hadoop 0.21)의 병목현상은 다음과 같습니다.</p>
<ul>
<li><strong>Network</strong>: Facebook에서 자신이 사용하는 클러스터는 약 2000개의 노드로 구성되어 있으며 MapReduce 프로그램 동작 시 각 서버들은 9개의 mapper와 6개의 reducer가 동작하도록 설정되어 있다고 합니다. 이 구성의 클러스터에서 MapReduce를 동작하면 클라이언트들은 동시에 약 30k 의 request를 NameNode 에게 요청한다고 합니다. 그러나 singleton으로 구현된 Hadoop RPCServer의 Listener 스레드가 모든 메시지를 처리하므로 상당히 많은 지연이 발생하고 CPU core의 수가 증가해도 효과가 없었다고 합니다.</li>
<li><strong>CPU</strong>: FSNamesystem lock 메카니즘으로 인해 namenode는 실제로는 8개의 core를 가진 시스템이지만 보통 2개의 코어밖에 활용되지 않는다고 합니다. Borthakur에 의하면 FSNamesystem에서 사용하는 locking 메커니즘이 너무 단순 하고 <a href="https://issues.apache.org/jira/browse/HADOOP-1269" target="_blank">HADOOP-1269</a> 를 통해 문제를 개선 시켰음에도 여전히 개선의 여지가 있다고 합니다.</li>
<li><strong>Memory<span style="font-weight: normal;">:</span></strong> Hadoop의 NameNode는 논문 내용에 충실하게 모든 메타 데이터를 메모리에 유지합니다. 그런데 Borthakur가 사용하는 클러스터의 HDFS에는 6천만개의 파일과 8천만개의 블럭들이 유지하고 있는데 이 파일들의 메타데이터를 유지하기 위해 무려 58GB의 힙공간이 필요했다고 합니다.</li>
</ul>
<p>Borthakur가 이 문제를 해결하기 위해 제안했던 방법은 다음과 같습니다.</p>
<ul>
<li><strong>RPC Server</strong>: singleton으로 구현되었던 Listener 스레드에 Reader 스레프 풀을 붙였다고 합니다. 그래서 Listener 스레드는 connection 요청에 대한 accept 만 해주고 Reader 스레드 중 하나가 RPC를 직접 처리하도록 개선했다고 합니다. 결과적으로 다량의 RPC 요청에 대해 더 많은 CPU core들을 활용할 수 있게 되었다고 합니다(<a href="https://issues.apache.org/jira/browse/HADOOP-6713" target="_blank">HADOOP-6713</a>).</li>
<li><strong>FSNamesystem lock</strong>: Borthakur는 파일에 대한 어떤 operation이 있을 때 lock이 걸리는지 통계를 내고 그 결과로 파일과 디렉토리의 상태를 얻을 때와 읽기 위해 파일을 열 때 걸리는 lock이 전체 lock의 90%를 차지 한다는 것을 밝힙니다. 그리고 저 두 파일 operation들은 오직 read-only operation 이기 때문에 read-write lock 으로 바꾸어 성능을 향상 시켰다고 합니다(<a href="https://issues.apache.org/jira/browse/HDFS-1093" target="_blank">HADOOP-1093</a>). 이 부분은 MapReduce 논문(<a href="http://labs.google.com/papers/mapreduce.html" target="_blank">The Google File System</a>) 4.1절 Namespace Management and Locking 에도 설명이 잘 되어 있습니다. 이미 MapReduce에서는 namespace의 자료구조에서 상위 디렉토리에 해당하는 데이터에는 read lock을 걸고 작업 디렉토리 또는 작업 파일에는 read 또는 write lock을 걸어 가능한 동시에 다수의 operation들이 공유 데이터에 접근하게 하면서도 consistency를 유지하는 방법을 설명하고 있습니다. 아마도 file 에 대한 append가 Hadoop 0.20 버전에 추가 된 것 처럼 논문에 설명이 있음에도 구현이 되지 않은 부분이었나 봅니다. 자세한건 소스를 분석해 봐야 알 수 있을 것 같습니다.</li>
</ul>
<p>그러나 메모리에 대한 문제는 아직 해결하지 못했다고 합니다. 그래도 Borthakur에 의하면 위 두 가지 문제점을 해결한 것만으로 무려 8배나 scalability를 향상 시켰다고 합니다.</p>
<p>얼마전 부터 HDFS scalability 향상에 대한 시도들이 눈에 띄고 재미있어 보여 &#8216;여유 있을 때  블로그에 한번 정리해 봐야 겠다&#8217;라고 한달전에 맘 먹었는데 겨우 하나를 마쳤네요. 요즘 시간이 잘 안나서 이 포스트를 시작해서 완성하는데 약 3주나 걸렸습니다. 그 사이 <em>Usenix</em>의 매거진인 <em>;login:</em>에 Hadoop Namenode의 scalability에 대한 article인 <a href="http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html" target="_blank">HDFS Scalability: The Limits to Growth</a>가 실렸습니다. 또 야후 개발자 네트워크 블로그에서 article을 소개하는 글 (<a href="http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html" target="_blank">Scalability of the Hadoop Distributed File System</a>)이 실렸네요. 시간날 때 마다 마저 정리해 보겠습니다.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/05/hdfs-scalability-%ed%96%a5%ec%83%81%ec%9d%98-%ec%8b%9c%eb%8f%84%eb%93%a4-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Hadoop RPC를 이용한 서버/클라이언트 구현</title>
		<link>http://diveintodata.org/2010/04/hadoop-rpc%eb%a5%bc-%ec%9d%b4%ec%9a%a9%ed%95%9c-%ec%84%9c%eb%b2%84%ed%81%b4%eb%9d%bc%ec%9d%b4%ec%96%b8%ed%8a%b8-%ea%b5%ac%ed%98%84/</link>
		<comments>http://diveintodata.org/2010/04/hadoop-rpc%eb%a5%bc-%ec%9d%b4%ec%9a%a9%ed%95%9c-%ec%84%9c%eb%b2%84%ed%81%b4%eb%9d%bc%ec%9d%b4%ec%96%b8%ed%8a%b8-%ea%b5%ac%ed%98%84/#comments</comments>
		<pubDate>Tue, 20 Apr 2010 12:04:24 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[rpc]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=659</guid>
		<description><![CDATA[Hadoop은 이미 알려질대로 잘 알려진 분산 컴퓨팅 프레임워크입니다. 많은 사람들이 Hadoop 하면 MapReduce 프로그래밍을 주로 떠올리지만 자체적으로 제공하는 Hadoop RPC와 분산 파일 시스템인 HDFS를 가지고도 재미있는 것을 시도해 볼 수 있을 것 같습니다. 본 포스팅에서는 그 중에서 Hadoop RPC를 이용한 간단한 서버 클라이언트 프로그램의 구현방법을 소개합니다. Hadoop RPC Concept Hadoop RPC는 일반적으로 하나의 프로토콜 인터페이스(interface)와 하나의 Server [...]]]></description>
			<content:encoded><![CDATA[<p><img class="alignright" title="Apache Hadoop" src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="Apache Hadoop" width="200" height="50" /></p>
<p><a href="http://hadoop.apache.org/" target="_blank">Hadoop</a>은 이미 알려질대로 잘 알려진 분산 컴퓨팅 프레임워크입니다. 많은 사람들이 Hadoop 하면 <a href="http://labs.google.com/papers/mapreduce.html" target="_blank">MapReduce</a> 프로그래밍을 주로 떠올리지만 자체적으로 제공하는 Hadoop RPC와 분산 파일 시스템인 HDFS를 가지고도 재미있는 것을 시도해 볼 수 있을 것 같습니다. 본 포스팅에서는 그 중에서 Hadoop RPC를 이용한 간단한 서버 클라이언트 프로그램의 구현방법을 소개합니다.</p>
<h3><strong>Hadoop RPC Concept</strong></h3>
<p>Hadoop RPC는 일반적으로 하나의 프로토콜 인터페이스(interface)와 하나의 Server 그리고 하나 이상의 Client(들)로 동작합니다. Hadoop RPC 서버의 인스턴스와 클라이언트 프록시의 인스턴스는 org.apache.hadoop.ipc.RPC 라는 클래스를 통해 얻을 수 있는데 내부적으로는 java reflection을 통해 구현되어 있습니다. 그리고 RPC method의 파라메터와 리턴 값은 오직 자바 primitive type들(예: int, long, String 등등)과 Writable 인터페이스를 구현한 구상클래스만 될 수 있습니다. 또한 Hadoop RPC는 자체적으로 서버와 클라이언트에 대한 기본적인 기능을 제공합니다. 따라서 복잡하게 스레드나 소켓 통신을 직접 구현할 필요가 없으며 개발자는 오로지 RPC 프로토콜 인터페이스와 RPC 메소드들에 대한 내용만 채워 넣으면 됩니다.</p>
<h3>Implementation of RPC Protocol</h3>
<p>RPC Protocol은 인터페이스로 정의되어야 하며 이 인터페이스는 org.apache.hadoop.ipc.VersionedProtocol을 상속하여야 합니다. VersionedProtocol은 자체적으로 getProtocolVersion() 메소드를 가지고 있는데 이 메소드는 프로토콜의 버전이 다양할 경우 서버-클라이언트가 다른 버전의 프로토콜로 통신하는 것을 방지하는 역할을 합니다.</p>
<p>RPC 프로토콜은 다음 예제와 같이 간단히 만들 수 있습니다. 아래 예제는 String 값을 반환하는 heartBeat()라는 하나의 RPC 메소드를 가진 RPC 프로토콜 인터페이스입니다.</p>
<pre class="brush: java;">
import org.apache.hadoop.ipc.VersionedProtocol;

public interface RPCProtocol extends VersionedProtocol {
  public long versionID=0;
  public String heartBeat() throws IOException;
}
</pre>
<h3>Implementation of RPC Server</h3>
<p>위에서 설명한 RPC 프로토콜의 서버 역할을 할 구상 클래스를 구현합니다. 서버 클래스는 간단히 위에서 정의한 RPCProtocol 인터페이스를 implements 하면 됩니다 (아래 예제 참조).</p>
<pre class="brush: java;">
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.ipc.RPC;
import org.apache.hadoop.ipc.RPC.Server;

public class TestServer implements RPCProtocol {

  @Override
  public String heartBeat() throws IOException {
    return &amp;quot;Hello&amp;quot;;
  }

  @Override
  public long getProtocolVersion(String arg0, long arg1) throws IOException {
    return 0;
  }

  /**
   * @param args
   * @throws IOException
   * @throws InterruptedException
   */
  public static void main(String[] args) throws IOException, InterruptedException {
    TestServer s = new TestServer();
    Configuration conf = new Configuration();
    Server server = RPC.getServer(s, &amp;quot;localhost&amp;quot;, 10000, conf);
    server.start();
    server.join();
  }
}
</pre>
<p>RPCProtocol 인터페이스에서 정의했던 String heartBeat() 메소드 역시 구현되어 있습니다. 반환 값으로 &#8220;Hello&#8221;가 호출한 RPC 클라이언트에게 전달 될 것입니다.</p>
<p>서버의 시동은 main 메소드에 구현되어 있습니다. 우선 프로토콜의 구상클래스(TestServer)의 인스턴스를 생성하고 RPC.getServer()에 인자로 전달합니다. 또한 getServer 메소드는 추가적으로 서버가 binding할 IP와 port 번호를 인자로 받으며 Server 클래스의 인스턴스를 반환합니다(내부적으로는 TestServer 클래스의 인스턴스에 대한 Listener 스레드를 생성하여 파라메터로 전달된 IP 및 port 번호와 바인딩 시킵니다. 그리고 RPC 콜이 있을 때마다 TestServer의 메소드를 콜하게 됩니다. 처리 결과는 Responder 스레드를 통해 반환하게 됩니다).</p>
<p>RPC.getServer 메소드의 원형은 다음과 같습니다.</p>
<table border="1" cellspacing="0" cellpadding="3" width="100%">
<tbody>
<tr bgcolor="white">
<td width="1%" align="right" valign="top"><code><span style="color: #000000;">static RPC.Server</span></code></td>
<td><code><strong><span style="color: #000000;">getServer</span></strong><span style="color: #000000;">(Object instance, String bindAddress, int port, Configuration conf)</span></code><span style="color: #000000;"><br />
</span></td>
</tr>
</tbody>
</table>
<h3>Implementation of RPC Client</h3>
<p>클라이언트는 RPC.waitForProxy 메소드를 통해서 간단히 얻을 수 있습니다. 그리고 클라이언트는 반환값으로 받은 proxy 인스턴스를 이용해서 손쉽게 RPC method를 콜하고 서버로부터 응답을 받아 올 수 있습니다.</p>
<table border="1" cellspacing="0" cellpadding="3" width="100%">
<tbody>
<tr bgcolor="white">
<td width="1%" align="right" valign="top"><code><span style="color: #000000;">static VersionedProtocol</span></code></td>
<td><code><strong><span style="color: #000000;">getProxy</span></strong><span style="color: #000000;">(Class&lt;?&gt; protocol, long clientVersion, InetSocketAddress addr, UserGroupInformation ticket,Configuration conf, SocketFactory factory)</span></code></td>
</tr>
</tbody>
</table>
<pre class="brush: plain;">
import java.io.IOException;
import java.net.InetSocketAddress;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.ipc.RPC;

public class TestClient {

  /**
   * @param args
   * @throws IOException
   * @throws InterruptedException
   */
  public static void main(String[] args) throws IOException, InterruptedException {
    Configuration conf = new Configuration();
    InetSocketAddress addr = new InetSocketAddress(&amp;quot;localhost&amp;quot;, 10000);
    RPCProtocol rpc = (RPCProtocol) RPC.waitForProxy(RPCProtocol.class,
        RPCProtocol.versionID, addr, conf);

    String msg = null;
    while(true) {
      Thread.sleep(1000);
      msg = rpc.heartBeat();
      System.out.println(msg);
    }
  }
}
</pre>
<p>위 예제는 프록시 인스턴스 변수인 rpc를 통해 손쉽게 rpc.heartBeat() 메소드를 실행하고 서버로 부터 결과를 얻는 내용을 설명합니다.</p>
<h3>Test</h3>
<p>서버를 먼저 실행하고 클라이언트를 실행하면 됩니다. 사실 순서를 바꿔 실행해도 크게 문제 되지 않습니다. Hadoop RPC의 클라이언트는 먼저 실행되었을 경우 RPC 서버에 접속이 될 때까지 1초 단위로 반복하여 접속 시도를 하게 됩니다.</p>
<p>정상적으로 수행되는 경우 다음과 같은 메시지를 확인할 수 있습니다.</p>
<pre>Hello
Hello
Hello
Hello
Hello
...</pre>
<h3>References</h3>
<ul>
<li><a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/ipc/RPC.html">http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/ipc/RPC.html</a></li>
<li><a href="http://www.supermind.org/blog/520">http://www.supermind.org/blog/520</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/04/hadoop-rpc%eb%a5%bc-%ec%9d%b4%ec%9a%a9%ed%95%9c-%ec%84%9c%eb%b2%84%ed%81%b4%eb%9d%bc%ec%9d%b4%ec%96%b8%ed%8a%b8-%ea%b5%ac%ed%98%84/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Data-Intensive Text Processing with MapReduce Draft Available in Online</title>
		<link>http://diveintodata.org/2010/03/data-intensive-text-processing-with-mapreduce-draft-available-in-online/</link>
		<comments>http://diveintodata.org/2010/03/data-intensive-text-processing-with-mapreduce-draft-available-in-online/#comments</comments>
		<pubDate>Thu, 11 Mar 2010 01:46:24 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[data intensive]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[text processing]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=605</guid>
		<description><![CDATA[Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer Actually, there have never been books that directly deal with MapReduce programming and algorithms. This book addresses from MapReduce algorithm design to EM Algorithms for Text Processing. Although this book is still draft, it seems well-organized and very interesting. In addition, the book contains some basic graph algorithms [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.umiacs.umd.edu/~jimmylin/book.html" target="_blank">Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer</a></p>
<p>Actually, there have never been books that directly deal with MapReduce programming and algorithms. This book addresses from MapReduce algorithm design to EM Algorithms for Text Processing. Although this book is still draft, it seems well-organized and very interesting. In addition, the book contains some basic graph algorithms using MapReduce.</p>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/03/data-intensive-text-processing-with-mapreduce-draft-available-in-online/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to Create A Table in HBase for Beginners</title>
		<link>http://diveintodata.org/2009/11/how-to-make-a-table-in-hbase-for-beginners/</link>
		<comments>http://diveintodata.org/2009/11/how-to-make-a-table-in-hbase-for-beginners/#comments</comments>
		<pubDate>Fri, 27 Nov 2009 02:33:36 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[create table]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[table]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=527</guid>
		<description><![CDATA[I have accumulated some knowledge and know-how about MapReduce, Hadoop, and HBase since I participated in some projects. From hence, I&#8217;ll post the know-how of HBase by period. Today, I&#8217;m going to introduce a way to make a hbase table in java. HBase provides two ways to allow a Hbase client to connect HBase master. [...]]]></description>
			<content:encoded><![CDATA[<p>I have accumulated some knowledge and know-how about MapReduce, Hadoop, and HBase since I participated in some projects. From hence, I&#8217;ll post the know-how of HBase by period. Today, I&#8217;m going to introduce a way to make a hbase table in java.</p>
<p>HBase provides two ways to allow a Hbase client to connect HBase master. One is to use a instance of <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HBaseAdmin.html" target="_blank">HBaseAdmin</a> class. HBaseAdmin provides some methods for creating, modifying, and deleting tables and column families. Another way is to use an instance of HTable class. This class almost provides some methods to manipulate data like inserting, modifying, and deleting rows and cells.</p>
<p>Thus, in order to make a hbase table, we need to connect a HBase master by initializing a instance of HBaseAdmin like line 4. HBaseAdmin requires an instance of <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/HBaseConfiguration.html" target="_blank">HBaseConfiguration</a>. If necessary, you may set some configurations like line 2.</p>
<p>In order to describe HBase schema,  we make an instances of <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/HColumnDescriptor.html" target="_blank">HColumnDescriptor</a> for each column family. In addition to column family names, HColumnDescriptor enables you to set various parameters, such as maxVersions, compression type, timeToLive, and bloomFilter. Then, we can create a HBase table by invoking createTable like line 10.</p>
<pre class="brush: java;">
HBaseConfiguration conf = new HBaseConfiguration();
conf.set(&quot;hbase.master&quot;,&quot;localhost:60000&quot;);

HBaseAdmin hbase = new HBaseAdmin(conf);
HTableDescriptor desc = new HTableDescriptor(&quot;TEST&quot;);
HColumnDescriptor meta = new HColumnDescriptor(&quot;personal&quot;.getBytes());
HColumnDescriptor prefix = new HColumnDescriptor(&quot;account&quot;.getBytes());
desc.addFamily(meta);
desc.addFamily(prefix);
hbase.createTable(desc);
</pre>
<p>Finally, you can check your hbase table as the following commands.</p>
<pre class="brush: bash;">
c0d3h4ck@code:~/Development/hbase$ bin/hbase shell
HBase Shell; enter 'help&lt;RETURN&gt;' for list of supported commands.
Version: 0.20.1, r822817, Wed Oct  7 11:55:42 PDT 2009
hbase(main):001:0&gt; list
TEST

1 row(s) in 0.0940 seconds
</pre>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/11/how-to-make-a-table-in-hbase-for-beginners/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>MapReduce Online Comes Out!</title>
		<link>http://diveintodata.org/2009/10/mapreduce-onlie-comes-out/</link>
		<comments>http://diveintodata.org/2009/10/mapreduce-onlie-comes-out/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 15:49:37 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[map-reduce]]></category>
		<category><![CDATA[online aggregation]]></category>
		<category><![CDATA[stream queries]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=461</guid>
		<description><![CDATA[MapReduce has been gaining much attention in data intensive computing field. As you know, it is well known as a very popular framework for batch-processing. Recently, however, Tyson Condie who is a Ph.D student in UC Berkeley accomplishes MapReduce Online. Today, I heard this news from Data Beta. Actually, It is amazing works since the [...]]]></description>
			<content:encoded><![CDATA[<p>MapReduce has been gaining much attention in data intensive computing field. As you know, it is well known as a very popular framework for batch-processing.</p>
<p>Recently, however, Tyson Condie who is a Ph.D student in UC Berkeley accomplishes <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html" target="_self">MapReduce Online</a>. Today, I heard this news from <a href="http://databeta.wordpress.com/2009/10/18/mapreduce-online/" target="_self">Data Beta</a>. Actually, It is amazing works since the original MapReduce is specialized and designed for only batch-processing. In addition, most people believe that MapReduce will remain a batch-processing.</p>
<p>The essential of MapReduce online is that it tries to hold the fault-tolerance model of the <a href="http://labs.google.com/papers/mapreduce.html" target="_self">original MapReduce</a>, whereas it provides the the pipelining of results across tasks and jobs instead of materializing the output of each MapReduce task and job into disk. Consequently, MapReduce online enables the program to return the result earlier from a big job.</p>
<p>You can get further information from <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html" target="_self">MapReduce Online</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/10/mapreduce-onlie-comes-out/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>BSP Library on Hadoop?</title>
		<link>http://diveintodata.org/2009/10/bsp-library-on-hadoop/</link>
		<comments>http://diveintodata.org/2009/10/bsp-library-on-hadoop/#comments</comments>
		<pubDate>Fri, 09 Oct 2009 11:45:33 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[angrapa]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[bsp]]></category>
		<category><![CDATA[bulk synchronization parallel]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hama]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=443</guid>
		<description><![CDATA[Recently, I started to participate in the Hama project (a distributed scientific package on Hadoop for massive matrix and graph data), and I have taken the times to develop the bulk synchronization parallel (BSP) library on Hadoop (HAMA-195); I&#8217;m getting help from Edword Yoon, a founder of Hama project. The motivation of BSP lib is [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I started to participate in the <a href="http://incubator.apache.org/hama/" target="_self">Hama project</a> (a distributed scientific package on Hadoop for massive matrix and graph data), and I have taken the times to develop the <a href="http://en.wikipedia.org/wiki/Bulk_synchronous_parallel" target="_self">bulk synchronization parallel</a> (BSP) library on Hadoop (<a href="https://issues.apache.org/jira/browse/HAMA-195" target="_self">HAMA-195</a>); I&#8217;m getting help from <a href="http://blog.udanax.org/" target="_self">Edword Yoon</a>, a founder of Hama project. The motivation of BSP lib is definitely clear.</p>
<p>The hadoop platforms are installed in cloud computing service providers and many companies as you can see in <a href="http://wiki.apache.org/hadoop/PoweredBy" target="_self">http://wiki.apache.org/hadoop/PoweredBy</a>. However, most of them may use only MapReduce programs. As you know although MapReduce is very scalability, but it provides only the simple programming model. Many programmers want to use more various programming model without changing the platform (i.e., <a href="http://hadoop.apache.org" target="_self">Hadoop</a>). This BSP lib will be the beginning for their desires. However, like MapReduce, BSP may also be not swiss army knife. When we find appropriate applications, BSP lib on Hadoop will be valued for its scalability and ability.</p>
<p>Sooner, I&#8217;ll post articles about the progress of BSP library and <a href="http://wiki.apache.org/hama/GraphPackage" target="_self">Angrapa</a> (the graph package on Hama).</p>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/10/bsp-library-on-hadoop/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>HadoopDB: An Open Source Parallel Database for Analytical Workloads</title>
		<link>http://diveintodata.org/2009/07/hadoopdb-releases/</link>
		<comments>http://diveintodata.org/2009/07/hadoopdb-releases/#comments</comments>
		<pubDate>Thu, 30 Jul 2009 15:01:15 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoopdb]]></category>
		<category><![CDATA[map-reduce]]></category>
		<category><![CDATA[vldb]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=155</guid>
		<description><![CDATA[With the increasingly growing volume of data, the techniques to manage big data are needed in many areas. Open source community and many companies have attempted developing solutions to deal with big data. Recently, Prof. Daniel Abadi, who is an Assistant Professor of Computer Science at Yale University, announced HadoopDB release and the paper published [...]]]></description>
			<content:encoded><![CDATA[<p><span class="dropcaps">W</span>ith the increasingly growing volume of data, the techniques to manage big data are needed in many areas. Open source community and many companies have attempted developing solutions to deal with big data.</p>
<p>Recently, <a href="http://cs-www.cs.yale.edu/homes/dna/" target="_blank">Prof. Daniel Abadi</a>, who is an Assistant Professor of Computer Science at Yale University, announced <a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html" target="_blank">HadoopDB release and the paper</a> published in <a href="http://vldb2009.org/" target="_blank">VLDB’09</a>. HadoopDB is an open source analytical database, being developed by him and his students. The paper states that HadoopDB is a hybrid of both MapReduce and parallel  database and it takes the best features from both.</p>
<p><img style="display: inline; margin-left: 0px; margin-right: 0px" title="Hadoop Logo" src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="Hadoop Logo" width="198" height="47" align="right" />Actually, MapReduce has made controversial issues from a database point of view. Formerly, there was some debates. Representatively, <a href="http://pages.cs.wisc.edu/~dewitt/" target="_blank">Prof. David Dewitt</a>, who is well known as a great master of (parallel) database, critiqued that <a href="http://databasecolumn.vertica.com/2008/01/mapreduce-a-major-step-back.html" target="_blank">MapReduce is a major step backwards</a>. On the other hand, proponents of MapReduce argue that MapReduce outperforms parallel database in respect of scalability, fault tolerance, and flexibility to unstructured data.</p>
<p>This paper concludes that HadoopDB is close to the performance of parallel databases while it is similar score on fault tolerance and feasibility in heterogeneous systems as Hadoop.</p>
<p>In sum, HadoopDB is a hybrid system of MapReduce and parallel DBMS. It is quite interesting achievement. I respect their decision to release HadoopDB as open source because their achievement will more broadly contribute to Hadoop and data analytical database. Still, I do not read this paper completely, and sooner I will discuss HadoopDB in detail.</p>
<h3>Some interesting points:</h3>
<ul>
<li>They carried out experiments on a 100 node of amazon EC2 cluster.</li>
<li>They try to deal with semantic web data (i.e., RDF) by HadoopDB.</li>
<li>HadoopDB is a full open source project.</li>
<li>HadoopDB isn’t well suited for real-time data yet.</li>
<li>I can participate in his presentation at the session at VLDB.</li>
</ul>
<h3>See Also:</h3>
<ul>
<li><a href="http://news.idg.no/cw/art.cfm?id=9D2C109A-1A64-6A71-CE90BD44D98F12B1" target="_blank">Yale researchers create database-Hadoop hybrid</a>, Computer World</li>
<li><a href="http://radar.oreilly.com/2009/07/hadoopdb-an-open-source-parallel-database.html" target="_blank">HadoopDB: An Open Source Parallel Database</a>, <a href="http://radar.oreilly.com/" target="_blank">O’REILLY radar</a></li>
<li><a href="http://databasecolumn.vertica.com/2008/01/mapreduce-a-major-step-back.html" target="_blank">MapReduce: A major step backwards</a></li>
<li><a href="http://databasecolumn.vertica.com/2008/01/mapreduce-continued.html" target="_blank">MapReduce: A major step backwards (II)</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/07/hadoopdb-releases/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Hadoop: The Definitive Guide</title>
		<link>http://diveintodata.org/2009/06/hadoop-the-definitive-guide/</link>
		<comments>http://diveintodata.org/2009/06/hadoop-the-definitive-guide/#comments</comments>
		<pubDate>Tue, 09 Jun 2009 01:33:57 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[Pig]]></category>
		<category><![CDATA[zookeeper]]></category>

		<guid isPermaLink="false">http://diveintodata.org/2009/06/hadoop-the-definitive-guide/</guid>
		<description><![CDATA[O&#8217;REILLY 에서 책이 출시 된 것 같네요. http://oreilly.com/catalog/9780596521974/ 다음과 같은 내용을 다루고 있다고 합니다. Use the Hadoop Distributed File System (HDFS) for storing large datasets, and run distributed computations over those datasets using MapReduce Become familiar with Hadoop&#8217;s data and I/O building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced [...]]]></description>
			<content:encoded><![CDATA[<p>O&#8217;REILLY 에서 책이 출시 된 것 같네요. <a title="[http://oreilly.com/catalog/9780596521974/]로 이동합니다." target="_blank" href="http://oreilly.com/catalog/9780596521974/">http://oreilly.com/catalog/9780596521974/</a><br />
다음과 같은 내용을 다루고 있다고 합니다. </p>
<ul>
<li>Use the Hadoop Distributed File System (HDFS) for storing large<br />
datasets, and run distributed computations over those datasets using<br />
MapReduce</li>
<li>Become familiar with Hadoop&#8217;s data and I/O building blocks for compression, data integrity, serialization, and persistence</li>
<li>Discover common pitfalls and advanced features for writing real-world MapReduce programs</li>
<li>Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud</li>
<li>Use Pig, a high-level query language for large-scale data processing</li>
<li>Take advantage of HBase, Hadoop&#8217;s database for structured and semi-structured data</li>
<li>Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems</li>
</ul>
<p>
그동안 소스도 분석해 보고 Hadoop 기반 어플리케이션도 짜보고 했지만 좀 더 체계적으로 알고 싶은 마음에 질러볼까합니다.<br />
그런데 바빠서 볼 수 있을지 ~(~_~)~</p>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/06/hadoop-the-definitive-guide/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Adding new data nodes to Hadoop without rebooting</title>
		<link>http://diveintodata.org/2008/10/adding-new-data-nodes-to-hadoop-without-rebooting/</link>
		<comments>http://diveintodata.org/2008/10/adding-new-data-nodes-to-hadoop-without-rebooting/#comments</comments>
		<pubDate>Thu, 23 Oct 2008 09:16:26 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[hadoop]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=67</guid>
		<description><![CDATA[Usually, I have been wonder how to new data nodes (or recovered nodes) to Hadoop without rebooting. Recently, I found the solution from hadoop core-user mailing list. The way is very simple as follows: 1. configure conf/slaves and *.xml files on master machine 2. configure conf/master and *.xml files on slave machine 3. run ${HADOOP}/bin/hadoop [...]]]></description>
			<content:encoded><![CDATA[<p>Usually, I have been wonder how to new data nodes (or recovered nodes) to Hadoop without rebooting. Recently, I found the solution from hadoop core-user mailing list.</p>
<p>The way is very simple as follows:</p>
<blockquote><p>1. configure conf/slaves and *.xml files on master machine<br />
2. configure conf/master and *.xml files on slave machine<br />
3. run ${HADOOP}/bin/hadoop datanode</p></blockquote>
<p>If you have to add more than one data node to Hadoop, run the following command (instead of the third command above) on master machine.</p>
<blockquote><p>${HADOOP}/bin/start-all.sh</p></blockquote>
<p>Additionally, the way to add a region server to Hbase master without restarting all is similar to that of Hadoop.</p>
<blockquote><p>1. configure conf/regionservers and *.xml files on master machine<br />
2. configure conf/*.xml files on slave machine<br />
3. run ${HBASE}/bin/hbase regionserver start</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2008/10/adding-new-data-nodes-to-hadoop-without-rebooting/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Three nice articles that address Very Large Data Base</title>
		<link>http://diveintodata.org/2008/09/three-nice-articles-that-address-very-large-data-base/</link>
		<comments>http://diveintodata.org/2008/09/three-nice-articles-that-address-very-large-data-base/#comments</comments>
		<pubDate>Thu, 25 Sep 2008 23:23:38 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[bigTable]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[map-reduce]]></category>
		<category><![CDATA[vldb]]></category>

		<guid isPermaLink="false">http://diveintodata.org/2008/09/three-nice-articles-that-address-very-large-data-base/</guid>
		<description><![CDATA[Big Data: The futhre of biocuration, Nature Greenplum MapReduce for the Petabyte Database Aster nCluster: In-Database MapReduce]]></description>
			<content:encoded><![CDATA[<ul style="list-style-type: disc;">
<li><a title="[http://www.nature.com/nature/journal/v455/n7209/full/455047a.html]로 이동합니다." target="_blank" href="http://www.nature.com/nature/journal/v455/n7209/full/455047a.html">Big Data: The futhre of biocuration, Nature</a>
</li>
<li><a title="[http://www.greenplum.com/resources/mapreduce/]로 이동합니다." target="_blank" href="http://www.greenplum.com/resources/mapreduce/">Greenplum MapReduce for the Petabyte Database</a></li>
<li><a title="[http://www.asterdata.com/product/mapreduce.html]로 이동합니다." target="_blank" href="http://www.asterdata.com/product/mapreduce.html">Aster nCluster: In-Database MapReduce</a></li>
</ul>
<p></p>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2008/09/three-nice-articles-that-address-very-large-data-base/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
