<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dive into A Data Deluge &#187; Research</title>
	<atom:link href="http://diveintodata.org/category/research/feed/" rel="self" type="application/rss+xml" />
	<link>http://diveintodata.org</link>
	<description>Discussion about Newly Emerging Issues on Database</description>
	<lastBuildDate>Tue, 01 Jun 2010 08:15:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
<atom:link rel="hub" href="http://pubsubhubbub.appspot.com"/><atom:link rel="hub" href="http://superfeedr.com/hubbub"/>		<item>
		<title>VoltDB and its related links</title>
		<link>http://diveintodata.org/2010/06/voltdb-and-its-related-links/</link>
		<comments>http://diveintodata.org/2010/06/voltdb-and-its-related-links/#comments</comments>
		<pubDate>Tue, 01 Jun 2010 05:26:55 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[ACID]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[shared-nothing architecture]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[VoltDB]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=842</guid>
		<description><![CDATA[There has been lots of buzz about VoltDB (academic name is H-Store&#160;[5]) since a week ago. M. Stonebraker leads VoltDB, and it is an open source OLTP DBMS. In addition, there are some interesting points: Running on shared-nothing clusters of commodity hardware In-memory database SQL support ACID Linear Scalability Released as an Open Source software [...]]]></description>
			<content:encoded><![CDATA[<p>There has been lots of buzz about <span mce_name="em" mce_style="font-style: italic;" style="font-style: italic;" class="Apple-style-span"><span style="font-style: normal;" mce_style="font-style: normal;">VoltDB (academic name is H-Store&nbsp;<a href="#ref-5" mce_href="#ref-5">[5]</a>)</span><span style="font-style: normal;" mce_style="font-style: normal;"> since a week ago. M. Stonebraker leads VoltDB, and it is an open source OLTP DBMS. In addition, there are some interesting points:</span></span></p>
<ul>
<li>Running on shared-nothing clusters of commodity hardware</li>
<li>In-memory database</li>
<li>SQL support</li>
<li>ACID</li>
<li>Linear Scalability</li>
<li>Released as an Open Source software</li>
</ul>
<p>Actually, there have already been some OLTP databases running on shared-nothing clusters. However, they cannot take advantage from the scalability of shared-nothing architecture due to their implementation&#8217;s natures, such as complex&nbsp;distributed locking and commit protocols <a href="#ref-1" mce_href="#ref-1">[1]</a>. In addition, according to&nbsp;<a href="#ref-3" mce_href="#ref-3">[3]</a>, traditional RDBMSs have four overhead components, which are logging, locking, latching, and buffer management. However, M. Stonebraker claims that VoltDB eliminated these legacy overheads.</p>
<p>Among many features, especially I have interest in its linear scalability with ACID and performance. It is meaningful in that today&#8217;s web applications have another alternative to NoSQL data stores. Although VoltDB is under heavy development, the above features and the next benchmark result show its promising.</p>
<ul>
<li><a href="https://voltdb.com/blog/key-value-benchmarking" mce_href="https://voltdb.com/blog/key-value-benchmarking">Key-Value Benchmark</a> (VoltDB versus Cassandra)</li>
</ul>
<p><a href="http://cassandra.apache.org/" mce_href="http://cassandra.apache.org/" target="_blank">Cassandra</a> is a remarkable&nbsp;key-value store and an open source project developed by apache committers. Now, it is well known as the most performant one in existing NoSQL stores.&nbsp;According to this benchmark result, however, in all cases&nbsp;VoltDB dominates Cassandra although the fairness of experiments is controversial.</p>
<ul>
<li><a href="http://community.voltdb.com/roadmap" mce_href="http://community.voltdb.com/roadmap" target="_blank">VoltDB Roadmap</a></li>
</ul>
<p>It&#8217;s future plan is also expected. I wonder how much attention VoltDB will be getting from communities and industrials.</p>
<h4>See Also:</h4>
<ol>
<li><img mce_name="a" name="ref-1" class="mceItemAnchor"><a id="ref-1" href="http://cs-www.cs.yale.edu/homes/dna/papers/abadi-cloud-ieee09.pdf" mce_href="http://cs-www.cs.yale.edu/homes/dna/papers/abadi-cloud-ieee09.pdf" target="_blank">Data Management in the Cloud: Limitations and Opportunities</a></li>
<li><a href="http://pgsnake.blogspot.com/2010/05/comparing-voltdb-to-postgres.html" mce_href="http://pgsnake.blogspot.com/2010/05/comparing-voltdb-to-postgres.html" target="_blank"></a><a href="http://pgsnake.blogspot.com/2010/05/comparing-voltdb-to-postgres.html" mce_href="http://pgsnake.blogspot.com/2010/05/comparing-voltdb-to-postgres.html" target="_blank">Comparing VoltDB vs Postgresql</a></li>
<li><img mce_name="a" name="ref-3" class="mceItemAnchor"><a href="http://cs-www.cs.yale.edu/homes/dna/papers/oltpperf-sigmod08.pdf" mce_href="http://cs-www.cs.yale.edu/homes/dna/papers/oltpperf-sigmod08.pdf" target="_blank">OLTP through the looking glass, and what we found there, ACM SIGMOD 2008</a></li>
<li><a href="http://voltdb.com/product" mce_href="http://voltdb.com/product">http://voltdb.com/product</a></li>
<li><img mce_name="a" name="ref-5" class="mceItemAnchor"><a href="http://db.cs.yale.edu/hstore/" mce_href="http://db.cs.yale.edu/hstore/" target="_blank">H-Store: A Next Generation OLTP DBMS</a></li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/06/voltdb-and-its-related-links/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>HDFS Scalability 향상을 위한 시도들 (1)</title>
		<link>http://diveintodata.org/2010/05/hdfs-scalability-%ed%96%a5%ec%83%81%ec%9d%98-%ec%8b%9c%eb%8f%84%eb%93%a4-1/</link>
		<comments>http://diveintodata.org/2010/05/hdfs-scalability-%ed%96%a5%ec%83%81%ec%9d%98-%ec%8b%9c%eb%8f%84%eb%93%a4-1/#comments</comments>
		<pubDate>Mon, 24 May 2010 05:21:51 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[distributed file systems]]></category>
		<category><![CDATA[gfs]]></category>
		<category><![CDATA[google file system]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hdfs]]></category>
		<category><![CDATA[improvement]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[scale-out]]></category>
		<category><![CDATA[scale-up]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=761</guid>
		<description><![CDATA[얼마전 Yahoo!의 HDFS 팀에서 Multiple nodes를 사용하여 HDFS namenode의 Horizontal Scalability를 향상 시키는 방법을 제안 했었습니다 (HDFS-1052). 그런데 그 뒤로는 Dhruba Borthakur라는 Hadoop 커미터가 Vertical Scalability 개선 방법을 제안했습니다(The Curse of Singletons! The Vertical Scalability of Hadoop NameNode, HDFS-1093, HADOOP-6713). Borthakur에 대해 LinkedIn 에서 찾아보니 현재 Facebook에서 근무하는 Hadoop 엔지니어라고 나오는군요. 위 두 제안을 보면 [...]]]></description>
			<content:encoded><![CDATA[<div>
<p><img class="alignright" title="Apache Hadoop" src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" width="200" height="50" /><br />
얼마전 Yahoo!의 HDFS 팀에서 Multiple nodes를 사용하여 HDFS namenode의 Horizontal Scalability를 향상 시키는 방법을 제안 했었습니다 (<a href="https://issues.apache.org/jira/browse/HDFS-1052" target="_blank">HDFS-1052</a>). 그런데 그 뒤로는 <a href="http://www.linkedin.com/in/dhruba" target="_blank">Dhruba Borthakur</a>라는 Hadoop 커미터가 Vertical Scalability 개선 방법을 제안했습니다(<a href="http://hadoopblog.blogspot.com/2010/04/curse-of-singletons-vertical.html" target="_blank">The Curse of Singletons! The Vertical Scalability of Hadoop NameNode</a>, <a href="https://issues.apache.org/jira/browse/HDFS-1093" target="_blank">HDFS-1093</a>, <a href="https://issues.apache.org/jira/browse/HADOOP-6713" target="_blank">HADOOP-6713</a>). Borthakur에 대해 LinkedIn 에서 찾아보니 현재 Facebook에서 근무하는 Hadoop 엔지니어라고 나오는군요.</p>
<p>위 두 제안을 보면 Vertical Scalability과 Horizontal Scalability라는 용어가 나옵니다. Vertical Scalability는 시스템의 사양을 향상 시켰을 때 얻는 확장성을 의미합니다. 주로 CPU, Memory, Hard disk 등의 향상을 의미합니다. Hadoop과 같은 분산 시스템에서는 시스템 코어의 수가 늘어나는 것도 Vertical Scalability의 범주로 포함됩니다. 반면 Horizontal Scalability는 시스템의 개수를 늘렸을 때 얻는 확장성을 의미합니다. 예를 들면 노드의 수가 10대에서 20개로 늘어났을 때 얻는 확장성을 의미합니다. scale-up과 scale-out도 각각 같은 의미로 통용됩니다.</p>
<p>본 포스트에서는 위 두 가지 제안 중에서 Dhruba Borthaku가 제안한 vertical scalability 향상을 위한 제안을 소개합니다. 우선 Dhruba Borthakur라는 해커가 지적한 Hadoop Namenode (현재 Hadoop 0.21)의 병목현상은 다음과 같습니다.</p>
<ul>
<li><strong>Network</strong>: Facebook에서 자신이 사용하는 클러스터는 약 2000개의 노드로 구성되어 있으며 MapReduce 프로그램 동작 시 각 서버들은 9개의 mapper와 6개의 reducer가 동작하도록 설정되어 있다고 합니다. 이 구성의 클러스터에서 MapReduce를 동작하면 클라이언트들은 동시에 약 30k 의 request를 NameNode 에게 요청한다고 합니다. 그러나 singleton으로 구현된 Hadoop RPCServer의 Listener 스레드가 모든 메시지를 처리하므로 상당히 많은 지연이 발생하고 CPU core의 수가 증가해도 효과가 없었다고 합니다.</li>
<li><strong>CPU</strong>: FSNamesystem lock 메카니즘으로 인해 namenode는 실제로는 8개의 core를 가진 시스템이지만 보통 2개의 코어밖에 활용되지 않는다고 합니다. Borthakur에 의하면 FSNamesystem에서 사용하는 locking 메커니즘이 너무 단순 하고 <a href="https://issues.apache.org/jira/browse/HADOOP-1269" target="_blank">HADOOP-1269</a> 를 통해 문제를 개선 시켰음에도 여전히 개선의 여지가 있다고 합니다.</li>
<li><strong>Memory<span style="font-weight: normal;">:</span></strong> Hadoop의 NameNode는 논문 내용에 충실하게 모든 메타 데이터를 메모리에 유지합니다. 그런데 Borthakur가 사용하는 클러스터의 HDFS에는 6천만개의 파일과 8천만개의 블럭들이 유지하고 있는데 이 파일들의 메타데이터를 유지하기 위해 무려 58GB의 힙공간이 필요했다고 합니다.</li>
</ul>
<p>Borthakur가 이 문제를 해결하기 위해 제안했던 방법은 다음과 같습니다.</p>
<ul>
<li><strong>RPC Server</strong>: singleton으로 구현되었던 Listener 스레드에 Reader 스레프 풀을 붙였다고 합니다. 그래서 Listener 스레드는 connection 요청에 대한 accept 만 해주고 Reader 스레드 중 하나가 RPC를 직접 처리하도록 개선했다고 합니다. 결과적으로 다량의 RPC 요청에 대해 더 많은 CPU core들을 활용할 수 있게 되었다고 합니다(<a href="https://issues.apache.org/jira/browse/HADOOP-6713" target="_blank">HADOOP-6713</a>).</li>
<li><strong>FSNamesystem lock</strong>: Borthakur는 파일에 대한 어떤 operation이 있을 때 lock이 걸리는지 통계를 내고 그 결과로 파일과 디렉토리의 상태를 얻을 때와 읽기 위해 파일을 열 때 걸리는 lock이 전체 lock의 90%를 차지 한다는 것을 밝힙니다. 그리고 저 두 파일 operation들은 오직 read-only operation 이기 때문에 read-write lock 으로 바꾸어 성능을 향상 시켰다고 합니다(<a href="https://issues.apache.org/jira/browse/HDFS-1093" target="_blank">HADOOP-1093</a>). 이 부분은 MapReduce 논문(<a href="http://labs.google.com/papers/mapreduce.html" target="_blank">The Google File System</a>) 4.1절 Namespace Management and Locking 에도 설명이 잘 되어 있습니다. 이미 MapReduce에서는 namespace의 자료구조에서 상위 디렉토리에 해당하는 데이터에는 read lock을 걸고 작업 디렉토리 또는 작업 파일에는 read 또는 write lock을 걸어 가능한 동시에 다수의 operation들이 공유 데이터에 접근하게 하면서도 consistency를 유지하는 방법을 설명하고 있습니다. 아마도 file 에 대한 append가 Hadoop 0.20 버전에 추가 된 것 처럼 논문에 설명이 있음에도 구현이 되지 않은 부분이었나 봅니다. 자세한건 소스를 분석해 봐야 알 수 있을 것 같습니다.</li>
</ul>
<p>그러나 메모리에 대한 문제는 아직 해결하지 못했다고 합니다. 그래도 Borthakur에 의하면 위 두 가지 문제점을 해결한 것만으로 무려 8배나 scalability를 향상 시켰다고 합니다.</p>
<p>얼마전 부터 HDFS scalability 향상에 대한 시도들이 눈에 띄고 재미있어 보여 &#8216;여유 있을 때  블로그에 한번 정리해 봐야 겠다&#8217;라고 한달전에 맘 먹었는데 겨우 하나를 마쳤네요. 요즘 시간이 잘 안나서 이 포스트를 시작해서 완성하는데 약 3주나 걸렸습니다. 그 사이 <em>Usenix</em>의 매거진인 <em>;login:</em>에 Hadoop Namenode의 scalability에 대한 article인 <a href="http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html" target="_blank">HDFS Scalability: The Limits to Growth</a>가 실렸습니다. 또 야후 개발자 네트워크 블로그에서 article을 소개하는 글 (<a href="http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html" target="_blank">Scalability of the Hadoop Distributed File System</a>)이 실렸네요. 시간날 때 마다 마저 정리해 보겠습니다.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/05/hdfs-scalability-%ed%96%a5%ec%83%81%ec%9d%98-%ec%8b%9c%eb%8f%84%eb%93%a4-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>A Brief Summary of Independent Set in Graph Theory</title>
		<link>http://diveintodata.org/2010/04/a-brief-summary-of-independent-set-in-graph-theory/</link>
		<comments>http://diveintodata.org/2010/04/a-brief-summary-of-independent-set-in-graph-theory/#comments</comments>
		<pubDate>Sat, 24 Apr 2010 02:27:34 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[coloring problem]]></category>
		<category><![CDATA[dominating set]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[graph coloring]]></category>
		<category><![CDATA[independent set]]></category>
		<category><![CDATA[maximal independent set]]></category>
		<category><![CDATA[maximum independent set]]></category>
		<category><![CDATA[mis]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=707</guid>
		<description><![CDATA[Graph Basics Let G be a undirected graph. G=(V,E), where V is a set of vertices and E is a set of edges.  Every edge e in E consists of two vertices in V of G. It is said to connect, join, or link the two vertices (or end points). Independent Set ﻿﻿﻿An independent set S [...]]]></description>
			<content:encoded><![CDATA[<h3>Graph Basics</h3>
<p>Let <em>G</em> be a undirected graph. <em>G=(V,E)</em>, where <em>V</em> is a set of vertices and <em>E</em> is a set of edges.  Every edge <em>e </em>in<em> E </em>consists of two vertices in <em>V </em>of<em> G. </em>It is said to connect, join, or link the two vertices (or end points).</p>
<h3>Independent Set</h3>
<p>﻿﻿﻿An independent set <em>S</em> is a subset of <em>V</em> in <em>G</em> such that no two vertices in <em>S</em> are adjacent. I suppose that its name is meaning that vertices in an independent set <em>S</em> is independent on a set of edges in a graph <em>G</em>. Like other vertex sets in graph theory, independent set has maximal and maximum sets as follows:</p>
<blockquote><p>The independent set <em>S</em> is <em><strong>maximal</strong><span style="font-style: normal;"> if </span>S</em> is not a proper subset of any independent set of <em>G.</em></p></blockquote>
<blockquote><p>The independent set <em>S</em> is <strong><em>maximum</em></strong> if there is no other independent set has more vertices than <em>S</em>.</p></blockquote>
<p>That is, a largest maximal independent set is called a maximum independent set. The maximum independent set problem is an NP-hard optimization problem.</p>
<p>All graphs has independent sets. For a graph <em>G</em> having a maximum independent set, the independence number <em>α</em>(<em>G</em>) is determined by the cardinality of a maximum independent set.</p>
<h3><strong>Relations to Dominating Sets</strong></h3>
<ul>
<li>A dominating set in a graph <em>G</em> is a subset <em>D</em> of <em>V</em> such that every vertex not in <em>D</em> is joined to at least one member of <em>D</em> by some edge.</li>
<li>In other words, a vertex set <em>D</em> is a dominating set in <em>G</em> if and if only every vertex in a graph <em>G</em> is contained in (or is adjacent to) a vertex in <em>D.</em></li>
<li>Every maximal independent set <em>S</em> of vertices in a simple graph <em>G</em> has the property that every vertex of the graph either is contained in <em>S</em> or is adjacent to a vertex in <em>S</em>.
<ul>
<li>That is, an independent set is a dominating set if and if only it is a maximal independent set.</li>
</ul>
</li>
</ul>
<h3>Relations to Graph Coloring</h3>
<ul>
<li>Independent set problem is related to coloring problem since vertices in an independent set can have the same color.</li>
</ul>
<h3>References</h3>
<ul>
<li>Chapter 10, <a href="http://www.amazon.com/Graph-Theory-Modeling-Applications-Algorithms/dp/0131423843" target="_blank">Graph Theory: Modeling, Applications, and Algorithms</a></li>
<li><a href="http://en.wikipedia.org/wiki/Independent_set_(graph_theory)">http://en.wikipedia.org/wiki/Independent_set_(graph_theory)</a></li>
<li><a href="http://en.wikipedia.org/wiki/Dominating_set">http://en.wikipedia.org/wiki/Dominating_set</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/04/a-brief-summary-of-independent-set-in-graph-theory/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Data-Intensive Text Processing with MapReduce Draft Available in Online</title>
		<link>http://diveintodata.org/2010/03/data-intensive-text-processing-with-mapreduce-draft-available-in-online/</link>
		<comments>http://diveintodata.org/2010/03/data-intensive-text-processing-with-mapreduce-draft-available-in-online/#comments</comments>
		<pubDate>Thu, 11 Mar 2010 01:46:24 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[data intensive]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[text processing]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=605</guid>
		<description><![CDATA[Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer Actually, there have never been books that directly deal with MapReduce programming and algorithms. This book addresses from MapReduce algorithm design to EM Algorithms for Text Processing. Although this book is still draft, it seems well-organized and very interesting. In addition, the book contains some basic graph algorithms [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.umiacs.umd.edu/~jimmylin/book.html" target="_blank">Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer</a></p>
<p>Actually, there have never been books that directly deal with MapReduce programming and algorithms. This book addresses from MapReduce algorithm design to EM Algorithms for Text Processing. Although this book is still draft, it seems well-organized and very interesting. In addition, the book contains some basic graph algorithms using MapReduce.</p>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/03/data-intensive-text-processing-with-mapreduce-draft-available-in-online/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How to Create A Table in HBase for Beginners</title>
		<link>http://diveintodata.org/2009/11/how-to-make-a-table-in-hbase-for-beginners/</link>
		<comments>http://diveintodata.org/2009/11/how-to-make-a-table-in-hbase-for-beginners/#comments</comments>
		<pubDate>Fri, 27 Nov 2009 02:33:36 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[create table]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[table]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=527</guid>
		<description><![CDATA[I have accumulated some knowledge and know-how about MapReduce, Hadoop, and HBase since I participated in some projects. From hence, I&#8217;ll post the know-how of HBase by period. Today, I&#8217;m going to introduce a way to make a hbase table in java. HBase provides two ways to allow a Hbase client to connect HBase master. [...]]]></description>
			<content:encoded><![CDATA[<p>I have accumulated some knowledge and know-how about MapReduce, Hadoop, and HBase since I participated in some projects. From hence, I&#8217;ll post the know-how of HBase by period. Today, I&#8217;m going to introduce a way to make a hbase table in java.</p>
<p>HBase provides two ways to allow a Hbase client to connect HBase master. One is to use a instance of <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HBaseAdmin.html" target="_blank">HBaseAdmin</a> class. HBaseAdmin provides some methods for creating, modifying, and deleting tables and column families. Another way is to use an instance of HTable class. This class almost provides some methods to manipulate data like inserting, modifying, and deleting rows and cells.</p>
<p>Thus, in order to make a hbase table, we need to connect a HBase master by initializing a instance of HBaseAdmin like line 4. HBaseAdmin requires an instance of <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/HBaseConfiguration.html" target="_blank">HBaseConfiguration</a>. If necessary, you may set some configurations like line 2.</p>
<p>In order to describe HBase schema,  we make an instances of <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/HColumnDescriptor.html" target="_blank">HColumnDescriptor</a> for each column family. In addition to column family names, HColumnDescriptor enables you to set various parameters, such as maxVersions, compression type, timeToLive, and bloomFilter. Then, we can create a HBase table by invoking createTable like line 10.</p>
<pre class="brush: java;">
HBaseConfiguration conf = new HBaseConfiguration();
conf.set(&quot;hbase.master&quot;,&quot;localhost:60000&quot;);

HBaseAdmin hbase = new HBaseAdmin(conf);
HTableDescriptor desc = new HTableDescriptor(&quot;TEST&quot;);
HColumnDescriptor meta = new HColumnDescriptor(&quot;personal&quot;.getBytes());
HColumnDescriptor prefix = new HColumnDescriptor(&quot;account&quot;.getBytes());
desc.addFamily(meta);
desc.addFamily(prefix);
hbase.createTable(desc);
</pre>
<p>Finally, you can check your hbase table as the following commands.</p>
<pre class="brush: bash;">
c0d3h4ck@code:~/Development/hbase$ bin/hbase shell
HBase Shell; enter 'help&lt;RETURN&gt;' for list of supported commands.
Version: 0.20.1, r822817, Wed Oct  7 11:55:42 PDT 2009
hbase(main):001:0&gt; list
TEST

1 row(s) in 0.0940 seconds
</pre>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/11/how-to-make-a-table-in-hbase-for-beginners/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>ACM SIGMOD 2010 Programming Contest</title>
		<link>http://diveintodata.org/2009/11/acm-sigmod-2010-programming-contest/</link>
		<comments>http://diveintodata.org/2009/11/acm-sigmod-2010-programming-contest/#comments</comments>
		<pubDate>Fri, 20 Nov 2009 11:44:06 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[acm]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[programming contest]]></category>
		<category><![CDATA[relational database]]></category>
		<category><![CDATA[SIGMOD]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=515</guid>
		<description><![CDATA[As you know, SIGMOD is ACM&#8217;s Special Interest Group on Management of Data. SIGMOD holds the annual conference that is regarded as one of the best conference in computer science. Besides, SIGMOD organizes a programming contest in parallel with the ACM SIGMOD conference. Below description is the call for the programming contest of this year. [...]]]></description>
			<content:encoded><![CDATA[<p>As you know, SIGMOD is ACM&#8217;s Special Interest Group on Management of Data. SIGMOD holds the annual conference that is regarded as one of the best conference in computer science. Besides, SIGMOD organizes a programming contest in parallel with the ACM SIGMOD conference. Below description is the call for the programming contest of this year. The programming contest&#8217;s subject of this year seems very interesting! The task is to implement a simple distributed query executor built on top of last year&#8217;s main-memory index. The environment on which contestants will test their implementation may be provided by Amazon. If you are interested in this programming contest, try that. You can get further information from here (<a href="http://dbweb.enst.fr/events/sigmod10contest/" target="_blank">http://dbweb.enst.fr/events/sigmod10contest</a>).</p>
<blockquote><p>A programming contest is organized in parallel with the ACM SIGMOD 2010 conference, following the success of the first annual SIGMOD programming contest organized last year. Student teams from degree-granting institutions are invited to compete to develop a distributed query engine over relational data. Submissions will be judged on the overall performance of the system on a variety of workloads. A shortlist of finalists will be invited to present their implementation at the SIGMOD conference in June 2010 in Indianapolis, USA. The winning team, to be selected during the conference, will be awarded a prize of 5,000 USD and will be invited to a one-week research visit in Paris. The winning system, released in open source, will form a building block of a complete distributed database system which will be built over the years, throughout the programming contests.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/11/acm-sigmod-2010-programming-contest/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>CIKM 2009 in Hong Kong</title>
		<link>http://diveintodata.org/2009/11/cikm-2009-in-hong-kong/</link>
		<comments>http://diveintodata.org/2009/11/cikm-2009-in-hong-kong/#comments</comments>
		<pubDate>Mon, 09 Nov 2009 15:08:26 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[cikm]]></category>
		<category><![CDATA[cikm09]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[Hong Kong]]></category>
		<category><![CDATA[spider]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=473</guid>
		<description><![CDATA[With Min Kyoung Sung who is a coauthor of  &#8216;SPIDER : A System for Scalable, Parallel / Distributed Evaluation of large-scale RDF Data&#8216;, I participated in 18th ACM CIKM 2009 (Conference on Information and Knowledge Management) held in Hong Kong. We stayed in Marriott Hotel near the Asia World-Expo at which CIKM 2009 held. At [...]]]></description>
			<content:encoded><![CDATA[<p>With Min Kyoung Sung who is a coauthor of  &#8216;<a href="http://dbserver.korea.ac.kr/projects/spider/" target="_blank"><em>SPIDER : A System for Scalable, Parallel / Distributed Evaluation of large-scale RDF Data</em></a>&#8216;, I participated in <a href="http://www.comp.polyu.edu.hk/conference/cikm2009/about/index.htm" target="_blank">18th ACM CIKM 2009 (Conference on Information and Knowledge Management)</a> held in Hong Kong. We stayed in Marriott Hotel near the <a href="http://www.asiaworld-expo.com/" target="_blank">Asia World-Expo</a> at which CIKM 2009 held. At this conference, I got along with several Korean researchers (<strong></strong>Kyong-Ha Lee, Jinoh Oh, and Sangchul Kim) and I discussed about SPIDER with some researchers who are interested in RDF data processing during the demonstration session.</p>
<p>At CIKM 2009, I felt that the recent trend of web data management are being changed to information extraction and semantic or structured web data rather then unstructured data. Many papers and posters addressed these issues. In addition, the subject of the panel was ‘<span><strong> <em>Information extraction meets relational databases: Where    are we heading?</em></strong></span>’ One of the panel said that the hot spot of web data management research changes from crawling, indexing, and searching to information extraction and semantic data. These changes lead to new various data and knowledge management issues. Besides information extraction, graph data mining was one of the main hot issues in CIKM 2009.</p>
<p>At the main keynote, Kyu-Young Hwang (KAIST, Korea) spoke &#8216;<span style="font-style: italic; font-weight: bold;">DB-IR Integration and Its Application to a Massively-Parallel Search Engine.&#8217; </span>Its key subject is that DB-IR integration is becoming one of major challenges in the database area, so it is leading to new DBMS architecture applicable to DB-IR integration. In addition, Edward Chang (Google Research China) and Clement Yu (University of Illinois at Chicago) spoke &#8216;<strong><em>Confucius and its intelligent Disciples</em></strong>&#8216; and &#8216;<strong><em>Advanced Metasearch Engines</em>&#8216;</strong> respectively.</p>
<p style="text-align: center;"><a class="flickr-image alignnone" title="Coffee Break at CIKM 2009" rel="flickr-mgr[CIKM]" href="http://www.flickr.com/photos/hyunsik/4088464259/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2764/4088464259_4f6498eca2_m.jpg" alt="Coffee Break at CIKM 2009" /></a><a class="flickr-image alignnone" title="SPIDER in Demo Session" rel="flickr-mgr[CIKM]" href="http://www.flickr.com/photos/hyunsik/4088463803/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2752/4088463803_b53bbd8646_m.jpg" alt="SPIDER in Demo Session" /></a></p>
<p style="text-align: center;"><a class="flickr-image alignnone" title="Tian Tan Buddha Statue in Hong Kong" rel="flickr-mgr[CIKM]" href="http://www.flickr.com/photos/hyunsik/4088461317/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2609/4088461317_5546d70eff_m.jpg" alt="Tian Tan Buddha Statue in Hong Kong" /></a><a class="flickr-image alignnone" title="The lunch time in CIKM 2009" rel="flickr-mgr[CIKM]" href="http://www.flickr.com/photos/hyunsik/4088462251/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2591/4088462251_d5875a68e3_m.jpg" alt="The lunch time in CIKM 2009" /></a></p>
<p>This conference was a really nice experience for me. I enjoyed the conference, reception, and banquet. However, I have an unsatisfied feeling because I didn&#8217;t participate in <a href="http://www.clouddb.org/CloudDB09/" target="_blank">the 1st Workshop CloudDB 2009</a> in conjunction in CIKM 2009.</p>
<p>Anyway, this conference inspired Min Kyoung Sung and me. It may be kept in our mind for long time.</p>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/11/cikm-2009-in-hong-kong/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>MapReduce Online Comes Out!</title>
		<link>http://diveintodata.org/2009/10/mapreduce-onlie-comes-out/</link>
		<comments>http://diveintodata.org/2009/10/mapreduce-onlie-comes-out/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 15:49:37 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[map-reduce]]></category>
		<category><![CDATA[online aggregation]]></category>
		<category><![CDATA[stream queries]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=461</guid>
		<description><![CDATA[MapReduce has been gaining much attention in data intensive computing field. As you know, it is well known as a very popular framework for batch-processing. Recently, however, Tyson Condie who is a Ph.D student in UC Berkeley accomplishes MapReduce Online. Today, I heard this news from Data Beta. Actually, It is amazing works since the [...]]]></description>
			<content:encoded><![CDATA[<p>MapReduce has been gaining much attention in data intensive computing field. As you know, it is well known as a very popular framework for batch-processing.</p>
<p>Recently, however, Tyson Condie who is a Ph.D student in UC Berkeley accomplishes <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html" target="_self">MapReduce Online</a>. Today, I heard this news from <a href="http://databeta.wordpress.com/2009/10/18/mapreduce-online/" target="_self">Data Beta</a>. Actually, It is amazing works since the original MapReduce is specialized and designed for only batch-processing. In addition, most people believe that MapReduce will remain a batch-processing.</p>
<p>The essential of MapReduce online is that it tries to hold the fault-tolerance model of the <a href="http://labs.google.com/papers/mapreduce.html" target="_self">original MapReduce</a>, whereas it provides the the pipelining of results across tasks and jobs instead of materializing the output of each MapReduce task and job into disk. Consequently, MapReduce online enables the program to return the result earlier from a big job.</p>
<p>You can get further information from <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html" target="_self">MapReduce Online</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/10/mapreduce-onlie-comes-out/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>BSP Library on Hadoop?</title>
		<link>http://diveintodata.org/2009/10/bsp-library-on-hadoop/</link>
		<comments>http://diveintodata.org/2009/10/bsp-library-on-hadoop/#comments</comments>
		<pubDate>Fri, 09 Oct 2009 11:45:33 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[angrapa]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[bsp]]></category>
		<category><![CDATA[bulk synchronization parallel]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hama]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=443</guid>
		<description><![CDATA[Recently, I started to participate in the Hama project (a distributed scientific package on Hadoop for massive matrix and graph data), and I have taken the times to develop the bulk synchronization parallel (BSP) library on Hadoop (HAMA-195); I&#8217;m getting help from Edword Yoon, a founder of Hama project. The motivation of BSP lib is [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I started to participate in the <a href="http://incubator.apache.org/hama/" target="_self">Hama project</a> (a distributed scientific package on Hadoop for massive matrix and graph data), and I have taken the times to develop the <a href="http://en.wikipedia.org/wiki/Bulk_synchronous_parallel" target="_self">bulk synchronization parallel</a> (BSP) library on Hadoop (<a href="https://issues.apache.org/jira/browse/HAMA-195" target="_self">HAMA-195</a>); I&#8217;m getting help from <a href="http://blog.udanax.org/" target="_self">Edword Yoon</a>, a founder of Hama project. The motivation of BSP lib is definitely clear.</p>
<p>The hadoop platforms are installed in cloud computing service providers and many companies as you can see in <a href="http://wiki.apache.org/hadoop/PoweredBy" target="_self">http://wiki.apache.org/hadoop/PoweredBy</a>. However, most of them may use only MapReduce programs. As you know although MapReduce is very scalability, but it provides only the simple programming model. Many programmers want to use more various programming model without changing the platform (i.e., <a href="http://hadoop.apache.org" target="_self">Hadoop</a>). This BSP lib will be the beginning for their desires. However, like MapReduce, BSP may also be not swiss army knife. When we find appropriate applications, BSP lib on Hadoop will be valued for its scalability and ability.</p>
<p>Sooner, I&#8217;ll post articles about the progress of BSP library and <a href="http://wiki.apache.org/hama/GraphPackage" target="_self">Angrapa</a> (the graph package on Hama).</p>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/10/bsp-library-on-hadoop/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Java Universal Network/Graph Framework</title>
		<link>http://diveintodata.org/2009/09/java-universal-networkgraph-framework/</link>
		<comments>http://diveintodata.org/2009/09/java-universal-networkgraph-framework/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 23:30:45 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[jung]]></category>
		<category><![CDATA[visualization tools]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=349</guid>
		<description><![CDATA[Recently, I&#8217;m primarily concerned with large-scale graph data processing. Occasionally, the visualization of graph can be a good way for us to observe some properties from graph data sets. Today, I&#8217;m going to introduce a graph framework, called Java Universal Network/Graph Framework (Jung). Jung provides data structures for graph, a programming interface familiar with graph [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I&#8217;m primarily concerned with large-scale graph data processing. Occasionally, the visualization of graph can be a good way for us to observe some properties from graph data sets. Today, I&#8217;m going to introduce a graph framework, called <em><a href="http://jung.sourceforge.net/" target="_blank">Java Universal Network/Graph Framework (Jung)</a>. </em>Jung provides data structures for graph, a programming interface familiar with graph features, some fundamental graph algorithms (e.g., minimum spanning tree, depth-first search, breath-first search, and dijkstra algorithm), and even visualization methods. Especially, I&#8217;m interested in its visualization methods.</p>
<p>The following java source shows the programming interface of Jung. In more detail, this program make a graph, add three vertices to the graph, and connect vertices. This source code is brought from <a href="http://jung.sourceforge.net/doc/index.html" target="_blank">Jung tutorial</a>. As you can see, Jung&#8217;s APIs are very easy.</p>
<pre class="brush: java;">
  // Make a graph by a SparseMultigraph instance.
  Graph&lt;Integer, String&gt; g = new SparseMultigraph&lt;Integer, String&gt;();
  g.addVertex((Integer)1); // Add a vertex with an integer 1
  g.addVertex((Integer)2);
  g.addVertex((Integer)3);
  g.addEdge(&quot;Edge-A&quot;, 1,3); // Added an edge to connect between 1 and 3 vertices.
  g.addEdge(&quot;Edge-B&quot;, 2,3, EdgeType.DIRECTED);
  g.addEdge(&quot;Edge-C&quot;, 3, 2, EdgeType.DIRECTED);
  g.addEdge(&quot;Edge-P&quot;, 2,3); // A parallel edge

  // Make some objects for graph layout and visualization.
  Layout&lt;Integer, String&gt; layout = new KKLayout&lt;Integer, String&gt;(g);
  BasicVisualizationServer&lt;Integer, String&gt; vv =
  new BasicVisualizationServer&lt;Integer, String&gt;(layout);
  vv.setPreferredSize(new Dimension(800,800));

  // It determine how each vertex with its value is represented in a diagram.
  ToStringLabeller&lt;Integer&gt; vertexPaint = new ToStringLabeller&lt;Integer&gt;() {
    public String transform(Integer i) {
    return &quot;&quot;+i;
   }
  };

  vv.getRenderContext().setVertexLabelTransformer(vertexPaint);

  JFrame frame = new JFrame(&quot;Simple Graph View&quot;);
  frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
  frame.getContentPane().add(vv);
  frame.pack();
  frame.setVisible(true);
</pre>
<p>Some APIs of the Jung are based on generic programming, so you can use easily vertices or edges to contains user-defined data. If you want more detail information, visit <a href="http://jung.sourceforge.net/">http://jung.sourceforge.net</a>.</p>
<p>The above source code shows the following diagram.<br />
<a class="flickr-image aligncenter" title="Jung example" rel="flickr-mgr" href="http://www.flickr.com/photos/hyunsik/3919489249/"><img class="flickr-medium aligncenter" src="http://farm3.static.flickr.com/2646/3919489249_3377cc8c63.jpg" alt="Jung example" width="347" height="346" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/09/java-universal-networkgraph-framework/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
