<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Dive Into A Data Deluge</title>
	<atom:link href="http://diveintodata.org/feed/" rel="self" type="application/rss+xml" />
	<link>http://diveintodata.org</link>
	<description>Discussion about Newly Emerging Issues on Database</description>
	<lastBuildDate>Thu, 29 Mar 2012 09:43:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='diveintodata.org' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Dive Into A Data Deluge</title>
		<link>http://diveintodata.org</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://diveintodata.org/osd.xml" title="Dive Into A Data Deluge" />
	<atom:link rel='hub' href='http://diveintodata.org/?pushpress=hub'/>
		<item>
		<title>Amazon EC2에서 whirr을 이용한 Hadoop 클러스터 구동 방법</title>
		<link>http://diveintodata.org/2011/03/19/whirr-usage-for-hadoop-cluster-in-amazon-ec2/</link>
		<comments>http://diveintodata.org/2011/03/19/whirr-usage-for-hadoop-cluster-in-amazon-ec2/#comments</comments>
		<pubDate>Sat, 19 Mar 2011 03:06:58 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[amazon ec2]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[configuration]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[whirr]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=1060</guid>
		<description><![CDATA[최근 연구내용 검증을 위해 Amazon EC2에서 Hadoop 클러스터를 구축하여 실험을 수행 하는 중입니다. 그런데 Hadoop 클러스터를 EC2에 구축하는데 있어 Amazon EC2 환경에 대한 이해 부족과 자료의 부족으로 직접 부딪혀서 해결해야 하는 부분들이 꽤 있었습니다. 저는 이 포스팅을 통해 제가 시도했던 방법을 소개하고 제 경험을 공유하고자 합니다. 우선 이 글을 읽는 분들은 Amazon EC2 계정이 있고 [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=1060&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>최근 연구내용 검증을 위해 Amazon EC2에서 Hadoop 클러스터를 구축하여 실험을 수행 하는 중입니다. 그런데 Hadoop 클러스터를 EC2에 구축하는데 있어 Amazon EC2 환경에 대한 이해 부족과 자료의 부족으로 직접 부딪혀서 해결해야 하는 부분들이 꽤 있었습니다. 저는 이 포스팅을 통해 제가 시도했던 방법을 소개하고 제 경험을 공유하고자 합니다.</p>
<p>우선 이 글을 읽는 분들은 Amazon EC2 계정이 있고 AMI, Instance, EC2 Key Pair에 대해 알고 계시다고 전제하겠습니다.</p>
<p><span style="font-size:20px;font-weight:bold;">Amazon EC2에서 Hadoop 클러스터 구동</span></p>
<p>현 시점에서 Amazon EC2 환경에 Hadoop 클러스터를 구동 방법은 선택의 폭이 그리 넓지 않습니다.</p>
<ol>
<li>Hadoop이 이미 설치된 이미지를 사용하고 수동 설정하는 방법</li>
<li>EBS 기반 AMI에 하둡 설치 및 복사 그리고 수동 설정</li>
<li>whirr을 사용하는 방법</li>
<li>whirr의 hadoop-ec2를 사용하는 방법</li>
</ol>
<p>이 포스팅에서는 3번인 whirr을 이용한 구축방법을 설명합니다. 그런데  이 방법은 정말 간단하지만 한 가지 제약을 가지고 있습니다. 이 방법은 기본적으로 instance store 기반의 AMI만 활용 할 수 있습니다. 따라서 Hadoop 클러스터의 HDFS는 instance store에 기반을 두게 되며 클러스터 종료 시 HDFS의 모든 데이터가 제거됩니다 (Amazon EC2의 모든 인스턴스는 영속적인 데이터 저장을 위해 EBS나 S3와 같은 별도의 저장 서비스를 사용해야 합니다.)</p>
<p>저의 경우 처음에 1,2번 방법을 모두 시도했었습니다. 그러나  다음과 같은 문제점이 있었습니다.</p>
<ul>
<li>최신 Hadoop 배포본(0.20 이상)이 설치된 AMI의 부재</li>
<li>수십 여개의 인스턴스의 시동(launch)의 자동화</li>
<li>매번 새로 할당 받는 IP 주소와 이에 따른 Hadoop 설정과 설정 배포의 어려움</li>
</ul>
<p>조사해보니 인스턴스를 시동을 자동화하고 시동된 인스턴스의 IP 목록을 얻어 설정 배포까지 원활히 하기 위해서는 <a href="http://aws.amazon.com/sdkforjava/">Amazon AWS API</a>를 이용하거나 <a href="http://code.google.com/p/boto/">boto (Python interface to Amazon Web Services)</a>, <a href="http://code.google.com/p/jclouds/downloads/list">jcloud (multi-cloud library)</a> 와 같은 third-party 라이브러리를 이용해 개발을 해야합니다. 그러나 이는 많은 시간을 요구합니다. EBS 기반 AMI에 Hadoop을 직접 설치하는 사용하는 방법 역시 비슷한 이유로 포기했습니다.</p>
<p>위에 4번 방법인 hadoop-ec2는 원래 Hadoop의 contrib 에 속했던 프로그램으로 현재는 whirr에서 진행되고 있지만 지속적으로 유지보수가 되지 않는 것으로 보여 시도하지 않았습니다. whirr의 Change Log를 봐도 4번에 대한 내용은 찾기 어려웠습니다.</p>
<p>현재로써는 whirr이 가장 편리한 방법이라 여겨 집니다.</p>
<p><span style="font-size:20px;font-weight:bold;">whirr?</span></p>
<p>whirr는 Apache Incubator에 속한 프로젝트로 Amazon EC2와 같은 상용 클라우드 환경에서 원하는 서비스에 대한 설치, 설정, 실행을 자동으로 수행하는 라이브러리입니다. 현재 제공하는 서비스로는 <a href="http://hadoop.apache.org/">Apache Hadoop</a>, <a href="http://cassandra.apache.org/">Cassandra</a>,  <a href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution for Hadoop (CDH)</a>, <a href="http://hadoop.apache.org/zookeeper/">Zookeeper</a>가 있으며 조만간 릴리즈 될 0.4-incubating 버전에 <a href="http://hbase.apache.org/">Hbase</a>가 추가될 예정이라고 합니다.</p>
<h2>동작의 개요</h2>
<p>whirr의 사용방법을 설명하기에 앞서 전체적인 동작에 대해 개략적으로 설명을 드리겠습니다.</p>
<p>사용자가 &#8216;<em>cluster-lunch&#8217;</em> 커맨드를 주면 whirr은 instance store 기반의 AMI를 이용해 다수의 인스턴스를 가동하고 모든 인스턴스들에 JDK 및 Hadoop의 설치와 설정을 일괄적으로 수행합니다. 이 과정이 끝나면 EC2 내부에서 Hadoop 클러스터가 동작하고 있게 됩니다.</p>
<p>그리고 사용자가 로컬 머쉰에 설치한 Hadoop 프로그램 통해 EC2에서 구동되는 Hadoop 클러스터를 제어하게 됩니다. 그런데 EC2 내부의 인스턴스들은 기본적으로 private IP만을 할당 받아 외부에서 접근할 수가 없고 기본적으로 방화벽 설정이 까다롭게 되어 있기 때문에 추가적인 설정 없이 Hadoop RPC나 웹 UI를 통한 접근이 불가능 합니다. 따라서 whirr이 제공하는 proxy 프로그램을 실행하고 난 뒤에 로컬 머쉰에 설치된 Hadoop 프로그램을 이용하여 EC2 내부의 클러스터를 제어하게 됩니다.</p>
<h2>Hadoop 클러스터 구동</h2>
<h3>계정 생성</h3>
<p>whirr을 통해 생성하는 Hadoop 클러스터는 linux에서 <em>hadoop</em>이라는 username의해 시동됩니다. Hadoop은 클러스터를 구동한 계정과 같은 계정으로 접근할 때 superuser 권한을 가집니다. 따라서 로컬에 hadoop이라는 계정을 생성하여 아래 작업을 수행해야 Amazon EC2 내부에서 동작하는 Hadoop 클러스터에 대한 superuser권한을 행사할 수 있습니다. 하지만 단순히 MapReduce 프로그램만 실행 시킨다면 아무 계정에서 작업해도 문제 없습니다.</p>
<h3>로컬 머쉰에 Hadoop과 whirr의 설치 그리고 Hadoop Version 문제</h3>
<p>Hadoop과 whirr은 <a href="http://archive.apache.org/dist/hadoop/core/">http://archive.apache.org/dist/</a> 에서 다운 받아 로컬 머쉰에 설치합니다.  그런데 현 시점에서는 &#8216;어떤 Hadoop 버전을 설치해야 하는가&#8217;가 문제가됩니다. Hadoop은 현재 한참 빠르게 개발되고 있으며 다른 버전간 내부 프로토콜이 호환되지 않습니다.</p>
<p>따라서 whirr이 자동으로 설치해주는 Hadoop 클러스터와 로컬 머쉰의 Hadoop은 같은 버전으로 맞춰야 합니다. 버전을 바꾸는 것은 <a href="http://incubator.apache.org/whirr/faq.html">whirr FAQ</a>에 아래와 같이 설명되어 있는 것 처럼 직접 install, configuration 스크립트를 수정해야 합니다.</p>
<blockquote>
<h3>How do I specify the service version and other service properties?</h3>
<p>Currently the only way to do this is to modify the scripts to install a particular version of the service, or to change the service properties from the defaults.</p>
<p>See &#8220;How to modify the instance installation and configuration scripts&#8221; above for details on how to do this.</p>
<p>from <a href="http://incubator.apache.org/whirr/faq.html">http://incubator.apache.org/whirr/faq.html</a></p></blockquote>
<p>whirr은 Apache Hadoop 배포본외에도 Cloudera의 Hadoop 배포본을 설치할 수 있습니다. 이는 아래 &#8216;whirr 설정 파일&#8217;에서 whirr.hadoop-install-runurl과 whirr.hadoop-configure-runurl에 대한 내용을 참고하시면 됩니다.</p>
<h3>whirr 설정 파일</h3>
<p>whirr을 이용한 클러스터의 구동은 클러스터에 대한 설정 파일을 만드는 것으로 시작합니다. 이 포스팅에서는 아래 cluster.properties 파일의 내용을 설명하고 이후 내용도 이 설정을 기준으로 설명하도록 하겠습니다.</p>
<p>(아래 내용은 최신인 0.3-incubating 버전에 대한 내용입니다. 0.4-incubating 버전이 릴리즈 되면 설정 방법이 변경될 예정입니다. 릴리즈 되고 나면 포스팅을 업데이트 하도록 하겠습니다.)</p>
<p><pre class="brush: plain;">
whirr.cluster-name=mycluster
whirr.instance-templates=1 jt+nn,16 dn+tt
whirr.provider=ec2
whirr.identity=ACCESS_KEY
whirr.credential=SECRET_KEY
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
whirr.location-id=us-east-1d
whirr.hardware-id=m1.small
whirr.service-name=hadoop
#whirr.hadoop-install-runurl=cloudera/cdh/install
#whirr.hadoop-configure-runurl=cloudera/cdh/post-configure
</pre></p>
<p>각 항목에 대한 설명은 다음과 같습니다.</p>
<ul>
<li>whirr.cluster-name : 구동할 클러스터를 식별하는 이름입니다. 클러스터를 구동하면 ${HOME}/.whirr/<em>&lt;cluster-name&gt;</em> 가  디렉토리가 생성되며 이 디렉토리에는 Hadoop 클러스터에 접근하는데 필요한 파일들이 저장됩니다.</li>
<li>whirr.instance-templates: 구동할 클러스터의 구성을 설정합니다. jt는 jobtracker, nn은 name node, dn은 data node, tt는 task tracker를 의미합니다. 이 설정을 통해 유연한 설정이 가능합니다. data node와 task tracker를 더 늘리고 싶을 때는 dn+tt 앞에 쓰여진 숫자를 변경해 주시면 됩니다.</li>
<li>whirr.provider: 클러스터 서비스 제공자를 설정합니다. 현재는 Amazon EC2와 Rackspace Cloud Servers 두 가지를 지원합니다.</li>
<li>whirr.identity: AWS의 access key를 입력하시면 됩니다.</li>
<li>whirr.credential: AWS의 secret key를 입력 합니다.</li>
<li>whirr.private-key-file: 이 설정과 아래 설정은 ec2 인스턴스를 생성할 때 사용할 key로 사용됩니다. 위 예제처럼 하기 위해서는 아래와 같이 ssh키를 생성해야 한다. 또는 기존에 다른 인스턴스를 위해 만들어 놓은 EC2 Key pair의 경로를 설정해도 됩니다.</li>
</ul>
<p><pre class="brush: plain;">
$ ssh-keygen -t rsa -P ''
</pre></p>
<ul>
<li>whirr.location-id: 원하는 availability zone을 설정한다. 설정하지 않으면 whirr을 실행하는 인스턴스와 같은 zone이 설정된다.</li>
<li>whirr.hardware-id: 원하는 인스턴스 유형을 설정한다. Amazon EC2가 제공하는 인스턴스 유형은 <a href="http://aws.amazon.com/ec2/instance-types/">Amzon EC2 Instance Types</a> 페이지에서 확인할 수 있으며 각 유형에 써 있는 API name을 이 설정에 적용하면 됩니다.</li>
<li>whirr.service-name: 구축할 서비스를 설정합니다. 이 글은 Hadoop을 위한 것이므로 hadoop으로 남겨 둡니다.</li>
<li>whirr.hadoop-install-runurl, whirr.hadoop-configure-runurl: Hadoop의 경우 apache 버전과 CDH 버전이 있습니다. 위 예제에서 주석(#)을 제거해 주면 CDH버전을 구동하게 됩니다.</li>
</ul>
<p>설정에 대한 추가적인 설명은 <a href="http://incubator.apache.org/whirr/configuration-guide.html">Whirr Configuration Guide</a> 문서를 참고하시면 됩니다.</p>
<h3>Hadoop 클러스터 시동</h3>
<p>클러스터의 시동은 다음과 같은 커맨드로 수행합니다. 클러스터를 시동하면 내부적으로 Amazon의 EC2 API를 통해 인스턴스를 생성해 필수 패키지(JDK)등을 설치하고 Hadoop 배포 버전을 다운로드 받아 설정을 하는 과정이 수행됩니다. 따라서 클러스터가 구동되는데 짧게는 수 분에서 길게는 10분 정도 소요됩니다. 클러스터 구동이 완료되면 다시 쉘 프롬프트가 뜨게 됩니다.</p>
<p><pre class="brush: plain;">
$ whirr/bin/whirr launch-cluster --config cluster.properties
Bootstrapping cluster
Configuring template
Starting 16 node(s) with roles [tt, dn]
Configuring template
Starting 1 node(s) with roles [jt, nn]
Nodes started: [[id=us-east-1/i-a45eb7cb, providerId=i-a45eb7cb, ...]]
.....
Nodes started: [[id=us-east-1/i-7a51b815, providerId=i-7a51b815, ...]]
Authorizing firewall
Running configuration script
Configuration script run completed
Running configuration script
Configuration script run completed
Completed configuration of mycluster
Web UI available at http://ec2-72-44-43-29.compute-1.amazonaws.com
Wrote Hadoop site file /home/-----/.whirr/mycluster/hadoop-site.xml
Wrote Hadoop proxy script /home/-----/.whirr/mycluster/hadoop-proxy.sh
Wrote instances file /home/-----/.whirr/mycluster/instances
Started cluster of 17 instances
Cluster{instances=[Instance{roles=[tt, dn], ...}]}
$
</pre></p>
<h3>프록시 열기</h3>
<p>whirr을 통해 구동한 Hadoop 클러스터에 접근하기 위해서는 로컬에 설치했던 Hadoop을 설정을 해야 합니다. 설정은 간단히 whirr이 클러스터 시동 후 생성해 주는 파일을 사용하면 됩니다. &#8216;<em>whirr launch-cluster&#8217;</em> 가 완료되고 나면 ${HOME}/.whirr/<em>&lt;cluster-name&gt;/</em>hadoop-site.xml 파일이 생성됩니다. 이 파일을 로컬에 설치한 Hadoop의 ${HADOOP_HOME}/conf에 간단하게 복사하거나 다음과 같이 환경변수를 설정하여 Hadoop의 설정 디렉토리를 override 하면 됩니다.</p>
<p><pre class="brush: plain;">
export HADOOP_CONF_DIR=~/.whirr/&lt;cluster-name&gt;
</pre></p>
<p>하지만 EC2 내부의 클러스터들은 private IP만을 가지기 때문에 바로 Hadoop 클러스터에 접근할 수는 없습니다. ${HOME}/.whirr/<em>&lt;cluster-name&gt;/hadoop-proxy.sh</em>를 실행 해야 비로소 EC2에 구동된 Hadoop 클러스터에 접근할 수 있습니다.</p>
<p><pre class="brush: plain;">
$ chmod +x ~/.whirr/mycluster/hadoop-proxy
$ ~/.whirr/mycluster/hadoop-proxy

Running proxy to Hadoop cluster at ec2-72-44-43-29.compute-1.
amazonaws.com. Use Ctrl-c to quit.
</pre></p>
<p><pre class="brush: plain;">
$ bin/hadoop dfs -ls
...
</pre></p>
<p><em>hadoop-proxy.sh</em>를 실행하면 EC2에서 동작하는 Hadoop 클러스터 웹 UI도 접근할 수 있습니다. 그러나 이를 위해서는 간단한 웹 브라우져 설정이 요구됩니다.</p>
<ul>
<li>크롬은 Preferences -&gt; Under the Hood 탭 -&gt; Network -&gt; Change Proxy Settings에서 설정하면 됩니다. Socks 설정의 주소는 localhost, 포트는 6666으로 해주시면 됩니다.</li>
<li>파폭은 기본적으로 SOCKS를 사용하더라도 로컬 머신에 설정된 DNS를 사용하게 되어 있는데 먼저 이 설정을 변경해 주셔야 합니다. 이를 위해서는 주소창에 about:config를 입력해 세부 설정으로 들어가 network.proxy.socks_remote_dns 설정을 true로 변경해주셔야 합니다. 그리고 Preferences -&gt; Advanced -&gt; Network 탭 -&gt; Connection 섹션의 Settings에서 SOCKS Host를 주소는 localhost, 포트는 6666으로 설정해주시면 됩니다.</li>
</ul>
<p>위와 같이 수정하고 hadoop-proxy.sh를 실행 했을 때 출력되는 URL에 접속하시면 됩니다.</p>
<h3>클러스터 종료</h3>
<p>Hadoop을 이용한 모든 작업이 끝나면 종료는 다음 커맨드를 통해 수행합니다. <span style="text-decoration:underline;">위에서 언급한 바와 같이 whirr은 아직까지 instance store 기반 클러스터 구축 밖에 지원하지 못하므로 HDFS의 모든 데이터가 제거되는 사실을 염두하셔야 합니다.</span></p>
<p><pre class="brush: plain;">
$ whirr/bin/whirr destroy-cluster --config=cluster.properties
</pre></p>
<h2>결론</h2>
<p>whirr을 통한 Hadoop 클러스터 구동 방법을 설명했습니다. whirr은 아직 사소한 버그와 설정의 한계가 있지만 직접 클러스터 를 구축해야 하는 사용자의 노력을 상당히 줄여줍니다. 필요에 따라 MapReduce 프로그램을 대규모 클러스터에 동작시켜야 하는 사용자들에게는 특히 유용하다고 생각합니다.</p>
<h2>참고문서</h2>
<ul>
<li><a href="http://incubator.apache.org/whirr/quick-start-guide.html" target="_blank">Getting Started with Whirr</a></li>
<li><a title="Map-Reduce With Ruby Using Hadoop" href="http://www.philwhln.com/map-reduce-with-ruby-using-hadoop" target="_blank">Map-Reduce With Ruby Using Hadoop</a></li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/1060/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=1060&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2011/03/19/whirr-usage-for-hadoop-cluster-in-amazon-ec2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>An Example of Hadoop MapReduce Counter</title>
		<link>http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/</link>
		<comments>http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/#comments</comments>
		<pubDate>Mon, 14 Mar 2011 15:56:42 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[counter]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=1014</guid>
		<description><![CDATA[MapReduce Counter Hadoop MapReduce Counter provides a way to measure the progress or the number of operations that occur within MapReduce programs. Basically, MapReduce framework provides a number of built-in counters to measure basic I/O operations, such as FILE_BYTES_READ/WRITTEN and Map/Combine/Reduce input/output records. These counters are very useful especially when you evaluate some MapReduce programs. Besides, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=1014&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><img class="alignright" title="Apache Hadoop" src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" width="200" height="50" /></p>
<h2>MapReduce Counter</h2>
<p>Hadoop MapReduce Counter provides a way to measure the progress or the number of operations that occur within MapReduce programs. Basically, MapReduce framework provides a number of built-in counters to measure basic I/O operations, such as FILE_BYTES_READ/WRITTEN and Map/Combine/Reduce input/output records. These counters are very useful especially when you evaluate some MapReduce programs. Besides, the MapReduce Counter allows users to employ your own counters. Since MapReduce Counters are automatically aggregated over Map and Reduce phases, it is one of the easiest way to investigate internal behaviors of MapReduce programs. In this post, I&#8217;m going to introduce how to use your own MapReduce Counter. The example sources described in this post are based on Hadoop 0.21 API.</p>
<h2>Incrementing your counter</h2>
<p>For your own MapReduce counter, you first define a <em>enum</em> type as follow:</p>
<p><pre class="brush: java;">
public static enum MATCH_COUNTER {
  INCOMING_GRAPHS,
  PRUNING_BY_NCV,
  PRUNING_BY_COUNT,
  PRUNING_BY_ISO,
  ISOMORPHIC
};
</pre></p>
<p>And then, when you want to increment your own counter, you should call the <em>increment </em>method as follows:</p>
<p><pre class="brush: java;">
context.getCounter(MATCH_COUNTER.INCOMING_GRAPHS).increment(1);
</pre></p>
<p>You can access<em> context</em> instance within <em>setup</em>, <em>cleanup</em>, <em>map</em>, and <em>reduce</em> method in Mapper or Reducer class. You can get a desired counter via calling <em>context.getCounter</em> method with some enum value.</p>
<h2>Finding your counter</h2>
<p>You can get some <em>Counters</em> from a finished job as follows:</p>
<p><pre class="brush: java;">
Configuration conf = new Configuration();
Cluster cluster = new Cluster(conf);
Job job = Job.getInstance(cluster,conf);
result = job.waitForCompletion(true);
...
Counters counters = job.getCounters();
</pre></p>
<p>The instance of <em>Counters</em> class contains all of the counters obtained from a job. So, when you want to get your own counter, you should call <em>findCounter</em> method with a <em>enum</em> type as follows:</p>
<p><pre class="brush: java;">
Counter c1 = counters.findCounter(MATCH_COUNTER.INCOMING_GRAPHS);
System.out.println(c1.getDisplayName()+&quot;:&quot;+c1.getValue());
</pre></p>
<p>The below example shows how to get built-in counter groups that Hadoop provides basically.</p>
<p><pre class="brush: java;">
for (CounterGroup group : counters) {
  System.out.println(&quot;* Counter Group: &quot; + group.getDisplayName() + &quot; (&quot; + group.getName() + &quot;)&quot;);
  System.out.println(&quot;  number of counters in this group: &quot; + group.size());
  for (Counter counter : group) {
    System.out.println(&quot;  - &quot; + counter.getDisplayName() + &quot;: &quot; + counter.getName() + &quot;: &quot;+counter.getValue());
  }
}
</pre> </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/1014/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/1014/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/1014/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/1014/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/1014/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/1014/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/1014/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/1014/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/1014/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/1014/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/1014/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/1014/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/1014/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/1014/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=1014&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://hadoop.apache.org/images/hadoop-logo.jpg" medium="image">
			<media:title type="html">Apache Hadoop</media:title>
		</media:content>
	</item>
		<item>
		<title>VoltDB and its related links</title>
		<link>http://diveintodata.org/2010/06/01/voltdb-and-its-related-links/</link>
		<comments>http://diveintodata.org/2010/06/01/voltdb-and-its-related-links/#comments</comments>
		<pubDate>Tue, 01 Jun 2010 05:26:55 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[ACID]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[shared-nothing architecture]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[VoltDB]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=842</guid>
		<description><![CDATA[There has been lots of buzz about VoltDB (academic name is H-Store [5]) since a week ago. VoltDB is lead by M. Stonebraker, and it is an open source OLTP DBMS. There are some interesting points: Running on shared-nothing clusters of commodity hardware In-memory database SQL support ACID Linear Scalability Released as an Open Source software [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=842&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://diveintodata.files.wordpress.com/2010/06/gi_voltdb-gif.jpg"><img class="alignright size-full wp-image-954" title="VoltDB" src="http://diveintodata.files.wordpress.com/2010/06/gi_voltdb-gif.jpg?w=590" alt=""   /></a>There has been lots of buzz about <em><span style="font-style:normal;">VoltDB (academic name is H-Store <a href="#ref-5">[5]</a>)</span><span style="font-style:normal;"> since a week ago. VoltDB is lead by <em>M. Stonebraker</em>, and it is an open source OLTP DBMS. There are some interesting points:</span></em></p>
<ul>
<li>Running on shared-nothing clusters of commodity hardware</li>
<li>In-memory database</li>
<li>SQL support</li>
<li>ACID</li>
<li>Linear Scalability</li>
<li>Released as an Open Source software</li>
</ul>
<p>Actually, there have already been some OLTP databases running on shared-nothing clusters. However, they cannot take advantage from the scalability of shared-nothing architecture due to their implementation&#8217;s natures, such as complex distributed locking and commit protocols <a href="#ref-1">[1]</a>. In addition, according to <a href="#ref-3">[3]</a>, traditional RDBMSs have four overhead components, which are logging, locking, latching, and buffer management. However, M. Stonebraker claims that VoltDB eliminated these legacy overheads.</p>
<p>Among many features, especially I have interest in its linear scalability with ACID and performance. It is meaningful in that today&#8217;s web applications have another alternative to NoSQL data stores. Although VoltDB is under heavy development, the above features and the next benchmark result show its promising.</p>
<ul>
<li><a href="https://voltdb.com/blog/key-value-benchmarking">Key-Value Benchmark</a> (VoltDB versus Cassandra)</li>
</ul>
<p><a href="http://cassandra.apache.org/" target="_blank">Cassandra</a> is a remarkable key-value store and an open source project developed by apache committers. Now, it is well known as the most performant one in existing NoSQL stores. According to this benchmark result, however, in all cases VoltDB dominates Cassandra although the fairness of experiments is controversial.</p>
<ul>
<li><a href="http://community.voltdb.com/roadmap" target="_blank">VoltDB Roadmap</a></li>
</ul>
<p>It&#8217;s future plan is also expected. I wonder how much attention VoltDB will be getting from communities and industrials.</p>
<h4>See Also:</h4>
<ol>
<li><a name="ref-1"></a><a id="ref-1" href="http://cs-www.cs.yale.edu/homes/dna/papers/abadi-cloud-ieee09.pdf" target="_blank">Data Management in the Cloud: Limitations and Opportunities</a></li>
<li><a href="http://pgsnake.blogspot.com/2010/05/comparing-voltdb-to-postgres.html" target="_blank">Comparing VoltDB vs Postgresql</a></li>
<li><a name="ref-3"></a><a href="http://cs-www.cs.yale.edu/homes/dna/papers/oltpperf-sigmod08.pdf" target="_blank">OLTP through the looking glass, and what we found there, ACM SIGMOD 2008</a></li>
<li><a href="http://voltdb.com/product">http://voltdb.com/product</a></li>
<li><a name="ref-5"></a><a href="http://db.cs.yale.edu/hstore/" target="_blank">H-Store: A Next Generation OLTP DBMS</a></li>
</ol>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/842/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/842/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/842/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/842/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/842/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/842/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/842/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=842&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/06/01/voltdb-and-its-related-links/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://diveintodata.files.wordpress.com/2010/06/gi_voltdb-gif.jpg" medium="image">
			<media:title type="html">VoltDB</media:title>
		</media:content>
	</item>
		<item>
		<title>HDFS Scalability 향상을 위한 시도들 (1)</title>
		<link>http://diveintodata.org/2010/05/24/hdfs-scalability-%ed%96%a5%ec%83%81%ec%9d%98-%ec%8b%9c%eb%8f%84%eb%93%a4-1/</link>
		<comments>http://diveintodata.org/2010/05/24/hdfs-scalability-%ed%96%a5%ec%83%81%ec%9d%98-%ec%8b%9c%eb%8f%84%eb%93%a4-1/#comments</comments>
		<pubDate>Mon, 24 May 2010 05:21:51 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[distributed file systems]]></category>
		<category><![CDATA[gfs]]></category>
		<category><![CDATA[google file system]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hdfs]]></category>
		<category><![CDATA[improvement]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[scale-out]]></category>
		<category><![CDATA[scale-up]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=761</guid>
		<description><![CDATA[얼마전 Yahoo!의 HDFS 팀에서 Multiple nodes를 사용하여 HDFS namenode의 Horizontal Scalability를 향상 시키는 방법을 제안 했었습니다 (HDFS-1052). 그런데 그 뒤로는 Dhruba Borthakur라는 Hadoop 커미터가 Vertical Scalability 개선 방법을 제안했습니다(The Curse of Singletons! The Vertical Scalability of Hadoop NameNode, HDFS-1093, HADOOP-6713). Borthakur에 대해 LinkedIn 에서 찾아보니 현재 Facebook에서 근무하는 Hadoop 엔지니어라고 나오는군요. 위 두 제안을 보면 [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=761&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<div>
<p><img class="alignright" title="Apache Hadoop" src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" width="200" height="50" /><br />
얼마전 Yahoo!의 HDFS 팀에서 Multiple nodes를 사용하여 HDFS namenode의 Horizontal Scalability를 향상 시키는 방법을 제안 했었습니다 (<a href="https://issues.apache.org/jira/browse/HDFS-1052" target="_blank">HDFS-1052</a>). 그런데 그 뒤로는 <a href="http://www.linkedin.com/in/dhruba" target="_blank">Dhruba Borthakur</a>라는 Hadoop 커미터가 Vertical Scalability 개선 방법을 제안했습니다(<a href="http://hadoopblog.blogspot.com/2010/04/curse-of-singletons-vertical.html" target="_blank">The Curse of Singletons! The Vertical Scalability of Hadoop NameNode</a>, <a href="https://issues.apache.org/jira/browse/HDFS-1093" target="_blank">HDFS-1093</a>, <a href="https://issues.apache.org/jira/browse/HADOOP-6713" target="_blank">HADOOP-6713</a>). Borthakur에 대해 LinkedIn 에서 찾아보니 현재 Facebook에서 근무하는 Hadoop 엔지니어라고 나오는군요.</p>
<p>위 두 제안을 보면 Vertical Scalability과 Horizontal Scalability라는 용어가 나옵니다. Vertical Scalability는 시스템의 사양을 향상 시켰을 때 얻는 확장성을 의미합니다. 주로 CPU, Memory, Hard disk 등의 향상을 의미합니다. Hadoop과 같은 분산 시스템에서는 시스템 코어의 수가 늘어나는 것도 Vertical Scalability의 범주로 포함됩니다. 반면 Horizontal Scalability는 시스템의 개수를 늘렸을 때 얻는 확장성을 의미합니다. 예를 들면 노드의 수가 10대에서 20개로 늘어났을 때 얻는 확장성을 의미합니다. scale-up과 scale-out도 각각 같은 의미로 통용됩니다.</p>
<p>본 포스트에서는 위 두 가지 제안 중에서 Dhruba Borthaku가 제안한 vertical scalability 향상을 위한 제안을 소개합니다. 우선 Dhruba Borthakur라는 해커가 지적한 Hadoop Namenode (현재 Hadoop 0.21)의 병목현상은 다음과 같습니다.</p>
<ul>
<li><strong>Network</strong>: Facebook에서 자신이 사용하는 클러스터는 약 2000개의 노드로 구성되어 있으며 MapReduce 프로그램 동작 시 각 서버들은 9개의 mapper와 6개의 reducer가 동작하도록 설정되어 있다고 합니다. 이 구성의 클러스터에서 MapReduce를 동작하면 클라이언트들은 동시에 약 30k 의 request를 NameNode 에게 요청한다고 합니다. 그러나 singleton으로 구현된 Hadoop RPCServer의 Listener 스레드가 모든 메시지를 처리하므로 상당히 많은 지연이 발생하고 CPU core의 수가 증가해도 효과가 없었다고 합니다.</li>
<li><strong>CPU</strong>: FSNamesystem lock 메카니즘으로 인해 namenode는 실제로는 8개의 core를 가진 시스템이지만 보통 2개의 코어밖에 활용되지 않는다고 합니다. Borthakur에 의하면 FSNamesystem에서 사용하는 locking 메커니즘이 너무 단순 하고 <a href="https://issues.apache.org/jira/browse/HADOOP-1269" target="_blank">HADOOP-1269</a> 를 통해 문제를 개선 시켰음에도 여전히 개선의 여지가 있다고 합니다.</li>
<li><strong>Memory<span style="font-weight:normal;">:</span></strong> Hadoop의 NameNode는 논문 내용에 충실하게 모든 메타 데이터를 메모리에 유지합니다. 그런데 Borthakur가 사용하는 클러스터의 HDFS에는 6천만개의 파일과 8천만개의 블럭들이 유지하고 있는데 이 파일들의 메타데이터를 유지하기 위해 무려 58GB의 힙공간이 필요했다고 합니다.</li>
</ul>
<p>Borthakur가 이 문제를 해결하기 위해 제안했던 방법은 다음과 같습니다.</p>
<ul>
<li><strong>RPC Server</strong>: singleton으로 구현되었던 Listener 스레드에 Reader 스레프 풀을 붙였다고 합니다. 그래서 Listener 스레드는 connection 요청에 대한 accept 만 해주고 Reader 스레드 중 하나가 RPC를 직접 처리하도록 개선했다고 합니다. 결과적으로 다량의 RPC 요청에 대해 더 많은 CPU core들을 활용할 수 있게 되었다고 합니다(<a href="https://issues.apache.org/jira/browse/HADOOP-6713" target="_blank">HADOOP-6713</a>).</li>
<li><strong>FSNamesystem lock</strong>: Borthakur는 파일에 대한 어떤 operation이 있을 때 lock이 걸리는지 통계를 내고 그 결과로 파일과 디렉토리의 상태를 얻을 때와 읽기 위해 파일을 열 때 걸리는 lock이 전체 lock의 90%를 차지 한다는 것을 밝힙니다. 그리고 저 두 파일 operation들은 오직 read-only operation 이기 때문에 read-write lock 으로 바꾸어 성능을 향상 시켰다고 합니다(<a href="https://issues.apache.org/jira/browse/HDFS-1093" target="_blank">HADOOP-1093</a>). 이 부분은 MapReduce 논문(<a href="http://labs.google.com/papers/mapreduce.html" target="_blank">The Google File System</a>) 4.1절 Namespace Management and Locking 에도 설명이 잘 되어 있습니다. 이미 MapReduce에서는 namespace의 자료구조에서 상위 디렉토리에 해당하는 데이터에는 read lock을 걸고 작업 디렉토리 또는 작업 파일에는 read 또는 write lock을 걸어 가능한 동시에 다수의 operation들이 공유 데이터에 접근하게 하면서도 consistency를 유지하는 방법을 설명하고 있습니다. 아마도 file 에 대한 append가 Hadoop 0.20 버전에 추가 된 것 처럼 논문에 설명이 있음에도 구현이 되지 않은 부분이었나 봅니다. 자세한건 소스를 분석해 봐야 알 수 있을 것 같습니다.</li>
</ul>
<p>그러나 메모리에 대한 문제는 아직 해결하지 못했다고 합니다. 그래도 Borthakur에 의하면 위 두 가지 문제점을 해결한 것만으로 무려 8배나 scalability를 향상 시켰다고 합니다.</p>
<p>얼마전 부터 HDFS scalability 향상에 대한 시도들이 눈에 띄고 재미있어 보여 &#8216;여유 있을 때  블로그에 한번 정리해 봐야 겠다&#8217;라고 한달전에 맘 먹었는데 겨우 하나를 마쳤네요. 요즘 시간이 잘 안나서 이 포스트를 시작해서 완성하는데 약 3주나 걸렸습니다. 그 사이 <em>Usenix</em>의 매거진인 <em>;login:</em>에 Hadoop Namenode의 scalability에 대한 article인 <a href="http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html" target="_blank">HDFS Scalability: The Limits to Growth</a>가 실렸습니다. 또 야후 개발자 네트워크 블로그에서 article을 소개하는 글 (<a href="http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html" target="_blank">Scalability of the Hadoop Distributed File System</a>)이 실렸네요. 시간날 때 마다 마저 정리해 보겠습니다.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/761/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/761/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/761/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/761/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/761/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/761/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/761/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=761&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/05/24/hdfs-scalability-%ed%96%a5%ec%83%81%ec%9d%98-%ec%8b%9c%eb%8f%84%eb%93%a4-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://hadoop.apache.org/images/hadoop-logo.jpg" medium="image">
			<media:title type="html">Apache Hadoop</media:title>
		</media:content>
	</item>
		<item>
		<title>A Brief Summary of Independent Set in Graph Theory</title>
		<link>http://diveintodata.org/2010/04/24/a-brief-summary-of-independent-set-in-graph-theory/</link>
		<comments>http://diveintodata.org/2010/04/24/a-brief-summary-of-independent-set-in-graph-theory/#comments</comments>
		<pubDate>Sat, 24 Apr 2010 02:27:34 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[coloring problem]]></category>
		<category><![CDATA[dominating set]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[graph coloring]]></category>
		<category><![CDATA[independent set]]></category>
		<category><![CDATA[maximal independent set]]></category>
		<category><![CDATA[maximum independent set]]></category>
		<category><![CDATA[mis]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=707</guid>
		<description><![CDATA[Graph Basics Let G be a undirected graph. G=(V,E), where V is a set of vertices and E is a set of edges.  Every edge e in E consists of two vertices in V of G. It is said to connect, join, or link the two vertices (or end points). Independent Set ﻿﻿﻿An independent set S [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=707&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<h3>Graph Basics</h3>
<p>Let <em>G</em> be a undirected graph. <em>G=(V,E)</em>, where <em>V</em> is a set of vertices and <em>E</em> is a set of edges.  Every edge <em>e </em>in<em> E </em>consists of two vertices in <em>V </em>of<em> G. </em>It is said to connect, join, or link the two vertices (or end points).</p>
<h3>Independent Set</h3>
<p>﻿﻿﻿An independent set <em>S</em> is a subset of <em>V</em> in <em>G</em> such that no two vertices in <em>S</em> are adjacent. I suppose that its name is meaning that vertices in an independent set <em>S</em> is independent on a set of edges in a graph <em>G</em>. Like other vertex sets in graph theory, independent set has maximal and maximum sets as follows:</p>
<blockquote><p>The independent set <em>S</em> is <em><strong>maximal</strong><span style="font-style:normal;"> if </span>S</em> is not a proper subset of any independent set of <em>G.</em></p></blockquote>
<blockquote><p>The independent set <em>S</em> is <strong><em>maximum</em></strong> if there is no other independent set has more vertices than <em>S</em>.</p></blockquote>
<p>That is, a largest maximal independent set is called a maximum independent set. The maximum independent set problem is an NP-hard optimization problem.</p>
<p>All graphs has independent sets. For a graph <em>G</em> having a maximum independent set, the independence number <em>α</em>(<em>G</em>) is determined by the cardinality of a maximum independent set.</p>
<h3><strong>Relations to Dominating Sets</strong></h3>
<ul>
<li>A dominating set in a graph <em>G</em> is a subset <em>D</em> of <em>V</em> such that every vertex not in <em>D</em> is joined to at least one member of <em>D</em> by some edge.</li>
<li>In other words, a vertex set <em>D</em> is a dominating set in <em>G</em> if and if only every vertex in a graph <em>G</em> is contained in (or is adjacent to) a vertex in <em>D.</em></li>
<li>Every maximal independent set <em>S</em> of vertices in a simple graph <em>G</em> has the property that every vertex of the graph either is contained in <em>S</em> or is adjacent to a vertex in <em>S</em>.
<ul>
<li>That is, an independent set is a dominating set if and if only it is a maximal independent set.</li>
</ul>
</li>
</ul>
<h3>Relations to Graph Coloring</h3>
<ul>
<li>Independent set problem is related to coloring problem since vertices in an independent set can have the same color.</li>
</ul>
<h3>References</h3>
<ul>
<li>Chapter 10, <a href="http://www.amazon.com/Graph-Theory-Modeling-Applications-Algorithms/dp/0131423843" target="_blank">Graph Theory: Modeling, Applications, and Algorithms</a></li>
<li><a href="http://en.wikipedia.org/wiki/Independent_set_(graph_theory)">http://en.wikipedia.org/wiki/Independent_set_(graph_theory)</a></li>
<li><a href="http://en.wikipedia.org/wiki/Dominating_set">http://en.wikipedia.org/wiki/Dominating_set</a></li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/707/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=707&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/04/24/a-brief-summary-of-independent-set-in-graph-theory/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>Hadoop RPC를 이용한 서버/클라이언트 구현</title>
		<link>http://diveintodata.org/2010/04/20/hadoop-rpc%eb%a5%bc-%ec%9d%b4%ec%9a%a9%ed%95%9c-%ec%84%9c%eb%b2%84%ed%81%b4%eb%9d%bc%ec%9d%b4%ec%96%b8%ed%8a%b8-%ea%b5%ac%ed%98%84/</link>
		<comments>http://diveintodata.org/2010/04/20/hadoop-rpc%eb%a5%bc-%ec%9d%b4%ec%9a%a9%ed%95%9c-%ec%84%9c%eb%b2%84%ed%81%b4%eb%9d%bc%ec%9d%b4%ec%96%b8%ed%8a%b8-%ea%b5%ac%ed%98%84/#comments</comments>
		<pubDate>Tue, 20 Apr 2010 12:04:24 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[rpc]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=659</guid>
		<description><![CDATA[Hadoop은 이미 알려질대로 잘 알려진 분산 컴퓨팅 프레임워크입니다. 많은 사람들이 Hadoop 하면 MapReduce 프로그래밍을 주로 떠올리지만 자체적으로 제공하는 Hadoop RPC와 분산 파일 시스템인 HDFS를 가지고도 재미있는 것을 시도해 볼 수 있을 것 같습니다. 본 포스팅에서는 그 중에서 Hadoop RPC를 이용한 간단한 서버 클라이언트 프로그램의 구현방법을 소개합니다. Hadoop RPC Concept Hadoop RPC는 일반적으로 하나의 프로토콜 인터페이스(interface)와 하나의 Server [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=659&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><img class="alignright" title="Apache Hadoop" src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="Apache Hadoop" width="200" height="50" /></p>
<p><a href="http://hadoop.apache.org/" target="_blank">Hadoop</a>은 이미 알려질대로 잘 알려진 분산 컴퓨팅 프레임워크입니다. 많은 사람들이 Hadoop 하면 <a href="http://labs.google.com/papers/mapreduce.html" target="_blank">MapReduce</a> 프로그래밍을 주로 떠올리지만 자체적으로 제공하는 Hadoop RPC와 분산 파일 시스템인 HDFS를 가지고도 재미있는 것을 시도해 볼 수 있을 것 같습니다. 본 포스팅에서는 그 중에서 Hadoop RPC를 이용한 간단한 서버 클라이언트 프로그램의 구현방법을 소개합니다.</p>
<h3><strong>Hadoop RPC Concept</strong></h3>
<p>Hadoop RPC는 일반적으로 하나의 프로토콜 인터페이스(interface)와 하나의 Server 그리고 하나 이상의 Client(들)로 동작합니다. Hadoop RPC 서버의 인스턴스와 클라이언트 프록시의 인스턴스는 org.apache.hadoop.ipc.RPC 라는 클래스를 통해 얻을 수 있는데 내부적으로는 java reflection을 통해 구현되어 있습니다. 그리고 RPC method의 파라메터와 리턴 값은 오직 자바 primitive type들(예: int, long, String 등등)과 Writable 인터페이스를 구현한 구상클래스만 될 수 있습니다. 또한 Hadoop RPC는 자체적으로 서버와 클라이언트에 대한 기본적인 기능을 제공합니다. 따라서 복잡하게 스레드나 소켓 통신을 직접 구현할 필요가 없으며 개발자는 오로지 RPC 프로토콜 인터페이스와 RPC 메소드들에 대한 내용만 채워 넣으면 됩니다.</p>
<h3>Implementation of RPC Protocol</h3>
<p>RPC Protocol은 인터페이스로 정의되어야 하며 이 인터페이스는 org.apache.hadoop.ipc.VersionedProtocol을 상속하여야 합니다. VersionedProtocol은 자체적으로 getProtocolVersion() 메소드를 가지고 있는데 이 메소드는 프로토콜의 버전이 다양할 경우 서버-클라이언트가 다른 버전의 프로토콜로 통신하는 것을 방지하는 역할을 합니다.</p>
<p>RPC 프로토콜은 다음 예제와 같이 간단히 만들 수 있습니다. 아래 예제는 String 값을 반환하는 heartBeat()라는 하나의 RPC 메소드를 가진 RPC 프로토콜 인터페이스입니다.</p>
<p><pre class="brush: java;">
import org.apache.hadoop.ipc.VersionedProtocol;

public interface RPCProtocol extends VersionedProtocol {
  public long versionID=0;
  public String heartBeat() throws IOException;
}
</pre></p>
<h3>Implementation of RPC Server</h3>
<p>위에서 설명한 RPC 프로토콜의 서버 역할을 할 구상 클래스를 구현합니다. 서버 클래스는 간단히 위에서 정의한 RPCProtocol 인터페이스를 implements 하면 됩니다 (아래 예제 참조).</p>
<p><pre class="brush: java;">
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.ipc.RPC;
import org.apache.hadoop.ipc.RPC.Server;

public class TestServer implements RPCProtocol {

  @Override
  public String heartBeat() throws IOException {
    return &amp;quot;Hello&amp;quot;;
  }

  @Override
  public long getProtocolVersion(String arg0, long arg1) throws IOException {
    return 0;
  }

  /**
   * @param args
   * @throws IOException
   * @throws InterruptedException
   */
  public static void main(String[] args) throws IOException, InterruptedException {
    TestServer s = new TestServer();
    Configuration conf = new Configuration();
    Server server = RPC.getServer(s, &amp;quot;localhost&amp;quot;, 10000, conf);
    server.start();
    server.join();
  }
}
</pre></p>
<p>RPCProtocol 인터페이스에서 정의했던 String heartBeat() 메소드 역시 구현되어 있습니다. 반환 값으로 &#8220;Hello&#8221;가 호출한 RPC 클라이언트에게 전달 될 것입니다.</p>
<p>서버의 시동은 main 메소드에 구현되어 있습니다. 우선 프로토콜의 구상클래스(TestServer)의 인스턴스를 생성하고 RPC.getServer()에 인자로 전달합니다. 또한 getServer 메소드는 추가적으로 서버가 binding할 IP와 port 번호를 인자로 받으며 Server 클래스의 인스턴스를 반환합니다(내부적으로는 TestServer 클래스의 인스턴스에 대한 Listener 스레드를 생성하여 파라메터로 전달된 IP 및 port 번호와 바인딩 시킵니다. 그리고 RPC 콜이 있을 때마다 TestServer의 메소드를 콜하게 됩니다. 처리 결과는 Responder 스레드를 통해 반환하게 됩니다).</p>
<p>RPC.getServer 메소드의 원형은 다음과 같습니다.</p>
<table border="1" cellspacing="0" cellpadding="3" width="100%">
<tbody>
<tr bgcolor="white">
<td width="1%" align="right" valign="top"><code><span style="color:#000000;">static RPC.Server</span></code></td>
<td><code><strong><span style="color:#000000;">getServer</span></strong><span style="color:#000000;">(Object instance, String bindAddress, int port, Configuration conf)</span></code><span style="color:#000000;"><br />
</span></td>
</tr>
</tbody>
</table>
<h3>Implementation of RPC Client</h3>
<p>클라이언트는 RPC.waitForProxy 메소드를 통해서 간단히 얻을 수 있습니다. 그리고 클라이언트는 반환값으로 받은 proxy 인스턴스를 이용해서 손쉽게 RPC method를 콜하고 서버로부터 응답을 받아 올 수 있습니다.</p>
<table border="1" cellspacing="0" cellpadding="3" width="100%">
<tbody>
<tr bgcolor="white">
<td width="1%" align="right" valign="top"><code><span style="color:#000000;">static VersionedProtocol</span></code></td>
<td><code><strong><span style="color:#000000;">getProxy</span></strong><span style="color:#000000;">(Class&lt;?&gt; protocol, long clientVersion, InetSocketAddress addr, UserGroupInformation ticket,Configuration conf, SocketFactory factory)</span></code></td>
</tr>
</tbody>
</table>
<p><pre class="brush: plain;">
import java.io.IOException;
import java.net.InetSocketAddress;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.ipc.RPC;

public class TestClient {

  /**
   * @param args
   * @throws IOException
   * @throws InterruptedException
   */
  public static void main(String[] args) throws IOException, InterruptedException {
    Configuration conf = new Configuration();
    InetSocketAddress addr = new InetSocketAddress(&amp;quot;localhost&amp;quot;, 10000);
    RPCProtocol rpc = (RPCProtocol) RPC.waitForProxy(RPCProtocol.class,
        RPCProtocol.versionID, addr, conf);

    String msg = null;
    while(true) {
      Thread.sleep(1000);
      msg = rpc.heartBeat();
      System.out.println(msg);
    }
  }
}
</pre></p>
<p>위 예제는 프록시 인스턴스 변수인 rpc를 통해 손쉽게 rpc.heartBeat() 메소드를 실행하고 서버로 부터 결과를 얻는 내용을 설명합니다.</p>
<h3>Test</h3>
<p>서버를 먼저 실행하고 클라이언트를 실행하면 됩니다. 사실 순서를 바꿔 실행해도 크게 문제 되지 않습니다. Hadoop RPC의 클라이언트는 먼저 실행되었을 경우 RPC 서버에 접속이 될 때까지 1초 단위로 반복하여 접속 시도를 하게 됩니다.</p>
<p>정상적으로 수행되는 경우 다음과 같은 메시지를 확인할 수 있습니다.</p>
<pre>Hello
Hello
Hello
Hello
Hello
...</pre>
<h3>References</h3>
<ul>
<li><a href="http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/ipc/RPC.html">http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/ipc/RPC.html</a></li>
<li><a href="http://www.supermind.org/blog/520">http://www.supermind.org/blog/520</a></li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/659/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/659/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/659/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/659/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/659/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/659/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/659/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/659/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=659&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/04/20/hadoop-rpc%eb%a5%bc-%ec%9d%b4%ec%9a%a9%ed%95%9c-%ec%84%9c%eb%b2%84%ed%81%b4%eb%9d%bc%ec%9d%b4%ec%96%b8%ed%8a%b8-%ea%b5%ac%ed%98%84/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://hadoop.apache.org/images/hadoop-logo.jpg" medium="image">
			<media:title type="html">Apache Hadoop</media:title>
		</media:content>
	</item>
		<item>
		<title>Postgresql로 한글 full text search 시도기</title>
		<link>http://diveintodata.org/2010/03/22/postgresql%eb%a1%9c-%ed%95%9c%ea%b8%80-full-text-search-%ec%8b%9c%eb%8f%84%ea%b8%b0/</link>
		<comments>http://diveintodata.org/2010/03/22/postgresql%eb%a1%9c-%ed%95%9c%ea%b8%80-full-text-search-%ec%8b%9c%eb%8f%84%ea%b8%b0/#comments</comments>
		<pubDate>Mon, 22 Mar 2010 10:40:40 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[full text search]]></category>
		<category><![CDATA[hunspell]]></category>
		<category><![CDATA[한글]]></category>
		<category><![CDATA[korean]]></category>
		<category><![CDATA[postgresql]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=618</guid>
		<description><![CDATA[최근 일이 있어 Postgresql을 이용한 full text search (FTS) 를 시도해보았다. Postgresql 자체가 역사가 긴 녀석이라 그런지 full text 검색 다양한 방법들을 제공했다. pgtrgm, tsearch2 와 같은 메소드를 제공하고 GIN (Generalized Inverted Index) 나 GiST (Generalized Search Tree) 와 같은 색인들을 제공한다. 일반적으로 100만건 이내에서는 만족할 성능을 보여준다고 말하여진다. 그런데 제목이 &#8216;시도기&#8217;로 그치는 것에는 이유가 있다. 설명을 위해 우선 FTS 대해 [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=618&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>최근 일이 있어 <a href="postgresql.org" target="_blank">Postgresql</a>을 이용한 full text search (FTS) 를 시도해보았다. Postgresql 자체가 역사가 긴 녀석이라 그런지 full text 검색 다양한 방법들을 제공했다. <a href="http://www.postgresql.org/docs/current/static/pgtrgm.html" target="_blank">pgtrgm</a>, <a href="http://www.postgresql.org/docs/current/static/tsearch2.html" target="_blank">tsearch2</a> 와 같은 메소드를 제공하고 <a href="http://www.postgresql.org/docs/current/static/gin.html" target="_blank">GIN (Generalized Inverted Index)</a> 나 <a href="http://www.postgresql.org/docs/current/static/gist.html" target="_blank">GiST (Generalized Search Tree)</a> 와 같은 색인들을 제공한다. 일반적으로 100만건 이내에서는 만족할 성능을 보여준다고 말하여진다.</p>
<p>그런데 제목이 &#8216;시도기&#8217;로 그치는 것에는 이유가 있다. 설명을 위해 우선 FTS 대해 조금 설명하면 FTS는 단순하게 SQL의 LIKE와 같이 서브 스트링을 포함하는 ROW를 찾아주는 문제가 아니다. 문서에서 조사와 같은 불용어를 제외하고 단어의 형태소 추출하며 단어간의 edit distance 까지 고려하여 철자가 유사한 단어에 대해서도 검색 결과로 내놓는다. 따라서 불용어와 형태소분석을 위해서는 사전이 필수적인 것이다. Postgresql <a href="http://www.postgresql.org/docs/8.4/static/textsearch-dictionaries.html">메뉴얼</a>을 보니 ispell, myspell, hunspell등의 포맷을 지원한다고 써 있었다.</p>
<p>사전을 찾아보니 데비안 메인테이너 cwryu님이 주도하시는 <a href="http://code.google.com/p/spellcheck-ko/" target="_blank">hunspell-ko</a> 프로젝트가 있었다. 안도&#8230; 그러나기쁨도 잠시 postgresql 이 사전을 제대로 읽어 들이지 못한다. IRC에서 cwryu님께 받은 조언으로 문제를 해결했지만 postgresql이 사전이 ASCII로 설정된 옵션(FLAG default) 외에는 받아들이지 않는다. 한글처리를 위해서는 default로는 불가했다.</p>
<p>현재 이 문제에 대해서는 postgresql bug 메일링에 <a href="http://archives.postgresql.org/pgsql-bugs/2010-03/msg00163.php" target="_blank">리포트</a> 해 놓은 상태이다. 결론은 postgresql로 제대로 된 full text search는 아직 꿈나라 인듯 싶다. tgtrgm 으로 짧은 문장에 대해서는 가능하기도 하지만 띄어쓰지 않은 5자 정도 이상의 문자들에 대해서는 false negative가 발생한다. false positive 는 성능 상 오버헤드가 있더라고 필터를 한번 더 주면 되지만 이것은 곤란했다.</p>
<p>시간이 좀 걸리더라도 짬을내어 이 문제에 대해 계속 리포트할 심산이다. 중간 중간 진행상황에 대해 포스팅 하도록 하겠다.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/618/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/618/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/618/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/618/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/618/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/618/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/618/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/618/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/618/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/618/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/618/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/618/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/618/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/618/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=618&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/03/22/postgresql%eb%a1%9c-%ed%95%9c%ea%b8%80-full-text-search-%ec%8b%9c%eb%8f%84%ea%b8%b0/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>Data-Intensive Text Processing with MapReduce Draft Available in Online</title>
		<link>http://diveintodata.org/2010/03/11/data-intensive-text-processing-with-mapreduce-draft-available-in-online/</link>
		<comments>http://diveintodata.org/2010/03/11/data-intensive-text-processing-with-mapreduce-draft-available-in-online/#comments</comments>
		<pubDate>Thu, 11 Mar 2010 01:46:24 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[data intensive]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[text processing]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=605</guid>
		<description><![CDATA[Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer Actually, there have never been books that directly deal with MapReduce programming and algorithms. This book addresses from MapReduce algorithm design to EM Algorithms for Text Processing. Although this book is still draft, it seems well-organized and very interesting. In addition, the book contains some basic graph algorithms [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=605&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.umiacs.umd.edu/~jimmylin/book.html" target="_blank">Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer</a></p>
<p>Actually, there have never been books that directly deal with MapReduce programming and algorithms. This book addresses from MapReduce algorithm design to EM Algorithms for Text Processing. Although this book is still draft, it seems well-organized and very interesting. In addition, the book contains some basic graph algorithms using MapReduce.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/605/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/605/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/605/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/605/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/605/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/605/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/605/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=605&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/03/11/data-intensive-text-processing-with-mapreduce-draft-available-in-online/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>애플 타플릿 IPad 발표 됐군요.</title>
		<link>http://diveintodata.org/2010/01/28/%ec%95%a0%ed%94%8c-%ed%83%80%ed%94%8c%eb%a6%bf-ipad-%eb%b0%9c%ed%91%9c-%eb%90%90%ea%b5%b0%ec%9a%94/</link>
		<comments>http://diveintodata.org/2010/01/28/%ec%95%a0%ed%94%8c-%ed%83%80%ed%94%8c%eb%a6%bf-ipad-%eb%b0%9c%ed%91%9c-%eb%90%90%ea%b5%b0%ec%9a%94/#comments</comments>
		<pubDate>Wed, 27 Jan 2010 20:08:16 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[apple]]></category>
		<category><![CDATA[ipad]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=598</guid>
		<description><![CDATA[나오기 전부터 시끄럽더니 단순한 언론 플레이는 아니었던 것 같습니다. 아래 두 링크는 발표와 제품 사진, 그리고 동영상입니다. 가격이 $499 부터 시작한다는게 조금 부담이네요. http://www.engadget.com/2010/01/27/live-from-the-apple-tablet-latest-creation-event/ http://www.apple.com/ipad/ http://www.apple.com/ipad/#video 제가 흥미로웠던 건 발표 시점에 이미 SDK, 프로그래밍 가이드라인, 휴먼 인터페이스 가이드 라인까지 준비가 되어 있었고 곧 바로 홈페이지에 소개가 됐다는 사실입니다. 언플을 밥먹듯 하는 국내 일부 기업들은 좀 배워야 [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=598&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>나오기 전부터 시끄럽더니 단순한 언론 플레이는 아니었던 것 같습니다. 아래 두 링크는 발표와 제품 사진, 그리고 동영상입니다. 가격이 $499 부터 시작한다는게 조금 부담이네요.</p>
<ul>
<li><a href="http://www.engadget.com/2010/01/27/live-from-the-apple-tablet-latest-creation-event/">http://www.engadget.com/2010/01/27/live-from-the-apple-tablet-latest-creation-event/</a></li>
<li><a href="http://www.apple.com/ipad/">http://www.apple.com/ipad/</a></li>
<li><a href="http://www.apple.com/ipad/#video">http://www.apple.com/ipad/#video</a></li>
</ul>
<p>제가 흥미로웠던 건 발표 시점에 이미 SDK, 프로그래밍 가이드라인, 휴먼 인터페이스 가이드 라인까지 준비가 되어 있었고 곧 바로 홈페이지에 소개가 됐다는 사실입니다. 언플을 밥먹듯 하는 국내 일부 기업들은 좀 배워야 하지 않나 싶습니다.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/598/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/598/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/598/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/598/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/598/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/598/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/598/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/598/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/598/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/598/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/598/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/598/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/598/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/598/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=598&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/01/28/%ec%95%a0%ed%94%8c-%ed%83%80%ed%94%8c%eb%a6%bf-ipad-%eb%b0%9c%ed%91%9c-%eb%90%90%ea%b5%b0%ec%9a%94/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>새로운 개념의 소셜 서비스 – Sekai Camera</title>
		<link>http://diveintodata.org/2009/12/22/%ec%83%88%eb%a1%9c%ec%9a%b4-%ea%b0%9c%eb%85%90%ec%9d%98-%ec%86%8c%ec%85%9c-%ec%84%9c%eb%b9%84%ec%8a%a4-sekai-camera/</link>
		<comments>http://diveintodata.org/2009/12/22/%ec%83%88%eb%a1%9c%ec%9a%b4-%ea%b0%9c%eb%85%90%ec%9d%98-%ec%86%8c%ec%85%9c-%ec%84%9c%eb%b9%84%ec%8a%a4-sekai-camera/#comments</comments>
		<pubDate>Tue, 22 Dec 2009 04:27:57 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[augmented reality]]></category>
		<category><![CDATA[iphone]]></category>
		<category><![CDATA[social networks]]></category>
		<category><![CDATA[ucc]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=585</guid>
		<description><![CDATA[Sekai Camera라는 어플이 앱스토어에 글로벌 버전으로 출시됐다고 한다. 살펴 보니 증강현실(augmented reality) + UCC + 소셜 네트워크를 이용한 새로운 개념의 소셜 서비스 인 것 같다. 최근 다양한 미디어와 디바이스를 바탕으로 한 이러한 서비스들이 우훅죽순으로 쏟아져 나오고 있는데 향후 3~5년 뒤가 참 기대된다. 더불어 이와 관련된 데이터 관리(data management) 이슈들도 많이 제기 될 것이다. 그런데 국내 [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=585&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p style="text-align:center;"><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='425' height='344' src='http://www.youtube.com/embed/KgTwSXK_5dg?version=3&amp;rel=1&amp;fs=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent' frameborder='0'></iframe></span></p>
<p>Sekai Camera라는 어플이 앱스토어에 글로벌 버전으로 출시됐다고 한다. 살펴 보니 증강현실(augmented reality) + UCC + 소셜 네트워크를 이용한 새로운 개념의 소셜 서비스 인 것 같다. 최근 다양한 미디어와 디바이스를 바탕으로 한 이러한 서비스들이 우훅죽순으로 쏟아져 나오고 있는데 향후 3~5년 뒤가 참 기대된다. 더불어 이와 관련된 데이터 관리(data management) 이슈들도 많이 제기 될 것이다. 그런데 국내 IT업체들은 지금 같이 급변하는 미디어 및 기술의 변화 속에서 현재 어떤 아이디어를 가지고 미래를 준비하고 있는지 참 궁금하다.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/585/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/585/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/585/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/585/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/585/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/585/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/585/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/585/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=585&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/12/22/%ec%83%88%eb%a1%9c%ec%9a%b4-%ea%b0%9c%eb%85%90%ec%9d%98-%ec%86%8c%ec%85%9c-%ec%84%9c%eb%b9%84%ec%8a%a4-sekai-camera/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>How to Create A Table in HBase for Beginners</title>
		<link>http://diveintodata.org/2009/11/27/how-to-make-a-table-in-hbase-for-beginners/</link>
		<comments>http://diveintodata.org/2009/11/27/how-to-make-a-table-in-hbase-for-beginners/#comments</comments>
		<pubDate>Fri, 27 Nov 2009 02:33:36 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[create table]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[table]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=527</guid>
		<description><![CDATA[I have accumulated some knowledge and know-how about MapReduce, Hadoop, and HBase since I participated in some projects. From hence, I&#8217;ll post the know-how of HBase by period. Today, I&#8217;m going to introduce a way to make a hbase table in java. HBase provides two ways to allow a Hbase client to connect HBase master. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=527&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I have accumulated some knowledge and know-how about MapReduce, Hadoop, and HBase since I participated in some projects. From hence, I&#8217;ll post the know-how of HBase by period. Today, I&#8217;m going to introduce a way to make a hbase table in java.</p>
<p>HBase provides two ways to allow a Hbase client to connect HBase master. One is to use a instance of <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HBaseAdmin.html" target="_blank">HBaseAdmin</a> class. HBaseAdmin provides some methods for creating, modifying, and deleting tables and column families. Another way is to use an instance of HTable class. This class almost provides some methods to manipulate data like inserting, modifying, and deleting rows and cells.</p>
<p>Thus, in order to make a hbase table, we need to connect a HBase master by initializing a instance of HBaseAdmin like line 4. HBaseAdmin requires an instance of <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/HBaseConfiguration.html" target="_blank">HBaseConfiguration</a>. If necessary, you may set some configurations like line 2.</p>
<p>In order to describe HBase schema,  we make an instances of <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/HColumnDescriptor.html" target="_blank">HColumnDescriptor</a> for each column family. In addition to column family names, HColumnDescriptor enables you to set various parameters, such as maxVersions, compression type, timeToLive, and bloomFilter. Then, we can create a HBase table by invoking createTable like line 10.</p>
<p><pre class="brush: java;">
HBaseConfiguration conf = new HBaseConfiguration();
conf.set(&quot;hbase.master&quot;,&quot;localhost:60000&quot;);

HBaseAdmin hbase = new HBaseAdmin(conf);
HTableDescriptor desc = new HTableDescriptor(&quot;TEST&quot;);
HColumnDescriptor meta = new HColumnDescriptor(&quot;personal&quot;.getBytes());
HColumnDescriptor prefix = new HColumnDescriptor(&quot;account&quot;.getBytes());
desc.addFamily(meta);
desc.addFamily(prefix);
hbase.createTable(desc);
</pre></p>
<p>Finally, you can check your hbase table as the following commands.</p>
<p><pre class="brush: bash;">
c0d3h4ck@code:~/Development/hbase$ bin/hbase shell
HBase Shell; enter 'help&lt;RETURN&gt;' for list of supported commands.
Version: 0.20.1, r822817, Wed Oct  7 11:55:42 PDT 2009
hbase(main):001:0&gt; list
TEST

1 row(s) in 0.0940 seconds
</pre> </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/527/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/527/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/527/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/527/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/527/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/527/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/527/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=527&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/11/27/how-to-make-a-table-in-hbase-for-beginners/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>ACM SIGMOD 2010 Programming Contest</title>
		<link>http://diveintodata.org/2009/11/20/acm-sigmod-2010-programming-contest/</link>
		<comments>http://diveintodata.org/2009/11/20/acm-sigmod-2010-programming-contest/#comments</comments>
		<pubDate>Fri, 20 Nov 2009 11:44:06 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[acm]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[programming contest]]></category>
		<category><![CDATA[relational database]]></category>
		<category><![CDATA[SIGMOD]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=515</guid>
		<description><![CDATA[As you know, SIGMOD is ACM&#8217;s Special Interest Group on Management of Data. SIGMOD holds the annual conference that is regarded as one of the best conference in computer science. Besides, SIGMOD organizes a programming contest in parallel with the ACM SIGMOD conference. Below description is the call for the programming contest of this year. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=515&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>As you know, SIGMOD is ACM&#8217;s Special Interest Group on Management of Data. SIGMOD holds the annual conference that is regarded as one of the best conference in computer science. Besides, SIGMOD organizes a programming contest in parallel with the ACM SIGMOD conference. Below description is the call for the programming contest of this year. The programming contest&#8217;s subject of this year seems very interesting! The task is to implement a simple distributed query executor built on top of last year&#8217;s main-memory index. The environment on which contestants will test their implementation may be provided by Amazon. If you are interested in this programming contest, try that. You can get further information from here (<a href="http://dbweb.enst.fr/events/sigmod10contest/" target="_blank">http://dbweb.enst.fr/events/sigmod10contest</a>).</p>
<blockquote><p>A programming contest is organized in parallel with the ACM SIGMOD 2010 conference, following the success of the first annual SIGMOD programming contest organized last year. Student teams from degree-granting institutions are invited to compete to develop a distributed query engine over relational data. Submissions will be judged on the overall performance of the system on a variety of workloads. A shortlist of finalists will be invited to present their implementation at the SIGMOD conference in June 2010 in Indianapolis, USA. The winning team, to be selected during the conference, will be awarded a prize of 5,000 USD and will be invited to a one-week research visit in Paris. The winning system, released in open source, will form a building block of a complete distributed database system which will be built over the years, throughout the programming contests.</p></blockquote>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/515/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/515/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/515/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/515/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/515/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/515/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/515/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=515&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/11/20/acm-sigmod-2010-programming-contest/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>CIKM 2009 in Hong Kong</title>
		<link>http://diveintodata.org/2009/11/10/cikm-2009-in-hong-kong/</link>
		<comments>http://diveintodata.org/2009/11/10/cikm-2009-in-hong-kong/#comments</comments>
		<pubDate>Mon, 09 Nov 2009 15:08:26 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[cikm]]></category>
		<category><![CDATA[cikm09]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[Hong Kong]]></category>
		<category><![CDATA[spider]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=473</guid>
		<description><![CDATA[With Min Kyoung Sung who is a coauthor of  &#8216;SPIDER : A System for Scalable, Parallel / Distributed Evaluation of large-scale RDF Data&#8216;, I participated in 18th ACM CIKM 2009 (Conference on Information and Knowledge Management) held in Hong Kong. We stayed in Marriott Hotel near the Asia World-Expo at which CIKM 2009 held. At [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=473&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>With Min Kyoung Sung who is a coauthor of  &#8216;<a href="http://dbserver.korea.ac.kr/projects/spider/" target="_blank"><em>SPIDER : A System for Scalable, Parallel / Distributed Evaluation of large-scale RDF Data</em></a>&#8216;, I participated in <a href="http://www.comp.polyu.edu.hk/conference/cikm2009/about/index.htm" target="_blank">18th ACM CIKM 2009 (Conference on Information and Knowledge Management)</a> held in Hong Kong. We stayed in Marriott Hotel near the <a href="http://www.asiaworld-expo.com/" target="_blank">Asia World-Expo</a> at which CIKM 2009 held. At this conference, I got along with several Korean researchers (<strong></strong>Kyong-Ha Lee, Jinoh Oh, and Sangchul Kim) and I discussed about SPIDER with some researchers who are interested in RDF data processing during the demonstration session.</p>
<p>At CIKM 2009, I felt that the recent trend of web data management are being changed to information extraction and semantic or structured web data rather then unstructured data. Many papers and posters addressed these issues. In addition, the subject of the panel was ‘<span><strong> <em>Information extraction meets relational databases: Where    are we heading?</em></strong></span>’ One of the panel said that the hot spot of web data management research changes from crawling, indexing, and searching to information extraction and semantic data. These changes lead to new various data and knowledge management issues. Besides information extraction, graph data mining was one of the main hot issues in CIKM 2009.</p>
<p>At the main keynote, Kyu-Young Hwang (KAIST, Korea) spoke &#8216;<span style="font-style:italic;font-weight:bold;">DB-IR Integration and Its Application to a Massively-Parallel Search Engine.&#8217; </span>Its key subject is that DB-IR integration is becoming one of major challenges in the database area, so it is leading to new DBMS architecture applicable to DB-IR integration. In addition, Edward Chang (Google Research China) and Clement Yu (University of Illinois at Chicago) spoke &#8216;<strong><em>Confucius and its intelligent Disciples</em></strong>&#8216; and &#8216;<strong><em>Advanced Metasearch Engines</em>&#8216;</strong> respectively.</p>
<p style="text-align:center;"><a class="flickr-image alignnone" title="Coffee Break at CIKM 2009" rel="flickr-mgr[CIKM]" href="http://www.flickr.com/photos/hyunsik/4088464259/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2764/4088464259_4f6498eca2_m.jpg" alt="Coffee Break at CIKM 2009" /></a><a class="flickr-image alignnone" title="SPIDER in Demo Session" rel="flickr-mgr[CIKM]" href="http://www.flickr.com/photos/hyunsik/4088463803/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2752/4088463803_b53bbd8646_m.jpg" alt="SPIDER in Demo Session" /></a></p>
<p style="text-align:center;"><a class="flickr-image alignnone" title="Tian Tan Buddha Statue in Hong Kong" rel="flickr-mgr[CIKM]" href="http://www.flickr.com/photos/hyunsik/4088461317/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2609/4088461317_5546d70eff_m.jpg" alt="Tian Tan Buddha Statue in Hong Kong" /></a><a class="flickr-image alignnone" title="The lunch time in CIKM 2009" rel="flickr-mgr[CIKM]" href="http://www.flickr.com/photos/hyunsik/4088462251/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2591/4088462251_d5875a68e3_m.jpg" alt="The lunch time in CIKM 2009" /></a></p>
<p>This conference was a really nice experience for me. I enjoyed the conference, reception, and banquet. However, I have an unsatisfied feeling because I didn&#8217;t participate in <a href="http://www.clouddb.org/CloudDB09/" target="_blank">the 1st Workshop CloudDB 2009</a> in conjunction in CIKM 2009.</p>
<p>Anyway, this conference inspired Min Kyoung Sung and me. It may be kept in our mind for long time.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/473/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/473/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/473/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/473/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/473/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/473/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/473/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=473&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/11/10/cikm-2009-in-hong-kong/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2764/4088464259_4f6498eca2_m.jpg" medium="image">
			<media:title type="html">Coffee Break at CIKM 2009</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2752/4088463803_b53bbd8646_m.jpg" medium="image">
			<media:title type="html">SPIDER in Demo Session</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2609/4088461317_5546d70eff_m.jpg" medium="image">
			<media:title type="html">Tian Tan Buddha Statue in Hong Kong</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2591/4088462251_d5875a68e3_m.jpg" medium="image">
			<media:title type="html">The lunch time in CIKM 2009</media:title>
		</media:content>
	</item>
		<item>
		<title>MapReduce Online Comes Out!</title>
		<link>http://diveintodata.org/2009/10/20/mapreduce-onlie-comes-out/</link>
		<comments>http://diveintodata.org/2009/10/20/mapreduce-onlie-comes-out/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 15:49:37 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[map-reduce]]></category>
		<category><![CDATA[online aggregation]]></category>
		<category><![CDATA[stream queries]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=461</guid>
		<description><![CDATA[MapReduce has been gaining much attention in data intensive computing field. As you know, it is well known as a very popular framework for batch-processing. Recently, however, Tyson Condie who is a Ph.D student in UC Berkeley accomplishes MapReduce Online. Today, I heard this news from Data Beta. Actually, It is amazing works since the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=461&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>MapReduce has been gaining much attention in data intensive computing field. As you know, it is well known as a very popular framework for batch-processing.</p>
<p>Recently, however, Tyson Condie who is a Ph.D student in UC Berkeley accomplishes <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html" target="_self">MapReduce Online</a>. Today, I heard this news from <a href="http://databeta.wordpress.com/2009/10/18/mapreduce-online/" target="_self">Data Beta</a>. Actually, It is amazing works since the original MapReduce is specialized and designed for only batch-processing. In addition, most people believe that MapReduce will remain a batch-processing.</p>
<p>The essential of MapReduce online is that it tries to hold the fault-tolerance model of the <a href="http://labs.google.com/papers/mapreduce.html" target="_self">original MapReduce</a>, whereas it provides the the pipelining of results across tasks and jobs instead of materializing the output of each MapReduce task and job into disk. Consequently, MapReduce online enables the program to return the result earlier from a big job.</p>
<p>You can get further information from <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html" target="_self">MapReduce Online</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/461/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/461/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/461/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/461/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/461/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/461/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/461/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=461&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/10/20/mapreduce-onlie-comes-out/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>BSP Library on Hadoop?</title>
		<link>http://diveintodata.org/2009/10/09/bsp-library-on-hadoop/</link>
		<comments>http://diveintodata.org/2009/10/09/bsp-library-on-hadoop/#comments</comments>
		<pubDate>Fri, 09 Oct 2009 11:45:33 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[angrapa]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[bsp]]></category>
		<category><![CDATA[bulk synchronization parallel]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hama]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=443</guid>
		<description><![CDATA[Recently, I started to participate in the Hama project (a distributed scientific package on Hadoop for massive matrix and graph data), and I have taken the times to develop the bulk synchronization parallel (BSP) library on Hadoop (HAMA-195); I&#8217;m getting help from Edword Yoon, a founder of Hama project. The motivation of BSP lib is [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=443&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Recently, I started to participate in the <a href="http://incubator.apache.org/hama/" target="_self">Hama project</a> (a distributed scientific package on Hadoop for massive matrix and graph data), and I have taken the times to develop the <a href="http://en.wikipedia.org/wiki/Bulk_synchronous_parallel" target="_self">bulk synchronization parallel</a> (BSP) library on Hadoop (<a href="https://issues.apache.org/jira/browse/HAMA-195" target="_self">HAMA-195</a>); I&#8217;m getting help from <a href="http://blog.udanax.org/" target="_self">Edword Yoon</a>, a founder of Hama project. The motivation of BSP lib is definitely clear.</p>
<p>The hadoop platforms are installed in cloud computing service providers and many companies as you can see in <a href="http://wiki.apache.org/hadoop/PoweredBy" target="_self">http://wiki.apache.org/hadoop/PoweredBy</a>. However, most of them may use only MapReduce programs. As you know although MapReduce is very scalability, but it provides only the simple programming model. Many programmers want to use more various programming model without changing the platform (i.e., <a href="http://hadoop.apache.org" target="_self">Hadoop</a>). This BSP lib will be the beginning for their desires. However, like MapReduce, BSP may also be not swiss army knife. When we find appropriate applications, BSP lib on Hadoop will be valued for its scalability and ability.</p>
<p>Sooner, I&#8217;ll post articles about the progress of BSP library and <a href="http://wiki.apache.org/hama/GraphPackage" target="_self">Angrapa</a> (the graph package on Hama).</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/443/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/443/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/443/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/443/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/443/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/443/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/443/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=443&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/10/09/bsp-library-on-hadoop/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>Google&#8217;s New Location-based Service</title>
		<link>http://diveintodata.org/2009/10/02/googles-new-location-based-service/</link>
		<comments>http://diveintodata.org/2009/10/02/googles-new-location-based-service/#comments</comments>
		<pubDate>Thu, 01 Oct 2009 15:08:07 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[google]]></category>
		<category><![CDATA[location-based service]]></category>
		<category><![CDATA[map]]></category>
		<category><![CDATA[mobile service]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=430</guid>
		<description><![CDATA[I always feel that Google is leading internet services. Recently, Google map provides the map service that allows users to search locations by given query keywords, such as restaurant, hospital, and gas station. They can be ordered by the distance from user&#8217;s location, the user-preferred ranking, and both. In addition, Google presents the new local [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=430&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I always feel that Google is leading internet services. Recently, <a href="http://maps.google.com/" target="_blank">Google map</a> provides the map service that allows users to search locations by given query keywords, such as restaurant, hospital, and gas station. They can be ordered by the distance from user&#8217;s location, the user-preferred ranking, and both. In addition, Google presents the new local search for mobile tab. This service enables users to mark some locations with stars and to can call starred places through only few clicks. Below video shows that service.</p>
<div><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='560' height='340' src='http://www.youtube.com/embed/Y_62nFjUW7Q?version=3&amp;rel=1&amp;fs=1&amp;showsearch=0&amp;showinfo=1&amp;iv_load_policy=1&amp;wmode=transparent' frameborder='0'></iframe></span></div>
<div>Actually, these services are not new in the academic&#8217;s point of view , but Google are realizing things that are mentioned in the literatures.</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/430/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/430/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/430/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/430/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/430/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/430/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/430/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/430/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/430/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/430/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/430/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/430/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/430/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/430/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=430&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/10/02/googles-new-location-based-service/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>Java Universal Network/Graph Framework</title>
		<link>http://diveintodata.org/2009/09/15/java-universal-networkgraph-framework/</link>
		<comments>http://diveintodata.org/2009/09/15/java-universal-networkgraph-framework/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 23:30:45 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[jung]]></category>
		<category><![CDATA[visualization tools]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=349</guid>
		<description><![CDATA[Recently, I&#8217;m primarily concerned with large-scale graph data processing. Occasionally, the visualization of graph can be a good way for us to observe some properties from graph data sets. Today, I&#8217;m going to introduce a graph framework, called Java Universal Network/Graph Framework (Jung). Jung provides data structures for graph, a programming interface familiar with graph [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=349&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Recently, I&#8217;m primarily concerned with large-scale graph data processing. Occasionally, the visualization of graph can be a good way for us to observe some properties from graph data sets. Today, I&#8217;m going to introduce a graph framework, called <em><a href="http://jung.sourceforge.net/" target="_blank">Java Universal Network/Graph Framework (Jung)</a>. </em>Jung provides data structures for graph, a programming interface familiar with graph features, some fundamental graph algorithms (e.g., minimum spanning tree, depth-first search, breath-first search, and dijkstra algorithm), and even visualization methods. Especially, I&#8217;m interested in its visualization methods.</p>
<p>The following java source shows the programming interface of Jung. In more detail, this program make a graph, add three vertices to the graph, and connect vertices. This source code is brought from <a href="http://jung.sourceforge.net/doc/index.html" target="_blank">Jung tutorial</a>. As you can see, Jung&#8217;s APIs are very easy.</p>
<p><pre class="brush: java;">
  // Make a graph by a SparseMultigraph instance.
  Graph&amp;lt;Integer, String&amp;gt; g = new SparseMultigraph&amp;lt;Integer, String&amp;gt;();
  g.addVertex((Integer)1); // Add a vertex with an integer 1
  g.addVertex((Integer)2);
  g.addVertex((Integer)3);
  g.addEdge(&amp;quot;Edge-A&amp;quot;, 1,3); // Added an edge to connect between 1 and 3 vertices.
  g.addEdge(&amp;quot;Edge-B&amp;quot;, 2,3, EdgeType.DIRECTED);
  g.addEdge(&amp;quot;Edge-C&amp;quot;, 3, 2, EdgeType.DIRECTED);
  g.addEdge(&amp;quot;Edge-P&amp;quot;, 2,3); // A parallel edge

  // Make some objects for graph layout and visualization.
  Layout&amp;lt;Integer, String&amp;gt; layout = new KKLayout&amp;lt;Integer, String&amp;gt;(g);
  BasicVisualizationServer&amp;lt;Integer, String&amp;gt; vv =
  new BasicVisualizationServer&amp;lt;Integer, String&amp;gt;(layout);
  vv.setPreferredSize(new Dimension(800,800));

  // It determine how each vertex with its value is represented in a diagram.
  ToStringLabeller&amp;lt;Integer&amp;gt; vertexPaint = new ToStringLabeller&amp;lt;Integer&amp;gt;() {
    public String transform(Integer i) {
    return &amp;quot;&amp;quot;+i;
   }
  };

  vv.getRenderContext().setVertexLabelTransformer(vertexPaint);

  JFrame frame = new JFrame(&amp;quot;Simple Graph View&amp;quot;);
  frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
  frame.getContentPane().add(vv);
  frame.pack();
  frame.setVisible(true);
</pre></p>
<p>Some APIs of the Jung are based on generic programming, so you can use easily vertices or edges to contains user-defined data. If you want more detail information, visit <a href="http://jung.sourceforge.net/">http://jung.sourceforge.net</a>.</p>
<p>The above source code shows the following diagram.<br />
<a class="flickr-image aligncenter" title="Jung example" rel="flickr-mgr" href="http://www.flickr.com/photos/hyunsik/3919489249/"><img class="flickr-medium aligncenter" src="http://farm3.static.flickr.com/2646/3919489249_3377cc8c63.jpg" alt="Jung example" width="347" height="346" /></a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/349/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/349/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/349/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/349/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/349/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/349/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/349/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=349&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/09/15/java-universal-networkgraph-framework/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2646/3919489249_3377cc8c63.jpg" medium="image">
			<media:title type="html">Jung example</media:title>
		</media:content>
	</item>
		<item>
		<title>Zipf Distribution Generator in Java</title>
		<link>http://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/</link>
		<comments>http://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/#comments</comments>
		<pubDate>Sun, 13 Sep 2009 14:17:34 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[probability]]></category>
		<category><![CDATA[zipf]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=369</guid>
		<description><![CDATA[When I carry out some experiments, I usually make synthetic data sets generated by  some probability distributions.  Especially, Zipf distribution is frequently used for a synthetic data set. Zipf distribution is  one of the discrete power law probability distributions. You can get detail information from Zipf&#8217;s law in Wikipedia. Anyway, I attached my own java [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=369&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>When I carry out some experiments, I usually make synthetic data sets generated by  some probability distributions.  Especially, Zipf distribution is frequently used for a synthetic data set. Zipf distribution is  one of the discrete power law probability distributions. You can get detail information from <a href="http://en.wikipedia.org/wiki/Zipf%27s_law" target="_blank">Zipf&#8217;s law</a> in Wikipedia. Anyway, I attached my own java class for zip distribution. Below graphs are generated by my own java code and the gnuplot.</p>
<pre><a class="flickr-image alignleft" title="Zipf Distribution (s=1)" rel="flickr-mgr" href="http://www.flickr.com/photos/hyunsik/3914971725/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2528/3914971725_39800bd7f5_m.jpg" alt="Zipf Distribution (s=1)" /></a><a class="flickr-image alignnone" title="Zipf Distribution with log scale (s=1)" rel="flickr-mgr" href="http://www.flickr.com/photos/hyunsik/3914971927/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2486/3914971927_df23796db2_m.jpg" alt="Zipf Distribution with log scale (s=1)" /></a>

<pre class="brush: java;">
import java.util.Random;

public class ZipfGenerator {
 private Random rnd = new Random(System.currentTimeMillis());
 private int size;
 private double skew;
 private double bottom = 0;

 public ZipfGenerator(int size, double skew) {
  this.size = size;
  this.skew = skew;

  for(int i=1;i&amp;lt;size; i++) {
  this.bottom += (1/Math.pow(i, this.skew));
  }
 }

 // the next() method returns an rank id. The frequency of returned rank ids are follows Zipf distribution.
 public int next() {
   int rank;
   double friquency = 0;
   double dice;

   rank = rnd.nextInt(size);
   friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
   dice = rnd.nextDouble();

   while(!(dice &amp;lt; friquency)) {
     rank = rnd.nextInt(size);
     friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
     dice = rnd.nextDouble();
   }

   return rank;
 }

 // This method returns a probability that the given rank occurs.
 public double getProbability(int rank) {
   return (1.0d / Math.pow(rank, this.skew)) / this.bottom;
 }

 public static void main(String[] args) {
   if(args.length != 2) {
     System.out.println(&amp;quot;usage: ./zipf size skew&amp;quot;);
     System.exit(-1);
   }

   ZipfGenerator zipf = new ZipfGenerator(Integer.valueOf(args[0]),
   Double.valueOf(args[1]));
   for(int i=1;i&amp;lt;=100;i++)
     System.out.println(i+&amp;quot; &amp;quot; +zipf.getProbability(i));
 }
}
</pre>
</pre>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/369/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/369/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/369/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/369/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/369/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/369/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/369/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=369&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2528/3914971725_39800bd7f5_m.jpg" medium="image">
			<media:title type="html">Zipf Distribution (s=1)</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2486/3914971927_df23796db2_m.jpg" medium="image">
			<media:title type="html">Zipf Distribution with log scale (s=1)</media:title>
		</media:content>
	</item>
		<item>
		<title>One-column abstract in two-column layouts in articles on Latex</title>
		<link>http://diveintodata.org/2009/09/11/one-column-abstract-in-two-column-layouts-in-articles-on-latex/</link>
		<comments>http://diveintodata.org/2009/09/11/one-column-abstract-in-two-column-layouts-in-articles-on-latex/#comments</comments>
		<pubDate>Fri, 11 Sep 2009 01:43:39 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[abstract]]></category>
		<category><![CDATA[latex]]></category>
		<category><![CDATA[one-column]]></category>
		<category><![CDATA[two-column]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=359</guid>
		<description><![CDATA[If you want one-column abstract in two-column layouts in articles on Latex, just add abstract package and follow below source code. Someone who uses ubuntu linux can install the abstract package from &#8216;texlive-latex-extra&#8217; package via synaptic. Others can install from http://www.tex.ac.uk/tex-archive/macros/latex/contrib/abstract/ You should move &#8216;maketitle&#8217; within &#8216;twocolumn&#8217; like above code and remove &#8216;abstract&#8217;. You can [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=359&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>If you want one-column abstract in two-column layouts in articles on Latex, just add abstract package and follow below source code. Someone who uses ubuntu linux can install the abstract package from &#8216;texlive-latex-extra&#8217; package via synaptic. Others can install from <a href="http://www.tex.ac.uk/tex-archive/macros/latex/contrib/abstract/" target="_blank">http://www.tex.ac.uk/tex-archive/macros/latex/contrib/abstract/</a></p>
<p><pre class="brush: plain;">
usepackage{abstract}

twocolumn[
  maketitle
  begin{onecolabstract}
    Here in which one-column abstract resides
  end{onecolabstract}
]
</pre></p>
<p>You should move &#8216;maketitle&#8217; within &#8216;twocolumn&#8217; like above code and remove &#8216;abstract&#8217;.</p>
<p>You can find further information about the abstract package from <a href="http://www.tex.ac.uk/tex-archive/macros/latex/contrib/abstract/abstract.pdf" target="_blank">http://www.tex.ac.uk/tex-archive/macros/latex/contrib/abstract/abstract.pdf</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/359/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/359/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/359/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/359/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/359/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/359/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/359/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=359&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/09/11/one-column-abstract-in-two-column-layouts-in-articles-on-latex/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>A Brief Introduction to Skyline Problem (Pareto-optimal Tuples) (1)</title>
		<link>http://diveintodata.org/2009/09/06/a-brief-introduction-to-skyline-problem-pareto-optimal-tuples-1/</link>
		<comments>http://diveintodata.org/2009/09/06/a-brief-introduction-to-skyline-problem-pareto-optimal-tuples-1/#comments</comments>
		<pubDate>Sun, 06 Sep 2009 06:27:09 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[decision making]]></category>
		<category><![CDATA[pareto tuples]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[skyline]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=78</guid>
		<description><![CDATA[The skyline problem is to compute the best tuples from a set of ordered d-tuples. The name is originated from what the solution represented on 2d plane resembles the scene that urban buildings comprise. Skyline is one of the recommendation queries, and it is considering multi criteria. It is very interesting problem as well as [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=78&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><span class="dropcaps">The s</span>kyline problem is to compute the best tuples from a set of ordered <em>d</em>-tuples. The name is originated from what the solution represented on 2d plane resembles the scene that urban buildings comprise. Skyline is one of the recommendation queries, and it is considering multi criteria. It is very interesting problem as well as very useful query. This problem has been being intensively studied for recent years. Today, I’m going to present the problem definition of skyline. Next time, I&#8217;ll describe several algorithms for the skyline problem.</p>
<p><a style="float:left;margin-right:5px;" title="Singapore Skyline (#12) by Christopher Chan, on Flickr" href="http://www.flickr.com/photos/chanc/469796567/"><img src="http://farm1.static.flickr.com/226/469796567_311f4a3b79.jpg" alt="Singapore Skyline (#12)" width="250" /></a> First of all, let us know the input data. The input data <img src="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" alt="D^{d}" /> of skyline is a set of <em>n</em> ordered <em>d-</em>tuples, each of which consists of ordered <em>d</em> scalar values. They are shown in below formulas:</p>
<p><img style="display:block;float:none;margin-left:auto;margin-right:auto;" src="http://www.codecogs.com/eq.latex?D^{d}%20=%20{tp_{1},tp_{2},...tp_{n}}" alt="D^{d} = {tp_{1},tp_{2},...tp_{n}}" /></p>
<div id="equationview" style="text-align:center;">
<div id="equationview"><img src="http://www.codecogs.com/eq.latex?tp_%7Bi%7D%20=%20%28v_%7B1%7D,v_%7B2%7D,...,v_%7Bd%7D%29" border="0" alt="tp_{i} = (v_{1},v_{2},...,v_{d})" align="absmiddle" /></div>
</div>
<p><em> </em></p>
<div id="equationview"><img src="http://www.codecogs.com/eq.latex?tp_%7Bi%7D" border="0" alt="tp_{i}" align="absmiddle" /> denotes a <em>d</em>-tuple. And, we need to understand the definition of the dominance relation. In addition, because the skyline problem is to find the better tuples, we need an assumption about &#8216;better&#8217;. In most literature, it is assumed that the less value is better, so we follow this assumption.</div>
<blockquote><p><span style="background-color:#ffffff;"><strong>Definition 1 (Dominance). </strong></span><span style="background-color:#ffffff;">Let <em>tp</em> and <em>tp’</em> be tuples in <img src="http://www.codecogs.com/eq.latex?D^{d}" alt="D^{d}" /> where </span><img src="http://www.codecogs.com/eq.latex?v_%7Bi%7D" border="0" alt="v_{i}" align="absmiddle" /> <span style="background-color:#ffffff;">is an element of <em>tp</em> and </span><img src="http://www.codecogs.com/eq.latex?u_%7Bi%7D" border="0" alt="u_{i}" align="absmiddle" /> <span style="background-color:#ffffff;">is an element of <em>tp&#8217; </em>for </span><img src="http://www.codecogs.com/eq.latex?1%20%3C%20i%20%5Cleq%20d" border="0" alt="1 &lt; i leq d" align="absmiddle" /><span style="background-color:#ffffff;">. Then, <em>tp</em> <strong>dominates</strong> <em>tp’</em> </span><span style="background-color:#ffffff;">if and only if  <img src="http://www.codecogs.com/eq.latex?forall{i},%20v_{i}%20leq%20u_{i}%20land%20exists{j},%20v_{j}%20%3C%20u_{j}" alt="forall{i}, v_{i} leq u_{i} land exists{j}, v_{j} &lt; u_{j}" width="182" height="17" />.</span></p></blockquote>
<p>In other words, it is said that one tuple <img src="http://www.codecogs.com/eq.latex?tp" border="0" alt="tp" align="absmiddle" /> dominates another tuple <img src="http://www.codecogs.com/eq.latex?tp%27" border="0" alt="tp'" align="absmiddle" /> if <img src="http://www.codecogs.com/eq.latex?tp" border="0" alt="tp" align="absmiddle" /> is not worse (not greater) than <img src="http://www.codecogs.com/eq.latex?tp%27" border="0" alt="tp'" align="absmiddle" /> in all dimensions and<em> </em><img src="http://www.codecogs.com/eq.latex?tp" border="0" alt="tp" align="absmiddle" /> is better (less) than <img src="http://www.codecogs.com/eq.latex?tp%27" border="0" alt="tp'" align="absmiddle" /> in at least one dimension.</p>
<blockquote><p><strong>Definition 2 (Skyline)</strong> Given a data set <img src="http://www.codecogs.com/eq.latex?D^{d}" alt="D^{d}" />, a skyline contains tuples that is not dominated any other tuples in <img src="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" alt="D^{d}" />.</p></blockquote>
<p>As I described above definition, a skyline is a set of tuples and the tuples are not dominated by any other tuples in <img src="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" alt="D^{d}" />. In literature, a <em>d</em>-dimensional data set and above two definitions are usually represented for comprehensive description to <em>d</em>-points on <em>d</em>-axies.</p>
<p style="text-align:left;">Without loss of generality, we assume that <img src="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" alt="D^{d}" /> is a 2d data set (i.e., <em>d</em>=2). A data set is given as follows:</p>
<ul>
<li>a = (3,2)</li>
<li>b = (8,1)</li>
<li>c = (1,10)</li>
<li>d = (4,3)</li>
<li>e = (8,6)</li>
</ul>
<p style="text-align:left;">Each element of a tuple in <img src="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" alt="D^{d}" /> can be represented to one axis. In other words, the first element and the second element of tuples are represented to X and Y axies respectively. Then, tuples of above list are represented to 2d points as shown in Fig. 1.</p>
<div id="attachment_324" class="wp-caption aligncenter" style="width: 300px"><img class="size-full wp-image-324" title="Fig. 1. An example of a skyline" src="http://diveintodata.files.wordpress.com/2009/09/skyline_intro.png?w=590" alt="Fig. 1. An example of a skyline"   /><p class="wp-caption-text">Fig. 1. An example of a skyline</p></div>
<p>In Fig. 1, let us look into a dominance relation. The point <em>a</em> dominates the points {<em>d,e</em>} since elements of the point <em>a</em> less than those of {<em>d,e</em>} in X and Y. The point <em>b</em> dominates only <em>e </em>since X values of {<em>b,e</em>} are same (i.e., X=8) but Y of <em>b</em> (i.e., 1) is less than that (i.e., 6) of <em>e</em>. The points {d,e} cannot belong to the skyline because they are dominated by other tuples. Consequently, the points <em>a,b</em>, and <em>c</em> belong to the skyline since they are not dominated by any other tuples.</p>
<p>Initially, the skyline problem was known as the <em><a href="http://portal.acm.org/citation.cfm?id=321910" target="_blank">maxima vector problem (H. T. Kung et. al 1975)</a></em> for traditional processing system. However, this problem was revisited by <a href="http://portal.acm.org/citation.cfm?id=656550&amp;dl=" target="_blank">the Skyline Operator (Stephan Börzsönyi et. al 2001)</a>. Since then, this problem has been intensively studied in database area.</p>
<p>Next time, I&#8217;ll describe several algorithms including above algorithms in detail.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/78/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/78/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/78/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/78/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/78/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/78/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/78/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&amp;blog=12237478&amp;post=78&amp;subd=diveintodata&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/09/06/a-brief-introduction-to-skyline-problem-pareto-optimal-tuples-1/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://farm1.static.flickr.com/226/469796567_311f4a3b79.jpg" medium="image">
			<media:title type="html">Singapore Skyline (#12)</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?Dd%20=%20tp_1,tp_2,...tp_n" medium="image">
			<media:title type="html">D^{d} = {tp_{1},tp_{2},...tp_{n}}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp_%7Bi%7D%20=%20%28v_%7B1%7D,v_%7B2%7D,...,v_%7Bd%7D%29" medium="image">
			<media:title type="html">tp_{i} = (v_{1},v_{2},...,v_{d})</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp_%7Bi%7D" medium="image">
			<media:title type="html">tp_{i}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?Dd" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?v_%7Bi%7D" medium="image">
			<media:title type="html">v_{i}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?u_%7Bi%7D" medium="image">
			<media:title type="html">u_{i}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?1%20%3C%20i%20%5Cleq%20d" medium="image">
			<media:title type="html">1 &#60; i leq d</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?foralli,%20v_i%20leq%20u_i%20land%20existsj,%20v_j%20%3C%20u_j" medium="image">
			<media:title type="html">forall{i}, v_{i} leq u_{i} land exists{j}, v_{j} &#60; u_{j}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp" medium="image">
			<media:title type="html">tp</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp%27" medium="image">
			<media:title type="html">tp&#039;</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp" medium="image">
			<media:title type="html">tp</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp%27" medium="image">
			<media:title type="html">tp&#039;</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp" medium="image">
			<media:title type="html">tp</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp%27" medium="image">
			<media:title type="html">tp&#039;</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?Dd" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://diveintodata.files.wordpress.com/2009/09/skyline_intro.png" medium="image">
			<media:title type="html">Fig. 1. An example of a skyline</media:title>
		</media:content>
	</item>
	</channel>
</rss>
