<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Dive Into A Data Deluge &#187; Research</title>
	<atom:link href="http://diveintodata.org/category/research/feed/" rel="self" type="application/rss+xml" />
	<link>http://diveintodata.org</link>
	<description>Discussion about Newly Emerging Issues on Database</description>
	<lastBuildDate>Thu, 29 Mar 2012 09:43:57 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='diveintodata.org' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Dive Into A Data Deluge &#187; Research</title>
		<link>http://diveintodata.org</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://diveintodata.org/osd.xml" title="Dive Into A Data Deluge" />
	<atom:link rel='hub' href='http://diveintodata.org/?pushpress=hub'/>
		<item>
		<title>Amazon EC2에서 whirr을 이용한 Hadoop 클러스터 구동 방법</title>
		<link>http://diveintodata.org/2011/03/19/whirr-usage-for-hadoop-cluster-in-amazon-ec2/</link>
		<comments>http://diveintodata.org/2011/03/19/whirr-usage-for-hadoop-cluster-in-amazon-ec2/#comments</comments>
		<pubDate>Sat, 19 Mar 2011 03:06:58 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Cloud Computing]]></category>
		<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[amazon ec2]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[configuration]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[howto]]></category>
		<category><![CDATA[whirr]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=1060</guid>
		<description><![CDATA[최근 연구내용 검증을 위해 Amazon EC2에서 Hadoop 클러스터를 구축하여 실험을 수행 하는 중입니다. 그런데 Hadoop 클러스터를 EC2에 구축하는데 있어 Amazon EC2 환경에 대한 이해 부족과 자료의 부족으로 직접 부딪혀서 해결해야 하는 부분들이 꽤 있었습니다. 저는 이 포스팅을 통해 제가 시도했던 방법을 소개하고 제 경험을 공유하고자 합니다. 우선 이 글을 읽는 분들은 Amazon EC2 계정이 있고 [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=1060&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>최근 연구내용 검증을 위해 Amazon EC2에서 Hadoop 클러스터를 구축하여 실험을 수행 하는 중입니다. 그런데 Hadoop 클러스터를 EC2에 구축하는데 있어 Amazon EC2 환경에 대한 이해 부족과 자료의 부족으로 직접 부딪혀서 해결해야 하는 부분들이 꽤 있었습니다. 저는 이 포스팅을 통해 제가 시도했던 방법을 소개하고 제 경험을 공유하고자 합니다.</p>
<p>우선 이 글을 읽는 분들은 Amazon EC2 계정이 있고 AMI, Instance, EC2 Key Pair에 대해 알고 계시다고 전제하겠습니다.</p>
<p><span style="font-size:20px;font-weight:bold;">Amazon EC2에서 Hadoop 클러스터 구동</span></p>
<p>현 시점에서 Amazon EC2 환경에 Hadoop 클러스터를 구동 방법은 선택의 폭이 그리 넓지 않습니다.</p>
<ol>
<li>Hadoop이 이미 설치된 이미지를 사용하고 수동 설정하는 방법</li>
<li>EBS 기반 AMI에 하둡 설치 및 복사 그리고 수동 설정</li>
<li>whirr을 사용하는 방법</li>
<li>whirr의 hadoop-ec2를 사용하는 방법</li>
</ol>
<p>이 포스팅에서는 3번인 whirr을 이용한 구축방법을 설명합니다. 그런데  이 방법은 정말 간단하지만 한 가지 제약을 가지고 있습니다. 이 방법은 기본적으로 instance store 기반의 AMI만 활용 할 수 있습니다. 따라서 Hadoop 클러스터의 HDFS는 instance store에 기반을 두게 되며 클러스터 종료 시 HDFS의 모든 데이터가 제거됩니다 (Amazon EC2의 모든 인스턴스는 영속적인 데이터 저장을 위해 EBS나 S3와 같은 별도의 저장 서비스를 사용해야 합니다.)</p>
<p>저의 경우 처음에 1,2번 방법을 모두 시도했었습니다. 그러나  다음과 같은 문제점이 있었습니다.</p>
<ul>
<li>최신 Hadoop 배포본(0.20 이상)이 설치된 AMI의 부재</li>
<li>수십 여개의 인스턴스의 시동(launch)의 자동화</li>
<li>매번 새로 할당 받는 IP 주소와 이에 따른 Hadoop 설정과 설정 배포의 어려움</li>
</ul>
<p>조사해보니 인스턴스를 시동을 자동화하고 시동된 인스턴스의 IP 목록을 얻어 설정 배포까지 원활히 하기 위해서는 <a href="http://aws.amazon.com/sdkforjava/">Amazon AWS API</a>를 이용하거나 <a href="http://code.google.com/p/boto/">boto (Python interface to Amazon Web Services)</a>, <a href="http://code.google.com/p/jclouds/downloads/list">jcloud (multi-cloud library)</a> 와 같은 third-party 라이브러리를 이용해 개발을 해야합니다. 그러나 이는 많은 시간을 요구합니다. EBS 기반 AMI에 Hadoop을 직접 설치하는 사용하는 방법 역시 비슷한 이유로 포기했습니다.</p>
<p>위에 4번 방법인 hadoop-ec2는 원래 Hadoop의 contrib 에 속했던 프로그램으로 현재는 whirr에서 진행되고 있지만 지속적으로 유지보수가 되지 않는 것으로 보여 시도하지 않았습니다. whirr의 Change Log를 봐도 4번에 대한 내용은 찾기 어려웠습니다.</p>
<p>현재로써는 whirr이 가장 편리한 방법이라 여겨 집니다.</p>
<p><span style="font-size:20px;font-weight:bold;">whirr?</span></p>
<p>whirr는 Apache Incubator에 속한 프로젝트로 Amazon EC2와 같은 상용 클라우드 환경에서 원하는 서비스에 대한 설치, 설정, 실행을 자동으로 수행하는 라이브러리입니다. 현재 제공하는 서비스로는 <a href="http://hadoop.apache.org/">Apache Hadoop</a>, <a href="http://cassandra.apache.org/">Cassandra</a>,  <a href="http://www.cloudera.com/hadoop/">Cloudera&#8217;s Distribution for Hadoop (CDH)</a>, <a href="http://hadoop.apache.org/zookeeper/">Zookeeper</a>가 있으며 조만간 릴리즈 될 0.4-incubating 버전에 <a href="http://hbase.apache.org/">Hbase</a>가 추가될 예정이라고 합니다.</p>
<h2>동작의 개요</h2>
<p>whirr의 사용방법을 설명하기에 앞서 전체적인 동작에 대해 개략적으로 설명을 드리겠습니다.</p>
<p>사용자가 &#8216;<em>cluster-lunch&#8217;</em> 커맨드를 주면 whirr은 instance store 기반의 AMI를 이용해 다수의 인스턴스를 가동하고 모든 인스턴스들에 JDK 및 Hadoop의 설치와 설정을 일괄적으로 수행합니다. 이 과정이 끝나면 EC2 내부에서 Hadoop 클러스터가 동작하고 있게 됩니다.</p>
<p>그리고 사용자가 로컬 머쉰에 설치한 Hadoop 프로그램 통해 EC2에서 구동되는 Hadoop 클러스터를 제어하게 됩니다. 그런데 EC2 내부의 인스턴스들은 기본적으로 private IP만을 할당 받아 외부에서 접근할 수가 없고 기본적으로 방화벽 설정이 까다롭게 되어 있기 때문에 추가적인 설정 없이 Hadoop RPC나 웹 UI를 통한 접근이 불가능 합니다. 따라서 whirr이 제공하는 proxy 프로그램을 실행하고 난 뒤에 로컬 머쉰에 설치된 Hadoop 프로그램을 이용하여 EC2 내부의 클러스터를 제어하게 됩니다.</p>
<h2>Hadoop 클러스터 구동</h2>
<h3>계정 생성</h3>
<p>whirr을 통해 생성하는 Hadoop 클러스터는 linux에서 <em>hadoop</em>이라는 username의해 시동됩니다. Hadoop은 클러스터를 구동한 계정과 같은 계정으로 접근할 때 superuser 권한을 가집니다. 따라서 로컬에 hadoop이라는 계정을 생성하여 아래 작업을 수행해야 Amazon EC2 내부에서 동작하는 Hadoop 클러스터에 대한 superuser권한을 행사할 수 있습니다. 하지만 단순히 MapReduce 프로그램만 실행 시킨다면 아무 계정에서 작업해도 문제 없습니다.</p>
<h3>로컬 머쉰에 Hadoop과 whirr의 설치 그리고 Hadoop Version 문제</h3>
<p>Hadoop과 whirr은 <a href="http://archive.apache.org/dist/hadoop/core/">http://archive.apache.org/dist/</a> 에서 다운 받아 로컬 머쉰에 설치합니다.  그런데 현 시점에서는 &#8216;어떤 Hadoop 버전을 설치해야 하는가&#8217;가 문제가됩니다. Hadoop은 현재 한참 빠르게 개발되고 있으며 다른 버전간 내부 프로토콜이 호환되지 않습니다.</p>
<p>따라서 whirr이 자동으로 설치해주는 Hadoop 클러스터와 로컬 머쉰의 Hadoop은 같은 버전으로 맞춰야 합니다. 버전을 바꾸는 것은 <a href="http://incubator.apache.org/whirr/faq.html">whirr FAQ</a>에 아래와 같이 설명되어 있는 것 처럼 직접 install, configuration 스크립트를 수정해야 합니다.</p>
<blockquote>
<h3>How do I specify the service version and other service properties?</h3>
<p>Currently the only way to do this is to modify the scripts to install a particular version of the service, or to change the service properties from the defaults.</p>
<p>See &#8220;How to modify the instance installation and configuration scripts&#8221; above for details on how to do this.</p>
<p>from <a href="http://incubator.apache.org/whirr/faq.html">http://incubator.apache.org/whirr/faq.html</a></p></blockquote>
<p>whirr은 Apache Hadoop 배포본외에도 Cloudera의 Hadoop 배포본을 설치할 수 있습니다. 이는 아래 &#8216;whirr 설정 파일&#8217;에서 whirr.hadoop-install-runurl과 whirr.hadoop-configure-runurl에 대한 내용을 참고하시면 됩니다.</p>
<h3>whirr 설정 파일</h3>
<p>whirr을 이용한 클러스터의 구동은 클러스터에 대한 설정 파일을 만드는 것으로 시작합니다. 이 포스팅에서는 아래 cluster.properties 파일의 내용을 설명하고 이후 내용도 이 설정을 기준으로 설명하도록 하겠습니다.</p>
<p>(아래 내용은 최신인 0.3-incubating 버전에 대한 내용입니다. 0.4-incubating 버전이 릴리즈 되면 설정 방법이 변경될 예정입니다. 릴리즈 되고 나면 포스팅을 업데이트 하도록 하겠습니다.)</p>
<p><pre class="brush: plain;">
whirr.cluster-name=mycluster
whirr.instance-templates=1 jt+nn,16 dn+tt
whirr.provider=ec2
whirr.identity=ACCESS_KEY
whirr.credential=SECRET_KEY
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
whirr.location-id=us-east-1d
whirr.hardware-id=m1.small
whirr.service-name=hadoop
#whirr.hadoop-install-runurl=cloudera/cdh/install
#whirr.hadoop-configure-runurl=cloudera/cdh/post-configure
</pre></p>
<p>각 항목에 대한 설명은 다음과 같습니다.</p>
<ul>
<li>whirr.cluster-name : 구동할 클러스터를 식별하는 이름입니다. 클러스터를 구동하면 ${HOME}/.whirr/<em>&lt;cluster-name&gt;</em> 가  디렉토리가 생성되며 이 디렉토리에는 Hadoop 클러스터에 접근하는데 필요한 파일들이 저장됩니다.</li>
<li>whirr.instance-templates: 구동할 클러스터의 구성을 설정합니다. jt는 jobtracker, nn은 name node, dn은 data node, tt는 task tracker를 의미합니다. 이 설정을 통해 유연한 설정이 가능합니다. data node와 task tracker를 더 늘리고 싶을 때는 dn+tt 앞에 쓰여진 숫자를 변경해 주시면 됩니다.</li>
<li>whirr.provider: 클러스터 서비스 제공자를 설정합니다. 현재는 Amazon EC2와 Rackspace Cloud Servers 두 가지를 지원합니다.</li>
<li>whirr.identity: AWS의 access key를 입력하시면 됩니다.</li>
<li>whirr.credential: AWS의 secret key를 입력 합니다.</li>
<li>whirr.private-key-file: 이 설정과 아래 설정은 ec2 인스턴스를 생성할 때 사용할 key로 사용됩니다. 위 예제처럼 하기 위해서는 아래와 같이 ssh키를 생성해야 한다. 또는 기존에 다른 인스턴스를 위해 만들어 놓은 EC2 Key pair의 경로를 설정해도 됩니다.</li>
</ul>
<p><pre class="brush: plain;">
$ ssh-keygen -t rsa -P ''
</pre></p>
<ul>
<li>whirr.location-id: 원하는 availability zone을 설정한다. 설정하지 않으면 whirr을 실행하는 인스턴스와 같은 zone이 설정된다.</li>
<li>whirr.hardware-id: 원하는 인스턴스 유형을 설정한다. Amazon EC2가 제공하는 인스턴스 유형은 <a href="http://aws.amazon.com/ec2/instance-types/">Amzon EC2 Instance Types</a> 페이지에서 확인할 수 있으며 각 유형에 써 있는 API name을 이 설정에 적용하면 됩니다.</li>
<li>whirr.service-name: 구축할 서비스를 설정합니다. 이 글은 Hadoop을 위한 것이므로 hadoop으로 남겨 둡니다.</li>
<li>whirr.hadoop-install-runurl, whirr.hadoop-configure-runurl: Hadoop의 경우 apache 버전과 CDH 버전이 있습니다. 위 예제에서 주석(#)을 제거해 주면 CDH버전을 구동하게 됩니다.</li>
</ul>
<p>설정에 대한 추가적인 설명은 <a href="http://incubator.apache.org/whirr/configuration-guide.html">Whirr Configuration Guide</a> 문서를 참고하시면 됩니다.</p>
<h3>Hadoop 클러스터 시동</h3>
<p>클러스터의 시동은 다음과 같은 커맨드로 수행합니다. 클러스터를 시동하면 내부적으로 Amazon의 EC2 API를 통해 인스턴스를 생성해 필수 패키지(JDK)등을 설치하고 Hadoop 배포 버전을 다운로드 받아 설정을 하는 과정이 수행됩니다. 따라서 클러스터가 구동되는데 짧게는 수 분에서 길게는 10분 정도 소요됩니다. 클러스터 구동이 완료되면 다시 쉘 프롬프트가 뜨게 됩니다.</p>
<p><pre class="brush: plain;">
$ whirr/bin/whirr launch-cluster --config cluster.properties
Bootstrapping cluster
Configuring template
Starting 16 node(s) with roles [tt, dn]
Configuring template
Starting 1 node(s) with roles [jt, nn]
Nodes started: [[id=us-east-1/i-a45eb7cb, providerId=i-a45eb7cb, ...]]
.....
Nodes started: [[id=us-east-1/i-7a51b815, providerId=i-7a51b815, ...]]
Authorizing firewall
Running configuration script
Configuration script run completed
Running configuration script
Configuration script run completed
Completed configuration of mycluster
Web UI available at http://ec2-72-44-43-29.compute-1.amazonaws.com
Wrote Hadoop site file /home/-----/.whirr/mycluster/hadoop-site.xml
Wrote Hadoop proxy script /home/-----/.whirr/mycluster/hadoop-proxy.sh
Wrote instances file /home/-----/.whirr/mycluster/instances
Started cluster of 17 instances
Cluster{instances=[Instance{roles=[tt, dn], ...}]}
$
</pre></p>
<h3>프록시 열기</h3>
<p>whirr을 통해 구동한 Hadoop 클러스터에 접근하기 위해서는 로컬에 설치했던 Hadoop을 설정을 해야 합니다. 설정은 간단히 whirr이 클러스터 시동 후 생성해 주는 파일을 사용하면 됩니다. &#8216;<em>whirr launch-cluster&#8217;</em> 가 완료되고 나면 ${HOME}/.whirr/<em>&lt;cluster-name&gt;/</em>hadoop-site.xml 파일이 생성됩니다. 이 파일을 로컬에 설치한 Hadoop의 ${HADOOP_HOME}/conf에 간단하게 복사하거나 다음과 같이 환경변수를 설정하여 Hadoop의 설정 디렉토리를 override 하면 됩니다.</p>
<p><pre class="brush: plain;">
export HADOOP_CONF_DIR=~/.whirr/&lt;cluster-name&gt;
</pre></p>
<p>하지만 EC2 내부의 클러스터들은 private IP만을 가지기 때문에 바로 Hadoop 클러스터에 접근할 수는 없습니다. ${HOME}/.whirr/<em>&lt;cluster-name&gt;/hadoop-proxy.sh</em>를 실행 해야 비로소 EC2에 구동된 Hadoop 클러스터에 접근할 수 있습니다.</p>
<p><pre class="brush: plain;">
$ chmod +x ~/.whirr/mycluster/hadoop-proxy
$ ~/.whirr/mycluster/hadoop-proxy

Running proxy to Hadoop cluster at ec2-72-44-43-29.compute-1.
amazonaws.com. Use Ctrl-c to quit.
</pre></p>
<p><pre class="brush: plain;">
$ bin/hadoop dfs -ls
...
</pre></p>
<p><em>hadoop-proxy.sh</em>를 실행하면 EC2에서 동작하는 Hadoop 클러스터 웹 UI도 접근할 수 있습니다. 그러나 이를 위해서는 간단한 웹 브라우져 설정이 요구됩니다.</p>
<ul>
<li>크롬은 Preferences -&gt; Under the Hood 탭 -&gt; Network -&gt; Change Proxy Settings에서 설정하면 됩니다. Socks 설정의 주소는 localhost, 포트는 6666으로 해주시면 됩니다.</li>
<li>파폭은 기본적으로 SOCKS를 사용하더라도 로컬 머신에 설정된 DNS를 사용하게 되어 있는데 먼저 이 설정을 변경해 주셔야 합니다. 이를 위해서는 주소창에 about:config를 입력해 세부 설정으로 들어가 network.proxy.socks_remote_dns 설정을 true로 변경해주셔야 합니다. 그리고 Preferences -&gt; Advanced -&gt; Network 탭 -&gt; Connection 섹션의 Settings에서 SOCKS Host를 주소는 localhost, 포트는 6666으로 설정해주시면 됩니다.</li>
</ul>
<p>위와 같이 수정하고 hadoop-proxy.sh를 실행 했을 때 출력되는 URL에 접속하시면 됩니다.</p>
<h3>클러스터 종료</h3>
<p>Hadoop을 이용한 모든 작업이 끝나면 종료는 다음 커맨드를 통해 수행합니다. <span style="text-decoration:underline;">위에서 언급한 바와 같이 whirr은 아직까지 instance store 기반 클러스터 구축 밖에 지원하지 못하므로 HDFS의 모든 데이터가 제거되는 사실을 염두하셔야 합니다.</span></p>
<p><pre class="brush: plain;">
$ whirr/bin/whirr destroy-cluster --config=cluster.properties
</pre></p>
<h2>결론</h2>
<p>whirr을 통한 Hadoop 클러스터 구동 방법을 설명했습니다. whirr은 아직 사소한 버그와 설정의 한계가 있지만 직접 클러스터 를 구축해야 하는 사용자의 노력을 상당히 줄여줍니다. 필요에 따라 MapReduce 프로그램을 대규모 클러스터에 동작시켜야 하는 사용자들에게는 특히 유용하다고 생각합니다.</p>
<h2>참고문서</h2>
<ul>
<li><a href="http://incubator.apache.org/whirr/quick-start-guide.html" target="_blank">Getting Started with Whirr</a></li>
<li><a title="Map-Reduce With Ruby Using Hadoop" href="http://www.philwhln.com/map-reduce-with-ruby-using-hadoop" target="_blank">Map-Reduce With Ruby Using Hadoop</a></li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/1060/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/1060/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/1060/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=1060&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2011/03/19/whirr-usage-for-hadoop-cluster-in-amazon-ec2/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>VoltDB and its related links</title>
		<link>http://diveintodata.org/2010/06/01/voltdb-and-its-related-links/</link>
		<comments>http://diveintodata.org/2010/06/01/voltdb-and-its-related-links/#comments</comments>
		<pubDate>Tue, 01 Jun 2010 05:26:55 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[ACID]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[OLTP]]></category>
		<category><![CDATA[shared-nothing architecture]]></category>
		<category><![CDATA[sql]]></category>
		<category><![CDATA[VoltDB]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=842</guid>
		<description><![CDATA[There has been lots of buzz about VoltDB (academic name is H-Store [5]) since a week ago. VoltDB is lead by M. Stonebraker, and it is an open source OLTP DBMS. There are some interesting points: Running on shared-nothing clusters of commodity hardware In-memory database SQL support ACID Linear Scalability Released as an Open Source software [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=842&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://diveintodata.files.wordpress.com/2010/06/gi_voltdb-gif.jpg"><img class="alignright size-full wp-image-954" title="VoltDB" src="http://diveintodata.files.wordpress.com/2010/06/gi_voltdb-gif.jpg?w=590" alt=""   /></a>There has been lots of buzz about <em><span style="font-style:normal;">VoltDB (academic name is H-Store <a href="#ref-5">[5]</a>)</span><span style="font-style:normal;"> since a week ago. VoltDB is lead by <em>M. Stonebraker</em>, and it is an open source OLTP DBMS. There are some interesting points:</span></em></p>
<ul>
<li>Running on shared-nothing clusters of commodity hardware</li>
<li>In-memory database</li>
<li>SQL support</li>
<li>ACID</li>
<li>Linear Scalability</li>
<li>Released as an Open Source software</li>
</ul>
<p>Actually, there have already been some OLTP databases running on shared-nothing clusters. However, they cannot take advantage from the scalability of shared-nothing architecture due to their implementation&#8217;s natures, such as complex distributed locking and commit protocols <a href="#ref-1">[1]</a>. In addition, according to <a href="#ref-3">[3]</a>, traditional RDBMSs have four overhead components, which are logging, locking, latching, and buffer management. However, M. Stonebraker claims that VoltDB eliminated these legacy overheads.</p>
<p>Among many features, especially I have interest in its linear scalability with ACID and performance. It is meaningful in that today&#8217;s web applications have another alternative to NoSQL data stores. Although VoltDB is under heavy development, the above features and the next benchmark result show its promising.</p>
<ul>
<li><a href="https://voltdb.com/blog/key-value-benchmarking">Key-Value Benchmark</a> (VoltDB versus Cassandra)</li>
</ul>
<p><a href="http://cassandra.apache.org/" target="_blank">Cassandra</a> is a remarkable key-value store and an open source project developed by apache committers. Now, it is well known as the most performant one in existing NoSQL stores. According to this benchmark result, however, in all cases VoltDB dominates Cassandra although the fairness of experiments is controversial.</p>
<ul>
<li><a href="http://community.voltdb.com/roadmap" target="_blank">VoltDB Roadmap</a></li>
</ul>
<p>It&#8217;s future plan is also expected. I wonder how much attention VoltDB will be getting from communities and industrials.</p>
<h4>See Also:</h4>
<ol>
<li><a name="ref-1"></a><a id="ref-1" href="http://cs-www.cs.yale.edu/homes/dna/papers/abadi-cloud-ieee09.pdf" target="_blank">Data Management in the Cloud: Limitations and Opportunities</a></li>
<li><a href="http://pgsnake.blogspot.com/2010/05/comparing-voltdb-to-postgres.html" target="_blank">Comparing VoltDB vs Postgresql</a></li>
<li><a name="ref-3"></a><a href="http://cs-www.cs.yale.edu/homes/dna/papers/oltpperf-sigmod08.pdf" target="_blank">OLTP through the looking glass, and what we found there, ACM SIGMOD 2008</a></li>
<li><a href="http://voltdb.com/product">http://voltdb.com/product</a></li>
<li><a name="ref-5"></a><a href="http://db.cs.yale.edu/hstore/" target="_blank">H-Store: A Next Generation OLTP DBMS</a></li>
</ol>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/842/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/842/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/842/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/842/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/842/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/842/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/842/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/842/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=842&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/06/01/voltdb-and-its-related-links/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://diveintodata.files.wordpress.com/2010/06/gi_voltdb-gif.jpg" medium="image">
			<media:title type="html">VoltDB</media:title>
		</media:content>
	</item>
		<item>
		<title>HDFS Scalability 향상을 위한 시도들 (1)</title>
		<link>http://diveintodata.org/2010/05/24/hdfs-scalability-%ed%96%a5%ec%83%81%ec%9d%98-%ec%8b%9c%eb%8f%84%eb%93%a4-1/</link>
		<comments>http://diveintodata.org/2010/05/24/hdfs-scalability-%ed%96%a5%ec%83%81%ec%9d%98-%ec%8b%9c%eb%8f%84%eb%93%a4-1/#comments</comments>
		<pubDate>Mon, 24 May 2010 05:21:51 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[distributed file systems]]></category>
		<category><![CDATA[gfs]]></category>
		<category><![CDATA[google file system]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hdfs]]></category>
		<category><![CDATA[improvement]]></category>
		<category><![CDATA[performance]]></category>
		<category><![CDATA[scale-out]]></category>
		<category><![CDATA[scale-up]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=761</guid>
		<description><![CDATA[얼마전 Yahoo!의 HDFS 팀에서 Multiple nodes를 사용하여 HDFS namenode의 Horizontal Scalability를 향상 시키는 방법을 제안 했었습니다 (HDFS-1052). 그런데 그 뒤로는 Dhruba Borthakur라는 Hadoop 커미터가 Vertical Scalability 개선 방법을 제안했습니다(The Curse of Singletons! The Vertical Scalability of Hadoop NameNode, HDFS-1093, HADOOP-6713). Borthakur에 대해 LinkedIn 에서 찾아보니 현재 Facebook에서 근무하는 Hadoop 엔지니어라고 나오는군요. 위 두 제안을 보면 [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=761&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<div>
<p><img class="alignright" title="Apache Hadoop" src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" width="200" height="50" /><br />
얼마전 Yahoo!의 HDFS 팀에서 Multiple nodes를 사용하여 HDFS namenode의 Horizontal Scalability를 향상 시키는 방법을 제안 했었습니다 (<a href="https://issues.apache.org/jira/browse/HDFS-1052" target="_blank">HDFS-1052</a>). 그런데 그 뒤로는 <a href="http://www.linkedin.com/in/dhruba" target="_blank">Dhruba Borthakur</a>라는 Hadoop 커미터가 Vertical Scalability 개선 방법을 제안했습니다(<a href="http://hadoopblog.blogspot.com/2010/04/curse-of-singletons-vertical.html" target="_blank">The Curse of Singletons! The Vertical Scalability of Hadoop NameNode</a>, <a href="https://issues.apache.org/jira/browse/HDFS-1093" target="_blank">HDFS-1093</a>, <a href="https://issues.apache.org/jira/browse/HADOOP-6713" target="_blank">HADOOP-6713</a>). Borthakur에 대해 LinkedIn 에서 찾아보니 현재 Facebook에서 근무하는 Hadoop 엔지니어라고 나오는군요.</p>
<p>위 두 제안을 보면 Vertical Scalability과 Horizontal Scalability라는 용어가 나옵니다. Vertical Scalability는 시스템의 사양을 향상 시켰을 때 얻는 확장성을 의미합니다. 주로 CPU, Memory, Hard disk 등의 향상을 의미합니다. Hadoop과 같은 분산 시스템에서는 시스템 코어의 수가 늘어나는 것도 Vertical Scalability의 범주로 포함됩니다. 반면 Horizontal Scalability는 시스템의 개수를 늘렸을 때 얻는 확장성을 의미합니다. 예를 들면 노드의 수가 10대에서 20개로 늘어났을 때 얻는 확장성을 의미합니다. scale-up과 scale-out도 각각 같은 의미로 통용됩니다.</p>
<p>본 포스트에서는 위 두 가지 제안 중에서 Dhruba Borthaku가 제안한 vertical scalability 향상을 위한 제안을 소개합니다. 우선 Dhruba Borthakur라는 해커가 지적한 Hadoop Namenode (현재 Hadoop 0.21)의 병목현상은 다음과 같습니다.</p>
<ul>
<li><strong>Network</strong>: Facebook에서 자신이 사용하는 클러스터는 약 2000개의 노드로 구성되어 있으며 MapReduce 프로그램 동작 시 각 서버들은 9개의 mapper와 6개의 reducer가 동작하도록 설정되어 있다고 합니다. 이 구성의 클러스터에서 MapReduce를 동작하면 클라이언트들은 동시에 약 30k 의 request를 NameNode 에게 요청한다고 합니다. 그러나 singleton으로 구현된 Hadoop RPCServer의 Listener 스레드가 모든 메시지를 처리하므로 상당히 많은 지연이 발생하고 CPU core의 수가 증가해도 효과가 없었다고 합니다.</li>
<li><strong>CPU</strong>: FSNamesystem lock 메카니즘으로 인해 namenode는 실제로는 8개의 core를 가진 시스템이지만 보통 2개의 코어밖에 활용되지 않는다고 합니다. Borthakur에 의하면 FSNamesystem에서 사용하는 locking 메커니즘이 너무 단순 하고 <a href="https://issues.apache.org/jira/browse/HADOOP-1269" target="_blank">HADOOP-1269</a> 를 통해 문제를 개선 시켰음에도 여전히 개선의 여지가 있다고 합니다.</li>
<li><strong>Memory<span style="font-weight:normal;">:</span></strong> Hadoop의 NameNode는 논문 내용에 충실하게 모든 메타 데이터를 메모리에 유지합니다. 그런데 Borthakur가 사용하는 클러스터의 HDFS에는 6천만개의 파일과 8천만개의 블럭들이 유지하고 있는데 이 파일들의 메타데이터를 유지하기 위해 무려 58GB의 힙공간이 필요했다고 합니다.</li>
</ul>
<p>Borthakur가 이 문제를 해결하기 위해 제안했던 방법은 다음과 같습니다.</p>
<ul>
<li><strong>RPC Server</strong>: singleton으로 구현되었던 Listener 스레드에 Reader 스레프 풀을 붙였다고 합니다. 그래서 Listener 스레드는 connection 요청에 대한 accept 만 해주고 Reader 스레드 중 하나가 RPC를 직접 처리하도록 개선했다고 합니다. 결과적으로 다량의 RPC 요청에 대해 더 많은 CPU core들을 활용할 수 있게 되었다고 합니다(<a href="https://issues.apache.org/jira/browse/HADOOP-6713" target="_blank">HADOOP-6713</a>).</li>
<li><strong>FSNamesystem lock</strong>: Borthakur는 파일에 대한 어떤 operation이 있을 때 lock이 걸리는지 통계를 내고 그 결과로 파일과 디렉토리의 상태를 얻을 때와 읽기 위해 파일을 열 때 걸리는 lock이 전체 lock의 90%를 차지 한다는 것을 밝힙니다. 그리고 저 두 파일 operation들은 오직 read-only operation 이기 때문에 read-write lock 으로 바꾸어 성능을 향상 시켰다고 합니다(<a href="https://issues.apache.org/jira/browse/HDFS-1093" target="_blank">HADOOP-1093</a>). 이 부분은 MapReduce 논문(<a href="http://labs.google.com/papers/mapreduce.html" target="_blank">The Google File System</a>) 4.1절 Namespace Management and Locking 에도 설명이 잘 되어 있습니다. 이미 MapReduce에서는 namespace의 자료구조에서 상위 디렉토리에 해당하는 데이터에는 read lock을 걸고 작업 디렉토리 또는 작업 파일에는 read 또는 write lock을 걸어 가능한 동시에 다수의 operation들이 공유 데이터에 접근하게 하면서도 consistency를 유지하는 방법을 설명하고 있습니다. 아마도 file 에 대한 append가 Hadoop 0.20 버전에 추가 된 것 처럼 논문에 설명이 있음에도 구현이 되지 않은 부분이었나 봅니다. 자세한건 소스를 분석해 봐야 알 수 있을 것 같습니다.</li>
</ul>
<p>그러나 메모리에 대한 문제는 아직 해결하지 못했다고 합니다. 그래도 Borthakur에 의하면 위 두 가지 문제점을 해결한 것만으로 무려 8배나 scalability를 향상 시켰다고 합니다.</p>
<p>얼마전 부터 HDFS scalability 향상에 대한 시도들이 눈에 띄고 재미있어 보여 &#8216;여유 있을 때  블로그에 한번 정리해 봐야 겠다&#8217;라고 한달전에 맘 먹었는데 겨우 하나를 마쳤네요. 요즘 시간이 잘 안나서 이 포스트를 시작해서 완성하는데 약 3주나 걸렸습니다. 그 사이 <em>Usenix</em>의 매거진인 <em>;login:</em>에 Hadoop Namenode의 scalability에 대한 article인 <a href="http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html" target="_blank">HDFS Scalability: The Limits to Growth</a>가 실렸습니다. 또 야후 개발자 네트워크 블로그에서 article을 소개하는 글 (<a href="http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html" target="_blank">Scalability of the Hadoop Distributed File System</a>)이 실렸네요. 시간날 때 마다 마저 정리해 보겠습니다.</p>
</div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/761/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/761/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/761/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/761/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/761/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/761/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/761/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/761/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=761&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/05/24/hdfs-scalability-%ed%96%a5%ec%83%81%ec%9d%98-%ec%8b%9c%eb%8f%84%eb%93%a4-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://hadoop.apache.org/images/hadoop-logo.jpg" medium="image">
			<media:title type="html">Apache Hadoop</media:title>
		</media:content>
	</item>
		<item>
		<title>A Brief Summary of Independent Set in Graph Theory</title>
		<link>http://diveintodata.org/2010/04/24/a-brief-summary-of-independent-set-in-graph-theory/</link>
		<comments>http://diveintodata.org/2010/04/24/a-brief-summary-of-independent-set-in-graph-theory/#comments</comments>
		<pubDate>Sat, 24 Apr 2010 02:27:34 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[coloring problem]]></category>
		<category><![CDATA[dominating set]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[graph coloring]]></category>
		<category><![CDATA[independent set]]></category>
		<category><![CDATA[maximal independent set]]></category>
		<category><![CDATA[maximum independent set]]></category>
		<category><![CDATA[mis]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=707</guid>
		<description><![CDATA[Graph Basics Let G be a undirected graph. G=(V,E), where V is a set of vertices and E is a set of edges.  Every edge e in E consists of two vertices in V of G. It is said to connect, join, or link the two vertices (or end points). Independent Set ﻿﻿﻿An independent set S [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=707&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<h3>Graph Basics</h3>
<p>Let <em>G</em> be a undirected graph. <em>G=(V,E)</em>, where <em>V</em> is a set of vertices and <em>E</em> is a set of edges.  Every edge <em>e </em>in<em> E </em>consists of two vertices in <em>V </em>of<em> G. </em>It is said to connect, join, or link the two vertices (or end points).</p>
<h3>Independent Set</h3>
<p>﻿﻿﻿An independent set <em>S</em> is a subset of <em>V</em> in <em>G</em> such that no two vertices in <em>S</em> are adjacent. I suppose that its name is meaning that vertices in an independent set <em>S</em> is independent on a set of edges in a graph <em>G</em>. Like other vertex sets in graph theory, independent set has maximal and maximum sets as follows:</p>
<blockquote><p>The independent set <em>S</em> is <em><strong>maximal</strong><span style="font-style:normal;"> if </span>S</em> is not a proper subset of any independent set of <em>G.</em></p></blockquote>
<blockquote><p>The independent set <em>S</em> is <strong><em>maximum</em></strong> if there is no other independent set has more vertices than <em>S</em>.</p></blockquote>
<p>That is, a largest maximal independent set is called a maximum independent set. The maximum independent set problem is an NP-hard optimization problem.</p>
<p>All graphs has independent sets. For a graph <em>G</em> having a maximum independent set, the independence number <em>α</em>(<em>G</em>) is determined by the cardinality of a maximum independent set.</p>
<h3><strong>Relations to Dominating Sets</strong></h3>
<ul>
<li>A dominating set in a graph <em>G</em> is a subset <em>D</em> of <em>V</em> such that every vertex not in <em>D</em> is joined to at least one member of <em>D</em> by some edge.</li>
<li>In other words, a vertex set <em>D</em> is a dominating set in <em>G</em> if and if only every vertex in a graph <em>G</em> is contained in (or is adjacent to) a vertex in <em>D.</em></li>
<li>Every maximal independent set <em>S</em> of vertices in a simple graph <em>G</em> has the property that every vertex of the graph either is contained in <em>S</em> or is adjacent to a vertex in <em>S</em>.
<ul>
<li>That is, an independent set is a dominating set if and if only it is a maximal independent set.</li>
</ul>
</li>
</ul>
<h3>Relations to Graph Coloring</h3>
<ul>
<li>Independent set problem is related to coloring problem since vertices in an independent set can have the same color.</li>
</ul>
<h3>References</h3>
<ul>
<li>Chapter 10, <a href="http://www.amazon.com/Graph-Theory-Modeling-Applications-Algorithms/dp/0131423843" target="_blank">Graph Theory: Modeling, Applications, and Algorithms</a></li>
<li><a href="http://en.wikipedia.org/wiki/Independent_set_(graph_theory)">http://en.wikipedia.org/wiki/Independent_set_(graph_theory)</a></li>
<li><a href="http://en.wikipedia.org/wiki/Dominating_set">http://en.wikipedia.org/wiki/Dominating_set</a></li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/707/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=707&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/04/24/a-brief-summary-of-independent-set-in-graph-theory/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>Data-Intensive Text Processing with MapReduce Draft Available in Online</title>
		<link>http://diveintodata.org/2010/03/11/data-intensive-text-processing-with-mapreduce-draft-available-in-online/</link>
		<comments>http://diveintodata.org/2010/03/11/data-intensive-text-processing-with-mapreduce-draft-available-in-online/#comments</comments>
		<pubDate>Thu, 11 Mar 2010 01:46:24 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[data intensive]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[mapreduce]]></category>
		<category><![CDATA[text processing]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=605</guid>
		<description><![CDATA[Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer Actually, there have never been books that directly deal with MapReduce programming and algorithms. This book addresses from MapReduce algorithm design to EM Algorithms for Text Processing. Although this book is still draft, it seems well-organized and very interesting. In addition, the book contains some basic graph algorithms [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=605&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.umiacs.umd.edu/~jimmylin/book.html" target="_blank">Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer</a></p>
<p>Actually, there have never been books that directly deal with MapReduce programming and algorithms. This book addresses from MapReduce algorithm design to EM Algorithms for Text Processing. Although this book is still draft, it seems well-organized and very interesting. In addition, the book contains some basic graph algorithms using MapReduce.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/605/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/605/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/605/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/605/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/605/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/605/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/605/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/605/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=605&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2010/03/11/data-intensive-text-processing-with-mapreduce-draft-available-in-online/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>How to Create A Table in HBase for Beginners</title>
		<link>http://diveintodata.org/2009/11/27/how-to-make-a-table-in-hbase-for-beginners/</link>
		<comments>http://diveintodata.org/2009/11/27/how-to-make-a-table-in-hbase-for-beginners/#comments</comments>
		<pubDate>Fri, 27 Nov 2009 02:33:36 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[create table]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hbase]]></category>
		<category><![CDATA[table]]></category>
		<category><![CDATA[tutorial]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=527</guid>
		<description><![CDATA[I have accumulated some knowledge and know-how about MapReduce, Hadoop, and HBase since I participated in some projects. From hence, I&#8217;ll post the know-how of HBase by period. Today, I&#8217;m going to introduce a way to make a hbase table in java. HBase provides two ways to allow a Hbase client to connect HBase master. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=527&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>I have accumulated some knowledge and know-how about MapReduce, Hadoop, and HBase since I participated in some projects. From hence, I&#8217;ll post the know-how of HBase by period. Today, I&#8217;m going to introduce a way to make a hbase table in java.</p>
<p>HBase provides two ways to allow a Hbase client to connect HBase master. One is to use a instance of <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HBaseAdmin.html" target="_blank">HBaseAdmin</a> class. HBaseAdmin provides some methods for creating, modifying, and deleting tables and column families. Another way is to use an instance of HTable class. This class almost provides some methods to manipulate data like inserting, modifying, and deleting rows and cells.</p>
<p>Thus, in order to make a hbase table, we need to connect a HBase master by initializing a instance of HBaseAdmin like line 4. HBaseAdmin requires an instance of <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/HBaseConfiguration.html" target="_blank">HBaseConfiguration</a>. If necessary, you may set some configurations like line 2.</p>
<p>In order to describe HBase schema,  we make an instances of <a href="http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/HColumnDescriptor.html" target="_blank">HColumnDescriptor</a> for each column family. In addition to column family names, HColumnDescriptor enables you to set various parameters, such as maxVersions, compression type, timeToLive, and bloomFilter. Then, we can create a HBase table by invoking createTable like line 10.</p>
<p><pre class="brush: java;">
HBaseConfiguration conf = new HBaseConfiguration();
conf.set(&quot;hbase.master&quot;,&quot;localhost:60000&quot;);

HBaseAdmin hbase = new HBaseAdmin(conf);
HTableDescriptor desc = new HTableDescriptor(&quot;TEST&quot;);
HColumnDescriptor meta = new HColumnDescriptor(&quot;personal&quot;.getBytes());
HColumnDescriptor prefix = new HColumnDescriptor(&quot;account&quot;.getBytes());
desc.addFamily(meta);
desc.addFamily(prefix);
hbase.createTable(desc);
</pre></p>
<p>Finally, you can check your hbase table as the following commands.</p>
<p><pre class="brush: bash;">
c0d3h4ck@code:~/Development/hbase$ bin/hbase shell
HBase Shell; enter 'help&lt;RETURN&gt;' for list of supported commands.
Version: 0.20.1, r822817, Wed Oct  7 11:55:42 PDT 2009
hbase(main):001:0&gt; list
TEST

1 row(s) in 0.0940 seconds
</pre> </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/527/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/527/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/527/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/527/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/527/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/527/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/527/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/527/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=527&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/11/27/how-to-make-a-table-in-hbase-for-beginners/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>ACM SIGMOD 2010 Programming Contest</title>
		<link>http://diveintodata.org/2009/11/20/acm-sigmod-2010-programming-contest/</link>
		<comments>http://diveintodata.org/2009/11/20/acm-sigmod-2010-programming-contest/#comments</comments>
		<pubDate>Fri, 20 Nov 2009 11:44:06 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[News]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[acm]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[programming contest]]></category>
		<category><![CDATA[relational database]]></category>
		<category><![CDATA[SIGMOD]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=515</guid>
		<description><![CDATA[As you know, SIGMOD is ACM&#8217;s Special Interest Group on Management of Data. SIGMOD holds the annual conference that is regarded as one of the best conference in computer science. Besides, SIGMOD organizes a programming contest in parallel with the ACM SIGMOD conference. Below description is the call for the programming contest of this year. [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=515&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>As you know, SIGMOD is ACM&#8217;s Special Interest Group on Management of Data. SIGMOD holds the annual conference that is regarded as one of the best conference in computer science. Besides, SIGMOD organizes a programming contest in parallel with the ACM SIGMOD conference. Below description is the call for the programming contest of this year. The programming contest&#8217;s subject of this year seems very interesting! The task is to implement a simple distributed query executor built on top of last year&#8217;s main-memory index. The environment on which contestants will test their implementation may be provided by Amazon. If you are interested in this programming contest, try that. You can get further information from here (<a href="http://dbweb.enst.fr/events/sigmod10contest/" target="_blank">http://dbweb.enst.fr/events/sigmod10contest</a>).</p>
<blockquote><p>A programming contest is organized in parallel with the ACM SIGMOD 2010 conference, following the success of the first annual SIGMOD programming contest organized last year. Student teams from degree-granting institutions are invited to compete to develop a distributed query engine over relational data. Submissions will be judged on the overall performance of the system on a variety of workloads. A shortlist of finalists will be invited to present their implementation at the SIGMOD conference in June 2010 in Indianapolis, USA. The winning team, to be selected during the conference, will be awarded a prize of 5,000 USD and will be invited to a one-week research visit in Paris. The winning system, released in open source, will form a building block of a complete distributed database system which will be built over the years, throughout the programming contests.</p></blockquote>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/515/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/515/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/515/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/515/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/515/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/515/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/515/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/515/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=515&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/11/20/acm-sigmod-2010-programming-contest/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>CIKM 2009 in Hong Kong</title>
		<link>http://diveintodata.org/2009/11/10/cikm-2009-in-hong-kong/</link>
		<comments>http://diveintodata.org/2009/11/10/cikm-2009-in-hong-kong/#comments</comments>
		<pubDate>Mon, 09 Nov 2009 15:08:26 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[cikm]]></category>
		<category><![CDATA[cikm09]]></category>
		<category><![CDATA[conference]]></category>
		<category><![CDATA[Hong Kong]]></category>
		<category><![CDATA[spider]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=473</guid>
		<description><![CDATA[With Min Kyoung Sung who is a coauthor of  &#8216;SPIDER : A System for Scalable, Parallel / Distributed Evaluation of large-scale RDF Data&#8216;, I participated in 18th ACM CIKM 2009 (Conference on Information and Knowledge Management) held in Hong Kong. We stayed in Marriott Hotel near the Asia World-Expo at which CIKM 2009 held. At [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=473&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>With Min Kyoung Sung who is a coauthor of  &#8216;<a href="http://dbserver.korea.ac.kr/projects/spider/" target="_blank"><em>SPIDER : A System for Scalable, Parallel / Distributed Evaluation of large-scale RDF Data</em></a>&#8216;, I participated in <a href="http://www.comp.polyu.edu.hk/conference/cikm2009/about/index.htm" target="_blank">18th ACM CIKM 2009 (Conference on Information and Knowledge Management)</a> held in Hong Kong. We stayed in Marriott Hotel near the <a href="http://www.asiaworld-expo.com/" target="_blank">Asia World-Expo</a> at which CIKM 2009 held. At this conference, I got along with several Korean researchers (<strong></strong>Kyong-Ha Lee, Jinoh Oh, and Sangchul Kim) and I discussed about SPIDER with some researchers who are interested in RDF data processing during the demonstration session.</p>
<p>At CIKM 2009, I felt that the recent trend of web data management are being changed to information extraction and semantic or structured web data rather then unstructured data. Many papers and posters addressed these issues. In addition, the subject of the panel was ‘<span><strong> <em>Information extraction meets relational databases: Where    are we heading?</em></strong></span>’ One of the panel said that the hot spot of web data management research changes from crawling, indexing, and searching to information extraction and semantic data. These changes lead to new various data and knowledge management issues. Besides information extraction, graph data mining was one of the main hot issues in CIKM 2009.</p>
<p>At the main keynote, Kyu-Young Hwang (KAIST, Korea) spoke &#8216;<span style="font-style:italic;font-weight:bold;">DB-IR Integration and Its Application to a Massively-Parallel Search Engine.&#8217; </span>Its key subject is that DB-IR integration is becoming one of major challenges in the database area, so it is leading to new DBMS architecture applicable to DB-IR integration. In addition, Edward Chang (Google Research China) and Clement Yu (University of Illinois at Chicago) spoke &#8216;<strong><em>Confucius and its intelligent Disciples</em></strong>&#8216; and &#8216;<strong><em>Advanced Metasearch Engines</em>&#8216;</strong> respectively.</p>
<p style="text-align:center;"><a class="flickr-image alignnone" title="Coffee Break at CIKM 2009" rel="flickr-mgr[CIKM]" href="http://www.flickr.com/photos/hyunsik/4088464259/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2764/4088464259_4f6498eca2_m.jpg" alt="Coffee Break at CIKM 2009" /></a><a class="flickr-image alignnone" title="SPIDER in Demo Session" rel="flickr-mgr[CIKM]" href="http://www.flickr.com/photos/hyunsik/4088463803/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2752/4088463803_b53bbd8646_m.jpg" alt="SPIDER in Demo Session" /></a></p>
<p style="text-align:center;"><a class="flickr-image alignnone" title="Tian Tan Buddha Statue in Hong Kong" rel="flickr-mgr[CIKM]" href="http://www.flickr.com/photos/hyunsik/4088461317/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2609/4088461317_5546d70eff_m.jpg" alt="Tian Tan Buddha Statue in Hong Kong" /></a><a class="flickr-image alignnone" title="The lunch time in CIKM 2009" rel="flickr-mgr[CIKM]" href="http://www.flickr.com/photos/hyunsik/4088462251/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2591/4088462251_d5875a68e3_m.jpg" alt="The lunch time in CIKM 2009" /></a></p>
<p>This conference was a really nice experience for me. I enjoyed the conference, reception, and banquet. However, I have an unsatisfied feeling because I didn&#8217;t participate in <a href="http://www.clouddb.org/CloudDB09/" target="_blank">the 1st Workshop CloudDB 2009</a> in conjunction in CIKM 2009.</p>
<p>Anyway, this conference inspired Min Kyoung Sung and me. It may be kept in our mind for long time.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/473/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/473/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/473/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/473/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/473/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/473/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/473/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/473/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=473&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/11/10/cikm-2009-in-hong-kong/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2764/4088464259_4f6498eca2_m.jpg" medium="image">
			<media:title type="html">Coffee Break at CIKM 2009</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2752/4088463803_b53bbd8646_m.jpg" medium="image">
			<media:title type="html">SPIDER in Demo Session</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2609/4088461317_5546d70eff_m.jpg" medium="image">
			<media:title type="html">Tian Tan Buddha Statue in Hong Kong</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2591/4088462251_d5875a68e3_m.jpg" medium="image">
			<media:title type="html">The lunch time in CIKM 2009</media:title>
		</media:content>
	</item>
		<item>
		<title>MapReduce Online Comes Out!</title>
		<link>http://diveintodata.org/2009/10/20/mapreduce-onlie-comes-out/</link>
		<comments>http://diveintodata.org/2009/10/20/mapreduce-onlie-comes-out/#comments</comments>
		<pubDate>Mon, 19 Oct 2009 15:49:37 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[map-reduce]]></category>
		<category><![CDATA[online aggregation]]></category>
		<category><![CDATA[stream queries]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=461</guid>
		<description><![CDATA[MapReduce has been gaining much attention in data intensive computing field. As you know, it is well known as a very popular framework for batch-processing. Recently, however, Tyson Condie who is a Ph.D student in UC Berkeley accomplishes MapReduce Online. Today, I heard this news from Data Beta. Actually, It is amazing works since the [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=461&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>MapReduce has been gaining much attention in data intensive computing field. As you know, it is well known as a very popular framework for batch-processing.</p>
<p>Recently, however, Tyson Condie who is a Ph.D student in UC Berkeley accomplishes <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html" target="_self">MapReduce Online</a>. Today, I heard this news from <a href="http://databeta.wordpress.com/2009/10/18/mapreduce-online/" target="_self">Data Beta</a>. Actually, It is amazing works since the original MapReduce is specialized and designed for only batch-processing. In addition, most people believe that MapReduce will remain a batch-processing.</p>
<p>The essential of MapReduce online is that it tries to hold the fault-tolerance model of the <a href="http://labs.google.com/papers/mapreduce.html" target="_self">original MapReduce</a>, whereas it provides the the pipelining of results across tasks and jobs instead of materializing the output of each MapReduce task and job into disk. Consequently, MapReduce online enables the program to return the result earlier from a big job.</p>
<p>You can get further information from <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html" target="_self">MapReduce Online</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/461/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/461/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/461/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/461/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/461/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/461/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/461/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/461/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=461&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/10/20/mapreduce-onlie-comes-out/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>BSP Library on Hadoop?</title>
		<link>http://diveintodata.org/2009/10/09/bsp-library-on-hadoop/</link>
		<comments>http://diveintodata.org/2009/10/09/bsp-library-on-hadoop/#comments</comments>
		<pubDate>Fri, 09 Oct 2009 11:45:33 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[FOSS]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[angrapa]]></category>
		<category><![CDATA[apache]]></category>
		<category><![CDATA[bsp]]></category>
		<category><![CDATA[bulk synchronization parallel]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hama]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=443</guid>
		<description><![CDATA[Recently, I started to participate in the Hama project (a distributed scientific package on Hadoop for massive matrix and graph data), and I have taken the times to develop the bulk synchronization parallel (BSP) library on Hadoop (HAMA-195); I&#8217;m getting help from Edword Yoon, a founder of Hama project. The motivation of BSP lib is [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=443&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Recently, I started to participate in the <a href="http://incubator.apache.org/hama/" target="_self">Hama project</a> (a distributed scientific package on Hadoop for massive matrix and graph data), and I have taken the times to develop the <a href="http://en.wikipedia.org/wiki/Bulk_synchronous_parallel" target="_self">bulk synchronization parallel</a> (BSP) library on Hadoop (<a href="https://issues.apache.org/jira/browse/HAMA-195" target="_self">HAMA-195</a>); I&#8217;m getting help from <a href="http://blog.udanax.org/" target="_self">Edword Yoon</a>, a founder of Hama project. The motivation of BSP lib is definitely clear.</p>
<p>The hadoop platforms are installed in cloud computing service providers and many companies as you can see in <a href="http://wiki.apache.org/hadoop/PoweredBy" target="_self">http://wiki.apache.org/hadoop/PoweredBy</a>. However, most of them may use only MapReduce programs. As you know although MapReduce is very scalability, but it provides only the simple programming model. Many programmers want to use more various programming model without changing the platform (i.e., <a href="http://hadoop.apache.org" target="_self">Hadoop</a>). This BSP lib will be the beginning for their desires. However, like MapReduce, BSP may also be not swiss army knife. When we find appropriate applications, BSP lib on Hadoop will be valued for its scalability and ability.</p>
<p>Sooner, I&#8217;ll post articles about the progress of BSP library and <a href="http://wiki.apache.org/hama/GraphPackage" target="_self">Angrapa</a> (the graph package on Hama).</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/443/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/443/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/443/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/443/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/443/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/443/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/443/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/443/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=443&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/10/09/bsp-library-on-hadoop/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>Java Universal Network/Graph Framework</title>
		<link>http://diveintodata.org/2009/09/15/java-universal-networkgraph-framework/</link>
		<comments>http://diveintodata.org/2009/09/15/java-universal-networkgraph-framework/#comments</comments>
		<pubDate>Mon, 14 Sep 2009 23:30:45 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[jung]]></category>
		<category><![CDATA[visualization tools]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=349</guid>
		<description><![CDATA[Recently, I&#8217;m primarily concerned with large-scale graph data processing. Occasionally, the visualization of graph can be a good way for us to observe some properties from graph data sets. Today, I&#8217;m going to introduce a graph framework, called Java Universal Network/Graph Framework (Jung). Jung provides data structures for graph, a programming interface familiar with graph [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=349&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Recently, I&#8217;m primarily concerned with large-scale graph data processing. Occasionally, the visualization of graph can be a good way for us to observe some properties from graph data sets. Today, I&#8217;m going to introduce a graph framework, called <em><a href="http://jung.sourceforge.net/" target="_blank">Java Universal Network/Graph Framework (Jung)</a>. </em>Jung provides data structures for graph, a programming interface familiar with graph features, some fundamental graph algorithms (e.g., minimum spanning tree, depth-first search, breath-first search, and dijkstra algorithm), and even visualization methods. Especially, I&#8217;m interested in its visualization methods.</p>
<p>The following java source shows the programming interface of Jung. In more detail, this program make a graph, add three vertices to the graph, and connect vertices. This source code is brought from <a href="http://jung.sourceforge.net/doc/index.html" target="_blank">Jung tutorial</a>. As you can see, Jung&#8217;s APIs are very easy.</p>
<p><pre class="brush: java;">
  // Make a graph by a SparseMultigraph instance.
  Graph&amp;lt;Integer, String&amp;gt; g = new SparseMultigraph&amp;lt;Integer, String&amp;gt;();
  g.addVertex((Integer)1); // Add a vertex with an integer 1
  g.addVertex((Integer)2);
  g.addVertex((Integer)3);
  g.addEdge(&amp;quot;Edge-A&amp;quot;, 1,3); // Added an edge to connect between 1 and 3 vertices.
  g.addEdge(&amp;quot;Edge-B&amp;quot;, 2,3, EdgeType.DIRECTED);
  g.addEdge(&amp;quot;Edge-C&amp;quot;, 3, 2, EdgeType.DIRECTED);
  g.addEdge(&amp;quot;Edge-P&amp;quot;, 2,3); // A parallel edge

  // Make some objects for graph layout and visualization.
  Layout&amp;lt;Integer, String&amp;gt; layout = new KKLayout&amp;lt;Integer, String&amp;gt;(g);
  BasicVisualizationServer&amp;lt;Integer, String&amp;gt; vv =
  new BasicVisualizationServer&amp;lt;Integer, String&amp;gt;(layout);
  vv.setPreferredSize(new Dimension(800,800));

  // It determine how each vertex with its value is represented in a diagram.
  ToStringLabeller&amp;lt;Integer&amp;gt; vertexPaint = new ToStringLabeller&amp;lt;Integer&amp;gt;() {
    public String transform(Integer i) {
    return &amp;quot;&amp;quot;+i;
   }
  };

  vv.getRenderContext().setVertexLabelTransformer(vertexPaint);

  JFrame frame = new JFrame(&amp;quot;Simple Graph View&amp;quot;);
  frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
  frame.getContentPane().add(vv);
  frame.pack();
  frame.setVisible(true);
</pre></p>
<p>Some APIs of the Jung are based on generic programming, so you can use easily vertices or edges to contains user-defined data. If you want more detail information, visit <a href="http://jung.sourceforge.net/">http://jung.sourceforge.net</a>.</p>
<p>The above source code shows the following diagram.<br />
<a class="flickr-image aligncenter" title="Jung example" rel="flickr-mgr" href="http://www.flickr.com/photos/hyunsik/3919489249/"><img class="flickr-medium aligncenter" src="http://farm3.static.flickr.com/2646/3919489249_3377cc8c63.jpg" alt="Jung example" width="347" height="346" /></a></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/349/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/349/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/349/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/349/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/349/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/349/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/349/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/349/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=349&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/09/15/java-universal-networkgraph-framework/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2646/3919489249_3377cc8c63.jpg" medium="image">
			<media:title type="html">Jung example</media:title>
		</media:content>
	</item>
		<item>
		<title>Zipf Distribution Generator in Java</title>
		<link>http://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/</link>
		<comments>http://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/#comments</comments>
		<pubDate>Sun, 13 Sep 2009 14:17:34 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[distribution]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[probability]]></category>
		<category><![CDATA[zipf]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=369</guid>
		<description><![CDATA[When I carry out some experiments, I usually make synthetic data sets generated by  some probability distributions.  Especially, Zipf distribution is frequently used for a synthetic data set. Zipf distribution is  one of the discrete power law probability distributions. You can get detail information from Zipf&#8217;s law in Wikipedia. Anyway, I attached my own java [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=369&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>When I carry out some experiments, I usually make synthetic data sets generated by  some probability distributions.  Especially, Zipf distribution is frequently used for a synthetic data set. Zipf distribution is  one of the discrete power law probability distributions. You can get detail information from <a href="http://en.wikipedia.org/wiki/Zipf%27s_law" target="_blank">Zipf&#8217;s law</a> in Wikipedia. Anyway, I attached my own java class for zip distribution. Below graphs are generated by my own java code and the gnuplot.</p>
<pre><a class="flickr-image alignleft" title="Zipf Distribution (s=1)" rel="flickr-mgr" href="http://www.flickr.com/photos/hyunsik/3914971725/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2528/3914971725_39800bd7f5_m.jpg" alt="Zipf Distribution (s=1)" /></a><a class="flickr-image alignnone" title="Zipf Distribution with log scale (s=1)" rel="flickr-mgr" href="http://www.flickr.com/photos/hyunsik/3914971927/"><img class="flickr-medium" src="http://farm3.static.flickr.com/2486/3914971927_df23796db2_m.jpg" alt="Zipf Distribution with log scale (s=1)" /></a>

<pre class="brush: java;">
import java.util.Random;

public class ZipfGenerator {
 private Random rnd = new Random(System.currentTimeMillis());
 private int size;
 private double skew;
 private double bottom = 0;

 public ZipfGenerator(int size, double skew) {
  this.size = size;
  this.skew = skew;

  for(int i=1;i&amp;lt;size; i++) {
  this.bottom += (1/Math.pow(i, this.skew));
  }
 }

 // the next() method returns an rank id. The frequency of returned rank ids are follows Zipf distribution.
 public int next() {
   int rank;
   double friquency = 0;
   double dice;

   rank = rnd.nextInt(size);
   friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
   dice = rnd.nextDouble();

   while(!(dice &amp;lt; friquency)) {
     rank = rnd.nextInt(size);
     friquency = (1.0d / Math.pow(rank, this.skew)) / this.bottom;
     dice = rnd.nextDouble();
   }

   return rank;
 }

 // This method returns a probability that the given rank occurs.
 public double getProbability(int rank) {
   return (1.0d / Math.pow(rank, this.skew)) / this.bottom;
 }

 public static void main(String[] args) {
   if(args.length != 2) {
     System.out.println(&amp;quot;usage: ./zipf size skew&amp;quot;);
     System.exit(-1);
   }

   ZipfGenerator zipf = new ZipfGenerator(Integer.valueOf(args[0]),
   Double.valueOf(args[1]));
   for(int i=1;i&amp;lt;=100;i++)
     System.out.println(i+&amp;quot; &amp;quot; +zipf.getProbability(i));
 }
}
</pre>

</pre>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/369/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/369/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/369/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/369/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/369/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/369/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/369/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/369/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=369&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/09/13/zipf-distribution-generator-in-java/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2528/3914971725_39800bd7f5_m.jpg" medium="image">
			<media:title type="html">Zipf Distribution (s=1)</media:title>
		</media:content>

		<media:content url="http://farm3.static.flickr.com/2486/3914971927_df23796db2_m.jpg" medium="image">
			<media:title type="html">Zipf Distribution with log scale (s=1)</media:title>
		</media:content>
	</item>
		<item>
		<title>One-column abstract in two-column layouts in articles on Latex</title>
		<link>http://diveintodata.org/2009/09/11/one-column-abstract-in-two-column-layouts-in-articles-on-latex/</link>
		<comments>http://diveintodata.org/2009/09/11/one-column-abstract-in-two-column-layouts-in-articles-on-latex/#comments</comments>
		<pubDate>Fri, 11 Sep 2009 01:43:39 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[abstract]]></category>
		<category><![CDATA[latex]]></category>
		<category><![CDATA[one-column]]></category>
		<category><![CDATA[two-column]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=359</guid>
		<description><![CDATA[If you want one-column abstract in two-column layouts in articles on Latex, just add abstract package and follow below source code. Someone who uses ubuntu linux can install the abstract package from &#8216;texlive-latex-extra&#8217; package via synaptic. Others can install from http://www.tex.ac.uk/tex-archive/macros/latex/contrib/abstract/ You should move &#8216;maketitle&#8217; within &#8216;twocolumn&#8217; like above code and remove &#8216;abstract&#8217;. You can [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=359&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>If you want one-column abstract in two-column layouts in articles on Latex, just add abstract package and follow below source code. Someone who uses ubuntu linux can install the abstract package from &#8216;texlive-latex-extra&#8217; package via synaptic. Others can install from <a href="http://www.tex.ac.uk/tex-archive/macros/latex/contrib/abstract/" target="_blank">http://www.tex.ac.uk/tex-archive/macros/latex/contrib/abstract/</a></p>
<p><pre class="brush: plain;">
usepackage{abstract}

twocolumn[
  maketitle
  begin{onecolabstract}
    Here in which one-column abstract resides
  end{onecolabstract}
]
</pre></p>
<p>You should move &#8216;maketitle&#8217; within &#8216;twocolumn&#8217; like above code and remove &#8216;abstract&#8217;.</p>
<p>You can find further information about the abstract package from <a href="http://www.tex.ac.uk/tex-archive/macros/latex/contrib/abstract/abstract.pdf" target="_blank">http://www.tex.ac.uk/tex-archive/macros/latex/contrib/abstract/abstract.pdf</a>.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/359/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/359/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/359/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/359/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/359/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/359/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/359/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/359/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=359&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/09/11/one-column-abstract-in-two-column-layouts-in-articles-on-latex/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>A Brief Introduction to Skyline Problem (Pareto-optimal Tuples) (1)</title>
		<link>http://diveintodata.org/2009/09/06/a-brief-introduction-to-skyline-problem-pareto-optimal-tuples-1/</link>
		<comments>http://diveintodata.org/2009/09/06/a-brief-introduction-to-skyline-problem-pareto-optimal-tuples-1/#comments</comments>
		<pubDate>Sun, 06 Sep 2009 06:27:09 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[decision making]]></category>
		<category><![CDATA[pareto tuples]]></category>
		<category><![CDATA[query]]></category>
		<category><![CDATA[skyline]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=78</guid>
		<description><![CDATA[The skyline problem is to compute the best tuples from a set of ordered d-tuples. The name is originated from what the solution represented on 2d plane resembles the scene that urban buildings comprise. Skyline is one of the recommendation queries, and it is considering multi criteria. It is very interesting problem as well as [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=78&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><span class="dropcaps">The s</span>kyline problem is to compute the best tuples from a set of ordered <em>d</em>-tuples. The name is originated from what the solution represented on 2d plane resembles the scene that urban buildings comprise. Skyline is one of the recommendation queries, and it is considering multi criteria. It is very interesting problem as well as very useful query. This problem has been being intensively studied for recent years. Today, I’m going to present the problem definition of skyline. Next time, I&#8217;ll describe several algorithms for the skyline problem.</p>
<p><a style="float:left;margin-right:5px;" title="Singapore Skyline (#12) by Christopher Chan, on Flickr" href="http://www.flickr.com/photos/chanc/469796567/"><img src="http://farm1.static.flickr.com/226/469796567_311f4a3b79.jpg" alt="Singapore Skyline (#12)" width="250" /></a> First of all, let us know the input data. The input data <img src="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" alt="D^{d}" /> of skyline is a set of <em>n</em> ordered <em>d-</em>tuples, each of which consists of ordered <em>d</em> scalar values. They are shown in below formulas:</p>
<p><img style="display:block;float:none;margin-left:auto;margin-right:auto;" src="http://www.codecogs.com/eq.latex?D^{d}%20=%20{tp_{1},tp_{2},...tp_{n}}" alt="D^{d} = {tp_{1},tp_{2},...tp_{n}}" /></p>
<div id="equationview" style="text-align:center;">
<div id="equationview"><img src="http://www.codecogs.com/eq.latex?tp_%7Bi%7D%20=%20%28v_%7B1%7D,v_%7B2%7D,...,v_%7Bd%7D%29" border="0" alt="tp_{i} = (v_{1},v_{2},...,v_{d})" align="absmiddle" /></div>
</div>
<p><em> </em></p>
<div id="equationview"><img src="http://www.codecogs.com/eq.latex?tp_%7Bi%7D" border="0" alt="tp_{i}" align="absmiddle" /> denotes a <em>d</em>-tuple. And, we need to understand the definition of the dominance relation. In addition, because the skyline problem is to find the better tuples, we need an assumption about &#8216;better&#8217;. In most literature, it is assumed that the less value is better, so we follow this assumption.</div>
<blockquote><p><span style="background-color:#ffffff;"><strong>Definition 1 (Dominance). </strong></span><span style="background-color:#ffffff;">Let <em>tp</em> and <em>tp’</em> be tuples in <img src="http://www.codecogs.com/eq.latex?D^{d}" alt="D^{d}" /> where </span><img src="http://www.codecogs.com/eq.latex?v_%7Bi%7D" border="0" alt="v_{i}" align="absmiddle" /> <span style="background-color:#ffffff;">is an element of <em>tp</em> and </span><img src="http://www.codecogs.com/eq.latex?u_%7Bi%7D" border="0" alt="u_{i}" align="absmiddle" /> <span style="background-color:#ffffff;">is an element of <em>tp&#8217; </em>for </span><img src="http://www.codecogs.com/eq.latex?1%20%3C%20i%20%5Cleq%20d" border="0" alt="1 &lt; i leq d" align="absmiddle" /><span style="background-color:#ffffff;">. Then, <em>tp</em> <strong>dominates</strong> <em>tp’</em> </span><span style="background-color:#ffffff;">if and only if  <img src="http://www.codecogs.com/eq.latex?forall{i},%20v_{i}%20leq%20u_{i}%20land%20exists{j},%20v_{j}%20%3C%20u_{j}" alt="forall{i}, v_{i} leq u_{i} land exists{j}, v_{j} &lt; u_{j}" width="182" height="17" />.</span></p></blockquote>
<p>In other words, it is said that one tuple <img src="http://www.codecogs.com/eq.latex?tp" border="0" alt="tp" align="absmiddle" /> dominates another tuple <img src="http://www.codecogs.com/eq.latex?tp%27" border="0" alt="tp'" align="absmiddle" /> if <img src="http://www.codecogs.com/eq.latex?tp" border="0" alt="tp" align="absmiddle" /> is not worse (not greater) than <img src="http://www.codecogs.com/eq.latex?tp%27" border="0" alt="tp'" align="absmiddle" /> in all dimensions and<em> </em><img src="http://www.codecogs.com/eq.latex?tp" border="0" alt="tp" align="absmiddle" /> is better (less) than <img src="http://www.codecogs.com/eq.latex?tp%27" border="0" alt="tp'" align="absmiddle" /> in at least one dimension.</p>
<blockquote><p><strong>Definition 2 (Skyline)</strong> Given a data set <img src="http://www.codecogs.com/eq.latex?D^{d}" alt="D^{d}" />, a skyline contains tuples that is not dominated any other tuples in <img src="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" alt="D^{d}" />.</p></blockquote>
<p>As I described above definition, a skyline is a set of tuples and the tuples are not dominated by any other tuples in <img src="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" alt="D^{d}" />. In literature, a <em>d</em>-dimensional data set and above two definitions are usually represented for comprehensive description to <em>d</em>-points on <em>d</em>-axies.</p>
<p style="text-align:left;">Without loss of generality, we assume that <img src="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" alt="D^{d}" /> is a 2d data set (i.e., <em>d</em>=2). A data set is given as follows:</p>
<ul>
<li>a = (3,2)</li>
<li>b = (8,1)</li>
<li>c = (1,10)</li>
<li>d = (4,3)</li>
<li>e = (8,6)</li>
</ul>
<p style="text-align:left;">Each element of a tuple in <img src="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" alt="D^{d}" /> can be represented to one axis. In other words, the first element and the second element of tuples are represented to X and Y axies respectively. Then, tuples of above list are represented to 2d points as shown in Fig. 1.</p>
<div id="attachment_324" class="wp-caption aligncenter" style="width: 300px"><img class="size-full wp-image-324" title="Fig. 1. An example of a skyline" src="http://diveintodata.files.wordpress.com/2009/09/skyline_intro.png?w=590" alt="Fig. 1. An example of a skyline"   /><p class="wp-caption-text">Fig. 1. An example of a skyline</p></div>
<p>In Fig. 1, let us look into a dominance relation. The point <em>a</em> dominates the points {<em>d,e</em>} since elements of the point <em>a</em> less than those of {<em>d,e</em>} in X and Y. The point <em>b</em> dominates only <em>e </em>since X values of {<em>b,e</em>} are same (i.e., X=8) but Y of <em>b</em> (i.e., 1) is less than that (i.e., 6) of <em>e</em>. The points {d,e} cannot belong to the skyline because they are dominated by other tuples. Consequently, the points <em>a,b</em>, and <em>c</em> belong to the skyline since they are not dominated by any other tuples.</p>
<p>Initially, the skyline problem was known as the <em><a href="http://portal.acm.org/citation.cfm?id=321910" target="_blank">maxima vector problem (H. T. Kung et. al 1975)</a></em> for traditional processing system. However, this problem was revisited by <a href="http://portal.acm.org/citation.cfm?id=656550&amp;dl=" target="_blank">the Skyline Operator (Stephan Börzsönyi et. al 2001)</a>. Since then, this problem has been intensively studied in database area.</p>
<p>Next time, I&#8217;ll describe several algorithms including above algorithms in detail.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/78/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/78/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/78/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/78/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/78/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/78/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/78/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/78/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=78&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/09/06/a-brief-introduction-to-skyline-problem-pareto-optimal-tuples-1/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://farm1.static.flickr.com/226/469796567_311f4a3b79.jpg" medium="image">
			<media:title type="html">Singapore Skyline (#12)</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?Dd%20=%20tp_1,tp_2,...tp_n" medium="image">
			<media:title type="html">D^{d} = {tp_{1},tp_{2},...tp_{n}}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp_%7Bi%7D%20=%20%28v_%7B1%7D,v_%7B2%7D,...,v_%7Bd%7D%29" medium="image">
			<media:title type="html">tp_{i} = (v_{1},v_{2},...,v_{d})</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp_%7Bi%7D" medium="image">
			<media:title type="html">tp_{i}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?Dd" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?v_%7Bi%7D" medium="image">
			<media:title type="html">v_{i}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?u_%7Bi%7D" medium="image">
			<media:title type="html">u_{i}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?1%20%3C%20i%20%5Cleq%20d" medium="image">
			<media:title type="html">1 &#60; i leq d</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?foralli,%20v_i%20leq%20u_i%20land%20existsj,%20v_j%20%3C%20u_j" medium="image">
			<media:title type="html">forall{i}, v_{i} leq u_{i} land exists{j}, v_{j} &#60; u_{j}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp" medium="image">
			<media:title type="html">tp</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp%27" medium="image">
			<media:title type="html">tp&#039;</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp" medium="image">
			<media:title type="html">tp</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp%27" medium="image">
			<media:title type="html">tp&#039;</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp" medium="image">
			<media:title type="html">tp</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?tp%27" medium="image">
			<media:title type="html">tp&#039;</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?Dd" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://www.codecogs.com/eq.latex?D%5E%7Bd%7D" medium="image">
			<media:title type="html">D^{d}</media:title>
		</media:content>

		<media:content url="http://diveintodata.files.wordpress.com/2009/09/skyline_intro.png" medium="image">
			<media:title type="html">Fig. 1. An example of a skyline</media:title>
		</media:content>
	</item>
		<item>
		<title>Some Interesting Papers of ACM SIGMOD Conference 2009</title>
		<link>http://diveintodata.org/2009/08/08/some-interesting-papers-of-acm-sigmod-conference-2009/</link>
		<comments>http://diveintodata.org/2009/08/08/some-interesting-papers-of-acm-sigmod-conference-2009/#comments</comments>
		<pubDate>Sat, 08 Aug 2009 13:31:48 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[paper]]></category>
		<category><![CDATA[SIGMOD]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=87</guid>
		<description><![CDATA[ACM SIGMOD Conference 2009 was held in Providence, Rhode Island from June 29 through July 2. Then, the electronic proceedings are available online. Among many nice papers, I tried to choose some interesting papers as follows: MapReduce &#38; Hadoop “A Comparison of Approaches to Large Scale Data Analysis,” Andrew Pavlo, Samuel Madden, David DeWitt, Michael [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=87&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>ACM SIGMOD Conference 2009 was held in Providence, Rhode Island from June 29 through July 2. Then, the electronic proceedings are available online. Among many nice papers, I tried to choose some interesting papers as follows:</p>
<h4>MapReduce &amp; Hadoop</h4>
<ul>
<li>“A Comparison of Approaches to Large Scale Data Analysis,” Andrew Pavlo, Samuel Madden, David DeWitt, Michael Stonebraker, Alexander Rasin, Erik Paulson, Lakshmikant Shrinivas and Daniel Abadi.</li>
</ul>
<p><span style="color:#400080;"><strong>Some of the authors are members of vertica, a parallel database. Prof. Dwitt strongly attacked MapReduce (<em><a href="http://databasecolumn.vertica.com/2008/01/mapreduce-a-major-step-back.html" target="_blank">MapReduce: A major step backwards</a></em>, </strong></span><span style="color:#400080;"><strong><a href="http://databasecolumn.vertica.com/2008/01/mapreduce-continued.html" target="_blank">MapReduce II</a></strong></span><span style="color:#400080;"><strong>). So, I wonder how did they benchmark both architectures.</strong></span></p>
<h4>Skyline Queries</h4>
<ul>
<li>“Minimizing the Communication Cost for Continuous Skyline Maintenance,” Zhenjie Zhang, Reynold Cheng, Dimitris Papadias, Anthony K. H. Tung.</li>
<li>“Scalable Skyline Computation Using Object-based Space Partitioning,” ZHANG Shiming, Nikos Mamoulis, David Cheung.</li>
<li>“Kernel-Based Skyline Cardinality Estimation,” Zhenjie Zhang, Yin Yang, Ruichu Cai, Dimitris Papadias, Anthony and K. H. Tung.</li>
</ul>
<p><strong><span style="color:#400080;">Since I first met the skyline problem, I have been always interested in skyline queries. Considering multi-criteria, Skyline queries retrieve the best tuples among multi-dimensional objects.</span></strong></p>
<h4>Graph Query Processing</h4>
<ul>
<li>“3-HOP: A High-Compression Indexing Scheme for Reachability Query,” Ruoming Jin, Yang Xiang, Ning Ruan, and Dave Fuhry.</li>
</ul>
<p><span style="color:#400080;"><strong>Rechability query is to compute whether two given vertices are rechable, or not. Rechability query is one of the most fundamental operations in graph querying. it can be usually used in a primitive operation for complex graph queries.</strong></span></p>
<h4>RDF Query Processing</h4>
<ul>
<li>“Scalable Join Processing on Very Large RDF Graphs,” Thomas Neumann and Gerhard Weikum.</li>
</ul>
<p><strong><span style="color:#400080;">The issue with which I’m primarily concerned is RDF query processing. As linked data are gaining attention, this issue will be more dealt with in the database community.</span></strong></p>
<h4>Spatial Query Processing</h4>
<ul>
<li>“Quality and Efficiency in High Dimensional Nearest Neighbor Search,” Yufei Tao, Ke Yi, Cheng Sheng and Panos Kalnis.</li>
<li>“Continuous Obstructed Nearest Neighbor Queries in Spatial Databases,” Yunjun Gao and Baihua Zheng.</li>
<li>“A Revised R*-tree in Comparison with Related Index Structures,” Norbert Beckmann and Bernhard Seeger.</li>
</ul>
<p><strong><span style="color:#400080;">While I was taking M.S. program, I studied many spatial query processing issues. Hence, I try to keep in touch with recent spatial database issues.</span></strong></p>
<p>They are seem to be very interesting. Later, I will post paper reviews about above papers.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/87/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/87/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/87/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/87/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/87/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/87/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/87/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/87/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=87&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/08/08/some-interesting-papers-of-acm-sigmod-conference-2009/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>HadoopDB: An Open Source Parallel Database for Analytical Workloads</title>
		<link>http://diveintodata.org/2009/07/31/hadoopdb-releases/</link>
		<comments>http://diveintodata.org/2009/07/31/hadoopdb-releases/#comments</comments>
		<pubDate>Thu, 30 Jul 2009 15:01:15 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[hadoopdb]]></category>
		<category><![CDATA[map-reduce]]></category>
		<category><![CDATA[vldb]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=155</guid>
		<description><![CDATA[With the increasingly growing volume of data, the techniques to manage big data are needed in many areas. Open source community and many companies have attempted developing solutions to deal with big data. Recently, Prof. Daniel Abadi, who is an Assistant Professor of Computer Science at Yale University, announced HadoopDB release and the paper published [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=155&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><span class="dropcaps">W</span>ith the increasingly growing volume of data, the techniques to manage big data are needed in many areas. Open source community and many companies have attempted developing solutions to deal with big data.</p>
<p>Recently, <a href="http://cs-www.cs.yale.edu/homes/dna/" target="_blank">Prof. Daniel Abadi</a>, who is an Assistant Professor of Computer Science at Yale University, announced <a href="http://dbmsmusings.blogspot.com/2009/07/announcing-release-of-hadoopdb-longer.html" target="_blank">HadoopDB release and the paper</a> published in <a href="http://vldb2009.org/" target="_blank">VLDB’09</a>. HadoopDB is an open source analytical database, being developed by him and his students. The paper states that HadoopDB is a hybrid of both MapReduce and parallel  database and it takes the best features from both.</p>
<p><img style="display:inline;margin-left:0;margin-right:0;" title="Hadoop Logo" src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="Hadoop Logo" width="198" height="47" align="right" />Actually, MapReduce has made controversial issues from a database point of view. Formerly, there was some debates. Representatively, <a href="http://pages.cs.wisc.edu/~dewitt/" target="_blank">Prof. David Dewitt</a>, who is well known as a great master of (parallel) database, critiqued that <a href="http://databasecolumn.vertica.com/2008/01/mapreduce-a-major-step-back.html" target="_blank">MapReduce is a major step backwards</a>. On the other hand, proponents of MapReduce argue that MapReduce outperforms parallel database in respect of scalability, fault tolerance, and flexibility to unstructured data.</p>
<p>This paper concludes that HadoopDB is close to the performance of parallel databases while it is similar score on fault tolerance and feasibility in heterogeneous systems as Hadoop.</p>
<p>In sum, HadoopDB is a hybrid system of MapReduce and parallel DBMS. It is quite interesting achievement. I respect their decision to release HadoopDB as open source because their achievement will more broadly contribute to Hadoop and data analytical database. Still, I do not read this paper completely, and sooner I will discuss HadoopDB in detail.</p>
<h3>Some interesting points:</h3>
<ul>
<li>They carried out experiments on a 100 node of amazon EC2 cluster.</li>
<li>They try to deal with semantic web data (i.e., RDF) by HadoopDB.</li>
<li>HadoopDB is a full open source project.</li>
<li>HadoopDB isn’t well suited for real-time data yet.</li>
<li>I can participate in his presentation at the session at VLDB.</li>
</ul>
<h3>See Also:</h3>
<ul>
<li><a href="http://news.idg.no/cw/art.cfm?id=9D2C109A-1A64-6A71-CE90BD44D98F12B1" target="_blank">Yale researchers create database-Hadoop hybrid</a>, Computer World</li>
<li><a href="http://radar.oreilly.com/2009/07/hadoopdb-an-open-source-parallel-database.html" target="_blank">HadoopDB: An Open Source Parallel Database</a>, <a href="http://radar.oreilly.com/" target="_blank">O’REILLY radar</a></li>
<li><a href="http://databasecolumn.vertica.com/2008/01/mapreduce-a-major-step-back.html" target="_blank">MapReduce: A major step backwards</a></li>
<li><a href="http://databasecolumn.vertica.com/2008/01/mapreduce-continued.html" target="_blank">MapReduce: A major step backwards (II)</a></li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/155/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/155/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/155/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/155/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/155/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/155/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/155/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/155/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/155/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/155/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/155/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/155/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/155/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/155/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=155&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/07/31/hadoopdb-releases/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://hadoop.apache.org/images/hadoop-logo.jpg" medium="image">
			<media:title type="html">Hadoop Logo</media:title>
		</media:content>
	</item>
		<item>
		<title>Paper: Graph Twiddling in a MapReduce World</title>
		<link>http://diveintodata.org/2009/07/17/paper-graph-twiddling-in-a-mapreduce-world/</link>
		<comments>http://diveintodata.org/2009/07/17/paper-graph-twiddling-in-a-mapreduce-world/#comments</comments>
		<pubDate>Fri, 17 Jul 2009 09:32:03 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[graph cluster]]></category>
		<category><![CDATA[map-reduce]]></category>
		<category><![CDATA[scalable computing]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=89</guid>
		<description><![CDATA[Today, at the lab seminar I presented the paper “Graph Twiddling in a MapReduce World” published in IEEE Computing in Science &#38; Engineering. This paper addresses an investigation into the feasibility of decomposion graph operations into a series of MapReduce processes. In this post, I’m going to discuss this paper briefly. As I mentioned above, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=89&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><span class="dropcaps">T</span>oday, at the lab seminar I presented the paper “<a href="http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5076317&amp;tag=1" target="_blank">Graph Twiddling in a MapReduce World</a>” published in <a href="http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=5992" target="_blank">IEEE Computing in Science &amp; Engineering</a>. This paper addresses an investigation into the feasibility of decomposion graph operations into a series of MapReduce processes. In this post, I’m going to discuss this paper briefly.</p>
<p>As I mentioned above, this paper discusses the feasibility of decompositing graph operations into a series of MapReduce processes. As you know, the <a href="http://labs.google.com/papers/mapreduce.html" target="_blank">MapReduce</a> has been gaining attentions in various applications that cope with large-scale datasets. However, to the best of my knowledge there have been no studies for dealing with graphs on MapReduce. This paper proposes several operations as follows:</p>
<ul>
<li>Augmenting Edges with Degrees</li>
<li>Simplifying the Graph</li>
<li>Enumerating Triangles</li>
<li>Enumerating Rectangles</li>
<li>Finding Trusses</li>
<li>Barycentric Clustering</li>
<li>Finding Components</li>
</ul>
<p>Some operations are performed in combination with other operations. Actually, some of them are very easy problems if they can traverse graphs. However, as the author said, traversing graphs with MapReduce is very inefficient (i.e., causing many MapReduce iterations) because a mapper reads only a record randomly for each map operation. Anyway, all the operations that the paper proposed avoid traversing graphs. Instead, their common pattern in graph algorithms proposed is as follows:</p>
<ol>
<li>A map operation: Read and process all the edges (or vertex) or changing some piece of edge (or vertex) information. Then, result in records by vertex as key.</li>
<li>A reduce oprtation: For each record obtained from the previous map operation, read and determine the updated state of vertex or edge; emit this information in partially (or locally) updated records. Then, results in them.</li>
<li>A reduce opration: For each record from the previous reduce operation, combine the updates globally and complete updated information.</li>
</ol>
<h3>Discussion</h3>
<p>Even though this paper proposes several graph operations, they are still unnatural owing to too many MapReduce iterations; to the best of my knowledge, each MapReduce job&#8217;s initializing cost is very expensive. It is because mapper only can read record sequentially. The proposed graph operations based on MapReduce will cause the overhead of both MR iteration and communication. As a result, the feasible primitive graph operations with MapReduce are very limited. In addition, there are evidences to show the MapReduce is not suited to graph operations, but I will state them later.</p>
<p>Therefore, I think that a new programming model for graph (or complexity data) are needed. Ideally, the new programming model for graph must support graph traversing. In addition, data are needed to be preserved in locality in regards with their connectivity although data are distributed across a number of data nodes. Actually, basing these ideas I’m concreting “<a href="http://wiki.apache.org/hadoop/Hamburg" target="_blank">Hamburg: A New Programming Model for Graph Data</a>” inspired by a blog post “<a href="http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html" target="_blank">Large-scale Graph Computing at Google</a>”</p>
<h3>References</h3>
<ul>
<li>Jonathan Conhen, “<a href="http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=5076317" target="_blank">Graph Twiddling in a MapReduce World</a>”, Volume 11,  Issue 4, pp 29–41, <a href="http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=5992" target="_blank">IEEE Computing in Science &amp; Engineering</a>, July-Aug, 2009.</li>
<li>Jeffrey Dean and Sanjay Ghemawat, “<a href="http://labs.google.com/papers/mapreduce.html" target="_blank">MapReduce: Simplified Data Processing on Large Clusters</a>”, OSDI&#8217;04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.</li>
<li><a href="http://www.youtube.com/watch?v=BT-piFBP4fE" target="_self">Google Cluster Computing and MapReduce Lecture 5</a></li>
<li><a href="http://www.johnandcailin.com/blog/cailin/breadth-first-graph-search-using-iterative-map-reduce-algorithm" target="_self">Breath-first graph search using an iterative map-reduce algorithm</a></li>
<li><a href="http://wiki.apache.org/hadoop/Hamburg" target="_blank">Hamburg</a>, Hadoop Wiki</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/89/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/89/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/89/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/89/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/89/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/89/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/89/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/89/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=89&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/07/17/paper-graph-twiddling-in-a-mapreduce-world/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>What is the Common Tag?</title>
		<link>http://diveintodata.org/2009/07/16/what-is-the-common-tag/</link>
		<comments>http://diveintodata.org/2009/07/16/what-is-the-common-tag/#comments</comments>
		<pubDate>Thu, 16 Jul 2009 09:33:41 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[common tag]]></category>
		<category><![CDATA[rdf]]></category>
		<category><![CDATA[semantic web]]></category>
		<category><![CDATA[tagging]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=94</guid>
		<description><![CDATA[최근 Common Tag (http://www.commontag.org)라는 새로운 키워드가 시맨틱 웹 커뮤니티에 등장했습니다. 사실 태그(Tag)는 이미 많이 익숙한 시스템입니다. 그런데 Common Tag가 최근 많이 언급되어 Common Tag가 무엇인지 기존 태그와 어떻게 다른지 관련 글들을 읽어보고 간단히 정리해 보았습니다. 공식 사이트에 설명되어 있는 Common Tag는 다음과 같습니다. Common Tag is an open tagging format developed to make content more [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=94&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>최근 Common Tag (<a title="http://www.commontag.org" href="http://www.commontag.org">http://www.commontag.org</a>)라는 새로운 키워드가 시맨틱 웹 커뮤니티에 등장했습니다. 사실 태그(Tag)는 이미 많이 익숙한 시스템입니다. 그런데 Common Tag가 최근 많이 언급되어 Common Tag가 무엇인지 기존 태그와 어떻게 다른지 관련 글들을 읽어보고 간단히 정리해 보았습니다.</p>
<p>공식 사이트에 설명되어 있는 Common Tag는 다음과 같습니다.</p>
<blockquote><p>Common Tag is an open tagging format developed to make content more connected, discoverable and engaging. Unlike free-text tags, Common Tags are references to unique, well-defined concepts, complete with metadata and their own URLs</p></blockquote>
<blockquote><p>Common Tag는 컨턴츠간의 연결성, 검색 가능성, 응용 프로그램에 의한 활용성을 향상 시키기 위해 개발된 공개 태그 형태이다. free-text 기반의 기존 태그와 달리 고유성, 잘 정의된 개념, 메타데이터를 통한 완전성, 자체 URL을 가진다.</p></blockquote>
<p><img class="alignright size-full wp-image-168" style="border:1px solid black;" title="The Common Tag" src="http://diveintodata.files.wordpress.com/2009/07/commontag.png?w=590" alt="The Common Tag"   />간단히 말하면 기존 태그가 free-text기반으로 사용자가 자유롭게 입력하는 텍스트 형태였다면 Common Tag는 미리 잘 정의된 개념에 URI를 부여하고 이 URI를 태그로 사용합니다. 그 동안 Web 2.0의 여러 컨텐츠들은 태그(Tag)라는 free-text 형태의 텍스트에 의해 분류되어졌고 검색에 이용되었습니다. 그러나 기존 태그는 사실 제 역할을 하지 못했습니다. 동음이의어, 동의이음어로 인해 분류가 정확하게 되지 못했고 따라서 검색 결과도 그저 그런 퀄리티를 보여줬습니다. Common Tag는 이러한 단점을 보완하기 위해 제안된 것으로 보여집니다.</p>
<h3>Why use Common Tag?</h3>
<p>그럼 Common Tag를 사용하게 되면 무엇이 좋아질까요? 다음과 같은 이유 때문에 컨텐츠 생산자 및 소비자 그리고 관련 어플리케이션 개발자들의 편의가 향상됩니다.</p>
<ol>
<li><a href="http://en.wikipedia.org/wiki/Findability" target="_blank">Findability</a>가 향상됩니다. 명확한 의미와 통일된 Common Tag를 통해 원하는 데이터를 정확하게 찾을 수 있게 됩니다.</li>
<li>뜻이 잘 정의되어 있고  유일한 키를 가지는 Common Tag를 통해 정보들간에 연결성이 향상 됩니다. 즉 같은 Common Tag를 가진 컨텐츠 끼리는 Common Tag에 해당하는 URI를 통해 연결성을 가지게 되는거죠. 기존 태그는 동음이의어 및 동의이음어로 인해 잘못된 연결을 가지거나 연결되지 않는 경우가 많았습니다.</li>
<li>Common Tag는 단순한 스트링(string)이 아닌 URI에 의해 식별 및 참조되어지는데 프로그램이 처리하기가 수월해 집니다. 또한 동음이의어의 경우 프로그램들은 사실 구별하는 것이 불가능한데 Common Tag의 경우는 URI를 통해 식별되기 때문에 이런 문제가 발생하지 않습니다.</li>
</ol>
<h3>How Can We Make use of Common Tag?</h3>
<p>웹 문서(HTML document) 자체를 Common Tag를 통해 태깅할 수 도 있으며 문서의 특정 섹션, 특정 단어 및 미디어 파일에 태깅할 수 도 있습니다. 여담이지만 애초 HTML이 표현을 위주로 설계되었었는데 이와 같이 시맨틱 데이터가 HTML안에 삽입될 수 있는 것은 <a href="http://www.w3.org/TR/xhtml-rdfa-primer" target="_blank">RDFa</a> 덕분입니다. <a href="http://www.w3.org/TR/xhtml-rdfa-primer" target="_blank">RDFa</a>는 HTML에 RDF를 embeding 할 수 있게 하는 W3C의 Recommendation 입니다. RDFa나 RDF에 대해서는 추후에 또 다루도록 하겠습니다.</p>
<p>한가지 예로 Common Tag를 통해 HTML의 anchor text에 다음과 같이 태깅할 수 있게 됩니다. 이외의 활용에 대해서는 <a href="http://www.commontag.org/QuickStartGuide" target="_blank">Common Tagging’s QuickStartGuide</a> 을 참고하세요.</p>
<pre style="border:1px solid #cecece;overflow:auto;min-height:40px;width:650px;background-color:#fbfbfb;padding:5px;">
<pre style="font-size:12px;width:100%;font-family:consolas,'Courier New',courier,monospace;background-color:#fbfbfb;margin:0;"><span style="color:#0000ff;">&lt;</span><span style="color:#800000;">div</span> <span style="color:#ff0000;">xmlns</span>:<span style="color:#ff0000;">ctag</span>=<span style="color:#0000ff;">"http://commontag.org/ns#"</span> <span style="color:#ff0000;">rel</span>=<span style="color:#0000ff;">"ctag:tagged"</span><span style="color:#0000ff;">&gt;</span></pre>
<pre style="font-size:12px;width:100%;font-family:consolas,'Courier New',courier,monospace;background-color:#fbfbfb;margin:0;">   NASA's <span style="color:#0000ff;">&lt;</span><span style="color:#800000;">a</span> <span style="color:#ff0000;">typeof</span>=<span style="color:#0000ff;">"ctag:Tag"</span> <span style="color:#ff0000;">rel</span>=<span style="color:#0000ff;">"ctag:means"</span></pre>
<pre style="font-size:12px;width:100%;font-family:consolas,'Courier New',courier,monospace;background-color:#fbfbfb;margin:0;">               <span style="color:#ff0000;">href</span>=<span style="color:#0000ff;">"http://rdf.freebase.com/ns/en.phoenix_mars_mission"</span></pre>
<pre style="font-size:12px;width:100%;font-family:consolas,'Courier New',courier,monospace;background-color:#fbfbfb;margin:0;">               <span style="color:#ff0000;">property</span>=<span style="color:#0000ff;">"ctag:label"</span><span style="color:#0000ff;">&gt;</span>Phoenix Mars Lander<span style="color:#0000ff;">&lt;/</span><span style="color:#800000;">a</span><span style="color:#0000ff;">&gt;</span> has deployed its robotic arm.</pre>
<pre style="font-size:12px;width:100%;font-family:consolas,'Courier New',courier,monospace;background-color:#fbfbfb;margin:0;"><span style="color:#0000ff;">&lt;/</span><span style="color:#800000;">div</span><span style="color:#0000ff;">&gt;</span></pre>
<p>위와 같이 기존 HTML에 쏙 들어갈 수 있습니다. 이런 기술적인 부분만 보면 웹 문서에 Common Tag가 아주 쉽게 적용될 수 있을 것 같습니다. 그러나 Common Tag가 태깅된 데이터에 대해서는 아주 유용하게 쓰일 수 있으나 태깅할 때 결국에 자신이 나타내고 싶은 뜻을 가진 Common Tag를 골라내는 즉 사람의 손을 거쳐야 한다는게 불편함으로 작용할 것 같습니다. 사실 현재의 기술수준으로는 불가피한 문제인데 웹 저작툴들의 Common Tag 자동완성 기능이라던지 검색기능으로 커버되어야 할 것 같습니다.</p>
<h3>Conclusion</h3>
<p>결과적으로는 <a href="http://www.w3.org/TR/xhtml-rdfa-primer" target="_blank">RDFa</a>와 더불어 Common Tag가 많이 보급되어 웹 저작툴 및 웹 어플리케이션들이 이들을 잘 지원하게 되면 시맨틱 데이터는 더욱 풍부해질 것으로 예상되고요. 또한 Common Tag를 통해 문서 또는 문서의 일부분에 인물, 사물, 지리 정보 및 추상적 개념을 정확하게 태깅할 수 있게 되고 이를 기반으로 흥미로운 어플리케이션이 쏟아질 것으로 기대됩니다. 그리고 글 중 틀린 내용이 있으면 지적 부탁 드립니다.</p>
<h3>See Also:</h3>
<ul>
<li><a href="http://www.commontag.org">http://www.commontag.org</a></li>
<li><a href="http://www.w3.org/TR/xhtml-rdfa-primer" target="_blank">RDFa Primer</a></li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/94/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/94/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/94/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/94/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/94/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/94/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/94/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/94/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=94&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/07/16/what-is-the-common-tag/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>

		<media:content url="http://diveintodata.files.wordpress.com/2009/07/commontag.png" medium="image">
			<media:title type="html">The Common Tag</media:title>
		</media:content>
	</item>
		<item>
		<title>How to Display Mathematics Symbols in Online</title>
		<link>http://diveintodata.org/2009/07/09/how-to-display-mathematics-symbols-in-online/</link>
		<comments>http://diveintodata.org/2009/07/09/how-to-display-mathematics-symbols-in-online/#comments</comments>
		<pubDate>Thu, 09 Jul 2009 09:12:15 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[Blogging]]></category>
		<category><![CDATA[Formula]]></category>
		<category><![CDATA[Mathematics]]></category>

		<guid isPermaLink="false">http://diveintodata.org/?p=65</guid>
		<description><![CDATA[Sometimes, I confront the situation to write mathematical symbols or formula in online. Actually, by using latex or a kind of word process we can write them, whereas it is difficult to do so in online. However, I found some convenient ways for them. This site (http://sixthform.info/steve/wordpress/?p=59) introduces many ways to write easily mathematical symbols [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=65&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><span class="dropcaps">S</span>ometimes, I confront the situation to write mathematical symbols or formula in online. Actually, by using latex or a kind of word process we can write them, whereas it is difficult to do so in online. However, I found some convenient ways for them. This site (<a href="http://sixthform.info/steve/wordpress/?p=59" target="_blank">http://sixthform.info/steve/wordpress/?p=59</a>) introduces many ways to write easily mathematical symbols or formulas in online. Among them, I prefer to the following methods because they provide immediately math-symbols image urls generated by online input.</p>
<ul>
<li><a href="http://thornahawk.unitedti.org/equationeditor/equationeditor.php" target="_self">Online LaTeX Equation Editor</a> (The best solution that I think)</li>
<li><a href="http://www.codecogs.com/components/equationeditor/equationeditor.php" target="_blank">LaTeX Equation Editor</a></li>
<li><a href="http://mathurl.com/" target="_blank">mathurl</a></li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/65/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/65/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/65/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=65&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/07/09/how-to-display-mathematics-symbols-in-online/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
		<item>
		<title>Computer Scientist들을 위한 추천 블로그 (1)</title>
		<link>http://diveintodata.org/2009/06/24/computer-scientist%eb%93%a4%ec%9d%84-%ec%9c%84%ed%95%9c-%ec%b6%94%ec%b2%9c-%eb%b8%94%eb%a1%9c%ea%b7%b8-1/</link>
		<comments>http://diveintodata.org/2009/06/24/computer-scientist%eb%93%a4%ec%9d%84-%ec%9c%84%ed%95%9c-%ec%b6%94%ec%b2%9c-%eb%b8%94%eb%a1%9c%ea%b7%b8-1/#comments</comments>
		<pubDate>Tue, 23 Jun 2009 15:11:35 +0000</pubDate>
		<dc:creator>Hyunsik Choi</dc:creator>
				<category><![CDATA[Research]]></category>
		<category><![CDATA[computer science]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[distributed systems]]></category>
		<category><![CDATA[P=NP]]></category>
		<category><![CDATA[scalable computing]]></category>

		<guid isPermaLink="false">http://diveintodata.org/2009/06/computer-scientist%eb%93%a4%ec%9d%84-%ec%9c%84%ed%95%9c-%ec%b6%94%ec%b2%9c-%eb%b8%94%eb%a1%9c%ea%b7%b8-1/</guid>
		<description><![CDATA[오늘은 Computer Science 분야의 문제들 및 최신 이슈들을 다루는 몇몇 유명 블로그들을 소개하려고 한다. 워낙 유명한 블로그들이라 이미 많은 분들이 아실꺼라 생각이 들지만 혹시 모르는 분들이 있을까 이렇게 소개해 본다. The Database Column &#8211; 말 그대로 데이터베이스 이슈들을 다룬다. 최근에는 클라우드 컴퓨팅에 대한 이슈도 언급된다. 이 블로그는 진짜 짱인게 Michael Stonebraker, Daniel Abadi, David DeWitt, [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=59&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p><span style="font-size:9pt;">오늘은 Computer Science 분야의 문제들 및 최신 이슈들을 다루는 몇몇 유명 블로그들을 소개하려고 한다. 워낙 유명한 블로그들이라 이미 많은 분들이 아실꺼라 생각이 들지만 혹시 모르는 분들이 있을까 이렇게 소개해 본다. </span><span style="font-weight:bold;"><br />
</span> <span style="font-size:10pt;font-weight:normal;"><span style="font-size:10pt;"><a title="[http://www.databasecolumn.com/]로 이동합니다." href="http://www.databasecolumn.com/" target="_blank"><span style="font-size:10pt;"></span></a></span></span></p>
<ul>
<li><span style="font-size:10pt;font-weight:normal;"><span style="font-size:10pt;"><a title="[http://www.databasecolumn.com/]로 이동합니다." href="http://www.databasecolumn.com/" target="_blank"><span style="font-size:10pt;"><span style="font-size:9pt;">The Database Column</span></span></a><span style="font-size:10pt;"><span style="font-size:9pt;"> &#8211; 말 그대로 데이터베이스 이슈들을 다룬다. 최근에는 클라우드 컴퓨팅에 대한 이슈도 언급된다. 이 블로그는 진짜 짱인게 </span></span></span></span><span class="entry-author-name" style="font-weight:normal;"><span style="font-size:9pt;"><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:9pt;">Michael Stonebraker, </span></span></span></span></span><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:9pt;">Daniel Abadi, </span></span></span></span></span><span style="font-size:9pt;"><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:9pt;">David DeWitt, Stan Zdonik, </span></span></span></span></span></span><span style="font-weight:normal;font-family:Helvetica;"><span style="font-size:9pt;"><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-family:Gulim;"><span style="font-size:9pt;">Samuel Madden</span></span></span></span></span></span></span><span class="entry-author-name" style="font-weight:normal;"><span style="font-size:9pt;"><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:9pt;"> 같은 대가들이 글을 쓴다. 최근 database 학계에서 어떤 주제에 관심을 가지고 있는지 알고 싶다면 제목만 훑어봐도 된다.</span></span></span></span></span></span></li>
<li><span class="entry-author-name" style="font-weight:normal;"><span style="font-size:9pt;"><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:10pt;"></span></span></span></span></span><a href="http://rjlipton.wordpress.com/"></a><span style="font-size:10pt;font-weight:normal;"><a title="[http://rjlipton.wordpress.com/]로 이동합니다." href="http://rjlipton.wordpress.com/" target="_blank"><span style="font-size:10pt;"><span style="font-size:9pt;">Gödel’s Lost Letter and P=NP</span></span></a><span style="font-size:10pt;"><span style="font-size:9pt;"> &#8211; </span></span></span><span style="font-size:9pt;"><span style="font-size:10pt;font-weight:normal;"><span style="font-size:10pt;"><span style="font-size:9pt;">제목만보면 NP문제를 주로 다루는 것 같지만 다양한 문제들과 알고리즘들을 다루고 있다(사실 오늘 발견함). 상당히 유익해 보이는 반면 어려워 보인다 (@_@)</span></span></span><span style="font-size:10pt;"><span style="font-size:10pt;font-weight:normal;"><span style="font-size:9pt;">. </span></span></span></span><a title="[http://www.allthingsdistributed.com/]로 이동합니다." href="http://www.allthingsdistributed.com/" target="_blank"><span style="font-size:10pt;"></span></a></li>
<li><a title="[http://www.allthingsdistributed.com/]로 이동합니다." href="http://www.allthingsdistributed.com/" target="_blank"><span style="font-size:10pt;"><span style="font-size:9pt;">All Things Distributed</span></span></a><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:9pt;"> &#8211; Amazon CTO인 </span></span></span><span style="font-size:10pt;"><span style="font-size:10pt;"><span style="font-size:9pt;">Werner Vogels의 블로그 이다. Scalable and distributed Computing에 대한 이슈를 다룬다.</span></span></span></li>
</ul>
<p><span style="font-size:9pt;"> 원래 계획은 5개씩 소개하여 2회에 총 10개 소개였는데 요즘 포스팅 거리도 없고 하니&#8230;&#8230; 나머지는 다음에 이어서 쓰겠다. </span></p>
<p><span style="font-size:9pt;"><br />
덧붙임. 저 블로그들에 읽고 싶은 글들은 많은데 업데이트되는 수가 장난이 아니라&#8230;따라가기 참 힘들구나 ~(~_~)~</span><br />
<strong></strong></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/diveintodata.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/diveintodata.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/diveintodata.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/diveintodata.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/diveintodata.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/diveintodata.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/diveintodata.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/diveintodata.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/diveintodata.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/diveintodata.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/diveintodata.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/diveintodata.wordpress.com/59/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/diveintodata.wordpress.com/59/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/diveintodata.wordpress.com/59/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=diveintodata.org&#038;blog=12237478&#038;post=59&#038;subd=diveintodata&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://diveintodata.org/2009/06/24/computer-scientist%eb%93%a4%ec%9d%84-%ec%9c%84%ed%95%9c-%ec%b6%94%ec%b2%9c-%eb%b8%94%eb%a1%9c%ea%b7%b8-1/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/4213567e11cad51fc02bc2038e9ace27?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">Hyunsik Choi</media:title>
		</media:content>
	</item>
	</channel>
</rss>
