데이터베이스 시스템의 주제별 기초 논문들

Posted: May 25, 2015 | Author: Hyunsik Choi | Filed under: Research | Tags: database, paper | Leave a comment

데이터베이스 시스템 이라는 큰 주제 아래 각 세부 주제에 대한 기초 논문 목록 들이다. 한참 학교에서 공부하던 시절에 정리하고 틈틈히 업데이트 했던 것 같다. 추후에 data processing이나 column store에 대한 논문들도 공유하도록 하겠다.

데이터베이스 분야는 일반적으로 순수한 알고리즘이나 자료구조 부터 다양한 응용 문제나 이론까지 아주 광범히 하다. 말 자체는 ‘데이터베이스’라서 약간은 고리타분해 보이기도 하지만 데이터에 대해 일반화 가능한 모든 연구라고 봐도 무방할 만큼 해당 학계에서 다양한 연구를 다룬다. 최근 큰 인기를 얻고 있는 하둡이나 분산 데이터처리 역시 데이터베이스 분야에서 활발히 다루어지고 있다. 마이닝의 많은 연구들 또한 이 분야에서 다루어진다. 제목에서 언급한 ‘데이터베이스 시스템’이라고 하면 일반적으로 시스템 구현기술과 이론에 해당되는 내용을 말한다.

개인적으로 해당 분야나 문제를 접할 때 그 문제에 대한 가장 초기 논문들은 꼭 읽어보려고 노력한다. 그 이유는 그 논문들이 그 문제에 대해 가장 깊은 통찰력과 고민들을 많이 담고 있기 때문이며 후대의 논문들 일수록 초기 논문들이 한 고민이나 통찰은 기본적인 전제로 사용되고 문제 풀이 아이디어 위주로 기술되기 때문이다. 그래서 아래 리스트는 각 분야의 초기 논문들 및 전체를 정리하는 논문들 위주로 리스팅이 되어 있다.

주제별로 중요한 논문을 다 담긴 것은 아니다. 주로 내가 관심 있었던 것들 위주이다. 또한 See Also에는 대학들의 좋은 커리큘럼이나 읽어볼만한 주제별 논문에 대해 정리한 리스트의 링크를 담고 있다. 그리고 그 동안은 개인적인 위키에 업데이트를 했었는데 앞으로는 이곳에서 업데이트를 하도록 하겠다.

History

M. Stonebraker and J. M. Hellerstein, “What Goes Around, Comes Around,” in Readings in Database Systems, 2005, pp. 2-41.
J. M. Hellerstein and M. Stonebraker, “Anatomy of a Database System,” in Readings in Database Systems, 2005, pp. 42-95.
Thomas Haigh, Fifty Years of Databases, ACM SIGMOD Blog, 2012.

Architecture

M. M. ASTRAHAN et al., System R: relational approach to database management, ACM TODS, 1976.
Donald D. Chamberlin et al., A History and Evaluation of System R, Communications of the ACM, 1981.
Michael Stonebraker et al., The Design and Implementation of Ingres, ACM TODS, 1976.
Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton, Architecture of a Database System, Foundations and Trends in Databases, 2007.

Query Processing

Leonard D. et al., Join processing in database systems with large main memories, ACM TODS, 1986.
D. J. DeWitt and Jim Gray, Parallel Database Systems: The Future of High Performance Database Processing, CACM 1992.
Chris Nyberg et al., AlphaSort: a cache-sensitive parallel external sort, VLDB Journal, 1995.
Goetz Graefe, Encapsulation of Parallelism in the Volcano Query Processing System, ACM SIGMOD, 1990.
Lothar F. Mackert et al., R* Optimizer Validation and Performance Evaluation for Distributed Queries, VLDB Conf, 1986.

Access Method

P. Griffiths Selinger et al., Access Path Selection in a Relational Database Management System, ACM SIGMOD, 1979.
Jim Gray et al., The Five-Minute Rule Ten Years Later, and Other Computer Storage Rules of Thumb, ACM SIGMOD Record, 1997.

Transaction Management

Logging

C. Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-ahead Logging, ACM TODS, 1987.
C. Mohan, Repeating History Beyond ARIES, VLDB Conf, 1999.
Russel Sears et al,, Segment-Based Recovery: Write-ahead logging revisited, PVLDB, 2009.
Philip L. Lehman et al., Efficient Locking for Concurrent Operations on B-Trees, ACM TODS, 1981.

Concurrency Control

Jim Gray et al., Granularity of Locks and Degrees of Consistency in a Shared Data Base, Readings in database systems, 1976.
Concurrency Control in Database Systems, ACM Computing Survey, 1981.
H. T. Kung, On optimistic methods for concurrency control, ACM TODS, 1981.
Rakesh Agrawal et al., Concurrency Control Performance Modeling: Alternatives and Implications, ACM TODS, 1987.
C. Mohan et al., Transaction Management in the R* Distributed Database Management System, ACM TODS, 1986.
- Two Phase Commit

Data Warehouse

Surajit Chaudhuri and Umeshwar Dayal., An Overview of Data Warehousing and OLAP Technology, ACM SIGMOD Record, 1997.
Patrick O’Neil and Dallan Quass, Improved Query Performance with Variant Indexes, ACM SIGMOD, 1997.
Jim Gray et al., Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-Tab, and Sub Totals, Data Mining and Knowledge Discovery, 1997.
Yihong Zhao et al., An Array-Based Algorithm for Simultaneous Multidimensional Aggregates, ACM SIGMOD, 1997.
C. Mohan and Inderpal Narang, Algorithms for creating indexes for very large tables without quiescing updates, ACM SIGMOD, 1992.

Amazon EC2에서 whirr을 이용한 Hadoop 클러스터 구동 방법

Posted: March 19, 2011 | Author: Hyunsik Choi | Filed under: Cloud Computing, FOSS, Research | Tags: amazon ec2, cloud computing, configuration, hadoop, howto, whirr | 3 Comments

최근 연구내용 검증을 위해 Amazon EC2에서 Hadoop 클러스터를 구축하여 실험을 수행 하는 중입니다. 그런데 Hadoop 클러스터를 EC2에 구축하는데 있어 Amazon EC2 환경에 대한 이해 부족과 자료의 부족으로 직접 부딪혀서 해결해야 하는 부분들이 꽤 있었습니다. 저는 이 포스팅을 통해 제가 시도했던 방법을 소개하고 제 경험을 공유하고자 합니다.

우선 이 글을 읽는 분들은 Amazon EC2 계정이 있고 AMI, Instance, EC2 Key Pair에 대해 알고 계시다고 전제하겠습니다.

Amazon EC2에서 Hadoop 클러스터 구동

현 시점에서 Amazon EC2 환경에 Hadoop 클러스터를 구동 방법은 선택의 폭이 그리 넓지 않습니다.

Hadoop이 이미 설치된 이미지를 사용하고 수동 설정하는 방법
EBS 기반 AMI에 하둡 설치 및 복사 그리고 수동 설정
whirr을 사용하는 방법
whirr의 hadoop-ec2를 사용하는 방법

이 포스팅에서는 3번인 whirr을 이용한 구축방법을 설명합니다. 그런데 이 방법은 정말 간단하지만 한 가지 제약을 가지고 있습니다. 이 방법은 기본적으로 instance store 기반의 AMI만 활용 할 수 있습니다. 따라서 Hadoop 클러스터의 HDFS는 instance store에 기반을 두게 되며 클러스터 종료 시 HDFS의 모든 데이터가 제거됩니다 (Amazon EC2의 모든 인스턴스는 영속적인 데이터 저장을 위해 EBS나 S3와 같은 별도의 저장 서비스를 사용해야 합니다.)

저의 경우 처음에 1,2번 방법을 모두 시도했었습니다. 그러나 다음과 같은 문제점이 있었습니다.

최신 Hadoop 배포본(0.20 이상)이 설치된 AMI의 부재
수십 여개의 인스턴스의 시동(launch)의 자동화
매번 새로 할당 받는 IP 주소와 이에 따른 Hadoop 설정과 설정 배포의 어려움

조사해보니 인스턴스를 시동을 자동화하고 시동된 인스턴스의 IP 목록을 얻어 설정 배포까지 원활히 하기 위해서는 Amazon AWS API를 이용하거나 boto (Python interface to Amazon Web Services), jcloud (multi-cloud library) 와 같은 third-party 라이브러리를 이용해 개발을 해야합니다. 그러나 이는 많은 시간을 요구합니다. EBS 기반 AMI에 Hadoop을 직접 설치하는 사용하는 방법 역시 비슷한 이유로 포기했습니다.

위에 4번 방법인 hadoop-ec2는 원래 Hadoop의 contrib 에 속했던 프로그램으로 현재는 whirr에서 진행되고 있지만 지속적으로 유지보수가 되지 않는 것으로 보여 시도하지 않았습니다. whirr의 Change Log를 봐도 4번에 대한 내용은 찾기 어려웠습니다.

현재로써는 whirr이 가장 편리한 방법이라 여겨 집니다.

whirr?

whirr는 Apache Incubator에 속한 프로젝트로 Amazon EC2와 같은 상용 클라우드 환경에서 원하는 서비스에 대한 설치, 설정, 실행을 자동으로 수행하는 라이브러리입니다. 현재 제공하는 서비스로는 Apache Hadoop, Cassandra, Cloudera’s Distribution for Hadoop (CDH), Zookeeper가 있으며 조만간 릴리즈 될 0.4-incubating 버전에 Hbase가 추가될 예정이라고 합니다.

동작의 개요

whirr의 사용방법을 설명하기에 앞서 전체적인 동작에 대해 개략적으로 설명을 드리겠습니다.

사용자가 ‘cluster-lunch’ 커맨드를 주면 whirr은 instance store 기반의 AMI를 이용해 다수의 인스턴스를 가동하고 모든 인스턴스들에 JDK 및 Hadoop의 설치와 설정을 일괄적으로 수행합니다. 이 과정이 끝나면 EC2 내부에서 Hadoop 클러스터가 동작하고 있게 됩니다.

그리고 사용자가 로컬 머쉰에 설치한 Hadoop 프로그램 통해 EC2에서 구동되는 Hadoop 클러스터를 제어하게 됩니다. 그런데 EC2 내부의 인스턴스들은 기본적으로 private IP만을 할당 받아 외부에서 접근할 수가 없고 기본적으로 방화벽 설정이 까다롭게 되어 있기 때문에 추가적인 설정 없이 Hadoop RPC나 웹 UI를 통한 접근이 불가능 합니다. 따라서 whirr이 제공하는 proxy 프로그램을 실행하고 난 뒤에 로컬 머쉰에 설치된 Hadoop 프로그램을 이용하여 EC2 내부의 클러스터를 제어하게 됩니다.

Hadoop 클러스터 구동

계정 생성

whirr을 통해 생성하는 Hadoop 클러스터는 linux에서 hadoop이라는 username의해 시동됩니다. Hadoop은 클러스터를 구동한 계정과 같은 계정으로 접근할 때 superuser 권한을 가집니다. 따라서 로컬에 hadoop이라는 계정을 생성하여 아래 작업을 수행해야 Amazon EC2 내부에서 동작하는 Hadoop 클러스터에 대한 superuser권한을 행사할 수 있습니다. 하지만 단순히 MapReduce 프로그램만 실행 시킨다면 아무 계정에서 작업해도 문제 없습니다.

로컬 머쉰에 Hadoop과 whirr의 설치 그리고 Hadoop Version 문제

Hadoop과 whirr은 http://archive.apache.org/dist/ 에서 다운 받아 로컬 머쉰에 설치합니다. 그런데 현 시점에서는 ‘어떤 Hadoop 버전을 설치해야 하는가’가 문제가됩니다. Hadoop은 현재 한참 빠르게 개발되고 있으며 다른 버전간 내부 프로토콜이 호환되지 않습니다.

따라서 whirr이 자동으로 설치해주는 Hadoop 클러스터와 로컬 머쉰의 Hadoop은 같은 버전으로 맞춰야 합니다. 버전을 바꾸는 것은 whirr FAQ에 아래와 같이 설명되어 있는 것 처럼 직접 install, configuration 스크립트를 수정해야 합니다.

How do I specify the service version and other service properties?

Currently the only way to do this is to modify the scripts to install a particular version of the service, or to change the service properties from the defaults.

See “How to modify the instance installation and configuration scripts” above for details on how to do this.

from http://incubator.apache.org/whirr/faq.html

whirr은 Apache Hadoop 배포본외에도 Cloudera의 Hadoop 배포본을 설치할 수 있습니다. 이는 아래 ‘whirr 설정 파일’에서 whirr.hadoop-install-runurl과 whirr.hadoop-configure-runurl에 대한 내용을 참고하시면 됩니다.

whirr 설정 파일

whirr을 이용한 클러스터의 구동은 클러스터에 대한 설정 파일을 만드는 것으로 시작합니다. 이 포스팅에서는 아래 cluster.properties 파일의 내용을 설명하고 이후 내용도 이 설정을 기준으로 설명하도록 하겠습니다.

(아래 내용은 최신인 0.3-incubating 버전에 대한 내용입니다. 0.4-incubating 버전이 릴리즈 되면 설정 방법이 변경될 예정입니다. 릴리즈 되고 나면 포스팅을 업데이트 하도록 하겠습니다.)

whirr.cluster-name=mycluster
whirr.instance-templates=1 jt+nn,16 dn+tt
whirr.provider=ec2
whirr.identity=ACCESS_KEY
whirr.credential=SECRET_KEY
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub
whirr.location-id=us-east-1d
whirr.hardware-id=m1.small
whirr.service-name=hadoop
#whirr.hadoop-install-runurl=cloudera/cdh/install
#whirr.hadoop-configure-runurl=cloudera/cdh/post-configure

각 항목에 대한 설명은 다음과 같습니다.

whirr.cluster-name : 구동할 클러스터를 식별하는 이름입니다. 클러스터를 구동하면 ${HOME}/.whirr/<cluster-name> 가 디렉토리가 생성되며 이 디렉토리에는 Hadoop 클러스터에 접근하는데 필요한 파일들이 저장됩니다.
whirr.instance-templates: 구동할 클러스터의 구성을 설정합니다. jt는 jobtracker, nn은 name node, dn은 data node, tt는 task tracker를 의미합니다. 이 설정을 통해 유연한 설정이 가능합니다. data node와 task tracker를 더 늘리고 싶을 때는 dn+tt 앞에 쓰여진 숫자를 변경해 주시면 됩니다.
whirr.provider: 클러스터 서비스 제공자를 설정합니다. 현재는 Amazon EC2와 Rackspace Cloud Servers 두 가지를 지원합니다.
whirr.identity: AWS의 access key를 입력하시면 됩니다.
whirr.credential: AWS의 secret key를 입력 합니다.
whirr.private-key-file: 이 설정과 아래 설정은 ec2 인스턴스를 생성할 때 사용할 key로 사용됩니다. 위 예제처럼 하기 위해서는 아래와 같이 ssh키를 생성해야 한다. 또는 기존에 다른 인스턴스를 위해 만들어 놓은 EC2 Key pair의 경로를 설정해도 됩니다.

$ ssh-keygen -t rsa -P ''

whirr.location-id: 원하는 availability zone을 설정한다. 설정하지 않으면 whirr을 실행하는 인스턴스와 같은 zone이 설정된다.
whirr.hardware-id: 원하는 인스턴스 유형을 설정한다. Amazon EC2가 제공하는 인스턴스 유형은 Amzon EC2 Instance Types 페이지에서 확인할 수 있으며 각 유형에 써 있는 API name을 이 설정에 적용하면 됩니다.
whirr.service-name: 구축할 서비스를 설정합니다. 이 글은 Hadoop을 위한 것이므로 hadoop으로 남겨 둡니다.
whirr.hadoop-install-runurl, whirr.hadoop-configure-runurl: Hadoop의 경우 apache 버전과 CDH 버전이 있습니다. 위 예제에서 주석(#)을 제거해 주면 CDH버전을 구동하게 됩니다.

설정에 대한 추가적인 설명은 Whirr Configuration Guide 문서를 참고하시면 됩니다.

Hadoop 클러스터 시동

클러스터의 시동은 다음과 같은 커맨드로 수행합니다. 클러스터를 시동하면 내부적으로 Amazon의 EC2 API를 통해 인스턴스를 생성해 필수 패키지(JDK)등을 설치하고 Hadoop 배포 버전을 다운로드 받아 설정을 하는 과정이 수행됩니다. 따라서 클러스터가 구동되는데 짧게는 수 분에서 길게는 10분 정도 소요됩니다. 클러스터 구동이 완료되면 다시 쉘 프롬프트가 뜨게 됩니다.

$ whirr/bin/whirr launch-cluster --config cluster.properties
Bootstrapping cluster
Configuring template
Starting 16 node(s) with roles [tt, dn]
Configuring template
Starting 1 node(s) with roles [jt, nn]
Nodes started: [[id=us-east-1/i-a45eb7cb, providerId=i-a45eb7cb, ...]]
.....
Nodes started: [[id=us-east-1/i-7a51b815, providerId=i-7a51b815, ...]]
Authorizing firewall
Running configuration script
Configuration script run completed
Running configuration script
Configuration script run completed
Completed configuration of mycluster
Web UI available at http://ec2-72-44-43-29.compute-1.amazonaws.com
Wrote Hadoop site file /home/-----/.whirr/mycluster/hadoop-site.xml
Wrote Hadoop proxy script /home/-----/.whirr/mycluster/hadoop-proxy.sh
Wrote instances file /home/-----/.whirr/mycluster/instances
Started cluster of 17 instances
Cluster{instances=[Instance{roles=[tt, dn], ...}]}
$

프록시 열기

whirr을 통해 구동한 Hadoop 클러스터에 접근하기 위해서는 로컬에 설치했던 Hadoop을 설정을 해야 합니다. 설정은 간단히 whirr이 클러스터 시동 후 생성해 주는 파일을 사용하면 됩니다. ‘whirr launch-cluster’ 가 완료되고 나면 ${HOME}/.whirr/<cluster-name>/hadoop-site.xml 파일이 생성됩니다. 이 파일을 로컬에 설치한 Hadoop의 ${HADOOP_HOME}/conf에 간단하게 복사하거나 다음과 같이 환경변수를 설정하여 Hadoop의 설정 디렉토리를 override 하면 됩니다.

export HADOOP_CONF_DIR=~/.whirr/<cluster-name>

하지만 EC2 내부의 클러스터들은 private IP만을 가지기 때문에 바로 Hadoop 클러스터에 접근할 수는 없습니다. ${HOME}/.whirr/<cluster-name>/hadoop-proxy.sh를 실행 해야 비로소 EC2에 구동된 Hadoop 클러스터에 접근할 수 있습니다.

$ chmod +x ~/.whirr/mycluster/hadoop-proxy
$ ~/.whirr/mycluster/hadoop-proxy

Running proxy to Hadoop cluster at ec2-72-44-43-29.compute-1.
amazonaws.com. Use Ctrl-c to quit.

$ bin/hadoop dfs -ls
...

hadoop-proxy.sh를 실행하면 EC2에서 동작하는 Hadoop 클러스터 웹 UI도 접근할 수 있습니다. 그러나 이를 위해서는 간단한 웹 브라우져 설정이 요구됩니다.

크롬은 Preferences -> Under the Hood 탭 -> Network -> Change Proxy Settings에서 설정하면 됩니다. Socks 설정의 주소는 localhost, 포트는 6666으로 해주시면 됩니다.
파폭은 기본적으로 SOCKS를 사용하더라도 로컬 머신에 설정된 DNS를 사용하게 되어 있는데 먼저 이 설정을 변경해 주셔야 합니다. 이를 위해서는 주소창에 about:config를 입력해 세부 설정으로 들어가 network.proxy.socks_remote_dns 설정을 true로 변경해주셔야 합니다. 그리고 Preferences -> Advanced -> Network 탭 -> Connection 섹션의 Settings에서 SOCKS Host를 주소는 localhost, 포트는 6666으로 설정해주시면 됩니다.

위와 같이 수정하고 hadoop-proxy.sh를 실행 했을 때 출력되는 URL에 접속하시면 됩니다.

클러스터 종료

Hadoop을 이용한 모든 작업이 끝나면 종료는 다음 커맨드를 통해 수행합니다. 위에서 언급한 바와 같이 whirr은 아직까지 instance store 기반 클러스터 구축 밖에 지원하지 못하므로 HDFS의 모든 데이터가 제거되는 사실을 염두하셔야 합니다.

$ whirr/bin/whirr destroy-cluster --config=cluster.properties

결론

whirr을 통한 Hadoop 클러스터 구동 방법을 설명했습니다. whirr은 아직 사소한 버그와 설정의 한계가 있지만 직접 클러스터 를 구축해야 하는 사용자의 노력을 상당히 줄여줍니다. 필요에 따라 MapReduce 프로그램을 대규모 클러스터에 동작시켜야 하는 사용자들에게는 특히 유용하다고 생각합니다.

참고문서

VoltDB and its related links

Posted: June 1, 2010 | Author: Hyunsik Choi | Filed under: FOSS, Research | Tags: ACID, cloud computing, databases, OLTP, shared-nothing architecture, sql, VoltDB | 6 Comments

There has been lots of buzz about VoltDB (academic name is H-Store [5]) since a week ago. VoltDB is lead by M. Stonebraker, and it is an open source OLTP DBMS. There are some interesting points:

Running on shared-nothing clusters of commodity hardware
In-memory database
SQL support
ACID
Linear Scalability
Released as an Open Source software

Actually, there have already been some OLTP databases running on shared-nothing clusters. However, they cannot take advantage from the scalability of shared-nothing architecture due to their implementation’s natures, such as complex distributed locking and commit protocols [1]. In addition, according to [3], traditional RDBMSs have four overhead components, which are logging, locking, latching, and buffer management. However, M. Stonebraker claims that VoltDB eliminated these legacy overheads.

Among many features, especially I have interest in its linear scalability with ACID and performance. It is meaningful in that today’s web applications have another alternative to NoSQL data stores. Although VoltDB is under heavy development, the above features and the next benchmark result show its promising.

Key-Value Benchmark (VoltDB versus Cassandra)

Cassandra is a remarkable key-value store and an open source project developed by apache committers. Now, it is well known as the most performant one in existing NoSQL stores. According to this benchmark result, however, in all cases VoltDB dominates Cassandra although the fairness of experiments is controversial.

VoltDB Roadmap

It’s future plan is also expected. I wonder how much attention VoltDB will be getting from communities and industrials.

Dive Into A Data Deluge

Discussion about Newly Emerging Issues on Database

데이터베이스 시스템의 주제별 기초 논문들

History

Architecture

Query Processing

Access Method

Transaction Management

Logging

Concurrency Control

Data Warehouse

See Also

Categories

Archives

Dive Into A Data Deluge

Discussion about Newly Emerging Issues on Database

데이터베이스 시스템의 주제별 기초 논문들

History

Architecture

Query Processing

Access Method

Transaction Management

Logging

Concurrency Control

Data Warehouse

See Also

Share this:

Amazon EC2에서 whirr을 이용한 Hadoop 클러스터 구동 방법

동작의 개요

Hadoop 클러스터 구동

계정 생성

로컬 머쉰에 Hadoop과 whirr의 설치 그리고 Hadoop Version 문제

How do I specify the service version and other service properties?

whirr 설정 파일

Hadoop 클러스터 시동

프록시 열기

클러스터 종료

결론

참고문서

Share this:

VoltDB and its related links

See Also:

Share this:

Categories

Archives