HadoopDB: An Open Source Parallel Database for Analytical Workloads

With the increasingly growing volume of data, the techniques to manage big data are needed in many areas. Open source community and many companies have attempted developing solutions to deal with big data.

Recently, Prof. Daniel Abadi, who is an Assistant Professor of Computer Science at Yale University, announced HadoopDB release and the paper published in VLDB’09. HadoopDB is an open source analytical database, being developed by him and his students. The paper states that HadoopDB is a hybrid of both MapReduce and parallel  database and it takes the best features from both.

Hadoop LogoActually, MapReduce has made controversial issues from a database point of view. Formerly, there was some debates. Representatively, Prof. David Dewitt, who is well known as a great master of (parallel) database, critiqued that MapReduce is a major step backwards. On the other hand, proponents of MapReduce argue that MapReduce outperforms parallel database in respect of scalability, fault tolerance, and flexibility to unstructured data.

This paper concludes that HadoopDB is close to the performance of parallel databases while it is similar score on fault tolerance and feasibility in heterogeneous systems as Hadoop.

In sum, HadoopDB is a hybrid system of MapReduce and parallel DBMS. It is quite interesting achievement. I respect their decision to release HadoopDB as open source because their achievement will more broadly contribute to Hadoop and data analytical database. Still, I do not read this paper completely, and sooner I will discuss HadoopDB in detail.

Some interesting points:

  • They carried out experiments on a 100 node of amazon EC2 cluster.
  • They try to deal with semantic web data (i.e., RDF) by HadoopDB.
  • HadoopDB is a full open source project.
  • HadoopDB isn’t well suited for real-time data yet.
  • I can participate in his presentation at the session at VLDB.

See Also:


My blog got the new domain name Diveintodata.org

The two week has passed since I opened moved this blog to here. Finally, I got a domain name “diveintodata.org” three days ago, and then I set my blog to be published as new domain. Now, I’m happy :)

Anyway, from now I’m going to write at least one articles within a week. The articles will discuss computer science issues, especially newly emerging database issues.


Paper: Graph Twiddling in a MapReduce World

Today, at the lab seminar I presented the paper “Graph Twiddling in a MapReduce World” published in IEEE Computing in Science & Engineering. This paper addresses an investigation into the feasibility of decomposion graph operations into a series of MapReduce processes. In this post, I’m going to discuss this paper briefly.

As I mentioned above, this paper discusses the feasibility of decompositing graph operations into a series of MapReduce processes. As you know, the MapReduce has been gaining attentions in various applications that cope with large-scale datasets. However, to the best of my knowledge there have been no studies for dealing with graphs on MapReduce. This paper proposes several operations as follows:

  • Augmenting Edges with Degrees
  • Simplifying the Graph
  • Enumerating Triangles
  • Enumerating Rectangles
  • Finding Trusses
  • Barycentric Clustering
  • Finding Components

Some operations are performed in combination with other operations. Actually, some of them are very easy problems if they can traverse graphs. However, as the author said, traversing graphs with MapReduce is very inefficient (i.e., causing many MapReduce iterations) because a mapper reads only a record randomly for each map operation. Anyway, all the operations that the paper proposed avoid traversing graphs. Instead, their common pattern in graph algorithms proposed is as follows:

  1. A map operation: Read and process all the edges (or vertex) or changing some piece of edge (or vertex) information. Then, result in records by vertex as key.
  2. A reduce oprtation: For each record obtained from the previous map operation, read and determine the updated state of vertex or edge; emit this information in partially (or locally) updated records. Then, results in them.
  3. A reduce opration: For each record from the previous reduce operation, combine the updates globally and complete updated information.

Discussion

Even though this paper proposes several graph operations, they are still unnatural owing to too many MapReduce iterations; to the best of my knowledge, each MapReduce job’s initializing cost is very expensive. It is because mapper only can read record sequentially. The proposed graph operations based on MapReduce will cause the overhead of both MR iteration and communication. As a result, the feasible primitive graph operations with MapReduce are very limited. In addition, there are evidences to show the MapReduce is not suited to graph operations, but I will state them later.

Therefore, I think that a new programming model for graph (or complexity data) are needed. Ideally, the new programming model for graph must support graph traversing. In addition, data are needed to be preserved in locality in regards with their connectivity although data are distributed across a number of data nodes. Actually, basing these ideas I’m concreting “Hamburg: A New Programming Model for Graph Data” inspired by a blog post “Large-scale Graph Computing at Google

References


What is the Common Tag?

최근 Common Tag (http://www.commontag.org)라는 새로운 키워드가 시맨틱 웹 커뮤니티에 등장했습니다. 사실 태그(Tag)는 이미 많이 익숙한 시스템입니다. 그런데 Common Tag가 최근 많이 언급되어 Common Tag가 무엇인지 기존 태그와 어떻게 다른지 관련 글들을 읽어보고 간단히 정리해 보았습니다.

공식 사이트에 설명되어 있는 Common Tag는 다음과 같습니다.

Common Tag is an open tagging format developed to make content more connected, discoverable and engaging. Unlike free-text tags, Common Tags are references to unique, well-defined concepts, complete with metadata and their own URLs

Common Tag는 컨턴츠간의 연결성, 검색 가능성, 응용 프로그램에 의한 활용성을 향상 시키기 위해 개발된 공개 태그 형태이다. free-text 기반의 기존 태그와 달리 고유성, 잘 정의된 개념, 메타데이터를 통한 완전성, 자체 URL을 가진다.

The Common Tag간단히 말하면 기존 태그가 free-text기반으로 사용자가 자유롭게 입력하는 텍스트 형태였다면 Common Tag는 미리 잘 정의된 개념에 URI를 부여하고 이 URI를 태그로 사용합니다. 그 동안 Web 2.0의 여러 컨텐츠들은 태그(Tag)라는 free-text 형태의 텍스트에 의해 분류되어졌고 검색에 이용되었습니다. 그러나 기존 태그는 사실 제 역할을 하지 못했습니다. 동음이의어, 동의이음어로 인해 분류가 정확하게 되지 못했고 따라서 검색 결과도 그저 그런 퀄리티를 보여줬습니다. Common Tag는 이러한 단점을 보완하기 위해 제안된 것으로 보여집니다.

Why use Common Tag?

그럼 Common Tag를 사용하게 되면 무엇이 좋아질까요? 다음과 같은 이유 때문에 컨텐츠 생산자 및 소비자 그리고 관련 어플리케이션 개발자들의 편의가 향상됩니다.

  1. Findability가 향상됩니다. 명확한 의미와 통일된 Common Tag를 통해 원하는 데이터를 정확하게 찾을 수 있게 됩니다.
  2. 뜻이 잘 정의되어 있고 유일한 키를 가지는 Common Tag를 통해 정보들간에 연결성이 향상 됩니다. 즉 같은 Common Tag를 가진 컨텐츠 끼리는 Common Tag에 해당하는 URI를 통해 연결성을 가지게 되는거죠. 기존 태그는 동음이의어 및 동의이음어로 인해 잘못된 연결을 가지거나 연결되지 않는 경우가 많았습니다.
  3. Common Tag는 단순한 스트링(string)이 아닌 URI에 의해 식별 및 참조되어지는데 프로그램이 처리하기가 수월해 집니다. 또한 동음이의어의 경우 프로그램들은 사실 구별하는 것이 불가능한데 Common Tag의 경우는 URI를 통해 식별되기 때문에 이런 문제가 발생하지 않습니다.

How Can We Make use of Common Tag?

웹 문서(HTML document) 자체를 Common Tag를 통해 태깅할 수 도 있으며 문서의 특정 섹션, 특정 단어 및 미디어 파일에 태깅할 수 도 있습니다. 여담이지만 애초 HTML이 표현을 위주로 설계되었었는데 이와 같이 시맨틱 데이터가 HTML안에 삽입될 수 있는 것은 RDFa 덕분입니다. RDFa는 HTML에 RDF를 embeding 할 수 있게 하는 W3C의 Recommendation 입니다. RDFa나 RDF에 대해서는 추후에 또 다루도록 하겠습니다.

한가지 예로 Common Tag를 통해 HTML의 anchor text에 다음과 같이 태깅할 수 있게 됩니다. 이외의 활용에 대해서는 Common Tagging’s QuickStartGuide 을 참고하세요.

<div xmlns:ctag="http://commontag.org/ns#" rel="ctag:tagged">
   NASA's <a typeof="ctag:Tag" rel="ctag:means"
               href="http://rdf.freebase.com/ns/en.phoenix_mars_mission"
               property="ctag:label">Phoenix Mars Lander</a> has deployed its robotic arm.
</div>

위와 같이 기존 HTML에 쏙 들어갈 수 있습니다. 이런 기술적인 부분만 보면 웹 문서에 Common Tag가 아주 쉽게 적용될 수 있을 것 같습니다. 그러나 Common Tag가 태깅된 데이터에 대해서는 아주 유용하게 쓰일 수 있으나 태깅할 때 결국에 자신이 나타내고 싶은 뜻을 가진 Common Tag를 골라내는 즉 사람의 손을 거쳐야 한다는게 불편함으로 작용할 것 같습니다. 사실 현재의 기술수준으로는 불가피한 문제인데 웹 저작툴들의 Common Tag 자동완성 기능이라던지 검색기능으로 커버되어야 할 것 같습니다.

Conclusion

결과적으로는 RDFa와 더불어 Common Tag가 많이 보급되어 웹 저작툴 및 웹 어플리케이션들이 이들을 잘 지원하게 되면 시맨틱 데이터는 더욱 풍부해질 것으로 예상되고요. 또한 Common Tag를 통해 문서 또는 문서의 일부분에 인물, 사물, 지리 정보 및 추상적 개념을 정확하게 태깅할 수 있게 되고 이를 기반으로 흥미로운 어플리케이션이 쏟아질 것으로 기대됩니다. 그리고 글 중 틀린 내용이 있으면 지적 부탁 드립니다.

See Also:


How to Display Mathematics Symbols in Online

Sometimes, I confront the situation to write mathematical symbols or formula in online. Actually, by using latex or a kind of word process we can write them, whereas it is difficult to do so in online. However, I found some convenient ways for them. This site (http://sixthform.info/steve/wordpress/?p=59) introduces many ways to write easily mathematical symbols or formulas in online. Among them, I prefer to the following methods because they provide immediately math-symbols image urls generated by online input.


Follow

Get every new post delivered to your Inbox.

Join 440 other followers