Paper: Graph Twiddling in a MapReduce World

Posted: July 17, 2009 | Author: Hyunsik Choi | Filed under: Research | Tags: graph, graph cluster, map-reduce, scalable computing | 5 Comments

Today, at the lab seminar I presented the paper “Graph Twiddling in a MapReduce World” published in IEEE Computing in Science & Engineering. This paper addresses an investigation into the feasibility of decomposion graph operations into a series of MapReduce processes. In this post, I’m going to discuss this paper briefly.

As I mentioned above, this paper discusses the feasibility of decompositing graph operations into a series of MapReduce processes. As you know, the MapReduce has been gaining attentions in various applications that cope with large-scale datasets. However, to the best of my knowledge there have been no studies for dealing with graphs on MapReduce. This paper proposes several operations as follows:

Augmenting Edges with Degrees
Simplifying the Graph
Enumerating Triangles
Enumerating Rectangles
Finding Trusses
Barycentric Clustering
Finding Components

Some operations are performed in combination with other operations. Actually, some of them are very easy problems if they can traverse graphs. However, as the author said, traversing graphs with MapReduce is very inefficient (i.e., causing many MapReduce iterations) because a mapper reads only a record randomly for each map operation. Anyway, all the operations that the paper proposed avoid traversing graphs. Instead, their common pattern in graph algorithms proposed is as follows:

A map operation: Read and process all the edges (or vertex) or changing some piece of edge (or vertex) information. Then, result in records by vertex as key.
A reduce oprtation: For each record obtained from the previous map operation, read and determine the updated state of vertex or edge; emit this information in partially (or locally) updated records. Then, results in them.
A reduce opration: For each record from the previous reduce operation, combine the updates globally and complete updated information.

Discussion

Even though this paper proposes several graph operations, they are still unnatural owing to too many MapReduce iterations; to the best of my knowledge, each MapReduce job’s initializing cost is very expensive. It is because mapper only can read record sequentially. The proposed graph operations based on MapReduce will cause the overhead of both MR iteration and communication. As a result, the feasible primitive graph operations with MapReduce are very limited. In addition, there are evidences to show the MapReduce is not suited to graph operations, but I will state them later.

Therefore, I think that a new programming model for graph (or complexity data) are needed. Ideally, the new programming model for graph must support graph traversing. In addition, data are needed to be preserved in locality in regards with their connectivity although data are distributed across a number of data nodes. Actually, basing these ideas I’m concreting “Hamburg: A New Programming Model for Graph Data” inspired by a blog post “Large-scale Graph Computing at Google”

References

Jonathan Conhen, “Graph Twiddling in a MapReduce World”, Volume 11, Issue 4, pp 29–41, IEEE Computing in Science & Engineering, July-Aug, 2009.
Jeffrey Dean and Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, OSDI’04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
Google Cluster Computing and MapReduce Lecture 5
Breath-first graph search using an iterative map-reduce algorithm
Hamburg, Hadoop Wiki

Computer Scientist들을 위한 추천 블로그 (1)

Posted: June 24, 2009 | Author: Hyunsik Choi | Filed under: Research | Tags: computer science, database, distributed systems, P=NP, scalable computing | 2 Comments

오늘은 Computer Science 분야의 문제들 및 최신 이슈들을 다루는 몇몇 유명 블로그들을 소개하려고 한다. 워낙 유명한 블로그들이라 이미 많은 분들이 아실꺼라 생각이 들지만 혹시 모르는 분들이 있을까 이렇게 소개해 본다.

The Database Column – 말 그대로 데이터베이스 이슈들을 다룬다. 최근에는 클라우드 컴퓨팅에 대한 이슈도 언급된다. 이 블로그는 진짜 짱인게 Michael Stonebraker, Daniel Abadi, David DeWitt, Stan Zdonik, Samuel Madden 같은 대가들이 글을 쓴다. 최근 database 학계에서 어떤 주제에 관심을 가지고 있는지 알고 싶다면 제목만 훑어봐도 된다.
Gödel’s Lost Letter and P=NP – 제목만보면 NP문제를 주로 다루는 것 같지만 다양한 문제들과 알고리즘들을 다루고 있다(사실 오늘 발견함). 상당히 유익해 보이는 반면 어려워 보인다 (@_@).
All Things Distributed – Amazon CTO인 Werner Vogels의 블로그 이다. Scalable and distributed Computing에 대한 이슈를 다룬다.

원래 계획은 5개씩 소개하여 2회에 총 10개 소개였는데 요즘 포스팅 거리도 없고 하니…… 나머지는 다음에 이어서 쓰겠다.

덧붙임. 저 블로그들에 읽고 싶은 글들은 많은데 업데이트되는 수가 장난이 아니라…따라가기 참 힘들구나 ~(~_~)~

Dive Into A Data Deluge

Discussion about Newly Emerging Issues on Database

Paper: Graph Twiddling in a MapReduce World

Discussion

References

Computer Scientist들을 위한 추천 블로그 (1)

Categories

Archives

Dive Into A Data Deluge

Discussion about Newly Emerging Issues on Database

Paper: Graph Twiddling in a MapReduce World

Discussion

References

Share this:

Computer Scientist들을 위한 추천 블로그 (1)

Share this:

Categories

Archives