Data-Intensive Text Processing with MapReduce Draft Available in Online

Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer

Actually, there have never been books that directly deal with MapReduce programming and algorithms. This book addresses from MapReduce algorithm design to EM Algorithms for Text Processing. Although this book is still draft, it seems well-organized and very interesting. In addition, the book contains some basic graph algorithms using MapReduce.


애플 타플릿 IPad 발표 됐군요.

나오기 전부터 시끄럽더니 단순한 언론 플레이는 아니었던 것 같습니다. 아래 두 링크는 발표와 제품 사진, 그리고 동영상입니다. 가격이 $499 부터 시작한다는게 조금 부담이네요.

제가 흥미로웠던 건 발표 시점에 이미 SDK, 프로그래밍 가이드라인, 휴먼 인터페이스 가이드 라인까지 준비가 되어 있었고 곧 바로 홈페이지에 소개가 됐다는 사실입니다. 언플을 밥먹듯 하는 국내 일부 기업들은 좀 배워야 하지 않나 싶습니다.


새로운 개념의 소셜 서비스 – Sekai Camera

Sekai Camera라는 어플이 앱스토어에 글로벌 버전으로 출시됐다고 한다. 살펴 보니 증강현실(augmented reality) + UCC + 소셜 네트워크를 이용한 새로운 개념의 소셜 서비스 인 것 같다. 최근 다양한 미디어와 디바이스를 바탕으로 한 이러한 서비스들이 우훅죽순으로 쏟아져 나오고 있는데 향후 3~5년 뒤가 참 기대된다. 더불어 이와 관련된 데이터 관리(data management) 이슈들도 많이 제기 될 것이다. 그런데 국내 IT업체들은 지금 같이 급변하는 미디어 및 기술의 변화 속에서 현재 어떤 아이디어를 가지고 미래를 준비하고 있는지 참 궁금하다.


How to Create A Table in HBase for Beginners

I have accumulated some knowledge and know-how about MapReduce, Hadoop, and HBase since I participated in some projects. From hence, I’ll post the know-how of HBase by period. Today, I’m going to introduce a way to make a hbase table in java.

HBase provides two ways to allow a Hbase client to connect HBase master. One is to use a instance of HBaseAdmin class. HBaseAdmin provides some methods for creating, modifying, and deleting tables and column families. Another way is to use an instance of HTable class. This class almost provides some methods to manipulate data like inserting, modifying, and deleting rows and cells.

Thus, in order to make a hbase table, we need to connect a HBase master by initializing a instance of HBaseAdmin like line 4. HBaseAdmin requires an instance of HBaseConfiguration. If necessary, you may set some configurations like line 2.

In order to describe HBase schema, we make an instances of HColumnDescriptor for each column family. In addition to column family names, HColumnDescriptor enables you to set various parameters, such as maxVersions, compression type, timeToLive, and bloomFilter. Then, we can create a HBase table by invoking createTable like line 10.

HBaseConfiguration conf = new HBaseConfiguration();
conf.set("hbase.master","localhost:60000");

HBaseAdmin hbase = new HBaseAdmin(conf);
HTableDescriptor desc = new HTableDescriptor("TEST");
HColumnDescriptor meta = new HColumnDescriptor("personal".getBytes());
HColumnDescriptor prefix = new HColumnDescriptor("account".getBytes());
desc.addFamily(meta);
desc.addFamily(prefix);
hbase.createTable(desc);

Finally, you can check your hbase table as the following commands.

c0d3h4ck@code:~/Development/hbase$ bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Version: 0.20.1, r822817, Wed Oct  7 11:55:42 PDT 2009
hbase(main):001:0> list
TEST

1 row(s) in 0.0940 seconds

ACM SIGMOD 2010 Programming Contest

As you know, SIGMOD is ACM’s Special Interest Group on Management of Data. SIGMOD holds the annual conference that is regarded as one of the best conference in computer science. Besides, SIGMOD organizes a programming contest in parallel with the ACM SIGMOD conference. Below description is the call for the programming contest of this year. The programming contest’s subject of this year seems very interesting! The task is to implement a simple distributed query executor built on top of last year’s main-memory index. The environment on which contestants will test their implementation may be provided by Amazon. If you are interested in this programming contest, try that. You can get further information from here (http://dbweb.enst.fr/events/sigmod10contest).

A programming contest is organized in parallel with the ACM SIGMOD 2010 conference, following the success of the first annual SIGMOD programming contest organized last year. Student teams from degree-granting institutions are invited to compete to develop a distributed query engine over relational data. Submissions will be judged on the overall performance of the system on a variety of workloads. A shortlist of finalists will be invited to present their implementation at the SIGMOD conference in June 2010 in Indianapolis, USA. The winning team, to be selected during the conference, will be awarded a prize of 5,000 USD and will be invited to a one-week research visit in Paris. The winning system, released in open source, will form a building block of a complete distributed database system which will be built over the years, throughout the programming contests.


CIKM 2009 in Hong Kong

With Min Kyoung Sung who is a coauthor of  ‘SPIDER : A System for Scalable, Parallel / Distributed Evaluation of large-scale RDF Data‘, I participated in 18th ACM CIKM 2009 (Conference on Information and Knowledge Management) held in Hong Kong. We stayed in Marriott Hotel near the Asia World-Expo at which CIKM 2009 held. At this conference, I got along with several Korean researchers (Kyong-Ha Lee, Jinoh Oh, and Sangchul Kim) and I discussed about SPIDER with some researchers who are interested in RDF data processing during the demonstration session.

At CIKM 2009, I felt that the recent trend of web data management are being changed to information extraction and semantic or structured web data rather then unstructured data. Many papers and posters addressed these issues. In addition, the subject of the panel was ‘ Information extraction meets relational databases: Where are we heading?’ One of the panel said that the hot spot of web data management research changes from crawling, indexing, and searching to information extraction and semantic data. These changes lead to new various data and knowledge management issues. Besides information extraction, graph data mining was one of the main hot issues in CIKM 2009.

At the main keynote, Kyu-Young Hwang (KAIST, Korea) spoke ‘DB-IR Integration and Its Application to a Massively-Parallel Search Engine.’ Its key subject is that DB-IR integration is becoming one of major challenges in the database area, so it is leading to new DBMS architecture applicable to DB-IR integration. In addition, Edward Chang (Google Research China) and Clement Yu (University of Illinois at Chicago) spoke ‘Confucius and its intelligent Disciples‘ and ‘Advanced Metasearch Engines respectively.

Coffee Break at CIKM 2009SPIDER in Demo Session

Tian Tan Buddha Statue in Hong KongThe lunch time in CIKM 2009

This conference was a really nice experience for me. I enjoyed the conference, reception, and banquet. However, I have an unsatisfied feeling because I didn’t participate in the 1st Workshop CloudDB 2009 in conjunction in CIKM 2009.

Anyway, this conference inspired Min Kyoung Sung and me. It may be kept in our mind for long time.


MapReduce Online Comes Out!

MapReduce has been gaining much attention in data intensive computing field. As you know, it is well known as a very popular framework for batch-processing.

Recently, however, Tyson Condie who is a Ph.D student in UC Berkeley accomplishes MapReduce Online. Today, I heard this news from Data Beta. Actually, It is amazing works since the original MapReduce is specialized and designed for only batch-processing. In addition, most people believe that MapReduce will remain a batch-processing.

The essential of MapReduce online is that it tries to hold the fault-tolerance model of the original MapReduce, whereas it provides the the pipelining of results across tasks and jobs instead of materializing the output of each MapReduce task and job into disk. Consequently, MapReduce online enables the program to return the result earlier from a big job.

You can get further information from MapReduce Online.


Follow

Get every new post delivered to your Inbox.