VIPS: a Vision-based Page Segmentation Algorithm

米Microsoftが開発中の検索エンジンに搭載されると思われる検索アルゴリズムに関する論文が同社サイト"Microsoft Research"にて公開されている。題名は『VIPS: a Vision-based Page Segmentation Algorithm』。ここで紹介されている"Vision-based Page Segmentation Algorithm"（以下、VIPS)にMicrosoftは力を入れているようだ。

まずVIPSに関する概要から。

A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure. Experiments show satisfactory results.

VIPSは私たちの視覚認知をまねて情報を理解しようとするアルゴリズムだ。私たちは与えられた情報について、内容ではなく視覚によって得られた位置情報から情報を理解することがある。例えば asahi.com（アサヒ・コム）を見てみよう。日常このサイトを利用しているユーザーは、画面右上にあるのはasahi.comのロゴ（サイトＩＤ）、画面上部に並べられた文字列はナビゲーション、その下にあるのは見出しとリード文だが、きっとこれが英語でも韓国語でもヘブライ語で記述されていても、少なくともその位置的情報から各々がページにおいてどんな意味（見出しなのか、ナビゲーションなのか、サイトＩＤなのか）を持つかは理解できよう。それが「視覚認知」だが、VIPSはそれをアルゴリズムで実現しようとしている。

Today the Web has become the largest information source for people. Most information retrieval systems on the Web consider web pages as the smallest and undividable units, but a web page as a whole may not be appropriate to represent a single semantic. A web page usually contains various contents such as navigation, decoration, interaction and contact information, which are not related to the topic of the web-page. Furthermore, a web page often contains multiple topics that are not necessarily relevant to each other. Therefore, detecting the semantic content structure of a web page could potentially improve the performance of web information retrieval. [VIPS: a Vision-based Page Segmentation Algorithm / Technical Report MSR-TR-2003-79 / Microsoft Research Microsoft Corporation]

検索エンジンに情報を伝達しやすいページ構成として、「１つのページに１つのトピック（話題・テーマ）を入れよ」と言われる。これは、詰め込む情報のトピックを絞り込むほど、そのトピックに対する専門性は高まることが理由だ。つまり多岐にわたるトピックを集めた情報はいいかえれば何の特徴もない、特別何かのキーワードとも関係がないページとなってしまうからだ。検索エンジンは与えられた検索クエリーに対して「最も関連性が高いと判断したページ」から順番に表示してくるので、専門性が高いページ - 検索キーワードに沿ったトピックで記述されたページ - の方を高く評価してくれるという理屈だ。

しかしWebの世界において１つのページに１つのトピック”のみ”が記述されていることは通常ありえないわけだ。ナビゲーションもあるしMacromedia Flashやshockwaveのようなマルチメディアファイル、お知らせや最新ニュース、広告などなど様々な情報が１ページ上に掲載される。つまり１つのWebページが純粋にたった１つのトピックだけを保有していることはありえないわけだ。

そこで、ページを意味のある情報構造に分類、主要素となる情報がどこにあるかを認識することにより情報検索技術を向上させることができるようになるわけだ。例えばページ上に掲載されている情報を、「会社情報」「本文」「サイトＩＤ」「ナビゲーション」といったように意味的にまとまりのある情報毎に分類できれば、「ページが何について記述されているのか」をより正確に理解できるようになるのだ。

If we can get a semantic content structure of the web page, wrappers can be more easily built and information can be more easily extracted. Moreover, Link analysis has received much attention in recent years. Traditionally different links in a page are treated identically. The basic assumption of link analysis is that if there is a link between two pages, there is some relationship between the two whole pages. But in most cases, a link from page A to page B just indicates that there might be some relationship between some certain part of page A and some certain part of page B. Also, the existence of large quantify of noisy links will cause the topic drift problem in HITS algorithm [7, 20]. Recent works on topic distillation [11, 12] and focused crawling [13] strengthen our observation. However, these works are based on DOM (Document Object Model)1 tree of the web page which has no sufficient power to semantically segment the web page as we show in the experimental section. Furthermore, efficient browsing of large web pages on small handheld devices also necessitates semantically segmentation of web pages [19].

情報を分類できるようになると、リンク分析にも役立つ。現在の検索アルゴリズムの大多数は、あるページからあるページに対してリンクが張られている時、そのページ全体が双方に何らかの関係があると認識される。しかし現実には、双方にリンクされたページのある一部分でしか相関関係がない場合が多々ある。しかしページ情報を意味的に分類できれば、「現実には無関係なリンク」を認識できるようになる。

In this paper, we propose VIPS (VIsion-based Page Segmentation) algorithm to extract the semantic structure for a web page. Such semantic structure is a hierarchical structure in which each node will correspond to a block. Each node will be assigned a value (Degree of Coherence) to indicate how coherent of the content in the block based on visual perception. The VIPS algorithm makes full use of page layout feature: it first extracts all the suitable blocks from the html DOM tree, then it tries to find the separators between these extracted blocks. Here, separators denote the horizontal or vertical lines in a web page that visually cross with no blocks. Finally, based on these separators, the semantic structure for the web page is constructed. VIPS algorithm employs a top-down approach, which is very effective.

VIPSはページをブロック単位に分類、それぞれのコンテンツとの密接さから価値を判断していくわけ。

ということでVIPSアルゴリズムはどんな形で検索エンジンに搭載されるか楽しみ。

SEMリサーチ

企業で働くウェブマスター向けに、インターネット検索やSEOの専門的な話題を扱います

VIPS: a Vision-based Page Segmentation Algorithm