Amazon.com、書籍全文検索サービスをどのように構築したか

米Amazon.com が書籍の文章を対象とする全文検索サービスを開始しましたが、どうやって１２万冊以上の書籍全文をデータに取り込んできたのか気になった方はいませんか？

USA Today 誌にAmazon.comの書籍情報のデジタル化の過程についての説明がありました。３，３００万ものページを全部スキャンしてイメージ化して保存、その上で検索エンジンで参照・アクセス可能なテキストに変化したのですね。10年前には実現不可能だったそうで。

It took a bold stroke for Amazon, the world's largest online retailer, to make the new service available. First, it had to scan 33 million book pages into an image archive, in some cases manually tearing pages from bindings to run through a scanner, in others, shipping caches of books to scanning centers in India and the Philippines.

Udi Manber, Amazon's vice president of search algorithms, then used processing power borrowed from the company's backup computers to convert the images into text data that could be cross-referenced and accessed by a custom-built search engine. "Ten years ago, this was all science fiction," says Manber.

[Source]

Amazon opens pages to perusal [USA TODAY / Posted 10/26/2003 10:35 PM]

Google、書籍全文検索サービスの構築に着手

SEMリサーチ

企業で働くウェブマスター向けに、インターネット検索やSEOの専門的な話題を扱います

Amazon.com、書籍全文検索サービスをどのように構築したか