Common Crawl provides public access to its huge web index.

Google is a powerful search engine, as are Bing, Yandex, et al, but they’re all proprietary: their spiders crawl the web and vacuum-up information which they store within their own walls. (Google calls its web index BigTable.) Yes, we can use their search engine user interfaces, but exactly what algorithms they use remains proprietary and for the most part, secret.

SpiderCommon Crawl Foundation (Commoncrawl.org) was created in 2007 with the goal of crawling the web and making the discovered information available to the public, to do with as it pleases. Common Crawl claims to have stored about six billion web pages in their index and they publish a free library of program code to access it.

Applications that use the Common Crawl index are beginning to appear. Lucky Oyster uses the Common Crawl index to reveal previously hidden social networking relationships to users.

MIT’s Technology Review published an article recently that speculates that, thanks to Common Crawl, now Google-scale start-ups can get underway without having to crawl the web themselves, dramatically reducing their need for capital. Walled gardens such as Facebook and LinkedIn block spiders from crawling their sites — they’re all about locking up information. It’ll be fun to watch the tug of war between the proprietary and the open model in the web search arena, My money is on the open model.

Visit my website: http://russbellew.com
© Russ Bellew · Fort Lauderdale, Florida, USA · phone 954 873-4695

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s