Common Crawl provides public access to its huge web index.

Google is a powerful search engine, as are Bing, Yandex, et al, but they’re all proprietary: their spiders crawl the web and vacuum-up information which they store within their own walls. (Google calls its web index BigTable.) Yes, we can use their search engine user interfaces, but exactly what algorithms they use remains proprietary and for the most part, secret.

SpiderCommon Crawl Foundation (Commoncrawl.org) was created in 2007 with the goal of crawling the web and making the discovered information available to the public, to do with as it pleases. Common Crawl claims to have stored about six billion web pages in their index and they publish a free library of program code to access it.

Applications that use the Common Crawl index are beginning to appear. Lucky Oyster uses the Common Crawl index to reveal previously hidden social networking relationships to users.

MIT’s Technology Review published an article recently that speculates that, thanks to Common Crawl, now Google-scale start-ups can get underway without having to crawl the web themselves, dramatically reducing their need for capital. Walled gardens such as Facebook and LinkedIn block spiders from crawling their sites — they’re all about locking up information. It’ll be fun to watch the tug of war between the proprietary and the open model in the web search arena, My money is on the open model.

Visit my website: http://russbellew.com
© Russ Bellew · Fort Lauderdale, Florida, USA · phone 954 873-4695

Leave a comment