Anybody reading this blog probably uses the Internet regularly, and we all have a general understanding of what the web is. It’s Google search, Facebook, Amazon.com, blogs, news sites and a wide range of other material that somebody has taken the time to put onto the Internet. But what surprises many people is that everything we can see on the web is just a miniscule fraction of the actual web and that most of what is out there is unavailable to us.
Mike Bergman, the founder of BrightPlanet, coined the phrase ‘deep web’ to describe all of the things that traverse the Internet that we can’t see. The part of the web we can see is called the surface web and it’s been estimated that the deep web is at least 500 times larger than the surface web.
The Google search engine probably looks at more of the surface web than any other crawler and it’s been estimated that Google looks at perhaps 15% of the surface web and none of the deep web. This means your Google searches are based upon only 0.03% of what is actually on the web. And that estimate might be conservatively high since it seems the deep web is growing exponentially.
So what are all of these things we can’t see on the web? They fall into a number of different categories:
- Private web content that requires a password. This includes huge databases like Lexis-Nexis (which contains transcripts of all court orders), most scientific papers, trade group papers that are available to members only, corporate information that is meant only for employees, and anything for which somebody wants to control (or charge) access.
- Unlinked web pages. Many web sites include pages that cannot be reached through links from the main page. Crawlers can’t generally find such pages.
- Content that lies behind a form. In this industry I am often asked for my name and company before being given access to whitepapers and other content.
- Web sites that are hidden on purpose. It’s possible to have a web page that fends off crawlers through the use of various techniques such as Robots Exclusion Standard or CAPTCHAS. Often these kinds of web sites are often part of the darknet, which consists of web sites used for nefarious purposes such as pornography, selling drugs, trading hacker information, sharing copyrighted material, and numerous things that the content owners want to keep under the radar. But the darknet isn’t always nefarious and might be used by political dissidents and others trying to hide their activities from authorities.
- Cached content that is stored as pictures rather than as a pdf or other readable format.
- Content in the form of a video. A search engine might note that a video exists, but cannot know the content of the video.
There are techniques for finding things on the deep web and one can imagine that governments around the world constantly search for things on the darknet. Normal web crawlers search the web by following hyperlinks (the links that connect web pages). But these techniques cannot uncover the deep web since their content is all shielded from hyperlinks.
In 2005 Google built something called the Sitemap Protocol which is a process that collects all queries made to go to deep web sites. Over time this process will allow a company like Google to map a significant portion of the deep web by identifying all of the hidden sites as well as how much traffic goes to each. But that is only half of the battle and much of the deep web is encrypted through tools like Tor, leaving the contents immune to search by normal web crawlers. So the challenge remains to find a way to map and uncover data on the deep web and share it in a format that can be easily read and understood.
One imagines the NSA spends a lot of time crawling around the deep web, particularly the darknet, but for the rest of us this is largely going to remain hidden and out of sight. I’ve always wondered why some topics I look for don’t seem to be found on the web – now I know they are probably there somewhere in the 99.7% of the web that the Google search engine doesn’t see.