The Deep Web
The deep Web has gotten a lot of press in recent years. The Web is becoming a complex entity that contains information from a variety of source types. It is much more than fixed Web pages. In fact, the part of the Web that is not fixed, and is served dynamically "on the fly," is far larger than the fixed documents that many associate with the Web. Some people incorrectly refer to this content as the "invisible Web," for reasons that will be explained below.
When we refer to the deep Web, we are usually talking about the following:
- The content of databases accessible on the Web. Databases contain information stored in tables created by such programs as Access, Oracle, SQL Server and DB2. Information stored in databases is accessible only by query. This is distinct from static, fixed Web pages, which are documents that can be accessed directly. A significant amount of valuable information on the Web is generated from databases. In fact, it has been estimated that content on the deep Web may be 500 times larger than the fixed Web.
- Non-textual files such as multimedia files, graphical files, software, and documents in formats such as Portable Document Format (PDF).
- Content available on sites protected by passwords or other restrictions. Some of this is fee-based content, such as subscription content paid for by libraries and available to their users based on various authentication schemes.
The phenomenon of databases on the Web has been talked about for years, before the terms "invisible Web" or "deep Web" were coined. People sometimes referred to them as specialty databases, subject-specific databases, virtual libraries, and other similar terms. As Web technology develops and greater amounts of information are mounted on the Web, these databases take on primary importance as information finding tools.
The concept of the deep Web is becoming more complex as search engines such as Google have found ways to integrate deep Web content into their centralized search function. This includes everything from airline flights to documents in Word format. However, even a search engine as innovative as Google provides access to only a very small part of the deep Web.
Terminology
Why is this content referred to as the "invisible Web"? This is because the content of databases rarely shows up in a search engine result. Search engine spiders cannot or will not go inside database tables and extract the data. Database content is therefore "invisible" to them.
However, the term "invisible Web" is a poor choice for these reasons:
- The term is very search engine-centric. It assumes that the only way to find information on the Web is to consult a search engine. If the information cannot be found on a search engine, you're out of luck. This is simply not the case.
- There is no such thing as recorded information that is invisible. Some information may be more of a challenge to find than others, but this is not the same as invisibility.
- Informational databases have been available for years. Many of us are familiar with a library's collection of Web-based e-journals and databases. We use online catalogs, which are databases of a library's holdings. No one has ever called this information a part of the "invisible library." These are simply databases whose content is available through user query. Like a library, the Web contains information of different types that is stored and retrieved in different ways.
- The content of search engines on the Web is itself stored in databases and available only through user query. Shouldn't we call this invisible, too? We're labelling as invisible something that is available only through user query (the invisible Web) because it isn't accessible from within something else that is also available only through user query (search engines). The logic of this terminology just doesn't hold up.
A company called BrightPlanet has coined the term "deep Web" to describe the phenomenon of searchable databases on the Web. (The static Web is referred to as the "surface Web.") This is much better since database content is visible with the appropriate search and retrieval technology.
A Few Tips for Dealing with the Deep Web
When dealing with the deep Web, keep these points in mind:
- Information that is likely to be stored in a database is a part of the deep Web.. This can include large listings of things with a common theme. All directories are part of the deep Web. A few examples include:
- phone books
- "people finders" such as lists of professionals such as doctors or lawyers
- patents
- laws
- dictionary definitions
- items for sale in a Web store or on Web-based auctions
- digital exhibits
- multimedia and graphical files
- Information that is new and dynamically changing in content will appear on the deep Web. Look to the deep Web for late breaking items, such as:
- news
- job postings
- available airline flights, hotel rooms, etc.
- stock and bond prices, market averages, etc.
- Web sites of searchable databases can be retrieved via directories and search engines. For example, a Google search on
"United States newspapers" will retrieve the site of
NewsDirectory,
a database of links to newspaper sites around the world. This may be thought of as "split level searching." For the first level, search for the database site. For the second level, go to the site and search the database itself for the information you want.
- Many search engine sites and commercial portals feature searchable databases as part of their package of services. This phenomenon falls under the heading of converging content, mentioned earlier in this tutorial. For example, you can visit
AlltheWeb and look up news, retrieve pictures and multimedia, etc., all things outside the purview of a spider-gathered index. As another example, Google
integrates searches of PDF, Word and other file types into its general search service.
- Some search engines will search the deep Web for related content subsequent to an initial search. For example, try a search on Google for
"World Trade Center" and select the Images tab. This will retrieve many pages of images of the World Trade Center. Look for this type of feature on other search engines.
- Topical coverage on the deep Web is extremely varied. This presents a challenge, since it is impossible to anticipate what might turn up in a database. In addition, this coverage will be fluid as databases proliferate on the Web.
- Some of the information stored on Web-accessible databases may not be substantive or useful to most searchers. As with all of Web searching, it is important to tailor the query to the tool. The deep Web is highly valuable to those seeking the kind of targeted information listed above. It is also important to know where to look for useful content.
Sources of Deep Web Content
As noted above, deep Web sites can be located in subject directories and search engines. In addition, deep Web content is available on search engine sites as featured content such as news, video, images, etc.
If you're interested in this topic, take a look at Deep Web Technologies. This company has developed a few databases, including
The number of deep Web sources is endless. The Online Education Database maintains a nice sample list of deep Web resources.
The Future of the Deep Web
The lines between seach engine content and the deep Web have begun to blur as search services are providing access to part or all of once-restricted content. These services are providing free search of the content of books and scholarly papers. Google Book Search, Google Scholar, Live Search Academic and other up-and-coming services are examples of this phenomenon.
Generally speaking, if a book is out of copyright, you can view the text in its entirety. The issue of full text availability is complex, as Google, for example, often restricts access to the full text of out-of-copyright books when publishers with which it has agreements are selling them. Access to scholarly papers is also tricky. Some papers are posted on preprint or postprint archives, on Open Access journals, or on personal Web sites. When these show up in search engine results, full text access can be achieved. In other cases, the search is free but you must pay to access the content.
In essence, an increasing amount of deep Web content, especially scholarly content, is opening up to free search. As more and more publishers and libraries make agreements with the big search engines, more content will be searchable from central locations. Access to this content is a mixed bag. It may be that the future of the deep Web will be defined less by the opportunity for search than by access fees or other types of authentication.