The Deep Web
The deep web is usually defined as the content on the Web not accessible through a search on general search engines. This content is sometimes also referred to as the hidden or invisible web.
The Web is a complex entity that contains information from a variety of source types and includes an evolving mix of different file types and media. It is much more than static, self-contained Web pages. In fact, the part of the Web that is not static, and is served dynamically "on the fly," is far larger than the static documents that many associate with the Web.
The concept of the deep Web is becoming more complex as search engines have found ways to integrate deep Web content into their central search function. This includes everything from airline flights to news to stock quotations to addresses to maps to activities on Facebook accounts. In the screenshot below, notice the various deep Web sources offered by Google, including images, maps, news, video, shopping, scholarly content, blogs, and so on. However, even a search engine as far-reaching as Google provides access to only a very small part of the deep Web.
Content on the deep Web
When we refer to the deep Web, we are usually talking about the following:
- The content of databases. Databases contain information stored in tables created by such programs as Access, Oracle, SQL Server, and MySQL. (There are other types of databases, but we will focus on database tables for the sake of simplicity.) Information stored in databases is accessible only by query. In other words, the database must somehow be searched and the data retrieved and then displayed on a Web page. This is distinct from static, self-contained Web pages, which can be accessed directly. A significant amount of valuable information on the Web is generated from databases.
- Non-text files such as multimedia, images, software, and documents in formats such as Portable Document Format (PDF) and Microsoft Word. For example, see Digital Image Resources on the Deep Web for a good indication of what is out there for images.
- Content available on sites protected by passwords or other restrictions. Some of this is fee-based content, such as subscription content paid for by libraries or private companies and available to their users based on various authentication schemes.
- Special content not presented as Web pages, such as full text articles and books
- Dynamically-changing, updated content, such as news and airline flights
This is usually the basic,"traditional" list. In these days of the social Web, let's consider adding new content to our list of deep Web sources. For example:
- Blog postings
- Discussions and other communication activities on social networking sites, for example Facebook and Twitter
- Bookmarks and citations stored on social bookmarking sites
As you can see, based on these few examples, the deep Web is expanding.
Tips for dealing with deep Web content
- Vertical search can solve some of the problems with the deep Web. With vertical search, you can query a collection of data focused on a specific topic, industry, type of content, geographical location, language, file type, website, piece of data, and so on. For example, consider MedNar and PubMed to search for medical topics. On the social Web, there are search engines for blogs, RSS feeds, Twitter content, and so on.
Tip! See the tutorial on Vertical Search Engines for more information.
- Use a general search engine to locate a vertical search engine. For example, a Google search on "stock market search" will retrieve sites that allow you to search for current stock prices, market news, etc. This may be thought of as split level searching. For the first level, search for the database site. For the second level, go to the site and search the database itself for the information you want.
- A number of general search engines will search the deep Web for related content subsequent to an initial search. For example, try a search on Google for "World Trade Center" and select the Images tab. This will retrieve many pages of images of the World Trade Center. Look for this type of feature on other search engines.
- Try to figure out which kind of information might be stored in a database.. There is no general rule. But think about large listings of things with a common theme. A few examples of databased content include:
- phone books
- "people finders" such as lists of professionals such as doctors or lawyers
- dictionary definitions
- items for sale in a Web store or on Web-based auctions
- digital exhibits
- images and multimedia
- full text articles and books
- Information that is new and dynamically changing in content will appear on the deep Web. Look to the deep Web for late breaking items, such as:
- job postings
- available airline flights, hotel rooms
- stock and bond prices, market averages
- The social Web often jumps on a late-breaking situation with news items and commentary. Blogs, Facebook, Twitter, and other social networking environments sometimes get out the word before more traditional sources.
- Topical coverage on the deep Web is extremely varied. This presents a challenge, since it is impossible to anticipate exactly what might turn up.
Tip! For insights into finding scholarly content on the Web - the deep and the not-so-deep - see the tutorial Finding Scholarly Content on the Web.