Back to Internet Tutorials

Searching the Internet:
Recommended Sites and Search Techniques

Search Engines

Definition: A search engine is a searchable database of Internet files collected by a computer program (called a wanderer, crawler, robot, worm, spider). Indexing is created from the collected files, e.g., title, full text, size, URL, etc. There is no selection criteria for the collection of files, though evaluation can be applied to ranking schemes that return the results of a query.

A search engine might well be called a search engine service or a search service. As such, it consists of three components:

There are two major types of search engines:

General Tips

  1. Search engines do not index all the documents available on the Web. For example, most engines cannot index files to password-protected sites. A good example of this are the research databases and e-journals licensed for use by libraries and made available to affiliated users only. Documents behind a firewall will not be accessible to a search engine spider. Other files can be excluded from search engines by Web server software at the host site, or by a command within the file itself. Still other Web pages may not be picked up if they are not linked to other pages, and are therefore missed by a search engine spider as it crawls from one page to the next. Search engines rarely contain the most recent documents posted to the Internet; do not look for yesterday's news on a search engine (unless, of course, it offers a separate news search).
  2. The content of databases generally will not show up in a search engine result. This is because search engine spiders cannot or will not get inside database tables and extract the data. The phenomenon is sometimes referred to as the deep Web. Later on in this tutorial, we will examine the nature of the deep Web.
  3. Search engine features are proliferating and are in a constant state of flux. Don't try to keep track of everything! For example, several search services offer searches on various fields, programming languages, domain locations, dates, and so on. As search engines develop and the competition among them intensifies, more features are available to users in all sorts of combinations. For a review of some of these features, and the engines that support them, see How to Choose a Search Engine or Research Database.
  4. Most major search engine indexes consist of the full text of source files. When you search a full text index, you will retrieve a file even if your search terms appear only once in the text and do not represent the primary topic of the document. (Of course, a good engine will place this type of file low in the list of results based on its relevancy ranking scheme.) Limiting your search to fields or using proximity operators (explained below) can be a useful way to boost the relevancy of your results.
  5. Many search engines have an interface for basic searches as well as a separate interface for advanced or more full-featured queries. Be sure to explore both interfaces and to use the one that is appropriate for your query. Keep in mind that some advanced search interfaces may actually be easier to use than the interface on the main screen. Visit AltaVista for an example of this.
  6. Because of the potentially large number of pages that can be retrieved by a search, good relevancy ranking is important. Most search engines use various criteria to construct a relevancy rating of each hit and will present your search results in this order. First generation search engines primarily use term relevancy ranking. This type of ranking judges relevancy based on the presence of your search terms in Web documents. For example, ranking will be based on: the presence of search terms in the title, URL, first heading; the number of times search terms appear in the document; search terms appearing early in the document; search terms appearing close together; etc. This is known as "on the page" ranking, since the engine looks at content on the page to determine its relevancy. The use of "on the page" ranking as the sole ranking scheme has been fading from the search engine scene because it has proven to be too simplistic for the Web environment.
  7. One of the most interesting developments in search engine technology is the organization of search results by peer ranking and the bundling of results into component concepts, domains and sites in addition to term relevancy. This type of ranking looks at "off the page" information to determine the order of your search results. Search engines that employ this alternative may be thought of as second generation search services. For example:
  8. Google ranks by the number of links from pages ranked high by the service
  9. Ask.com ranks according to the number of links from topically relevant pages
  10. Clusty sorts results into categories representing concepts derived from your search
  11. A more detailed look at second generation search services may be found in the tutorial Second Generation Searching on the Web.

  12. Search tools generally present results in one of two ways:
  13. Vertical layout: Your results are presented in one long list. This is by far the most common method of presentation. In these cases, you need to examine each source to determine if it addresses aspects of your topic that interest you.
  14. Horizontal layout: Certain concept grouping engines offer results in a horizontal layout. With this feature, you can first review concept categories retrieved by your search before examining the results within particular categories. This type of organization can make it easier to determine if your results relate to the aspect of the topic that interests you. Examples of these tools are Query Server and Clusty.
  15. Don't be impressed by a large number of hits in response to a search. Often multiple pages are returned from a single site because they all contain your search terms. AltaVista is one search engine that avoids this with a technique called results grouping, whereby all the results from one site are clustered together into one result. You are then given the opportunity to view all the retrieved pages from that site if you choose. With these engines, you may get a smaller number of results from a search, but each result is coming from a different site.
  16. Offered features do not always work perfectly. Don't look for perfection. Just relax and get what you can out of the search.
  17. It is helpful to understand that not all aspects of search engine technology are revealed to the public. In the world of commercial search engines, trade secrets abound. Help files tend to be general in nature when explaining how the technology works. This writer has queried services via e-mail for more details, only to get back slightly more substantive information.
  18. Watch for converging content. Most well-known sites contain information from an array of sources. Some have appled the term "portal" to describe this phenomenon. Offerings on a portal can increase the usefulness of search sites, but also can create confusion in terms of the information source. For example, consider what you may find on a commercial search engine service:
    • Spider gathered index: The mechanism for searching a spider-gathered index is the feature people usually associate with a search engine.
    • Results from other search services: It is common for a search engine to return results from other services with which it has partnered. Each partner service offers an enhancement over search results that are derived from the Web. This represents an interesting combination of first and second generation search technologies appearing on the same site.
    • Directory: Many search services offer a directory on their sites. This directory may be a name brand such as LookSmart or the Open Directory Project, or a directory compiled by a site's own editors. Results from the directory may appear automatically with results from the spider-crawled Web, or the directory may be searched or browsed separately.
    • Deep Web: Many search services offer the option to search databases offering specific content. Included may be news, business, shopping, multimedia files, and so on. These databases constitute a small subset of the deep Web.

    All of this points to a blurring of the distinctions among sites that provide directory content, those that offer results from the spider-crawled Web, and those that provide access to content on the deep Web.

NOTE! New search engines are springing up all the time. Because Google has become so dominant, innovative new tools are sometimes called Alternative Search Engines (ASE's). To keep up to date on this exciting field, track the blog Alt Search Engines.


Now that you understand the basics of search engines, let's consider some useful general search strategies. >>

<< Back to BUBL LINK
Return to Index

Return to Top