The World of Search Engines
A search engine is a searchable database of Internet files collected by a computer program, called a crawler, robot, worm, or spider. Indexing is created from the collected files, e.g., title, full text, date last modified, URL, language, etc. Results are ranked by relevance; this will vary among search engines.
In essence, a search engine consists of three components:
- Spider: Program that traverses the Web from link to link, identifying and reading pages
- Index: Database containing a copy of each Web page or other file gathered by the spider
- Search and retrieval mechanism: Technology that enables you to search the index and that returns results in a relevancy-ranked order
This tutorial will cover four types of search engines: general, meta, concept categorizing, and vertical. We'll take these one at a time.
First, let's look at the search engine scene
~ Search engines don't index all the documents on the Web. Far from it. Here are some examples of the type of content (often referred to as the deep web) that usually does not appear in your search engine results:
- Pages behind password-protected sites, such as the research databases and e-journals licensed for use by libraries and made available only to affiliated users
- Pages behind a firewall
- The content of many databases - a vast amount of content on the Web
- The very latest pages posted to the Web
- Pages excluded from search spiders by Web server software at the host site, or by a command within the Web page itself
- Pages that are not linked to other pages, and are therefore missed by a search engine spider as it crawls from one page to the next
- Many of the activities on the social Web - but this is changing
tip! See the tutorial on The Deep Web for more information on the content generally not found on search engines.
~ On the other hand, some search engines do retrieve a limited amount of content from the deep Web. For example, take a look at this search result from Google on "vegan diet". Notice that the results include videos from YouTube - which, conveniently, Google owns.
~ Because of the potentially large number of pages that can be retrieved by a search, good relevancy ranking is important. Search engines use various criteria to construct a relevancy rating of each search result and will present your results in this order.
- on the page ranking: All search engines look at the Web page itself in determining its relevancy to your search. This usually takes the form of term relevancy ranking. With this common type of ranking, your results will be listed based on the presence of your search terms on each page. For example, ranking can be based on: the presence of search terms in the title, URL, first heading; the number of times search terms appear on the page; search terms appearing early on the page; search terms appearing close together; etc.
- off the page ranking: This ranking pays attention to factors beyond the Web page itself. This can take several forms, for example: semantic term matching, link ranking, and the bundling of results into concept categories, domains, and sites. Personalization of results (see below) is another example. This type of ranking looks at "off the page" information to determine the order of your search results along with any supplementary information that may appear. Sometimes human judgment and actions are factors in the presentation of results. The real action nowadays on search engines is happening off the page.
- Google uses its PageRank system: it ranks by the number of links from the greatest number of pages ranked high by the service; this is a type of peer ranking
- iBoogie sorts results into categories representing concepts derived from your search
- Hakia derives some of its results from librarian-recommended pages and emphasizes the most current content as well as content that is semantically relevant to your search terms
~ In addition to relevancy ranking, supplementary material may appear with your search results to help you focus on your desired topic and learn more about it. Some search engines are using semantics to present suggested topics, related data, meanings and attributes relating to your search. This is moving search in significant new directions. The screenshot on the right shows useful information retrieved in a Google search about the artist Pablo Picasso.
~ In this era of personalization, search results can be different for different people using the same search engine. For example, Google personalizes your search results based on sites you've selected from previous search results. This feature is called Web History, and you can opt out of it if you wish. This is an important trend to watch as the Web exeperience becomes more individualized to your preferences and needs.
~ real time search is important to the search experience on the Web. General search engines such as Bing and social search engines such as FriendFeed Search are bringing the real time stream to everyone. If it's happening now on the Web, you can search for it.
~ It is helpful to understand that not all aspects of a search engine's technology are revealed to the public. In the world of commercial search engines, trade secrets abound. Help files tend to be general in nature when explaining how the technology works.
~ Don't expect search engines to work perfectly. Sometimes they just don't. If your results look strange, try a different search or a different search engine.
~ Google isn't the only search engine on the Web! Sure, Google is great, but there are other excellent search engines that deserve to be explored. Because Google has become so dominant, it can be easy to overlook useful alternatives. Check out this site's search engine page to explore some of them.
~ And finally... Beware of search results! Some search engines load the top of their results pages with paid listings. These are sites whose owners have paid for high placement. In other words, they are advertisements. Not all search engines do this, and some are more clear than others about what has been paid for and what has not. A good overview of this phenomenon can be found in the 2007 article, "Buying Your Way In: Search Engine Advertising Chart" by Danny Sulliven of Search Engine Watch. If you're interested, read the story.
Now, on to a discussion of various types of search engines.
