THIS COLUMN appears weekly not only in newsprint but also on the World Wide Web. Yet if you hunt for it with one of the popular Web-wide search engines like Altavista, Infoseek or Hotbot, you will probably never find it.
So-called web crawlers and their search engines are indispensable tools that help users sift diamonds from the rough. The engines can deliver so much information that they may seem infallible and complete.
But when a search fails to turn up information that you know has existed on the Web for months, or sends you to a nonexistent page or one that lacks the information you were hunting for, you sense something is amiss. Despite one product's claims that it ++ "was the first search engine to index and search the entire World Wide Web," it is impossible to sniff out every last nook and cranny.
The reason lies in Web crawlers' very nature. They typically begin at a known page and follow links from it to others, downloading pages and indexing them as they go. That strategy means that if no reference to a document exists in the outside world, the Web crawler will never find it.
John, June and Jake may have wonderful Web pages linked only to one another, but though their friends may see the sites by typing addresses, no crawler will ever reach Jayville until another site cites it. To address that problem, search engines prominently offer a form for submitting new addresses, but site developers do not always send out birth announcements.
Other omissions can be rectified only with difficulty, if at all. Most search engines skip sites that demand a password for entrance, even those like the New York Times on the Web that offer passwords free of charge. Internal search tools are usually available, some better than others, but you cannot hunt for information until you get past the password. Some sites lack links to archival pages and make them accessible only through a local search engine. Web crawlers cannot see them.
Web sites can use something called the Robots Exclusion Protocol to tell Web crawlers what to skip. For example, pages that are updated frequently are poor candidates for inclusion in indexes; by the time the index is created, the content it points to has long since disappeared or moved. Unsophisticated Webmasters may not know about the protocol and let crawlers index pages changed daily, weekly or even monthly, thus guaranteeing frustration when users arrive at pages lacking what they expected to find.
If you believe that the Web is the repository of all human knowledge, an unusual new site called AT1 (www.at1.com) will demonstrate otherwise. It offers free searches into what it calls "the invisible Web," which typical Web crawlers do not reach. Much of this information is not on the Web at all, but tucked away on CD-ROMs or in proprietary databases that charge as much as $300 an hour, plus annual service fees, so venturing past the result screens can cost real money. But there is also free material from a variety of sources and an extensive index of material from America Online. The ability to search an index of indexes that Web crawlers miss is a concept that deserves to catch on.
Meanwhile, search services are adding useful options to the information they do find. Altavista is introducing Livetopics, an enhancement that analyzes the information from a particular search and generates categories and related words to help users home in on the data they seek. Though the pre-release version's interface displays rough edges, the idea seems attractive as a computer-generated supplement to the category lists compiled at sites like Yahoo.
And the recent arrival of Excite's Newstracker (at nt.excite.com) means you can search recent editions of 300 Web-based publications, including many online versions of newspapers and magazines. This crawler cleverly peeks into password-protected sites, but it is up to you to register and, if necessary, pay for access to those sites if you want the full text of articles the program finds. The time lag for indexing means up-to-the-minute news can be missing, and material disappears after about a week, but Newstracker looks like a useful tool. However, it could not find last week's edition of this column. Why not? Search me!
Pub Date: 2/17/97