Chemical Information Resources on the World Wide Web

R. Stembridge, Sales Development Manager Europe, Knight Ridder Information

Introduction

There can be few other developments in the history of science of such significance as the advent of the World Wide Web (WWW); Mendeleev's periodic table, the splitting of the atom, the isolation of penicillin; the WWW has the potential to stand alongside these. Why is this?

The essence of science is discovery - the analysis of problems and the use of knowledge and application of original thinking to that knowledge to arrive at solutions. Thus we see that the cornerstone of science, of discovery, is knowledge.

Knowledge is gained through understanding ; understanding that comes from the acquisition of information and the rationalisation of that information through questioning, probing, sharing, discussion. These are the two fundamental elements of the WWW; a vast storehouse of information coupled with the vehicle for the dissemination and discussion of that information by other minds. Thus the WWW is already having, and increasingly will have, a profound impact on the way that science is done.

This paper seeks to explore the information resources available for the chemical research professional today, some of the issues surrounding access to that information, and a glimpse into the near future of one possible solution.

Needs of the chemical research professional

The life blood of chemical research is knowledge which is built on the understanding of information and access to that information. The key to this process is access to the right information, at the right time, and in the right way.

Since the use of computers is now integral to chemical research, clearly the right way to access information is through the computer. Connection to the WWW offers every chemical research professional access to information in this way. To access the WWW requires two things:

- Internet connection
- WWW browser

Internet connectivity is largely a closed book to the speaker, and he is grateful for the presence at this conference of his colleague, Mark Dooley, who will be able to answer any questions there may be on this topic - he can be reached through the Knight Ridder Information booth in the exhibit hall; essentially there are two ways of connecting to the Internet; either your organisation will be registered as an Internet site and you can connect via your internal network; or you can register with one of the commercial Internet service providers such as Pipex or Demon and connect via the provider using a modem and phone line. To take full advantage of the information on the WWW, it is desirable to have a WWW browser capable of interpreting and displaying HTML format documents since a great deal of information on the WWW is in this form. There are two popular WWW browsers of this type - Mosaic and Netscape. Of these, it is said that 70% of people are using Netscape.

How, then, do we access the right information at the right time? This depends on being able to locate what is needed quickly and easily. There are a number of issues which affect this which will be addressed later, but for now, I would like to conduct a review of some of the information resources available to the chemical research professional on the WWW today.

In the time available, it is not possible to do a comprehensive review; rather, a broad overview and selected highlights will be covered. If your favourite site or resource is not mentioned, it is omitted through ignorance or time constraints rather than prejudice; therefore, your understanding and forgiveness is sought!

Navigation aids

Estimates of the size of the WWW vary according to how that estimate is calculated, but Carnegie Mellon University's may be taken as indicative; they currently estimate the size of the web to be over 5 million documents not including pages inside databases like the Library of Congress, the Human Genome Database, or WAIS indexes.

Therefore, the virtue of the WWW is also it's biggest vice - amongst the wealth of information available, how do you find what's needed? If you're trying to find a particular site or document on the Internet or just looking for a resource list on a particular subject, you can use one of the many available on-line search engines. These engines allow you to search for information in many different ways - some search titles or headers of documents, others search the documents themselves, and still others search other indexes or directories.

An alternative approach is to use one of the "Subject Guides" to the WWW. These take the form of a subject index to the WWW and allow you to home in on the sites on the Web that are likely to contain information of use to you.

Of the many aids available to help navigate round the WWW; three search engines (Aliweb, Lycos, Webcrawler) and one "Subject Guide" (Yahoo) are reviewed here.

Aliweb
Aliweb is one of the public services provided by Nexor, based in the UK.

The idea behind ALIWEB is very simple. The World Wide Web is growing too big to find things easily. It is impossible to keep track of all the services other people provide; they change often, and there are simply too many of them. Therefore ALIWEB proposes that people just keep track of the services they provide, in such a way that automatic programs can simply pick up their descriptions, and combine them into a searchable database.

So the way ALIWEB works is as follows:
People write descriptions of their services in a standard format into a file on the Web, by hand or using automatic tools; they then tell ALIWEB about this file. ALIWEB regularly retrieves all these files, and combines them into a searchable database. Anybody can come and search this database from the Web. Because the database can be updated regularly (currently once a day) the data is very up-to-date. Since ALIWEB does all the work of retrieving and combining these files, people only need to worry about descriptions of their own services; so the information is likely to be correct and informative. And as only these small description files need to be gathered there is little overhead.

Lycos
One of the so-called "robot-generated" WWW indices, this is provided by Lycos, Inc based at Carnegie Mellon University. The name "Lycos" comes from the first 5 letters of the latin name for Wolf Spider.

The Lycos web explorer searches the World Wide Web every day, building a database of all the web pages it finds. The index is updated weekly. The Pursuit search engine provides probabilistic retrieval from this catalogue, taking a user's query and returning a sorted list of hits (the list is sorted by match score, and only documents with scores above the threshhold are retrieved).

This search engine will allow you to search on document titles and content. Its July 21 database contains 5.5 million link descriptors. The Lycos index is built by a Web crawler that can bring in 5000 documents per day. The index searches document title, headings, links, and keywords it locates in these documents.

Lycos is sometimes perceived to be slow. This is because, although it is actually fast with searches against the 5.5 million record database taking only a few seconds, Lycos currently have only 7 computers handling up to 150,000 users per week. So it seems slow because you're sharing it with 149,999 other people. Good times to search are before 11am EST, or after 6pm. To cope with the load, a reduced-sized catalogue has been provided with about 3/4 million entries.

WebCrawler
Another robot-generated WWW index, WebCrawler is based at the University of Washington and is operated by America Online as a service to the Internet.

The WebCrawler is a tool for searching the Web. It operates by traversing the Web and either building an index for later use, or by searching in real-time for a query. The index built by the WebCrawler is available for searching via the WebCrawler Search Page. The engine allows searches by document title and content. It is part of the WebCrawler project, managed by Brian Pinkerton at the University of Washington, which collects documents from the Web.

Results of a WebCrawler search are returned in relevance ranked order. The numbers at the left side of the query results are an indication of the relevance of a particular document to your query. The documents are presented in order of increasing relevance, and the numbers are normalized to the most relevant document. So, a document with a score of 500 would be "half as relevant" as the one with a score of 1000. It's not particularly scientific, but it gives you a feeling for just how helpful a document might be before you click on it.

The WebCrawler Database has a content index of about 100MB. It contains information on over 150,000 different documents that the WebCrawler has explored. The rest of the WebCrawler database (tables of all known, unvisited documents) occupies another 100MB or so, and contains data on over 1,500,000 different documents. As you can see, the WebCrawler has a way to go before it explores all the documents it knows about!

Yahoo
This is probably the most popular of the manual indices of WWW-based information and is run by Yahoo Corp based in Mountain View, California. It features a hierarchically organised subject tree.

Each site is reviewed, validated and categorised before adding to Yahoo. Currently, about 600 new sites are being added to Yahoo each day. Over 4,500 sites are listed under "Science", of which over 240 are categorised under "Chemistry"

Examples of what's listed at this site are given below:

Science:Chemistry

Others
Cusi
Nexor U.K. offers this tool, a single form to search a large number of different WWW engines for documents, people, software, dictionaries, and more.

Infoseek
InfoSeek is a comprehensive and accurate WWW search engine. You can type your search in plain English or just enter key words and phrases. You can also use special query operators. Infoseek can be accessed directly from the Netscape browser menu under Net Search. This links to http://www2.infoseek.com, which will search www sites and return the top 10 hits.

Comparison of navigation aids

The examples of navigation aids given above are of two basic types; manually and robot-generated.

The indices of WWW-based resources generated by robots are very complete, but are more likely to find too much information. The Lycos robot is probably the largest. WebCrawler is smaller, but more up-to-date.

To compare the differences between these aids, a sample search for information on the Web relating to one of the current hot topics in Chemistry was performed using the three search engines reviewed above and the results compared.

The topic chosen was "Combinatorial Chemistry", the technique of building & managing large catalogues of related chemical entities particularly being applied in the pharmaceutical area to help speed up the discovery and development of bioliogically active molecules.

The results are as follows:

Aliweb

Lycos
WebCrawler
Thus, we can see that the robot-generated indices produce broad retrieval of subject-matter across a wide variety of information types; if one can tolerate a certain amount of serendipity, these are useful tools for navigating the Web.

For more focussed retrieval of information, however, the Subject Guides probably offer more help. Besides Yahoo, there are a couple of very useful sites specific to Chemistry which gather together links to a wide array of chemical information in one place.

Some specific Chemical Information sites

Sites covered here relate to two basic types: those that provide information about organisations and their products & services, and those that give access to chemical information either by making documents available directly or by acting as gateways to other document sites.

Organisations, their products & services

ACS
Since this paper is being presented to this ACS meeting, I can hardly not mention their Web site here!

This WWW server provides information on the American Chemical Society products and services.

Amongst other topics, the site gives information about recent developments at ACS, news and details of forthcoming meetings, expositions, symposia & colloquia and details of ACS publications and software.

Derwent
Together with details of Derwent's products and services, a number of other useful options are available at this site:

Derwent selected Usenet Newsgroups by industry sector
Easy access to Derwent's Online Host's Home Pages
Links to the Online Hosts via telnet

There is also a very useful collection of links to other WWW chemical information resources here under the "Derwent Information Limited Internet Links Page - General Chemistry"

The resources are grouped into Databases, Literature and Other Resources.

Chemistry Databases
Fullerene Database; Fullerene Contents Alert; Protein Database

Chemistry Literature
ACS Gopher; Journal of Chemical Physics Gopher (JCP Express); Chemistry Textbooks in Print; First Online Chemistry Conference in Chemical Education (CHEMCONF); Chemical Physics Preprint Database

Chemistry Other Resources
Chemistry on the Internet from InterNIC; Computational Chemistry List; MIME Types for Chemistry; Scientific Visualization and Graphics; Home pages for many chemical institutes & University departments
All Contents Copyright © Derwent Information Limited. Updated By: krobson@derwent.co.uk

Document sites/gateways

Three of the most useful resources known to the speaker are covered here:

EiNet Galaxy
The Gary Wiggins Chemistry resource list
Carl/UnCover document delivery

EiNet Galaxy

Under EiNet home page, the "Chemistry - Science" link takes you to a truly comprehensive and valuable listing of links to many useful sites. To give a flavour of what's here, I've listed the headings together with a couple of examples for each heading:

Chemistry - Science
Guest editor Vineet Gupta and EINet bring you the WWW Hub for Chemistry.

New Items less than 7 days old (e.g. University of Massachusetts, Department of Chemistry)
Articles (e.g. MIME types for Chemistry)
Guides (e.g. Some Chemistry Resources on the Internet)
Software (e.g. GraphPad Prism -- Scientific Graphing)
Product and Service Descriptions (e.g. Chemical Abstracts WWW Server)
Collections (e.g. Amazing Science at the Roxy (Chemistry), Chemistry Resources on the Internet, The World-Wide Web Virtual Library: Chemistry)
Directories (e.g. Computers in Teaching Chemistry)
Organizations (e.g. Chemistry at Center for Scientific Computing (Finland))
Academic Organizations (e.g. Cambridge University Chemical Laboratory, Department of Chemistry UCLA)
Government Organizations (e.g. Combustion Chemistry Laboratory, Sandia National Laboratories)
Commercial Organizations (e.g. Cambridge Scientific Computing, Inc.)

Gary Wiggins
Several individuals on the Web have compiled collections of chemical information resources on the Web. One of the most useful known to the speaker at this time is that compiled by Gary Wiggins of Indiana University. Topics covered here include:

BOOK CATALOGS
DATABASES
DOCUMENT DELIVERY
LIST-SERVS, NEWSGROUPS, ETC.
FTP RESOURCES
GOPHERS
GUIDES TO INTERNET RESOURCES
ON-LINE SEARCH SERVICES
PERIODICALS AND CONFERENCE PROCEEDINGS (FULL TEXT)
PERIODICALS AND OTHER DOCUMENTS (CURRENT AWARENESS)
SOFTWARE (INCLUDING USERS GROUPS)
TEACHING RESOURCES
WORLD-WIDE WEB RESOURCES
CORPORATE WEB RESOURCES

Carl/Uncover Document delivery (Telnet session
UnCover, a service from the Carl Corporation, contains records describing journals and their contents. Over 4000 current citations are added daily. UnCover offers you the opportunity to order fax copies of articles from this database. UnCover can be searched in several ways; by WORD or TOPIC, by AUTHOR, or by BROWSEing journal titles

To order an article from the UnCover data base, perform a search to locate the article you want and then mark the article. Once you have marked all of the articles you wish to order, directions on the screen are given for ordering.

Articles will be delivered directly to your fax machine within 24 hours of ordering (Mon-Fri). Watch for '1 Hour' articles -- available for immediate delivery -- 24 hours/day, 7 days/week.

Copyright fees have been provided by and are paid to the Copyright Clearance Center in Salem, Mass., wherever possible, or directly to publishers in all other cases or upon request.

Issues affecting access to information on the WWW

Speed of access to sites on the Web and reliability of connection continue to be issues. The growth of traffic on the Internet continues unabated. As fast as capacity is built into the networks, new users absorb that capacity and more. It remains to be seen whether the phenomenal growth in usage will continue to rise and whether capacity can be increased to keep pace with this growth.

As the quantity of information on the WWW grows exponentially, so the problem of knowing what to believe and rely on, and what to discard as rumour or speculation will increase. The growth of sites on the Web with editorial control, from newsletters to well-known information providers and publishers is, I believe, a natural response to this need. Many of these sites are, and will increasingly be, made available by commercial organisations; the choice for the user then becomes no different from the daily experience of whether to believe what you hear through the grapevine for free, or to trust your favourite scientific journal for which you have paid.

The future

As discussed, much of the future development of chemical information resources on the WWW will be made by commercial organisations. One such development to be made available next year is KR ScienceBase. This is designed to provide a site which draws together in one place much of the body of published chemical literature and patents, and to make this easily and precisely searchable through a combination of keywords, registry numbers and appropriate subject lists using Netscape and search forms. Sources such as Chemical Abstracts, Derwent, Medline and Biosis amongst many others will be included. Thus, for the first time, a vast store of validated, edited and evaluated chemical literature will be available in one place on the WWW accessible by all chemical research professionals who need this type of information.

Conclusion

I said in my introduction that the WWW has the potential to stand alongside other major milestones in the development of science; if the issues raised here can be satisfactorily addressed, I believe the impact of the WWW on science in general, and chemistry in particular will be increasingly profound.

Ladies & Gentlemen, thank you for your kind attention.