about summary refs log tree commit diff stats
path: root/docs/CRAWL.announce
diff options
context:
space:
mode:
authorThomas E. Dickey <dickey@invisible-island.net>2010-04-29 22:00:22 -0400
committerThomas E. Dickey <dickey@invisible-island.net>2010-04-29 22:00:22 -0400
commitdc748b1c47baadafae2c90f0e188927b11b7e029 (patch)
treec728869dc6504570b9bffb7459ccbdd1bf264a9f /docs/CRAWL.announce
parentd4093cadbda3787dfb165954f8f6521790cfac86 (diff)
downloadlynx-snapshots-dc748b1c47baadafae2c90f0e188927b11b7e029.tar.gz
snapshot of project "lynx", label v2_8_8dev_6c
Diffstat (limited to 'docs/CRAWL.announce')
-rw-r--r--docs/CRAWL.announce131
1 files changed, 0 insertions, 131 deletions
diff --git a/docs/CRAWL.announce b/docs/CRAWL.announce
deleted file mode 100644
index e734bcba..00000000
--- a/docs/CRAWL.announce
+++ /dev/null
@@ -1,131 +0,0 @@
-The TRAVERSAL code from old versions of Lynx has been upgraded by David
-Mathog (mathog@seqaxp.bio.caltech.edu) so that it works again, can be
-implemented via a command line switch (-traversal) instead of via a
-compilation symbol for creating a separate Lynx executable as in those
-previous versions, and can be used in conjunction with a -crawl switch
-to make Lynx a front end for a Web Crawler.
- 
-
-Usage:
-
-   lynx [-traversal] [-realm] [-crawl] ["startpage"]
-
-
-Added switches are:
-
-  -traversal      Follow all http links derived from startpage that are
-                  on the same server as startpage.  If startpage isn't
-                  specified then the traversal begins with the default
-                  startfile or WWW_HOME.
-
-  -realm	  Further restrict http links to ones in the same realm
-                  (having a matching base URI) as the startpage (e.g.,
-		  http://host/~user/ will restrict the traversal to that
-		  user's public html tree).
-
-  -crawl          With [-traversal] outputs each unique hypertext page
-                  as an lnk###########.dat file in the format specified
-                  below.  With [-dump] outputs only the startpage, in
-		  the same format, to stdout.
-
-
-Note on startpage:
-
-                  If a startpage is specified and contains any uppercase
-		  characters, on VMS it should be enclosed in double-quotes.
-		  The code that extracts the access and host fields from
-                  startpage for comparsions with links to ensure they are
-                  not on another server, and the comparisons with already
-                  traversed links, are case sensitive, and the startpage
-                  will go to all lowercase on VMS if no double-quotes are
-                  supplied, such that it might be treated as a new link if
-                  encounted with uppercase letters.
-
-
-Files created and/or used with the -traversal switch, based on definitions
-in userdefs.h:
-
-TRAVERSE_FILE (traverse.dat):
-                  Contains a list of all URLs that were traversed.  Note
-                  that if a URL appears in this file it will not be 
-                  traversed again (important if runs are started and 
-                  stopped).  Placing an entry in this file BEFORE the
-                  run will block traversal of that URL.  Unlike reject.dat
-                  a final * has no effect (see below).  Note that Lynx
-		  internal client-side image MAP URLs will be included in
-		  this file (e.g., LYNXIMGMAP:http://server/foo.html#map1),
-		  in addition to the "real" (external) http URLs.
-
-TRAVERSE_FOUND_FILE (traverse2.dat):
-                  Contains a list of all URLs that were traversed, in the
-                  order encountered or re-encountered (but not re-travered)
-                  during a traversal run, and the TITLEs of the documents
-                  (separated from the URLs by TABs)  A URL and TITLE may be
-                  present in this list many times.  To simplify the list,
-                  on VMS use:  sort/nodups traverse2.dat;1 ;2
-		  Note that the URLs and TITLEs of the Lynx internal
-		  client-side image MAP pseudo-documents will not be
-		  included in this file, though "traversed", but only the
-		  http URLs and TITLEs derived from the MAP's AREA tag
-		  HREFs that were traversed.
-
-TRAVERSE_REJECT_FILE (reject.dat):
-                  Contains a list of URLs that have been rejected from the
-                  traversal.  Once a URL has been entered in this list, it
-                  will not be traversed.  URLs that end in a * will cause
-		  rejection of all URLs that match up to the character before
-		  the *. So for instance, to reject all htbin references on a
-		  site put this line in the reject.dat file BEFORE starting
-		  the run:  http://www.site.wherever:8000/htbin*
-
-TRAVERSE_ERRORS (traverse.errors):
-		  A list of links that could not be accessed or had an
-		  unknown status returned by the http server.  If the
-		  owner of the document containing the link is know via
-		  a LINK REV="made" HREF="mailto:foo" in it and the
-		  MAIL_SYSTEM_ERROR_LOGGING was set true in userdefs.h
-		  or lynx.cfg (not recommended!!!), a message about the
-		  problem will be mailed to the owner as well. 
-
-
-Files created during traversals if the -crawl switch is included with the
--traversal switch:
-
-lnk########.dat   Numbered output files containing the contents of traversed
-		  hypertext documents in text format.  All hypertext links
-		  within the document have been stripped, and the URL and
-		  TITLE of the document are recorded as the first two lines,
-		  e.g., for the seqaxp.bio.caltech.edu home page the first
-		  two lines will be:
-
-                  THE_URL:http://seqaxp.bio.caltech.edu:8000/
-                  THE_TITLE:SAF Web server home page
-
-                  The VMSIndex software is being adapted to use this
-		  information to extract the corresponding URL and TITLE
-		  for use in indexing the lnk########.dat files, e.g.:
-
-                  $ build_index -
-                    /url=(text="THE_URL:") -
-                    /topic=(text="THE_TITLE:",EXCLUDE) -
-                    /output=INDEX_NAME -
-                    lnk*.dat
-
-		  A clever person should be able to figure out a way to
-		  index the lnk########.dat files on Unix as well.
-
-		  If you want the hypertext links in the document to be
-		  numbered, include the -number_links switch.  By default,
-		  this will cause the list of References (URLs for the
-		  numbered links) to be appended as well.  If you want
-		  numbered links but not the References list, include the
-		  -nolist switch as well.
-
-		  Note that any client-side image MAP pseudo documents
-		  that were "traversed" will not have lnk########.dat
-		  output files created for them, but output files will
-		  be created for "real" documents that were traversed
-		  based on the HREFs of the MAP's AREA tags.
-
-This functionality is still under development.  Feedback and suggestions
-are welcome.